Understanding power measurements and the issues associated with various power measurement methodologies is key to utilizing, procuring and deploying large HPC (High Performance Computing) clusters along with maximizing bottom line profit in the enterprise world. In the HPC space, large FLOPs/watt ratios are now a key design and procurement requirement as the operating costs of today’s petascale systems are on par with the acquisition costs of the actual supercomputer hardware itself (Subramaniam & Wu-chun, 2010). The focus on energy efficiency will become even more pronounced as the industry moves to exascale computing and beyond.
Measuring the power usage of Intel® Xeon Phi™ processors and coprocessors is of particular importance to the HPC community because – as noted by James Reinders and Jim Jeffers in their book High Performance Parallelism Pearls– by mid-2013 Intel Xeon Phi coprocessors “exceeded the combined FLOPs contributed by all the graphics processing units (GPUs) installed as floating-point accelerators in the TOP 500 list” (Reinders & Jeffers, 2014). Reinders and James further observed that the, “only device type contributing more FLOPs to TOP 500 supercomputers [were] Intel Xeon® processors”, which is convenient as many of the details that affect power measurements in the Intel Xeon Phi product family also affect power measurements for Intel Xeon processors. This design consistency between the Intel Xeon and Intel Xeon Phi product families of devices will be even more evident with the forthcoming introduction of the Intel® Xeon Phi™ x200 product family, also known as Knights Landing (KNL), devices that will be based on Silvermont architecture Intel® Atom™ processor cores. In short, the Green 500 pronounced the new Knights Landing chip as, “the most power efficient parallel processor in the world” (Johnson, 2014).
The efficiency of the new KNL processors is clearly a next step on Intel’s roadmap to exascale computing. In general, researchers have confirmed the energy benefits of running on the current Intel® Xeon Phi™ x100 generation of coprocessors over conventional CPUs. One example is the paper, “Performance and energy evaluation of CoMD on Intel Xeon Phi co-processors”, that found energy savings as high as 30% when running a molecular dynamics code on Intel® Xeon Phi™ x100 coprocessors as opposed to Intel Xeon CPUs (Lawson, Sosonkina, & Shen, 2014).
The limits of the measurement hardware and software present in current generation (KNC) Intel® Xeon Phi™ x100 coprocessors restrict the type of analysis that a programmer can perform on their application. Even so, the literature contains examples where the current generation of Intel Xeon Phi coprocessors have been successfully used (a) to evaluate the FLOPs/watt of applications and improve their power performance; and (b) to determine the improved power efficiency that Intel Xeon Phi can deliver for workloads that contain irregular memory access patterns (Choi, Mukhan, Liu, & Vudue, 2014). Apparently the coprocessor requires an order of magnitude less energy per access during random memory access operations, which is a boon for sparse matrix and graph algorithms.
The design of the next generation of Intel® Xeon Phi™ processors and coprocessors includes improvements in functionality and measurement capabilities that will expand the scope and depth of the power analysis possible for HPC and enterprise applications, and potentially expand the class of applications that can be analyzed and benefit from improved power efficiency.
Approaches to Power Measurement
- Measuring power with a watt meter
- Measuring Intel Xeon Phi coprocessor power using micsmc
- Measuring cluster power consumption using the PAPI RAPL API
- Measuring Power: Using the micras syssf nodes for the Intel Xeon Phi coprocessor
We assume that the FLOPs/watt metric is used even though this metric is considered by the Green 500 to be somewhat biased against larger systems due to the challenges of achieving perfect scalability across many nodes. In other words, communications and algorithmic limitations tend to cause most applications to scale sub-linearly as the number of processors increase while the power consumption scales perfectly linearly or super-linearly (Subramaniam & Wu-chun, 2010). As a result, smaller supercomputers appear to have better energy efficiency using the FLOPs/watt metric. For now, FLOPs/watt is popular as it is easy to measure and it reflects floating-point performance, which is still the primary performance target for the HPC community.
Measuring power with a watt meter
A power meter is the simplest way to measure power for a single node or a rack of nodes as shown in Figure 1. Many student competitions utilize the simplicity of this setup to trigger a system shutdown should an application exceed a specified peak power utilization.
Figure 1 Power Meter Set-Up
A modern compute system, be it a single server or a large cluster, has a variety of subsystems that consume power. The power usage of many of these subsystems (e.g. disc drives, fans, etcetera) have little relationship, or at best an inconsistent relationship, to the workload being tested. In addition, there are many activities taking place in a modern system that occur on a sporadic and largely unpredictable basis. These activities and subsystems create a noisy background that makes it difficult to measure the effect of a workload on energy consumption whether instantaneous or average power is being used. Though it is possible to reduce the impact of this background noise through multiple runs and averaging, the unpredictable and sporadic nature of this noise makes reproducibility an issue, especially when the execution time of the workload is long enough to limit the number of runs being averaged.
Even so, the Green 500 list will accept the average power consumption for an entire application run (Green500.org, 2015). Instantaneous power consumption (an alternative power meter measurement) is not accepted due to the significant variability – up to 35% - that can occur even when the CPU is idle due to communication operations and/or other effects caused by advanced features such as dynamically adjusted processor clock rates (Subramaniam & Wu-chun, 2010).
The simplicity of a power meter is appealing, but ideally a programmer wants to know the power measurements associated with specific parts of their application. Such information allows the program to be modified to both improve performance and be more power efficient. Unfortunately, average power measurements are a fairly blunt measure that discards much information about an application’s energy and performance efficiency. For both mobile and HPC applications, Intel is dedicated to teaching programmers that power efficiency and performance often go hand-in-hand (Intel Corp, 2015). Detailed information can be critical when optimizing applications to potentially save megawatts of energy on leadership class systems.
Measuring Intel Xeon Phi Power Using micsmc
To manage the Intel Xeon Phi power configuration and facilitate power measurements on Intel Xeon Phi coprocessors, Intel includes the micsmc utility, which measures total power and coprocessor silicon temperature as well as coprocessor usage (Intel Corp, 2015).
An example micsmc screenshot is shown below for one coprocessor, but micsmc has the ability to simultaneously report measurements for all the Intel Xeon Phi coprocessors in a system.
Figure 2: Example micsmc display (image courtesy Morgan Kaufmann)
Starting micsmc in text mode with the –t option provides additional detailed information about the coprocessor including fan intake and exhaust thermal readings and memory temperature.
Figure 3: Text output for two devices from micsmc (image courtesy Morgan Kaufmann)
Chapter 14 of High Performance Parallelism Pearls includes example scripts that collect the text-based power information from micsmc and displays it as a graph as seen below (Wright, 2014). The y-axis displays power consumption as reported by micsmc and the x-axis shows exact timestamps. From this plot, it is possible to examine the timeline of the application in a profiler and correlate runtime to power consumption. The example scripts can be freely downloaded from lotsofcores.com.
Future products, specifically some members of the new Intel Xeon Phi x200 product family (i.e., the KNL family of processors and coprocessors), will utilize a new memory technology called MCDRAM that delivers ~5x the bandwidth of DDR4 memory while also being 5x more power efficient. This dramatic increase in memory bandwidth will greatly benefit application performance resulting in faster runtime while also potentially reducing energy consumption. The micsmc tool can be used to monitor these changes and calculate the energy cost benefits that might motivate an upgrade to the new KNL architecture.
Figure 4: Excel plot of micsmc text output (image courtesy Morgan Kaufmann)
Measuring and Reducing Intel Xeon Phi Power Consumption in a Cluster
Large aggregate energy savings can be achieved in a cluster environment as optimizing an application to achieve even a small energy savings per device can result in big savings for the entire cluster. Further, optimizing the power efficiency of popular applications can deliver even greater energy savings over time.
To put this in perspective, the fastest supercomputer in the world (as of November 2014 TOP500 list) is the 33 PF/s RMAX (54 PF/s RPEAK) Tianhe-2 supercomputer that contains 48,000 Intel Xeon Phi 31S1P coprocessors. This system has a peak energy consumption of 24 megawatts (million watts). Instrumenting such systems at scale highlights the importance of understanding application power efficiency, as even simple configuration and software changes can potentially save megawatts of power during a single run. For example, even a 20 watt energy savings per Intel Xeon Phi coprocessor translates to a megawatt of power savings for any application that utilizes all the Tianhe-2 coprocessors. Smaller clusters can also deliver significant energy savings over time, while not so dramatic, when popular applications are optimized for power efficiency.
A number of power saving studies that report significant energy savings for the Intel Xeon Phi x100 coprocessor exist in the literature. For example, the paper, “Energy Evaluation for Applications with Different Thread Affinities on the Intel Xeon Phi”, by Lawson et. al. measured energy consumption as a function of thread affinity and number of threads on an Intel Xeon Phi coprocessor. Changing the thread affinity and thread count on Intel Xeon Phi coprocessors is easily accomplished by simply defining a few shell environment variables – no code modifications are required. The Lawson paper showed that “varying thread affinity may improve both performance and energy, which is the most apparent under the compact affinity tests when the number of threads is larger than three per core. The energy savings reached as high as 48% for the CG NAS benchmark” (Lawson, Sosonkina, & Yuzhong, 2014).
Other savings can be significant if not so dramatic as the Lawson results – especially when the power reduction is across the 48,000 Intel Xeon Phi x100 coprocessors in the Tianhe-2 supercomputer. Shao and Brooks investigated the synthetic Linpack benchmark suite using an instruction-level energy model. They observed increases in energy efficiency as high as 10% on Linpack and between 1% to 5% on real applications (Shao & Brooks, 2013). A microbenchmarking study by Choi et al. found that the Intel Xeon Phi coprocessors offer energy benefits to highly irregular data processing workloads (Choi, Mukhan, Liu, & Vudue, 2014). Apparently the coprocessor requires an order of magnitude less energy per access during random memory access operations, which is a boon for sparse matrix and graph algorithms.
The need for large scale power profiling is only going to become more important as highlighted by recent world-wide, very large and high-profile Intel® Xeon Phi™ Product Family processor and coprocessor procurements including: KNL for the Trinity supercomputer, a joint effort between Los Alamos and Sandia National Laboratories; KNL for the Cori supercomputer, announced by The U.S. Department of Energy’s (DOE) National Energy Research Scientific Computing (NERSC) Center; private procurements such as DownUnder GeoSolutions who recently announced the largest commercial deployment of current-generation Intel Xeon Phi x100 coprocessors; and the new National Supercomputing Center IT4Innovations supercomputer that will become the largest Intel Xeon Phi x100 coprocessor-based cluster in Europe.
Visualizing performance data and precise energy consumption on such large systems can be accomplished via waterfall plots, where data from each node is displayed in a line that is part of a three dimensional plot displaying information from all the nodes utilized by the application. Following is a set of example power consumption waterfall plots for an NWchem run utilizing Intel Xeon Phi coprocessors at Pacific Northwest National Laboratory.
Figure 5: Example waterfall plots showing Intel Xeon Phi power consumptions for an NWchem run (Image courtesy EMSL, a DOE Office of Science User Facility sponsored by the Office of Biological and Environmental Research and located at Pacific Northwest National Laboratory.)
The freely available NWPerf software was used to collect the preceding performance data. An important benefit of NWPerf is that it can be used to collect performance data for every job that runs on the supercomputer or compute cluster. The software was designed (and can be configured) to have a minimal impact on application performance while gathering as much performance data as possible. The historical record created by NWPerf has proven to be invaluable for software vendors, scientists, programmers, and analysts evaluating balance metrics for future procurements. Succinctly, the performance behavior of any run or sets of runs can be accessed via a web interface, which means that silent performance regressions from vendor and programmer updates can be identified – even months after the software changes are pushed into production – while work load assessments and balance metrics can be determined from queries of the NWPerf database.
NWPerf is an example of a number of profiling tools available for profiling Intel Xeon Phi coprocessors and Intel Xeon processors. The NWPerf software can be downloaded from https://github.com/EMSL-MSC/NWPerf.
Using the PAPI RAPL API
With the current Intel Xeon Phi x100 generation of devices, users need to measure power usage and energy consumption via PAPI's (Performance API) MIC component instead of the Intel RAPL (Running Average Power Limit) component.
As of release 5.3.2, PAPI includes two Intel® Xeon Phi™ KNC-based coprocessor oriented components:
- micpower - runs in native mode where both the actual application as well as PAPI are running natively on the coprocessor without being offloaded from a host system.
- host_micpower - which conveniently offloads PAPI from the host to the Intel Xeon Phi coprocessor(s). The power data is exported through the MicAccessAPI interfaces distributed with the Intel® MPSS (Manycore Platform Software Stack).
More information can be found on the PAPI website (http://icl.cs.utk.edu/papi/) and in the paper “Power Monitoring with PAPI for Extreme Scale Architectures and Dataflow-based Programming Models” (McCraw, Ralph, Danalis, & Dongarra, 2014).
The next generation of Knights Landing devices will be based on Silvermont processing cores, which means they are already compatible with software that utilize the PAPI RAPL Application Programming Interface (API). The PAPI team added support for Silvermont processors as of release 5.3.2, which highlights the benefits of Knights Landing utilizing existing Intel Xeon power and performance APIs such as:
- Socket RAPL and Turbo that provide the same base algorithm/interface as used in current and previous generation of Intel processors.
- PECI (Performance Environment Control Interface) for power/thermal management/optimization
Intel’s RAPL technology is designed to measure power usage of different parts of the silicon chip. It is available in modern Intel microarchitectures starting from Sandy Bridge and intended for controlling and limiting power usage at the chip level. In addition, RAPL provides capabilities to measure energy and power usage. The following RAPL domains are supported by server-grade silicon chips:
- PP0 (Power Plane 0) – processors cores subsystem.
- PKG (Package) – processor die.
- DRAM (Memory) – directly-attached DRAM.
RAPL provides approximately 1ms resolution measurements. RAPL sensor data is read via model-specific registers (MSR) and exported by the Linux kernel via devfs, which makes these measurements readily available to profiling tools
Using the micras syssf Nodes for the Intel®Xeon Phi™ Family of Coprocessors
The micrasd tool is an application that runs on the host platform that handles and logs Intel® Xeon Phi™ coprocessor errors. It can run in a single use mode or in a daemon mode. Information is collected every 5 milliseconds and displayed in /sys/class/micras/power.
Following is an example output. Note both the high resolution of the samples and variety of information that is reported as compared to micsmc. For clarity, this output was annotated with comments after the ‘#’ mark:
$ cat /sys/class/micras/power
119000000 #Total power for running average Window 0
118000000 #Total power for running average Window 1
121000000 #power for current 5 msec sample (Instantaneous power)
189000000 #Max Instantaneous power over some sampling period
34000000 #PCI-E connector power
33000000 #power delivered by 2x3 power connector
54000000 #power delivered by 2x4 power connector
38000000 0 869000 #Core rail (Power, Current, Voltage)
32000000 0 1000000 # Non-core rail (Power, Current, Voltage)
33000000 0 1501000 #Mermory subsystem rail (Power, Current, Voltage)
$
Micras reported power is specified in uWatts, current in uAmps, and voltage uVolts. Note that the minor inconsistencies between the connector power (e.g. 121 W), total power (e.g. 119 W), and rail power (103 W) are most likely the result of different and independent sampling circuits.
Choice of Benchmark
The choice of benchmark depends upon what quantities are to be measured. Generally the best benchmarks are the most popular workloads on the system. Alternatively, benchmarks can be chosen to characterize the entire system at max load to determine utility power requirements. Care must be taken to ensure that the benchmarks are representative of production workloads to target optimization efforts. After that, consider the impact of the measurement technique utilized, e.g. from the wall, from direct processor PMU (Performance Monitoring Unit) measurements, or from external rail sensors.
No matter your measuring technique, a variety of principles needs to be followed.
(a) Characterize and factoring in background power usage by other hardware
(b) Characterize and factor in background processes that may be competing with the workload
(c) Select the proper workload or workload suite for the type and purpose of the measurements
(d) Select an appropriate spectrum of working sets for characterizing power usage
(e) Select an experiment with enough sample points to draw reasonable statistics
(f) Take note of and characterize errors due to small sample size or background noise
Idle Power
Idle power measurement on Intel Xeon Phi product family coprocessors is surprisingly challenging because these devices run Linux*, which means that the coprocessor has to wake-up to service the idle power measurement request – thus making some samples less useful and more difficult to characterize. Depending upon the circumstances, the wake-up overhead may invalidate the idle power measurement.
Software power measurement utilizes the MPSS libmicmgmt library, which can read sensor data from the card for many different domains such as PCIe, 2x3, 2x4, VCCP, VDDG, VDDQ, instant power and more. The drawback of software-based sensor readings is that the libmicmgmt library must send an interrupt over PCIe bus to the Linux uOS (Linux). While Linux has many excellent power savings features, unfortunately the libmicmgmt interrupt forces the coprocessor to switch to higher power state, which means this software cannot be used to measure idle power consumption.
Direct measurement of a single Intel Xeon Phi SE10/7120 coprocessor card has been accomplished by physically removing the card from the server enclosure and monitoring the actual power draw of the card. This approach shows that idle power consumption is around 17 watts.
For more information, Intel has produced a detailed document, “Determining the Idle Power of an Intel® Xeon Phi™ Coprocessor,” that discusses both in-band and out-of-band idle power measurements (Intel Corp, 2014).
Measuring Power on KNL (Intel® Xeon Phi™ x200 Product Family)
Measuring power on KNL promises to be far more direct, and offers the potential to give programmers insight into application power usage that can be characterized to nearly the level of an individual instruction. This remarkable ability results from the availability of software accessible power measurement events provided by a modern Intel Atom Silvermont core. Taking measurements directly from the processing core rather than from external sensors (such as those on the Intel Xeon Phi x100 product family motherboards or from an external power meter) simplifies the software and hardware while also providing programmers with very detailed information about their applications
Conclusion
Translating to real dollars saved for both HPC and enterprise customers, informed developers with access to detailed application power profiles have the ability to literally reduce power consumption by a megawatt or more on a leadership class supercomputer. The impact of even a few percent energy savings per device can be significant as the industry moves to exascale computing where tens of thousands of devices are used to achieve such performance. Based on current industry trends, the vast majority of the FLOPs in future TOP 500 systems will be delivered by a combination of Intel Xeon processors and Intel Xeon Phi x200 product family processors and coprocessors, which is one of many reasons why Intel is striving to increase compatibility between the forthcoming Intel Knights Landing products and existing Intel Xeon APIs for power and performance profiling.
Even on smaller HPC clusters and enterprise systems, optimizing frequently utilized applications can reduce power, cooling, and, over the long term, infrastructure maintenance costs. Education is key to achieving these savings as developers need to know how to: (1) get access to detailed power data, and (2) use the information contained in the power profiles to reduce application power utilization while preserving performance.
This white paper has striven to raise awareness of the potentially significant impact that even small reductions in per-device power consumption can have on overall system and data center power consumption both at scale and over time. However, this document is only a starting point as the literature contains many studies not mentioned here, plus there are many other power profiling tools available from both the open-source and commercial software communities that were also not covered.
Works Cited
Choi, J., Mukhan, M., Liu, X., & Vudue, R. (2014). Algorithmic time, energy, and power on candidate HPC compute building blocks. IEEE 28th International Symposium on Parallel Distributed Processing. Arizona, USA: IEEE.
Green500.org. (2015, Jan 16). Energy Efficient High Performance Computing Power Measurement Methodology. Retrieved from green500.org: http://www.green500.org/sites/default/files/eehpcwg/EEHPCWG_PowerMeasurementMethodology.pdf
Intel Corp. (2014, June 26). Determining the Idle Power of an Intel® Xeon Phi™ Coprocessor. Retrieved from software.intel.com: https://software.intel.com/en-us/articles/determining-the-idle-power-of-an-intel-xeon-phi-coprocessor
Intel Corp. (2015, Feb.). Energy Efficient Software. Retrieved from software.intel.com: https://software.intel.com/en-us/energy-efficient-software
Intel Corp. (2015). Intel® Xeon Phi™ coprocessor Power Management Configuration: Using the micsmc command-line Interface. Retrieved from software.intel.com: v
Johnson, R. C. (2014, June 23). Intel Supercomputer Processors Roadmap. Retrieved from InformationWeek: http://www.informationweek.com/infrastructure/pc-and-servers/intel-supercomputer-processors-roadmap/d/d-id/1278798
Lawson, G., Sosonkina, M., & Shen, Y. (2014). Performance and energy evaluation of CoMD on Intel Xeon Phi co-processors. Proceedings of the 1st International Workshop on Hardware-Software Co-Design for High Performance Computing (pp. 49-54). Piscataway, NJ, USA: IEEE Press.
Lawson, G., Sosonkina, M., & Yuzhong, S. (2014). Energy Evaluation for Applications with Different Thread Affinities on the Intel Xeon Phi. Computer Architecture and High Performance Computing Workshop (SBAC-PADW), 2014 International Symposium on. Paris: IEEE.
McCraw, H., Ralph, J., Danalis, A., & Dongarra, J. (2014). Power Monitoring with PAPI for Extreme Scale Architectures and Dataflow-based Programming Models. Workshop on Monitoring and Analysis for High Performance Computing Systems Plus Applications, IEEE Cluster 2014.
Reinders, J., & Jeffers, J. (2014). High Performance Parallelism Pearls. Burlington, Massachusetts: Morgan Kaufmann.
Shao, Y., & Brooks, D. (2013). Energy characterization and instruction-level energy model of Intel’s Xeon Phi processor. IEEE International Symposium on Low Power Electronics and Design (ISLPED) (pp. 389–394). IEEE.
Subramaniam, B., & Wu-chun, F. (2010). Understanding Power Measurement Implications. IEEE/ACM International Conference on Green Computing and Communications & 2010 IEEE/ACM International Conference, (pp. 245-251).
Wright, C. J. (2014). Power Analysis on the Intel Xeon Phi Coprocessor. In J. Reinders, & J. Jeffers, High Performance Parallelism Pearls. Morgan Kaufmann.