Abstract
This paper discusses a methodology for optimizing Message Passing Interface (MPI) collective performance for programming applications linked with Intel® MPI Library. Intel MPI Library incorporates the I_MPI _ADJUST environment variable family to allow explicit selection of the algorithms associated with each of its collective operations. The methodology for MPI collective optimization, when using Intel MPI Library, is demonstrated through the measurement of message passing latencies for the Intel® MPI Benchmarks by incorporating process pinning and interconnect fabric control. The Intel MPI Benchmarks analysis for collectives was run on the Intel® Xeon Phi™ coprocessor in native mode, but this scheme for selecting the best collective algorithms is applicable to other Intel® microarchitectures.
- Introduction
- Who might benefit from this article?
- What methodology is used to improve the Message Passing Interface (MPI) collective performance?
- How is this article organized?
- Collective Operation Control within Intel® MPI Library
- Architecture of the Intel® Xeon Phi™ Coprocessor
- The Intel® MPI Benchmarks
- MPI_Allreduce Performance Improvement due to Adjusting the I_MPI_ADJUST_ALLREDUCE Environment Variable Setting
- Conclusion
- References
1. Introduction
Who might benefit from this article?
If you use collective operations within a Message Passing Interface (MPI)1,2,3 application that is linked with Intel® MPI Library4 for execution on the Linux* or Windows* OS, and profiling analysis from software tools such as Intel® VTune™ Amplifier XE5 or Intel® Trace Analyzer and Collector6 indicates that collective operations are hotspot bottlenecks, then this article might help you achieve better execution performance.
What methodology is used to improve the MPI collective performance?
Intel MPI Library4 provides a way to control the collective algorithm selection explicitly by using the I_MPI_ADJUST environment variable family. The collective optimization process will be demonstrated on the Intel® Xeon Phi™ coprocessor architecture using native-mode executables built from the Intel® MPI Benchmarks7 and linked with Intel MPI Library.4 The user can apply this approach to their own MPI applications that are linked with Intel MPI Library and that incorporate collective operations. Also, this optimization technique is relevant to other Intel® microarchitectures, as well as MPI implementations that have algorithmic collective control targeting Intel® microarchitectures and non-Intel microprocessors that support Intel® architecture.
How is this article organized?
Section 2 gives a description of collective operations for the MPI1,2,3 and a class of environment variables supported by Intel MPI Library that can be used to control collective optimization. Part 3 provides background on the Intel Xeon Phi processor architecture and a set of Intel MPI Library environment variables that will be used to exploit the compute cores and communication between Intel Xeon Phi coprocessor sockets. Section 4 briefly describes the Intel MPI Benchmarks.7 Part 5 demonstrates performance results on the Intel Xeon Phi coprocessor using the MPI_Allreduce collective within the Intel MPI Benchmarks. Finally, conclusions are provided.
2. Collective Operation Control within Intel® MPI Library
Collective operations for MPI processes come in three basic forms:3
- Barrier synchronization
- Global communication functions such as Broadcast, Gather, Scatter, and Gather/Scatter from all members to all members of a group (complete exchange or all-to-all)
- Global reduction operations such as max, min, product, sum, or user-defined functions
In all these cases, a message-passing library can take advantage of knowledge about the structure of the computing system so as to increase the parallelism within these collective operations1 and thus improve execution performance of an application.3
According to the MPI Standard,1 Figure 1 shows MPI data transitions for a subset of MPI collective operations, namely, the Broadcast, Scatter, Gather, Allgather, and Alltoall collective functions.1 The rows represent MPI processes and the columns represent data associated with those MPI processes. The tokens Ai, Bi, Ci, Di, Ei, and Fi represent the data items that can be associated with each MPI process.
Figure 1. Illustration of how the MPI collective functions Broadcast, Scatter, Gather, Allgather, and Alltoall respectively interact with data across MPI processes.1
The typical approaches for implementing the collective operations in Figure 1 can be summarized into two categories:6
- Short vectors that are treated with, say, one optimization technique
- Long vectors that are treated with a different approach
For MPI applications that use collectives, it is imperative that an MPI library perform well for all vector lengths.8 For example,9 collective optimization algorithms can be designed to distinguish between short and long messages, and switch between the short-message and long-message10 algorithms. Thus, in general, optimizing MPI collective operations can encompass the following:11,12
- A wide range of possible algorithms
- Knowledge about the network topologies
- Programming methods
Therefore, by building a parameterized model for MPI collectives, an automatic process can be put into place to choose the most efficient implementation of that collective algorithm at runtime.13 For this article, an automatic process for choosing the most efficient implementation is not used, but instead the focus is on collective-based environment variables that are available within Intel MPI Library where one can compile their MPI application once, and then use the collective environment variable control to target the application for various cluster topologies.
As mentioned in the introduction, each collective operation within Intel MPI Library supports a number of communication algorithms. In addition to highly optimized default settings, the Intel MPI Library provides a way to control the algorithm selection explicitly by using the I_MPI_ADJUST environment variable family, which is described in this section.
The I_MPI_ADJUST environment variable is available for both Intel and non-Intel microprocessors, but it may perform additional optimizations for Intel microprocessors than it performs for non-Intel microprocessors that support Intel® architecture.4
Table 1 provides a list of I_MPI_ADJUST environment variable names and a set of algorithms that have been implemented within Intel MPI Library for each collective operation.4 Depending on the cluster topology, the interconnection fabric, and shared memory communication, one collective optimization algorithm may provide better performance over that of the other algorithms associated with that collective operation. In regard to the MPI collective functions Broadcast, Scatter, Gather, Allgather, and Alltoall that are illustrated in Figure 1, the Algorithm Selection column in Table 1 lists possible algorithms that can be chosen to carry out the respective collective operations.
Table 1. Environment Variable Definitions for the I_MPI_ADJUST Family of Intel® MPI Library Collective Operations.4
Environment Variable Name | Collective Operation | Algorithm Selection |
I_MPI_ADJUST_ALLGATHER | MPI_Allgather |
|
I_MPI_ADJUST_ALLGATHERV | MPI_Allgatherv |
|
I_MPI_ADJUST_ALLREDUCE | MPI_Allreduce |
|
I_MPI_ADJUST_ALLTOALL | MPI_Alltoall |
|
I_MPI_ADJUST_ALLTOALLV | MPI_Alltoallv |
|
I_MPI_ADJUST_ALLTOALLW | MPI_Alltoallw | Isend/Irecv + waitall algorithm |
I_MPI_ADJUST_BARRIER | MPI_Barrier |
|
I_MPI_ADJUST_BCAST | MPI_Bcast |
|
I_MPI_ADJUST_EXSCAN | MPI_Exscan |
|
I_MPI_ADJUST_GATHER | MPI_Gather |
|
I_MPI_ADJUST_GATHERV | MPI_Gatherv |
|
I_MPI_ADJUST_REDUCE_SCATTER | MPI_Reduce_scatter |
|
I_MPI_ADJUST_REDUCE | MPI_Reduce |
|
I_MPI_ADJUST_SCAN | MPI_Scan |
|
I_MPI_ADJUST_SCATTER | MPI_Scatter |
|
I_MPI_ADJUST_SCATTERV | MPI_Scatterv |
|
In regard to Table 1, if an application uses the MPI_Scatter collective, the environment variable I_MPI_ADJUST_SCATTER can be set to an integer value of either 1, 2, or 3 to select respectively a Binomial algorithm, a Topology-aware binomial algorithm, or Shumilin’s algorithm. Shumilin's algorithm is applicable to small-scale clusters and is bandwidth-efficient. The reader can find descriptions of the various algorithm implementations in the literature (for example, see Thakur et al., and the references therein).12
As mentioned in the introduction, the Intel microarchitecture that will be used to experiment with collective performance is the Intel Xeon Phi coprocessor, which is an Intel® Many Integrated Core Architecture (Intel® MIC Architecture). The next section provides a brief description of the Intel Xeon Phi coprocessor architecture, and that section discusses a select set of Intel MPI Library environment variables so as to understand:
- The notion of doing MPI process pinning on multiple cores within an Intel Xeon Phi coprocessor socket
- Controlling Intel Xeon Phi coprocessor socket communication
3. Architecture of the Intel® Xeon Phi™ Coprocessor
The Intel Xeon Phi coprocessor principally consists of 61 processing cores with four hardware threads per core. The coprocessor also has caches, memory controllers, PCIe* (Peripheral Component Interconnect Express*) client logic, and a very high bandwidth, bidirectional ring interconnection network (Figure 2).14 Each core has a private L2 cache that is kept fully coherent by a global-distributed tag directory (labeled “TD” in Figure 2). The L2 cache is eight-way set associative and is 512 KB in size. The cache is unified where it caches both data and instructions. The L1 cache consists of an eight-way set associative 32 KB L1 instruction and 32 KB L1 data cache. The memory controllers and the PCIe client logic provide a direct interface to the GDDR5 memory (double data rate type five synchronous graphics random access memory) on the coprocessor and the PCIe bus, respectively. All these components are connected together by the ring interconnection network.
Figure 2. The microarchitecture of the Intel® Xeon Phi™ coprocessor14
Figure 2 shows a subset of the cores, the L2 cache associated with each core, and the tag directories.
As mentioned earlier, the I_MPI_ADJUST environment variable is the key to controlling algorithmic selection for Intel MPI Library collectives. However, there are four other environment variables associated with Intel MPI Library which will be used to help exploit the architectural features of the Intel Xeon Phi coprocessor, namely:
export I_MPI_MIC=1
When the I_MPI_MIC environment variable is set to 1, or enable, or yes, Intel MPI Library will try to detect and work with the Intel MIC Architecture components.
export I_MPI_PIN_MODE=lib
The value lib causes the pinning of processes to occur inside the Intel MPI Library.
export I_MPI_PIN_CELL=core
The I_MPI_PIN_CELL environment variable defines the pinning resolution granularity. I_MPI_PIN_CELL specifies the minimal processor cell allocated when an MPI process is running. The value core refers to a physical processor core, such as the core abstractions shown in Figure 2.
export I_MPI_FABRICS= <fabric>|<intra-node fabric>:<inter-nodes fabric>
The I_MPI_FABRICS environment variable is used to select the particular network fabrics to be used, where:
<fabric> := shm | dapl | tcp | tmi | ofa | ofi
<intra-node fabric> := shm | dapl | tcp | tmi | ofa | ofi
<inter-nodes fabric> := dapl | tcp | tmi | ofa | ofi
Table 2 provides a summary of what the fabric values are and what definition is associated with the values.
Table 2. Possible Values for the I_MPI_FABRICS Environment Variable4
<fabric> | Define a Network Fabric |
shm | Shared memory |
dapl | DAPL-capable network fabrics, such as InfiniBand*, iWarp*, Dolphin*, and XPMEM* (through DAPL*) |
tcp | TCP/IP-capable network fabrics, such as Ethernet and InfiniBand (through IPoIB*) |
tmi | TMI-capable network fabrics including Intel® True Scale Fabric, Myrinet* (through Tag Matching Interface) |
ofa | OFA-capable network fabric including InfiniBand (through OFED* verbs) |
ofi | OFI (OpenFabrics Interfaces*)-capable network fabric including Intel® True Scale Fabric, and TCP (through OFI* API) |
The Intel MPI Library environment variables that have been described will be applied to the Intel MPI Benchmarks7 executables to do MPI_Allreduce experiments on the Intel Xeon Phi coprocessor. Thus, a brief review of how to build the Intel MPI Benchmarks for the Intel MIC Architecture is warranted along with a subset of useful command-line options that may help with experimenting with the assortment of MPI collective tests that are available within the benchmarks.
4. The Intel® MPI Benchmarks
The Intel MPI Benchmarks7 perform a set of MPI performance measurements for point-to-point and global communication operations on a range of message sizes. The generated benchmark data fully characterizes:
- The performance of a cluster system, including the node performance, the network latency, and the throughput
- The efficiency of an MPI implementation. This allows one to make performance comparisons between various implementations of the MPI Standard
Based on the guide “Getting Started with Intel® MPI Benchmarks 4.1”7 and targeting the Intel MPI Benchmarks executables for Intel MIC Architecture), one can issue the following on a Linux OS:
cd <path to the Intel MPI Benchmarks directory>/src
make –f make_ict_mic
After the make-command completes, the following executables should reside in the src subdirectory referenced above:
IMB-EXT.mic
IMB-IO.mic
IMB-MPI1.mic
IMB-NBC.mic
IMB-RMA.mic
Assuming that there are at least four MIC (Many Integrated Core) cards on the cluster system and that the MIC cards have names such as mic0, mic1, mic2, and mic3, Intel MPI Benchmarks analysis of the collective function MPI_Allreduce might be as follows:15
mpirun –host mic0,mic1,mic2,mic3 –ppn 15 –n 60 ./IMB-MPI1.mic ALLREDUCE –npmin 60 –time 50
where in general, the mpirun command-line option
-ppn <# of processes>
indicates process pinning the number of consecutive MPI processes associated with each host in the group using round-robin scheduling. In general, the pseudo-argument <# of processes> represents an integer value for the number of MPI processes that are to be pinned to a host. For the mpirun command shown above, the processing pinning value is 15 MPI ranks per host.
The Intel MPI Benchmarks command-line option called –npmin specifies the number of processes P_min to run all selected benchmarks on.15 The P_min value after –npmin must be an integer. Given a value P_min, the selected Intel MPI Benchmarks will run on the processes with the following sequence of process numbers:
P_min, 2×P_min, 4×P_min, …, largest 2n×P_min < P, P
The Intel MPI Benchmarks command-line switch –time specifies the number of seconds for the benchmark to run per message size.15 The argument after -time is a floating-point number.
The combination of this flag with the -iter flag or its default alternative ensures that the Intel MPI Benchmarks always chooses the maximum number of repetitions that conform to all restrictions. By default, the number of iterations is controlled through the parameters MSGSPERSAMPLE, OVERALL_VOL, MSGS_NONAGGR, and ITER_POLICY defined in <path to the Intel MPI Benchmarks directory>/src/IMB_settings.h.
A rough number of repetitions per sample to fulfill the -time request is estimated in preparatory runs that use approximately 1 second of overhead.
For the default -time parameter, the floating-point value specifying the runtime seconds per sample is set in the SECS_PER_SAMPLE variable defined in <path to the Intel MPI Benchmarks directory>/src/IMB_settings.h, or <path to the Intel MPI Benchmarks directory>/src/IMB_settings_io.h.
5. MPI_Allreduce Performance Improvement Due to Adjusting the I_MPI_ADJUST_ALLREDUCE Environment Variable Setting
As a review, the following topics have thus far been covered:
- Intel MPI Library collective operation control
- The architecture of the Intel Xeon Phi coprocessor and environment variables associated with controlling the compute cores and socket communication
- How to build the Intel MPI Benchmarks for Intel Xeon Phi coprocessor architecture
In this section, the information discussed in prior segments will be coalesced so as to do experiments with the MPI_Allreduce collective. Note that the experimental outcomes that you may achieve will most likely vary from those provided here. The results observed can be a function of the Intel Xeon Phi coprocessor stepping, the interconnection fabric, the socket configuration, the software stack, and the memory size. Focus on tuning for your specific cluster configuration and do not be concerned with matching the chart results that are illustrated here.
The script called test_on_2_mics.sh for running the “allreduce” component of the Intel MPI Benchmarks on two Intel Xeon Phi coprocessor sockets named mic0, and mic1 is:
#!/bin/sh export I_MPI_MIC=1 export I_MPI_PIN_MODE=lib export I_MPI_PIN_CELL=core export I_MPI_ADJUST_ALLREDUCE=$3 export I_MPI_FABRICS=$4 case "$3" in 1) ADJUST_ALLREDUCE="RDA" ;; 2) ADJUST_ALLREDUCE="RaA" ;; 3) ADJUST_ALLREDUCE="RBA" ;; 4) ADJUST_ALLREDUCE="TARBA" ;; 5) ADJUST_ALLREDUCE="BGSA" ;; 6) ADJUST_ALLREDUCE="TABGSA" ;; 7) ADJUST_ALLREDUCE="SRA" ;; 8) ADJUST_ALLREDUCE="RA" ;; 9) ADJUST_ALLREDUCE="KA" ;; *) ADJUST_ALLREDUCE="ALL_REDUCE_ERROR" echo "An error occurred regarding the collective algorithm selection. Value is $3; Exiting..." exit 1 ;; esac case "$4" in 1) export I_MPI_FABRICS=shm:tcp FABRIC="shm_tcp" ;; 2) export I_MPI_FABRICS=tcp FABRIC="tcp" ;; *) FABRIC="FABRIC_ERROR" echo "An error occurred regarding the fabric selection. Value was $4; Exiting..." exit 1 ;; esac mpirun -hosts mic0,mic1 -ppn $1 -n $2 ./IMB-MPI1.mic allreduce -npmin $2 -time 20 > allreduce_report.$1.$2.${ADJUST_ALLREDUCE}.${FABRIC} 2>&1
For the script above, note references to the environment variables I_MPI_MIC, I_MPI_PIN_MODE, I_MPI_PIN_CELL, I_MPI_ADJUST_ALLREDUCE (See Table 3), and I_MPI_FABRICS (See Table 2). You can cut and paste the text above into your own test_on_2_mics.sh script and modify the semantics to match your cluster. The pseudo-arguments for the command-line script might be something like:
./test_on_2_mics.sh <processes-to-pin> <mpi-ranks> <adjust-allreduce> <fabric>
where the pseudo-argument <processes-to-pin> has the value 30
<mpi-ranks> has the value 60
<adjust-allreduce> can have the values 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
<fabric> can have the values 1 | 2
For convenience, Table 3 provides semantic meaning for the pseudo-argument <adjust-allreduce>.
Table 3. Implemented Algorithms Available for the MPI_Allreduce Collective.4
Environment Variable Name | Collective Operation | Algorithm Selection |
I_MPI_ADJUST_ALLREDUCE | MPI_Allreduce |
|
Figure 3 shows the results of using command-line syntax that looks something like:
./test_on_2_mics.sh 30 60 4 1
where the topology-aware Reduce + Bcast algorithm for MPI_Allreduce was selected and the interconnection fabric was set to shm:tcp, and
./test_on_2_mics.sh 30 60 4 2
where the topology-aware Reduce + Bcast algorithm for MPI_Allreduce was selected and the interconnection fabric was set to tcp. For each command-line reference to test_on_2_mics.sh shown above, two Intel Xeon Phi coprocessor sockets were used, where each socket had 61 cores. The application was run with 60 MPI ranks, and 30 MPI ranks were pinned (see Figure 2) to each of the two sockets.
Figure 3. Results of analyzing the “allreduce” component of the Intel MPI Benchmarks in native mode on two Intel® Xeon Phi™ coprocessor sockets using the script test_on_2_mics.sh with 30 MPI processes pinned on each Intel Xeon Phi coprocessor socket. There are a total of 60 MPI ranks.
The script called test_on_4_mics.sh for running the “allreduce” component of the Intel MPI Benchmarks on four Intel Xeon Phi coprocessor sockets named mic0, mic1, mic2, and mic3 is:
#!/bin/sh export I_MPI_MIC=1 export I_MPI_PIN_MODE=lib export I_MPI_PIN_CELL=core export I_MPI_ADJUST_ALLREDUCE=$3 export I_MPI_FABRICS=$4 case "$3" in 1) ADJUST_ALLREDUCE="RDA" ;; 2) ADJUST_ALLREDUCE="RaA" ;; 3) ADJUST_ALLREDUCE="RBA" ;; 4) ADJUST_ALLREDUCE="TARBA" ;; 5) ADJUST_ALLREDUCE="BGSA" ;; 6) ADJUST_ALLREDUCE="TABGSA" ;; 7) ADJUST_ALLREDUCE="SRA" ;; 8) ADJUST_ALLREDUCE="RA" ;; 9) ADJUST_ALLREDUCE="KA" ;; *) ADJUST_ALLREDUCE="ALL_REDUCE_ERROR" echo "An error occurred regarding the collective algorithm selection. Value was $3; Exiting..." exit 1 ;; esac case "$4" in 1) export I_MPI_FABRICS=shm:tcp FABRIC="shm_tcp" ;; 2) export I_MPI_FABRICS=tcp FABRIC="tcp" ;; *) FABRIC="FABRIC_ERROR" echo "An error occurred regarding the fabric selection. Value was $4; Exiting..." exit 1 ;; esac mpirun -hosts mic0,mic1,mic2,mic3 -ppn $1 -n $2 ./IMB-MPI1.mic allreduce -npmin $2 -time 50 > allreduce_report.$1.$2.${ADJUST_ALLREDUCE}.${FABRIC} 2>&1
Again, you can cut and paste the text above into your own test_on_4_mics.sh script. The pseudo-arguments for the command-line might be something like:
./test_on_4_mics.sh <processes-to-pin> <mpi-ranks> <adjust-allreduce> <fabric>
where pseudo-argument <processes-to-pin> has the value 15 | 60
<mpi-ranks> has the value 60 | 240
<adjust-allreduce> can have the value 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
<fabric> can have the value 1 |
See Table 3 above for semantic meanings for the <adjust-allreduce> pseudo-argument to the test_on_4_mics.sh script.
Figure 4 shows the results of using command-line syntax that looked something like:
./test_on_4_mics.sh 15 60 1 1
where the recursive doubling algorithm for MPI_Allreduce was selected and the interconnection fabric was set to shm:tcp, and
./test_on_4_mics.sh 15 60 1 2
where the recursive doubling algorithm for MPI_Allreduce was selected and the interconnection fabric was set to tcp. For each command-line reference to test_on_4_mics.sh shown above, four Intel Xeon Phi coprocessor sockets were used, where each socket had 61 cores. The application was run with 60 MPI ranks, and 15 MPI ranks were pinned to each of the four sockets.
Figure 4. Results of analyzing the “allreduce” component of the Intel MPI Benchmarks in native mode on four Intel Xeon Phi coprocessor sockets using the script test_on_4_mics.sh with 15 MPI processes pinned on each Intel Xeon Phi coprocessor socket. There are a total of 60 MPI ranks.
Figure 5 shows the results of using command-line syntax that looked something like:
./test_on_4_mics.sh 60 240 2 1
where the Rabenseifner's algorithm for MPI_Allreduce was selected, and the interconnection fabric was set to shm:tcp, and
./test_on_4_mics.sh 60 240 2 2
where the Rabenseifner's algorithm for MPI_Allreduce was selected and the interconnection fabric was set to tcp. Again, for each command-line reference to test_on_4_mics.sh shown above, four Intel Xeon Phi coprocessor sockets were used, where each socket had 61 cores. The application was run with 240 MPI ranks, and 60 MPI ranks were pinned to each of the four sockets, where each socket has 61 cores.
Figure 5. Results of analyzing the “allreduce” component of the Intel MPI Benchmarks in native mode on four Intel® Xeon Phi™ coprocessor sockets using the script test_on_4_mics.sh with 60 MPI processes pinned on each Intel Xeon Phi coprocessor socket. There are a total of 240 MPI ranks.
In general, for Figures 3, 4, and 5, smaller latency values on the ordinate axis are better for each of the curves shown.
6. Conclusion
This article has outlined a methodology using Intel MPI Library, for choosing an implementation of an MPI collective algorithm that will enable best performance for an application that runs on a cluster. Experiments that exemplify this approach were done using the Intel MPI Benchmarks on the architecture of the Intel Xeon Phi coprocessor. When an application uses MPI collective operations, the emphasis has been to exploit the use of collective-based environment variables to select an appropriate optimization algorithm. For Intel Xeon Phi coprocessor socket topologies and MPI process pinning, the experiments demonstrated that the user can achieve increased performance by using an appropriate setting for the collective-based environment variables.
The framework for achieving MPI collective performance is not limited to the architecture of the Intel Xeon Phi processor, but rather is also applicable to other Intel microarchitectures. In general, MPI library implementations that support environment variable selection of collective algorithms can also apply this type of methodology to improve MPI application performance.
You can hopefully use this scheme as a model for optimizing collective performance for your specific MPI applications.
7. References
- “MPI Documents,” http://www.mpi-forum.org/docs/.
- W. Gropp, E. Lusk, and A. Skjellum, Using MPI: Portable Parallel Programming with the Message-Passing Interface, 2nd Edition, The MIT Press, Cambridge, Massachusetts, 1999.
- M. Snir, S. W. Otto, S. Huss-Lederman, D. W. Walker, and J. Dongarra, MPI: The Complete Reference, The MIT Press, Cambridge, Massachusetts, 1996.
- “Intel® MPI Library for Linux* OS Reference Manual,” https://software.intel.com/sites/default/files/Reference_Manual_1.pdf.
- “Intel® VTune™ Amplifier,” https://software.intel.com/en-us/intel-vtune-amplifier-xe.
- “Intel® Trace Analyzer and Collector,” https://software.intel.com/en-us/intel-trace-analyzer.
- “Getting Started with Intel® MPI Benchmarks 4.1,” https://software.intel.com/en-us/articles/intel-mpi-benchmarks, October 2013.
- M. Barnett, L. Shuler, R. van de Geijn, S. Gupta, D. G. Payne, J. Watts, “Interprocessor Collective Communication Library,” Proceedings of the IEEE Scalable High-Performance Computing Conference, May 23-25, 1994, p. 357.
- Rabenseifner, R., “A new optimized MPI reduce algorithm,” https://fs.hlrs.de/projects/par/mpi/myreduce.html, November 1997.
- R. Thakur and W. Gropp, “Improving the Performance of Collective Operations in MPICH,” in Proceedings of the 10th European PVM/MPI Users’ Group Meeting (Euro PVM/MPI 2003), Recent Advances in Parallel Virtual Machine and Message Passing Interface, Lecture Notes in Computer Science, LNCS 2840, Springer, September 2003, pp. 257-267.
- J. Pješivac-Grbovic, T. Angskun, G. Bosilca, G. E. Fagg, E. Gabriel, J. J. Dongarra, “Performance Analysis of MPI Collective Operations,” Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium, April 2005.
- R. Thakur, R. Rabenseifner, W. Gropp, “Optimization of Collective Communication Operations in MPICH,” International Journal of High Performance Computing Applications, Spring 2005, Vol. 19, No. 1, pp. 49-66.
- H. N. Mamadou, T. Nanri, K. Murakami, “A Robust Dynamic Optimization for MPI Alltoall Operation,” IEEE International Symposium on Parallel and Distributed Processing, May 2009.
- G. Chrysos, “The Intel® Xeon Phi™ Coprocessor – the Architecture,” https://software.intel.com/en-us/articles/intel-xeon-phi-coprocessor-codename-knights-corner.
- Intel® MPI Benchmarks User Guide and Methodology Description,” https://software.intel.com/sites/default/files/managed/66/e8/IMB_Users_Guide.pdf.