DAAL

June 4, 2017, 6:55 pm

Latest and popular articles on Intel Technologies

≫ Next: Improve Performance Using Vectorization and Intel® Xeon® Scalable Processors

≪ Previous: Recipe: Building and Running MILC on Intel® Xeon® Processors and Intel® Xeon Phi™ Processors

This page provides links to the current ways to get the Intel^® performance library: Intel^®Math Kernel Library, Intel^®Integrated Performance Primitives, Intel^®Data Analytics Acceleration Library( Intel^®MKL/IPP/DAAL)

• Sub-component of Intel Parallel Studio XE/Intel System Studio

https://software.intel.com/en-us/intel-parallel-studio-xe
https://software.intel.com/en-us/system-studio

The article provide the version information of Intel^® performance library include in the bundles product: https://software.intel.com/en-us/articles/which-version-of-the-intel-ipp-intel-mkl-and-intel-tbb-libraries-are-included-in-the-intel

•   Free standalone access through separated product main page (Intel^®MKL/IPP/DAAL)
    https://software.intel.com/en-us/mkl
    https://software.intel.com/en-us/intel-ipp
    https://software.intel.com/en-us/intel-daal

• Free access through High Performance Library (Intel^®MKL/IPP/DAAL)
https://software.intel.com/en-us/performance-libraries

• YUM/APT repository for Intel^®MKL/IPP/DAAL & Intel Distribution for Python for Linux* OS
https://software.intel.com/en-us/articles/installing-intel-free-libs-and-python-yum-repo
https://software.intel.com/en-us/articles/installing-intel-free-libs-and-python-apt-repo

Only for Intel^®MKL:

•Cloudera* Parcels support since Intel^® MKL 2017 update 2
https://software.intel.com/en-us/articles/installing-intel-mkl-cloudera-cdh-parcel

•Conda* package/ Anaconda Cloud* support since Intel^® MKL 2017 update 2
https://software.intel.com/en-us/articles/using-intel-distribution-for-python-with-anaconda

The libraries gotten from above ways are exact same in functionality and performance. The mainly differences may in license agreement and support priority. Please refer to Intel MKL license FAQ:

https://software.intel.com/en-us/mkl/license-faq

↧

Improve Performance Using Vectorization and Intel® Xeon® Scalable Processors

October 2, 2017, 11:21 am

Latest and popular articles on Intel Technologies

≫ Next: Exploit Nested Parallelism with OpenMP* Tasking Model

≪ Previous: How to Get Intel® MKL/IPP/DAAL

Introduction

Modern CPUs include different levels of parallelism. High-performance software needs to take advantage of all opportunities for parallelism in order to fully benefit from modern hardware. These opportunities include vectorization, multithreading, memory optimization, and more.

The need for increased performance in software continues to grow, but instead of getting better performance from increasing clock speeds as in the past, now software applications need to use parallelism in the form of multiple cores and, in each core, an increasing number of execution units, referred to as single instruction, multiple data (SIMD) architectures, or vector processors.

To take advantage of both multiple cores and wider SIMD units, we need to add vectorization and multithreading to our software. Vectorization in each core is a critical step because of the multiplicative effect of vectorization and multithreading. To get good performance from multiple cores we need to extract good performance from each individual core.

Intel’s new processors support the rising demands in performance with Intel® Advanced Vector Extensions 512 (Intel® AVX-512), which is a set of new instructions that can accelerate performance for demanding computational workloads.

Intel AVX-512 may increase the vectorization efficiency of our codes, both for current hardware and also for future generations of parallel hardware.

Vectorization Basics

Figure 1 illustrates the basics of vectorization using Intel AVX-512.

In the grey box at the top of Figure 1, we have a very simple loop performing addition of the elements of two arrays containing single precision numbers. In scalar mode, without vectorization, every instruction produces one result. If the loop is vectorized, 16 elements from each operand are going to be packed into 512-bit vector registers, and a single Intel AVX-512 instruction is going to produce 16 results. Or 8 results, if we use double-precision floating point numbers.

How do we know that this loop was vectorized? One way is to get an optimization report from the Intel® C compiler or the Intel® C++ compiler, as shown in the green box at the bottom of the figure. The next section in this article shows how to generate these reports.

However, not all loops can run in vector mode. This simple loop is vectorizable because different iterations of this loop can be processed independently from each other. For example, the iteration when the index variable i is 3 can be executed at the same time when i is 5 or 6, or has any other value in the iteration range, from 0 to N in this case. In other words, this loop has no data dependencies between iterations.

Figure 1: Simple loop adding two arrays of single precision floating point numbers. Operations are performed simultaneously on 512-bit registers. No dependencies present. Loop is vectorized.

Figure 2 illustrates the case where a loop might not be directly vectorizable. The figure shows a loop where each iteration produces one value in the array c, which is computed using another value of c just produced in the previous iteration.

If this loop is executed in vector mode, the old values of the array with name c (16 of those values in this case) are going to be loaded into the vector register (as shown by the red circles in Figure 2), so the result will be different compared to the execution in scalar mode.

And so, this loop cannot be vectorized. This type of data dependency is called a Read-After-Write, or a Flow dependency, and we can see that the compiler will detect it and will not vectorize this loop.

There are other types of data dependencies; this is just an example to illustrate one that will prevent the compiler from automatically vectorizing this loop. In these cases we need to modify the loop or the data layout to come up with a vectorizable loop.

Figure 2: Simple loop, yet not vectorized because a data dependency is present.

Intel® Advanced Vector Extensions 512 (Intel® AVX-512): Enhanced Vector Processing Capabilities

The Intel AVX-512 instruction set increases vector processing capabilities because the new instructions can operate on 512-bit registers. There is support in Intel AVX-512 for 32 of these vector registers, and each register can pack either 16 single-precision floating point numbers or 8 double-precision numbers.

Compared to the Intel® Advanced Vector Extensions 2 instruction set (Intel® AVX2), Intel AVX-512 doubles the number of vector registers, and each vector register can pack twice the number of floating point or double-precision numbers. Intel AVX2 offers 256-bit support. This means more work can be achieved per CPU cycle, because the registers can hold more data to be processed simultaneously.

As of today, Intel AVX-512 is currently available on Intel® Xeon Phi™ processors x200, and on the new Intel® Xeon® Scalable processors.

The full specification of the Intel AVX-512 instruction set consists of several separate subsets:

A. Some are present in both the Intel Xeon Phi processors x200 and in the Intel Xeon Scalable processors:

Intel AVX-512 Foundation Instructions (AVX512-F)
Intel AVX-512 Conflict Detection Instructions (AVX512-CD)

B. Some are supported by Intel Xeon Phi processors:

Intel AVX-512 Exponential and Reciprocal Instructions (AVX512-ER)
Intel AVX-512 Prefetch Instructions (AVX512-PF)

C. Some are supported by Intel Xeon Scalable processors:

Intel AVX-512 Byte (char/int8) and Word (short/int16) Instructions (AVX512-BW)
Intel AVX-512 Double-word (int32/int) and Quad-word (int64/long) Instructions (AVX512-DQ)
Intel AVX-512 Vector Length Extensions (AVX512-VL)

The subsets shown above can be accessed in different ways. The easiest way is to use a compiler option. As an example, Intel C++ compiler options that control which subsets to use are as follows:

Option -xCOMMON-AVX512 will use:
- Intel AVX-512 Foundation Instructions (AVX512-F)
- Intel AVX-512 Conflict Detection Instructions (AVX512-CD)
- Instructions enabled with -xCORE-AVX2
Option -xMIC-AVX512 will use:
- Intel AVX-512 Foundation Instructions (AVX512-F)
- Intel AVX-512 Conflict Detection Instructions (AVX512-CD)
- Intel AVX-512 Exponential and Reciprocal Instructions (AVX512-ER)
- Intel AVX-512 Prefetch Instructions (AVX512-PF)
- Instructions enabled with -xCORE-AVX2
Option -xCORE-AVX512 will use:
- Intel AVX-512 Foundation Instructions (AVX512-F)
- Intel AVX-512 Conflict Detection Instructions (AVX512-CD)
- Intel AVX-512 Byte and Word Instructions (AVX512-BW)
- Intel AVX-512 Double-word and Quad-word Instructions (AVX512-DQ)
- Intel AVX-512 Vector Length Extensions (AVX512-VL)
- Instructions enabled with -xCORE-AVX2
Option -xCORE-AVX2 will use:
- Intel Advanced Vector Extensions 2 (Intel AVX2), Intel® Advanced Vector Extensions (Intel® AVX), Intel SSE 4.2, Intel SSE 4.1, Intel SSE 3, Intel SSE 2, Intel SSE, and Supplemental Streaming SIMD Extensions 3 instructions for Intel® processors

For example, if it is necessary to keep binary compatibility for both Intel Xeon Phi processors x200 and Intel Xeon Scalable processors, code can be compiled using the Intel C++ compiler as follows:

icpc Example.cpp -xCOMMON-AVX512 <more options>

But if the executable will run on Intel Xeon Scalable processors, the code can be compiled as follows:

icpc Example.cpp -xCORE-AVX512 <more options>

Vectorization First

The combination of vectorization and multithreading can be much faster than either one alone, and this difference in performance is growing with each generation of hardware.

In order to use the high degree of parallelism present on modern CPUs, like the Intel Xeon Phi processors x200 and the new Intel Xeon Scalable processors, we need to write new applications in such a way that they take advantage of vector processing on individual cores and multithreading on multiple cores. And we want to do that in a way that guarantees that the optimizations will be preserved as much as possible for future generations of processors, to preserve code and optimization efforts, and to maximize software development investment.

When optimizing code, the first efforts should be focused on vectorization. Data parallelism in the algorithm/code is exploited in this stage of the optimization process.

There are several ways to take advantage of vectorization capabilities on a single core on Intel Xeon Phi processors x200 and the new Intel Xeon Scalable processors:

The easiest way is to use libraries that are already optimized for Intel processors. An example is the Intel® Math Kernel Library, which is optimized to take advantage of vectorization and multithreading. In this case we can get excellent improvements in performance just by linking with this library. Another example is if we are using Python*. Using the Intel® Distribution for Python* will automatically increase performance, because this distribution accelerates computational packages in Python, like NumPy* and others.
On top of using optimized libraries, we can also write our code in a way that the compiler will automatically vectorize it, or we can modify existing code for this purpose. This is commonly called automatic vectorization. Sometimes we can also add keywords or directives to the code to help or guide automatic vectorization. This is a very good methodology because code optimized in this way will likely be optimized for future generations of processors, probably with minor or no modifications. Only recompilation might be needed.
A third option is to directly call vector instructions using intrinsic functions or in assembly language.

In this article, we will focus on examples of optimizing applications using the first two methods shown above.

A typical vectorization workflow is as follows:

Compile + Profile
Optimize
Repeat

The first step above can be performed using modern compilers, profilers, and optimization tools. In this article, an example of an optimization flow will be shown using the optimization reports generated by the Intel® compilers, as well as advanced optimization tools like Intel® Advisor and Intel® VTune™ Amplifier, which are part of the new Intel® Parallel Studio XE 2018.

Example: American Option Pricing

This example is taken from a book by James Reinders and Jim Jeffers titled High Performance Parallelism Pearls Volume Two¹. In chapter 8, Shuo Li (the author of that chapter and the code that will be used in this article) describes a possible solution to the problem of American option pricing. It consists of finding an approximate solution to partial differential equations based on the Black-Scholes model using the Newton-Raphson method (based on work by Barone-Adesi and Whaley²). They have implemented this solution as C++ code and the source code is freely available at the authors’ website (http://lotsofcores.com/).

Figure 3 shows two fragments of the source code am_opt.cpp (downloaded from the link shown above), containing the two main loops in their code. The first loop initializes arrays with random values. The second loop performs the pricing operation for the number of options indicated in the variable OptPerThread, which in this case has a value of about 125 million options. In the rest of this article we will focus on Loop 2, which uses most of the CPU time. In particular, we will focus on the call to function baw_scalaropt (line 206), which performs option pricing for one option.

Figure 3: Fragments of the source code from the author's book and website showing the two main loops in the program.

The following code snippet shows the definition of function baw_scalaropt:

 90 float baw_scalaropt( const float S,
 91                  const float X,
 92                  const float r,
 93                  const float b,
 94                  const float sigma,
 95                  const float time)
 96 {
 97     float sigma_sqr = sigma*sigma;
 98     float time_sqrt = sqrtf(time);
 99     float nn_1 = 2.0f*b/sigma_sqr-1;
100     float m = 2.0f*r/sigma_sqr;
101     float K = 1.0f-expf(-r*time);
102     float rq2 = 1/((-(nn_1)+sqrtf((nn_1)*(nn_1) +(4.f*m/K)))*0.5f);
103
104     float rq2_inf = 1/(0.5f * ( -(nn_1) + sqrtf(nn_1*nn_1+4.0f*m)));
105     float S_star_inf = X / (1.0f - rq2_inf);
106     float h2 = -(b*time+2.0f*sigma*time_sqrt)*(X/(S_star_inf-X));
107     float S_seed = X + (S_star_inf-X)*(1.0f-expf(h2));
108     float cndd1 = 0;
109     float Si=S_seed;
110     float g=1.f;
111     float gprime=1.0f;
112     float expbr=expf((b-r)*time);
113     for ( int no_iterations =0; no_iterations<100; no_iterations++) {
114         float c  = european_call_opt(Si,X,r,b,sigma,time);
115         float d1 = (logf(Si/X)+
116                    (b+0.5f*sigma_sqr)*time)/(sigma*time_sqrt);
117         float cndd1=cnd_opt(d1);
118         g=(1.0f-rq2)*Si-X-c+rq2*Si*expbr*cndd1;
119         gprime=( 1.0f-rq2)*(1.0f-expbr*cndd1)+rq2*expbr*n_opt(d1)*
120                (1.0f/(sigma*time_sqrt));
121         Si=Si-(g/gprime);
122     };
123     float S_star = 0;
124     if (fabs(g)>ACCURACY) { S_star = S_seed; }
125     else { S_star = Si; };
126     float C=0;
127     float c  = european_call_opt(S,X,r,b,sigma,time);
128     if (S>=S_star) {
129         C=S-X;
130     }
131     else {
132         float d1 = (logf(S_star/X)+
133                    (b+0.5f*sigma_sqr)*time)/(sigma*time_sqrt);
134         float A2 =  (1.0f-expbr*cnd_opt(d1))* (S_star*rq2);
135         C=c+A2*powf((S/S_star),1/rq2);
136     };
137     return (C>c)?C:c;
138 };

Notice the following in the code snippet shown above:

There is a loop performing the Newton-Raphson optimization on line 113.
There is a call to function european_call_opt on line 114 (inside the loop) and on line 127 (outside the loop). This function performs a pricing operation for European options, which is needed for the pricing of American options (see details of the algorithm in¹).

For reference, the following code snippet shows the definition of the european_call_opt function. Notice that this function only contains computation (and calls to math functions), but no loops:

75 float european_call_opt( const float S,
 76                 const float X,
 77                 const float r,
 78                 const float q,
 79                 const float sigma,
 80                 const float time)
 81 {
 82     float sigma_sqr = sigma*sigma;
 83     float time_sqrt = sqrtf(time);
 84     float d1 = (logf(S/X) + (r-q + 0.5f*sigma_sqr)*time)/(sigma*time_sqrt);
 85     float d2 = d1-(sigma*time_sqrt);
 86     float call_price=S*expf(-q*time)*cnd_opt(d1)-X*expf(-r*time)* cnd_opt(d2);
 87     return call_price;
 88 };

To better visualize the code, Figure 4 shows the structure of functions and loops in the am_opt.cpp program. Notice that we have labeled the two loops we will be focusing on as Loop 2.0 and Loop 2.1.

Figure 4: Structure of functions and loops in the program.

As a first approach, we can compile this code with the options shown at the top of the yellow section in Figure 5 (the line indicated by label 1). Notice that we have included the option “O2”, which specifies a moderate level of optimization, and also the option “-qopt-report=5”, which asks the compiler to generate an optimization report with the maximum level of information (possible levels range from 1–5).

A fragment of the optimization report is shown in the yellow section in Figure 5. Notice the following:

Loop 2.0 on line 201 was not vectorized. The compiler suggests a loop interchange and the use of the SIMD directive.
The compiler also reports Loop 2.1 was not vectorized because of an assumed dependence, and also reports that function baw_scalaropt has been inlined.
Inlining the function baw_scalaropt presents both loops (2.0 and 2.1) as a nested loop, which the compiler reports as an “Imperfect Loop Nest” (a loop with extra instructions between the two loops), and for which it suggests the permutation.

Figure 5: Compiling the code with optimization option "O2" and targeting the Intel® AVX-512 instructions.

Before trying the SIMD option (which would force vectorization), we can try compiling this code using the optimization flag “O3”, which specifies a higher level of optimization from the compiler. The result is shown in Figure 6. We can observe in the yellow section that:

The compiler reports that the outer loop (Loop 2.0) has been vectorized, using a vector length of 16.
The compiler’s estimated potential speedup for the vectorized loop 2.0 is 7.46.
The compiler reports a number of vectorized math library calls and one serialized function call.

Given that the vector length used for this loop was 16, we would expect a potential speedup close to 16 from this vectorized loop. Why was it only 7.46?

The reason seems to come from the serialized function call reported by the compiler (which refers to the call of the function european_call_opt inside the inner loop (Loop 2.1). One possible way to fix this would be to ask the compiler to recursively inline all the function calls. For this we can use the directive “#pragma inline recursive” right before the call to the function baw_scalaropt.

After compiling the code (using the same compiler options as in the previous experiment), we get the new optimization report informing that the compiler’s estimated potential speedup for the vectorized loop 2.0 (Figure 7) is now 14.180, which is closer to the ideal speedup of 16.

Figure 6: Compiling the code with optimization option "O3" and targeting the Intel® AVX-512 instructions.

Figure 7: Adding the "#pragma inline recursive" directive.

Figure 8: Fragment of optimization report showing both loops vectorized.

Figure 8 shows the sections of the optimization report confirming that both loops have been vectorized. More precisely, as we can see in the line with label 2, the inner loop (Loop 2.1) has been vectorized along with the outer loop, which means that the compiler has generated efficient code, taking into account all the operations involved in the nested loops.

Once the compiler reports a reasonable level of vectorization, we can perform a runtime analysis. For this, we can use Intel Advisor 2018.

Intel Advisor 2018 is one of the analysis tools in the Intel Parallel Studio XE 2018 suite that lets us analyze our application and gives us advice on how to improve vectorization in our code.

Intel Advisor analyzes our application and reports not only the extent of vectorization but also possible ways to achieve more vectorization and increase the effectiveness of the current vectorization.

Although Intel Advisor works with any compiler, it is particularly effective when applications are compiled using Intel compilers, because Intel Advisor will use the information from the reports generated by Intel compilers.

The most effective way to use Intel Advisor is via the graphical user interface. This interface gives us access to all the information and recommendations that Intel Advisor collects from our code. Detailed information can be found at Intel Advisor Support, where documentation, training materials, and code samples can be found. Product support and access to the Intel Advisor community forum can also be found there.

Intel Advisor also offers a command-line interface (CLI) that lets the user work on remote hosts, and/or generate information in a way that is easy to automate analysis tasks; for example, using scripts.

As an example, let us run the Intel Advisor CLI to perform a basic analysis of the example code am_opt.cpp.

Figure 9: Running Intel® Advisor to collect survey information.

The first step for a quick analysis is to create an optimized executable that will run on the Intel processor (in this case, an Intel Xeon Scalable processor 81xx). Figure 9 shows how we can run the CLI version of the Intel Advisor tool. The survey analysis is a good starting point because it provides information that lets us identify how our code uses vectorization and where the hotspots for analysis are.

The command labeled as 1 in Figure 9 runs the Intel Advisor tool and creates a project directory VecAdv-01. Inside this directory, Intel Advisor creates, among other things, a directory named e000, containing the results of the analysis. The command is:

$ advixe-cl --collect survey --project-dir ./VecAdv-01 --search-dir src:r=./. -- ./am_call

The next step is to view the results of the survey analysis performed by the Intel Advisor tool. We will use the CLI to generate the report. To do this, we replace the -collect option with the -report one (as shown in the command with label 2 in Figure 9), making sure we refer to the project directory where the data is collected. We can use the following command to generate a survey report from the data contained in the results directory in our project directory:

$ advixe-cl -report  survey -format=xml --project-dir ./VecAdv-01 --search-dir src:r=./. -- ./am_call

The above command creates a report named advisor-survey.xml in the project directory. If we do not use the –format=xml option, a text-formatted report will be generated. This report is in a column format and contains several columns, so it can be a little difficult to read on a console. One option for a quick read is to limit the number of columns to be displayed using the filter option.

Another option is to create an XML-formatted report. We can do this if we change the value for the -format option from text to XML, which is what we did in Figure 9.

The XML-formatted report might be easier to read on a small screen because the information in the report file is condensed into one column. Figure 9 (the area labeled with 4) shows a fragment of the report corresponding to the results of Loop 2.0.

The survey option in the Intel Advisor tool generates a performance overview of the loops in the application. For example, Figure 9 shows that the loop starting on line 200 in the source code has been vectorized using Intel AVX-512. It also shows an estimate of the improvement of the loop’s performance (compared to a scalar version) and timing information.

Also notice that the different loops have been assigned a loop ID, which is the way the Intel Advisor tool labels the loops in order to keep track of them in future analysis (for example, after looking at the performance overview shown above, we might want to generate more detailed information about a specific loop by including the loop ID in the command line).

Once we have looked at the performance summary reported by the Intel Advisor tool using the Survey option, we can use other options to add more specific information to the reports. One option is to run the Trip counts analysis to get information about the number of times loops are executed.

To add this information to our project, we can use the Intel Advisor tool to run a Trip Counts analysis on the same project we used for the survey analysis. Figure 10 shows how this can be done. The commands we used are:

$ advixe-cl --collect tripcounts --project-dir ./VecAdv-01 --search-dir src:r=./. -- ./am_call
$ advixe-cl –report tripcounts -format=xml --project-dir ./VecAdv-01 --search-dir src:r=./. -- ./am_call

Now the XML-formatted report contains information about the number of times the loops have been executed. Specifically, the Trip_Counts fields in the XML report will be populated, while the information from the survey report will be preserved. This is shown in Figure 10 in the area labeled 5.

In a similar way, we can generate other types of reports that give us other useful information about our loops. The –help collect and –help report options in the command-line Intel Advisor tool show what types of collections and reports are available. For example, to obtain memory access pattern details in our source code, we can run a Memory Access Patterns (MAP) analysis using the -map option. This is shown in Figure 11.

Figure 11 shows the results of running Advisor using the -map option to collect MAP information. The MAP analysis is an in-depth analysis of memory access patterns and, depending on the number of loops and complexity in the program, it can take some time to finish the data collection. The report might also be very long. For that reason, notice that in Figure 11 we selected only the loop we are focusing on in this example, which has a loop ID of 3. In this way, data will be collected and reported only for that loop. The commands we are using (from Figure 11) to collect and report MAP information for this loop are:

$ advixe-cl --collect map --project-dir ./VecAdv-01 --mark-up-list=3 --search-dir src:r=./. -- ./am_call
$ advixe-cl –report map -format=xml --project-dir ./VecAdv-01 --search-dir src:r=./. -- ./am_call

Also notice that the stride distribution for all patterns found in this loop are reported, and information for every memory pattern is reported also. In Figure 11, only one of them is shown (for pattern with pattern ID = “186”, showing a unit stride access).

Figure 10: Adding the "tripcounts" analysis.

Figure 11: Adding the "MAP" analysis.

Summary

Software needs to take advantage of all opportunities for parallelism present in modern processors to obtain top performance. As modern processors continue to increase the number of cores and to widen SIMD registers, the combination of vectorization, threading and efficient memory use will make our code run faster. Vectorization in each core is going to be a critical step because of the multiplicative effect of vectorization and multithreading.

Extracting good performance from each individual core will be a necessary step to efficiently use multiple cores.

This article presented an overview of vectorization using Intel compilers Intel optimization tools, in particular Intel Advisor XE 2018. The purpose was to illustrate a methodology for vectorization performance analysis that can be used as an initial step in a code modernization effort to get software running faster on a single core.

The code used in this example is from chapter 8 of¹, which implements a Newton-Raphson algorithm to approximate American call options. The source code is available for download from the book’s website (http://www.lotsofcores.com/).

References

1. J. Reinders and J. Jeffers, High Performance Parallelism Pearls, vol. 2, Morgan Kaufmann, 2015.

2. Barone-Adesi and W. R. G., Efficient analytic approximation of american option values, J. Finance, 1987.

↧

Exploit Nested Parallelism with OpenMP* Tasking Model

November 7, 2017, 1:30 am

Latest and popular articles on Intel Technologies

≫ Next: Recipe: Building NAMD on Intel® Xeon® and Intel® Xeon Phi™ Processors on a Single Node

≪ Previous: Improve Performance Using Vectorization and Intel® Xeon® Scalable Processors

The new generation, Intel® Xeon® processor Scalable family (formerly code-named Skylake-SP), Intel’s most scalable processor has up to 28 processor cores per socket with options to scale from 2 to 8 sockets. Intel® Xeon Phi^TMprocessor provides massive parallelism with up to 72 cores per unit. More and more parallelism capabilities are introduced by hardware that requires software to exploit.

This is not always easy however in cases like lacking enough parallel tasks, huge temporary memory expansion when thread number grows, load imbalance etc. In these cases, nested parallelism can be helpful to scale parallel task number at multiple levels. It can also help inhibit temporary space explosion by share memory and parallelize at certain level while enclosed by another parallel region.

There are actually two ways to enable nested parallelism with OpenMP*. One is explicitly documented in OpenMP spec by setting OMP_NESTED environment variable or calling “omp_set_nested” runtime routines. There are some good examples and explanations on this topic from online tutorial: OpenMP Lab on Nested Parallelism and Task.

The other one is using OpenMP Tasking model. Comparing to other Worksharing constructs supported in OpenMP, tasking construct provides more flexibility in supporting various kinds of parallelism. It can be nested inside a parallel region, other task constructs, or other Worksharing constructs. With introducing taskloop reduction and taskgroup, this become more useful.

Here we use an example to demonstrate how to apply nested parallelism in different ways.

void fun1()
{
    for (int i=0; i<80; i++)
        ...
}


void main()
{
#pragma omp parallel
   {
#pragma omp for
       for (int i=0; i<100; i++)
           ...

#pragma omp for
       for (int i=0; i<10; i++)
           fun1();
    }
}

In the above example, the 2nd loop in main has a small trip count that can only be distributed to 10 threads with omp for. However there are 80 loop iterations in Fun1 which will be called 10 times in main loop. The product of loop trip count in Fun1 and the main loop will yield 800 iterations in total! This gives much more parallelism potential if parallelism can be added in both levels.

Here is how nested parallel regions work:

void fun1()
{
#pragma omp parallel for
    for (int i=0; i<80; i++)
        ...
}

void main
{
#Pragma omp parallel
    {
        #pragma omp for
        for (int i=0; i<100; i++)
            …

        #pragma omp for
        for (int i=0; i<10; i++)
            fun1();
    }
}

The problem with this implementation is you may either have insufficient threads for the 1st main loop as it has larger loop count, or create exploded number of threads for the 2nd main loop when OMP_NESTED=TRUE. The simple solution is to split the parallel region in main and create separate ones for each loop with a distinct thread number specified.

In contrast, here's how omp tasking works:

void fun1()
{
#pragma omp taskloop
     for (int I = 0; i<80; i++)
         ...
}

void main
{
#pragma omp parallel
    {
#pragma omp for
        for (int i=0; i<100; i++)
            ...
      
#pragma omp for
        for (int i=0; i<10; i++)
            fun1();
    }
}

As you can see, you don't have to worry about the thread number changes in 1st and 2nd main loops. Even though you still have a small amount of (10) threads allocated for 2nd main loop, the rest available threads will be able to be distributed through omp taskloop in fun1.

In general, OpenMP nested parallel regions is a way to distribute tasks by creating/forking more threads. In OpenMP, parallel region is the only construct determines execution thread number and controls thread affinity. Using nested parallel regions means each thread in parent region will yield multiple threads in enclosed regions, which in turn create a product of thread number.

Omp tasking shows another way to explore parallelism by adding more tasks, instead of threads. Though the thread number is unchanged as specified at the entry of the parallel region, the increased tasks from the nested tasking constructs can be distributed and executed by any available/idle threads in the current team of the same parallel region. This gives opportunities to fully use all threads’ capability, and improve balance of workloads automatically.

With the introducing of omp taskloop, omp taskloop reduction, omp taskgroup, omp taskgroup reduction, OpenMP tasking model becomes a more powerful resolution supporting nested parallelism. For more details on these new features in OpenMP 5.0TR, please refer to OpenMP* 5.0 support in Intel® Compiler 18.0.

Please note that we also received some known issue regarding nested parallelism with reduction clause in 18.0 initial version. This issue is expected to be fixed in 2018 Update 1 which will be available soon.

↧

Recipe: Building NAMD on Intel® Xeon® and Intel® Xeon Phi™ Processors on a Single Node

November 6, 2017, 11:04 am

Latest and popular articles on Intel Technologies

≫ Next: Recipe: Building NAMD on Intel® Xeon® and Intel® Xeon Phi™ Processors for Multi-node Runs

≪ Previous: Exploit Nested Parallelism with OpenMP* Tasking Model

By: Bobyr, Alexander, Shiryaev, Mikhail, Smahane Douyeb

For cluster run, please refer to the recipe: Building NAMD on Intel® Xeon® and Intel® Xeon Phi™ Processors on cluster

Purpose

This recipe describes a step-by-step process for getting, building, and running NAMD (scalable molecular dynamics code) on the Intel® Xeon Phi™ processor and Intel® Xeon® processor E5 family to achieve better performance.

Introduction

NAMD is a parallel molecular dynamics code designed for high-performance simulation of large biomolecule systems. Based on Charm++ parallel objects, NAMD scales to hundreds of cores for typical simulations and beyond 500,000 cores for the largest simulations. NAMD uses the popular molecular graphics program VMD for simulation setup and trajectory analysis, but is also file-compatible with AMBER, CHARMM, and X-PLOR.

NAMD is distributed free of charge with source code. You can build NAMD yourself or download binaries for a wide variety of platforms. Below are the details for how to build NAMD on the Intel Xeon Phi processor and Intel Xeon processor E5 family. You can learn more about NAMD at http://www.ks.uiuc.edu/Research/namd/.

Building and Running NAMD on the Intel® Xeon® Processor E5-2697 v4 (formerly Broadwell (BDW)), Intel® Xeon Phi™ Processor 7250 (formerly Knight Landing (KNL)), and Intel® Xeon® Gold 6148 Processor (formerly Skylake (SKX))

Download the Code

Download the latest NAMD source code from this site: http://www.ks.uiuc.edu/Development/Download/download.cgi?PackageName=NAMD
Download the Charm++ 6.7.1 version.
a. You can get Charm++ from the NAMD source code of the Version Nightly Build.
b. Or download it separately: http://charmplusplus.org/download/
Download the fftw3 version: http://www.fftw.org/download.html
Version 3.3.4 is used is this run.
Download apoa1 and stvm workloads: http://www.ks.uiuc.edu/Research/namd/utilities/

Build the Binaries

Set environment for compilation:

CC=icc; CXX=icpc; F90=ifort; F77=ifort
export CC CXX F90 F77
source /opt/intel/compiler/<version>/compilervars.sh intel64

Build fftw3:

cd <fftw_root_path>

./configure --prefix=<fftw_install_path> --enable-single --disable-fortran CC=icc
Use –xCORE-AVX512 for SKX, -xMIC-AVX512 for KNL and –xCORE-AVX2 for BDW

make CFLAGS=“-O3 -xMIC-AVX512 -fp-model fast=2 -no-prec-div -qoverride-limits” clean install

Build a multicore version of Charm++:

cd <charm_root_path>

./build charm++ multicore-linux64 iccstatic --with-production “-O3 -ip”

Build NAMD:

a. Modify the arch/Linux-x86_64-icc to look like the following (select one of the FLOATOPTS options depending on the CPU type):

NAMD_ARCH = Linux-x86_64
CHARMARCH = multicore-linux64-iccstatic

# For KNL
FLOATOPTS = -ip -xMIC-AVX512  -O3 -g -fp-model fast=2 -no-prec-div -qoverride-limits -DNAMD_DISABLE_SSE

# For SKX
FLOATOPTS = -ip -xCORE-AVX512  -O3 -g -fp-model fast=2 -no-prec-div -qoverride-limits -DNAMD_DISABLE_SSE

# For BDW
FLOATOPTS = -ip -xCORE-AVX2  -O3 -g -fp-model fast=2 -no-prec-div -qoverride-limits -DNAMD_DISABLE_SSE

CXX = icpc -std=c++11 -DNAMD_KNL
CXXOPTS = -static-intel -O2 $(FLOATOPTS)
CXXNOALIASOPTS = -O3 -fno-alias $(FLOATOPTS) -qopt-report-phase=loop,vec -qopt-report=4
CXXCOLVAROPTS = -O2 -ip
CC = icc
COPTS = -static-intel -O2 $(FLOATOPTS)

b. Compile NAMD:

./config Linux-x86_64-icc --charm-base <charm_root_path> --charm-arch multicore-linux64- iccstatic --with-fftw3 --fftw-prefix <fftw_install_path> --without-tcl --charm-opts –verbose

ii.

gmake –j

Other System Setup

Change the kernel setting for KNL: “nmi_watchdog=0 rcu_nocbs=2-271 nohz_full=2-271” Here is one way to change the settings (this could be different for every system):
a. To be safe, first save your original grub.cfg:
```
cp /boot/grub2/grub.cfg /boot/grub2/grub.cfg.ORIG
```
b. In “/etc/default/grub” add (append) the following to
```
“GRUB_CMDLINE_LINUX”: nmi_watchdog=0 rcu_nocbs=2-271 nohz_full=2-271
```
c. Save your new configuration:
```
grub2-mkconfig -o /boot/grub2/grub.cfg 
```
d. Reboot the system. After logging in, verify the settings with “cat /proc/cmdline”
Change next lines in *.namd file for both workloads:
numsteps 1000
outputtiming 20
outputenergies 600

Run NAMD

on SKL/BDW (ppn = 40 / ppn = 72 correspondingly):

./namd2 +p $ppn apoa1/apoa1.namd +pemap 0-($ppn-1)

on KNL (ppn = 136 (2 hyper threads per core), MCDRAM in flat mode, similar performance in cache):
```
numactl -p 1 ./namd2 +p $ppn apoa1/apoa1.namd +pemap 0-($ppn-1)
```

KNL Example

numactl -p 1 <namd_root_path>/Linux-KNL-icc/namd2 +p 136 apoa1/apoa1.namd +pemap 0-135

Performance results reported in the Intel Salesforce repository (ns/day; higher is better):

Workload

2S Intel® Xeon® Processor E5-2697 v4 18c 2.3 GHz (ns/day)

Intel® Xeon Phi™ Processor 7250 bin1 (ns/day)

Intel® Xeon Phi™ Processor 7250 versus 2S Intel® Xeon® Processor E5-2697 v4 (speedup)

stmv

0.45

0.55

1.22x

apoa1

5.5

6.18

1.12x

Workload	2S Intel® Xeon® Gold 6148 Processor 20c 2.4 GHz (ns/day)	Intel® Xeon Phi™ Processor 7250 versus 2S Intel® Xeon® Processor E5-2697 v4 (speedup)
stmv	0.73	1.44x
apoa1 original	7.68	1.43x
apoa1	8.70	1.44x

Systems configuration

Processor	Intel® Xeon® Processor E5-2697 v4	Intel® Xeon® Gold 6148 Processor	Intel® Xeon Phi™ Processor 7250
Stepping	1 (B0)	1 (B0)	1 (B0) Bin1
Sockets / TDP	2S / 290W	2S / 300W	1S / 215W
Frequency / Cores / Threads	2.3 GHz / 36 / 72	2.4 GHz / 40 / 80	1.4 GHz / 68 / 272
DDR4	8x16 GB 2400 MHz (128 GB)	12x16 GB 2666 MHz (192 GB)	6x16 GB 2400 MHz
MCDRAM	N/A	N/A	16 GB Flat
Cluster/Snoop Mode/Mem Mode	Home	Home	Quadrant/flat
Turbo	On	On	On
BIOS	GRRFSDP1.86B0271.R00.1510301446		GVPRCRB1.86B.0010.R02.1608040407
Compiler	ICC-2017.0.098	ICC-2016.4.298	ICC-2017.0.098
Operating System	Red Hat Enterprise Linux* 7.2	Red Hat Enterprise Linux 7.3	Red Hat Enterprise Linux 7.2
Operating System	(3.10.0-327.e17.x86_64)	(3.10.0-514.6.2.0.1.el7.x86_64.knl1)	(3.10.0-327.22.2.el7.xppsl_1.4.1.3272._86_64)

About the Authors

Alexander Bobyr is a CRT application engineer at the INNL lab at Intel supporting and providing feedback for HPC deals and SW Tools. He serves as a technical expert and representative for SPEC HPG. Alexander has a Bachelor’s degree in Intelligent Systems direction and a Master’s degree in Artificial Intelligence from Power Engineering Institute of Moscow, Russia.

Mikhail Shiryaev is a Software Development Engineer in Software and Services Group (SSG) at Intel. He is part of the Cluster Tools team working on the development of Intel MPI and Intel MLSL libraries. His major interests are high performance computing, distributed systems and distributed deep learning. Mikhail received his Master’s degree and his Bachelor’s degree in Software Engineering from Lobachevsky State University of Nizhny Novgorod, Russia.

Smahane Douyeb is currently working as a Software Apps Engineer in Software and Services Group (SSG) at Intel. Part of her job is to run and validate recipes and benchmarks for various HPC platforms for competitive testing purposes. She also works on HPC Python apps optimization on some Intel platforms. She received her Software Engineering Bachelor’s degree from Oregon Institute of Technology. She is very passionate about growing and learning to achieve her dream of becoming a Principle Engineer.

↧

Recipe: Building NAMD on Intel® Xeon® and Intel® Xeon Phi™ Processors for Multi-node Runs

November 7, 2017, 8:25 am

Latest and popular articles on Intel Technologies

≫ Next: How to use the MPI-3 Shared Memory in Intel® Xeon Phi™ Processors

≪ Previous: Recipe: Building NAMD on Intel® Xeon® and Intel® Xeon Phi™ Processors on a Single Node

By: Bobyr, Alexander, Shiryaev, Mikhail, Smahane Douyeb

Download [96KB]

For single-node runs, refer to the recipe: Building NAMD on Intel® Xeon® and Intel® Xeon Phi™ processors

Purpose

Introduction

NAMD is distributed free of charge with source code. You can build NAMD yourself or download binaries for a wide variety of platforms. Below are the details for how to build NAMD on Intel Xeon Phi processor and Intel Xeon processor E5 family. You can learn more about NAMD at http://www.ks.uiuc.edu/Research/namd/.

Building and Running NAMD for Cluster on the Intel® Xeon® Processors

E5-2697 v4 (formerly Broadwell (BDW)), Intel® Xeon Phi™ processor 7250 (formerly Knights Landing (KNL)), and Intel® Xeon® Gold 6148 processor (formerly Skylake (SKX))

Download the code

Download the latest NAMD source code from this site: http://www.ks.uiuc.edu/Development/Download/download.cgi?PackageName=NAMD
Download Open Fabric Interfaces (OFI). NAMD uses Charm++/OFI for multi-node.
- You can use the installed OFI library, which comes with the IFS package, or download and build it manually.
- To check the version of the installed OFI use the “fi_info --version” command (OFI1.4.2 was used here).
- The OFI library can be downloaded from https://github.com/ofiwg/libfabric/releases.
Download Charm++ with OFI support:
From here: http://charmplusplus.org/download/
or
git clone: http://charm.cs.illinois.edu/gerrit/charm.git
Download the fftw3 version: http://www.fftw.org/download.html
Version 3.3.4 is used is this run.
Download the apao and stvm workloads: http://www.ks.uiuc.edu/Research/namd/utilities/

Build the Binaries

Set the environment for compilation:

CC=icc; CXX=icpc; F90=ifort; F77=ifort
export CC CXX F90 F77
source /opt/intel/compiler/<version>/compilervars.sh intel64

Build the OFI library (you can skip this step if you want to use the installed OFI library):
1. cd <libfabric_root_path>
2. ./autogen.sh
3. ./configure --prefix=<libfabric_install_path> --enable-psm2
4. make clean && make -j12 all && make install
5. custom OFI can be used further using LD_PRELOAD or LD_LIBRARY_PATH:
export LD_LIBRARY_PATH=<libfabric_install_path>/lib:${LD_LIBRARY_PATH}
mpiexec.hydra …
or
LD_PRELOAD=<libfabric_install_path>/lib/libfabric.so mpiexec.hydra …
Build fftw3:
1. cd <fftw_root_path>
2. ./configure --prefix=<fftw_install_path> --enable-single --disable-fortran CC=icc
  Use –xCORE-AVX512 for SKX, -xMIC-AVX512 for KNL and –xCORE-AVX2 for BDW
3. make CFLAGS=“-O3 -xMIC-AVX512 -fp-model fast=2 -no-prec-div -qoverride-limits” clean install
Build multi-node version of Charm++:
1. cd <charm_root_path>
2. ./build charm++ ofi-linux-x86_64 icc smp --basedir <libfabric_root_path> --with-production “-O3 -ip” -DCMK_OPTIMIZE

Build NAMD:

Modify the arch/Linux-x86_64-icc to look like the following (select one of the FLOATOPTS options depending on the CPU type):

NAMD_ARCH = Linux-x86_64
CHARMARCH = multicore-linux64-iccstatic

# For KNL
FLOATOPTS = -ip -xMIC-AVX512  -O3 -g -fp-model fast=2 -no-prec-div -qoverride-limits -DNAMD_DISABLE_SSE

# For SKX
FLOATOPTS = -ip -xCORE-AVX512  -O3 -g -fp-model fast=2 -no-prec-div -qoverride-limits -DNAMD_DISABLE_SSE

# For BDW
FLOATOPTS = -ip -xCORE-AVX2  -O3 -g -fp-model fast=2 -no-prec-div -qoverride-limits -DNAMD_DISABLE_SSE

CXX = icpc -std=c++11 -DNAMD_KNL
CXXOPTS = -static-intel -O2 $(FLOATOPTS)
CXXNOALIASOPTS = -O3 -fno-alias $(FLOATOPTS) -qopt-report-phase=loop,vec -qopt-report=4
CXXCOLVAROPTS = -O2 -ip
CC = icc
COPTS = -static-intel -O2 $(FLOATOPTS)

Compile NAMD
1. ./config Linux-x86_64-icc --charm-base <charm_root_path> --charm-arch ofi-linux-x86_64-smp-icc --with-fftw3 --fftw-prefix <fftw_install_path>--without-tcl --charm-opts -verbose
2. cd Linux-x86_64-icc
3. make clean && gmake –j

Build memopt NAMD binaries:
Like BDW/KNL build with extra options “–with-memopt” for config.

Other Setup

Change the next lines in the *.namd file for both the stmv and opao1 workloads:
numsteps: 1000
outputtiming: 20
outputenergies: 600

Run the Binaries

Set the environment for launching:
1. source /opt/intel/compiler/<version>/compilervars.sh intel64
2. source /opt/intel/impi/<version>/intel64/bin/mpivars.sh
3. specify host names to run on in “hosts” file
4. export MPIEXEC=“mpiexec.hydra -hostfile ./hosts”
5. export PSM2_SHAREDCONTEXTS=0 (if you use PSM2 < 10.2.85)

Launch the task (for example with N nodes, with 1 process per node and PPN cores):

$MPPEXEC -n N -ppn 1 ./namd2 +ppn (PPN-1) <workload_path> +pemap 1-(PPN-1) +commap 0

For example for BDW (PPN=72):
$MPPEXEC -n 8 -ppn 1 ./namd2 +ppn 71 <workload_path> +pemap 1-71 +commap 0

For example for KNL (PPN=68, without hyper threads):
$MPPEXEC -n 8 -ppn 1 ./namd2 +ppn 67 <workload_path> +pemap 1-67 +commap 0

For example for KNL (with 2 hyper threads per core):
$MPPEXEC -n 8 -ppn 1 ./namd2 +ppn 134 <workload_path> +pemap 0-66+68 +commap 67

For KNL with MCDRAM in flat mode:

$MPPEXEC -n N -ppn 1 numactl -p 1 ./namd2 +ppn (PPN-1) <workload_path> +pemap 1-(PPN-1) +commap 0

Remarks

To achieve better scale on multi-node, increase the count of the communication threads (1, 2, 4, 8, 13, 17). For example, the following is a command for N KNL nodes with 17 processes per node and 8 threads per process (7 worker threads and 1 communication thread):

$MPPEXEC -n $(($N*17)) -ppn 17 numactl -p 1 ./namd2 +ppn 7 <workload_path> +pemap 0-67,68-135:4.3 +commap 71-135:4

Basic Charm++/OFI knobs (should be added as NAMD parameters)

+ofi_eager_maxsize: (default: 65536) Threshold between buffered and RMA paths
+ofi_cq_entries_count: (default: 8) Maximum number of entries to read from the completion queue with each call to fi_cq_read().
+ofi_use_inject: (default: 1) whether to use buffered send.
+ofi_num_recvs: (default: 8) Number of pre-posted receive buffers.
+ofi_runtime_tcp: (default: off) during the initialization phase the OFI EP names need to be exchanged among all nodes.
By default, the exchange is done with both PMI and OFI. If this flag is set then the exchange is done with PMI only.

For example:

$MPPEXEC -n 2 -ppn 1 ./namd2 +ppn 1 <workload_path> +ofi_eager_maxsize 32768 +ofi_num_recvs 16

Best performance results reported on an up to 128 Intel® Xeon Phi™ processor nodes cluster (ns/day; higher is better)

Workload/Node (2HT)	1	2	4	8	16
stmv (ns/day)	0.55	1.05	1.86	3.31	5.31

Workload/Node (2HT)	8	16	32	64	128
stmv.28M (ns/day)	0.152	0.310	0.596	1.03	1.91

About the Authors

↧

How to use the MPI-3 Shared Memory in Intel® Xeon Phi™ Processors

March 29, 2017, 12:33 pm

Latest and popular articles on Intel Technologies

≫ Next: Demo: Software Defined Visualization Using Intel® Xeon Phi™ Processor

≪ Previous: Recipe: Building NAMD on Intel® Xeon® and Intel® Xeon Phi™ Processors for Multi-node Runs

This whitepaper introduces the MPI-3 shared memory feature, the corresponding APIs, and a sample program to illustrate the use of MPI-3 shared memory in the Intel® Xeon Phi™ processor.

Introduction to MPI-3 Shared Memory

MPI-3 shared memory is a feature introduced in version 3.0 of the message passing interface (MPI) standard. It is implemented in Intel® MPI Library version 5.0.2 and beyond. MPI-3 shared memory allows multiple MPI processes to allocate and have access to the shared memory in a compute node. For applications that require multiple MPI processes to exchange huge local data, this feature reduces the memory footprint and can improve performance significantly.

In the MPI standard, each MPI process has its own address space. With MPI-3 shared memory, each MPI process exposes its own memory to other processes. The following figure illustrates the concept of shared memory: Each MPI process allocates and maintains its own local memory, and exposes a portion of its memory to the shared memory region. All processes then can have access to the shared memory region. Using the shared memory feature, users can reduce the data exchange among the processes.

Data exchange among the processes

By default, the memory created by an MPI process is private. It is best to use MPI-3 shared memory when only memory needs to be shared and all other resources remain private. As each process has access to the shared memory region, users need to pay attention to process synchronization when using shared memory.

Sample Code

In this section, sample code is provided to illustrate the use of MPI-3 shared memory.

A total of eight MPI processes are created on the node. Each process maintains a long array of 32 million elements. For each element _j in the array, the process updates this element value based on its current value and the values of the element _j in the corresponding arrays of two nearest processes, and the same procedure is applied for the whole array. The following pseudo-code shows when running the program for eight MPI processes with 64 iterations:

Repeat the following procedure 64 times:
for each MPI process n from 0 to 7:
    for each element j in the array A[k]:A_n[j] ← 0.5*A_n[j]  + 0.25*A_previous[j] + 0.25*A_next[j]

where A_n is the long array belonging to the process n, and A_n [j] is the value of the element j in the array belonging to the process n. In this program, since each process exposes it to local memory, all processes can have access to all arrays, although each process just needs the two neighbor arrays (for example, process 0 needs data from processes 1 and 7, process 1 needs data from processes 0 and 2,…).

Shared Memory Diagram

Besides the basic APIs used for MPI programming, the following MPI-3 shared memory APIs are introduced in this example:

MPI_Comm_split_type: Used to create a new communicator where all processes share a common property. In this case, we pass MPI_COMM_TYPE_SHARED as an argument in order to create a shared memory from a parent communicator such as MPI_COMM_WORLD, and decompose the communicator into a shared memory communicator shmcomm.
MPI_Win_allocate_shared: Used to create a shared memory that is accessible by all processes in the shared memory communicator. Each process exposes its local memory to all other processes, and the size of the local memory allocated by each process can be different. By default, the total shared memory is allocated contiguously. The user can pass an info hint “alloc_shared_noncontig” to specify that the shared memory does not have to be contiguous, which can cause performance improvement, depending on the underlying hardware architecture.
MPI_Win_free: Used to release the memory.
MPI_Win_shared_query: Used to query the address of the shared memory of an MPI process.
MPI_Win_lock_all and MPI_Win_unlock_all: Used to start an access epoch to all processes in the window. Only shared epochs are needed. The calling process can access the shared memory on all processes.
MPI_Win_sync: Used to ensure the completion of copying the local memory to the shared memory.
MPI_Barrier: Used to block the caller process on the node until all processes reach a barrier. The barrier synchronization API works across all processes.

Basic Performance Tuning for Intel® Xeon Phi™ Processor

This test is run on an Intel Xeon Phi processor 7250 at 1.40 GHz with 68 cores, installed with Red Hat Enterprise Linux* 7.2 and Intel® Xeon Phi™ Processor Software 1.5.1, and Intel® Parallel Studio 2017 update 2. By default, the Intel compiler will try to vectorize the code, and each MPI process has a single thread of execution. OpenMP* pragma is added at loop level for later use. To compile the code, run the following command line to generate the binary mpishared.out:

$ mpiicc mpishared.c -qopenmp -o mpishared.out
$ mpirun -n 8 ./mpishared.out
Elapsed time in msec: 5699 (after 64 iterations)

To explore the thread parallelism, run four threads per core, and re-compile with –xMIC-AVX512 to take advantage of Intel® Advanced Vector Extensions 512 (Intel® AVX-512) instructions:

$ mpiicc mpishared.c -qopenmp -xMIC-AVX512 -o mpishared.out
$ export OMP_NUM_THREADS=4
$ mpirun -n 8 ./mpishared.out
Elapsed time in msec: 4535 (after 64 iterations)

As MCDRAM in this system is currently configured as flat, the Intel Xeon Phi processor appears as two NUMA nodes. The node 0 contains all CPUs and the on-platform memory DDR4, while node 1 has the on-packet memory MCDRAM:

$ numactl -H
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271
node 0 size: 98200 MB
node 0 free: 92775 MB
node 1 cpus:
node 1 size: 16384 MB
node 1 free: 15925 MB
node distances:
node   0   1
  0:  10  31
  1:  31  10

To allocate the memory in the MCDRAM (node 1), pass the argument –m 1 to the command numactl as follows:

$ numactl -m 1 mpirun -n 8 ./mpishared.out
Elapsed time in msec: 3070 (after 64 iterations)

This simple optimization technique greatly improves performance speeds.

Summary

This whitepaper introduced the MPI-3 shared memory feature, followed by sample code, which used MPI-3 shared memory APIs. The pseudo-code explained what the program is doing along with an explanation of shared memory APIs. The program ran on an Intel Xeon Phi processor, and it was further optimized with simple techniques.

Reference

MPI Forum, MPI 3.0
Message Passing Interface Forum, MPI: A Message-Passing Interface Standard Version 3.0
The MIT Press, Using Advanced MPI
James Reinders, Jim Jeffers, Publisher: Morgan Kaufmann, Chapter 16 - MPI-3 Shared Memory Programming Introduction, High Performance Parallelism Pearls Volume Two

Appendix

The code of the sample MPI program is available for download.

↧

Demo: Software Defined Visualization Using Intel® Xeon Phi™ Processor

March 24, 2017, 8:18 am

Latest and popular articles on Intel Technologies

≪ Previous: How to use the MPI-3 Shared Memory in Intel® Xeon Phi™ Processors

In this demo we are showcasing the use of Intel® Xeon Phi™ processor, to do a 3D visualization of tumor in a human brain. This can help advance research in medical field by getting precise detection and removal of something like tumor in human brain.

More information

The tool used for visualization is Paraview, with OSPRay as the rendering library.

Pre-requisites

Intel® Xeon Phi™ processor system with CentOS 7.2 Linux* (internet enabled)

Open a terminal, in your work area directory and follow the steps below:

Create directory for the demo
mkdir Intel_brain_demo
Change directory
cd Intel_brain_demo
Create two directories under this
mkdir paraview mkdir ospray
Access the files from Dropbox:
https://www.dropbox.com/s/wj0qp1clxv5xssv/SC_2016_BrainDemo.tar.gz?dl=0
Copy the Paraview and Ospray tar files into the respective directories you created in steps above
mv SC_2016_BrainDemo/paraview_sc_demo.tgz paraview/ mv SC_2016_BrainDemo/ospray.tgz ospray/
Untar each of the *tgz directories in the respective area
tar –xzvf *.tgz
Point the library path
Export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:<…../Intel_brain_demo/ospray/install/lib64>
Optional step: set graphics system to a random variable, only if Paraview doesn’t load normally
export QT_GRAPHICSSYSTEM=gtk
Change directory to paraview/install where the binaries are
cd paraview/install
Run Paraview
./bin/paraview
Once Paraview loads
Select File/Load State
Then load the brain_demo.psvm file from the SC_2016BrainDemo that you sync’d in step above
Then it will ask you to load VTK files, click the “...” button to select the appropriate *tumor1.vtk file, then *tumor2.vtk file and then *Tumor1.vtk file in order on your local machine. Then click OK.
An Output Messages pop window will show with warnings. Ignore the warnings and click close, and you should see something like following:
Now you can go to File/Save State and save this state. Now every time you load you can load to this state file to skip the previous step of having to locate the data files.
Then on properties tab on left side, enable Ospray for every view (all the rendewViews1/2/2 by selecting that view and clicking enable Ospray)
Once you do that you should see the images for all three views look as below:
You can also rotate the views and see how they look.

A few issues and how to resolve

Missing OpenGL, install mesa for OpenGL

Sudo yum –y install mesa-libGL Sudo yum –y install mesa-libGL-devel

libQtGui.so.4 error, install qt-x11 package

yum –y install qt-x11

Acknowledgements

Special thanks to Carson Brownlee and James Jeffers from Intel Corporation for all their contributions and support. Without their efforts, it wouldn’t have been possible to get this demo running.

References

↧