Quantcast
Channel: Intel Developer Zone Articles
Viewing all 327 articles
Browse latest View live

Case Study: BerkeleyGW using Intel® Xeon Phi™ Processors

$
0
0

BerkeleyGW is a Materials Science application for calculating the excited state properties of materials such as band gaps, band structures, absoprtion spectroscopy, photoemission spectroscopy and more. It requires as input the Kohn-Sham orbitals and energies from a DFT code like Quantum ESPRESSO, PARATEC, PARSEC etc. Like such DFT codes, it is heavily dependent on FFTs, Dense Linear algebra and tensor contraction type operations similar in nature to those found in Quantum Chemistry applications. 

The target science application for the Cori timeframe is to study realistic interfaces in organic photo-voltaics. Such systems require 1000+ atoms and considerable amount of vacuum that contributes to the computational complexity. GW calculations general scale as the number of atoms to the fourth power (the vacuum space roughly counting as having more atoms). This is 2-5 times bigger problem than has been done in the past. Therefore, successfully completing these runs on Cori requires not only taking advantage of the compute capabilities of the Intel® Xeon Phi™ Processor architecture but also improving the scalability of the code in order to reach full-machine capability.

Check out the entire paper: http://www.nersc.gov/users/computational-systems/cori/application-porting-and-performance/application-case-studies/berkeleygw-case-study/

Lessons Learned

1. Optimal performance for this code required restructuring to enable optimal thread scaling, vectorization and improved data reuse.

2. Long loops are best for vectorization. In the limit of long loops, effects of loop peeling and remainders can be neglected.

3. There are many coding practices that prevent compiler auto-vectorization of code. The use of profilers and compiler reports can greatly aid in producing vectorizable code.

4. The absence on L3 cache on Intel® Xeon Phi™ architectures makes data locality ever more important than on traditional Intel® Xeon® architectures. 

5. Optimization is a continuous process. The limiting factor in code performance may change between IO/communication, memory bandwidth, latency and CPU clockspeed as you continue to optimize.

 


Improve Performance with Vectorization

$
0
0

This article focuses on the steps to improve software performance with vectorization. Included are examples of full applications along with some simpler cases to illustrate the steps to vectorization. As hardware moves forward adding more cores and wider vector registers, software must modernize or change to match the hardware and deliver higher levels of parallelism and vectorization. This increased performance will solve more complex problems in finer resolution. A previous article, Recognizing and Measuring Vectorization Performance, explained how to measure how effectively the software was vectorized.

In this article the terms SIMD and vectorization are used synonymously. A brief review of vectorization is discussed in the introduction. SIMD stands for Single Instruction Multiple Data; the same instruction is used on multiple data elements simultaneously. Intel® Advanced Vector Extensions 512 (Intel® AVX-512) provides registers that are 512 bits wide. One of these registers may be filled with 8 double precision floating point values, or 16 single precision floating point values, or 16 integers. When the register is fully populated a single instruction is applied that operates on 8 to 16 values (depending on the types of values and instruction used) simultaneously (see Figure 1).

Figure 1 - Example of SIMD operations

Figure 1:Example of SIMD operations on zmm Intel® Advanced Vector Extensions 512 registers.

When the Intel AVX-512 registers are filled across all available lanes performance can be 8 to 16 times faster. Typically there are many scalar operations to perform, as well as some shuffling and data movement, so a speedup of 8 to 16 is not delivered for a full application (though performance increases significantly). Intel® Advanced Vector Extensions 2 (Intel® AVX2) and Intel® Advanced Vector Extensions (Intel® AVX) instructions are half the width of Intel AVX-512 instructions and Intel® Streaming SIMD Extensions 4 instructions are half the width of Intel AVX and Intel AVX2 instructions. All deliver increasing levels of performance.

Taking software to new performance levels with vectorization

How does a software developer go from scalar code to vector mode? There are several steps to follow:

  1. Use vectorized libraries (for example, the Intel® Math Kernel Library)
  2. Think SIMD (think what is done repetitively or multiple times)
  3. Do a performance analysis (collect data, and then improve)
  4. Express SIMD

The first three steps are covered in the article referenced above. This article focuses on expressing SIMD and code modifications and includes the following:

  • Use pragmas and directives
  • Write SIMD-enabled functions
  • Remove loop invariant dependencies
  • Expand temporary variables to short vectors
  • Pay attention to data alignment
  • Improve data layout

Pragmas/Directives and dependence

Compilers do a great job of vectorization. There are still times when a compiler is not able to disambiguate memory references inside a repeated do loop to safely guarantee that there are no dependencies between one loop iteration and another. When the compiler cannot determine data independence it will produce scalar code instead of vectorized code. Other times the analysis for the compiler may become too complex, and the compiler may stop and produce scalar code rather than vectorized code.

The most common method to improve vectorization is the addition of pragmas or directives. A compiler pragma (C/C++) or directive (Fortran*) gives the compiler more information about the code segment than it can discern from the code alone allowing the compiler to perform more optimizations. The most common problem for which pragmas or directives are applied is pointer disambiguation, which is the inability to determine if it is safe to vectorize a for or do loop. If you know that the data entering the do (or for) loop does not overlap, you can add a pragma or a directive instructing the compiler to ignore the potential dependency and vectorize it.

Various pragma/directives were developed by different compiler groups. The need to tell compilers to ignore potential data dependencies is so common that it is easier for software developers to have a common set of pragmas and directives recognized by all compilers rather than proprietary pragmas for each compiler. The OpenMP* organization filled that void by adding a SIMD compiler pragma/directive to the OpenMP 4.0 specifications in 2013. This way you can use a standard directive/pragma and know that it will be recognized by many compilers.

The syntax for this directive/pragma is:

#pragma omp simd
!$omp simd

There are additional options to the SIMD pragma that may also be useful:

safelen(length),linear(list:linear-step),aligned(list[alignment]) , . . .   

These additional parameters may be useful if the code contains something like:

for (i=8;i<n;i++)
{    . . .
  x[i] += x[i-8]* . . . ;  . . .
}

There is a dependence between iterations. However, there is a safe length for this (the dependence only goes back 8 values) so the compiler can safely vectorize for a SIMD length of 8. In this case, use a pragma like:

#pragma omp simd safelen(8)

When you use the SIMD pragma/directive, you’re telling the compiler that you know enough about the program and data access that the assumption regarding no dependencies is safe. If you’re wrong, the code may deliver incorrect results. There is a quick check that some developers use for a sanity check—reverse the order of loops—but this is not a guarantee. If reversing the order of the loop changes the value, the code is not safe to vectorize or thread without changes. Additional specifications such as safelen or an atomic operation or other controls may be required to make the code safe to vectorize.

If the results are correct, it might be that for this particular dataset the order of operations did not matter, but it could matter with a different dataset. So failing the check clearly shows dependence, passing the check is not a guarantee of safety. Tools such as Intel® Advisor XE provide a more rigorous analysis and can help identify dependencies and give you hints as to when it is safe to apply SIMD pragmas or directives.

Functions

Frequently a function or subroutine call within a loop prevents vectorization. There are a couple of possibilities that may help in this situation. A short function can be written as a vectorizable function. Vector versions of many transcendental functions exist in math libraries used by the compiler. With the proper syntax, the compiler will generate both vector and scalar versions of a user-written function.

Let’s say there is a function that squares a number and adds a constant to the value. The function might be like this:

double sqadd(double a, double r)
{
   double t ;
   t = a*a + r ;
   return(t) ;
}

To tell the compiler to create a vector version of this function as well as a scalar version write it like this:

#pragma omp declare simd // tell compiler to generate vector versiondouble sqadd(double a, double r)
{
   double t ;
   t = a*a + r ;
   return(t) ;
}

Next, at the places where the function is called, further instruct the compiler to vectorize the loop in the presence of a function call like this:

#pragma omp simd // instruct compiler to generate vector code even though loop invokes function
for (i=0 ; i<n; ++)
{
   . . .
  anarray[i] = sqadd(r, t) ;
  . . .
}

In Fortran the correct directive to use is: 

!$omp declare simd(subroutine name) 

For example the above function in Fortran may be written as shown below to generate vectorizable implementations:

     real*8 function sqadd(r,t) result(qreal)
!$omp declare simd(sqadd)
         real*8 :: r
         real*8 :: t
         qreal = r*r + t
       end function sqadd

In the file where the function is invoked I defined it like this:

      INTERFACE
         real*8 function sqadd(r,t)
!$omp declare simd(sqadd)
         real*8 r
         real*8 t
         end function sqadd
      END INTERFACE

The compiler vectorized the do loop containing the call to sqadd. If modules are used, and the function or subroutine is declared SIMD enabled, any file that uses that module can see the function/subroutines are SIMD enabled and then generate vectorized code. For more information on creating SIMD functions/subroutines in Fortran, see this article on explicit vector programming in Fortran.

Loop invariant dependencies

It is common to see do/for loops with an if/else statement in the middle of them where the condition of the if/else statement does not change within the loop. The presence of the conditional may prevent vectorization. This can be rewritten so that it is easier to vectorize. For example if the code looks something like:

for (int ii = 0 ; ii < n ; ++ii)
{
     . . .
     if (method == 0)
        ts[ii] = . . . . ;
    else
         ss[ii] = . . . ;
     . . .
}

This can be rewritten as:

if (method == 0)
    for (int ii = 0; ii < n ; ++ii)
    {
        . . .
       ts[ii] = . . . ;
       . . .
    }
else
   for (int ii = 0 ; ii < n; ++ii)
   {
       . . .
      ss[ii] = . . . ;
       . . .
   }

The above two techniques were applied to the MPAS-Oi oceanic code. The MPAS-O code simulates the earth's ocean system and can work on time scales of months to thousands of years. It handles regions below 1 km as well as global circulation. Doug Jacobsenii from LANL participated in a short collaboration with ParaTools where we used the TAU Performance System* to measure vectorization intensity of this routine on the Intel® Xeon Phi™ coprocessor. Based on the data we looked where the report indicated low vector intensity and then at the code and compiler report.

Excerpts of the code are shown below:

 do k=3,maxLevelCell(iCell)-1
    if(vert4thOrder) then
       high_order_vert_flux(k, iCell) = &
          mpas_tracer_advection_vflux4( tracer_cur(k-2,iCell),tracer_cur(k-1,iCell), &
          tracer_cur(k,iCell),tracer_cur(k+1,iCell), w(k,iCell))
    else if(vert3rdOrder) then
       high_order_vert_flux(k, iCell) = &
          mpas_tracer_advection_vflux3( tracer_cur(k-2,iCell),tracer_cur(k-1,iCell), &
          tracer_cur(k,iCell),tracer_cur(k+1,iCell), w(k,iCell), coef_3rd_order )
    else if (vert2ndOrder) then
       verticalWeightK = verticalCellSize(k-1, iCell) / (verticalCellSize(k, iCell) +&
          verticalCellSize(k-1, iCell))
       verticalWeightKm1 = verticalCellSize(k, iCell) / (verticalCellSize(k, iCell) +&
          verticalCellSize(k-1, iCell))
       high_order_vert_flux(k,iCell) = w(k,iCell) * (verticalWeightK * tracer_cur(k,iCell) +&
          verticalWeightKm1 * tracer_cur(k-1,iCell))
    end if
    tracer_max(k,iCell) = max(tracer_cur(k-1,iCell),tracer_cur(k,iCell),tracer_cur(k+1,iCell))
    tracer_min(k,iCell) = min(tracer_cur(k-1,iCell),tracer_cur(k,iCell),tracer_cur(k+1,iCell))
 end do

We quickly saw that this code contained both a subroutine call and invariant conditional within the do loop. So we made the subroutine vectorizable and swapped the positions of the loop and conditional. The new code looks like this:

 ! Example flipped loop
 if ( vert4thOrder ) then
    do k = 3, maxLevelCell(iCell) - 1
       high_order_vert_flux(k, iCell) = &
          mpas_tracer_advection_vflux4( tracer_cur(k-2,iCell),tracer_cur(k-1,iCell), &
          tracer_cur(k,iCell),tracer_cur(k+1,iCell), w(k,iCell))
    end do
 else if ( vert3rdOrder ) then
    do k = 3, maxLevelCell(iCell) - 1
       high_order_vert_flux(k, iCell) = &
          mpas_tracer_advection_vflux3( tracer_cur(k-2,iCell),tracer_cur(k-1,iCell), &
          tracer_cur(k,iCell),tracer_cur(k+1,iCell), w(k,iCell), coef_3rd_order )
    end do
 else if ( vert2ndOrder ) then
    do k = 3, maxLevelCell(iCell) - 1
        verticalWeightK = verticalCellSize(k-1, iCell) / (verticalCellSize(k, iCell) +&
           verticalCellSize(k-1, iCell))
        verticalWeightKm1 = verticalCellSize(k, iCell) / (verticalCellSize(k, iCell) +&
           verticalCellSize(k-1, iCell))
        high_order_vert_flux(k,iCell) = w(k,iCell) * (verticalWeightK * tracer_cur(k,iCell) +&
           verticalWeightKm1 * tracer_cur(k-1,iCell))
    end do
 end if

In this case we combined two of the techniques listed in this article to get vectorized code and improve performance: make subroutines/functions SIMD enabled or vectorizable and move out invariant conditionals. You will typically find that you must apply multiple techniques to tune your code. Each change is a step in the correct direction; sometimes the performance doesn't really jump ahead until all the changes are made; change data layout, apply pragmas or directives, align data and such. Each element is a step to improve performance. Several more techniques follow in the upcoming sections.

Expand temporary scalars into short arrays

Temporary values are commonly calculated in the middle of a for or do loop. Sometimes this is done because an intermediate value is used in several calculations, and the common subset of computations is calculated and then held for a short time for reuse. Sometimes it just makes the code easier to read. If you apply “think SIMD” to this situation it means to make the temporary arrays into short vectors that are as long as the widest SIMD register the code will run on.

This technique was applied to finite element and finite volume methods solving a system of partial differential equations related to calcium dynamics in a heart cell. Calcium ions play a crucial role in driving the heartbeat. Calcium is released into the heart cell at a lattice of discrete positions throughout the cell, known as calcium release units (CRUs). The probability that calcium will be released from a CRU depends on the calcium concentration at that CRU. When a CRU releases calcium (it “fires”), the local concentration of calcium ions increases sharply, and the calcium that diffuses raises the probability for release at neighboring sites. As a sequence of CRUs begins to release calcium throughout the cell, the release can self-organize into a wave of increasing concentration. A wave that is triggered among the normal physiological signaling (for example, the cardiac action potential) can lead to irregular heartbeats and possibly a life-threatening ventricular fibrillation. These dynamics are simulated by a system of three time-dependent partial differential equations developed by Leighton T. Izu.

Consider special-purpose MPI code that simulates these calcium dynamics developed by Matthias K. Gobbert and his collaborators (www.umbc.edu/~gobbert/calcium). This code solves the system of partial differential equations using finite element or finite volume methods and matrix-free linear solvers. Using a profiler such as the TAU Performance System, we find that most of the runtime is spent in the matrix vector multiplication function.

Here is a representative code snippetiii from this function (variables or functions changed are in bold italics):

  for(iy = 0; iy < Ny; iy++)
  {     for(ix = 0; ix < Nx; ix++)
    {
      iz  = l_iz +   spde.vNz_cum[id];
      i   =   ix +   (iy * Nx) + (iz   * ng);
      l_i =   ix +   (iy * Nx) + (l_iz * ng);

      t = 0.0;

      if (ix == 0)
      {
        t -= (        8.0/3.0 * Dxdx) * l_x[l_i+1 ];
        diag_x = 8.0/3.0 * Dxdx;
      } else if (ix == 1)
      {
        t -= (  bdx + 4.0/3.0 * Dxdx) * l_x[l_i-1 ];
        t -= (                  Dxdx) * l_x[l_i+1 ];
        diag_x = bdx + 7.0/3.0 * Dxdx;
      }
      if (iy == 0)
      {
          . . .
      } else if (iy == 1)
      {
        t -= (  bdy + 4.0/3.0 * Dydy) * l_x[l_i-Nx];
        t -= (                  Dydy) * l_x[l_i+Nx];
        diag_y = bdy + 7.0/3.0 * Dydy;
      } else if (iy == Ny-2)
      {
        t -=    (bdy + Dydy) * l_x[l_i-Nx];
        t -= 4.0/3.0 * Dydy  * l_x[l_i+Nx];
        diag_y = bdy +  7.0/3.0 * Dydy;
      } else if (iy == Ny-1)
      {
        t -= (2*bdy + 8.0/3.0 * Dydy) * l_x[l_i-Nx];
        diag_y = 2*bdy + 8.0/3.0 * Dydy;
      }else
      {
        t -= (bdy + Dydy) * l_x[l_i-Nx];
        t -=        Dydy  * l_x[l_i+Nx];
        diag_y = bdy + 2.0 * Dydy;
      }

      if (iz == 0)
      {
       .
       .
       .
      }
      .
      .
      .
      if (il == 1)
      {
         .
         .
         .
         l_y[l_i] += t*dt + (d + dt*(diag_x+diag_y+diag_z + a +   
            getreact_3d (is,js,ns, l_uold, l_i) )) * l_x[l_i];
      }else
      {
          l_y[l_i] += t*dt + (d + dt*(diag_x+diag_y+diag_z + a )) *
             l_x[l_i];
      }
    }
  }

The temporary variable t is expanded to be a short array temp[8]. In addition, a new short vector alocal is created to store values equivalent to the function call to getreact_3d(). To facilitate ease of programming, Intel® Cilk™ Plus expressions are used. When the temporary arrays are used, the new code looks like this (new variables or changed variables are bold italics for ease of recognition):

  for(iy = 0; iy < Ny; iy++)
  {
    ...
    for(ix = 8; ix < Nx-9; ix+=8)
    {
      i   =   ix +   (iy * Nx) + (iz   * ng);
      l_i =   ix +   (iy * Nx) + (l_iz * ng);

      temp[0:8] = 0.0;
      temp[0:8] -= (  bdx +           Dxdx) * l_x[l_i-1:8 ];
      temp[0:8] -= (                  Dxdx) * l_x[l_i+1:8 ];
      diag_x = bdx + 2.0 * Dxdx;

      if (iy == 0)
      {
        temp[0:8] -= (          8.0/3.0 * Dydy) * l_x[l_i+Nx:8];
        diag_y = 8.0/3.0 * Dydy;
      }      else if (iy == 1)
      {
        temp[0:8] -= (  bdy + 4.0/3.0 * Dydy) * l_x[l_i-Nx:8];
        temp[0:8] -= (                  Dydy) * l_x[l_i+Nx:8];
        diag_y = bdy + 7.0/3.0 * Dydy;
      } else if (iy == Ny-2)
      {
        temp[0:8] -=    (bdy + Dydy) * l_x[l_i-Nx:8];
        temp[0:8] -= 4.0/3.0 * Dydy  * l_x[l_i+Nx:8];
        diag_y = bdy +  7.0/3.0 * Dydy;
      }else if (iy == Ny-1)
      {
        temp[0:8] -= (2*bdy + 8.0/3.0 * Dydy) * l_x[l_i-Nx:8];
        diag_y = 2*bdy + 8.0/3.0 * Dydy;
      }else
      {
        temp[0:8] -= (bdy + Dydy) * l_x[l_i-Nx:8];
        temp[0:8] -=        Dydy  * l_x[l_i+Nx:8];
        diag_y = bdy + 2.0 * Dydy;
      }

      if (iz == 0)
      {
          .
          .
          .
      }
      .
      .
      .
      if (il == 1)
      {
        . . .
        alocals[0:8] = . . .
        // operations that would be done by function getreact_3d()
        l_y[l_i:8] += temp[0:8]*dt + (d + dt*(diag_x+diag_y+diag_z + a +
          alocals[0:8])) * l_x[l_i:8];
     }else
     {
        l_y[l_i:8] += temp[0:8]*dt + (d + dt*(diag_x+diag_y+diag_z + a )) *
           l_x[l_i:8];
      }
      . . .
   }
}

When these changes were applied to the code, the execution time on an Intel® Xeon Phi™ coprocessor 5110P went from 68 hours 45 minutes down to 38 hours 10 minutes, cutting runtime by 44 percent!

Data alignment

Notice that the original loop went from ix = 0 to Nx - 1. There were some special conditions to cover ix = 0 or 1, but instead of the new code going from ix = 2 to Nx - 1 it begins at 8. You might recognize this is done to maintain data alignment in the kernel loop.

In addition to operating on full SIMD register width, performance is best when data moving into the SIMD registers is aligned on cache line boundaries. This is especially true for the Intel Xeon Phi coprocessors (it helps all processors, but the percentage performance impact is largest on Intel Xeon Phi coprocessors). The arrays are initially aligned on cache boundaries, then cases where ix=0,7 were handled before entering the ix loop, then all the movements within the ix loop operate on subsections of arrays where each subsection is aligned to a cache line boundary. It would have been good if the software developers in this case had added a pragma assume aligned, which would not have improved code performance, but it may have eliminated the need for the compiler to add a peel loop to execute before beginning the aligned kernel loop. In a previous article I pointed out that unaligned matrices could decrease code performance for selected kernels by over 53 percent. The times for aligned versus unaligned matrixes in a recent simple matrix multiply test on a platform based on an Intel® Xeon® processor E5-2620 are shown in Table 1:

Version

Time (seconds)

Aligned

3.78

Un-aligned

4.86

Table 1:Matrix multiply test results—aligned versus unaligned.

For further investigation, the code segments to repeat that exercise are available at: https://github.com/drmackay/samplematrixcode

Data Layout

Software performs best when the data layout is organized the way data is accessed. An example that illustrates this is the tutorial that accompanies ParaTools’ ThreadSpotter*iv, which you can download and run (ThreadSpotter is not required to build or run the different cases). In the first case a linked list is used to fill and access a database. The initial data storage element is a structure that is defined as shown below:

struct car_t {
    void randomize();

    color_t color;
    model_t model;
    std::string regnr;
    double weight;
    double hp;
};

A linked list is formed of these structures, and then queries are made. The linked list allows data to be accessed in a random, unpredictable method that is difficult for the prefetch section of the processor to accurately predict. In Table 2, this is shown in the “Linked list” row. When the linked list is replaced by a standard C++ vector class, data is accessed in a linear fashion.

 The vector basis is shown below:

class database_2_vector_t : public single_question_database_t
{
public:
    virtual void add_one(const car_t &c);
    virtual void finalize_adding();
    virtual void ask_one_question(query_t &query) const;

private:
    typedef std::vector<car_t> cars_t;
    cars_t cars;
};

In Table 2 this is shown in the “Vector” row. As shown in Table 2, performance increases significantly with this improved access pattern. This change preserved the same structure. When a query is related to the car color, notice that the entire structure is brought into cache, but only one element of the structure— color—is used. In this case the memory bus is occupied transferring extra data to and from memory and cache. This consumes both extra bandwidth as well as cache resources.

In the next case the cars structure used in the previous two examples is replaced with a new class:

class database_3_hot_cold_vector_t : public single_question_database_t
{
public:
    virtual void add_one(const car_t &c);
    virtual void finalize_adding();
    virtual void ask_one_question(query_t &query) const;

private:
    typedef std::vector<color_t> colors_t;
    typedef std::vector<model_t> models_t;
    typedef std::vector<double> weights_t;

    colors_t colors;
    models_t models;
    weights_t weights;

    typedef std::vector<car_t> cars_t;
    cars_t cars;
};

Notice that there are separate vectors for colors_t, models_t, and weights. When a query is based on colors, the colors_t vector is brought into cache and the model and weight information is not brought in, reducing the pressure on data bandwidth and L1 cache use. The vector brought into cache for the searches is “hot”; the vector not brought into cache is “cold.” This is shown in the “Hot-cold vectors” row. This also allows data to be accessed in a preferred unit stride fashion and use all lanes of the SIMD registers. See the performance improvements shown in Table 2.

This last data layout change is equivalent to changing an Array of Structures to a Structure of Arrays. The concept of changing from AOS to SOA is to place data contiguously in the manner it is accessed. Developers are encouraged to learn more about AOS to SOA such as when it is advantageous to adopt a Structure of Arrays rather than operate on an Array of Structures.

Version

Average Time/Query Set

Speed Up

Linked list

11.1

1

Vector

0.32

34

Hot-cold vectors

0.22

50

Table 2:Data layout impact on performance on platform based on the Intel® Xeon® E5-2620 processor.

Summary

Performance increases significantly when software effectively fills and uses the vector registers available on current platforms. The steps listed in this article were:  

  • Use appropriate pragmas and directives
  • Utilize SIMD-enabled user-defined functions/subroutines
  • Move invariant conditionals outside of the inner-most loops
  • Store data in the order in which it is used
  • Align data on cache line boundaries

Software developers who embrace these techniques and use the information from compiler optimization reports should see their software performance improve. Tools such as Intel Advisor XE can help improve vectorization as well. Vectorization is an important step in software optimization and performance optimization. Adopt these techniques, and check-out the Intel Developer Zone, Modern Code site for more details and find answers to your questions. Finally, please share your experiences in the comments section of this article.

 

i MPAS-O oceanic code development was supported by the DOE Office of Science BER.  

ii Douglas Jacobsen work was supported by the DOE Office of Science Biological and Environmental Research (BER), MPAS code example used by permission. 

iii Samuel Khuvis. Porting and Tuning Numerical Kernels in Real-World Applications to Many-Core Intel Xeon Phi Accelerators, PhD Thesis, Department of Mathematics and Statistics, University of Maryland, Baltimore County, May 2016. Code segments used by permission.  

iv Thread Spotter is distributed by ParaTools, Inc. tutorial code segments used by permission. Intel® Advisor XE is a part of Intel® Parallel Studio XE, distributed by Intel Corp.

Monte-Carlo simulation on Asian Options Pricing

$
0
0

This is an exercise in performance optimization on heterogeneous Intel architecture systems based on multi-core processors and manycore (MIC) coprocessors.

NOTE: this lab follows the discussion in Section 4.7.1 and 4.7.2 in the book "Parallel Programming and Optimization with Intel Xeon Phi Coprocessors", second edition (2015). The book can be obtained at xeonphi.com/book

In this step, we will look at how to load-balance in an MPI application running on a heterogeneous cluster. The provided source code is a Monte-Carlo simulation on Asian Options Pricing. For the purposes of this exercise, the actual implementation of the simulation is not important, however those if you are interested in learning more about the simulation itself refer to the Colfax Research website.

Asian Options Code Sample link: https://github.com/ColfaxResearch/Asian-Options

sssssss

  1. Study "workdistribution.cc" and compile it. Then run the MPI application across all the nodes available to you (including MICs), with one process on each node.

    You should see that there is load-imbalance, where one node finished faster than others.

  2. A simple solution to this load balance is to distribute work unevenly depending on the target system. Implement a tuning variable "alpha" (should be typefloat or double) where the workload MIC receives is alpha times the workload the CPU receives. Each node shpould calculate which options to work on. To do this use the function input "rankTypes", which stores the type (CPU or MIC) of all nodes in the world. "rankTypes[i]" is "1" if the rank "i" node is on a coprocessor, and "0" if it is on a CPU. Make sure every option is accounted for.

    Compile and run the application. Then try to find the "alpha" value that provides the best performance.

  3. The previous implementation, although simple to implement, has the drawback that the alpha value will be dependent on the cluster. To make the application independent of the cluster it runs on, implement boss-worker model in which the boss assigns work to the workers as the workers completes them.

    Compile and run the code to see the performance. Remember that node that has the boss proccess should have 2 processes.

    Hint: To implement th boss worker model, you will need an if statement with two while loops in it. The worker loop should send it's rank to the host, and receive the index that it needs to calculate. The host should use MPI_ANY_SOURCE in it's receive for the rank, and send the next index to the worker rank that it received. When there are no more options to be simulated, the boss should send a "terminate" index (say index of -1). When the worker receives this "terminate" index it should exit the while loop. The host should exit the while loop when "terminate" has been sent to every other process. Finally, don't forget to have MPI_Barrier in before the MPI_Reduce to make sure all processes are done before the reduction happens.

Asian-Options code GitHub link: https://github.com/ColfaxResearch/Asian-Options

Direct N-body Simulation

$
0
0

Exercise in performance optimization on Intel Architecture, including Intel® Xeon Phi processors

NOTE: this lab is an overview of various optimizations discussed in Chapter 4 in the book "Parallel Programming and Optimization with Intel Xeon Phi Coprocessors", second edition (2015). The book can be obtained at xeonphi.com/book

In this step we will look at how to modernize a piece of code through an example application. The provided source code is an N-body simulation, which is a simulation of many particles that gravitationally or electrostatically interacting with each other. We keep track of the position and the velocity of each particle in the structure "Particle". The simulation is discretized into timesteps. In each timestep, first, the force on each particle (stored in the structure) is calculated with a direct all-to-all algorithm (O(n^2) complexity). Next, the velocity of each particle is modified using the explicit Euler method. Finally the positions of the particles are updated using the explicit Euler method.

N-body simulations are used in astrophysics to model galaxy evolution, colliding galaxies, dark matter distribution in the Universe, and planetary systems. They are also used in simulations of molecular structures. Real astrophysical N-body simulations, targeted to systems with billions of particles, use simplifications to reduce the complexity of the method to O(n log n). However, our toy model is the basis on which the more complex models are built.

In this lab, you will be mostly be modifying the function MoveParticles().

  1. Study the code, then compile and run the application to get the baseline performance. To run the application on the host, use the command "make run-cpu" and for coprocessor, use "make run-mic".

  2. Parallelize MoveParticles() by using OpenMP. Remember that there are two loops that need to be parallelized. You only need to parallelize the outer-most loop.

    Also modify the print statement in, which is hardwired to print "1 thread" (i.e., print the actual number of threads used).

    Compile and run the application to see if you got an improvement.

  3. Apply strength reduction for the calculation of force (the j-loop). You should be able to limit the use of expensive operations to one sqrtf() and one division, with the rest being multiplications. Also make sure to control the precision of constants and functions.

    Compile and run the application to see if you got an improvement.

  4. In the current implementation the particle data is stored in a Array of Structures(AoS), namely a structure of "ParticleTypes"s. Although this is great for readability and abstraction, it is sub-optimal for performance because the coordinates of consecutive particles are not adjacent. Thus when the positions and the velocities are accessed in the loop and vectorized, the data has a non-unit stride access, which hampers performance. Therefore it is often beneficial to instead implement a Structure of Arrays (SoA) instead, where a single structure holds coordinate arrays.

    Implement SoA by replacing "ParticleType" with "ParticleSet". Particle set should have 6 arrays of size "n", one for each dimension in the coordinates (x, y, z) and velocities (vx, vy, vz). The i-th element of each array is the cordinate or velocity of the i-th particle. Be sure to also modify the initialization in main(), and modify the access to the arrays in "MoveParticles()". Compile then run to see if you get a performance improvement.

  5. Let's analyze this application in terms of arithmetic intensity. Currently, the vectorized inner j-loop iterates through all particles for each i-th element. Since the cache line length and the vector length are the same, arithmetic intensity is simply the number of instructions in the inner-most loop. Not counting the reduction at the bottom, the number of operations per iteration is ~20, which is less than the ~30 that roofline model calls for.

    To fix this, we can use tiling to increase cache re-use. By tiling in "i" or "j" by Tile=16 (we chose 16 because it is the cache line length as well as the vector length), we can increase the number operations to ~16*20=~320. This is more than enough to be in the compute-bound region of the roofline mode.

    Although the loop can be tiled in "i" or "j" (if we allow loop swap) it is more beneficial to tile in "i" and therefore vectorize in "i". If we have "j" as the inner-most loop each iteration requires three reductions of the vector register (for Fx, Fy, Fz). This is costly as this not vectorizable. On the other hand, if we vectorize in "i" with tile = 16, it does not require reduction. Note though, that you will need to create three buffers of length 16 where you can store Fx, Fy and Fz for the "i"th element.

    Implement tiling in "i". then compile and run to see the performance.

  6. Using MPI, parallelize the simulation across multiple processes (or compute nodes). To make this work doable in a short time span, keep the entire data set in each process. However, each MPI process should execute only a portion of the loop in the MoveParticle() function. Try to minimize the amount of communication between the nodes. You may find the MPI function MPI_Allgather() useful. Compile and run the code to see if you get a performance improvement.

Direct N-Body simulation code GitHub link: https://github.com/ColfaxResearch/N-body

Multithreaded Transposition of Square Matrices with Common Code for Intel® Xeon® Processors and Intel® Xeon Phi™ Coprocessors

$
0
0

In-place matrix transposition, a standard operation in linear algebra, is a memory bandwidth-bound operation. The theoretical maximum performance of transposition is the memory copy bandwidth. However, due to non-contiguous memory access in the transposition operation, practical performance is usually lower. The ratio of the transposition rate to the memory copy bandwidth is a measure of the transposition algorithm efficiency.

This paper demonstrates and discusses an efficient C language implementation of parallel in-place square matrix transposition. For large matrices, it achieves a transposition rate of 49 GB/s (82% efficiency) on Intel® Xeon® Processors and 113 GB/s (67% efficiency) on Intel® Xeon Phi coprocessors. The code is tuned with pragma-based compiler hints and compiler arguments. Thread parallelism in the code is handled by OpenMP*, and vectorization is automatically implemented by the Intel compiler. This approach allows to use the same C code for a CPU and for a MIC architecture executable, both demonstrating high efficiency. For benchmarks, an Intel Xeon Phi 7110P coprocessor is used.

To run the benchmark, execute the script ./benchmark.sh

The included Makefile and script ./benchmark.sh are designed for Linux.

In order for the CPU code to compile, you must have the Intel C++ compiler installed in the system.

In order to compile and run the MIC platform code, you must have an Intel Xeon Phi coprocessor in the system and the MIC Platform Software Stack (MPSS) installed and running

Multithreaded Transposition of Square Matrices code GitHub link: https://github.com/ColfaxResearch/Transposition

 

Fine-Tuning Vectorization and Memory Traffic on Intel® Xeon Phi™ Coprocessors: LU Decomposition of Small Matrices

$
0
0

by Andrey Vladimirov, Colfax International

LU-decomposition

Common techniques for fine-tuning the performance of automatically vectorized loops in applications for Intel® Xeon Phi coprocessors are discussed. These techniques include strength reduction, regularizing the vectorization pattern, data alignment and aligned data hint, and pointer disambiguation. In addition, the loop tiling technique of memory traffic tuning is shown. The optimization methods are illustrated on an example of single-threaded LU decomposition of a single precision matrix of size 128×128.

Benchmarks show that the discussed optimizations improve the performance on the coprocessor by a factor of 2.8 compared to the unoptimized code, and by a factor of 1.7 on the multi-core host system, achieving roughly the same performance on the host and on the coprocessor.

The code discussed in the paper can be freely downloaded from https://github.com/ColfaxResearch/LU-decomposition

Offload over Fabric to Intel® Xeon Phi™ Processor: Tutorial

$
0
0

The OpenMP* 4.0 device constructs supported by the Intel® C++ Compiler can be used to offload a workload from an Intel® Xeon® processor-based host machine to Intel® Xeon Phi™ coprocessors over Peripheral Component Interface Express* (PCIe*). Offload over Fabric (OoF) extends this offload programing model to support the 2nd  generation Intel® Xeon Phi™ processor; that is, the Intel® Xeon® processor-based host machine uses OoF to offload a workload to the 2nd generation Intel Xeon Phi processors over high-speed networks such as Intel® Omni-Path Architecture (Intel® OPA) or Mellanox InfiniBand*.

This tutorial shows how to install OoF software, configure the hardware, test the basic configuration, and enable OoF. A sample source code is provided to illustrate how the OoF works.

Hardware Installation

In this tutorial, two machines are used: an Intel® Xeon® processor E5-2670 2.6 GHz serves as the host machine and an Intel® Xeon Phi™ processor serves as the target machine. Both host and target machines are running Red Hat Enterprise Linux* 7.2, and each has Gigabit Ethernet adapters to enable remote log in. Note that the hostnames of the host and target machines are host-device and knl-sb2 respectively.

First we need to set up a high-speed network. We used InfiniBand in our lab due to the hardware availability, but Intel OPA is also supported.

Prior to the test, both host and target machines are powered off to set up a high-speed network between the machines. Mellanox ConnectX*-3 VPI InfiniBand adapters are installed into PCIe slots in these machines and are connected using an InfiniBand cable with no intervening router. After rebooting the machines, we first verify that the Mellanox network adapter is installed on the host:

[host-device]$ lspci | grep Mellanox
84:00.0 Network controller: Mellanox Technologies MT27500 Family [ConnectX-3]

And on the target:

[knl-sb2 ~]$ lspci | grep Mellanox
01:00.0 Network controller: Mellanox Technologies MT27500 Family [ConnectX-3]

Software Installation

The host machine and target machines are running Red Hat Enterprise Linux 7.2. On the host, you can verify the current Linux kernel version:

[host-device]$ uname -a
Linux host-device 3.10.0-327.el7.x86_64 #1 SMP Thu Oct 29 17:29:29 EDT 2015 x86_64 x86_64 x86_64 GNU/Linux

You can also verify the current operating system kernel running on the target:

[knl-sb2 ~]$ uname –a
Linux knl-sb2 3.10.0-327.el7.x86_64 #1 SMP Thu Oct 29 17:29:29 EDT 2015 x86_64 x86_64 x86_64 GNU/Linux

On the host machine, install the latest OoF software here to enable OoF. In this tutorial, the OoF software version 1.4.0 for Red Hat Enterprise Linux 7.2 (xppsl-1.4.0-offload-host-rhel7.2.tar) was installed. Refer to the document “Intel® Xeon Phi™ Processor x200 Offload over Fabric User’s Guide” for details on the installation. In addition, the Intel® Parallel Studio XE 2017 is installed on the host to enable the OoF support, specifically support of offload programming models provided by the Intel compiler.

On the target machine, install the latest Intel Xeon Phi processor software here. In this tutorial, the Intel Xeon Phi processor software version 1.4.0 for Red Hat Enterprise Linux 7.2 (xppsl-1.4.0-rhel7.2.tar) was installed. Refer to the document “Intel® Xeon Phi™ Processor Software User’s Guide” for details on the installation.

On both host and target machines, the Mellanox OpenFabrics Enterprise Distribution (OFED) for Linux driver MLNX_OFED_LINUX 3.2-2 for Red Hat Enterprise Linux 7.2 is installed to set up the InfiniBand network between the host and target machines. This driver can be download from www.mellanox.com (navigate to Products > Software > InfiniBand/VPI Drivers, and download Mellanox OFED Linux).

Basic Hardware Testing

After you have installed the Mellanox driver on both the host and target machines, test the network cards to insure the Mellanox InfiniBand HCAs are working properly. To do this, bring the InfiniBand network up, and then test the network link using the ibping command.

First start InfiniBand and the subnet manager on the host, and then display the link information:

[knl-sb2 ~]$ sudo service openibd start
Loading HCA driver and Access Layer:                       [  OK  ]

[knl-sb2 ~]$ sudo service opensm start
Redirecting to /bin/systemctl start  opensm.service

[knl-sb2 ~]$ iblinkinfo
CA: host-device HCA-1:
      0x7cfe900300a13b41      1    1[  ] ==( 4X       14.0625 Gbps Active/  LinkUp)==>       2    1[  ] "knl-sb2 HCA-1" ( )
CA: knl-sb2 HCA-1:
      0xf4521403007d2b91      2    1[  ] ==( 4X       14.0625 Gbps Active/  LinkUp)==>       1    1[  ] "host-device HCA-1" ( )

Similarly, start InfiniBand and the subnet manager on the target, and then display the link information of each port in the InfiniBand network:

[knl-sb2 ~]$ sudo service openibd start
Loading HCA driver and Access Layer:                       [  OK  ]

[knl-sb2 ~]$ sudo service opensm start
Redirecting to /bin/systemctl start  opensm.service

[knl-sb2 ~]$ iblinkinfo
CA: host-device HCA-1:
      0x7cfe900300a13b41      1    1[  ] ==( 4X       14.0625 Gbps Active/  LinkUp)==>       2    1[  ] "knl-sb2 HCA-1" ( )
CA: knl-sb2 HCA-1:
      0xf4521403007d2b91      2    1[  ] ==( 4X       14.0625 Gbps Active/  LinkUp)==>       1    1[  ] "host-device HCA-1" ( )

iblinkinfo reports the link information for all the ports in the fabric, one at the target machine and one at the host machine. Next, use the ibping command to test the link (it is equivalent to the ping command for Ethernet). Start the ibping server on the host machine using:

[host-device ~]$ ibping –S

From the target machine, ping the port identification of the host:

[knl-sb2 ~]$ ibping -G 0x7cfe900300a13b41
Pong from host-device.(none) (Lid 1): time 0.259 ms
Pong from host-device.(none) (Lid 1): time 0.444 ms
Pong from host-device.(none) (Lid 1): time 0.494 ms

Similarly, start the ibping server on the target machine:

[knl-sb2 ~]$ ibping -S

This time, ping the port identification of the target from the host:

[host-device ~]$ ibping -G 0xf4521403007d2b91
Pong from knl-sb2.jf.intel.com.(none) (Lid 2): time 0.469 ms
Pong from knl-sb2.jf.intel.com.(none) (Lid 2): time 0.585 ms
Pong from knl-sb2.jf.intel.com.(none) (Lid 2): time 0.572 ms

IP over InfiniBand (IPoIB) Configuration

So far we have verified that the InfiniBand network is functional. Next, to use OoFabric, we must configure IP over InfiniBand (IPoIB). This configuration provides the target IP address that is used to offload computations over fabric.

First verify that the ib_ipoib driver is installed:

[host-device ~]$ lsmod | grep ib_ipoib
ib_ipoib              136906  0
ib_cm                  47035  3 rdma_cm,ib_ucm,ib_ipoib
ib_sa                  33950  5 rdma_cm,ib_cm,mlx4_ib,rdma_ucm,ib_ipoib
ib_core               141088  12 rdma_cm,ib_cm,ib_sa,iw_cm,mlx4_ib,mlx5_ib,ib_mad,ib_ucm,ib_umad,ib_uverbs,rdma_ucm,ib_ipoib
mlx_compat             16639  17 rdma_cm,ib_cm,ib_sa,iw_cm,mlx4_en,mlx4_ib,mlx5_ib,ib_mad,ib_ucm,ib_addr,ib_core,ib_umad,ib_uverbs,mlx4_core,mlx5_core,rdma_ucm ib_ipoib

If the ib_ipoib driver is not listed, you need to add the module to the Linux kernel using the following command:

[host-device ~]$ modprobe ib_ipoib

Next list the InfiniBand interface ib0 on the host using the ifconfig command:

[host-device ~]$ ifconfig ib0
ib0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 2044
Infiniband hardware address can be incorrect! Please read BUGS section in ifconfig(8).
        infiniband A0:00:02:20:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00  txqueuelen 1024  (InfiniBand)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 0  bytes 0 (0.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

Configure 10.0.0.1 as the IP address on this interface:

[host-device ~]$ sudo ifconfig ib0 10.0.0.1/24
[host-device ~]$ ifconfig ib0
ib0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 2044
        inet 10.0.0.1  netmask 255.255.255.0  broadcast 10.0.0.255
Infiniband hardware address can be incorrect! Please read BUGS section in ifconfig(8).
        infiniband A0:00:02:20:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00  txqueuelen 1024  (InfiniBand)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 10  bytes 2238 (2.1 KiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

Similarly on the target, configure 10.0.0.2 as the IP address on this InfiniBand interface:

[knl-sb2 ~]$ ifconfig ib0
ib0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 2044
Infiniband hardware address can be incorrect! Please read BUGS section in ifconfig(8).
        infiniband A0:00:02:20:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00  txqueuelen 1024  (InfiniBand)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 0  bytes 0 (0.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

[knl-sb2 ~]$ sudo ifconfig ib0 10.0.0.2/24
[knl-sb2 ~]$ ifconfig ib0
ib0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 2044
        inet 10.0.0.2  netmask 255.255.255.0  broadcast 10.0.0.255
Infiniband hardware address can be incorrect! Please read BUGS section in ifconfig(8).
        infiniband A0:00:02:20:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00  txqueuelen 1024  (InfiniBand)
        RX packets 3  bytes 168 (168.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 10  bytes 1985 (1.9 KiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

Finally, verify the new IP address 10.0.0.2 of the target using the ping command on the host to test the connectivity:

[host-device ~]$ ping 10.0.0.2
PING 10.0.0.2 (10.0.0.2) 56(84) bytes of data.
64 bytes from 10.0.0.2: icmp_seq=1 ttl=64 time=0.443 ms
64 bytes from 10.0.0.2: icmp_seq=2 ttl=64 time=0.410 ms<CTRL-C>

Similarly, from the target, verify the new IP address 10.0.0.1 of the host:

[knl-sb2 ~]$ ping 10.0.0.1
PING 10.0.0.1 (10.0.0.1) 56(84) bytes of data.
64 bytes from 10.0.0.1: icmp_seq=1 ttl=64 time=0.313 ms
64 bytes from 10.0.0.1: icmp_seq=2 ttl=64 time=0.359 ms
64 bytes from 10.0.0.1: icmp_seq=3 ttl=64 time=0.375 ms<CTRL-C>

SSH Password-Less Setting (Optional)

When offloading a workload to the target machine, Secure Shell (SSH) requires the target’s password to log on to target and execute the workload. To enable this transaction without manual intervention, you must enable the ssh login without a password. To do this, first generate a pair of authentication keys on the host without entering a passphrase:

[host-device ~]$ ssh-keygen -t rsa

Then append the host’s new public key to the target’s public key using the command ssh-copy-id:

[host-device ~]$ ssh-copy-id @10.0.0.2

Offload over Fabric

At this point, the high-speed network is enabled and functional. To enable OoF functionality, you need to Install Intel® Parallel Studio XE 2017 for Linux on the host. For the purpose of this paper we installed the Intel Parallel Studio XE 2017 Beta Update 1 for Linux. Next set up your shell environment using:

[host-device]$ source /opt/intel/parallel_studio_xe_2017.0.024/psxevars.sh intel64
Intel(R) Parallel Studio XE 2017 Beta Update 1 for Linux*
Copyright (C) 2009-2016 Intel Corporation. All rights reserved.

Below is the sample program used to test the OoF functionality. This sample program allocates and initiates a constant A and buffers x, y, z in the host, and then offloads the computation to the target using OpenMP device constructs directives (#pragma omp target map…). The target directive creates a device data environment (on the target). At runtime values for the variables x,y and A are copied to the target before beginning the computation, and values of variable y are copied back (to the host) when the target completes the computation. In this example, the target parses CPU information from the lscpu command, and spawns a team of OpenMP threads to compute a vector scalar product and add the result to a vector.

#include <stdio.h>

int main(int argc, char* argv[])
{
    int i, num = 1024;;
    float A = 2.0f;

    float *x = (float*) malloc(num*sizeof(float));
    float *y = (float*) malloc(num*sizeof(float));
    float *z = (float*) malloc(num*sizeof(float));

    for (i=0; i<num; i++)
    {
       x[i] = i;
       y[i] = 1.5f;
       z[i] = A*x[i] + y[i];
    }

    printf("Workload is executed in a system with CPU information:\n");
    #pragma omp target map(to: x[0:num], A) \
                       map(tofrom: y[0:num])
    {
        char command[64];
        strcpy(command, "lscpu | grep Model");
        system(command);
        int done = 0;

        #pragma omp parallel for
        for (i=0; i<num; i++)
        {
            y[i] = A*x[i] + y[i];

            if ((omp_get_thread_num() == 0) && (done == 0))
            {
               int numthread = omp_get_num_threads();
               printf("Total number of threads: %d\n", numthread);
               done = 1;
            }
        }
    }

    int passed = 0;

    for (i=0; i<num; i++)
        if (z[i] == y[i]) passed = 1;

    if (passed == 1)
        printf("PASSED!\n");
    else
        printf("FAILED!\n");

    free(x);
    free(y);
    free(z);

    return 0;
}

Compile this OpenMP program with the Intel compiler option -qoffload-arch=mic-avx512 to indicate the offload portion is built for the 2nd generation Intel Xeon Phi processor. Prior to executing the program, set the environment variable OFFLOAD_NODES to the IP address of the target machine, in this case 10.0.0.2, to indicate that the high-speed network is to be used.

[host-device]$ icc -qopenmp -qoffload-arch=mic-avx512 -o OoF-OpenMP-Affinity OoF-OpenMP-Affinity.c

[host-device]$ export OFFLOAD_NODES=10.0.0.2

[host-device]$ ./OoF-OpenMP-Affinity
Workload is executed in a system with CPU information:
Model:                 87
Model name:            Intel(R) Xeon Phi(TM) CPU 7250 @000000 1.40GHz
PASSED!
Total number of threads: 268

Note that the offload processing is internally handled by the Intel® Coprocessor Offload Infrastructure (Intel® COI). By default, the offload code runs with all OpenMP threads available in the target. The target has 68 cores, and the Intel COI daemon running on one core of the target leaves the remaining 67 cores available; the total number of threads is 268 (4 threads/core). You can use the coitrace command to trace all Intel COI API invocations:

[host-device]$ coitrace ./OoF-OpenMP-Affinity
COIEngineGetCount [ThID:0x7f02fdd04780]
        in_DeviceType = COI_DEVICE_MIC
        out_pNumEngines = 0x7fffc8833e00 0x00000001 (hex) : 1 (dec)

COIEngineGetHandle [ThID:0x7f02fdd04780]
        in_DeviceType = COI_DEVICE_MIC
        in_EngineIndex = 0x00000000 (hex) : 0 (dec)
        out_pEngineHandle = 0x7fffc8833de8 0x7f02f9bc4320

Workload is executed in a system with CPU information:
COIEngineGetHandle [ThID:0x7f02fdd04780]
        in_DeviceType = COI_DEVICE_MIC
        in_EngineIndex = 0x00000000 (hex) : 0 (dec)
        out_pEngineHandle = 0x7fffc88328e8 0x7f02f9bc4320

COIEngineGetInfo [ThID:0x7f02fdd04780]
        in_EngineHandle = 0x7f02f9bc4320
        in_EngineInfoSize = 0x00001440 (hex) : 5184 (dec)
        out_pEngineInfo = 0x7fffc8831410
                DriverVersion:
                DeviceType: COI_DEVICE_KNL
                NumCores: 68
                NumThreads: 272

<truncate here>

OpenMP* Thread Affinity

The result from the above program shows the default number of threads (272) that run on the target; however, you can set the number of threads that run on the target explicitly. One method uses environment variables on the host to modify the target’s execution environment. First, define a target-specific environment variable prefix, and then add this prefix to the OpenMP thread affinity environment variables. For example, the following environment variable settings configure the offload runtime to use 8 threads on the target:

[host-device]$ $ export MIC_ENV_PREFIX=PHI
[host-device]$ $ export PHI_OMP_NUM_THREADS=8

The Intel OpenMP runtime extensions KMP_PLACE_THREAD and KMP_AFFINITY environment variables can be used to bind threads to physical processing units (that is, cores) (refer to the section Thread Affinity Interface in the Intel® C++ Compiler User and Reference Guide for more information). For example, the following environment variable settings configure the offload runtime to use 8 threads close to each other:

[host-device]$ $ export PHI_KMP_AFFINITY=verbose,granularity=thread,compact
[host-device]$ $ ./OoF-OpenMP-Affinity

You can also use OpenMP affinity by using the OMP_PROC_BIND environment variable. For example, to duplicate the previous example to run 8 threads close to each other using OMP_PROC_BIND use the following:

[host-device]$ $ export MIC_ENV_PREFIX=PHI
[host-device]$ $ export PHI_KMP_AFFINITY=verbose
[host-device]$ $ export PHI_OMP_PROC_BIND=close
[host-device]$ $ export PHI_OMP_NUM_THREADS=8
[host-device]$ $ ./OoF-OpenMP-Affinity

Or run with 8 threads and spread them out using:

[host-device]$ $ export PHI_OMP_PROC_BIND=spread
[host-device]$ $ ./OoF-OpenMP-Affinity

The result is shown in the following table:

OpenMP* thread numberCore number
00
110
219
331
439
548
656
765

To run 8 threads, 2 threads/core (4 cores total) use:

[host-device]$ export PHI_OMP_PROC_BIND=close;
[host-device]$ export PHI_OMP_PLACES="cores(4)"
[host-device]$ export PHI_OMP_NUM_THREADS=8
[host-device]$ $ ./OoF-OpenMP-Affinity

The result is shown in the following table:

OpenMP* thread numberCore number
00
10
21
31
42
52
63
73

Summary

This tutorial shows details on how to set up and run an OoF application. Hardware and software installations were presented. Mellanox InfiniBand Host Channel Adapters were used in this example, but Intel OPA can be used instead. The sample code was an OpenMP offload programming model application that demonstrates running on an Intel Xeon processor host and offloading the computation to an Intel Xeon Phi processor target using a high-speed network. This tutorial also showed how to compile and run the offload program for the Intel Xeon Phi processor and control the OpenMP Thread Affinity on the Intel Xeon Phi processor.

References

About the Author

Loc Q Nguyen received an MBA from University of Dallas, a master’s degree in Electrical Engineering from McGill University, and a bachelor's degree in Electrical Engineering from École Polytechnique de Montréal. He is currently a software engineer with Intel Corporation's Software and Services Group. His areas of interest include computer networking, parallel computing, and computer graphics.

How to Mount a Shared Directory on Intel® Xeon Phi™ Coprocessor

$
0
0

In order to run a native program on the Intel® Xeon Phi™ coprocessor, the program and any dependencies must be copied to the target platform. However, this approach takes away memory from the native application. To reserve memory resource (16-GB GDDR5 memory on board the Intel Xeon Phi coprocessor), it is practical to mount a Network File System (NFS) shared directory on the Intel Xeon Phi coprocessor from the host server so that most of its memory can be used for applications. This article shows two ways to accomplish this task: the preferred method is using micctrl utility and the second method is a manual procedure.

Using micctrl utility

The preferred method to mount a shared directory on an Intel Xeon Phi coprocessor is to use the micctrl utility shipped with the Intel® Manycore Platform Software Stack (Intel® MPSS). The following example shows how to share the Intel® Compiler C++ library using micctrl. In the host machine used for this example, the MPSS 3.4.8 was installed.

  1. On the host machine, ensure that the shared directory exists:
    [host ~]# ls /opt/intel/compilers_and_libraries_2017.0.098/linux/
  2. Add a new descriptor to the /etc/exports configuration file in the host machine, in order to export the directory /var/mpss/mic0.exports to the coprocessor mic0 whose IP address is 172.31.1.1. Use the option read only so that the coprocessor cannot delete anything in the shared library mistakenly:
    [host ~]# cat /etc/exports
    	/opt/intel/compilers_and_libraries_2017.0.098/linux172.31.1.1(ro,async,no_root_squash)

    For more information on the export options, you can refer to http://nfs.sourceforge.net/nfs-howto/ar01s03.html.
     
  3. Next, update the NFS export table in the host:
    [host ~]# exportfs -a
  4. From the host, use the micctrl utility to add an NFS entry on the coprocessors:
    [host ~]# micctrl --addnfs=/opt/intel/compilers_and_libraries_2017.0.098/linux --dir=/mnt-library --options=defaults
  5. Restart the MPSS service:
    [host ~]# service mpss restart
    	Shutting down Intel(R) MPSS:                               [  OK  ]
    	Starting Intel(R) MPSS:                                    [  OK  ]
    	mic0: online (mode: linux image: /usr/share/mpss/boot/bzImage-knightscorner)
    	mic1: online (mode: linux image: /usr/share/mpss/boot/bzImage-knightscorner)
  6. Finally, from the coprocessor, verify that the remote directory is accessible:
    [host ~]# ssh mic0 cat /etc/fstab
    	rootfs          /               auto            defaults                1  1
    	proc            /proc           proc            defaults                0  0
    	devpts          /dev/pts        devpts          mode=0620,gid=5         0  0
    	172.31.1.254:/opt/intel/compilers_and_libraries_2017.0.098/linux  /mnt-library  nfs             defaults 1 1
    
    	[host ~]# ssh mic0 ls /mnt-mic0

Mounting manually

As an example of the manual procedure, let’s assume we want to mount an NFS shared directory /mnt-mic0 on the Intel Xeon Phi coprocessor from the host machine (/var/mpss/mic0.export is the directory that the host machine exports). In this method, steps 1-3 are the same as in the previous method:

  1. On the host machine, ensure that the shared directory exists; if doesn’t exist, create it:
    [host ~]# mkdir /var/mpss/mic0.export
  2. Add a descriptor to the /etc/exports configuration file in the host machine to export the directory /var/mpss/mic0.exports to the coprocessor mic0, which in this case has an IP address of 172.31.1.1:
    [host ~]# cat /etc/exports
    	/var/mpss/mic0.export 172.31.1.1(rw,async,no_root_squash)

    For more information on the export options, you can refer to http://nfs.sourceforge.net/nfs-howto/ar01s03.html.
     
  3. Next, update the NFS export table:
    [host ~]# exportfs -a
  4. Next, login on the coprocessor mic0:
    [host ~]# ssh mic0
  5. Create the mount point /mnt-mic0 on the coprocessor:
    (mic0)# mkdir /mnt-mic0
  6. Add the following descriptor to the /etc/fstab file of the coprocessor to specify the server, the path name of the exported directory, the local directory (mount point), the type of the file system, and the list of mount options: “172.31.1.254:/var/mpss/mic0.export /mnt-mic0 nfs defaults 1 1
    (mic0)# cat /etc/fstab
    	rootfs          /               auto             defaults                1  1
    	proc            /proc           proc             defaults                0  0
    	devpts          /dev/pts        devpts           mode=0620,gid=5         0  0
    	172.31.1.254:/var/mpss/mic0.export /mnt-mic0 nfs defaults                1  1
  7. To mount the shared directory /var/mpss/mic0.export on the coprocessor, we can type:
    (mic0)# mount –a

Notes:

  • If "Connection refused" error is received, restart NFS server in the host:
    [host~]# service nfs restart
    Shutting down NFS daemon:                                  [  OK  ]
    Shutting down NFS mountd:                                  [  OK  ]
    Shutting down NFS quotas:                                  [  OK  ]
    Shutting down NFS services:                                [  OK  ]
    Starting NFS services:                                     [  OK  ]
    Starting NFS quotas:                                       [  OK  ]
    Starting NFS mountd:                                       [  OK  ]
    Stopping RPC idmapd:                                       [  OK  ]
    Starting RPC idmapd:                                       [  OK  ]
    Starting NFS daemon:                                       [  OK  ]
  • If "Permission denied" error is received, review and correct the /etc/exports file in the host.
  • If the coprocessor reboots, you have to mount the directory in the coprocessor again.
  • The above shared directory can be read/write. To change to read only option, use the option (ro,async,no_root_squash) as seen in step 2.

Conclusion

This article shows two methods to mount a shared directory on the Intel Xeon Phi coprocessor. One method is using micctrl utility, the other is the common manual method. Although both methods work, using micctrl utility is the preferred method as it prevents users from entering data incorrectly in the /etc/fstab table of the coprocessor.

References


Introduction to Heterogeneous Streams Library

$
0
0

Introduction

To efficiently utilize all available resources for the task concurrency application on heterogeneous platforms, designers need to understand the memory architecture, the thread utilization on each platform, the pipeline to offload the workload to different platforms, and to coordinate all these activities.

To relieve designers of the burden of implementing the necessary infrastructures, the Heterogeneous Streaming (hStreams) library provides a set of well-defined APIs to support a task-based parallelism model on heterogeneous platforms. hStreams explores the use of the Intel® Coprocessor Offload Infrastructure (Intel® COI) to implement these infrastructures. That is, the host decomposes the workload into tasks, one or more tasks are executed in separate targets, and finally the host gathers the results from all of the targets. Note that the host can also be a target too.

Intel® Manycore Platform Software Stack (Intel® MPSS) version 3.6 contains the hStreams library, documentation, and sample codes. Starting from Intel MPSS 3.7, hStreams is removed from Intel MPSS software and becomes an open source project. The current version 1.0 supports the Intel® Xeon® processor and Intel® Xeon Phi™ coprocessor as targets. hStreams binaries version 1.0.0 can be downloaded:

Users can contribute to hStreams development at https://github.com/01org/hetero-streams. The following tables summarize the tools that support hStreams in Linux and Windows:

Name of Tool (Linux*)Supported Version

Intel® Manycore Platform Software Stack

3.4, 3.5, 3.6, 3.7

Intel® C++ Compiler

15.0, 16.0

Intel® Math Kernel Library

11.2, 11.3

 

Name of Tool (Windows*)Supported Version

Intel MPSS

3.4, 3.5, 3.6, 3.7

Intel C++ Compiler

15.0, 16.0

Intel Math Kernel Library

11.2, 11.3

Visual Studio*

11.0 (2012)

This whitepaper briefly introduces hStreams and highlights its concepts. For a full description, readers are encouraged to read the tutorial included in the hStreams package mentioned above.

Execute model concepts

This section highlights some basic concepts of hStreams: source and sink, domains, streams, buffers, and actions:

  • Streams are FIFO queues where actions are enqueued. Streams are associated with logical domains. Each stream has two endpoints: source and sink, which is bound to a logical domain.
  • Source is where the work is enqueued and sink where the work is executed. In the current implementation, the source process runs on an Intel Xeon processor-based machine, and the sink process runs on a machine that can be the host itself, an Intel Xeon Phi coprocessor, or in the future, even any hardware platform. The library allows the source machine to invoke the user’s defined function on the target machine.
  • Domains represent the resources of hetero platforms. A physical domain is the set of all resources available in a platform (memory and computing). For example, an Intel Xeon processor-based machine and an Intel Xeon Phi coprocessor are two different physical domains. A logical domain is a subset of a given physical domain; it uses any subset of available cores in a physical domain. The only restriction is that two logical domains cannot be partially overlapping.
  • Buffers represent memory resources to transfer data between source and sink. In order to transfer data, the user must create a buffer by calling an appropriate API, and a corresponding physical buffer is instantiated at the sink. Buffers can have properties such as memory type (for example, DDR or HBW) and affinity (for example, sub-NUMA clustering).
  • Actions are requests to execute functions at the sinks (compute action), to transfer data from source to sink or vise-versa (memory movement action), and to synchronize tasks among streams (synchronization action). Actions enqueued in a stream are proceeded in first in, first out (FIFO) semantics: The source places the action in and the sink removes the action. All actions are non-blocking (asynchronous) and have completion events. Remote invocation can be user-defined functions or optimized convenient functions (for example, dgemm). Thus, a FIFO stream queue handles dependencies within a stream while synchronization actions handle dependencies among streams.

In a typical scenario, the source-side code allocates stream resources, allocates memory, transfers data to the sink, invokes the sink to execute a predefined function, handles synchronization, and eventually terminates streams. Note that actions such as data transferring, remote invocation, and synchronization are handled in FIFO streams. The sink-side code simply executes the function that the source requested.

For example, consider the pseudo-code of a simple hStreams application that creates two streams, the source transfers data to the sinks, performs remote invocation at the sinks, and then transfers results back to the source host:

Step 1: Initialize two streams 0 and 1

Step 2: Allocate buffers A0, B0, C0, A1, B1, C1

Step 3: Use stream i, transfer memory Ai, Bi to sink (i=0,1)

Step 4: Invoke remote computing in stream i: Ai + Bi -> Ci (i=0,1)

Step 5: Transfer memory Ci back to host (i=0,1)

Step 6: Synchronize

Step 7: Terminate streams

The following figure illustrates the actions generated at the host:

Actions are placed in the corresponding streams and removed at the sinks:

hStreams provides two levels of APIs: the app API and the core API. The app API offers simple interfaces; it is targeted to novice users to quickly ramp on hStreams library. The core API gives advanced users the full functionality of the library. The app APIs in fact call the core layer APIs, which in turn use Intel COI and the Symmetric Communication Interface (SCIF). Note that users can mix these two levels of API when writing their applications. For more details on the hStreams API, refer to the document Programing Guide and API Reference. The following figure illustrates the relation between the hStreams app API and the core API.

Refer to the document “Hetero Streams Library 1.0 Programing Guide and API” and the tutorial included in the hStreams download package for more information.

Building and running a sample hStreams program

This section illustrates a sample code that makes use of the hStreams app API. It also demonstrates how to build and run the application. The sample code is an MPI program running on an Intel Xeon processor host with two Intel Xeon Phi coprocessors connected.

First, download the package from https://github.com/01org/hetero-streams. Then, follow the instruction to build and install the hStreams library on an Intel Xeon processor-based host machine that runs Intel MPSS 3.7.2 in this case. This host machine has two Intel Xeon Phi coprocessors installed and connects to a remote Intel Xeon processor-based machine. This remote machine (10.23.3.32) also has two Intel Xeon Phi coprocessors.

This sample code creates two streams; each stream runs explicitly on a separate coprocessor. An MPI rank manages these two streams.

The application consists of two parts: The source-side code is shown in Appendix A and the corresponding sink-side code is shown in Appendix B. The sink-side code contains a user-defined function vector_add, which is to be invoked by the source.

This sample MPI program is designed to run with two MPI ranks. Each MPI rank runs on a different domain (Intel Xeon processor host) and initializes two streams; each stream is responsible for communicating with a coprocessor. The MPI ranks enqueues the required actions into the streams in the following order: Memory transfer action from source to sink action, remote invocation action, and memory transfer action from sink to source. The following app APIs are called in the source-side code:

  • hStreams_app_init: Initialize and create streams across all available Intel Xeon Phi coprocessors. This API assumes one logical domain per physical domain.
  • hStreams_app_create_buf: Create an instantiation of buffers in all currently existing logical domains.
  • hStreams_app_xfer_memory: Enqueue memory transfer action in a stream; depending on the specified direction, memory is transferred from source to sink or sink to source.
  • hStreams_app_invoke: Enqueue a user-defined function in a stream. This function is executed at the stream sink. Note that the user also needs to implement the remote target function in the sink-side program.
  • hStreams_app_event_wait: This sync action blocks until the set of specified events is completed. In this example, only the last transaction in a stream is required, since all other actions should be completed.
  • hStreams_app_fini: Destroy hStreams internal structures and clear the library state.

Intel MPSS 3.7.2 and Intel® Parallel Studio XE 2016 update 3 are installed on the host machine Intel® Xeon® processor E5-2600. First, bring the Intel MPSS service up and set up compiler environment variables on the host machine:

$ sudo service mpss start

$ source /opt/intel/composerxe/bin/compilervars.sh intel64

To compile the source-side code, link the source-side code with the dynamic library hstreams_source which provides source functionality:

$ mpiicpc hstream_sample_src.cpp –O3 -o hstream_sample -lhstreams_source \   -I/usr/include/hStreams -qopenmp

The above command generates the executable hstream_sample. To generate the user kernel library for the coprocessor (as sink), compile with the flag –mmic:

$ mpiicpc -mmic -fPIC -O3 hstream_sample_sink.cpp –o ./mic/hstream_sample_mic.so \    -I/usr/include/hStreams -qopenmp -shared

To follow the convention, the target library takes the form <exec_name>_mic.so for the Intel Xeon Phi coprocessor and <exec_name>_host.so for the host. This generates the library named hstream_sample_mic.so under the folder /mic.

To run this application, set the environment variable SINK_LD_LIBRARY_PATH so that hStreams runtime can find the user kernel library hstream_sample_mic.so

$ export SINK_LD_LIBRARY_PATH=/opt/mpss/3.7.2/sysroots/k1om-mpss-linux/usr/lib64:~/work/hStreams/collateral/delivery/mic:$MIC_LD_LIBRARY_PATH

Run this program with two ranks, one rank running on this current host and one rank running on the host whose IP address is 10.23.3.32, as follows:

$ mpiexec.hydra -n 1 -host localhost ~/work/hstream_sample : -n 1 -wdir ~/work -host 10.23.3.32 ~/work/hstream_sample

Hello world! rank 0 of 2 runs on knightscorner5
Hello world! rank 1 of 2 runs on knightscorner0.jf.intel.com
Rank 0: stream 0 moves A
Rank 0: stream 0 moves B
Rank 0: stream 1 moves A
Rank 0: stream 1 moves B
Rank 0: compute on stream 0
Rank 0: compute on stream 1
Rank 0: stream 0 Xtransfer data in C back
knightscorner5-mic0
knightscorner5-mic1
Rank 1: stream 0 moves A
Rank 1: stream 0 moves B
Rank 1: stream 1 moves A
Rank 1: stream 1 moves B
Rank 1: compute on stream 0
Rank 1: compute on stream 1
Rank 1: stream 0 Xtransfer data in C back
knightscorner0-mic0.jf.intel.com
knightscorner0-mic1.jf.intel.com
Rank 0: stream 1 Xtransfer data in C back
Rank 1: stream 1 Xtransfer data in C back
sink: compute on sink in stream num: 0
sink: compute on sink in stream num: 0
sink: compute on sink in stream num: 1
sink: compute on sink in stream num: 1
C0=97.20 C1=90.20 C0=36.20 C1=157.20 PASSED!

Conclusion

hStreams provides a well-defined set of APIs allowing users to design a task-based application on heterogeneous platforms quickly. Two levels of hStreams API co-exist: The app API offers simple interfaces for novice users to quickly ramp on the hStreams library, and the core API gives advanced users the full functionality of the rich library. This paper presents some basic hStreams concepts and illustrates how to build and run an MPI program that takes advantages of the hStreams interface.

 

About the Author

Loc Q Nguyen received an MBA from University of Dallas, a master’s degree in Electrical Engineering from McGill University, and a bachelor's degree in Electrical Engineering from École Polytechnique de Montréal. He is currently a software engineer with Intel Corporation's Software and Services Group. His areas of interest include computer networking, parallel computing, and computer graphics.

Caffe* Optimized for Intel® Architecture: Applying Modern Code Techniques

$
0
0

Improving the computational performance of a deep learning framework

PDF version

Authors

Vadim Karpusenko, Ph.D., Intel Corporation
Andres Rodriguez, Ph.D., Intel Corporation
Jacek Czaja, Intel Corporation
Mariusz Moczala, Intel Corporation

Abstract

This paper demonstrates a special version of Caffe* — a deep learning framework originally developed by the Berkeley Vision and Learning Center (BVLC) — that is optimized for Intel® architecture. This version of Caffe, known as Caffe optimized for Intel architecture, is currently integrated with the latest release of Intel® Math Kernel Library 2017 and is optimized for Intel® Advanced Vector Extensions 2 and will include Intel Advanced Vector Extensions 512 instructions. This solution is supported by Intel® Xeon® processors and Intel® Xeon Phi™ processors, among others. This paper includes performance results for a CIFAR-10* image-classification dataset, and it describes the tools and code modifications that can be used to improve computational performance for the BVLC Caffe code and other deep learning frameworks.

Introduction

Deep learning is a subset of general machine learning that in recent years has produced groundbreaking results in image and video recognition, speech recognition, natural language processing (NLP), and other big-data and data-analytics domains. Recent advances in computation, large datasets, and algorithms have been key ingredients behind the success of deep learning, which works by passing data through a series of layers, with each layer extracting features of increasing complexity.

Each layer in a deep network is trained to identify features of higher complexity—this figure shows a small subset of the features of a deep network projected down to the pixels space and the corresponding images that activate those features
Figure 1. Each layer in a deep network is trained to identify features of higher complexity—this figure shows a small subset of the features of a deep network projected down to the pixels space (the gray images on the left) and the corresponding images (colored images on the right) that activate those features.
Zeiler, Matthew D. and Fergus, Rob. New York University, Department of Computer Science. “Visualizing and Understanding Convolutional Networks.” 2014. https://www.cs.nyu.edu/~fergus/papers/zeilerECCV2014.pdf.

Supervised deep learning requires a labeled dataset. Three popular types of supervised deep networks are multilayer perceptrons (MLPs), convolutional neural networks (CNNs), and recurrent neural networks (RNNs). In these networks, the input is passed through a series of linear and non-linear transformations as it progresses through each layer, and an output is produced. An error and the respective cost of the error are then computed before a gradient of the costs of the weights and activations in the network is computed and iteratively backward propagated to lower layers. Finally, the weights or models are updated based on the computed gradient.

In MLPs, the input data at each layer (represented by a vector) is first multiplied by a dense matrix unique to that layer. In RNNs, the dense matrix (or matrices) is the same for every layer (the layer is recurrent), and the length of the network is determined by the length of the input signal. CNNs are similar to MLPs, but they use a sparse matrix for the convolutional layers. This matrix multiplication is represented by convolving a 2-D representation of the weights with a 2-D representation of the layer’s input. CNNs are popular in image recognition, but they are also used for speech recognition and NLP. For a detailed explanation of CNNs, see “CS231n Convolutional Neural Networks for Visual Recognition” at http://cs231n.github.io/convolutional-networks/.

Caffe

Caffe* is a deep learning framework developed by the Berkeley Vision and Learning Center (BVLC) and community contributors. This paper refers to that original version of Caffe as “BVLC Caffe.”

In contrast, Caffe optimized for Intel® architecture is a specific, optimized fork of the BVLC Caffe framework. Caffe optimized for Intel architecture is currently integrated with the latest release of Intel® Math Kernel Library (Intel® MKL) 2017, and it is optimized for Intel® Advanced Vector Extensions 2 (Intel® AVX2) and will include Intel® Advanced Vector Extensions 512 (Intel® AVX-512) instructions, which are supported by Intel® Xeon® processors and Intel® Xeon Phi™ processors, among others. For a detailed description of compiling, training, fine- tuning, testing, and using the various tools available, read “Training and Deploying Deep Learning Networks with Caffe* Optimized for Intel® Architecture” at https://software.intel.com/en-us/articles/training-and-deploying-deep-learning-networks-with-caffe-optimized-for-intel-architecture.

Intel would like to thank Boris Ginsburg for his ideas and initial contribution to the OpenMP* multithreading implementation of Caffe* optimized for Intel® architecture.

This paper describes the performance of Caffe optimized for Intel architecture compared to BVLC Caffe running on Intel architecture, and it discusses the tools and code modifications used to improve computational performance for the Caffe framework. It also shows performance results from using the CIFAR-10* image-classification dataset (https://www.cs.toronto.edu/~kriz/cifar.html) and the CIFAR-10 full-sigmoid model that composes layers of convolution, max and average pooling, and batch normalization: (https://github.com/BVLC/caffe/blob/master/examples/cifar10/cifar10_full_sigmoid_train_test_bn.prototxt).

Example of CIFAR-10* dataset images
Figure 2. Example of CIFAR-10* dataset images

To download the source code for the tested Caffe frameworks, visit the following:

Image Classification

The CIFAR-10 dataset consists of 60,000 color images, each with dimensions of 32 × 32, equally divided and labeled into the following 10 classes: airplane, automobile, bird, cat, deer, dog, frog, horse, ship, and truck. The classes are mutually exclusive; there is no overlap between different types of automobiles (such as sedans or sports utility vehicles [SUVs]) or trucks (which includes only big trucks)—neither group includes pickup trucks (see Figure 2).

When Intel tested the Caffe frameworks, we used the CIFAR-10 full-sigmoid model, a CNN model with multiple layers including convolution, max pooling, batch normalization, fully connected, and softmax layers. For layer descriptions, refer to the Code Parallelization with OpenMP* section.

Initial Performance Profiling

One method for benchmarking Caffe optimized for Intel architecture and BVLC Caffe is using the time command, which computes the layer-by-layer forward and backward propagation time. The time command is useful for measuring the time spent in each layer and for providing the relative execution times for different models:

./build/tools/caffe time \
    --model=examples/cifar10/cifar10_full_sigmoid_train_test_bn.prototxt \
    -iterations 1000

In this context, an iteration is defined as one forward and backward pass over a batch of images. The previous command returns the average execution time per iteration for 1,000 iterations per layer and for the entire network. Figure 3 shows the full output.

Output from the Caffe* time command
Figure 3. Output from the Caffe* time command

In our testing, we used a dual-socket system with one Intel Xeon processor E5-2699 v3 at 2.30 GHz per socket, 18 physical cores per CPU, and Intel® Hyper-Threading Technology (Intel® HT Technology) disabled. This dual- socket system had 36 cores in total, so the default number of OpenMP* threads, specified by the OMP_NUM_THREADS environment variable, was 36 for our tests, unless otherwise specified (note that we recommend letting Caffe optimize for Intel architecture automatically specify the OpenMP environment rather than setting it manually). The system also had 64 GB of DDR4 memory installed, operating at a frequency of 2,133 MHz.

Using those numbers, this paper demonstrates the performance results of code optimizations made by Intel engineers. We used the following tools for performance monitoring:

  • Callgrind* from Valgrind* toolchain
  • Intel® VTune™ Amplifier XE 2017 beta

Intel VTune Amplifier XE tools provide the following information:

  • Functions with the highest total execution time (hotspots)
  • System calls (including task switching)
  • CPU and cache usage
  • OpenMP multithreading load balance
  • Thread locks
  • Memory usage

We can use the performance analysis to find good candidates for optimization, such as code hotspots and long function calls. Figure 4 shows important data points from the Intel VTune Amplifier XE 2017 beta summary analysis running 100 iterations. The Elapsed Time, Figure 4 top, is 37 seconds.

This is the time that the code takes to execute on the test system. The CPU Time, shown below Elapsed Time, is 1,306 seconds—this is slightly less than 37 seconds multiplied by 36 cores (1,332 seconds). CPU Time is the combined duration sum of all threads (or cores, because hyper-threading was disabled in our test) contributing to the execution.

Intel® VTune™ Amplifier XE 2017 beta analysis summary for BVLC Caffe* CIFAR-10* execution
Figure 4. Intel® VTune™ Amplifier XE 2017 beta analysis; summary for BVLC Caffe* CIFAR-10* execution

The CPU Usage Histogram, Figure 4 bottom, shows how often a given number of threads ran simultaneously during the test. Most of the time, only a single thread (a single core) was running—14 seconds out of the 37-second total. The rest of the time, we had a very inefficient multithreaded run with less than 20 threads contributing to the execution.

The Top Hotspots section of the execution summary, Figure 4 middle, gives an indication of what is happening here. It lists function calls and their corresponding combined CPU times. The kmp_fork_barrier function is an internal OpenMP function for implicit barriers, and it is used to synchronize thread execution. With the kmp_fork_barrier function taking 1,130 seconds of CPU time, this means that during 87 percent of the CPU execution time, threads were spinning at this barrier without doing any useful work.

The source code of the BVLC Caffe package contains no #pragma omp parallel code line. In the BVLC Caffe code, there is no explicit use of the OpenMP library for multithreading. However, OpenMP threads are used inside of the Intel MKL to parallelize some of the math-routine calls. To confirm this parallelization, we can look at a bottom-up tab view (see Figure 5 and review the function calls with Effective Time by Utilization [at the top] and the individual thread timelines [at the bottom]).

Figure 5 shows the function-call hotspots for BVLC Caffe on the CIFAR-10 dataset.

Timeline visualization and function-call hotspots for BVLC Caffe* CIFAR-10* dataset training
Figure 5. Timeline visualization and function-call hotspots for BVLC Caffe* CIFAR-10* dataset training

The gemm_omp_driver_v2 function — part of libmkl_intel_thread.so— is a general matrix-matrix (GEMM) multiplication implementation of Intel MKL. This function uses OpenMP multithreading behind the scenes. Optimized Intel MKL matrix-matrix multiplication is the main function used for forward and backward propagation—that is, for weight calculation, prediction, and adjustment. Intel MKL initializes OpenMP multithreading, which usually reduces the computation time of GEMM operations. However, in this particular case—convolution for 32 × 32 images— the workload is not big enough to efficiently utilize all 36 OpenMP threads on 36 cores in a single GEMM operation. Because of this, a different multithreading-parallelization scheme is needed, as will be shown later in this paper.

To demonstrate the overhead of OpenMP thread utilization, we run code with the OMP_NUM_THREADS=1 environment variable, and then compare the execution times for the same workload: 31.1 seconds instead of 37 seconds (see the Elapsed Time section in Figure 4 and Figure 6 top). By using this environment variable, we force OpenMP to create only a single thread and to use it for code execution. The resulting almost six seconds of runtime difference in the BVLC Caffe code implementation provides an indication of the OpenMP thread initialization and synchronization overhead.

 OMP_NUM_THREADS=1
Figure 6. Intel® VTune™ Amplifier XE 2017 beta analysis summary for BVLC Caffe* CIFAR-10* dataset execution with a single thread: OMP_NUM_THREADS=1

With this analysis setup, we identified three main candidates for performance optimization in the BVLC Caffe implementation: the im2col_cpu, col2im_cpu, and PoolingLayer::Forward_cpu function calls (see Figure 6 middle).

Code Optimizations

The Caffe optimized for Intel architecture implementation for the CIFAR-10 dataset is about 13.5 times faster than BVLC Caffe code (20 milliseconds [ms] versus 270 ms for forward-backward propagation). Figure 7 shows the results of our forward-backward propagation averaged across 1,000 iterations. The left column shows the BVLC Caffe results, and the right column shows the results for Caffe optimized for Intel architecture.

Forward-backward propagation results
Figure 7. Forward-backward propagation results

For an in-depth description of these individual layers, refer to the Neural-Network-Layers Optimization Results section below.

For more information about defining calculation parameters for layers, visit http://caffe.berkeleyvision.org/tutorial/layers.html.

The following sections describe the optimizations used to improve the performance of various layers. Our techniques followed the methodology guidelines of Intel® Modern Code Developer Code, and some of these optimizations rely on Intel MKL 2017 math primitives. The optimization and parallelization techniques used in Caffe optimized for Intel architecture are presented here to help you better understand how the code is implemented and to empower code developers to apply these techniques for other machine learning and deep learning applications and frameworks.

Scalar and Serial Optimizations

Code Vectorization

After profiling the BVLC Caffe code and identifying hotspots—function calls that consumed most of the CPU time—we applied optimizations for vectorization. These optimizations included the following:

  • Basic Linear Algebra Subprograms (BLAS) libraries (switch from Automatically Tuned Linear Algebra System [ATLAS*] to Intel MKL) 
  • Optimizations in assembly (Xbyak just-in-time [JIT] assembler) 
  • GNU Compiler Collection* (GCC*) and OpenMP code vectorization

BVLC Caffe has the option to use Intel MKL BLAS function calls or other implementations. For example, the GEMM function is optimized for vectorization, multithreading, and better cache traffic. For better vectorization, we also used Xbyak — a JIT assembler for x86 (IA-32) and x64 (AMD64* or x86-64). Xbyak currently supports the following list of vector-instruction sets: MMX™ technology, Intel® Streaming SIMD Extensions (Intel® SSE), Intel SSE2, Intel SSE3, Intel SSE4, floating-point unit, Intel AVX, Intel AVX2, and Intel AVX-512.

The Xbyak assembler is an x86/x64 JIT assembler for C++, a library specifically created for developing code efficiently. The Xbyak assembler is provided as header-only code. It can also dynamically assemble x86 and x64 mnemonics. JIT binary-code generation while code is running allows for several optimizations, such as quantization, an operation that divides the elements of a given array by the elements of a second array, and polynomial calculation, an operation that creates actions according to constant, variable x, add, sub, mul, div, and so on. With the support of Intel AVX and Intel AVX2 vector-instruction sets, Xbyak can get a better vectorization ratio in the code implementation of Caffe optimized for Intel architecture. The latest version of Xbyak has Intel AVX-512 vector-instruction-set support, which can improve computational performance on the Intel Xeon Phi processor x200 product family. This improved vectorization ratio allows Xbyak to process more data simultaneously with single instruction, multiple data (SIMD) instructions, which more efficiently utilize data parallelism. We used Xbyak to vectorize this operation, which improved the performance of the process pooling layer significantly. If we know the pooling parameters, we can generate assembly code to handle a particular pooling model for a specific pooling window or pooling algorithm. The result is a plain assembly that is proven to be more efficient than C++ code.

Generic Code Optimizations

Other serial optimizations included:

  • Reducing algorithm complexity
  • Reducing the amount of calculations
  • Unwinding loops

Common-code elimination is one of the scalar optimization techniques that we applied during the code optimization. This was done in order to predetermine what can be calculated outside of the innermost for-loop.

For example, consider the following code snippet:

for (int h_col = 0; h_col < height_col; ++h_col) {
  for (int w_col = 0; w_col < width_col; ++w_col) {
    int h_im = h_col * stride_h - pad_h + h_offset;
    int w_im = w_col * stride_w - pad_w + w_offset;

In the third line of this code snippet, for the h_im calculation, we are not using a w_col index of the innermost loop. But this calculation will still be performed for every iteration of the innermost loop. Alternatively, we can move this line outside of the innermost loop with the following code:

for (int h_col = 0; h_col < height_col; ++h_col) {
  int h_im = h_col * stride_h - pad_h + h_offset;
  for (int w_col = 0; w_col < width_col; ++w_col) {
    int w_im = w_col * stride_w - pad_w + w_offset;

CPU-Specific, System-Specific, and Other Generic Code-Optimization Techniques

The following additional generic optimizations were applied:

  • Improved im2col_cpu/col2im_cpu implementation
  • Complexity reduction for batch normalization
  • CPU/system-specific optimizations
  • Use one core per computing thread
  • Avoid thread movement

Intel VTune Amplifier XE 2017 beta identified the im2col_cpu function as one of the hotspot functions—making it a good candidate for performance optimization. The im2col_cpu function is a common step in performing direct convolution as a GEMM operation for using the highly optimized BLAS libraries. Each local patch is expanded to a separate vector, and the whole image is converted to a larger (more memory- intensive) matrix whose rows correspond to the multiple locations where filters will be applied.

One of the optimization techniques for the im2col_cpu function is index-calculation reduction. The BVLC Caffe code had three nested loops for going through image pixels:

for (int c_col = 0; c_col < channels_col; ++c_col)
  for (int h_col = 0; h_col < height_col; ++h_col)
    for (int w_col = 0; w_col < width_col; ++w_col)
      data_col[(c_col*height_col+h_col)*width_col+w_col] = // ...

In this code snippet, BVLC Caffe was originally calculating the corresponding index of the data_col array element, although the indexes of this array are simply processed sequentially. Therefore, four arithmetic operations (two additions and two multiplications) can be substituted by a single index-incrementation operation. In addition, the complexity of the conditional check can be reduced due to the following:

/* Function uses casting from int to unsigned to compare if value
of parameter a is greater or equal to zero and lower than value of
parameter b. The b parameter has signed type and always positive,
therefore its value is always lower than 0x800... where casting
negative parameter value converts it to value higher than 0x800...
The casting allows to use one condition instead of two. */
inline bool is_a_ge_zero_and_a_lt_b(int a, int b) {
  return static_cast<unsigned>(a) < static_cast<unsigned>(b);
}

In BVLC Caffe, the original code had the conditional check if (x >= 0 && x < N), where x and N are both signed integers, and N is always positive. By converting the type of those integer numbers into unsigned integers, the interval for the comparison can be changed. Instead of running two compares with logical AND, a single comparison is sufficient after type casting:

if (((unsigned) x) < ((unsigned) N))

To avoid thread movement by the operating system, we used the OpenMP affinity environment variable, KMP_AFFINITY=c ompact,granularity=fine. Compact placement of neighboring threads can improve performance of GEMM operations because all threads that share the same last-level cache (LLC) can reuse previously prefetched cache lines with data.

For cache-blocking-optimization implementations and for data layout and vectorization, please refer to the following publication: http://arxiv.org/pdf/1602.06709v1.pdf.

Code Parallelization with OpenMP*

Neural-Network-Layers Optimization Results

The following neural network layers were optimized by applying OpenMP multithreading parallelization to them:

  • Convolution
  • Deconvolution
  • Local response normalization (LRN)
  • ReLU
  • Softmax
  • Concatenation
  • Utilities for OpenBLAS* optimization—such as the vPowx - y[i] = x[i]β operation, caffe_set, caffe_copy, and caffe_rng_bernoulli
  • Pooling
  • Dropout
  • Batch normalization
  • Data
  • Eltwise

Convolution Layer

The convolution layer, as the name suggests, convolves the input with a set of learned weights or filters, each producing one feature map in the output image. This optimization prevents under-utilization of hardware for a single set of input feature maps.

template <typename Dtype>
void ConvolutionLayer<Dtype>::Forward_cpu(const vector<Blob<Dtype>*>& \
      bottom, const vector<Blob<Dtype>*>& top) {
  const Dtype* weight = this->blobs_[0]->cpu_data();
  // If we have more threads available than batches to be prcessed then
  // we are wasting resources (lower batches than 36 on XeonE5)
  // So we instruct MKL
  for (int i = 0; i < bottom.size(); ++i) {
    const Dtype* bottom_data = bottom[i]->cpu_data();
    Dtype* top_data = top[i]->mutable_cpu_data();
#ifdef _OPENMP
    #pragma omp parallel for num_threads(this->num_of_threads_)
#endif
      for (int n = 0; n < this->num_; ++n) {
        this->forward_cpu_gemm(bottom_data + n*this->bottom_dim_,
                               weight,
                               top_data + n*this->top_dim_);
        if (this->bias_term_) {
          const Dtype* bias = this->blobs_[1]->cpu_data();
          this->forward_cpu_bias(top_data + n * this->top_dim_, bias);
        }
      }
  }
}

We process k = min(num_threads,batch_size) sets of input_feature maps; for example, k im2col operations happen in parallel, and k calls to Intel MKL. Intel MKL is switched to a single-threaded execution flow automatically, and performance overall is better than it was when Intel MKL was processing one batch. This behavior is defined in the source code file, src/caffe/layers/base_conv_layer.cpp. The implementation optimized OpenMP multithreading from src/caffe/layers/conv_layer.cpp — the file location with the corresponding code

Pooling or Subsampling

Max-pooling, average-pooling, and stochastic-pooling (not implemented yet) are different methods for downsampling, with max-pooling being the most popular method. The pooling layer partitions the results of the previous layer into a set of usually non-overlapping rectangular tiles. For each such sub-region, the layer then outputs the maximum, the arithmetic mean, or (in the future) a stochastic value sampled from a multinomial distribution formed from the activations of each tile.

Pooling is useful in CNNs for three main reasons:

  • Pooling reduces the dimensionality of the problem and the computational load for upper layers.
  • Pooling lower layers allows the convolutional kernels in higher layers to cover larger areas of the input data and therefore learn more complex features; for example, a lower-layer kernel usually learns to recognize small edges, whereas a higher-layer kernel might learn to recognize sceneries like forests or beaches.
  • Max-pooling provides a form of translation invariance. Out of eight possible directions in which a 2 × 2 tile (the typical tile for pooling) can be translated by a single pixel, three will return the same max value; for a 3 × 3 window, five will return the same max value

Pooling works on a single feature map, so we used Xbyak to make an efficient assembly procedure that can create average-to-max pooling for one or more input feature maps. This pooling procedure can be implemented for a batch of input feature maps when you run the procedure parallel to OpenMP.

The pooling layer is parallelized with OpenMP multithreading; because images are independent, they can be processed in parallel by different threads:

#ifdef _OPENMP
  #pragma omp parallel for collapse(2)
#endif
  for (int image = 0; image < num_batches; ++image)
    for (int channel = 0; channel < num_channels; ++channel)
      generator_func(bottom_data, top_data, top_count, image, image+1,
                        mask, channel, channel+1, this, use_top_mask);
}

With the collapse(2) clause, OpenMP #pragma omp parallel spreads on to both nested for-loops, iterates though images in the batch and image channels, combines the loops into one, and parallelizes the loop.

Softmax and Loss Layer

The loss (cost) function is the key component in machine learning that guides the network training process by comparing a prediction output to a target or label and then readjusting weights to minimize the cost by calculating gradients—partial derivatives of the weights with respect to the loss function.

The softmax (normalized exponential) function is the gradient-log normalizer of the categorical probability distribution. In general, this is used to calculate the possible results of a random event that can take on one of K possible outcomes, with the probability of each outcome separately specified. Specifically, in multinomial logistic regression (a multi-class classification problem), the input to this function is the result of K distinct linear functions, and the predicted probability for the jth class for sample vector x is:

OpenMP multithreading, when applied for these calculations, is a method of parallelizing by using a master thread to fork a specified number of subordinate threads as a way of dividing a task among them. The threads then run concurrently as they are allocated to different processors. For example, in the following code, parallelized individual arithmetic operations with independent data access are implemented through division by the calculated norm in different channels:

    // division
#ifdef _OPENMP
#pragma omp parallel for
#endif
    for (int j = 0; j < channels; j++) {
      caffe_div(inner_num_, top_data + j*inner_num_, scale_data,
              top_data + j*inner_num_);
    }

Rectified Linear Unit (ReLU) and Sigmoid—Activation/ Neuron Layers

ReLUs are currently the most popular non-linear functions used in deep learning algorithms. Activation/neuron layers are element-wise operators that take one bottom blob and produce one top blob of the same size. (A blob is the standard array and unified memory interface for the framework. As data and derivatives flow through the network, Caffe stores, communicates, and manipulates the information as blobs.)

The ReLU layer takes input value x and computes the output as x for positive values and scales them by negative_slope for negative values:

The default parameter value for negative_slope is zero, which is equivalent to the standard ReLU function of taking max(x, 0). Due to the data-independent nature of the activation process, each blob can be processed in parallel as shown on the next page:

template <typename Dtype>
void ReLULayer<Dtype>::Forward_cpu(const vector<Blob<Dtype>*>& bottom,
    const vector<Blob<Dtype>*>& top) {
  const Dtype* bottom_data = bottom[0]->cpu_data();
  Dtype* top_data = top[0]->mutable_cpu_data();
  const int count = bottom[0]->count();
  Dtype negative_slope=this->layer_param_.relu_param().negative_slope();
#ifdef _OPENMP
#pragma omp parallel for
#endif
  for (int i = 0; i < count; ++i) {
    top_data[i] = std::max(bottom_data[i], Dtype(0))
        + negative_slope * std::min(bottom_data[i], Dtype(0));
  }
}

Similar parallel calculations can be used for backward propagation, as shown below:

template <typename Dtype>
void ReLULayer<Dtype>::Backward_cpu(const vector<Blob<Dtype>*>& top,
    const vector<bool>& propagate_down,
    const vector<Blob<Dtype>*>& bottom) {
  if (propagate_down[0]) {
    const Dtype* bottom_data = bottom[0]->cpu_data();
    const Dtype* top_diff = top[0]->cpu_diff();
    Dtype* bottom_diff = bottom[0]->mutable_cpu_diff();
    const int count = bottom[0]->count();
    Dtype negative_slope=this->layer_param_.relu_param().negative_slope();
#ifdef _OPENMP
#pragma omp parallel for
#endif
    for (int i = 0; i < count; ++i) {
      bottom_diff[i] = top_diff[i] * ((bottom_data[i] > 0)
          + negative_slope * (bottom_data[i] <= 0));
    }
  }
}

In the same fashion, the sigmoid function S(x) = 1 / (1 + exp(-x)) can be parallelized in the following way:

#ifdef _OPENMP
  #pragma omp parallel for
#endif
  for (int i = 0; i < count; ++i) {
    top_data[i] = sigmoid(bottom_data[i]);
  }

Because Intel MKL does not provide math primitives to implement ReLUs to add this functionality, we tried to implement a performance-optimized version of the ReLU layer with assembly code (via Xbyak). However, we found no visible gain for Intel Xeon processors — perhaps due to limited memory bandwidth. Parallelization of the existing C++ code was good enough to improve the overall performance.

Conclusion

The previous section discussed various components and layers of neural networks and how blobs of processed data in these layers were distributed among available OpenMP threads and Intel MKL threads. The CPU Usage Histogram in Figure 8 shows how often a given number of threads ran concurrently after our optimizations and parallelizations were applied.

With Caffe optimized for Intel architecture, the number of simultaneously operating threads is significantly increased. The execution time on our test system dropped from 37 seconds in the original, unmodified run to only 3.6 seconds with Caffe optimized for Intel architecture—improving the overall execution performance by more than 10 times.

Intel® VTune™ Amplifier XE 2017 beta analysis summary of the Caffe* optimized for Intel® architecture implementation for CIFAR-10* training
Figure 8. Intel® VTune™ Amplifier XE 2017 beta analysis summary of the Caffe* optimized for Intel® architecture implementation for CIFAR-10* training

As shown in the Elapsed Time section, Figure 8 top, there is still some Spin Time present during the execution of this run. As a result, the execution’s performance does not scale linearly with the increased thread count (in accordance with Amdahl’s law). In addition, there are still serial execution regions in the code that are not parallelized with OpenMP multithreading. Re-initialization of OpenMP parallel regions was significantly optimized for the latest OpenMP library implementations, but it still introduces non-negligible performance overhead. Moving OpenMP parallel regions into the main function of the code could potentially improve the performance even more, but it would require significant code refactoring.

Figure 9 summarizes the described optimization techniques and code rewriting principals that we followed with Caffe optimized for Intel architecture.

Step-by-step approach of Intel® Modern Code Developer Code
Figure 9. Step-by-step approach of Intel® Modern Code Developer Code

In our testing, we used Intel VTune Amplifier XE 2017 beta to find hotspots—good code candidates for optimization and parallelization. We implemented scalar and serial optimizations, including common-code elimination and reduction/simplification of arithmetic operations for loop index and conditional calculations. Next, we optimized the code for vectorization following the general principles described in “Auto-vectorization in GCC” (https://gcc.gnu. org/projects/tree-ssa/vectorization.html). The JIT assembler Xbyak allowed us to use SIMD operations more efficiently.

We implemented multithreading with an OpenMP library inside the neural-network layers, where data operations on images or channels were data-independent. The last step in implementing the Intel Modern Code Developer Code approach involved scaling the single-node application for many-core architectures and a multi- node cluster environment. This is the main focus of our research and implementation at this moment. We also applied optimizations for memory (cache) reuse for better computational performance. For more information see: http://arxiv.org/pdf/1602.06709v1.pdf. Our optimizations for the Intel Xeon Phi processor x200 product family included the use of high-bandwidth MCDRAM memory and utilization of the quadrant NUMA mode.

Caffe optimized for Intel architecture not only improves computational performance, but it enables you to extract increasingly complex features from data. The optimizations, tools, and modifications included in this paper will help you achieve top computational performance from Caffe optimized for Intel architecture.

For more information about the Intel Modern Code Developer Code program, refer to the following publications:

For more information on machine learning, see:


Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark* and MobileMark*, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other informa- tion and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information, visit intel.com/performance.

Optimization Notice: Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimiza- tions include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice Revision #20110804

Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system configuration. No computer system can be absolutely secure. Check with your system manufacturer or retailer or learn more at intel.com.

Intel technologies may require enabled hardware, specific software, or services activation. Check with your system manufacturer or retailer.

No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.

Intel disclaims all express and implied warranties, including without limitation, the implied warranties of merchantability, fitness for a particular purpose, and non-infringement, as well as any warranty arising from course of performance, course of dealing, or usage in trade.

This document contains information on products, services and/or processes in development. All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest forecast, schedule, specifications and roadmaps.

The products and services described may contain defects or errors known as errata which may cause deviations from published specifications. Current characterized errata are available on request.

Copies of documents which have an order number and are referenced in this document may be obtained by calling 1-800-548-4725 or by visiting intel.com/design/literature.htm.

This sample source code is released under the Intel Sample Source Code License Agreement.

Intel, the Intel logo, Intel Xeon Phi, VTune, and Xeon are trademarks of Intel Corporation in the U.S. and/or other countries.

*Other names and brands may be claimed as the property of others.

Copyright © 2016 Intel Corporation. All rights reserved.

0816/VK/PRW/PDF 334759-001US

Hybrid Parallelism: A MiniFE* Case Study

$
0
0

In my first article, Hybrid Parallelism: Parallel Distributed Memory and Shared Memory Computing, I discussed the chief forms of parallelism: shared memory parallel programming and distributed memory message passing parallel programming. That article explained the basics of threading for shared memory programming and Message Passing Interface* (MPI) message passing. It included an analysis of a hybrid hierarchical version of one of the NAS Parallel Benchmarks*. In that case study the parallelism for threading was done at a lower level than the parallelism for MPI message passing.

This case study examines the situation where the problem decomposition is the same for threading as it is for MPI; that is, the threading parallelism is elevated to the same level as MPI parallelism. The reasons to elevate threading to the same parallel level as MPI message passing is to see if performance can improve because of less overhead in thread libraries than in MPI calls and to check whether the memory consumption is reduced by using threads rather voluminous numbers of MPI invocations.

This paper provides a brief overview of threading and MPI followed by a discussion of the changes in the miniFE*. Performance results are also shared. The performance gains are minimal, but threading consumed less memory, and data sets 15 percent larger could be solved using the threaded model. Indications are that the main benefit is memory usage. This article will be of most interest to those who want to optimize for memory footprint, especially those working on first generation Intel® Xeon Phi™ coprocessors (code-named Knights Corner), where available memory on the card is limited. 

Many examples of hybrid distributed/shared memory parallel programming follow a hierarchical approach MPI distributed memory programming is done at the top and shared memory parallel programming is introduced at multiple regions of the software underneath (OpenMP* is a popular choice). Many software developers design good problem decomposition for MPI programming. However, when using OpenMP, some developers fall back to simply placing pragmas around do or for loops without considering overall problem decomposition and data locality. For this reason some say that if they want good parallel threaded code, they would write the MPI code first and then port it to OpenMP, because they are confident this would force them to implement good problem decomposition with good data locality and such.

This leads to another question: if good performance can be obtained either way does it matter whether a threading model or MPI is used? There are two things to consider. One is performance of course: is MPI or threading inherently faster than the other? The second consideration is memory consumption. When an MPI job begins, an initial number of MPI processes or ranks that will cooperate to complete the overall work are specified. As ever larger problem sizes or data sets are run, the number of systems in a cluster dedicated to a particular job increases, thus the number of MPI ranks increases. As the number of MPI ranks increases, the MPI runtime libraries consume more memory in order to be ready to handle a larger number of potential messages (see Hybrid Parallelism: Parallel Distributed Memory and Shared Memory Computing).

This case study compares both performance and memory consumption. The code used in this case study is miniFE. MiniFE was developed at Sandia National Labs and is now distributed as part of the Montevo project (see montevo.org).

Shared Memory Threaded Parallel Programming

In the threading model, all the resources belong to the same process. Memory belongs to the process so sharing memory between threads can be easy. Each thread must be given a pointer to the common shared memory location. This means that there must be at least one common pointer or address passed into each thread so each thread can access the shared memory regions. Each thread has its own instruction pointer and stack.

A problem with threaded software is the potential for data race conditions. A data race occurs when two or more threads access the same memory address and at least one of the threads alters the value in memory. Whether the writing thread completes its write before or after the reading thread reads the value can alter the results of the computation. Mutexes, barriers, and locks were designed to control execution flow, protect memory, and prevent data races. This creates other problems, because deadlock can happen preventing any forward progression in the code, or contention for mutexes or locks restricts execution flow becoming a bottleneck. Mutexes and locks are not a simple cure-all. If not used correctly, data races can still exist. Placing locks that protect code segments rather than memory references is the most common error.

Distributed Memory MPI Parallel Programming

Distributed memory parallel programming models offer a range of methods with MPI. The discussion in this case study uses the traditional explicit message passing interface of MPI. The most commonly used elements of MPI are message passing constructs. The discussion in this case study uses the traditional explicit message passing interface of MPI. Any data that one MPI rank has that may be needed by another MPI rank must explicitly be sent by the first MPI rank to other ranks that need that data. In addition, the receiving MPI rank must explicitly request the data be received before it can access and use the data sent. The developer must define the buffers used to send and receive data as well as pack or unpack them if necessary; if data is received into its desired location, it doesn't need to be unpacked.  

Finite Element Analysis

In finite element analysis a physical domain whose behavior is modeled by partial differential equations is divided into very small regions called elements. A set of basis functions (often polynomials) is defined for each element. The parameters of basis functions approximate the solution to the partial differential equations within each element. The solution phase typically is a step of minimizing the difference between the true value of the physical property and the value approximated by the basis functions. The operations form a linear system of equations for each finite element known as an element stiffness matrix. Each of these element stiffness matrices are added into a global stiffness matrix, which is solved to determine the values of interest. The solution values represent a physical property: displacement, stress, density, velocity, and so on. 

 

The miniFE is representative of more general finite element analysis packages. In a general finite element program the elements may be irregular and of varying size and different physical properties. Only one element type and one domain are used within miniFE. The domain selected is always a rectangular prism that is divided into an integral number of elements along the three major axes: x, y, and z. The rectangular prism is then sliced parallel to the principal plans recursively to create smaller sets of the finite elements. This is illustrated in Figure 1.

MiniFE* Case Study

Figure 1: A rectangular prism divided into finite elements and then split into four subdomains.

 

The domain is divided into several regions or subdomains containing a set of finite elements. Figure 1 illustrates the recursive splitting. The dark purple line near the center of the prism on the left shows where the domain may be divided into two subdomains. Further splitting may occur as shown by the two green lines. The figure on the right shows the global domain split into four subdomains. This example shows the splitting occurring perpendicular to the z-axis. However, the splitting may be done parallel to any plane. The splitting is done recursively to obtain the desired number of subdomains. In the original miniFE code, each of these subdomains is assigned to an MPI rank: one subdomain per each MPI rank. Each MPI rank determines a local numbering of its subdomain and maps that numbering to the numbering of the global domain. For example, an MPI rank may hold a 10×10×10 subdomain. Locally it would number this from 0–9 in the x, y, and z directions. Globally, however, this may belong to the region numbered 100–109 on the x-axis, 220–229 on the y-axis, and 710–719 along the z axis. Each MPI rank determines the MPI ranks with which it shares edges or faces, and then initializes the buffers it will use to send and receive data during the conjugate gradient solver phase used to solve the linear system of equations. There are additional MPI communication costs; when a dot product is formed, each rank calculates its local contribution to the dot product and then the value must be reduced across all MPI ranks.

The miniFE code has options using threading models to parallelize the for loops surrounding many activities. In this mode the MPI ranks do not further subdivide their subdomain recursively into multiple smaller subdomains for each thread. Instead, for loops within the calculations for each subdomain are divided into parallel tasks using OpenMP*, Intel® Cilk™ Plus, or qthreads*. Initial performance data on the original reference code showed that running calculations for a specified problem set on 10 MPI ranks without threading was much faster than running one MPI rank with 10 threads. 

So the division among threads was not as efficient as the division between MPI ranks. Software optimizations should begin with software performance analysis. I used the TAU* Performance System for software performance analysis. The data showed that the waxpy (vector _axpy operations) operations consumed much more time in the hybrid thread/MPI version than the MPI-only version. The waxpy operation is inherently parallel. It doesn't involve any reduction like a dot product, and there is no potential data-sharing problems that would complicate threading. The only reason for the waxpy operation to consume more time is because the threading models used are all fork-join models. That is, the work for each thread is forked off at the beginning of a for loop, and then all the threads join back again at the end. The effort to initiate computation at the fork and then synchronize at the end adds considerable overhead, which was not present in an MPI-only version of the code.

The original miniFE code divided the domain into a number of subdomains that matches the number of MPI ranks (This number is called numprocs). The identifier for each subdomain was the MPI rank number, called myproc. In the new hybrid code the original domain is divided into a number of subdomains that match the total number of threads globally; this is the number of MPI ranks times the number of threads per rank (numthreads). This global count of subdomains is called idivs (idivs = numprocs * numthreads). Each thread is given a local identifier, mythread (beginning at zero of course). The identifier of each subdomain changes from myproc to mypos (mypos = myproc * numthreads + mythread). When there is only one thread per mpi rank, mypos and myproc are equal. Code changes were implemented in each file to change references to numprocs to idivs, and myproc to mypos. A new routine was written between the main program and the routine driver. The main program forks off the number of threads indicated. Each thread begins execution of this new routine, which then calls the driver, and in turn each thread calls all of the subroutines that execute the full code path below the routine driver.

The principle of dividing work into compact local regions, or subdomains, remains the same. For example when a subdomain needs to share data with an adjacent subdomain it loops through all its neighbors and shares the necessary data. The code snippets below show the loops for sharing data from one subdomain to another in the original code and the new code. In these code snippets each subdomain is sending data to its adjacent neighbors with which it shares faces or edges. In the original code each subdomain maps to an MPI rank. These code snippets come from the file exchange_externals.hpp.  The original code is shown below in the first text box. Comments are added to increase clarity. 

Original code showing sends for data exchanges:

// prepare data to send to neighbors by copying data into send buffers
for(size_t i=0; i<total_to_be_sent; ++i) {
  send_buffer[i] = x.coefs[elements_to_send[i]];
}
//loop over all adjacent subdomain neighbors – Send data to each neighbor
Scalar* s_buffer = &send_buffer[0];
for(int i=0; i<num_neighbors; ++i) {
  int n_send = send_length[i];
  MPI_Send(s_buffer, n_send, mpi_dtype, neighbors[i],
MPI_MY_TAG, MPI_COMM_WORLD);
  s_buffer += n_send;
}


New code showing sends for data exchanges:

//loop over all adjacent subdomain neighbors – communicate data to each neighbor
for(int i=0; i<num_neighbors; ++i) {
int n_send = send_length[i];
if (neighbors[i]/numthreads != myproc)
{// neighbor is in different MPI rank pack and send data
for (int ij = ibuf ; ij < ibuf + n_send ; ++ij)
send_buffer[ij] = x.coefs[elements_to_send[ij]] ;
MPI_Send(s_buffer, n_send, mpi_dtype,
neighbors[i]/numthreads,
MPI_MY_TAG+(neighbors[i]*numthreads)+mythread, MPI_COMM_WORLD);
} else
{//neighbor is another thread in this mpi rank wait until recipient flags it is safe to write then write
while (sg_sends[neighbors[i]%numthreads][mythread]);
stmp = (Scalar *) (sg_recvs[neighbors[i]%numthreads][mythread]);
for (int ij = ibuf ; ij < ibuf + n_send ; ++ij)
stmp[ij-ibuf] = x.coefs[elements_to_send[ij]] ;
// set flag that write completed
sg_sends[neighbors[i]%numthreads][mythread] = 2 ;
}
s_buffer += n_send;
ibuf += n_send ;
}


In the new code each subdomain maps to a thread. So each thread now communicates with threads responsible for neighboring subdomains. These other threads may or may not be in the same MPI rank. The setup of communicating data remains nearly the same. When communication mapping is set up, a vector of pointers is shared within each MPI rank. When communication is between threads in the same MPI rank (process), a buffer is allocated and both threads have access to the pointer to that buffer. When it is time to exchange data, a thread loops through all its neighbors. If the recipient is in another MPI rank, the thread makes a regular MPI send call. If the recipient is in the same process as the sender, the sending thread writes the data to the shared buffer and marks a flag that it completed the write.

Additional changes were also required. By default, MPI assumes only one thread in a process or the MPI rank sends and receives messages. In this new miniFE thread layout each thread may send or receive data from another MPI rank. This required changing MPI_Init to MPI_Init_thread with the setting MPI_THREAD_MULTIPLE. This sets up the MPI runtime library to behave in a thread-safe manner. It is important to remember that MPI message passing is between processes (MPI ranks) not threads, so by default when a thread sends a message to a remote system there is no distinction made between threads on the remote system. One method to handle this would be to create multiple MPI communicators. If there were a separate communicator for each thread in an MPI rank, a developer could control which thread received the message in the other MPI rank by its selection of the communicator. Another method would be to use different tags for each thread so that the tags identify which thread should receive a particular message. The latter was used in this implementation; MPI message tags were used to control which thread received messages. The changes in MPI message tags can be seen in the code snippets as well. In miniFE the sender fills the send buffer in the order the receiver prefers. Thus the receiver does not need to unpack data on receipt and can use the data directly from the receive buffer destination.  

Dot products are more complicated, because they are handled in a hierarchical fashion. First, a local sum is made by all threads in the MPI rank. One thread makes the MPI Allreduce call, while the other threads stop at a thread barrier waiting for the MPI Allreduce to be completed and the data recorded in an appropriate location for all threads to get a copy.

In this initial port all of the data collected here used Intel® Threading Building Blocks (Intel® TBB) thread. This closely match the C++ thread specifications, so it will be trivial to test using standard C++ threads.

Optimizations

The initial threading port achieved the goal matching vector axpy operation execution time. Even though this metric improved, prior to some tuning, the threading model was initially slower than the MPI version. Three principle optimization steps were applied to improve the threaded code performance.

The first step was to improve parallel operations like dot products. The initial port used a simple method of each thread accumulating results such as dot products using simple locks. The first attempt replaced Posix* mutexes with Intel TBB locks and then atomic operations as flags. These steps made no appreciable improvement. Although the simple lock method worked for reductions or gathers for quick development using four threads, it did not scale well when there were a couple of hundred threads. A simple tree was created to add some thread parallelism to reductions such as the dot products. Implementing a simple tree for a parallel reduction offered a significant performance improvement; further improvements may offer small incremental improvements.

The second optimization was to make copies of some global arrays for each thread (this came from an MPI_Allgather). Because none of the threads alters the array values there is no opportunity for race conditions or cache invalidation. From that point the array was used in a read-only mode. So the initial port shared one copy of the array among all threads. For performance purposes, this proved to be wrong; creating a private copy of the array for each thread improved performance. Even after these optimizations, performance of the code with lots of threads still lagged behind the case with only one thread per MPI rank.
This leads to the third and last optimization step. The slow region was in problem setup and initialization. I then realized that the bottleneck was in dynamic memory allocation and a better memory allocator would resolve the bottleneck. The default memory allocation libraries on Linux* do not scale for numerous threads. Several third-party scalable memory allocation libraries are available to resolve this problem. All of them work better than the Linux default memory allocation runtime libraries. I used Intel TBB memory allocator because I am familiar with it and it can be adopted without any code modification by simply using LD_PRELOAD. So at runtime LD_PRELOAD was defined to use the Intel TBB memory allocator designed for parallel software. This change of the memory allocation runtime libraries closed the performance gap. This step substituted the Intel TBB memory allocator for all of the object and dynamic memory creation. This single step provided the biggest performance improvement.   

Performance

This new hybrid miniFE code ran on both the Intel® Xeon Phi™ coprocessor and the Intel® Xeon Phi™ processor. The data collected varied the number of MPI ranks and threads using a problem size that nearly consumed all of the system memory for the two platforms. For the first-generation Intel Xeon Phi coprocessor the MPI rank/thread ratio varied from 1:244 and 244:1. A problem size of 256×256×512 was used for the tests. The results are shown in Figure 2.

MiniFE* Case Study
Figure 2. MiniFE* performance on an Intel® Xeon Phi™ coprocessor.

 

The results show variations in performance based on the different ratios of the MPI-to-thread ratio. Each ratio of MPI to thread ran at least twice, and the fastest time was selected for reporting. More runs were collected for the ratios with slower execution time. The differences in time proved repeatable. Figure 3 shows the same tests on the Intel Xeon Phi Processor using a larger problem size.

MiniFE* Case Study
Figure 3: MiniFE* performance on Intel® Xeon Phi™ Processor

 


The performance on the Intel Xeon Phi Processor showed less performance variation than performance of miniFE on the Intel Xeon Phi coprocessor. No explanation is offered for the differences between the runtime and number of MPI ranks to the number of threads. It may be possible to close those differences by explicitly pinning threads and MPI ranks to specific cores. These tests left process and thread assignment to the OS.

There is much less variation for miniFE performance than was reported for the NAS SP-MZ* benchmark hybrid code as discussed in Hybrid Parallelism: Parallel Distributed Memory and Shared Memory Computing. The NAS benchmark code though did not create subdomains for each thread as was done in this investigation of miniFE. The NAS SP-MZ code did not scale as well with threads as it did with MPI. This case study shows that following the same decomposition, threads do as well as MPI ranks. On the Intel® Xeon Phi™ Product family, miniFE performance was slightly better for using the maximum number of threads and only one MPI rank rather than using the maximum number of MPI ranks with only one thread each. Best performance was achieved with a mixture of MPI ranks and threads.

Memory consumption proves to be the most interesting aspect. The Intel Xeon Phi coprocessor is frequently not set up with a disk to swap pages to virtual memory, which provides an ideal platform to evaluate the size of a problem that can be run with the associated runtime libraries. When running the miniFE hybrid code on the Intel Xeon Phi coprocessor, the largest problem size that ran successfully with one MPI rank for each core was 256×256×512. This is a problem size of 33,554,432 elements. The associated global stiffness matrix contained 908,921,857 nonzero entries. When running with only 1 MPI rank and creating a number of threads that match the number of cores, the same number of subdomains are created and a larger problem size—256×296×512—runs to completion. This larger problem contained 38,797,312 elements, and the corresponding global stiffness matrix had 1,050,756,217 nonzero elements. Based on the number of finite elements, the threading model allows developers to run a model 15 percent larger. Based on nonzero elements in the global stiffness matrix, the model solved a matrix that is 15.6 percent larger. The ability to run a larger problem size is a significant advantage that may appeal to some project teams.

There are further opportunities for optimization of the threaded software (for example, pinning threads and MPI ranks to specific cores and improving parallel reductions). It is felt that the principal tuning has been done and further tuning would probably have minimal changes in performance. The principal motivation to follow the same problem decomposition for threading as for MPI is for the improvement in memory consumption.

Summary

The effort to write code for both threads and MPI is time consuming. Projects such as the Multi-Processor Computer (MPC) framework (see mpc.hpcframework.paratools.com) may make writing code in MPI and running via threads just as efficient in the future. The one-sided communication features of MPI-3 may allow developers to write code more like the threaded version of miniFE, where one thread writes the necessary data to the other threads' desired locations, minimizing the need for MPI runtime libraries to hold so much memory in reserve. When add threading to MPI code, remember the best practices such as watching for scalable runtime libraries and system calls that may not be thread-friendly by default, such as memory allocation or rand(). 

Performance of threaded software performs comparably with MPI when they both follow the same parallel layout: subdomain per MPI rank and subdomain per thread. In cases like miniFE, threading consumes less memory than MPI runtime libraries and allows larger problem sizes to be solved on the same system. For this implementation of miniFE, problem sizes 15 percent larger could be run on the same platform. Those seeking to optimize for memory consumption should consider the same parallel layout for both threading and MPI and will likely benefit from the transition.

Notes

Data collected using the Intel® C++ Compiler 16.0 and Intel® MPI Library 5.1.

Direct N-body Simulation

$
0
0

Exercise in performance optimization on Intel Architecture, including Intel® Xeon Phi processors.

NOTE: this lab is an overview of various optimizations discussed in Chapter 4 in the book "Parallel Programming and Optimization with Intel Xeon Phi Coprocessors", second edition (2015). The book can be obtained at xeonphi.com/book

In this step we will look at how to modernize a piece of code through an example application. The provided source code is an N-body simulation, which is a simulation of many particles that gravitationally or electrostatically interacting with each other. We keep track of the position and the velocity of each particle in the structure "Particle". The simulation is discretized into timesteps. In each timestep, first, the force on each particle (stored in the structure) is calculated with a direct all-to-all algorithm (O(n^2) complexity). Next, the velocity of each particle is modified using the explicit Euler method. Finally the positions of the particles are updated using the explicit Euler method.

N-body simulations are used in astrophysics to model galaxy evolution, colliding galaxies, dark matter distribution in the Universe, and planetary systems. They are also used in simulations of molecular structures. Real astrophysical N-body simulations, targeted to systems with billions of particles, use simplifications to reduce the complexity of the method to O(n log n). However, our toy model is the basis on which the more complex models are built.

In this lab, you will be mostly be modifying the function MoveParticles().

  1. Study the code, then compile and run the application to get the baseline performance. To run the application on the host, use the command "make run-cpu" and for coprocessor, use "make run-mic".

  2. Parallelize MoveParticles() by using OpenMP. Remember that there are two loops that need to be parallelized. You only need to parallelize the outer-most loop.

    Also modify the print statement in, which is hardwired to print "1 thread" (i.e., print the actual number of threads used).

    Compile and run the application to see if you got an improvement.

  3. Apply strength reduction for the calculation of force (the j-loop). You should be able to limit the use of expensive operations to one sqrtf() and one division, with the rest being multiplications. Also make sure to control the precision of constants and functions.

    Compile and run the application to see if you got an improvement.

  4. In the current implementation the particle data is stored in a Array of Structures(AoS), namely a structure of "ParticleTypes"s. Although this is great for readability and abstraction, it is sub-optimal for performance because the coordinates of consecutive particles are not adjacent. Thus when the positions and the velocities are accessed in the loop and vectorized, the data has a non-unit stride access, which hampers performance. Therefore it is often beneficial to instead implement a Structure of Arrays (SoA) instead, where a single structure holds coordinate arrays.

    Implement SoA by replacing "ParticleType" with "ParticleSet". Particle set should have 6 arrays of size "n", one for each dimension in the coordinates (x, y, z) and velocities (vx, vy, vz). The i-th element of each array is the cordinate or velocity of the i-th particle. Be sure to also modify the initialization in main(), and modify the access to the arrays in "MoveParticles()". Compile then run to see if you get a performance improvement.

  5. Let's analyze this application in terms of arithmetic intensity. Currently, the vectorized inner j-loop iterates through all particles for each i-th element. Since the cache line length and the vector length are the same, arithmetic intensity is simply the number of instructions in the inner-most loop. Not counting the reduction at the bottom, the number of operations per iteration is ~20, which is less than the ~30 that roofline model calls for.

    To fix this, we can use tiling to increase cache re-use. By tiling in "i" or "j" by Tile=16 (we chose 16 because it is the cache line length as well as the vector length), we can increase the number operations to ~16*20=~320. This is more than enough to be in the compute-bound region of the roofline mode.

    Although the loop can be tiled in "i" or "j" (if we allow loop swap) it is more beneficial to tile in "i" and therefore vectorize in "i". If we have "j" as the inner-most loop each iteration requires three reductions of the vector register (for Fx, Fy, Fz). This is costly as this not vectorizable. On the other hand, if we vectorize in "i" with tile = 16, it does not require reduction. Note though, that you will need to create three buffers of length 16 where you can store Fx, Fy and Fz for the "i"th element.

    Implement tiling in "i". then compile and run to see the performance.

  6. Using MPI, parallelize the simulation across multiple processes (or compute nodes). To make this work doable in a short time span, keep the entire data set in each process. However, each MPI process should execute only a portion of the loop in the MoveParticle() function. Try to minimize the amount of communication between the nodes. You may find the MPI function MPI_Allgather() useful. Compile and run the code to see if you get a performance improvement.

Direct N-Body simulation code GitHub link: https://github.com/ColfaxResearch/N-body

Recipe: Building NAMD on Intel® Xeon® and Intel® Xeon Phi™ Processors

$
0
0

Purpose

This recipe describes a step-by-step process of how to get, build, and run NAMD, Scalable Molecular Dynamic, code on Intel® Xeon Phi™ processor and Intel® Xeon® E5 processors for better performance.

Introduction

NAMD is a parallel molecular dynamics code designed for high-performance simulation of large biomolecule systems. Based on Charm++ parallel objects, NAMD scales to hundreds of cores for typical simulations and beyond 500,000 cores for the largest simulations. NAMD uses the popular molecular graphics program VMD for simulation setup and trajectory analysis, but is also file-compatible with AMBER, CHARMM, and X-PLOR.

NAMD is distributed free of charge with source code. You can build NAMD yourself or download binaries for a wide variety of platforms. Find the details below of how to build on Intel® Xeon Phi™ processor and Intel® Xeon® E5 processors and learn more about NAMD at http://www.ks.uiuc.edu/Research/namd/

Building NAMD on Intel® Xeon® Processor E5-2697 v4 (BDW) and Intel® Xeon Phi™ Processor 7250 (KNL)

  1. Download the latest NAMD source code(Nightly Build) from this site: http://www.ks.uiuc.edu/Development/Download/download.cgi?PackageName=NAMD
  2. Download fftw3 from this site: http://www.fftw.org/download.html
    • Version 3.3.4 is used in this run
  3. Build fftw3:
    1. Cd<path>/fftw3.3.4
    2. ./configure --prefix=$base/fftw3 --enable-single --disable-fortran CC=icc
                        Use xMIC-AVX512 for KNL or –xCORE-AVX2 for BDW
    3. make CFLAGS="-O3 -xMIC-AVX512 -fp-model fast=2 -no-prec-div -qoverride-limits" clean install
  4. Download charm++* version 6.7.1
  5. Build multicore version of charm++:
    1. cd <path>/charm-6.7.1
    2. ./build charm++ multicore-linux64 iccstatic --with-production "-O3 -ip"
  6. Build BDW:
    1. Modify the Linux-x86_64-icc.arch to look like the following:
      NAMD_ARCH = Linux-x86_64
      CHARMARCH = multicore-linux64-iccstatic
      FLOATOPTS = -ip -xCORE-AVX2 -O3 -g -fp-model fast=2 -no-prec-div -qoverride-limits -DNAMD_DISABLE_SSE
      CXX = icpc -std=c++11 -DNAMD_KNL
      CXXOPTS = -static-intel -O2 $(FLOATOPTS)
      CXXNOALIASOPTS = -O3 -fno-alias $(FLOATOPTS) -qopt-report-phase=loop,vec -qopt-report=4
      CXXCOLVAROPTS = -O2 -ip
      CC = icc
      COPTS = -static-intel -O2 $(FLOATOPTS)
    2.  ./config Linux-x86_64-icc --charm-base <charm_path> --charm-arch multicore-linux64- iccstatic --with-fftw3 --fftw-prefix <fftw_path> --without-tcl --charm-opts –verbose
    3. gmake -j
  7. Build KNL:
    1. Modify the arch/Linux-KNL-icc.arch to look like the following:
      NAMD_ARCH = Linux-KNL
      CHARMARCH = multicore-linux64-iccstatic
      FLOATOPTS = -ip -xMIC-AVX512 -O3 -g -fp-model fast=2 -no-prec-div -qoverride-limits
      DNAMD_DISABLE_SSE
      CXX = icpc -std=c++11 -DNAMD_KNL
      CXXOPTS = -static-intel -O2 $(FLOATOPTS)
      CXXNOALIASOPTS = -O3 -fno-alias $(FLOATOPTS) -qopt-report-phase=loop,vec -qopt-report=4
      CXXCOLVAROPTS = -O2 -ip
      CC = icc
      COPTS = -static-intel -O2 $(FLOATOPTS)
    2. ./config Linux-KNL-icc --charm-base <charm_path> --charm-arch multicore-linux64-iccstatic --with-fftw3 --fftw-prefix <fftw_path> --without-tcl --charm-opts –verbose
    3. gmake –j
  8. Change the kernel setting for KNL: “nmi_watchdog=0 rcu_nocbs=2-271 nohz_full=2-271”
  9. Download apoa and stmv workloads from here: http://www.ks.uiuc.edu/Research/namd/utilities/
  10. Change next lines in *.namd file for both workloads:
    	numsteps         1000
            outputtiming     20
            outputenergies   600

Run NAMD workloads on Intel® Xeon® Processor E5-2697 v4 and Intel® Xeon Phi™ Processor 7250

Run BDW (ppn = 72):

           $BIN +p $ppn apoa1/apoa1.namd +pemap 0-($ppn-1)

Run KNL (ppn = 136, MCDRAM in flat mode, similar performance in cache mode):

           numactl –m 1 $BIN +p $ppn apoa1/apoa1.namd +pemap 0-($ppn-1)

Performance results reported in Intel® Salesforce repository

(ns/day; higher is better):

WorkloadIntel® Xeon® Processor E5-2697 v4 (ns/day)Intel® Xeon Phi™ Processor 7250 (ns/day)KNL vs. 2S BDW (speedup)
stmv0.450.55  1.22x
Ap0a15.5  6.181.12x

Systems configuration:

ProcessorIntel® Xeon® Processor E5-2697 v4(BDW)Intel® Xeon Phi™ Processor 7250 (KNL)
Stepping1 (B0)1 (B0) Bin1
Sockets / TDP2S / 290W1S / 215W
Frequency / Cores / Threads2.3 GHz / 36 / 721.4 GHz / 68 / 272
DDR4 8x16 GB 2400 MHz(128 GB)6x16 GB 2400 MHz
MCDRAMN/A16 GB Flat
Cluster/Snoop Mode/Mem ModeHomeQuadrant/flat
TurboOnOn
BIOSGRRFSDP1.86B0271.R00.1510301446GVPRCRB1.86B.0010.R02.1608040407
CompilerICC-2017.0.098ICC-2017.0.098
Operating System

Red Hat* Enterprise Linux* 7.2

(3.10.0-327.e17.x86_64)

Red Hat Enterprise Linux 7.2

(3.10.0-327.22.2.el7.xppsl_1.4.1.3272._86_64)

  

Introducing DNN primitives in Intel® Math Kernel Library

$
0
0

    Deep Neural Networks (DNNs) are on the cutting edge of the Machine Learning domain. These algorithms received wide industry adoption in the late 1990s and were initially applied to tasks such as handwriting recognition on bank checks. Deep Neural Networks have been widely successful in this task, matching and even exceeding human capabilities. Today DNNs have been used for image recognition and video and natural language processing, as well as in solving complex visual understanding problems such as autonomous driving. DNNs are very demanding in terms of compute resources and the volume of data they must process. To put this into perspective, the modern image recognition topology AlexNet takes a few days to train on modern compute systems and uses slightly over 14 million images. Tackling this complexity requires well optimized building blocks to decrease the training time in order to meet the needs of the industrial application.

    Intel® Math Kernel Library (Intel® MKL) 2017 introduces the DNN domain, which includes functions necessary to accelerate the most popular image recognition topologies, including AlexNet, VGG, GoogleNet and ResNet.

    These DNN topologies rely on a number of standard building blocks, or primitives, that operate on data in the form of multidimensional sets called tensors. These primitives include convolution, normalization, activation and inner product functions along with functions necessary to manipulate tensors. Performing computations effectively on Intel architectures requires taking advantage of SIMD instructions via vectorization and of multiple compute cores via threading. Vectorization is extremely important as modern processors operate on vectors of data up to 512 bits long (16 single-precision numbers) and can perform up to two multiply and add (Fused Multiply Add, or FMA) operations per cycle. Taking advantage of vectorization requires data to be located consecutively in memory. As typical dimensions of a tensor are relatively small, changing the data layout introduces significant overhead; we strive to perform all the operations in a topology without changing the data layout from primitive to primitive.

Intel MKL provides primitives for most widely used operations implemented for vectorization-friendly data layout:

  • Direct batched convolution
  • Inner product
  • Pooling: maximum, minimum, average
  • Normalization: local response normalization across channels (LRN), batch normalization
  • Activation: rectified linear unit (ReLU)
  • Data manipulation: multi-dimensional transposition (conversion), split, concat, sum and scale.

Programming model

    Execution flow for the neural network topology includes two phases: setup and execution. During the setup phase the application creates descriptions of all DNN operations necessary to implement scoring, training, or other application-specific computations. To pass data from one DNN operation to the next one, some applications create intermediate conversions and allocate temporary arrays if the appropriate output and input data layouts do not match. This phase is performed once in a typical application and followed by multiple execution phases where actual computations happen.

    During the execution step the data is fed to the network in a plain layout like BCWH (batch, channel, width, height) and is converted to a SIMD-friendly layout. As data propagates between layers the data layout is preserved and conversions are made when it is necessary to perform operations that are not supported by the existing implementation.

 

    Intel MKL DNN primitives implement a plain C application programming interface (API) that can be used in the existing C/C++ DNN framework. An application that calls Intel MKL DNN functions should involve the following stages:

    Setup stage: for given a DNN topology, the application creates all DNN operations necessary to implement scoring, training, or other application-specific computations. To pass data from one DNN operation to the next one, some applications create intermediate conversions and allocate temporary arrays if the appropriate output and input data layouts do not match.

    Execution stage: at this stage, the application calls to the DNN primitives that apply the DNN operations, including necessary conversions, to the input, output, and temporary arrays.

    The appropriated examples for training and scoring computations may be find out into MKL package directory: <mklroot>\examples\dnnc\source 

Performance

Caffe, a deep learning framework developed by Berkeley Vision and Learning Center (BVLC), is one of the most popular community frameworks for image recognition. Together with AlexNet, a neural network topology for image recognition, and ImageNet, a database of labeled images, Caffe is often used as a benchmark. The chart below shows performance comparison of original Caffe implementation and Intel optimized version, that takes advantage of optimized matrix-matrix multiplication and new Intel MKL 2017 DNN primitives on Intel® Xeon® processor E5-2699 v4 (codename Broadwell) and Intel® Xeon Phi™ processor 7250 (codename Knights Landing).

Summary

DNN primitives available in Intel MKL 2017 can be used to accelerate Deep Learning workloads on Intel Architecture. Please refer to Intel MKL Developer Reference Manual and examples for detailed information.

 

 

Intel® HPC Developer Conference 2016 - Session Presentations

$
0
0

The 2016 Intel® HPC Developer Conference brought together developers from around the world to discuss code modernization in high-performance computing. For those who may have missed it or if you want to catch presentations that you may have missed, we have posted the Top Tech Sessions of 2016 to the HPC Developer’s Conference webpage. The sessions are split out by track, including Artificial Intelligence/Machine Learning, Systems, Software Visualization, Parallel Programming and others.

Artificial Intelligence/Machine Learning Track

Systems Track


High Productivity Languages Track

Software Visualization Track


Parallel Programming Track

 


Thread Parallelism in Cython*

$
0
0

Introduction

Cython* is a superset of Python* that additionally supports C functions and C types on variable and class attributes. Cython is used for wrapping external C libraries that speed up the execution of a Python program. Cython generates C extension modules, which are used by the main Python program using the import statement.

One interesting feature of Cython is that it supports native parallelism (see the cython.parallel module). The cython.parallel.prange function can be used for parallel loops; thus one can take advantage of Intel® Many Integrated Core Architecture (Intel® MIC Architecture) using the thread parallelism in Python.

Cython in Intel® Distribution for Python* 2017

Intel® Distribution for Python* 2017 is a binary distribution of Python interpreter, which accelerates core Python packages, including NumPy, SciPy, Jupyter, matplotlib, Cython, and so on. The package integrates Intel® Math Kernel Library (Intel® MKL), Intel® Data Analytics Acceleration Library (Intel® DAAL), pyDAAL, Intel® MPI Library and Intel® Threading Building Blocks (Intel® TBB). For more information on these packages, please refer to the Release Notes.

The Intel Distribution for Python 2017 can be downloaded here. It is available for free for Python 2.7.x and 3.5.x on OS X*, Windows* 7 and later, and Linux*. The package can be installed as a standalone or with the Intel® Parallel Studio XE 2017.

Intel Distribution for Python supports both Python 2 and Python 3. There are two separate packages available in the Intel Distribution for Python: Python 2.7 and Python 3.5. In this article, the Intel® Distribution for Python 2.7 on Linux (l_python27_pu_2017.0.035.tgz) is installed on a 1.4 GHz, 68-core Intel® Xeon Phi™ processor 7250 with four hardware threads per core (a total of 272 hardware threads). To install, extract the package content, run the install script, and then follow the installer prompts:

$ tar -xvzf l_python27_pu_2017.0.035.tgz
$ cd l_python27_pu_2017.0.035
$ ./install.sh

After the installation completes, activate the root environment (see the Release Notes):

$ source /opt/intel/intelpython27/bin/activate root

Thread Parallelism in Cython

In Python, there is a mutex that prevents multiple native threads from executing bycodes at the same time. Because of this, threads in Python cannot run in parallel. This section explores thread parallelism in Cython. This functionality is then imported to the Python code as an extension module allowing the Python code to utilize all the cores and threads of the hardware underneath.

To generate an extension module, one can write Cython code (file with extension .pyx). The .pyx file is then compiled by the Cython compiler to convert it into efficient C code (file with extension .c). The .c file is in turn compiled and linked by a C/C++ compiler to generate a shared library (.so file). The shared library can be imported in Python as a module.

In the following multithreads.pyx file, the function serial_loop computes log(a)*log(b) for each entry in the A and B arrays and stores the result in the C array. The log function is imported from the C math library. The NumPy module, the high-performance scientific computation and data analysis package, is used in order to vectorize operations on A and B arrays.

Similarly, the function parallel_loop performs the same computation using OpenMP* threads to execute the computation in the body loop. Instead of using range, prange (parallel range) is used to allow multiple threads executed in parallel. prange is a function of the cython.parallel module and can be used for parallel loops. When this function is called, OpenMP starts a thread pool and distributes the work among the threads. Note that the prange function can be used only when the Global Interpreter Lock (GIL) is released by putting the loop in a nogil context (the GIL global variable prevents multiple threads to run concurrently). With wraparound(False), Cython never checks for negative indices; with boundscheck(False), Cython doesn’t do bound check on the arrays.

$ cat multithreads.pyx

cimport cython
import numpy as np
cimport openmp
from libc.math cimport log
from cython.parallel cimport prange
from cython.parallel cimport parallel

THOUSAND = 1024
FACTOR = 100
NUM_TOTAL_ELEMENTS = FACTOR * THOUSAND * THOUSAND
X1 = -1 + 2*np.random.rand(NUM_TOTAL_ELEMENTS)
X2 = -1 + 2*np.random.rand(NUM_TOTAL_ELEMENTS)
Y = np.zeros(X1.shape)

def test_serial():
    serial_loop(X1,X2,Y)

def serial_loop(double[:] A, double[:] B, double[:] C):
    cdef int N = A.shape[0]
    cdef int i

    for i in range(N):
        C[i] = log(A[i]) * log(B[i])

def test_parallel():
    parallel_loop(X1,X2,Y)

@cython.boundscheck(False)
@cython.wraparound(False)
def parallel_loop(double[:] A, double[:] B, double[:] C):
    cdef int N = A.shape[0]
    cdef int i

    with nogil:
        for i in prange(N, schedule='static'):
            C[i] = log(A[i]) * log(B[i])

After completing the Cython code, the Cython compiler converts it to a C code extension file. This can be done by a disutilssetup.py file (disutils is used to distribute Python modules). To use the OpenMP support, one must tell the compiler to enable OpenMP by providing the flag –fopenmp in a compile argument and link argument in the setup.py file as shown below. The setup.py file invokes the setuptools build process that generates the extension modules. By default, this setup.py uses GNU GCC* to compile the C code of the Python extension. In addition, we add –O0 compile flags (disable all optimizations) to create a baseline measurement.

$ cat setup.py
from distutils.core import setup
from Cython.Build import cythonize
from distutils.extension import Extension
from Cython.Distutils import build_ext

setup(
  name = "multithreads",
  cmdclass = {"build_ext": build_ext},
  ext_modules =
  [
    Extension("multithreads",
              ["multithreads.pyx"],
              extra_compile_args = ["-O0", "-fopenmp"],
              extra_link_args=['-fopenmp']
              )
  ]
)

Use the command below to build C/C++ extensions:

$ python setup.py build_ext –-inplace

Alternatively, you can also manually compile the Cython code:

$ cython multithreads.pyx

This generates the multithreads.c file, which contains the Python extension code. You can compile the extension code with the gcc compiler to generate the shared object multithreads.so file.

$ gcc -O0 -shared -pthread -fPIC -fwrapv -Wall -fno-strict-aliasing
-fopenmp multithreads.c -I/opt/intel/intelpython27/include/python2.7 -L/opt/intel/intelpython27/lib -lpython2.7 -o multithreads.so

After the shared object is generated. Python code can import this module to take advantage of thread parallelism. The following section will show how one can improve its performance.

You can import the timeit module to measure the execution time of a Python function. Note that by default, timeit runs the measured function 1,000,000 times. Set the number of execution times to 100 in the following examples for a shorter execution time. Basically, timeit.Timer () imports the multithreads module and measures the time spent by the function multithreads.test_serial(). The argument number=100 tells the Python interpreter to perform the run 100 times. Thus, t1.timeit(number=100) measures the time to execute the serial loop (only one thread performs the loop) 100 times.

Similarly, t12.timeit(number=100) measures the time when executing the parallel loop (multiple threads perform the computation in parallel) 100 times.

  • Measure the serial loop with gcc compiler, compiler option –O0 (disabled all optimizations).
$ python
Python 2.7.12 |Intel Corporation| (default, Oct 20 2016, 03:10:12)
[GCC 4.8.2 20140120 (Red Hat 4.8.2-15)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
Intel(R) Distribution for Python is brought to you by Intel Corporation.
Please check out: https://software.intel.com/en-us/python-distribution

Import timeit and time t1 to measure the time spent in the serial loop. Note that you built with gcc compiler and disabled all optimizations. The result is displayed in seconds.

>>> import timeit>>> t1 = timeit.Timer("multithreads.test_serial()","import multithreads")>>> t1.timeit(number=100)
2874.419779062271
  • Measure the parallel loop with gcc compiler, compiler option –O0 (disabled all optimizations).

The parallel loop is measured by t2 (again, you built with gcc compiler and disabled all optimizations).

>>> t2 = timeit.Timer("multithreads.test_parallel()","import multithreads")>>> t2.timeit(number=100)
26.016316175460815

As you observe, the parallel loop improves the performance by roughly a factor of 110x.

  • Measure the parallel loop with icc compiler, compiler option –O0 (disabled all optimizations).

Next, recompile using the Intel® C Compiler and compare the performance. For the Intel® C/C++ Compiler, use the –qopenmp flag instead of –fopenmp to enable OpenMP. After installing the Intel Parallel Studio XE 2017, set the proper environment variables and delete all previous build:

$ source /opt/intel/parallel_studio_xe_2017.1.043/psxevars.sh intel64
Intel(R) Parallel Studio XE 2017 Update 1 for Linux*
Copyright (C) 2009-2016 Intel Corporation. All rights reserved.

$ rm multithreads.so multithreads.c -r build

To explicitly use the Intel icc to compile this application, execute the setup.py file with the following command:

$ LDSHARED="icc -shared" CC=icc python setup.py build_ext –-inplace

The parallel loop is measured by t2 (this time, you built with Intel compiler, disabled all optimizations):

$ python>>> import timeit>>> t2 = timeit.Timer("multithreads.test_parallel()","import multithreads")>>> t2.timeit(number=100)
23.89365792274475
  • Measure the parallel loop with icc compiler, compiler option –O3.

For the third try, you may want to see whether or not using –O3 optimization and enabling Intel® Advanced Vector Extensions (Intel® AVX-512) ISA on the Intel® Xeon Phi™ processor can improve the performance. To do this, in the setup.py, replace –O0 with –O3 and add –xMIC-AVX512. Repeat the compilation, and then run the parallel loop as indicated in the previous step, which results in: 21.027512073516846. The following graph shows the results (in seconds) when compiling with gcc, icc without optimization enabled, and icc with optimization, Intel AVX-512 ISA:

The result shows that the best result (21.03 seconds) is obtained when you compile the parallel loop with the Intel compiler, and enable auto-vectorization (-O3) combined with Intel AVX-512 ISA (-xMIC-AVX512) for the Intel Xeon Phi processor.

By default, the Intel Xeon Phi processor uses all available resources: it has 68 cores, and each core uses four hardware threads. A total of 272 threads or four threads/core are running in a parallel region. It is possible to modify the core and number of thread running by each core. The last section shows how to use an environment variable to accomplish this.

  • To run 68 threads on 68 cores (one thread per core) executing the loop body for 100 times, set the KMP_PLACE_THREADS environment as below:

$ export KMP_PLACE_THREADS=68c,1t

  • To run 136 threads on 68 cores (two threads per core) running the parallel loop for 100 times, set the KMP_PLACE_THREADS environment as below:

$ export KMP_PLACE_THREADS=68c,2t

  • To run 204 threads on 68 cores (three threads per core) running the parallel loop for 100 times, set the KMP_PLACE_THREADS environment as below:

$ export KMP_PLACE_THREADS=68c,3t

The following graph summarizes the result:

Conclusion

This article showed how to use Cython to build an extension module for Python in order to take advantage of multithread support for the Intel Xeon Phi processor. It shows how to use the setup script to build a shared library. The parallel loop performance can be improved by trying different compiler options in the setup script. This article also showed how to set different number of threads per core.

Exploring MPI for Python* on Intel® Xeon Phi™ Processor

$
0
0

Introduction

Message Passing Interface (MPI) is a standardized message-passing library interface designed for distributed memory programming. MPI is widely used in the High Performance Computing (HPC) domain because it is well-suited for distributed memory architectures.

Python* is a modern, powerful interpreter which supports modules and packages. Python supports extension C/C++. While HPC applications are usually written in C or FORTRAN for faster speed, Python can be used to quickly prototype a proof of concept and for rapid application development because of its simplicity and modularity support.

The MPI for Python (mpi4py) package provides Python bindings for the MPI standard. The mpi4py package translates MPI syntax and semantics, and uses Python objects to communicate. Thus, programmers can implement MPI applications in Python quickly. Note that mpi4py is object-oriented. Not all functions in the MPI standard are available in mpi4py; however, almost all the commonly used functions are. More information on mpi4pi can be found here. In mpi4py, COMM_WORLD is an instance of the base class of communicators.

mpi4py supports two types of communications:

  • Communication of generic Python objects: The methods of a communicator object are lower-case (send(), recv(), bcast(), scatter(), gather(), etc.). In this type of communication, the sent object is passed as a parameter to the communication call.
  • Communication of buffer-like objects: The methods of a communicator object are upper-case letters (Send(), Recv(), Bcast(), Scatter(), Gather(), etc.). Buffer arguments to these calls are specified using tuples. This type of communication is much faster than Python objects communication type.

Intel® Distribution for Python* 2017

Intel® Distribution for Python* is a binary distribution of Python interpreter; it accelerates core Python packages including NumPy, SciPy, Jupyter, matplotlib, mpi4py, etc. The package integrates Intel® Math Kernel Library (Intel® MKL), Intel® Data Analytics Acceleration Library (Intel® DAAL), pyDAAL, Intel® MPI Library, and Intel® Threading Building Blocks (Intel® TBB).

The Intel Distribution for Python 2017 is available free for Python 2.7.x and 3.5.x on OS X*, Windows* 7 and later, and Linux*. The package can be installed as a stand alone or with the Intel® Parallel Studio XE 2017.

In the Intel Distribution for Python, mpi4py is a Python wraparound for the native Intel MPI implementation (Intel MPI Library). This document shows how to write an MPI program in Python, and how to take advantage of Intel® multicore architecture using OpenMP threads and Intel® AVX-512 instructions.

Intel Distribution for Python supports both Python 2 and Python 3. There are two separate packages available in the Intel Distribution for Python: Python 2.7 and Python 3.5. In this example, the Intel Distribution for Python 2.7 on Linux (l_python27_pu_2017.0.035.tgz) is installed on an Intel® Xeon Phi™ processor 7250 @ 1.4 GHz and 68 cores with 4 hardware threads per core (a total of 272 hardware threads). To install, extract the package content, run the install script, and follow the installer prompts:

$ tar -xvzf l_python27_pu_2017.0.035.tgz
$ cd l_python27_pu_2017.0.035
$ ./install.sh

After the installation completes, activate the root Intel® Python Conda environment:

$ source /opt/intel/intelpython27/bin/activate root

Parallel Computing: OpenMP and SIMD

While multithreaded Python workloads can use Intel TBB optimized thread scheduling, another approach is to use OpenMP to take advantage of Intel multicore architecture. This section shows how to implement OpenMP multithreads and C math library in Cython*.

Cython is an interpreted language that can be built into native language. Cython is similar to Python, but it supports C function calls and C-style declaration of variables and class attributes. Cython is used for wrapping external C libraries that speed up the execution of a Python program. Cython generates C extension modules, which are used by the main Python program using the import statement.

For example, to generate an extension module, one can write a Cython code (.pyx file). The .pyx file is then compiled by Cython to generate a .c file, which contains the code of a Python extension code. The .c file is in turn compiled by a C compiler to generate a shared object library (.so file).

One way to build Cython code is to write a disutilssetup.py file (disutils is used to distribute Python modules). In the following multithreads.pyx file, the function vector_log_multiplication computes log(a)*log(b) for each entry in the A and B arrays and stores the result in the C array. Note that a parallel loop (prange) is used to allow multiple threads executed in parallel. The log function is imported from the C math library. The function getnumthreads() returns the number of threads:

$ cat multithreads.pyx

cimport cython
import numpy as np
cimport openmp
from libc.math cimport log
from cython.parallel cimport prange
from cython.parallel cimport parallel

@cython.boundscheck(False)
def vector_log_multiplication(double[:] A, double[:] B, double[:] C):
    cdef int N = A.shape[0]
    cdef int i

    with nogil, cython.boundscheck(False), cython.wraparound(False):
        for i in prange(N, schedule='static'):
            C[i] = log(A[i]) * log(B[i])

def getnumthreads():
    cdef int num_threads

    with nogil, parallel():
        num_threads = openmp.omp_get_num_threads()
        with gil:
            return num_threads

The setup.py file invokes the setuptools build process that generates the extension modules. By default, this setup.py uses GNU GCC* to compile the C code of the Python extension. In order to take advantage of AVX-512 and OpenMP multithreads in the Intel Xeon Phi processor, one can specify the options -xMIC-avx512 and -qopenmp in the compile and link flags, and use the Intel® compiler icc. For more information on how to create the setup.py file, refer to the Writing the Setup Script section of the Python documentation.

$ cat setup.py

from distutils.core import setup
from Cython.Build import cythonize
from distutils.extension import Extension
from Cython.Distutils import build_ext

setup(
  name = "multithreads",
  cmdclass = {"build_ext": build_ext},
  ext_modules = [
    Extension("multithreads",
              ["multithreads.pyx"],
              libraries=["m"],
              extra_compile_args = ["-O3", "-xMIC-avx512", "-qopenmp" ],
              extra_link_args=['-qopenmp', '-xMIC-avx512']
              )
  ]

)

In this example, the Parallel Studio XE 2017 update 1 is installed. First, set the proper environment variables for the Intel C compiler:

$ source /opt/intel/parallel_studio_xe_2017.1.043/psxevars.sh intel64
Intel(R) Parallel Studio XE 2017 Update 1 for Linux*
Copyright (C) 2009-2016 Intel Corporation. All rights reserved.

To explicitly use the Intel compiler icc to compile this application, execute the setup.py file with the following command:

$ LDSHARED="icc -shared" CC=icc python setup.py build_ext –inplace

running build_ext
cythoning multithreads.pyx to multithreads.c
building 'multithreads' extension
creating build
creating build/temp.linux-x86_64-2.7
icc -fno-strict-aliasing -Wformat -Wformat-security -D_FORTIFY_SOURCE=2 -fstack-protector -O3 -fpic -fPIC -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/opt/intel/intelpython27/include/python2.7 -c multithreads.c -o build/temp.linux-x86_64-2.7/multithreads.o -O3 -xMIC-avx512 -march=native -qopenmp
icc -shared build/temp.linux-x86_64-2.7/multithreads.o -L/opt/intel/intelpython27/lib -lm -lpython2.7 -o /home/plse/test/v7/multithreads.so -qopenmp -xMIC-avx512

As mentioned above, this process first generates the extension code multithreads.c. The Intel compiler compiles this extension code to generate the dynamic shared object library multithreads.so.

How to write a Python Application with Hybrid MPI/OpenMP

In this section, we write an MPI application in Python. This program imports mpi4py and multithreads modules. The MPI application uses a communicator object, MPI.COMM_WORLD, to identify a set of processes which can communicate within the set. The MPI functions MPI.COMM_WORLD.Get_size(), MPI.COMM_WORLD.Get_rank(), MPI.COMM_WORLD.send(), and MPI.COMM_WORLD.recv() are methods of this communicator object. Note that in mpi4py there is no need to call MPI_Init() and MPI_Finalize() as in the MPI standard because these functions are called when the module is imported and when the Python process ends, respectively.

The sample Python application first initializes two large input arrays consisting of random numbers between 1 and 2. Each MPI rank uses OpenMP threads to do the computation in parallel; each OpenMP thread in turn computes the product of two natural logarithms c = log(a)*log(b) where a and b are random numbers between 1 and 2 (1 <= a,b <= 2). To do that, each MPI rank calls the vector_log_multiplication function defined in the multithreads.pyx file. Execution time of this function is short, about 1.5 seconds. For illustration purposes, we use the timeit utility to invoke the function ten times just to have enough time to demonstrate the number of OpenMP threads involved.

Below is the application source code mpi_sample.py:

from mpi4py import MPI
from multithreads import *
import numpy as np
import timeit

def time_vector_log_multiplication():
    vector_log_multiplication(A, B, C)

size = MPI.COMM_WORLD.Get_size()
rank = MPI.COMM_WORLD.Get_rank()
name = MPI.Get_processor_name()

THOUSAND = 1024
FACTOR = 512
NUM_TOTAL_ELEMENTS = FACTOR * THOUSAND * THOUSAND
NUM_ELEMENTS_RANK = NUM_TOTAL_ELEMENTS / size
repeat = 10
numthread = getnumthreads()

if rank == 0:
   print "Initialize arrays for %d million of elements" % FACTOR

A = 1 + np.random.rand(NUM_ELEMENTS_RANK)
B = 1 + np.random.rand(NUM_ELEMENTS_RANK)
C = np.zeros(A.shape)

if rank == 0:
    print "Start timing ..."
    print "Call vector_log_multiplication with iter = %d" % repeat
    t1 =  timeit.timeit("time_vector_log_multiplication()", setup="from __main__ import time_vector_log_multiplication",number=repeat)
    print "Rank %d of %d running on %s with %d threads in %d seconds" % (rank, size, name, numthread, t1)

    for i in xrange(1, size):
        rank, size, name, numthread, t1 = MPI.COMM_WORLD.recv(source=i, tag=1)
        print "Rank %d of %d running on %s with %d threads in %d seconds" % (rank, size, name, numthread, t1)
    print "End  timing ..."

else:
    t1 =  timeit.timeit("time_vector_log_multiplication()", setup="from __main__ import time_vector_log_multiplication",number=repeat)
    MPI.COMM_WORLD.send((rank, size, name, numthread, t1), dest=0, tag=1)

Run the following command line to launch the above Python application with two MPI ranks:

$ mpirun -host localhost -n 2 python mpi_sample.py

Initialize arrays for 512 million of elements
Start timing ...
Call vector_log_multiplication with iter = 10
Rank 0 of 2 running on knl-sb2.jf.intel.com with 136 threads in 14 seconds
Rank 1 of 2 running on knl-sb2.jf.intel.com with 136 threads in 15 seconds
End  timing ...

While the Python program is running, the top command in a new terminal displays two MPI ranks (shown as two Python processes). When the main module enters the loop (shown with the message “Start timing…”), the top command reports almost 136 threads running (~13600 %CPU). This is because by default, all 272 hardware threads on this system are utilized by two MPI ranks, thus each MPI rank has 272/2 = 136 threads.

To get detailed information about MPI at run time, we can set the I_MPI_DEBUG environment variable to a value ranging from 0 to 1000. The following command runs 4 MPI ranks and sets the I_MPI_DEBUG to the value 4. Each MPI rank has 272/4 = 68 OpenMP threads as indicated by the top command:

$ mpirun -n 4 -genv I_MPI_DEBUG 4 python mpi_sample.py

[0] MPI startup(): Multi-threaded optimized library
[0] MPI startup(): shm data transfer mode
[1] MPI startup(): shm data transfer mode
[2] MPI startup(): shm data transfer mode
[3] MPI startup(): shm data transfer mode
[0] MPI startup(): Rank    Pid      Node name             Pin cpu
[0] MPI startup(): 0       84484    knl-sb2.jf.intel.com  {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152, 204,205,206,207,208,209,210,211,212,213,214,215,216,217,218,219,220}
[0] MPI startup(): 1       84485    knl-sb2.jf.intel.com  {17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,85,86,87,88,89,90,91,92,93,94                                            ,95,96,97,98,99,100,101,153,154,155,156,157,158,159,160,161,162,163,164,165,166, 167,168,169,221,222,223,224,225,226,227,228,229,230,231,232,233,234,235,236,237}
[0] MPI startup(): 2       84486    knl-sb2.jf.intel.com  {34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,170,171,172,173,174,175,176,177,178,179,180,181,182,183,184,185,186,238,239,240,241,242,243,244,245,246,247,248,249,250,251,252,253,254}
[0] MPI startup(): 3       84487    knl-sb2.jf.intel.com  {51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,119,120,121,122,123,124,125,126,127,128,129,130,131,132,133,134,135,187,188,189,190,191,192,193,194,195,196,197,198,199,200,201,202,203,255,256,257,258,259,260,261,262,263,264,265,266,267,268,269,270,271}
Initialize arrays for 512 million of elements
Start timing ...
Call vector_log_multiplication with iter = 10
Rank 0 of 4 running on knl-sb2.jf.intel.com with 68 threads in 16 seconds
Rank 1 of 4 running on knl-sb2.jf.intel.com with 68 threads in 15 seconds
Rank 2 of 4 running on knl-sb2.jf.intel.com with 68 threads in 15 seconds
Rank 3 of 4 running on knl-sb2.jf.intel.com with 68 threads in 15 seconds
End  timing ...

We can specify the number of OpenMP threads used by each rank in the parallel region by setting the OMP_NUM_THREADS environment variable. The following command starts 4 MPI ranks, 34 threads for each MPI ranks (or 2 threads/core):

$  mpirun -host localhost -n 4 -genv OMP_NUM_THREADS 34 python mpi_sample.py

Initialize arrays for 512 million of elements
Start timing ...
Call vector_log_multiplication with iter = 10
Rank 0 of 4 running on knl-sb2.jf.intel.com with 34 threads in 18 seconds
Rank 1 of 4 running on knl-sb2.jf.intel.com with 34 threads in 17 seconds
Rank 2 of 4 running on knl-sb2.jf.intel.com with 34 threads in 17 seconds
Rank 3 of 4 running on knl-sb2.jf.intel.com with 34 threads in 17 seconds
End  timing ...

Finally, we can force the program to allocate memory in MCDRAM (High-Bandwidth Memory on the Intel Xeon Phi processor). For example, before the execution of the program, the ”numactl –hardware” command shows the system has 2 NUMA nodes: node 0 consists of CPUs and 96 GB DDR4 memory, node 1 is the on-board 16 GB MCDRAM memory:

$ numactl --hardware

available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271
node 0 size: 98200 MB
node 0 free: 73585 MB
node 1 cpus:
node 1 size: 16384 MB
node 1 free: 15925 MB
node distances:
node   0   1
  0:  10  31
  1:  31  10

Run the following command, which indicates allocating memory in MCDRAM if possible:

$ mpirun -n 4 numactl --preferred 1 python mpi_sample.py

While the program is running, we can observe that it allocates memory in MCDRAM (NUMA node 1):

$ numactl --hardware

available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271
node 0 size: 98200 MB
node 0 free: 73590 MB
node 1 cpus:
node 1 size: 16384 MB
node 1 free: 3428 MB
node distances:
node   0   1
  0:  10  31
  1:  31  10

Readers can also try the above code on an Intel® Xeon® processor system with the appropriate setting. For example, on Intel® Xeon® processor E5-2690 v4, using –xCORE-AVX2 instead of –xMIC-AVX512, and set the number of available threads to 28 instead of 272. Also note that E5-2690 v4 doesn’t have High-Bandwidth Memory.

Conclusion

This article introduced the MPI for Python package and demonstrated how to use it via the Intel Distribution for Python. Furthermore, it showed how to use OpenMP and Intel AVX-512 instructions in order to fully take advantage of the Intel Xeon Phi processor architecture. A simple example was included to show how one can write a parallel Cython function with OpenMP, compiled it with the Intel compiler with AVX-512 enabled option, and integrated it with an MPI Python program to fully take advantage of the Intel Xeon Phi processor architecture.

References:

About the Author

Loc Q Nguyen received an MBA from University of Dallas, a master’s degree in Electrical Engineering from McGill University, and a bachelor's degree in Electrical Engineering from École Polytechnique de Montréal. He is currently a software engineer with Intel Corporation's Software and Services Group. His areas of interest include computer networking, parallel computing, and computer graphics.

Quick Analysis of Vectorization Using the Intel® Advisor 2017 Tool

$
0
0

In this article we continue our exploration of vectorization on an Intel® Xeon Phi™ processor using examples of loops that we used in a previous article. We will discuss how to use the command-line interface in Intel® Advisor 2017 for a quick, initial analysis of loop performance that gives an overview of the hotspots in the code. This initial analysis can be then followed by more in-depth analysis using the graphical user interface (GUI) in Intel Advisor 2017.

Introduction

Intel has developed several software products aimed at increasing productivity of software developers and helping them to make the best use of Intel® processors. One of these tools is Intel® Parallel Studio XE, which contains a set of compilers and analysis tools that let the user write, analyze and optimize their application on Intel hardware.

In this article, we explore Intel® Advisor 2017, which is one of the analysis tools in the Intel Parallel Studio XE suite that lets us analyze our application and gives us advice on how to improve vectorization in our code.

Why is vectorization important? And how does Intel® Advisor help?

Vector-level parallelism allows the software to use special hardware like vector registers and SIMD (Single Instruction Multiple Data) instructions. New Intel® processors, like the Intel® Xeon Phi™ processor features 512-bit wide vector registers which, in conjunction with the Intel® Advanced Vector Extensions 512 (Intel® AVX-512) ISA, allows the use of two vector processing units in each individual core, each of them capable of processing 16 single-precision (32-bit) or 8 double-precision (64-bit) floating point numbers.

To further realize the full performance of modern processors, code must be also threaded to take advantage of multiple cores. The multiplicative effect of vectorization and threading will accelerate code more than the effect of only vectorization or threading.

Intel Advisor analyzes our application and reports not only the extent of vectorization but also possible ways to achieve more vectorization and increase the effectiveness of the current vectorization.

Although Intel Advisor works with any compiler, it is particularly effective when applications are compiled using Intel compilers, because Intel Advisor will use the information from the reports generated by Intel compilers.

How to use Intel® Advisor

The most effective way to use Intel Advisor is via the GUI. This interface gives us access to all the information and recommendations that Intel Advisor collects from our code. Detailed information can be found in https://software.intel.com/en-us/intel-advisor-xe-support, where documentation, training materials, and code samples can be found. Product support and access to the Intel Advisor community forum can be also found in that link.

Intel Advisor also offers a command-line interface (CLI) that lets the user work on remote hosts, and/or generate information in a way that is easy to automate analysis tasks, for example using scripts.

When working on Intel Xeon Phi processor, which is based on the Linux* OS, we might need to use a combination of Advisor’s GUI and CLI for our specific analysis workflow, and in some cases the CLI will be a good starting point for a quick view of a performance summary, as well as in the initial phases of our workflow analysis. Detailed information about the Intel Advisor CLI for Linux can be found at https://software.intel.com/en-us/node/634769.

In the next sections, a procedure for a quick initial performance analysis on Linux using the Intel Advisor CLI will be described. This quick analysis will give us an idea of the performance bottlenecks in our application and where to focus initial optimization efforts. Also, for testing purposes, this procedure will also allow the user to automate testing and results reporting.

This analysis is intended as an initial step and will provide access to only limited information. The full extent of the information and help offered by Intel Advisor is available using a combination of the Intel Advisor GUI and CLI.

Using Intel Advisor on an Intel® Xeon Phi™ processor: Running a quick survey analysis

To illustrate this procedure, I will use the code sample from a previous article that shows vectorization improvements when using the Intel AVX-512 ISA. Details of the source code are discussed in that article. The sample code can be downloaded from here.

This example will be run in the following hardware:

Processor: Intel Xeon Phi processor, model 7250 (1.40 GHz)
Number of cores: 68
Number of threads: 272

The first step for a quick analysis is to create an optimized executable that will run on the Intel Xeon Phi processor. For this, we start by compiling our application with a set of options that will direct the compiler to create this executable in a way that Intel Advisor will be able to extract information from. The options that must be used are –xMIC-AVX512, which enables the use of all the subsets of Intel Advanced Vector Extensions 512 that are supported by the Intel® Xeon Phi™ processor (Zhang, 2016), and –g to generate debugging information and symbols. The –O3 option is also used because the executable must be optimized. We can use either the –O2 or the -O3 options for this purpose.

$ icpc Histogram_Example.cpp -g -O3 -restrict -xMIC-AVX512 -o run512 -lopencv_highgui -lopencv_core -lopencv_imgproc

Notice that we have also used the –restrict option, which informs the compiler that the pointers used in this application are not aliased. Also notice that we are linking the application with the OpenCV* library (www.opencv.org), which we use in this application to read an image from disk. A Makefile file is included if you download the sample code. This Makefile file can be used to generate an executable for Intel Advisor.

Next, we can run the CLI version of the Intel Advisor tool. The survey analysis is a good starting point for analysis, because it provides information that will let us identify how our code is using vectorization and where the hotspots for analysis are.

$ advixe-cl -collect survey -project-dir ./AdvProj-Example-AVX512 -search-dir all:=./src -- ./run512 image01.jpg

The above command runs the Intel Advisor tool and creates a project directory AdvProj-Example-AVX512. Inside this directory, Intel Advisor creates, among other things, a directory named e000 containing the results of the analysis. If we list the contents of the results directory, we see the following:

$ ls AdvProj-Example-AVX512/e000/
e000.advixeexp  hs000  loop_hashes.def
$

The directory hs000 contains results from the survey analysis just created.

The next step is to view the results of the survey analysis performed by the Intel Advisor tool. Here we will use the CLI to generate the report. To do this, we replace the -collect option with the -report one, making sure we refer to the same project directory where the data has been collected. We can use the following command to generate a survey report from the survey data that is contained in the results directory in our project directory:

$ advixe-cl -report survey -project-dir ./AdvProj-Example-AVX512 -format=text -report-output=./REPORTS/survey-AVX512.txt

The above command will create a report named survey-AVX512.txt in the subdirectory REPORTS. This report is in a column format and contains several columns, so it can be a little difficult to read on a console. One option for a quick read is to limit the number of columns to be displayed using the –filter option (only the survey report is supported in the current version of Intel Advisor).

Another option is to create an xml-formatted report. We can do this if we change the value for the -format option from text to xml:

$ advixe-cl -report survey -project-dir ./AdvProj-Example-AVX512 -format=xml -report-output=./REPORTS/survey-AVX512.xml

The xml-formatted report might be easier to read on a small screen, because the information in the columns in the report file is condensed into one column. Here is a fragment of it:

(…)
</function_call_site_or_loop><function_call_site_or_loop ID="4"
   Function_Call_Sites_and_Loops="[child]-[loop in main at Histogram_Example.cpp:107]"
                              Self_Time="0.060s"
                              Total_Time="0.120s"
                              Type="Vectorized (Body)"
                              Why_No_Vectorization=""
                              Vector_ISA="AVX512"
                              Compiler_Estimated_Gain="3.37x"
                              Trip_Counts_Average=""
                              Trip_Counts_Min=""
                              Trip_Counts_Max=""
                              Trip_Counts_Call_Count=""
                              Transformations=""
                              Source_Location="Histogram_Example.cpp:107"
                              Module="run512">
  (…)
  </function_call_site_or_loop><function_call_site_or_loop ID="8" name="[loop in main at Histogram_Example.cpp:87]"
                              Self_Time="0.030s"
                              Total_Time="0.030s"
                              Type="Vectorized (Body; [Remainder])"
                              Why_No_Vectorization="1 vectorization possible but seems inefficient. Use vector always directive or -vec-threshold0 to override "
                              Vector_ISA="AVX512"
                              Compiler_Estimated_Gain="20.53x"
                              Trip_Counts_Average=""
                              Trip_Counts_Min=""
                              Trip_Counts_Max=""
                              Trip_Counts_Call_Count=""
                              Transformations=""
                              Source_Location="Histogram_Example.cpp:87"
                              Module="run512"></function_call_site_or_loop><function_call_site_or_loop ID="1"
   Function_Call_Sites_and_Loops="[child]-[loop in main at Histogram_Example.cpp:87]"
                              Self_Time="0.030s"
                              Total_Time="0.030s"
                              Type="Vectorized (Body)"
                              Why_No_Vectorization=""
                              Vector_ISA="AVX512"
                              Compiler_Estimated_Gain="20.53x"
                              Trip_Counts_Average=""
                              Trip_Counts_Min=""
                              Trip_Counts_Max=""
                              Trip_Counts_Call_Count=""
                              Transformations=""
                              Source_Location="Histogram_Example.cpp:87"
                              Module="run512">

Recall that the survey option in the Intel Advisor tool will generate a performance overview of the loops in the application. For example, the example shown above shows that the loop starting on line 107 in the source code has been vectorized using Intel AVX-512 ISA. It also shows an estimate of the improvement of the loop’s performance (compared to a scalar version) and timing information. The second and third blocks in the example above give performance overview for the loop at line 87 in the source code. It shows that the body of the loop has been vectorized, but the reminder of the loop has not.

Also notice that the different loops have been assigned a loop ID, which is the way the Intel Advisor tool labels the loops in order to keep track of them in future analysis (for example, after looking at the performance overview shown above, we might want to generate more detailed information about a specific loop by including the loop ID in the command line).

The above is a quick way to run and visualize a vectorization analysis in the Intel Xeon Phi processor. This procedure will let us quickly visualize the basic vectorization information from our codes with minimum effort. It will also let us create quick summaries of progressive optimization steps in the form of tables or plots (if we have run several of these analysis at different stages of the optimization process). However, if we need to access more advanced information from our analysis, like traits or the assembly code, we can use the Intel Advisor GUI possibly from a different computer (either by copying the project folder to another computer or by accessing it over the network) and access the complete information that Intel Advisor offers.

For example, the next figure shows what the Intel Advisor GUI looks like for the survey analysis shown above. We can see that, besides the information contained in the CLI report, The Intel Advisor GUI offers other information, like traits and source and assembly code.

Collecting more detailed information

Once we have looked at the performance summary reported by the Intel Advisor tool using the Survey option, we can use other options to add more specific information to the reports. One option is to run the Tripcounts analysis to get information about the number of times loops are executed.

To add this information to our project, we can use the Intel Advisor tool to run a tripcounts analysis on the same project we used for the survey analysis:

$ advixe-cl -collect tripcounts -project-dir ./AdvProj-Example-AVX512 -search-dir all:=./src -- ./run512 image01.jpg

And similarly to generate a tripcounts report:

$ advixe-cl -report tripcounts -project-dir ./AdvProj-Example-AVX512 -format=xml -report-output=./REPORTS/tripcounts-AVX512.xml

Now the xml-formatted report will contain information about the number of times the loops have been executed. Specifically, the Trip_Counts fields in the xml report will be populated, while the information from the survey report will be preserved. Next is a fragment of the enhanced report (only the first, most time-consuming loop is shown):

(…)
  </function_call_site_or_loop><function_call_site_or_loop ID="4"
   Function_Call_Sites_and_Loops="[child]-[loop in main at Histogram_Example.cpp:107]"
                              Self_Time="0.070s"
                              Total_Time="0.120s"
                              Type="Vectorized (Body)"
                              Why_No_Vectorization=""
                              Vector_ISA="AVX512"
                              Compiler_Estimated_Gain="3.37x"
                              Trip_Counts_Average="761670"
                              Trip_Counts_Min="761670"
                              Trip_Counts_Max="761670"
                              Trip_Counts_Call_Count="1"
                              Transformations=""
                              Source_Location="Histogram_Example.cpp:107"
                              Module="run512">

In a similar way, we can generate other types of reports that will give us other useful information about our loops. The –help collect and –help report options in the command line Intel Advisor tool will show what types of collections and reports are available:

$ advixe-cl -help collect
Intel(R) Advisor Command Line Tool
Copyright (C) 2009-2016 Intel Corporation. All rights reserved.

-c, -collect=<string>         Collect specified data. Specifying --search-dir
                              when collecting data is strongly recommended.

Usage: advixe-cl -collect=<string> [-action-option] [-global-option] [--]<target> [<target options>]<string> is one of the following analysis types to perform on <target>:

            survey        - Explore where to add efficient vectorization and/or threading.
            dependencies  - Identify and explore loop-carried dependencies for marked loops.
            map           - Identify and explore complex memory accesses for marked loops.
            suitability   - Analyze the annotated program to check its predicted parallel performance.
            tripcounts    - Find how many iterations are executed.
$ advixe-cl -help report
Intel(R) Advisor Command Line Tool
Copyright (C) 2009-2016 Intel Corporation. All rights reserved.

-R, -report=<string>          Report the results that were previously gathered.

Generates a formatted data report with the specified type and action options.

 Usage: advixe-cl -report=<string> [-action-option] [-global-option] [--]<target> [<target options>]<string> is the list of available reports:

            survey        - shows results of the survey analysis
            annotations   - lists the annotations in the sources
            dependencies  - shows possible dependencies
            hotspots      -
            issues        -
            map           - reports memory access patterns
            suitability   - shows possible performance gains
            summary       - shows the collection summary
            threads       - shows the list of threads
            top-down      - shows the report in a top-down view
            tripcounts    - shows survey report with tripcounts data added

For example, to obtain memory access pattern details in our source code, we can run a memory access patterns (MAP) analysis using the map option:

$ advixe-cl -collect map -project-dir ./AdvProj-Example-AVX512 -search-dir all:=./src -- ./run512 image01.jpg

$ advixe-cl -report map -project-dir ./AdvProj-Example-AVX512 -format=xml -report-output=./REPORTS/map-AVX512.xml

In all the above cases, the project directory (in this example, AdvProj-Example-AVX512) contains all the information necessary to perform a full analysis using the GUI. When we are ready to use the GUI, we can copy the project directory to a workstation/laptop (or access it over the filesystem) and run the GUI-based Intel Advisor from there, as was shown in a previous section in this article.

Summary

This article showed a simple way to quickly explore vectorization performance using Intel Advisor 2017. This was achieved using the CLI of Intel Advisor to perform a quick and preliminary analysis and report in the Intel Xeon Phi processor using a text window, with the idea of later obtaining more information about our codes by using the Intel Advisor GUI interface.

This procedure will also be useful for consolidating performance information after several iterations of source code optimization. A Unix* script (or similar) can be used to collect information from different reports and quickly consolidate it into tables or plots.

References

Zhang, B. (2016). "Guide to Automatic Vectorization With Intel AVX-512 Instructions in Knights Landing Processors."

Intel® Xeon Phi™ Processor 7200 Family Memory Management Optimizations

$
0
0

This paper examines software performance optimization for an implementation of a non-library version of DGEMM executing on the Intel® Xeon Phi™ processor (code-named Knights Landing, with acronym KNL) running the Linux* Operating System (OS). The performance optimizations will incorporate the use of C/C++ High Bandwidth Memory (HBM) application programming interfaces (APIs) for doing dynamic storage allocation from Multi-Channel DRAM (MCDRAM), _mm_malloc dynamic storage allocation calls into Double Data Rate (DDR) memory, high-level abstract vector register management, and data prefetching. The dynamic storage allocations will be used to manage tiled data structure objects that will accommodate the memory hierarchy of the Intel Xeon Phi processor architecture. The focus in terms of optimizing application performance execution based on storage allocation is to:

  • Align the starting addresses of data objects during storage allocation so that vector operations on the Intel Xeon Phi processor will not require additional special vector alignment when using a vector processing unit associated with each hardware thread.
  • Select data tile sizes and do abstract vector register management that will allow for cache reuse and data locality.
  • Place select data structures into MCDRAM, through the HBM software package.
  • Use data prefetching to improve timely referencing of the tiled data structures into the Intel Xeon Phi processor cache hierarchy.

These methodologies are intended to provide you with insight when applying code modernization to legacy software applications and when developing new software for the Intel Xeon Phi processor architecture.

Contents

  1. Introduction
           ∘     What strategies are used to improve application performance?
           ∘     How is this article organized?
  2. The Intel® Xeon Phi™ Processor Architecture
  3. Why Does the Intel Xeon Phi Processor Need HBM?
           ∘     How does a software application distinguish between data assigned to DDR versus MCDRAM in flat mode?
           ∘     How does a software application interface to MCDRAM?
  4. Prefetch Tuning
  5. Matrix Multiply Background and Programming Example
  6. Performance Results for a C/C++ Implementation of DGEMM Based on Intel® Math Kernel Library/DGEMM
  7. Conclusions
  8. References

Introduction

The information in this article might help you achieve better execution performance if you are optimizing software applications for the Intel® Xeon Phi™ processor architecture (code-named Knights Landing 1) that is running the Linux* OS. The scenario is that optimization opportunities are exposed from using profiling analysis software tools such as Intel® VTune™ Amplifier XE 2, and/or Intel® Trace Analyzer and Collector 3, and/or MPI Performance Snapshot 4 where these software tools reveal possible memory management bottlenecks.

What strategies are used to improve application performance?

This article examines memory management, which involves tiling of data structures using the following strategies:

  • Aligned data storage allocation. This paper examines the use of the _mm_malloc intrinsic for dynamic storage allocation of data objects that reside in Double Data Rate (DDR) memory.
  • Use of Multi-Channel Dynamic Random-Access Memory (MCDRAM). This article discusses the use of a 16-gigabyte MCDRAM, which is High-Bandwidth Memory (HBM) 1. MCDRAM on the Intel Xeon Phi processor comprises eight devices (2 gigabytes each). This HBM is integrated on-the Intel® Xeon Phi™ processor package and is connected to the Knights Landing die via a proprietary on-package I/O. All eight MCDRAM devices collectively provide an aggregate Stream triad benchmark bandwidth of more than 450 gigabytes per second 1.
  • Vector register management. An attempt will be made to manage the vector registers on the Intel Xeon Phi processor by using explicit compiler semantics including C/C++ Extensions for Array Notation (CEAN) 5.
  • Compiler prefetching controls. Compiler prefetching control will be applied to application data objects to manage data look-ahead into the Intel Xeon Phi processor’s L2 and L1 cache hierarchy.

Developers of applications for Intel Xeon Phi processor architecture may find these methodologies useful for optimizing programming applications, which exploit at the core level, hybrid parallel programming consisting of a combination of both threading and vectorization technologies.

How is this article organized?

Section 2 provides insight on the Intel Xeon Phi processor architecture and what software developer may want to think about in doing code modernization for existing applications or for developing new software applications. Part 3 examines storage allocations for HBM (MCDRAM). In this article and for the experiments, data objects that are not allocated in MCDRAM will reside in DDR. Section 4 examines prefetch tuning capabilities. Part 5 provides background material for the matrix multiply algorithm. Section 6 applies the outlined memory management techniques to a double-precision floating-point matrix multiply algorithm (DGEMM), and works through restructuring transformations to improve execution performance. Part 7 describes performance results.

The Intel® Xeon Phi™ Processor Architecture

A Knights Landing processor socket has at most 36 active tiles, where a tile is defined as consisting of two cores (Figure 1) 1. This means that the Knights Landing socket can have at most 72 cores. The two cores within each tile communicate with each other via a 2D mesh on-die interconnect architecture that is based on a Ring architecture (Figure 1) 1. The communication mesh consists of four parallel networks, each of which delivers different types of packet information (for example, commands, data, and responses) and is highly optimized for the Knights Landing traffic flows and protocols. The mesh can deliver greater than 700 gigabytes per second of total aggregate bandwidth. 

Processor Block Diagram
Figure 1. Intel® Xeon Phi™ processor block diagram showing tiles. (DDR MC = DDR memory controller, DMI = Direct Media Interface, EDC = MCDRAM controllers, MCDRAM = Multi-Channel DRAM) 1.

Each core has two Vector Processing Units (VPUs) and 1 megabyte of level-2 (L2) cache that is shared by the two cores within a tile (Figure 2) 1. Each core within a tile has 32 kilobytes of L1 instruction cache and 32 kilobytes of L1 data cache. The cache lines are 512-bits wide, implying that a cache line can contain 64 bytes of data. 

Intel® Xeon Phi™ processor illustration of a tile from Figure 1

Figure 2. Intel® Xeon Phi™ processor illustration of a tile from Figure 1 that contains two cores (CHA = Caching/Home Agent, VPU = Vector Processing Unit) 1.

In terms of single precision and double-precision floating-point data, the 64-byte cache lines can hold 16 single-precision floating-point objects or 8 double-precision floating-point objects.

Looking at the details of a core in Figure 2, there are four hardware threads (hardware contexts) per core 1, where each hardware thread acts as a logical processor 6. A hardware thread has 32 512-bit-wide vector registers (Figure 3) to provide Single Instruction Multiple Data (SIMD) support 6. To manage the 512-bit wide SIMD registers (ZMM0-ZMM31), the Intel® Advanced Vector Extensions 512 (Intel® AVX-512) instruction set is used 7. For completeness in regard to Figure 3, the lower 256-bits of the ZMM registers are aliased to the respective 256-bit YMM registers, and the lower 128-bits are aliased to the respective 128-bit XMM registers.

Figure 3. 512-bit-wide vectors and SIMD register set.

Figure 3. 512-bit-wide vectors and SIMD register set 6.

The rest of this article focuses on the instructions that support the 512-bit wide SIMD registers (ZMM0-ZMM31). Regarding the Intel AVX-512 instruction set extensions, a 512-bit VPU also supports Fused Multiply-Add (FMA) instructions 6, where each of the three registers acts as a source and one of them also functions as a destination to store the result. The FMA instructions in conjunction with the 512-bit wide SIMD registers can do 32 single-precision floating-point computations or 16 double-precision floating-point operations per clock cycle for computational semantics such as:

Cij = Cij + Aip× Bpj

where subscripts “i”, “j”, and “p” serve as respective row and column indices for matrices A, B, and C.

Why Does the Intel Xeon Phi Processor Need HBM?

Conventional Dynamic Random-Access Memory (DRAM) and Dual-Inline Memory Modules (DIMMs) cannot meet the data-bandwidth consumption capabilities of the Intel Xeon Phi processor 8. To address this “processor to memory bandwidth” issue there are two memory technologies that can be used that place the physical memory closer to the Knights Landing processor socket, namely 8:

  • MCDRAM: This is a proprietary HBM that physically sits atop the family of Intel Xeon Phi processors.
  • HBM: This memory architecture is compatible with the Joint Electron Device Engineering Council (JEDEC) standards 9, and is a high-bandwidth memory designed for a generation of Intel Xeon Phi processors, code named Knights Hill.

From a performance point of view, there is no conceptual difference between MCDRAM and HBM.

For the Intel Xeon Phi processor, MCDRAM as shown in Figure 4 has three memory modes 1: cache mode, flat mode, and hybrid mode. When doing code modernization for existing applications or performing new application development on Intel Xeon Phi processor architecture, you may want to experiment with the three configurations to find the one that provides the best performance optimization for your applications. Below are some details about the three modes that may help you make informed decisions about which configuration may provide the best execution performance for software applications. 

Figure 4. The three MCDRAM memory modes—cache, flat, and hybrid

Figure 4. The three MCDRAM memory modes—cache, flat, and hybrid—in the Intel® Xeon Phi™ processor. These modes are selectable through the BIOS at boot time 1.

The cache mode does not require any software change and works well for many applications 1. For those applications that do not show a good hit rate in MCDRAM, the other two memory modes provide more user control to better utilize MCDRAM.

In flat mode, both the MCDRAM memory and the DDR memory act as regular memory and are mapped into the same system address space 1. The flat mode configuration is ideal for applications that can separate their data into a larger, lower-bandwidth region and a smaller, higher bandwidth region. Accesses to MCDRAM in flat mode see guaranteed high bandwidth compared to cache mode, where it depends on the hit rates. Unless the data structures for the application workload can fit entirely within MCDRAM, the flat mode configuration requires software support to enable the application to take advantage of this mode.

For the hybrid mode, the MCDRAM is partitioned such that either a half or a quarter of the MCDRAM is used as cache, and the rest is used as flat memory 1. The cache portion will serve all of the DDR memory. This is ideal for a mixture of software applications that have data structures that benefit from general caching, but also can take advantage by storing critical or frequently accessed data in the flat memory partition. As with the flat mode, software enabling is required to access the flat mode section of the MCDRAM when software does not entirely fit into it. Again as mentioned above, the cache mode section does not require any software support 1.

How does a software application distinguish between data assigned to DDR versus MCDRAM in flat mode?

When MCDRAM is configured in flat mode, the application software is required to explicitly allocate memory into MCDRAM 1. In a flat mode configuration, the MCDRAM is accessed as memory by relying on mechanisms that are already supported in the existing the Linux* OS software stack. This minimizes any major enabling effort and ensures that the applications written for flat MCDRAM mode remain portable to systems that do not have a flat MCDRAM configuration. This software architecture is based on the Non-Uniform Memory Access (NUMA) memory support model 10 that exists in current operating systems and is widely used to optimize software for current multi-socket systems. The same mechanism is used to expose the two types of memory on Knights Landing as two separate NUMA nodes (DDR and MCDRAM). This provides software with a way to address the two types of memory using NUMA mechanisms. By default, the BIOS sets the Knights Landing cores to have a higher affinity to DDR than MCDRAM. This affinity helps direct all default and noncritical memory allocations to DDR and thus keeps them out of MCDRAM.

On a Knights Landing system one can type the NUMA command:

numactl –H

or

numactl --hardware

and you will see attributes about the DDR memory (node 0) and MCDRAM (node 1). The attributes might look something like the following:

available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255
node 0 size: 32664 MB
node 0 free: 30414 MB
node 1 cpus:
node 1 size: 16384 MB
node 1 free: 15958 MB
node distances:
node 0 1
  0: 10 31
  1: 31 10

There is an environment variable called MEMKIND_HBW_NODES which controls the binding of high bandwidth memory allocations to one of the two NUMA nodes listed above. For example, if this environment variable is set to 0, it will bind high bandwidth memory allocations to NUMA node 0. Alternatively, setting this environment variable to 1 will bind high bandwidth allocations to NUMA node 1.

How does a software application interface to MCDRAM?

To allocate critical memory in MCDRAM in flat mode, a high-bandwidth (HBW) malloc library is available that can be downloaded at reference 11 or by clicking here. This memkind library 11 has functions that can align data objects on, say, 64-byte boundaries. Such alignments can lead to efficient use of cache lines, the L2 and L1 caches, and the SIMD vector registers. Once the memkind library is installed, the LD_LIBRARY_PATH environment variable will need to be updated to include the directory path to the memkind library.

One other topic should be noted regarding huge pages. On Knights Landing, huge pages are managed by the kernel a bit differently when you try to perform memory allocation using them (memory pages of size 2 MB instead of the standard 4 KB) 12. In those cases, huge pages need to be enabled prior to use. The content of the file called:

/proc/sys/vm/nr_hugepages

contains the current number of preallocated huge pages of the default size. If for example, you issue the Linux command on the Knights Landing system:

cat /proc/sys/vm/nr_hugepages

and the file contains a 0, the system administrator can issue the Linux OS command:

echo 20 > /proc/sys/vm/nr_hugepages

to dynamically allocate and deallocate default sized persistent huge pages, and thus adjust the number of default sized huge pages in the huge page pool to 20. Therefore, the system will allocate or free huge pages, as required. Note that one does not need to explicitly set the number of huge pages by echoing to the file /proc/sys/vm/nr_hugepages as long as the content of /sys/kernel/mm/transparent_hugepage/enabled is set to “always”.

A detailed review for setting the environment variables MEMKIND_HBW_NODES and LD_LIBRARY_PATH in regard to the memkind library and adjusting the content of the file nr_hugepages is discussed in Performance Results for a C/C++ Implementation of DGEMM Based on Intel® Math Kernel Library/DGEMM.

Prefetch Tuning

Compiler prefetching is disabled by default for the Intel Xeon Phi processor 13. To enable compiler prefetching for Knights Landing architecture use the compiler options:

-O3 –xmic-avx512 –qopt-prefetch=<n>

where the values of meta-symbol <n> are explained in Table 1

Table 1. Intel® C/C++ compiler switch settings for -qopt-prefetch 

How does the -qopt-prefetch=<n> compiler switch work for the Intel® Xeon Phi™ processor architecture?

Value of meta-symbol “<n>

Prefetch Semantic Actions

0This is the default and if you omit the –qopt-prefetch option, then no auto-prefetching is done by the compiler
2This is the default if you use only –qopt-prefetch with no explicit “<n>” argument. Insert prefetches for direct references where the compiler thinks the hardware prefetcher may not be able to handle it
3Prefetching is turned on for all direct memory references without regard to the hardware prefetcher
4Same as n=3 (currently)
5Additional prefetching for all indirect references (Intel® Advanced Vector Extensions 512 (Intel® AVX-512) and above)
  • Indirect prefetches (hint 1) is done using AVX512-PF gatherpf instructions on Knights Landing (not all cases, but a subset)
  • Extra prefetches issued for strided vector accesses (hint 0) to cover all cache-lines

The prefetch distance is the number of iterations of look-ahead when a prefetch is issued. Prefetching is done after the vectorization phase, and therefore the distance is in terms of vectorized iterations if an entire serial loop or part of a serial loop is vectorized. The Intel Xeon Phi processor also has a hardware L2 prefetcher that is enabled by default. In general, if the software prefetching algorithm is performing well for an executing application, the hardware prefetcher will not join in with the software prefetcher.

For this article the Intel C/C++ Compiler option:

-qopt-prefetch-distance=n1[,n2]

is explored. The arguments n1 and n2 have the following semantic actions in regard to the --qopt-prefetch-distance compiler switch:

  • The distance n1 (number of future loop iterations) for first-level prefetching into the Intel Xeon Phi processor L2 cache.
  • The distance n2 for second-level prefetching from the L2 cache into the L1 cache, where n2n1. The exception is that n1 can be 0 for values of n2 (no first-level prefetches will be issued by the compiler).

Some useful values to try for n1 are 0, 4, 8, 16, 32, and 64 14. Similarly, useful values to try for n2 are 0, 1, 2, 4, and 8. These L2 prefetching values signified by n1 can be permuted with prefetching values n2 that control data movement from the L2 cache into the L1 cache. This permutation process can reveal the best combination of n1 and n2 values. For example, a setting might be:

-qopt-prefetch-distance=0,1

where the value 0 tells the compiler to disable compiler prefetching into the L2 cache, and the n2 value of 1 indicates that 1 iteration of compiler prefetching should be done from the L2 cache into the L1 cache.

The optimization report output from the compiler (enabled using -opt-report=<m>) will provide details on the number of prefetch instructions inserted by the compiler for each loop.

In summary, section 2 discussed the Intel Xeon Phi processor many-core architecture, including the on-die interconnection network for the cores, hardware threads, VPUs, the L1 and L2 caches, 512-bit wide vector registers, and 512-bit wide cache lines. Part 3 examined MCDRAM, and the memkind library for helping to establish efficient data alignment of data structures (memory objects). This present section discussed prefetching of these memory objects into the cache hierarchy. In the next section, these techniques will be applied so as to optimize an algorithm 15 such as a double-precision version of matrix multiply. The transformation techniques will incorporate using a high-level programming language in an attempt to maintain portability from one processor generation to the next 16.

Matrix Multiply Background and Programming Example

Matrix multiply has the core computational assignment:

Cij = Cij + Aip× Bpj

A basic matrix multiply loop structure implemented in a high-level programming language might look something like the following pseudo-code:

integer i, j, p;for p = 1:Kfor j = 1:Nfor i = 1:M
           Cij = Cij + Aip × Bpj
        endfor
    endfor
endfor

where matrix A has dimensions M × K, matrix B has dimensions K × N, and matrix C has dimensions M × N. For the memory offset computation for matrices A, B, and C we will assume column-major-order data organization.

For various processor architectures, software vendor libraries are available for performing matrix multiply in a highly efficient manner. For example, matrix multiplication for the above can be computed using DGEMM which calculates the product for a matrix C using double precision matrix elements 17. Note that a DGEMM core solver for implementing the above algorithm may be implemented in assembly language (e.g., DGEMM for the Intel® Math Kernel Library 17), where an assembly language solution may not be necessarily portable from one processor architecture to the next.

In regard to this article, the focus is to do code restructuring transformations to achieve code modernization performance improvements using a high-level programming language. The reason for using matrix multiply as an example in applying the high-level memory-allocation-optimization techniques is that the basic algorithm is roughly four lines long and is easily understandable. Additionally, it is hoped that after you see a before and after of the applied restructuring transformations using a high-level programming language, you will think about associating restructuring transformations of a similar nature to the applications that you have written in a high-level programming language which are targeted for code modernization techniques.

Goto et al. 15 have looked at restructuring transformations for the basic matrix multiply loop structure shown above in order to optimize it for various processor architectures. This has required organizing the A, B, and C matrices from the pseudo-code example above into sub-matrix tiles. Figure 5 shows a tile rendering abstraction, but note that the access patterns required in Figure 5 are different from those described in reference 15. For the and tiles in Figure 5, data packing is done to promote efficient matrix-element memory referencing. 

Partitioning of DGEMM for the Intel® Xeon Phi™ processor

Figure 5. Partitioning of DGEMM for the Intel® Xeon Phi™ processor where buffer is shared by all cores, and buffer and sections of matrix C are not shared by all cores. The data partitioning is based on an Intel® Xeon Phi™ processor/DGEMM implementation from the Intel® Math Kernel Library 17.

Regarding the matrix partitioning in Figure 5 for the Intel Xeon Phi processor, is shared by all the cores, and matrices and C are not shared by all the cores. This is for a multi-threaded DGEMM solution. Parallelism for the Intel Xeon Phi processor can be demonstrated as threading at the core level (Figure 1), and then as shown in Figure 2, the VPUs can exploit vectorization with 512-bit vector registers and SIMD semantics.

For the sub-matrices in Figure 5 that are either shared by all of the Intel Xeon Phi processor cores (for example, sub-matrix ), or for the sub-matrices that are not shared (for example, sub-matrices for and partitions for matrix C), the next question is: what memory configurations should you use (for example, DDR or MCDRAM)? 

Figure 6. DGEMM kernel solver for Intel® Xeon Phi™ processor with partitions for A ̃, B ̃, and C.

Figure 6. DGEMM kernel solver for Intel® Xeon Phi™ processor with partitions for , , and C17.

Recall that there is 16 gigabytes of multi-channel DRAM and therefore since sub-matrix is shared by all the cores, it will be placed into MCDRAM using the flat mode configuration.

In section 3, we examined HBM, where for MCDRAM there were three configurations: cache mode, flat mode, and hybrid mode (Figure 4). It was mentioned that the flat mode configuration is ideal for applications that can separate their data into a larger, lower-bandwidth region and a smaller, higher bandwidth region. Following this rule for flat mode, we will place  (Figure 6) into MCDRAM using the following “memkind” library prototype:

int hbw_posix_memalign(void **memptr, size_t alignment, size_t size);

where the alignment argument “size_t alignment” might have a value of 64, which is a power of 2 and allows the starting address of to align on the beginning of a cache line.

In Figure 6, note that matrix C consists of a core partition that has 8 rows and 28 columns. From an abstraction point of view, the 8 double-precision matrix elements (64 bytes total) can fit into a 512-bit (64 byte) vector register. Also, recall that there are 32 512-bit vector registers per hardware thread. To reduce register pressure on a hardware thread, 28 of the vector registers will be used for the core solver on the right side of Figure 6.

Similarly, for the other tiling objects in Figure 6, the _mm_malloc intrinsic will be used to allocate storage in DDR memory on Knights Landing. The _mm_malloc function prototype looks as follows:

void *_mm_malloc (size_t size, size_t align);

The _mm_malloc prototype also has a “size_t align argument, which again is an alignment constraint. Using a value of 64 allows data objects that are dynamically allocated to have their starting address aligned on the beginning of a cache line.

For Figure 6, matrix will be migrated into the L2 cache.

To summarize, we have discussed the partitioning of the A, B, and C matrices into sub-matrix data tiles, and we have utilized 28 of the 32 512-bit vector registers. We looked at data storage prototypes for placing data into MCDRAM or DDR. Next, we want to explore how the data elements within the sub-matrices will be organized (packed) to provide efficient access and reuse.

Data element packing for 8 rows by 336 columns of matrix segment A ̃

Figure 7. Data element packing for 8 rows by 336 columns of matrix segment using column-major-order memory offsets 17.

Recall from Figure 5 and Figure 6 that matrix segment has a large number of rows and 336 columns. The data is packed into strips that have 8 row elements for each of the 336 columns (Figure 7) using column-major order memory offsets. The number of strips for matrix segment is equal to:

Large-number-of-rows / 8

Note that 8 double-precision row elements for each of the 336 columns can provide efficient use of the 512-bit wide cache lines for the L2 and L1 caches.

For matrix segment in Figure 5 and Figure 6, the 336 row by 112 column tiles are sub-partitioned into 336 rows by 28 column strips (Figure 8). In Figure 8, the matrix segment has strips that use row-major-order memory offsets and therefore the 28 elements in a row are contiguous. The 28 elements of a row for matrix segment correspond with the 28 elements for matrix C that are used in the core solver computation illustrated in the right portion of Figure 6.

Figure 8. Data element packing for 336 rows by 28 columns of matrix segment B ̃

Figure 8. Data element packing for 336 rows by 28 columns of matrix segment using row-major-order memory offsets 17.

Figure 9 shows a column-major-order partition of matrix C that has 8 array elements within a column and there are 28 columns (in green). As mentioned earlier, the 8 contiguous double-precision array elements within a column will fit into a 512-bit (64 byte) vector register, and the 28 columns contain 8 row-elements each that can map onto 28 of the 32 possible vector registers associated with a hardware thread. In Figure 9, note that when the next 8 row by 28 column segment for matrix C (in white) is processed, the first element in each column is adjacent to the last element in each column with respect to the green partition. Thus, this column major ordering can allow the 8 row by 28 column segment (in white) to be efficiently prefetched for anticipated FMA computation.

Figure 9. Data element organization for 8 rows by 28 columns of matrix segment C

Figure 9. Data element organization for 8 rows by 28 columns of matrix segment C using column-major-order memory offsets 17.

In regard to Figures 7, 8, and 9, we have completed the data referencing analysis for the , , and C matrices, which are color coded to make it easy to associate the rectangular data tiling abstractions with the three matrix multiply data objects. Putting this all together into a core matrix multiply implementation, a possible pseudo-code representation for the core solver in Figure 6 that also reflects the data referencing patterns for Figures 7, 8 and 9, might look something like the following:

Code snippet

C/C++ Extensions for Array Notation (CEAN) 5 are used in the pseudo-code as indicated by the colon notation “…:8” within the subscripts. This language extension is used to describe computing 8 double-precision partial results for matrix C by using 8 double-precision elements of and replicating a single element of eight times and placing the same 8 values into a 512-bit vector register for the operand to make it possible to take advantage of FMA computation. For matrix the subscript “l” (the character l is the letter L) is used to reference the level (entire strip in Figure 8), where there are 4 levels in the matrix, each containing 336 rows and 28 columns. This accounts for the value 112 (28 × 4) in Figure 6. 

Performance Results for a C/C++ Implementation of DGEMM Based on Intel® Math Kernel Library/DGEMM

This section describes three DGEMM experiments that were run on a single Intel Xeon Phi processor socket that had 64 cores and 16 gigabytes of MCDRAM. The first experiment establishes a baseline for floating-point-operations-per second performance. Experiments 2 and 3 attempt to demonstrate increasing floating-point-operations-per-second performance. All three executables do storage allocation into MCDRAM or DDR using the respective function prototypes:

int hbw_posix_memalign(void **memptr, size_t alignment, size_t size);

and

void *_mm_malloc (size_t size, size_t align);

MCDRAM for these experiments was configured in flat mode. The executables were built with the Intel C/C++ Compiler. Cache management was used for the A and B matrices by transferring data into tile data structures that would fit into the L2 and L1 caches. MCDRAM was used for the -tile. The other data structures that were allocated for this Intel® Math Kernel Library (Intel® MKL)/DGEMM implementation used DDR memory.

Please note that on your system, the floating-point-operations-per-second results will vary from those shown in Figures 10, 11, and 12. Results will be influenced, for example, by factors such as the version of the OS, the software stack component versions, the processor stepping, the number of cores on a socket, and the storage capacity of MCDRAM.

A shell script for running the experiment that resulted in the data for Figure 9 had the following arguments:

64 1 336 112 43008 43008 dynamic 2 <path-to-memkind-library>

  • 64 defines the number of core threads.
  • 1 defines the number of hardware threads per core that are to be used.
  • 336 defines the number of columns for the and the number of rows for the tiling data structures. See Figure 5 and Figure 6.
  • 112 defines the number of columns for the data structure tile. Also, see Figures 5 and Figure 6.
  • 43008 is the matrix order.
  • The second value 43008 refers to the number of rows for.
  • The values dynamic and 2 are used to control the OpenMP* scheduling 18,19 (see Table 2 below).
  • The meta-symbol <path-to-memkind-library> refers the directory path to the memkind library that is installed on the user’s Knights Landing system.

The first experiment is based on the data tiling storage diagrams from the section 5. Recall that each 512-bit vector register for Intel Xeon Phi processor can reference eight double-precision floating-point operations, and there is also an opportunity to use the FMA vector instruction for the core computation:

for ( … )
     C[ir+iir:8,jc+jjc] += …
     C[ir+iir:8,jc+jjc+1] += …

     …

     C[ir+iir:8,jc+jjc+26] += …
     C[ir+iir:8,jc+jjc+27] += …
endfor

For the compilation of orig_knl_dgemm.c into an executable for running on the Intel Xeon Phi processor, the floating-point-operations-per-second results might look something like the following:

Figure 10. Intel® Xeon Phi™ processor result for the executable orig_knl_dgemm.exe

Figure 10. Intel® Xeon Phi™ processor result for the executable orig_knl_dgemm.exe using 64 core threads and 1 OpenMP* thread per core. The matrix order was 43,008. The tile data structure for the A-matrix had 336 columns and the M-rows value was 43,008. 0 abstract vector registers were used for the matrix-multiply core solver.

In the next experiment, the source file called opt_knl_dgemm.c is used to build the executable called reg_opt_knl_dgemm.exe. In this file, references to the 28 columns of matrix C in the core solver are replaced with the following:

t0[0:8] = C[ir+iir:8,jc+jjc];
t1[0:8] = C[ir+iir:8,jc+jjc+1];

…

t26[0:8] = C[ir+iir:8,jc+jjc+26];
t27[0:8] = C[ir+iir:8,jc+jjc+27];for ( … )
    t0[0:8] += …
    t1[0:8] += …

…

    t27[0:8] += …
endfor 

C[ir+iir:8,jc+jjc] += …
C[ir+iir:8,jc+jjc+1] += …

     …

C[ir+iir:8,jc+jjc+26] = t26[0:8];C[ir+iir:8,jc+jjc+27] = t27[0:8];

The notion of using the array temporaries t0 through t27 can be thought of as assigning abstract vector registers in the computation of partial results for the core matrix multiply algorithm. For this experiment on the Intel Xeon Phi processor, the floating-point-operations-per-second results might look something like:

Figure 11. Intel® Xeon Phi™ processor performance comparison

Figure 11. Intel® Xeon Phi™ processor performance comparison between the executable, orig_knl_dgemm.exe and the executable, reg_opt_knl_dgemm.exe using 64 core threads and 1 OpenMP* thread per core. The matrix order was 43,008. The tile data structure for the A-matrix had 336 columns and the M-rows value was 43,008. The executable reg_opt_knl_dgemm.exe used 28 abstract vector registers for the matrix-multiply core solver

Note that in Figure 11, the result for the executable, orig_knl_dgemm.exe is compared with the result for the executable, reg_opt_knl_dgemm.exe (where 28 abstract vector registers were used). As mentioned previously, from an abstract vector register perspective, the intent was to explicitly manage 28 of the thirty-two 512 bit vector registers for a hardware thread within a Knights Landing core.

The last experiment (experiment 3) builds the Intel MKL/DGEMM executable called pref_32_0_reg_opt_knl_dgemm.exe using the Intel C/C++ compiler options -qopt-prefetch=2 and -qopt-prefetch-distance=n1,n2 where n1 and n2 are replaced with integer constants. The -qopt-prefetch-distance switch is used to control the number of iterations of data prefetching that take place for the L2 and L1 caches on Knights Landing. The L2, L1 combination that is reported here is (32,0). Figure 12 shows a comparison of experiments 1, 2, and 3.

Figure 12. Intel® Xeon Phi™ processor performance comparisons

Figure 12. Intel® Xeon Phi™ processor performance comparisons for executables, orig_knl_dgemm.exe, reg_opt_knl_dgem.exe, and pref_32_0_reg_opt_knl_dgemm.exe. Each executable used 64 core threads and 1 OpenMP* thread per core. The matrix order was 43,008. The tile data structure for the A-matrix had 336 columns and the M-rows value was 43,008. The executables reg_opt_knl_dgemm.exe and pref_32_0_reg_opt_knl_dgemm.exe used 28 abstract vector registers for the matrix-multiply core solver. The executable pref_32_0_reg_opt_knl_dgemm.exe was also built with the Intel® C/C++ Compiler prefetch switches -qopt-prefetch=2 and -qopt-prefetch-distance=32,0

For the three experiments discussed above, the user can download the shell scripts, makefiles, C/C++ source files, and a README.TXT file at the following URL:

Knights Landing/DGEMM Download Package

After downloading and untarring the package, note the following checklist:

  1. Make sure that the HBM software package called memkind is installed on your Knights Landing system. Click here to retrieve the package, if it is not already installed.
  2. Set the following environment variables:
    export MEMKIND_HBW_NODES=1
    export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:<path-to-memkind-library>/lib
    where <path-to-memkind-library> is a meta-symbol and represents the directory path to the memkind library where the user has done their installation of this library.
  3. Issue the command:
    cat /proc/sys/vm/nr_hugepages
    If it does not have a value of 20, then ask your system administrator to change this value on your Knights Landing system using root privileges. With a system administrator account, this can be done by issuing the command:
    echo 20 > /proc/sys/vm/nr_hugepages
    followed by the verification command:
    cat /proc/sys/vm/nr_hugepages
    As mentioned earlier in a subsection of Section 3, one does not need to explicitly set the number of huge pages by echoing to /proc/sys/vm/nr_hugepages as long as the content of /sys/kernel/mm/transparent_hugepage/enabled is set to “always”.
  4. Review the content of the README.TXT file that is within the directory opt_knl_dgemm_v1 on your Knights Landing system. This read-me file contains information about how to build and run the executables from a host Knights Landing system. The README.TXT file should be used as a guide for doing your own experiments.

Once you complete the checklist on the Intel Xeon Phi processor system, you can source an Intel Parallel Studio XE Cluster Edition script called psxevars.sh by doing the following:

. <path-to-Intel-Parallel-Studio-XE-Cluster-Edition>/psxevars.sh intel64

This script is sourced in particular to set up the Intel C/C++ compilation environment.

For experiment 1, issue a command sequence that looks something like the following within the directory opt_knl_dgemm_v1:

$ cd ./scripts
$ ./orig_knl_dgemm.sh < path-to-memkind-library>

The output report for a run with respect to the scripts sub-directory will be placed in the sibling directory called reports, and the report file should have a name something like:

orig_knl_dgemm_report.64.1.336.112.43008.43008.dynamic.2

where the suffix notation for the report file name has the following meaning:

  • 64 defines the number of core threads.
  • 1 defines the number of hardware threads per core that are to be used.
  • 336 defines the number of columns for the and the number of rows for the tiling data structures. See Figures 5 and 6.
  • 112 defines the number of columns for the data structure tile. Also, see Figures 5 and 6.
  • 43008 is the matrix order.
  • The second value 43008 refers to the number of rows for .
  • The values dynamic and 2 are used to control the OpenMP scheduling (see below).

As mentioned earlier, OpenMP is used to manage threaded parallelism. In so doing, the OpenMP Standard 18 provides a scheduling option for work-sharing loops:

schedule(kind[, chunk_size])

This scheduling option is part of the C/C++ directive: #pragma omp parallel for, or #pragma omp for, and the Fortran* directive: !$omp parallel do, or !$omp do. The schedule clause specifies how iterations of the associated loops are divided into contiguous non-empty subsets, called chunks, and how these chunks are distributed among threads of a team. Each thread executes its assigned chunk or chunks in the context of its implicit task. The chunk_size expression is evaluated using the original list items of any variables that are made private in the loop construct.

Table 2 provides a summary of the possible settings for the “kind” component for the “schedule” option.

Table 2. “kind” scheduling values for the OpenMP* schedule(kind[, chunk_size])directive component for OpenMP work sharing loops 18,19.

Kind

Description

Static

Divide the loop into equal-sized chunks or as equal as possible in the case where the number of loop iterations is not evenly divisible by the number of threads multiplied by the chunk size. By default, chunk size is the loop-count/number-of-threads. Set chunk to 1 to interleave the iterations.

Dynamic

Use the internal work queue to give a chunk-sized block of loop iterations to each thread. When a thread is finished, it retrieves the next block of loop iterations from the top of the work queue. By default, the chunk size is 1. Be careful when using this scheduling type because of the extra overhead involved.

Guided

Similar to dynamic scheduling, but the chunk size starts off large and decreases to better handle load imbalance between iterations. The optional chunk parameter specifies the minimum size chunk to use. By default, the chunk size is approximately loop-count/number-of-threads.

Auto

When schedule (auto) is specified, the decision regarding scheduling is delegated to the compiler. The programmer gives the compiler the freedom to choose any possible mapping of iterations to threads in the team.

Runtime

Uses the OMP_SCHEDULE environment variable to specify which one of the three loop-scheduling types should be used. OMP_SCHEDULE is a string formatted exactly the same as would appear on the parallel construct.

An alternative to using the scheduling option of the C/C++ directives:

#pragma omp parallel for or #pragma omp for

or the Fortran directives:

!$omp parallel do or !$omp do

is to use the OpenMP environment variable OMP_SCHEDULE, which has the options:

type[,chunk]

where:

  • type is one of static, dynamic, guided, or auto.
  • chunk is an optional positive integer that specifies the chunk size.

Finally, experiments 2 and 3 can be launched from the scripts directory by using the commands:

./reg_opt_knl_dgemm.sh <path-to-memkind-library>

and

./pref_reg_opt_knl_dgemm.sh <path-to-memkind-library>

In a similar manner, the output report for each run with respect to the scripts sub-directory will be placed in the sibling directory called reports.

Conclusions 

The experiments on an Intel Xeon Phi processor architecture using HBM library storage allocation along with MCDRAMM for a non-library C/C++ implementation of KNL/DGEMM indicate that data alignment, data placement, and management of the vector registers can help provide good performance on the Intel Xeon Phi processor. Management of the Intel Xeon Phi processor vector registers at the program-language-application-level was done with abstract vector registers. In general, you may want to use conditional compilation macros within your applications to control the selection of the high-bandwidth libraries for managing dynamic storage allocations into MCDRAM versus DDR. In this way, you can experiment with the application to see which storage allocation methodology provides the best execution performance for your application running on Intel Xeon Phi processor architectures. Finally, compiler prefetching controls were used for the L1 and L2 data caches. The experiments showed that making adjustments to prefetching further improved execution performance.

As mentioned earlier, the core solver for MKL DGEMM is written in assembly language, and when a user finds a need to use DGEMM as part of a software application programming solution, Intel® MKL DGEMM should be used. For completeness, the following URL provides performance charts for Intel® MKL DGEMM on Knights Landing:

https://software.intel.com/en-us/intel-mkl/benchmarks#DGEMM3

References 

  1. A. Sodani, R. Gramunt, J. Corbal, H. Kim, K. Vinod, S. Chinthamani, S. Hutsell, R. Agarwal, Y. Liu, “KNIGHTS LANDING: SECOND GENERATION INTEL® XEON PHI PRODUCT,” IEEE MICRO, March/April 2016, pp. 34-46. 
  2. Intel® VTune™ Amplifier 2017. 
  3. Intel® Trace Analyzer and Collector. 
  4. Getting Started with the MPI Performance Snapshot. 
  5. C/C++ Extensions for Array Notations Programming Model 
  6. Intel® 64 and IA-32 Architectures Software Developer Manuals. 
  7. Intel® Architecture Instruction Set Extensions Programming Reference.(PDF 74 KB) 
  8. B. Brett, Multi-Channel DRAM (MCDRAM) and High-Bandwidth Memory (HBM). 
  9. https://www.jedec.org/ 
  10. http://man7.org/linux/man-pages/man7/numa.7.html 
  11. https://github.com/memkind/memkind 
  12. https://www.kernel.org/doc/Documentation/vm/hugetlbpage.txt 
  13. Intel® C++ Compiler 17.0 Developer Guide and Reference.
  14. R. Krishnaiyer, Compiler Prefetching for the Intel® Xeon Phi™ coprocessor (PDF 336 KB). 
  15. K. Goto and R. van de Geijn, “Anatomy of High-Performance Matrix Multiplication,” ACM Transactions on Mathematical Software, Vol. 34, No. 3, May 2008, pp. 1-25. 
  16. Guide to Automatic Vectorization with Intel AVX-512 Instructions in Knights Landing Processors, May 2016. 
  17. Intel® Math Kernel Library (Intel® MKL). 
  18. The OpenMP API Specification for Parallel Programming. 
  19. R. Green, OpenMP Loop Scheduling.

3D Isotropic Acoustic Finite-Difference Wave Equation Code: A Many-Core Processor Implementation and Analysis

$
0
0

Finite difference is a simple and efficient mathematical tool that helps solve differential equations. In this paper, we solve an isotropic acoustic 3D wave equation using explicit, time domain finite differences.

Propagating seismic waves remains a compute-intensive task even when considering the simplest expression of the wave equation. In this paper, we explain how to implement and optimize a three-dimension isotropic kernel with finite differences to run on the Intel® Xeon® processor v4 Family and the Intel® Xeon Phi™ processor.

We also give a brief overview of new memory hierarchy introduced with the Intel® Xeon Phi™ processor and the different settings and modifications of the source code needed to incorporate the use of C/C++ High Bandwidth Memory (HBM) application programming interfaces (APIs) for doing dynamic storage allocation from Multi-Channel DRAM (MCDRAM).

Viewing all 327 articles
Browse latest View live


<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>