Quantcast
Channel: Intel Developer Zone Articles
Viewing all 327 articles
Browse latest View live

How to Mount a Shared Directory on Intel® Xeon Phi™ Coprocessor

$
0
0

In order to run a native program on the Intel® Xeon Phi™ coprocessor, the program and any dependencies must be copied to the target platform. However, this approach takes away memory from the native application. To reserve memory resource (16-GB GDDR5 memory on board the Intel Xeon Phi coprocessor), it is practical to mount a Network File System (NFS) shared directory on the Intel Xeon Phi coprocessor from the host server so that most of its memory can be used for applications. This article shows two ways to accomplish this task: the preferred method is using micctrl utility and the second method is a manual procedure.

Using micctrl utility

The preferred method to mount a shared directory on an Intel Xeon Phi coprocessor is to use the micctrl utility shipped with the Intel® Manycore Platform Software Stack (Intel® MPSS). The following example shows how to share the Intel® Compiler C++ library using micctrl. In the host machine used for this example, the MPSS 3.4.8 was installed.

  1. On the host machine, ensure that the shared directory exists:
    [host ~]# ls /opt/intel/compilers_and_libraries_2017.0.098/linux/
  2. Add a new descriptor to the /etc/exports configuration file in the host machine, in order to export the directory /var/mpss/mic0.exports to the coprocessor mic0 whose IP address is 172.31.1.1. Use the option read only so that the coprocessor cannot delete anything in the shared library mistakenly:
    [host ~]# cat /etc/exports
    	/opt/intel/compilers_and_libraries_2017.0.098/linux172.31.1.1(ro,async,no_root_squash)

    For more information on the export options, you can refer to http://nfs.sourceforge.net/nfs-howto/ar01s03.html.
     
  3. Next, update the NFS export table in the host:
    [host ~]# exportfs -a
  4. From the host, use the micctrl utility to add an NFS entry on the coprocessors:
    [host ~]# micctrl --addnfs=/opt/intel/compilers_and_libraries_2017.0.098/linux --dir=/mnt-library --options=defaults
  5. Restart the MPSS service:
    [host ~]# service mpss restart
    	Shutting down Intel(R) MPSS:                               [  OK  ]
    	Starting Intel(R) MPSS:                                    [  OK  ]
    	mic0: online (mode: linux image: /usr/share/mpss/boot/bzImage-knightscorner)
    	mic1: online (mode: linux image: /usr/share/mpss/boot/bzImage-knightscorner)
  6. Finally, from the coprocessor, verify that the remote directory is accessible:
    [host ~]# ssh mic0 cat /etc/fstab
    	rootfs          /               auto            defaults                1  1
    	proc            /proc           proc            defaults                0  0
    	devpts          /dev/pts        devpts          mode=0620,gid=5         0  0
    	172.31.1.254:/opt/intel/compilers_and_libraries_2017.0.098/linux  /mnt-library  nfs             defaults 1 1
    
    	[host ~]# ssh mic0 ls /mnt-mic0

Mounting manually

As an example of the manual procedure, let’s assume we want to mount an NFS shared directory /mnt-mic0 on the Intel Xeon Phi coprocessor from the host machine (/var/mpss/mic0.export is the directory that the host machine exports). In this method, steps 1-3 are the same as in the previous method:

  1. On the host machine, ensure that the shared directory exists; if doesn’t exist, create it:
    [host ~]# mkdir /var/mpss/mic0.export
  2. Add a descriptor to the /etc/exports configuration file in the host machine to export the directory /var/mpss/mic0.exports to the coprocessor mic0, which in this case has an IP address of 172.31.1.1:
    [host ~]# cat /etc/exports
    	/var/mpss/mic0.export 172.31.1.1(rw,async,no_root_squash)

    For more information on the export options, you can refer to http://nfs.sourceforge.net/nfs-howto/ar01s03.html.
     
  3. Next, update the NFS export table:
    [host ~]# exportfs -a
  4. Next, login on the coprocessor mic0:
    [host ~]# ssh mic0
  5. Create the mount point /mnt-mic0 on the coprocessor:
    (mic0)# mkdir /mnt-mic0
  6. Add the following descriptor to the /etc/fstab file of the coprocessor to specify the server, the path name of the exported directory, the local directory (mount point), the type of the file system, and the list of mount options: “172.31.1.254:/var/mpss/mic0.export /mnt-mic0 nfs defaults 1 1
    (mic0)# cat /etc/fstab
    	rootfs          /               auto             defaults                1  1
    	proc            /proc           proc             defaults                0  0
    	devpts          /dev/pts        devpts           mode=0620,gid=5         0  0
    	172.31.1.254:/var/mpss/mic0.export /mnt-mic0 nfs defaults                1  1
  7. To mount the shared directory /var/mpss/mic0.export on the coprocessor, we can type:
    (mic0)# mount –a

Notes:

  • If "Connection refused" error is received, restart NFS server in the host:
    [host~]# service nfs restart
    Shutting down NFS daemon:                                  [  OK  ]
    Shutting down NFS mountd:                                  [  OK  ]
    Shutting down NFS quotas:                                  [  OK  ]
    Shutting down NFS services:                                [  OK  ]
    Starting NFS services:                                     [  OK  ]
    Starting NFS quotas:                                       [  OK  ]
    Starting NFS mountd:                                       [  OK  ]
    Stopping RPC idmapd:                                       [  OK  ]
    Starting RPC idmapd:                                       [  OK  ]
    Starting NFS daemon:                                       [  OK  ]
  • If "Permission denied" error is received, review and correct the /etc/exports file in the host.
  • If the coprocessor reboots, you have to mount the directory in the coprocessor again.
  • The above shared directory can be read/write. To change to read only option, use the option (ro,async,no_root_squash) as seen in step 2.

Conclusion

This article shows two methods to mount a shared directory on the Intel Xeon Phi coprocessor. One method is using micctrl utility, the other is the common manual method. Although both methods work, using micctrl utility is the preferred method as it prevents users from entering data incorrectly in the /etc/fstab table of the coprocessor.

References


Recipe: Building NAMD on Intel® Xeon® and Intel® Xeon Phi™ Processors

$
0
0

Purpose

This recipe describes a step-by-step process of how to get, build, and run NAMD, Scalable Molecular Dynamic, code on Intel® Xeon Phi™ processor and Intel® Xeon® E5 processors for better performance.

Introduction

NAMD is a parallel molecular dynamics code designed for high-performance simulation of large biomolecule systems. Based on Charm++ parallel objects, NAMD scales to hundreds of cores for typical simulations and beyond 500,000 cores for the largest simulations. NAMD uses the popular molecular graphics program VMD for simulation setup and trajectory analysis, but is also file-compatible with AMBER, CHARMM, and X-PLOR.

NAMD is distributed free of charge with source code. You can build NAMD yourself or download binaries for a wide variety of platforms. Find the details below of how to build on Intel® Xeon Phi™ processor and Intel® Xeon® E5 processors and learn more about NAMD at http://www.ks.uiuc.edu/Research/namd/

Building and running NAMD on Intel® Xeon® Processor E5-2697 v4 (BDW) and Intel® Xeon Phi™ Processor 7250 (KNL)

Download the Code:

  1. Download the latest “Source Code” of NAMD from this site: http://www.ks.uiuc.edu/Development/Download/download.cgi?PackageName=NAMD
  2. Download charm++ 6.7.1 version
  3. Download fftw3 version(http://www.fftw.org/download.html)
    • Version 3.3.4 is used is this run
  4. Download apao and stvm workloads from here: http://www.ks.uiuc.edu/Research/namd/utilities/

Build the Binaries:

  1. Recommended steps to build fftw3:
    • Cd<path>/fftw3.3.4
    • ./configure --prefix=$base/fftw3 --enable-single --disable-fortran CC=icc
                     Use xMIC-AVX512  for KNL or –xCORE-AVX2 for BDW
    • make CFLAGS="-O3 -xMIC-AVX512 -fp-model fast=2 -no-prec-div -qoverride-limits" clean install
  2. Build multicore version of charm++:
    • cd <path>/charm-6.7.1
    • ./build charm++ multicore-linux64 iccstatic --with-production "-O3 -ip"
  3. Build BDW:
    • Modify the Linux-x86_64-icc.arch to look like the following:
      NAMD_ARCH = Linux-KNL
      CHARMARCH = multicore-linux64-iccstatic
      FLOATOPTS = -ip -xMIC-AVX512 -O3 -g -fp-model fast=2 -no-prec-div -qoverride-limits -DNAMD_DISABLE_SSE
      CXX = icpc -std=c++11 -DNAMD_KNL
      CXXOPTS = -static-intel -O2 $(FLOATOPTS)
      CXXNOALIASOPTS = -O3 -fno-alias $(FLOATOPTS) -qopt-report-phase=loop,vec -qopt-report=4
      CXXCOLVAROPTS = -O2 -ip
      CC = icc
      COPTS = -static-intel -O2 $(FLOATOPTS)
    • ./config Linux-x86_64-icc --charm-base <charm_path> --charm-arch multicore-linux64-  iccstatic --with-fftw3 --fftw-prefix <fftw_path> --without-tcl --charm-opts –verbose
    • gmake -j
  4. Build KNL:
    • Modify the arch/Linux-KNL-icc.arch to look like the following:
      NAMD_ARCH = Linux-KNL
      CHARMARCH = multicore-linux64-iccstatic
      FLOATOPTS = -ip -xMIC-AVX512 -O3 -g -fp-model fast=2 -no-prec-div -qoverride-limits -DNAMD_DISABLE_SSE
      CXX = icpc -std=c++11 -DNAMD_KNL
      CXXOPTS = -static-intel -O2 $(FLOATOPTS)
      CXXNOALIASOPTS = -O3 -fno-alias $(FLOATOPTS) -qopt-report-phase=loop,vec -qopt-report=4
      CXXCOLVAROPTS = -O2 -ip
      CC = icc
      COPTS = -static-intel -O2 $(FLOATOPTS)
    • ./config Linux-KNL-icc  --charm-base <charm_path> --charm-arch multicore-linux64-iccstatic --with-fftw3 --fftw-prefix <fftw_path> --without-tcl --charm-opts –verbose
    • gmake –j

Other system setup:

  1. Change the kernel setting for KNL: “nmi_watchdog=0 rcu_nocbs=2-271 nohz_full=2-271” One of the ways to change the settings (this could be different for every system):
    • First save your original grub.cfg to be safe
        cp /boot/grub2/grub.cfg /boot/grub2/grub.cfg.ORIG
    • In “/etc/default/grub”. Add (append) the below to the “GRUB_CMDLINE_LINUX”
        nmi_watchdog=0 rcu_nocbs=2-271 nohz_full=2-271
    • Save your new configuration
        grub2-mkconfig -o /boot/grub2/grub.cfg
    • Reboot the system. After logging in, verify the settings with 'cat /proc/cmdline’
  2. Change next lines in *.namd file for both workloads:
         numsteps             1000
         outputtiming          20
         outputenergies     600

Run NAMD:

  1. Run BDW (ppn = 72):
    $BIN +p $ppn apoa1/apoa1.namd +pemap 0-($ppn-1)
  2. Run KNL (ppn = 136, MCDRAM in flat mode, similar performance in cache):
    numactl –m 1 $BIN +p $ppn apoa1/apoa1.namd +pemap 0-($ppn-1)

Example: numactl –m 1 /NAMD_2.11_Source/Linux-KNL-icc/namd2 +p 136 apoa1/apoa1.namd +pemap 0-135

Performance results reported in Intel Salesforce repository (ns/day; higher is better):

Workload2S BDW 18c 2.3Ghz (ns/day)KNL bin1 (ns/day)KNL vs. 2S BDW (speedup)
stmv0.450.551.22x
Ap0a15.56.181.12x

Systems configuration:

ProcessorIntel® Xeon® Processor E5-2697 v4(BDW)Intel® Xeon Phi™ Processor 7250(KNL)
Stepping1 (B0)1 (B0) Bin1
Sockets / TDP2S / 290W1S / 215W
Frequency / Cores / Threads2.3 GHz / 36 / 721.4 GHz / 68 / 272
DDR4 8x16GB 2400 MHz(128GB)6x16 GB 2400 MHz
MCDRAMN/A16 GB Flat
Cluster/Snoop Mode/Mem ModeHomeQuadrant/flat
TurboOnOn
BIOSGRRFSDP1.86B0271.R00.1510301446GVPRCRB1.86B.0010.R02.1608040407
CompilerICC-2017.0.098ICC-2017.0.098
Operating SystemRed Hat Enterprise Linux* 7.2

(3.10.0-327.e17.x86_64)
Red Hat Enterprise Linux 7.2

(3.10.0-327.22.2.el7.xppsl_1.4.1.3272._86_64)

Building and running NAMD for Cluster on Intel® Xeon® Processor E5-2697 v4 (BDW) and Intel® Xeon Phi™ Processor 7250 (KNL)

Build the Binaries:

  1. Set Intel tools for compilation:
    I_MPI_CC=icc;I_MPI_CXX=icpc;I_MPI_F90=ifort;I_MPI_F77=ifort
    export I_MPI_CC I_MPI_CXX I_MPI_F90 I_MPI_F77
    CC=icc;CXX=icpc;F90=ifort;F77=ifort
    export CC CXX F90 F77
    export I_MPI_LINK=opt_mt
  2. Recommended steps to build fftw3:
    • Cd<path>/fftw3.3.4
    • ./configure --prefix=$base/fftw3 --enable-single --disable-fortran CC=icc
    • Use xMIC-AVX512  for KNL or –xCORE-AVX2 for BDW
    • make CFLAGS="-O3 -xMIC-AVX512 -fp-model fast=2 -no-prec-div -qoverride-limits" clean install
  3. Recommended steps to build multicore version of charm++:
    • cd <path>/charm-6.7.1
    • chmod –R 777 *
    • source /opt/intel/compiler/<version>/compilervars.sh intel64
    • source /opt/intel/impi/<version>/bin/mpivars.sh
    •  ./build charm++ mpi-linux-x86_64 smp mpicxx ifort --with-production $base_charm_opts -DCMK_OPTIMIZE -DMPICH_IGNORE_CXX_SEEK
  4. Build on KNL:
    •  ./config Linux-KNL-icc --charm-base < fullPath >/charm-6.7.1 --charm-arch mpi-linux-x86_64-ifort-smp-mpicxx --with-fftw3 --fftw-prefix <fullPath>/fftw3 --without-tcl --charm-opts –verbose
    • cd “Linux-KNL-icc”
    •   gmake -j
  5. Build on BDW:
    • ./config Linux-KNL-icc --charm-base $FULLPATH/charm-6.7.1 --charm-arch mpi-linux-x86_64-ifort-smp-mpicxx --with-fftw3 --fftw-prefix $FULLPATH/fftw3 --without-tcl --charm-opts -verbose
    •  cd Linux-KNL-icc
    • make clean
    • gmake –j

Run the Binaries (ps: “hosts”: is the file that contains the host names to run on):

  1. BDW run on single node:
    export I_MPI_PROVIDER=psm2
    export I_MPI_FALLBACK=no
    export I_MPI_FABRICS=tmi
    
    source /opt/intel/compiler/<version>/compilervars.sh intel64
    source /opt/intel/impi/<version>/intel64/bin/mpivars.sh
    
    NTASKS_PER_NODE=1
    export MPPEXEC="time -p mpiexec.hydra -perhost $NTASKS_PER_NODE -f ./hosts "
    $MPPEXEC -n $node $BINPATH/$BINNAME +ppn 71 $FULLPATH/$WORKLOAD +pemap 1-71 +commap 0

    Example:
    $MPPEXEC -n 1 $FULLPATH/namd2” +ppn 71 $FULLPATH/stmv/stmv.namd” +pemap 1-71 +commap 0
     
  2. KNL Run on single node:
    export I_MPI_PROVIDER=psm2
    export I_MPI_FALLBACK=0
    export I_MPI_FABRICS=tmi
    export PSM2_IDENTIFY=1
    export PSM2_RCVTHREAD=0
    export TMI_PSM2_TEST_POLL=1
    
    NTASKS_PER_NODE=1
    export MPPEXEC="mpiexec.hydra -perhost $NTASKS_PER_NODE -f ./hosts "
    numactl -m 1 $MPPEXEC $BINPATH/$BINNAME +ppn 135 $FULLPATH/$WORKLOAD +pemap 1-135 +commap 0

    Example:
    numactl -m 1 $MPPEXEC $FULLPATH/namd2 +ppn 135 $FULLPATH/stmv/stmv.namd +pemap 1-135 +commap 0
     
  3. KNL Run on multi-node (node = number of nodes to run on):
    export MPPEXEC="mpiexec.hydra -perhost 1 -f ./hosts "
    numactl -m 1 $MPPEXEC -n $node numactl -m 1 $BINPATH/$BINNAME +ppn 134 $FULLPATH/$WORKLOAD +pemap 0-($ppn-1) +commap 67 

    Example:
    numactl -m 1 $MPPEXEC -n 8 numactl -m 1 $FULLPATH/namd2 +ppn 134 $FULLPATH/stmv/stmv.nand +pemap 0-66+68 +commap 67

Remark:

For better scale on multinodes run, please increase count of communication threads (1, 2, 4, 8, 13, 17). Example of a command run that can be used:

export MPPEXEC="mpiexec.hydra -perhost 17 -f ./hosts "
numactl -m 1 $MPPEXEC -n $(($node*17)) numactl -m 1 $BINPATH/$BINNAME +ppn 7  $FULLPATH/$WORKLOAD +pemap 0-67,68-135:4.3 +commap 71-135:4 > ${WKL}_cluster_commapN/${WKL}.$node.$

One usage example:

nodes="16 8 4 2 1"
for node in ${nodes}
do
  	export MPPEXEC="mpiexec.hydra -perhost 17 -f ./hosts
numactl -m 1 $MPPEXEC -n $(($node*17)) numactl -m 1 $FullPath.namd2  +ppn 8  $WorkloadPath/$WKL/$WKL.namd  +pemap 0-67+68 +commap 71-135:4 > $ResultFile.$node.$BINNAME.68c2t.commap_8th_from2cx4t
done

Best performance results reported on up to 128 Intel Xeon Phi nodes cluster (ns/day; higher is better):

Workload\node (2HT)

1

2

4

8

16

stmv (ns/day)

0.55

1.05

1.86

3.31

5.31

Workload\node (2HT)

8

16

32

64

128

stmv.28M (ns/day)

0.152

0.310

0.596

1.03

1.91

Recipe: Building and Running MILC on Intel® Xeon® Processors and Intel® Xeon Phi™ Processors

$
0
0

Introduction

MILC software represents a set of codes written by the MIMD Lattice Computation (MILC) collaboration used to study quantum chromodynamics (QCD), the theory of the strong interactions of subatomic physics. It performs simulations of four-dimensional SU lattice gauge theory on MIMD (Multiple Instruction, Multiple Data) parallel machines. “Strong interactions” are responsible for binding quarks into protons and neutrons and holding them all together in the atomic nucleus. MILC applications address fundamental questions in high energy and nuclear physics, and is directly related to major experimental programs in these fields. MILC is one of the largest compute cycle users at many U.S. and European supercomputing centers.

This article provides instructions for code access, build, and run directions for the “ks_imp_rhmc” application on Intel® Xeon® processors and Intel® Xeon Phi™ processors. The “ks_imp_rhmc” is a dynamical RHMC (rational hybrid Monte Carlo algorithm) code for staggered fermions. In addition to the naive and asqtad staggered actions, the highly improved staggered quark (HISQ) action is also supported.

Currently, the conjugate gradient (CG) solver in the code uses the QPhiX library. Efforts are ongoing to integrate other operations (gauge force (GF), fermion force (FF)) with the QPhiX library as well.

The QPhiX library provides sparse solvers and Dslash kernels for Lattice QCD simulations optimized for Intel® architectures.

Code Access

The MILC Software and QPhiX library are primarily required. The MILC software can be downloaded from GitHub here: https://github.com/milc-qcd/milc_qcd. Download the master branch. QPhiX support is integrated into this branch for CG solvers.

The QPhiX library and code generator for use with Wilson-Clover fermions (for example, for use with chroma) are available from https://github.com/jeffersonlab/qphix.git and https://github.com/jeffersonlab/qphix-codegen.git, respectively. For the most up to date version, we suggest you use the devel branch of QPhiX. The MILC version is currently not open source. Please contact the MILC collaboration group for access to the QPhiX (MILC) branch.

Build Directions

Compile the QPhiX Library

Users need to build QPhiX first before building the MILC package.

The QPhiX library will have two tar files, mbench*.tar and qphix-codegen*.tar.

Untar the above.

Build qphix-codgen

The files with intrinsics for QPhiX are built in the qphix-codegen directory.

Enter the qphix-codegen directory.

Edit line #3 in “Makefile_xyzt”, enable “milc=1” variable.

Compile as:

source /opt/intel/compiler/<version>/bin/compilervars.sh intel64
source /opt/intel/impi/<version>/mpi/intel64/bin/mpivars.sh
make –f Makefile_xyzt avx512 -- [for Intel® Xeon Phi™ Processor]
make –f avx2 -- [for Intel® Xeon® v3 Processors /Intel® Xeon® v4 Processor]

Build mbench

Enter the mbench directory.

Edit line #3 in “Makefile_qphixlib”, set “mode=mic” to compile with Intel® AVX-512 for Intel® Xeon Phi™ Processor and “mode=avx” to compile with Intel® Advanced Vector Extensions 2 (Intel® AVX2) for Intel® Xeon® Processors.

Edit line #13 in “Makefile_qphixlib” to enable MPI. Set ENABLE_MPI = 1.

Compile as:

make -f Makefile_qphixlib mode=mic AVX512=1 -- [Intel® Xeon Phi™ Processor]
make -f Makefile_qphixlib mode=avx AVX2=1 -- [Intel® Xeon® Processors]

Compile MILC Code

Install/download the master branch from the above GitHub location.

Download the Makefile.qphix file from the following location:

http://denali.physics.indiana.edu/~sg/MILC_Performance_Recipe/.

Copy the Makefile.qphix to the corresponding application directory. In this case, copy the Makefile.qphix to the “ks_imp_rhmc” application directory and rename it as Makefile.

Make the following changes to the Makefile:

  • On line #17 - Add/uncomment the appropriate ARCH variable:
    • For example, ARCH = knl (compile with Intel AVX-512 for Intel® Xeon Phi™ Processor architecture).
    • For example, ARCH = bdw (compile with Intel AVX2 for Intel® Xeon® Processor architecture).
  • On line #28 - Change MPP variable to “true” if you want MPI.
  • On line #34 - Pick the PRECISION you want:
    • 1 = Single, 2 = Double. We use Double for our runs.
  • Starting line #37 - Compiler is set up and this should work:
    •  If directions above were followed. If not, customize starting at line #40.
  • On line #124 - Setup of Intel compiler starts:
    • Based on ARCH it will use the appropriate flags.
  • On line #395 - QPhiX customizations starts: 
    • On line #399 – Set QPHIX_HOME to correct QPhiX path (Path to mbench directory).
    • The appropriate QPhiX FLAGS will be set if the above is defined correctly.

Compile as:

Enter the ks_imp_rhmc. The Makefile with the above changes should be in this directory. Source the latest Intel® compilers and Intel® MPI Library.

make su3_rhmd_hisq -- Build su3_rhmd_hisq binary
make su3_rhmc_hisq -- Build su3_rhmc_hisq binary

Compile the above binaries for Intel® Xeon Phi™ Processor and Intel® Xeon® Processor (edit Makefile accordingly).

Run Directions

Input Files

There are two required input files, params.rest, and rat.m013m065m838.

They can be downloaded from here:

http://denali.physics.indiana.edu/~sg/MILC_Performance_Recipe/.

The file rat.m013m065m838 defines the residues and poles of the rational functions needed in the calculation. The file params.rest sets all the run time parameters, including the lattice size, the length of the calculation (number of trajectories), and the precision of the various conjugate-gradient solutions.

In addition, a params.<lattice-size> file with required lattice size will be created during runtime. This file essentially has the params.rest appended to it with the lattice size (Nx * Ny * Nz * Ny) to run.

The Lattice Sizes

The size of the four-dimensional space-time lattice is controlled by the “nx, ny, nz, nt” parameters.

As an example, consider a problem as (nx x ny x nz x nt) = 32 x 32 x 32 x 64 running on 64 MPI ranks. To weak scale this problem a user would begin by multiplying nt by 2, then nz by 2, then ny by 2, then nx by 2 and so on, such that all variables get sized accordingly in a round-robin fashion.

This is illustrated in the table below. The original problem size is 32 x 32 x 32 x 64, to keep the elements/rank constant (weak scaling); for 128 rank count, first multiply nt by 2 (32 x 32 x 32 x 128). Similarly, for 512 ranks, multiply ntby 2, nz by 2, ny by 2 from the original problem size to keep the same elements/rank.

Ranks64128256512
nx32323232
ny32323264
nz32326464
nt64128128128
     
Total Elements20971524194304838860816777216
Multiplier1248
Elements/Rank32768327683276832768

Table: Illustrates Weak Scaling of Lattice Sizes

Running with MPI x OpenMP*

The calculation takes place on a four-dimensional hypercubic lattice, representing three spatial dimensions and one time dimension. The quark fields have values on each of the lattice points and the gluon field has values on each of the links connecting nearest-neighbors of the lattice sites. 

The lattice is divided into equal subvolumes, one per MPI rank. The MPI ranks can be thought of as being organized into a four-dimensional grid of ranks. It is possible to control the grid dimensions with the params.rest file. Of course, the grid dimensions must be integer factors of the lattice coordinate dimensions.

Each MPI rank executes the same code. The calculation requires frequent exchanges of quark and gluon values between MPI ranks with neighboring lattice sites. Within a single MPI rank, the site-by-site calculation is threaded using OpenMP* directives, which have been inserted throughout the code. The most time-consuming part of production calculations is the CG solver. In the QPhiX version of the CG solver, the data layout and the calculation at the thread level is further organized to take advantage of the Intel Xeon and Intel Xeon Phi processors SIMD(single instruction, multiple data) lanes.

Running the Test Cases

  1. Create a “run” directory in the top-level directory and add the input files obtained from above.
  2. cd <milc>/run
    P.S: Run the appropriate binary for each architecture.
  3. Create the lattice volume:
    cat << EOF > params.$nx*$ny*$nz*$nt
    prompt 0
    nx $nx
    ny $ny
    nz $nz
    nt $nt
    EOF
    cat params.rest >> params.$nx*$ny*$nz*$nt

    For this performance recipe, we evaluate the single node and multinode (16 nodes) performance with the following weak scaled lattice volume:

    Single Node (nx * ny * nz * nt): 24 x 24 x 24 x 60

    Multinode [16 nodes] (nx * ny * nz * nt): 48 x 48 x 48 x 120

  4. Run on Intel Xeon processor (E5-2697v4).
    Source the latest Intel compilers and Intel MPI Library
    • Intel® Parallel Studio 2017 and above recommended

    Single Node:

    mpiexec.hydra –n 12 –env OMP_NUM_THREADS 3 –env KMP_AFFINITY 'granularity=fine,scatter,verbose'<path-to>/ks_imp_rhmc/su3_rhmd_hisq.bdw < params.24x24x24x60

    Multinode (16 nodes, via Intel® Omni-Path Host Fabric Interface (Intel® OP HFI)):

    # Create a runScript (run-bdw) #<path-to>/ks_imp_rhmc/su3_rhmd_hisq.bdw < params.48x48x48x120
    #Intel® OPA fabric-related environment variables#
    export I_MPI_FABRICS=shm:tmi
    export I_MPI_TMI_PROVIDER=psm2
    export PSM2_IDENTIFY=1
    export I_MPI_FALLBACK=0
    #Create nodeconfig.txt with the following#
    -host <hostname1> -env OMP_NUM_THREADS 3 -env KMP_AFFINITY 'granularity=fine,scatter,verbose' -n 12 <path-to>/run-bdw
    …..
    …..
    …..
    -host <hostname16> -env OMP_NUM_THREADS 3 -env KMP_AFFINITY 'granularity=fine,scatter,verbose' -n 12 <path-to>/run-bdw
    #mpirun command#
    mpiexec.hydra –configfile nodeconfig.txt
  5. Run on Intel Xeon Phi processor (7250).
    Source Intel compilers and Intel MPI Library
    • Intel® Parallel Studio 2017 and above recommended

    Single Node:

    mpiexec.hydra –n 20 –env OMP_NUM_THREADS 3 –env KMP_AFFINITY 'granularity=fine,scatter,verbose' numactl –p 1 <path-to>/ks_imp_rhmc/su3_rhmd_hisq.knl < params.24x24x24x60

    Multinode (16 nodes, via Intel OP HFI):

    # Create a runScript (run-knl) #
    numactl –p 1 <path-to>/ks_imp_rhmc/su3_rhmd_hisq.knl < params.48x48x48x120
    #Intel OPA fabric-related environment variables#
    export I_MPI_FABRICS=shm:tmi
    export I_MPI_TMI_PROVIDER=psm2
    export PSM2_IDENTIFY=1
    export I_MPI_FALLBACK=0
    #Create nodeconfig.txt with the following#
    -host <hostname1> -env OMP_NUM_THREADS 3 -env KMP_AFFINITY 'granularity=fine,scatter,verbose' -n 20 <path-to>/run-knl
    …..
    …..
    …..
    -host <hostname16> -env OMP_NUM_THREADS 3 -env KMP_AFFINITY 'granularity=fine,scatter,verbose' -n 20 <path-to>/run-knl
    #mpirun command#
    mpiexec.hydra –configfile nodeconfig.txt

Performance Results and Optimizations

The output prints the total time to solution for the entire application, which takes into account the time for the different solvers and operators (for example, CG solver, fermion force, link fattening, gauge force, and so on).

The performance chart below is the speedup w.r.t 2S Intel Xeon processor E5-2697v4 based on the total run time.

 Speedup w.r.t 2S Intel® Xeon® processor E5-2697v4

The optimizations as part of the QPhiX library include data layout changes to target vectorization and generation of packed aligned loads/stores, cache blocking, load balancing and improved code generation for each architecture (Intel Xeon processor, Intel Xeon Phi processor) with corresponding intrinsics, where necessary. See References and Resources section for details.

Testing Platform Configurations

The following hardware was used for the above recipe and performance testing.

ProcessorIntel® Xeon® Processor E5-2697 v4Intel® Xeon Phi™ Processor 7250F
Sockets / TDP2S / 290W1S / 215W
Frequency / Cores / Threads2.3 GHz / 36 / 721.4 a / 68 / 272
DDR48x16 GB 2400 MHz6x16 GB 2400 MHz
MCDRAMN/A16 GB Flat
Cluster/Snoop ModeHomeQuadrant
Memory Mode Flat
TurboOFFOFF
BIOSSE5C610.86B.01.01.0016.033
120161139
GVPRCRB1.86B.0010.R02.1
606082342
Operating SystemOracle Linux* 7.2
(3.10.0-229.20.1.el6.x86_64)
Oracle Linux* 7.2
(3.10.0-229.20.1.el6.x86_64)

MILC Build Configurations

The following configurations were used for the above recipe and performance testing.

MILC VersionMaster version as of 28 January 2017
Intel® Compiler Version2017.1.132
Intel® MPI Library Version2017.0.098
MILC Makefiles UsedMakefile.qphix, Makefile_qphixlib, Makefile

References and Resources

  1. MIMD Lattice Computation (MILC) Collaboration: http://physics.indiana.edu/~sg/milc.html
  2. QPhiX Case Study: http://www.nersc.gov/users/computational-systems/cori/application-porting-and-performance/application-case-studies/qphix-case-study/
  3. MILC Staggered Conjugate Gradient Performance on Intel Intel® Xeon Phi™ Processor: https://anl.app.box.com/v/IXPUG2016-presentation-10

Recipe: Building and Running GROMACS* on Intel® Processors

$
0
0

Purpose

This recipe describes how to get, build, and run the GROMACS* code on Intel® Xeon® and Intel® Xeon Phi™ processors for better performance on a single node.

Introduction

GROMACS is a versatile package for performing molecular dynamics, using Newtonian equations of motion, for systems with hundreds to millions of particles. GROMACS is primarily designed for biochemical molecules like proteins, lipids, and nucleic acids that have a multitude of complicated bonded interactions. But, since GROMACS is extremely fast at calculating the non-bonded interactions typically dominating simulations, many researchers use it for research on non-biological systems, such as polymers.

GROMACS supports all the usual algorithms expected from a modern molecular dynamics implementation.

The GROMACS code is maintained by developers around the world. The code is available under the GNU General Public License from www.gromacs.org.

Code Access

Download GROMACS:

Workloads Access

Download the workloads:

Generate Water Workloads Input Files:

To generate the .tpr input file:

  • tar xf water_GMX50_bare.tar.gz
  • cd water-cut1.0_GMX50_bare/1536
  • gmx_mpi grompp -f pme.mdp -c conf.gro -p topol.top -o topol_pme.tpr
  • gmx_mpi grompp -f rf.mdp -c conf.gro -p topol.top -o topol_rf.tpr

Build Directions

Build the GROMACS binary. Use cmake configuration for Intel® Compiler 2017.1.132 + Intel® MKL + Intel® MPI 2017.1.132:

Set the Intel Xeon Phi BIOS options to be:

  • Quadrant Cluster mode
  • MCDRAM Flat mode
  • Turbo Enabled

For Intel Xeon Phi, build the code as:

  • BuildDir= "${GromacsPath}/build” # Create the build directory
  • installDir="${GromacsPath}/install"
  • mkdir $BuildDir
     

  • source /opt/intel/<version>/bin/compilervars.sh intel64 # Source the Intel compiler, MKL and IMPI
  • source /opt/intel/impi/<version>/mpivars.sh
  • source /opt/intel/mkl/<version>/mklvars.sh intel64
     

  • cd $BuildDir # Set the build environments for Intel Xeon Phi
FLAGS="-xMIC-AVX512 -g -static-intel"; CFLAGS=$FLAGS CXXFLAGS=$FLAGS CC=mpiicc CXX=mpiicpc cmake .. -DBUILD_SHARED_LIBS=OFF -DGMX_FFT_LIBRARY=mkl -DCMAKE_INSTALL_PREFIX=$installDir -DGMX_MPI=ON -DGMX_OPENMP=ON -DGMX_CYCLE_SUBCOUNTERS=ON -DGMX_GPU=OFF -DGMX_BUILD_HELP=OFF -DGMX_HWLOC=OFF -DGMX_SIMD=AVX_512_KNL -DGMX_OPENMP_MAX_THREADS=256

For Intel Xeon, set the build environments and build the code as above with changes:

  • FLAGS="-xCORE-AVX2 -g -static-intel"
  • -DGMX_SIMD=AVX2_256

Build GROMACS:

  • make -j 4
  • sleep 5
  • make check

Run Directions

Run workloads on Intel Xeon Phi with the environment settings and command lines as (nodes.txt : localhost:272):


	export  I_MPI_DEBUG=5
	export I_MPI_FABRICS=shm
	export I_MPI_PIN_MODE=lib
	export KMP_AFFINITY=verbose,compact,1

	gmxBin="${installDir}/bin/gmx_mpi"

	mpiexec.hydra -genvall -machinefile ./nodes.txt -np 66 numactl -m 1 $gmxBin mdrun -npme 0 -notunepme -ntomp 4 -dlb yes -v -nsteps 4000 -resethway -noconfout -pin on -s ${WorkloadPath}water-cut1.0_GMX50_bare/1536/topol_pme.tpr
	export KMP_BLOCKTIME=0
	mpiexec.hydra -genvall -machinefile ./nodes.txt -np 66 numactl -m 1 $gmxBin mdrun -ntomp 4 -dlb yes -v -nsteps 1000 -resethway -noconfout -pin on -s ${WorkloadPath}lignocellulose-rf.BGQ.tpr
	mpiexec.hydra -genvall -machinefile ./nodes.txt -np 64 numactl -m 1 $gmxBin mdrun -ntomp 4 -dlb yes -v -nsteps 5000 -resethway -noconfout -pin on -s ${WorkloadPath}water-cut1.0_GMX50_bare/1536/topol_rf.tpr

Run workloads on Intel Xeon with the environment settings and command lines as:


	export  I_MPI_DEBUG=5
	export I_MPI_FABRICS=shm
	export I_MPI_PIN_MODE=lib
	export KMP_AFFINITY=verbose,compact,1

	gmxBin="${installDir}/bin/gmx_mpi"

	mpiexec.hydra -genvall -machinefile ./nodes.txt -np 72 $gmxBin mdrun -notunepme -ntomp 1 -dlb yes -v -nsteps 4000 -resethway -noconfout -s ${WorkloadPath}water-cut1.0_GMX50_bare/1536_bdw/topol_pme.tpr
	export KMP_BLOCKTIME=0
	mpiexec.hydra -genvall -machinefile ./nodes.txt -np 72 $gmxBin mdrun -ntomp 1 -dlb yes -v -nsteps 1000 -resethway -noconfout -s ${WorkloadPath}lignocellulose-rf.BGQ.tpr
	mpiexec.hydra -genvall -machinefile ./nodes.txt -np 72 $gmxBin mdrun -ntomp 1 -dlb yes -v -nsteps 5000 -resethway -noconfout -s ${WorkloadPath}water-cut1.0_GMX50_bare/1536_bdw/topol_rf.tpr

Performance Testing

Performance tests for GROMACS are illustrated below with comparisons between an Intel Xeon processor and an Intel Xeon Phi processor against three standard workloads: water1536k_pme, water1536k_rf, and lignocellulose3M_rf. In all cases, turbo mode is turned on.

Testing Platform Configurations

The following hardware was used for the above recipe and performance testing.

Processor

Intel® Xeon® Processor E5-2697 v4

Intel® Xeon Phi™ Processor 7250

Stepping

1 (B0)

1 (B0) Bin1

Sockets / TDP

2S / 290W

1S / 215W

Frequency / Cores / Threads

2.3 GHz / 36 / 72

1.4 GHz / 68 / 272

DDR4

8x16GB 2400 MHz(128GB)

6x16 GB 2400 MHz

MCDRAM

N/A

16 GB Flat

Cluster/Snoop Mode/Mem Mode

Home

Quadrant/flat

Turbo

On

On

BIOS

GRRFSDP1.86B.0271.R00.1510301446

GVPRCRB1.86B.0011.R04.1610130403

Compiler

ICC-2017.1.132

ICC-2017.1.132

Operating System

Red Hat Enterprise Linux* 7.2

Red Hat Enterprise Linux 7.2

3.10.0-327.el7.x86_64

3.10.0-327.13.1.el7.xppsl_1.3.3.151.x86_64

GROMACS Build Configurations

The following configurations were used for the above recipe and performance testing.

  • GROMACS Version: GROMACS-2016.1
  • Intel® Compiler Version: 2017.1.132
  • Intel® MPI Library Version: 2017.1.132
  • Workloads used: water1536k_pme, water1536k_rf, and lignocellulose3M_rf

Recipe: Building and running NEMO* on Intel® Xeon Phi™ Processors

$
0
0

About NEMO*

The NEMO* (Nucleus for European Modelling of the Ocean) numerical solutions framework encompasses models of ocean, sea ice, tracers, and biochemistry equations and their related physics. It also incorporates the pre- and post-processing tools and the interface to other components of the Earth System. NEMO allows several ocean-related components of the Earth System to work together or separately, and also allows for two-way nesting via AGRIF software. It is interfaced with the remaining components of the Earth System package (atmosphere, land surfaces, and so on) via the OASIS coupler.

This recipe shows the performance advantages of using the Intel® Xeon Phi™ processor 7250.

NEMO 3.6 is the current stable version.

Downloading the Code

  1. Download the NEMO source code from the official NEMO repository (you should register at www.nemo-ocean.eu ):

    svn co –r 6939  http://forge.ipsl.jussieu.fr/nemo/svn/branches/2015/nemo_v3_6_STABLE/NEMOGCM nemo

  2. Download the XIOS IO server from the official XIOS repository:

    svn co -r 703 http://forge.ipsl.jussieu.fr/ioserver/svn/XIOS/branchs/xios-1.0 xios

  3. If your system has NetCDF libraries with Fortran bindings already installed and they link with NEMO and XIOS binaries, go to the section “Building XIOS for the Intel Xeon Processor”:
  4. NetCDF-Fortran  from https://github.com/Unidata/netcdf-fortran/archive/netcdf-fortran-4.2.tar.gz

Building Additional Libraries for the Intel® Xeon® Processor

  1. First, choose a directory for your experiments, such as “~/NEMO-BDW”:
    export base=”~/NEMO-BDW”
  2. Create a directory and copy all required libraries in $base:
    mkdir -p $base/libraries
  3. Unpack the tarball files in $base/libraries/src.
  4. To build an Intel® Advanced Vector Extensions 2 (Intel® AVX2) version of libraries, set:
    export arch="-xCORE-AVX2"
  5. Set the following environment variables:
    export PREFIX=$base/libraries
    export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:${PREFIX}/lib
    export CFLAGS="-I$PREFIX/include -L$PREFIX/lib –O3 -g -traceback -openmp ${arch} -fPIC"
    export CPPFLAGS=$CFLAGS
    export CXXFLAGS=$CFLAGS
    export FFFLAGS=$CFLAGS
    export FCFLAGS=$CFLAGS
    export LDFLAGS="-L$PREFIX/lib -openmp ${arch} -fPIC"
    export FC=mpiifort
    export CXX=mpiicc
    export CC=mpiicc
    export CPP="icc -E"
  6. Build szip:
    cd $base/libraries/src/szip-2.1
    ./configure --prefix=$PREFIX
    make -j 4
    make install
  7. Build zlib:
    cd $base/libraries/src/zlib-1.2.8
    ./configure --prefix=$PREFIX
    make –j 4
    make install
  8. Build HDF5:
    cd $base/libraries/src/hdf5-1.8.12
    ./configure --with-zlib=$PREFIX --prefix=$PREFIX --enable-fortran --with-szlib=$PREFIX --enable-hl
    make
    make install
  9. Build CURL:
    cd $base/libraries/src/curl- 7.42.1
    ./configure --prefix=$PREFIX
    make –j 4
    make install
  10. Build NetCDF:
    cd $base/libraries/src/netcdf-4.3.3
    export LIBS=" -lhdf5_hl -lhdf5 -lz -lsz -lmpi"
    export LD_FLAGS+=" -L$PREFIX/lib"
    ./configure --prefix=$PREFIX
    make
    make install
  11. Build NetCDF Fortran wrapper:
    cd $base/libraries/src/netcdf-fortran-4.2/
    export LIBS=""
    export CFLAGS="$CFLAGS -lnetcdf"
    export CPPFLAGS=$CFLAGS
    export CXXFLAGS=$CFLAGS
    export FFFLAGS=$CFLAGS
    export FCFLAGS=$CFLAGS
    export FC=ifort
    export CXX=mpiicc
    export CC=mpiicc
    export LDFLAGS+=" -L$I_MPI_ROOT/lib64/"
    ./configure --prefix=$PREFIX
    make
    make install

Building XIOS for the Intel Xeon Processor

  1. Copy XIOS source code to $base/xios
  2. Create files:
    $base/xios/arch/arch-ifort_linux.env
    $base/xios/arch/arch-ifort_linux.fcm
    $base/xios/arch/arch-ifort_linux.path
  3. Add the following lines to the $base/xios/arch/arch-ifort_linux.env file:
    export NETCDF_INC_DIR=$base/libraries/include
    export NETCDF_LIB_DIR=$base/libraries/lib
    export HDF5_INC_DIR=$base/libraries/include
    export HDF5_LIB_DIR=$base/libraries/lib
  4. Add the following lines to the $base/xios/arch/arch-ifort_linux.fcm file:
    %NCDF_INC            -I$base/libraries/include
    %NCDF_LIB            -L$base/libraries/lib -lnetcdff -lnetcdf -lhdf5 -lcurl -lz -lsz
    %FC                  mpiifort
    %FCFLAGS             -O3 -g -traceback -xCORE-AVX2 -I$base/libraries/include -L$base/libraries/lib
    %FFLAGS              -O3 -g -traceback -xCORE-AVX2 -I$base/libraries/include -L$base/libraries/lib
    %LD                  mpiifort
    %FPPFLAGS            -P -C -traditional
    %LDFLAGS             -O3 -g -traceback -xCORE-AVX2 -I$base/libraries/include -L$base/libraries/lib
    %AR                  ar
    %ARFLAGS             -r
    %MK                  gmake
    %USER_INC            %NCDF_INC_DIR
    %USER_LIB            %NCDF_LIB_DIR
    
    %MAKE                gmake
    %BASE_LD        -lstdc++ -lifcore -lintlc
    %LINKER         mpiifort -nofor-main
    %BASE_INC       -D__NONE__
    %CCOMPILER      mpiicc
    %FCOMPILER      mpiifort
    %CPP            cpp
    %FPP            cpp -P
    
    %BASE_CFLAGS    -O3 -g -traceback -xCORE-AVX2 -I$base/libraries/include -L$base/libraries/lib
    %PROD_CFLAGS    -O3 -g -traceback -xCORE-AVX2 -I$base/libraries/include -L$base/libraries/lib
    %DEV_CFLAGS    -O3 -g -traceback -xCORE-AVX2 -I$base/libraries/include -L$base/libraries/lib
    %DEBUG_CFLAGS  -O3 -g -traceback -xCORE-AVX2 -I$base/libraries/include -L$base/libraries/lib
    %BASE_FFLAGS   -O3 -g -traceback -xCORE-AVX2 -I$base/libraries/include -L$base/libraries/lib
    %PROD_FFLAGS    -O3 -g -traceback -xCORE-AVX2 -I$base/libraries/include -L$base/libraries/lib
    %DEV_FFLAGS    -O3 -g -traceback -xCORE-AVX2 -I$base/libraries/include -L$base/libraries/lib
    %DEBUG_FFLAGS   -O3 -g -traceback -xCORE-AVX2 -I$base/libraries/include -L$base/libraries/lib
  5. Add the following lines to the $base/xios/arch/arch-ifort_linux.path file:
    NETCDF_INCDIR="-I $NETCDF_INC_DIR"
    NETCDF_LIBDIR="-L $NETCDF_LIB_DIR"
    NETCDF_LIB="-lnetcdff -lnetcdf -lcurl"
    MPI_INCDIR=""
    MPI_LIBDIR=""
    MPI_LIB=""
    HDF5_INCDIR="-I $HDF5_INC_DIR"
    HDF5_LIBDIR="-L $HDF5_LIB_DIR"
    HDF5_LIB="-lhdf5_hl -lhdf5 -lz -lcurl"
  6. Change directory to $base/xios and execute the following command:
    ./make_xios --full --prod --arch ifort_linux

Building NEMO for the Intel Xeon Processor and Preparing Workloads

  1. Copy NEMO source code to $base/nemo
  2. Apply the following patch to file $base/nemo/NEMOGCM/ NEMO/OPA_SRC/nemogcm.F90:
    @@ -116,6 +116,7 @@
           !!              Madec, 2008, internal report, IPSL.
           !!----------------------------------------------------------------------
           INTEGER ::   istp       ! time step index
    +DOUBLE PRECISION :: mpi_wtime, sstart, send
           !!----------------------------------------------------------------------
           !
     #if defined key_agrif
    @@ -163,18 +164,19 @@
     #if defined key_agrif
               CALL Agrif_Regrid()
     #endif
    -
              DO WHILE ( istp <= nitend .AND. nstop == 0 )
    +sstart = mpi_wtime()
     #if defined key_agrif
                 CALL stp                         ! AGRIF: time stepping
     #else
                 CALL stp( istp )                 ! standard time stepping
     #endif
    +send=mpi_wtime()
    +print *, "Step ", istp, " - " , send-sstart , "s."
                 istp = istp + 1
                 IF( lk_mpp )   CALL mpp_max( nstop )
              END DO
     #endif
    -
           IF( lk_diaobs   )   CALL dia_obs_wri
           !
           IF( ln_icebergs )   CALL icb_end( nitend )
  3. Create the file $base/nemo/ARCH/arch-mpiifort_linux.fcm and add the following lines:
    %NCDF_INC            -I/$base/libraries/include
    %NCDF_LIB            -L$base/libraries/lib -lnetcdff -lnetcdf -lz -lcurl -lhdf5_hl -lhdf5 -lz -lcurl
    %CPP                 icc -E
    %FC                  mpiifort
    %FCFLAGS          -r8 -g -traceback -qopenmp -O3 -xCORE-AVX2 -g -traceback
    %FFLAGS             -r8 -g -traceback -qopenmp -O3 -xCORE-AVX2 -g -traceback
    %LD                  mpiifort
    %FPPFLAGS            -P -C -traditional
    %LDFLAGS             -lstdc++ -lifcore -O3 -xCORE-AVX2 -g -traceback
    %AR                  ar
    %ARFLAGS             -r
    %MK                  gmake
    %XIOS_INC            -I$base/xios/inc
    %XIOS_LIB            -L$base/xios/lib -lxios
    %USER_INC            %NCDF_INC %XIOS_INC
    %USER_LIB            %NCDF_LIB %XIOS_LIB
  4. Build the binary for the GYRE workload:
    cd $base/nemo/NEMOGCM/CONFIG
    ./makenemo -n GYRE -m mpiifort_linux -j 4
  5. Create a sandbox directory for the GYRE runs:
    1.  mkdir -p $base/nemo/gyre-exp
       cp –r $base/nemo/NEMOGCM/CONFIG/GYRE/BLD/bin/nemo.exe $base/nemo/gyre-exp
       cp -r $base/nemo/NEMOGCM/CONFIG/GYRE/EXP00/* $base/nemo/gyre-exp
    2. Switch creating mesh files to off by changing “nn_msh” to 0 in namelist_ref file
    3. Enable benchmark mode by changing “nn_bench” to 1 in namelist_ref  file.
    4. Set the following parameters in the “&namcfg” section:
      jp_cfg = 70
      jpidta = 2102
      jpjdta = 1402
      jpkdta = 31
      jpiglo = 2102
    5. Switch off using the IO server in the iodef.xml file (“using_server = false”)
  6. Build a binary for the ORCA025 workload:
    1. Change  “$base/nemo/NEMOGCM/CONFIG/ORCA2_LIM3/cpp_ORCA2_LIM3.fcm” content to “bld::tool::fppkeys key_trabbl key_vvl key_dynspg_ts key_ldfslp key_traldf_c2d key_traldf_eiv key_dynldf_c3d key_zdfddm key_zdftmx key_mpp_mpi key_zdftke key_lim3 key_iomput”
    2. Change the line “ORCA2_LIM3 OPA_SRC LIM_SRC_3 NST_SRC” to “ORCA2_LIM3 OPA_SRC LIM_SRC_3” in file $base/nemo/NEMOGCM/CONFIG/cfg.txt
    3. ./makenemo -n ORCA2_LIM3 -m mpiifort_linux -j 4
  7. Go to the Barcelona Supercomputing Center (in Spanish), and in section 9 locate the paragraph, “PREGUNTAS Y RESPUESTAS:” with a path to the ftp server and credentials to log in.
  8. Download the BenchORCA025L75.tar.gz file from directory Benchmarks_aceptacion/NEMO/
  9. Extract the contents of the tarball file to $base/nemo/orca-exp
  10. Copy the NEMO binary to the sandbox directory:
    cp $base/nemo/NEMOGCM/CONFIG/ORCA2_LIM3/BLD/bin/nemo.exe $base/nemo/orca-exp
  11. Edit the file $base/nemo/orca-exp/iodef.xml and add the following lines into the “<context id="xios">    <variable_definition>” section:
    <variable id="min_buffer_size" type="int">994473778</variable><variable id="buffer_size" type="int">994473778</variable> 
  12. In the file namelist_ref in section “&namrun” set the following variables:
    nn_itend     =   10
    nn_stock    =    10
    nn_write    =    10
  13. Copy the $base/nemo/NEMOGCM/CONFIG/SHARED/namelist_ref file to $base/nemo/exp-orca
  14. Switch off using the IO server in the iodef.xml file (“using_server = false”)
  15. To build the KNL binaries change “-xCORE-AVX2” to “-xMIC-AVX512”, change $base to another directory, and do all of the steps again.

Running the GYRE Workload with the Intel Xeon Processor

  1. Go to $base/nemo/gyre-exp
  2. Source the environment variables for the compiler and the Intel® MPI Library:
    source /opt/intel/compiler/latest/bin/compilervars.sh intel64
    source /opt/intel/impi/latest/bin/compilervars.sh intel64
  3. Add libraries to LD_LIBRARY_PATH:
    export LD_LIBRARY_PATH=$base/libraries/lib/:$LD_LIBRARY_PATH
  4. Set additional variables for the Intel MPI Library:
    export I_MPI_FABRICS=shm:tmi
    export I_MPI_PIN_CELL=core
  5. Run NEMO:
    mpiexec.hyrda –genvall –f <hostfile> -n <number of ranks> -perhost <ppn> ./nemo.exe

Running the ORCA025 Workload with the Intel Xeon Processor

  1. Go to $base/nemo/orca-exp
  2. Source the environment variables for the compiler and the Intel MPI Library:
    source /opt/intel/compiler/latest/bin/compilervars.sh intel64
    source /opt/intel/impi/latest/bin/compilervars.sh intel64
  3. Add libraries to LD_LIBRARY_PATH:
    export LD_LIBRARY_PATH=$base/libraries/lib/:$LD_LIBRARY_PATH
  4. Set additional variables for Intel MPI Library:
    export I_MPI_FABRICS=shm:tmi
    export I_MPI_PIN_CELL=core
  5. Run NEMO:
    mpiexec.hyrda –genvall –f <hostfile> -n <number of ranks> -perhost <ppn> ./nemo.exe
  6. If you are faced with hangs while the application is running you can run NEMO with the XIOS server in detached mode:
    1. Copy xios_server.exe from $base/xios/bin to $base/nemo/orca-exp
    2. Edit iodef.xml file and set “using_server = true”
    3. mpiexec.hy–da -genvall –f <hostfile> -n <number of ranks> -perhost <ppn> ./nemo.exe : -n 2 ./xios_server.exe

Building Additional Libraries for the Intel® Xeon Phi™ Processor

  1. First, choose a directory for your experiments, such as “~/NEMO-KNL”
    export base=”~/NEMO-KNL”
  2. Create the directory and copy all required libraries in $base:
    mk–ir -p $base/libraries
  3. Unpack the tarball files in $base/libraries/src
  4. To build an Intel AVX2 version of libraries, set:
    export a”ch="-xMIC-AV”512"
  5. Set the following environment variables:
     export PREFIX=$base/libraries
     export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:${PREFIX}/lib
     export CFL”GS="-I$PREFIX/incl–de -L$PREFIX/lib –O3–-g -traceb–ck -openmp ${ar–h} -”PIC"
     export CPPFLAGS=$CFLAGS
     export CXXFLAGS=$CFLAGS
     export FFFLAGS=$CFLAGS
     export FCFLAGS=$CFLAGS
     export LDFL”GS="-L$PREFIX/–ib -openmp ${ar–h} -”PIC"
     export FC=mpiifort
     export CXX=mpiicc
     export CC=mpiicc
     export ”PP="–c” -E"
  6. Build szip:
     cd $base/libraries/src/szip-2.1
     ./config–e --prefix=$PREFIX
     m–ke -j 4
     make install
  7. Build zlib:
    cd $base/libraries/src/zlib-1.2.8
    ./config–e --prefix=$PREFIX
    make –j 4
    make install
  8. Build HDF5:
    cd $base/libraries/src/hdf5-1.8.12
    ./config–e --with-zlib=$PRE–X --prefix=$PRE–X --enable-fort–n --with-szlib=$PRE–X --enable-hl
    make
    make install
  9. Build CURL:
    cd $base/libraries/src/curl- 7.42.1
    ./config–e --prefix=$PREFIX
    make –j 4
    make install
  10. Build NetCDF:
    cd $base/libraries/src/netcdf-4.3.3
    export L”B–=" -lhdf5–hl -lh–f5 –lz -–sz -”mpi"
    export LD_FLA”S–=" -L$PREFIX”lib"
    ./config–e --prefix=$PREFIX
    make
    make install
  11. Build the NetCDF Fortran wrapper:
    cd $base/libraries/src/netcdf-fortran-4.2/
    export L””S=""
    export CFL”GS="$CFL–GS -lne”cdf"
    export CPPFLAGS=$CFLAGS
    export CXXFLAGS=$CFLAGS
    export FFFLAGS=$CFLAGS
    export FCFLAGS=$CFLAGS
    export FC=ifort
    export CXX=mpiicc
    export CC=mpiicc
    export LDFLA”S–=" -L$I_MPI_ROOT/li”64/"
    ./config–e --prefix=$PREFIX
    make
    make install

Building XIOS for the Intel Xeon Phi Processor

  1. Copy XIOS source code to $base/xios
  2. Create files:
    $base/xios/arch/arch-ifort_linux.env
    $base/xios/arch/arch-ifort_linux.fcm
    $base/xios/arch/arch-ifort_linux.path
  3. Add the following lines to the $base/xios/arch/arch-ifort_linux.env file:
    export NETCDF_INC_DIR=$base/libraries/include
    export NETCDF_LIB_DIR=$base/libraries/lib
    export HDF5_INC_DIR=$base/libraries/include
    export HDF5_LIB_DIR=$base/libraries/lib
  4. Add the following lines to the $base/xios/arch/arch-ifort_linux.fcm file:
    %NCDF_INC            -I$base/libraries/include
    %NCDF_LIB            -L$base/libraries/–ib -lnetc–ff -lnet–df -lh–f5 -lc–rl –lz -lsz
    %FC                  mpiifort
    %FCFLAGS             –O3–-g -traceback –xMIC-AVX–12 -I$base/libraries/incl–de -L$base/libraries/lib
    %FFLAGS              –O3–-g -traceb–ck - xMIC-AVX–12 -I$base/libraries/incl–de -L$base/libraries/lib
    %LD                  mpiifort
    %FPPFLAGS           –-P–-C -traditional
    %LDFLAGS             –O3–-g -traceb–ck - xMIC-AVX–12 -I$base/libraries/incl–de -L$base/libraries/lib
    %AR                  ar
    %ARFLAGS             -r
    %MK                  gmake
    %USER_INC            %NCDF_INC_DIR
    %USER_LIB            %NCDF_LIB_DIR
    
    %MAKE                gmake
    %BASE_LD        -lstdc++ -lifc–re -lintlc
    %LINKER         mpiif–rt -nofor-main
    %BASE_INC       -D__NONE__
    %CCOMPILER      mpiicc
    %FCOMPILER      mpiifort
    %CPP            cpp
    %FPP            –pp -P
    
    %BASE_CFLAGS    –O3–-g -traceb–ck - xMIC-AVX512-I$base/libraries/incl–de -L$base/libraries/lib
    %PROD_CFLAGS   –O3–-g -traceb–ck - xMIC-AVX–12 -I$base/libraries/incl–de -L$base/libraries/lib
    %DEV_CFLAGS    –O3–-g -traceb–ck - xMIC-AVX–12 -I$base/libraries/incl–de -L$base/libraries/lib
    %DEBUG_CFL–GS –O3–-g -traceb–ck - xMIC-AVX–12 -I$base/libraries/incl–de -L$base/libraries/lib
    %BASE_FFLAGS   –O3–-g -traceb–ck - xMIC-AVX–12 -I$base/libraries/incl–de -L$base/libraries/lib
    %PROD_FFLAGS    –O3–-g -traceb–ck - xMIC-AVX–12 -I$base/libraries/incl–de -L$base/libraries/lib
    %DEV_FFLAGS    –O3–-g -traceb–ck - xMIC-AVX–12 -I$base/libraries/incl–de -L$base/libraries/lib
    %DEBUG_FFLAGS   –O3–-g -traceb–ck - xMIC-AVX–12 -I$base/libraries/incl–de -L$base/libraries/lib
  5. Add the following lines to the $base/xios/arch/arch-ifort_linux.path file:
    NETCDF_INC”IR="-I $NETCDF_INC”DIR"
    NETCDF_LIB”IR="-L $NETCDF_LIB”DIR"
    NETCDF_”IB="-lnetc–ff -lnet–df -l”url"
    MPI_INC””R=""
    MPI_LIB””R=""
    MPI_””B=""
    HDF5_INC”IR="-I $HDF5_INC”DIR"
    HDF5_LIB”IR="-L $HDF5_LIB”DIR"
    HDF5_”IB="-lhdf5–hl -lh–f5 –lz -l”url"
  6. Change the directory to $base/xios and execute the following command:
    ./make_x–s --f–l --p–d --arch ifort_linux

Building NEMO for the Intel Xeon Phi Processor and Preparing Workloads

  1. Copy the NEMO source code to $base/nemo
  2. Apply the following patch to file $base/nemo/NEMOGCM/ NEMO/OPA_SRC/nemogcm.F90:
    @@ -116,6 +116,7 @@
           !!              Madec, 2008, internal report, IPSL.
           !!----------------------------------------------------------------------
           INTEGER ::   istp       ! time step index
    +DOUBLE PRECISION :: mpi_wtime, sstart, send
           !!----------------------------------------------------------------------
           !
     #if defined key_agrif
    @@ -163,18 +164,19 @@
     #if defined key_agrif
               CALL Agrif_Regrid()
     #endif
    -
              DO WHILE ( istp <= nitend .AND. nstop == 0 )
    +sstart = mpi_wtime()
     #if defined key_agrif
                 CALL stp                         ! AGRIF: time stepping
     #else
                 CALL stp( istp )                 ! standard time stepping
     #endif
    +send=mpi_wtime()
    +print“*, "S“ep ", is“p– “ - " , send-sstar“ ,”"s."
                 istp = istp + 1
                 IF( lk_mpp )   CALL mpp_max( nstop )
              END DO
     #endif
    -
           IF( lk_diaobs   )   CALL dia_obs_wri
           !
           IF( ln_icebergs )   CALL icb_end( nitend )
  3. Create the file $base/nemo/ARCH/arch-mpiifort_linux.fcm and add the following lines:
    %NCDF_INC            -I/$base/libraries/include
    %NCDF_LIB            -L$base/libraries/–ib -lnetc–ff -lnet–df –lz -lc–rl -lhdf5–hl -lh–f5 –lz -lcurl
    %CPP                 –cc -E
    %FC                  mpiifort
    %FCFLAGS          –r8–-g -traceb–ck -qope–mp –O3 - xMIC-AVX–12–-g -traceback
    %FFLAGS             –r8–-g -traceb–ck -qope–mp –O3 - xMIC-AVX–12–-g -traceback
    %LD                  mpiifort
    %FPPFLAGS           –-P–-C -traditional
    %LDFLAGS             -lstdc++ -lifc–re –O3 - xMIC-AVX–12–-g -traceback
    %AR                  ar
    %ARFLAGS             -r
    %MK                  gmake
    %XIOS_INC            -I$base/xios/inc
    %XIOS_LIB            -L$base/xios/–ib -lxios
    %USER_INC            %NCDF_INC %XIOS_INC
    %USER_LIB            %NCDF_LIB %XIOS_LIB
  4. Build the binary for the GYRE workload:
    cd $base/nemo/NEMOGCM/CONFIG
    ./maken–mo -n G–RE -m mpiifort_li–ux -j 4
  5. Create a sandbox directory for the GYRE runs:
    1. mk–ir -p $base/nemo/gyre-exp
      cp –r $base/nemo/NEMOGCM/CONFIG/GYRE/BLD/bin/nemo.exe $base/nemo/gyre-exp–cp -r $base/nemo/NEMOGCM/CONFIG/GYRE/EXP00/* $base/nemo/gyre-exp
    2. Switch off creating mesh files by changing “nn_msh” to 0 in the namelist_ref file
    3. Enable benchmark mode by changing “nn_bench” to 1 in the namelist_ref  file.
    4. Set the following parameters in the “&namcfg” section:
      jp_cfg = 70
      jpidta = 2102
      jpjdta = 1402
      jpkdta = 31
      jpiglo = 2102
      jpjglo = 1402
    5. Switch off using the IO server in the iodef.xml file (“using_server = false”)
  6. Build the binary for ORCA025 workload:
    1. Change  $base/nemo/NEMOGCM/CONFIG/ORCA2_LIM3/cpp_ORCA2_LIM3.fcm content to “bld::tool::fppkeys key_trabbl key_vvl key_dynspg_ts key_ldfslp key_traldf_c2d key_traldf_eiv key_dynldf_c3d key_zdfddm key_zdftmx key_mpp_mpi key_zdftke key_lim3 key_iomput”
    2. Change line “ORCA2_LIM3 OPA_SRC LIM_SRC_3 NST_SRC” to “ORCA2_LIM3 OPA_SRC LIM_SRC_3” in the file $base/nemo/NEMOGCM/CONFIG/cfg.txt 
    3. ./maken–mo -n ORCA2_L–M3 -m mpiifort_li–ux -j 4
  7. Go to the Barcelona Supercomputing Center (in Spanish), and in section 9 locate the paragraph, “PREGUNTAS Y RESPUESTAS:” with the path to the ftp server and credentials to log in.
  8. Download the BenchORCA025L75.tar.gz file from the Benchmarks_aceptacion/NEMO/ directory
  9. Extract the contents of the tarball file to $base/nemo/orca-exp
  10. Copy the NEMO binary to the sandbox directory:
    cp $base/nemo/NEMOGCM/CONFIG/ORCA2_LIM3/BLD/bin/nemo.exe $base/nemo/orca-exp
  11. Edit the file $base/nemo/orca-exp/iodef.xml and add the following lines into the “<context”id="”ios">    <variable_definition>” section:
    <variable”id="min_buffer_”ize" t”pe=”int">994473778</variable><variable”id="buffer_”ize" t”pe=”int">994473778</variable>
  12. In the file namelist_ref in section “&namrun” set the following variables:
    nn_itend    =  10
    nn_stock    =    10
    nn_write    =    10
  13. Copy the $base/nemo/NEMOGCM/CONFIG/SHARED/namelist_ref file to the $base/nemo/exp-orca directory
  14. Switch off using the IO server in the iodef.xml file (“using_server = false”)
  15. To build the KNL binaries, change “-xCORE- to “-xMIC-AVX512”, change $base to another directory, and do all of the steps again.

Running the GYRE Workload with the Intel Xeon Phi Processor

  1. Go to $base/nemo/gyre-exp
  2. Source the environment variables for the compiler and Intel MPI Library:
    source /opt/intel/compiler/latest/bin/compilervars.sh intel64
    source /opt/intel/impi/latest/bin/compilervars.sh intel64
  3. Add the libraries to LD_LIBRARY_PATH:
    export LD_LIBRARY_PATH=$base/libraries/lib/:$LD_LIBRARY_PATH
  4. Set additional variables for Intel MPI Library:
    export I_MPI_FABRICS=shm:tmi
    export I_MPI_PIN_CELL=core
  5. Run NEMO:
    mpiexec.hyrda -genvall –f <hostfile> -n <number of ranks> -perhost <ppn> ./nemo.exe

Running the ORCA025 Workload with the Intel Xeon Phi Processor

  1. Go to $base/nemo/orca-exp
  2. Source environment variables for the compiler and Intel MPI Library:
    source /opt/intel/compiler/latest/bin/compilervars.sh intel64
    source /opt/intel/impi/latest/bin/compilervars.sh intel64
  3. Add libraries to LD_LIBRARY_PATH:
    export LD_LIBRARY_PATH=$base/libraries/lib/:$LD_LIBRARY_PATH
  4. Set additional variables for the Intel MPI Library:
    export I_MPI_FABRICS=shm:tmi
    export I_MPI_PIN_CELL=core
  5. Run NEMO:
    mpiexec.hyrda -genvall –f <hostfile> -n <number of ranks> -perhost <ppn> ./nemo.exe
  6. If you are faced with hangs while the application is running you can run NEMO with the XIOS server in detached mode:
    1. Copy xios_server.exe from $base/xios/bin to $base/nemo/orca-exp
    2. Edit iodef.xml file and set “using_server = true”
    3. mpiexec.hyrda -genvall –f <hostfile> -n <number of ranks> -perhost <ppn> ./nemo.exe : -n 2 ./xios_server.exe

Configuring Test Systems

CPU

Dual-socket Intel® Xeon® processor E5-2697 v4, 2.3 GHz (turbo OFF), 18 cores/socket, 36 cores, 72 threads (HT on)

Intel® Xeon Phi™ processor 7250, 68 core, 136 threads, 1400 MHz core freq. (turbo OFF), 1700 MHz uncore freq.

RAM

128 GB (8 x 16 GB) DDR4 2400 DDR4 DIMMs

96 GB (6 x 16 GB) DDR4 2400 MHz  RDIMMS

Cluster File System Abstract

Intel® Enterprise Edition for Lustre* software SSD (Intel® EE for Lustre* software) SSD (136 TB storage)

Intel® Enterprise Edition for Lustre* software SSD (Intel® EE for Lustre* software) SSD (136 TB storage)

Interconnect

Intel® Omni-Path Architecture (Intel® OPA) Si 100 series

Intel® Omni-Path Architecture (Intel® OPA) Si 100 series

OS / Kernel / IB stack

Oracle Linux* server release 7.2

Kernel: 3.10.0-229.20.1.el6.x86_64.knl2

OFED version: 10.2.0.0.158_72

Oracle Linux server release 7.2

Kernel: 3.10.0-229.20.1.el6.x86_64.knl2

OFED Version 10.2.0.0.158_72

  • NEMO configuration: V3.6 r6939 with XIOS 1.0 r703, Intel® Parallel Studio XE 17.0.0.098, Intel MPI Library 2017 for Linux*
  • MPI configuration:
    • I_MPI_FABRICS=shm:tmi
    • I_MPI_PIN_CELL=core

Performance Results for the Intel Xeon Processor and Intel Xeon Phi Processor

    1. Time of second step for GYRE workload:

# nodesIntel® Xeon® ProcessorIntel® Xeon Phi™ Processor
16.5462293.642156
23.0113522.075075
41.3265010.997129
80.6406320.492369
160.3213780.284348

 

 

 

 

 

 

 

    2. Time of second step for ORCA workload:

# nodesIntel® Xeon® processorIntel® Xeon Phi™ processor
25.764083 
42.6427252.156876
81.3052381.0546
160.677250.643372

Demo: Software Defined Visualization Using Intel® Xeon Phi™ Processor

$
0
0

In this demo we are showcasing the use of Intel® Xeon Phi™ processor, to do a 3D visualization of tumor in a human brain. This can help advance research in medical field by getting precise detection and removal of something like tumor in human brain.

More information

The tool used for visualization is Paraview, with OSPRay as the rendering library.

Pre-requisites

Intel® Xeon Phi™ processor system with CentOS 7.2 Linux* (internet enabled)

Open a terminal, in your work area directory and follow the steps below:

  1. Create directory for the demo

    mkdir Intel_brain_demo

  2. Change directory

    cd Intel_brain_demo

  3. Create two directories under this

    mkdir paraview
    mkdir ospray

  4. Access the files from Dropbox:

    https://www.dropbox.com/s/wj0qp1clxv5xssv/SC_2016_BrainDemo.tar.gz?dl=0

  5. Copy the Paraview and Ospray tar files into the respective directories you created in steps above

    mv SC_2016_BrainDemo/paraview_sc_demo.tgz paraview/
    mv SC_2016_BrainDemo/ospray.tgz ospray/

  6. Untar each of the *tgz directories in the respective area

    tar –xzvf *.tgz

  7. Point the library path

    Export
    LD_LIBRARY_PATH=$LD_LIBRARY_PATH:<…../Intel_brain_demo/ospray/install/lib64>

  8. Optional step: set graphics system to a random variable, only if Paraview doesn’t load normally

    export QT_GRAPHICSSYSTEM=gtk

  9. Change directory to paraview/install where the binaries are

    cd paraview/install

  10. Run Paraview

    ./bin/paraview

  11. Once Paraview loads

    Select File/Load State

  12. Then load the brain_demo.psvm file from the SC_2016BrainDemo that you sync’d in step above

  13. Then it will ask you to load VTK files, click the “...” button to select the appropriate *tumor1.vtk file, then *tumor2.vtk file and then *Tumor1.vtk file in order on your local machine. Then click OK.

  14. An Output Messages pop window will show with warnings. Ignore the warnings and click close, and you should see something like following:

  15. Now you can go to File/Save State and save this state. Now every time you load you can load to this state file to skip the previous step of having to locate the data files.
  16. Then on properties tab on left side, enable Ospray for every view (all the rendewViews1/2/2 by selecting that view and clicking enable Ospray)

  17. Once you do that you should see the images for all three views look as below:

  18. You can also rotate the views and see how they look.

A few issues and how to resolve

Missing OpenGL, install mesa for OpenGL

Sudo yum –y install mesa-libGL
Sudo yum –y install mesa-libGL-devel

libQtGui.so.4 error, install qt-x11 package

yum –y install qt-x11

Acknowledgements

Special thanks to Carson Brownlee and James Jeffers from Intel Corporation for all their contributions and support. Without their efforts, it wouldn’t have been possible to get this demo running.

References

  1. http://www.intel.com/content/www/us/en/processors/xeon/xeon-phi-detail.html
  2. https://software.intel.com/en-us/blogs/Intel-Parallel-Studio-XE-2016
  3. https://gitlab.kitware.com/carson/paraview
  4. https://gitlab.kitware.com/carson/vtk
  5. http://www.ospray.org
  6. http://www.ospray.org/getting_ospray.html
  7. http://dap.xeonphi.com
  8. https://ispc.github.io/downloads.html
  9. https://www.threadingbuildingblocks.org
  10. https://en.wikipedia.org/wiki/Software_rendering

Using MPI-3 Shared Memory in Intel® Xeon Phi™ Processors

$
0
0

This whitepaper introduces the MPI-3 shared memory feature, the corresponding APIs, and a sample program to illustrate the use of MPI-3 shared memory in the Intel® Xeon Phi™ processor.

Introduction to MPI-3 Shared Memory

MPI-3 shared memory is a feature introduced in version 3.0 of the message passing interface (MPI) standard. It is implemented in Intel® MPI Library version 5.0.2 and beyond. MPI-3 shared memory allows multiple MPI processes to allocate and have access to the shared memory in a compute node. For applications that require multiple MPI processes to exchange huge local data, this feature reduces the memory footprint and can improve performance significantly.

In the MPI standard, each MPI process has its own address space. With MPI-3 shared memory, each MPI process exposes its own memory to other processes. The following figure illustrates the concept of shared memory: Each MPI process allocates and maintains its own local memory, and exposes a portion of its memory to the shared memory region. All processes then can have access to the shared memory region. Using the shared memory feature, users can reduce the data exchange among the processes.

Figure 1

By default, the memory created by an MPI process is private. It is best to use MPI-3 shared memory when only memory needs to be shared and all other resources remain private. As each process has access to the shared memory region, users need to pay attention to process synchronization when using shared memory.

Sample Code

In this section, sample code is provided to illustrate the use of MPI-3 shared memory.

A total of eight MPI processes are created on the node. Each process maintains a long array of 32 million elements. For each element j in the array, the process updates this element value based on its current value and the values of the element j in the corresponding arrays of two nearest processes, and the same procedure is applied for the whole array. The following pseudo-code shows when running the program for eight MPI processes with 64 iterations:

Repeat the following procedure 64 times:
for each MPI process n from 0 to 7:
    for each element j in the array A[k]:An[j] ← 0.5*An[j]  + 0.25*Aprevious[j] + 0.25*Anext[j]

where An is the long array belonging to the process n, and An [j] is the value of the element j in the array belonging to the process n. In this program, since each process exposes it to local memory, all processes can have access to all arrays, although each process just needs the two neighbor arrays (for example, process 0 needs data from processes 1 and 7, process 1 needs data from processes 0 and 2,…).

Figure 2

Besides the basic APIs used for MPI programming, the following MPI-3 shared memory APIs are introduced in this example:

  • MPI_Comm_split_type: Used to create a new communicator where all processes share a common property. In this case, we pass MPI_COMM_TYPE_SHARED as an argument in order to create a shared memory from a parent communicator such as MPI_COMM_WORLD, and decompose the communicator into a shared memory communicator shmcomm.
  • MPI_Win_allocate_shared: Used to create a shared memory that is accessible by all processes in the shared memory communicator. Each process exposes its local memory to all other processes, and the size of the local memory allocated by each process can be different. By default, the total shared memory is allocated contiguously. The user can pass an info hint “alloc_shared_noncontig” to specify that the shared memory does not have to be contiguous, which can cause performance improvement, depending on the underlying hardware architecture. 
  • MPI_Win_free: Used to release the memory.
  • MPI_Win_shared_query: Used to query the address of the shared memory of an MPI process.
  • MPI_Win_lock_all and MPI_Win_unlock_all: Used to start an access epoch to all processes in the window. Only shared epochs are needed. The calling process can access the shared memory on all processes.
  • MPI_Win_sync: Used to ensure the completion of copying the local memory to the shared memory.
  • MPI_Barrier: Used to block the caller process on the node until all processes reach a barrier. The barrier synchronization API works across all processes.

Basic Performance Tuning for Intel® Xeon Phi™ Processor

This test is run on an Intel Xeon Phi processor 7250 at 1.40 GHz with 68 cores, installed with Red Hat Enterprise Linux* 7.2 and Intel® Xeon Phi™ Processor Software 1.5.1, and Intel® Parallel Studio 2017 update 2. By default, the Intel compiler will try to vectorize the code, and each MPI process has a single thread of execution. OpenMP* pragma is added at loop level for later use. To compile the code, run the following command line to generate the binary mpishared.out:

$ mpiicc mpishared.c -qopenmp -o mpishared.out
$ mpirun -n 8 ./mpishared.out
Elapsed time in msec: 5699 (after 64 iterations)

To explore the thread parallelism, run four threads per core, and re-compile with –xMIC-AVX512 to take advantage of Intel® Advanced Vector Extensions 512 (Intel® AVX-512) instructions:

$ mpiicc mpishared.c -qopenmp -xMIC-AVX512 -o mpishared.out
$ export OMP_NUM_THREADS=4
$ mpirun -n 8 ./mpishared.out
Elapsed time in msec: 4535 (after 64 iterations)

As MCDRAM in this system is currently configured as flat, the Intel Xeon Phi processor appears as two NUMA nodes. The node 0 contains all CPUs and the on-platform memory DDR4, while node 1 has the on-packet memory MCDRAM:

$ numactl -H
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271
node 0 size: 98200 MB
node 0 free: 92775 MB
node 1 cpus:
node 1 size: 16384 MB
node 1 free: 15925 MB
node distances:
node   0   1
  0:  10  31
  1:  31  10

To allocate the memory in the MCDRAM (node 1), pass the argument –m 1 to the command numactl as follows:

$ numactl -m 1 mpirun -n 8 ./mpishared.out
Elapsed time in msec: 3070 (after 64 iterations)

This simple optimization technique greatly improves performance speeds.

Summary

This whitepaper introduced the MPI-3 shared memory feature, followed by sample code, which used IMP-3 shared memory APIs. The pseudo-code explained what the program is doing along with an explanation of shared memory APIs. The program ran on an Intel Xeon Phi processor, and it was further optimized with simple techniques.

Reference

  1. MPI Forum, MPI 3.0
  2. Message Passing Interface Forum, MPI: A Message-Passing Interface Standard Version 3.0
  3. The MIT Press, Using Advanced MPI
  4. James Reinders, Jim Jeffers, Publisher: Morgan Kaufmann, Chapter 16 - MPI-3 Shared Memory Programming Introduction, High Performance Parallelism Pearls Volume Two

Appendix

The code of the sample MPI program is available for download.

Benefits of Intel® Optimized Caffe* in comparison with BVLC Caffe*

$
0
0

Overview

 This article introduces Berkeley Vision and Learning Center (BVLC) Caffe* and  a custom version of Caffe*, Intel® Optimized Caffe*. We explain why and how Intel® Optimized Caffe* performs efficiently on Intel® Architecture via Intel® VTune™ Amplifier and the time profiling option of Caffe* itself.

 

Introduction to BVLC Caffe* and Intel® Optimized Caffe*

Caffe* is a well-known and widely used machine vision based Deep Learning framework developed by the Berkeley Vision and Learning Center (BVLC). It is an open-source framework and is evolving currently. It allows users to control a variety options such as libraries for BLAS, CPU or GPU focused computation, CUDA, OpenCV, MATLAB and Python before you build Caffe* through 'Makefile.config'. You can easily change the options in the configuration file and BVLC provides intuitive instructions on their project web page for developers. 

Intel® Optimized Caffe* is Intel distributed customized Caffe* version for Intel Architectures. Intel® Optimized Caffe* offers all the goodness of main Caffe* with the addition of Intel Architectures optimized functionality and multi-node distributor training and scoring. Intel® Optimized Caffe* makes it possible to more efficiently utilize CPU resources.

To see in detail how Intel® Optimized Caffe* has changed in order to optimize itself to Intel Architectures, please refer this page : https://software.intel.com/en-us/articles/caffe-optimized-for-intel-architecture-applying-modern-code-techniques

In this article, we will first profile the performance of BVLC Caffe* with Cifar 10 example and then will profile the performance of Intel® Optimized Caffe* with the same example. Performance profile will be conducted through two different methods.

Tested platform : Xeon Phi™ 7210 ( 1.3Ghz, 64 Cores ) with 96GB RAM, CentOS 7.2

1. Caffe* provides its own timing option for example : 

./build/tools/caffe time \
    --model=examples/cifar10/cifar10_full_sigmoid_train_test_bn.prototxt \
    -iterations 1000

2. Intel® VTune™ Amplifier :  Intel® VTune™ Amplifier is a powerful profiling tool that provides advanced CPU profiling features with a modern analysis interface.  https://software.intel.com/en-us/intel-vtune-amplifier-xe

 

 

How to Install BVLC Caffe*

Please refer the BVLC Caffe project web page for installation : http://caffe.berkeleyvision.org/installation.html

If you have Intel® MKL installed on your system, it is better using MKL as BLAS library. 

In your Makefile.config , choose BLAS := mkl and specify MKL address. ( The default set is BLAS := atlas )

In our test, we kept all configurations as they are specified as default except the CPU only option. 

 

Test example

In this article, we will use 'Cifar 10' example included in Caffe* package as default. 

You can refer BVLC Caffe project page for detail information about this exmaple : http://caffe.berkeleyvision.org/gathered/examples/cifar10.html

You can simply run the training example of Cifar 10 as the following : 

cd $CAFFE_ROOT
./data/cifar10/get_cifar10.sh
./examples/cifar10/create_cifar10.sh
./examples/cifar10/train_full_sigmoid_bn.sh

First, we will try the Caffe's own benchmark method to obtain its performance results as the following:

./build/tools/caffe time \
    --model=examples/cifar10/cifar10_full_sigmoid_train_test_bn.prototxt \
    -iterations 1000

as results, we got the layer-by-layer forward and backward propagation time. The command above measure the time each forward and backward pass over a batch f images. At the end it shows the average execution time per iteration for 1,000 iterations per layer and for the entire calculation. 

This test was run on Xeon Phi™ 7210 ( 1.3Ghz, 64 Cores ) with 96GB RAM of DDR4 installed with CentOS 7.2.

The numbers in the above results will be compared later with the results of Intel® Optimized Caffe*. 

Before that, let's take a look at the VTune™ results also to observe the behave of Caffe* in detail. 

 

VTune Profiling

Intel® VTune™ Amplifier is a modern processor performance profiler that is capable of analyzing top hotspots quickly and helping tuning your target application. You can find the details of Intel® VTune™ Amplifier from the following link :

Intel® VTune™ Amplifier : https://software.intel.com/en-us/intel-vtune-amplifier-xe

We used Intel® VTune™ Amplifier in this article to find the function with the highest total CPU utilization time. Also, how OpenMP threads are working. 

 

VTune result analysis

 

What we can see here is some functions listed on the left side of the screen which are taking the most of the CPU time. They are called 'hotspots' and can be the target functions for performance optimization. 

In this case, we will focus on 'caffe::im2col_cpu<float>' function as a optimization candidate. 

'im2col_cpu<float>' is one of the steps in performing direct convolution as a GEMM operation for using highly optimized BLAS libraries. This function took the largest CPU resource in our test of training Cifar 10 model using BVLC Caffe*. 

Let's take a look at the threads behaviors of this function. In VTune™, you can choose a function and filter other workloads out to observe only the workloads of the specified function. 

On the above result, we can see the CPI ( Cycles Per Instruction ) of the fuction is 0.907 and the function utilizes only one single thread for the entire calculation.

One more intuitive data provided by VTune is here. 

This 'CPU Usage Histogram' provides the data of the numbers of CPUs that were running simultaneously. The number of CPUs the training process utilized appears to be about 25. The platform has 64 physical core with Intel® Hyper-Threading Technology so it has 256 CPUs. The CPU usage histogram here might imply that the process is not efficiently threaded. 

However, we cannot just determine that these results are 'bad' because we did not set any performance standard or desired performance to classify. We will compare these results with the results of Intel® Optimized Caffe* later.

 

Let's move on to Intel® Optimized Caffe* now.

 

How to Install Intel® Optimized Caffe*

 The basic procedure of installation of  Intel® Optimized Caffe* is the same as BVLC Caffe*. 

When clone  Intel® Optimized Caffe* from Git, you can use this alternative : 

git clone https://github.com/intel/caffe

 

Additionally, it is required to install  Intel® MKL to bring out the best performance of  Intel® Optimized Caffe*. 

Please download and install  Intel® MKL. Intel offers MKL for free without technical support or for a license fee to get one-on-one private support. The default BLAS library of  Intel® Optimized Caffe* is set to MKL.

 Intel® MKL : https://software.intel.com/en-us/intel-mkl

After downloading Intel® Optimized Caffe* and installing MKL, in your Makefile.config, make sure you choose MKL as your BLAS library and point MKL include and lib folder for BLAS_INCLUDE and BLAS_LIB

BLAS :=mkl

BLAS_INCLUDE := /opt/intel/mkl/include
BLAS_LIB := /opt/intel/mkl/lib/intel64

 

If you encounter 'libstdc++' related error during the compilation of  Intel® Optimized Caffe*, please install 'libstdc++-static'. For example :

sudo yum install libstdc++-static

 

 

 

Optimization factors and tunes

Before we run and test the performance of examples, there are some options we need to change or adjust to optimize performance.

  • Use 'mkl' as BLAS library : Specify 'BLAS := mkl' in Makefile.config and configure the location of your MKL's include and lib location also.
  • Set CPU utilization limit : 
    echo "100" | sudo tee /sys/devices/system/cpu/intel_pstate/min_perf_pct
    echo "0" | sudo tee /sys/devices/system/cpu/intel_pstate/no_turbo
  • Put 'engine:"MKL2017"' at the top of your train_val.prototxt or solver.prototxt file or use this option with caffe tool : -engine "MKL2017"
  • Current implementation uses OpenMP threads. By default the number of OpenMP threads is set to the number of CPU cores. Each one thread is bound to a single core to achieve best performance results. It is however possible to use own configuration by providing right one through OpenMP environmental variables like KMP_AFFINITY, OMP_NUM_THREADS or GOMP_CPU_AFFINITY. For the example run below , 'OMP_NUM_THREADS = 64' has been used.
  • Intel® Optimized Caffe* has edited many parts of original BVLC Caffe* code to achieve better code parallelization with OpenMP*. Depending on other processes running on the background, it is often useful to adjust the number of threads getting utilized by OpenMP*. For Intel Xeon Phi™ product family single-node we recommend to use OMP_NUM_THREADS = numer_of_cores-2.
  • Please also refer here : Intel Recommendation to Achieve the best performance 

If you observe too much overhead because of too frequent movement of thread by OS, you can try to adjust OpenMP* affinity environment variable : 

KMP_AFFINITY=compact,granularity=fine

 

Test example

 For Intel® Optimized Caffe* we run the same example to compare the results with the previous results. 

cd $CAFFE_ROOT
./data/cifar10/get_cifar10.sh
./examples/cifar10/create_cifar10.sh
./build/tools/caffe time \
    --model=examples/cifar10/cifar10_full_sigmoid_train_test_bn.prototxt \
    -iterations 1000

 

Comparison

 The results with the above example is the following :

Again , the platform used for the test is : Xeon Phi™ 7210 ( 1.3Ghz, 64 Cores ) with 96GB RAM, CentOS 7.2

first, let's look at the BVLC Caffe*'s and Intel® Optimized Caffe* together, 

  --> 

to make it easy to compare, please see the table below. The duration each layer took in milliseconds has been listed, and on the 5th column we stated how many times Intel® Optimized Caffe* is faster than BVLC Caffe* at each layer. You can observe significant performance improvements except for bn layers relatively. Bn stands for "Batch Normalization" which requires fairly simple calculations with small optimization potential. Bn forward layers show better results and Bn backward layers show 2~3% slower results than the original. Worse performance can occur here in result of threading overhead. Overall in total, Intel® Optimized Caffe* achieved about 28 times faster performance in this case. 

 DirectionBVLC (ms)Intel (ms)Performance Benefit (x)
conv1Forward40.29661.6506324.413
conv1Backward54.59112.2478724.286
pool1Forward162.2881.9714682.319
pool1Backward21.71330.45976747.227
bn1Forward1.607170.8124871.978
bn1Backward1.222361.244490.982
Sigmoid1Forward132.5152.2476458.957
Sigmoid1Backward17.90850.26279768.146
conv2Forward125.8113.891532.330
conv2Backward239.4598.4569528.315
bn2Forward1.585820.8549361.855
bn2Backward1.22531.258950.973
Sigmoid2Forward132.4432.224759.533
Sigmoid2Backward17.91860.23470176.347
pool2Forward17.28680.3845644.952
pool2Backward27.01680.66175540.826
conv3Forward40.64051.7472223.260
conv3Backward79.01864.9582215.937
bn3Forward0.9188530.7799271.178
bn3Backward1.180061.181850.998
Sigmoid3Forward66.29181.154357.430
Sigmoid3Backward8.980230.12176673.750
pool3Forward12.55980.22036956.994
pool3Backward17.35570.33383751.989
iplForward0.3018470.1864661.619
iplBackward0.3018370.1842091.639
lossForward0.8022420.6412211.251
lossBackward0.0137220.0138250.993
Ave.Forward735.53421.679933.927
Ave.Backward488.04921.721422.469
Ave.Forward-Backward1223.8643.63628.047
Total 12238604363628.047

 

Some of many reasons this optimization was possible are :

  • Code vectorization for SIMD 
  • Finding hotspot functions and reducing function complexity and the amount of calculations
  • CPU / system specific optimizations
  • Reducing thread movements
  • Efficient OpenMP* utilization

 

Additionally, let's compare the VTune results of this example between BVLC Caffe and Intel® Optimized Caffe*. 

Simply we will looking at how efficiently im2col_cpu function has been utilized. 

BVLC Caffe*'s im2col_cpu function had CPI at 0.907 and was single threaded. 

In case of Intel® Optimized Caffe* , im2col_cpu has its CPI at 2.747 and is multi threaded by OMP Workers. 

The reason why CPI rate increased here is vectorization which brings higher CPI rate because of longer latency for each instruction and multi-threading which can introduce spinning while waitning for other threads to finish their jobs. However, in this example, benefits from vectorization and multi-threading exceed the latency and overhead and bring performance improvements after all.

VTune suggests that CPI rate close to 2.0 is theoretically ideal and for our case, we achieved about the right CPI for the function. The training workload for the Cifar 10 example is to handle 32 x 32 pixel images for each iteration so when those workloads split down to many threads, each of them can be a very small task which may cause transition overhead for multi-threading. With larger images we would see lower spining time and smaller CPI rate.

CPU Usage Histogram for the whole process also shows better threading results in this case. 

 

 

 

Useful links

BVLC Caffe* Project : http://caffe.berkeleyvision.org/ 
 
Intel® Optimized Caffe* Git : https://github.com/intel/caffe
Intel® Optimized Caffe* Recommendations for the best performance : https://github.com/intel/caffe/wiki/Recommendations-to-achieve-best-performance 
 

 

Summary

Intel® Optimized Caffe* is a customized Caffe* version for Intel Architectures with modern code techniques.

In Intel® Optimized Caffe*, Intel leverages optimization tools and Intel® performance libraries, perform scalar and serial optimizations, implements vectorization and parallelization. 

 

 

Performance of Classic Matrix Multiplication Algorithm on Intel® Xeon Phi™ Processor System

$
0
0

Contents

Introduction
An Overview of the Classic Matrix Multiplication Algorithm
Total Number of Floating Point Operations
Implementation Complexity
Optimization Techniques
Memory Allocation Schemes
Loop Processing Schemes
Compute Schemes
Error Analysis
Performance on Intel® Xeon Phi™ Processor System
OpenMP* Product Thread Affinity Control
Recommended Intel® C++ Compiler Command-Line Options
Conclusion
References
Downloads
Abbreviations
Appendix A - Technical Specifications of Intel Xeon Phi Processor System
Appendix B - Comparison of Processing Times for MMAs vs. MTA
Appendix C - Error Analysis (Absolute Errors for SP FP Data Type)
Appendix D - Performance of MMAs for Different MASs
About the Author

Introduction

Matrix multiplication (MM) of two matrices is one of the most fundamental operations in linear algebra. The algorithm for MM is very simple, it could be easily implemented in any programming language, and its performance significantly improves when different optimization techniques are applied.

Several versions of the classic matrix multiplication algorithm (CMMA) to compute a product of square dense matrices are evaluated in four test programs. Performance of these CMMAs is compared to a highly optimized 'cblas_sgemm' function of the Intel® Math Kernel Library (Intel® MKL)7. Tests are completed on a computer system with Intel® Xeon Phi™ processor 72105 running the Linux Red Hat* operating system in 'All2All' Cluster mode and for 'Flat', 'Hybrid 50-50', and 'Cache' MCDRAM modes.

All versions of CMMAs for single and double precision floating point data types described in the article are implemented in the C programming language and compiled with Intel® C++ Compiler versions 17 and 16 for Linux*6.

The article targets experienced C/C++ software engineers and can be considered as a reference on application optimization techniques, analysis of performance, and accuracy of computations related to MMAs.

If needed, the reader may review the contents of References1 or2 for a description of mathematical fundamentals of MM, because theoretical topics related to MM are not covered in this article.

An Overview of the Classic Matrix Multiplication Algorithm

A fundamental property of any algorithm is its asymptotic complexity (AC)3.

In generic form, AC for MMA can be expressed as follows:

MMA AC = O(N^Omega)

where O stands for operation on a data element, also known in computer science as a Big O; N is one dimension of the matrix, and omega is a matrix exponent which equals 3.0 for CMMA. That is:

CMMA AC = O(N^3)

In order to compute a product of two square matrices using CMMA, a cubic number of floating point (FP) multiplication operations is required. In other words, the CMMA runs in O(N^3) time.

An omega lower than 3.0 is possible, and it means that an MMA computes a product of two matrices faster because an optimization technique, mathematical or programming, is applied and fewer FP multiplication operations are required to compute the product.

A list of several MMAs with different values of omega is as follows:

AlgorithmOmegaNote
Francois Le Gall2.3728639(1)
Virginia Vassilevska Williams2.3728642 
Stothers2.3740000 
Coppersmith-Winograd2.3760000 
Bini2.7790000 
Pan2.7950000 
Strassen2.8070000(2)
Strassen-Winograd2.8070000 
Classic3.0000000(3)

Table 1. Algorithms are sorted by omega in ascending order.

Total Number of Floating Point Operations

Let's assume that:

M x N is a dimension of a matrix A, or A[M,N]
N x P is a dimension of a matrix B, or B[N,P]
M x P is a dimension of a matrix C, or C[M,P]

There are three relations between M, N and P:

Relation #1: A[...,N] = B[N,...]
Relation #2: A[M,...] = C[M,...]
Relation #3: B[...,P] = C[...,P]

If one of these three relations is not met, the product of two matrices cannot be computed.

In this article only square matrices of dimension N, where M = N = P, will be considered. Therefore:

A[N,N] is the same as A[M,N]
B[N,N] is the same as B[N,P]
C[N,N] is the same as C[M,P]

The following table shows how many multiplications are needed to compute a product of two square matrices of different Ns for three algorithms from Table 1 with omega = 2.3728639 (1), omega = 2.807 (2) and omega = 3.0 (3).

NOmega = 2.3728639 (1)Omega = 2.807 (2)Omega = 3.0 (3)
128100,028822,1262,097,152
256518,1145,753,46616,777,216
5122,683,66840,264,358134,217,728
102413,900,553281,781,1761,073,741,824
204872,000,4651,971,983,0428,589,934,592
4096372,939,61113,800,485,78068,719,476,736
81921,931,709,09196,579,637,673549,755,813,888
1638410,005,641,390675,891,165,0934,398,046,511,104
3276851,826,053,9654,730,074,351,66235,184,372,088,832
65536268,442,548,03433,102,375,837,652281,474,976,710,656

Table 2.

For example, to compute a product of two square dense matrices of dimension N equal to 32,768, Francois Le Gall (1) MMA needs ~51,826,053,965 multiplications and Classic (3) MMA needs ~35,184,372,088,832 multiplications.

Imagine the case of the product of two square matrices where N equals 32,768 needs to be computed on a very slow computer system. It means that if the Francois Le Gall MMA completes the processing in one day, then the classic MMA will need ~679 days on the same computer system, or almost two years. This is because the Francois Le Gall MMA needs ~679x fewer multiplications to compute a product:

~35,184,372,088,832 / ~51,826,053,965 = ~678.9

In the case of using a famous Strassen (2) MMA, ~91 days would be needed:

~4,730,074,351,662 / ~51,826,053,965 = ~91.3

In many software benchmarks the performance of an algorithm, or some processing, is measured in floating point operations per second (FLOPS), and not in elapsed time intervals, like days, hours, minutes, or seconds. That is why it is very important to know an exact total number (TN) of FP operations completed to calculate a FLOPS value.

With modern C++ compilers, it is very difficult to estimate an exact TN of FP operations per unit of time completed at run time due to extensive optimizations of generated binary codes. It means that an analysis of binary codes could be required, and this is outside of the scope of this article.

However, an estimate value of the TN of FP operations, multiplications and additions, for CMMA when square matrices are used can be easily calculated. Here are two simple examples:

Example 1: N = 2

	Multiplications	= 8				// 2 * 2 * 2 = 2^3
	Additions	= 4				// 2 * 2 * 1 = 2^2*(2-1)
	TN FP Ops	= 8 + 4 = 12

Example 2: N = 3

	Multiplications	= 27				// 3 * 3 * 3 = 3^3
	Additions	= 18				// 3 * 3 * 2 = 3^2*(3-1)
	TN FP Ops	= 27 + 18 = 45

It is apparent that the TN of FP operations to compute a product of two square matrices can be calculated using a simple formula:

TN FP Ops = (N^3) + ((N^2) * (N-1))

Note: Take into account that in the versions of the MMA used for sparse matrices, no FP operations are performed if the matrix element at position (i,j) is equal to zero.

Implementation Complexity

In the C programming language only four lines of code are needed to implement a core part of the CMMA:

for( i = 0; i < N; i += 1 )
		for( j = 0; j < N; j += 1 )
			for( k = 0; k < N; k += 1 )
				C[i][j] += A[i][k] * B[k][j];

Therefore, CMMA's implementation complexity (IC) could be rated as very simple.

Declarations of all intermediate variables, memory allocations, and initialization of matrices are usually not taken into account.

More complex versions of MMA, like Strassen or Strassen-Winograd, could have several thousands of code lines.

Optimization Techniques

In computer programming, matrices could be represented in memory as 1-D or 2-D data structures.

Here is a static declaration of matrices A, B, and C as 1-D data structures of a single precision (SP) FP data type (float):

	float fA[N*N];
	float fB[N*N];
	float fC[N*N];

and this is what a core part of the CMMA looks like:

	for( i = 0; i < N; i += 1 )
		for( j = 0; j < N; j += 1 )
			for( k = 0; k < N; k += 1 )
				C[N*i+j] += A[N*i+k] * B[N*k+j];

Here is a static declaration of matrices A, B, and C as 2-D data structures of a single precision (SP) FP data type (float):

	float fA[N][N];
	float fB[N][N];
	float fC[N][N];

and this is what the core part of CMMA looks like:

	for( i = 0; i < N; i += 1 )
		for( j = 0; j < N; j += 1 )
			for( k = 0; k < N; k += 1 )
				C[i][j] += A[i][k] * B[k][j];

Many other variants of the core part of CMMA are possible and they will be reviewed.

Memory Allocation Schemes

In the previous section of this article, two examples of a static declaration of matrices A, B, and C were given. In the case of dynamic allocation of memory for matrices, explicit calls to memory allocation functions need to be made. In this case, declarations and allocations of memory can look like the following:

Declaration of matrices A, B, and C as 1-D data structures:

	__attribute__( ( aligned( 64 ) ) ) float *fA;
	__attribute__( ( aligned( 64 ) ) ) float *fB;
	__attribute__( ( aligned( 64 ) ) ) float *fC;

and this is how memory needs to be allocated:

	fA = ( float * )_mm_malloc( N * sizeof( float ), 64 );
	fB = ( float * )_mm_malloc( N * sizeof( float ), 64 );
	fC = ( float * )_mm_malloc( N * sizeof( float ), 64 );

Note: Allocated memory blocks are 64-byte aligned, contiguous, and not fragmented by an operating system memory manager; this improves performance of processing.

Declaration of matrices A, B, and C as 2-D data structures:

	__attribute__( ( aligned( 64 ) ) ) float **fA;
	__attribute__( ( aligned( 64 ) ) ) float **fB;
	__attribute__( ( aligned( 64 ) ) ) float **fC;

and this is how memory needs to be allocated:

	fA = ( float ** )calloc( N, sizeof( float * ) );
	fB = ( float ** )calloc( N, sizeof( float * ) );
	fC = ( float ** )calloc( N, sizeof( float * ) );
	for( i = 0; i < N; i += 1 )
	{
		fA[i] = ( float * )calloc( N, sizeof( float ) );
		fB[i] = ( float * )calloc( N, sizeof( float ) );
		fC[i] = ( float * )calloc( N, sizeof( float ) );
	}

Note: Allocated memory blocks are not contiguous and can be fragmented by an operating system memory manager, and fragmentation can degrade performance of processing.

In the previous examples, a DDR4-type RAM memory was allocated for matrices. However, on an Intel Xeon Phi processor system5 a multichannel DRAM (MCDRAM)-type RAM memory could be allocated as well, using functions from a memkind library11 when MCDRAM mode is configured to 'Flat' or 'Hybrid'. For example, this is how an MCDRAM-type RAM memory can be allocated:

	fA = ( float * )hbw_malloc( N * sizeof( float ) );
	fB = ( float * )hbw_malloc( N * sizeof( float ) );
	fC = ( float * )hbw_malloc( N * sizeof( float ) );

Note: An 'hbw_malloc' function of the memkind library was used instead of an '_mm_malloc' function.

On an Intel Xeon Phi processor system, eight variants of memory allocation for matrices A, B, and C are possible:

Matrix AMatrix BMatrix CNote
DDR4DDR4DDR4(1)
DDR4DDR4MCDRAM(2)
DDR4MCDRAMDDR4 
DDR4MCDRAMMCDRAM 
MCDRAMDDR4DDR4 
MCDRAMDDR4MCDRAM 
MCDRAMMCDRAMDDR4 
MCDRAMMCDRAMMCDRAM 

Table 3.

It is recommended to use MCDRAM memory as much as possible because its bandwidth is ~400 GB/s, and it is ~5 times faster than the ~80 GB/s bandwidth of DDR4 memory5.

Here is an example of how 'cblas_sgemm' MMA performs for two memory allocation schemes (MASs) (1) and (2):

	Matrix multiplication C=A*B where matrix A (32768x32768) and matrix B (32768x32768)
	Allocating memory for matrices A, B, C: MAS=DDR4:DDR4:DDR4
	Initializing matrix data
	Matrix multiplication started
	Matrix multiplication completed at 50.918 seconds
	Allocating memory for matrices A, B, C: MAS=DDR4:DDR4:MCDRAM
	Initializing matrix data
	Matrix multiplication started
	Matrix multiplication completed at 47.385 seconds

It is clear that there is a performance improvement of ~7 percent when an MCDRAM memory was allocated for matrix C.

Loop Processing Schemes

A loop processing scheme (LPS) describes what optimization techniques are applied to the 'for' statements of the C language of the core part of CMMA. For example, the following code:

	for( i = 0; i < N; i += 1 )						// loop 1
		for( j = 0; j < N; j += 1 )					// loop 2
			for( k = 0; k < N; k += 1 )				// loop 3
				C[i][j] += A[i][k] * B[k][j];

corresponds to an LPS=1:1:1, and it means that loop counters are incremented by 1.

Table 4 below includes short descriptions of different LPSs:

LPSNote
1:1:1Loops not unrolled
1:1:23rd loop unrolls to 2-in-1 computations
1:1:43rd loop unrolls to 4-in-1 computations
1:1:83rd loop unrolls to 8-in-1 computations
1:2:12nd loop unrolls to 2-in-1 computations
1:4:12nd loop unrolls to 4-in-1 computations
1:8:12nd loop unrolls to 8-in-1 computations

Table 4.

For example, the following code corresponds to an LPS=1:1:2, and it means that counters 'i' and 'j' for loops 1 and 2 are incremented by 1, and counter 'k' for loop 3 is incremented by 2:

	for( i = 0; i < N; i += 1 )						// :1
	{
		for( j = 0; j < N; j += 1 )					// :1
		{
			for( k = 0; k < N; k += 2 )				// :2 (unrolled loop)
			{
				C[i][j] += A[i][k  ] * B[k   ][j];
				C[i][j] += A[i][k+1] * B[k+1][j];
			}
		}
	}

Note: A C++ compiler could unroll loops as well if command line-options for unrolling are used. A software engineer should prevent such cases when source code unrolling is used at the same time, because it prevents vectorization of inner loops, and degrades performance of processing.

Another optimization technique is the loop interchange optimization technique (LIOT). When the LIOT is used, a core part of CMMA looks as follows:

	for( i = 0; i < N; i += 1 )						// loop 1
		for( k = 0; k < N; k += 1 )					// loop 2
			for( j = 0; j < N; j += 1 )				// loop 3
				C[i][j] += A[i][k] * B[k][j];

It is worth noting that counters 'j' and 'k' for loops 2 and 3 were exchanged.

The loops unrolling and LIOT allow for improved performance of processing because elements of matrices A and B are accessed more efficiently.

Compute Schemes

A compute scheme (CS) describes the computation of final or intermediate values and how elements of matrices are accessed.

In a CMMA an element (i,j) of the matrix C is calculated as follows:

	C[i][j] += A[i][k] * B[k][j]

and its CS is ij:ik:kj.

However, elements of matrix B are accessed in a very inefficient way. That is, the next element of matrix B, which needs to be used in the calculation, is located at a distance of (N * sizeof (datatype)) bytes. For very small matrices this is not critical because they can fit into CPU caches. However, for larger matrices it affects performance of computations, which can be significantly degraded, due to cache misses.

In order to solve that problem and improve performance of computations, a very simple optimization technique is used. If matrix B is transposed, the next element that needs to be used in the calculation will be located at a distance of (sizeof (datatype)) bytes. Thus, access to the elements of matrix B will be similar to the access to the elements of matrix A.

In a transpose-based CMMA, an element (i,j) of the matrix C is calculated as follows:

	C[i][j] += A[i][k] * B[j][k]

and its CS is ij:ik:jk. Here B[j][k] is used instead of B[k][j].

It is very important to use the fastest possible algorithm for the matrix B transposition before processing is started. In Appendix B an example is given on how much time is needed to transpose a square matrix of 32,768 elements, and how much time is needed to compute the product on an Intel Xeon Phi processor system.

Another optimization technique is the loop blocking optimization technique (LBOT) and it allows the use of smaller subsets of A, B, and C matrices to compute the product. When the LBOT is used, a core part of CMMA looks as follows:

	for( i = 0; i < N; i += BlockSize )
	{
		for( j = 0; j < N; j += BlockSize )
		{
			for( k = 0; k < N; k += BlockSize )
			{
				for( ii = i; ii < ( i+BlockSize ); ii += 1 )
					for( jj = j; jj < ( j+BlockSize ); jj += 1 )
						for( kk = k; kk < ( k+BlockSize ); kk += 1 )
							C[ii][jj] += A[ii][kk] * B[kk][jj];
			}
		}
	}

Note: A detailed description of LBOT can be found at10.

Table 5 shows four examples of CSs:

CSNote
ij:ik:kjDefault
ij:ik:jkTransposed
iijj:iikk:kkjjDefault LBOT
iijj:iikk:jjkkTransposed LBOT

Table 5.

Error Analysis

In any version of MMA many FP operations need to be done in order to compute values of elements of matrix C. Since FP data types SP or DP have limited precision4, rounding errors accumulate very quickly. A common misconception is that rounding errors can occur only in cases where large or very large matrices need to be multiplied. This is not true because, in the case of floating point arithmetic (FPA), a rounding error is also a function of the range of an input value, and it is not a function of the size of input matrices.

However, a very simple optimization technique allows improvement in the accuracy of computations.

If matrices A and B are declared as an SP FP data type, then intermediate values could be stored in a variable of DP FP data type:

	for( i = 0; i < N; i += 1 )
	{
		for( j = 0; j < N; j += 1 )
		{
			double sum = 0.0;
			for( k = 0; k < N; k += 1 )
			{
				sum += ( double )( A[i][k] * B[k][j] );
			}
			C[i][j] = sum;
		}
	}

The accuracy of computations will be improved, but performance of processing can be lower.

An error analysis (EA) is completed using the mmatest4.c test program for different sizes of matrices of SP and DP FP data types (see Table 6 in Appendix C with results).

Performance on the Intel® Xeon Phi™ Processor System

Several versions of the CMMA to compute a product of square dense matrices are evaluated in four test programs. Performance of these CMMAs is compared to a highly optimized 'cblas_sgemm' function of the Intel MKL7. Also see Appendix D for more evaluations.

performance evaluation
Figure 1. Performance tests for matrix multiply algorithms on Intel® Xeon Phi™ processor using mmatest1.c with KMP_AFFINITY environment variable set to 'scatter', 'balanced', and 'compact'. A lower bar height means faster processing.

Here are the names of source files with a short description of tests:

mmatest1.c - Performance tests matrix multiply algorithms on an Intel Xeon Phi processor.
mmatest2.c - Performance tests matrix multiply algorithms on an Intel Xeon Phi processor in one MCDRAM mode ('Flat') for DDR4:DDR4:DDR4 and DDR4:DDR4:MCDRAM MASs.
mmatest3.c - Performance tests matrix multiply algorithms on an Intel Xeon Phi processor in three MCDRAM modes ('All2All', 'Flat', and 'Cache') for DDR4:DDR4:DDR4 and MCDRAM:MCDRAM:MCDRAM MASs. Note: In 'Cache' MCDRAM mode, MCDRAM:MCDRAM:MCDRAM MAS cannot be used.
mmatest4.c - Verification matrix multiply algorithms accuracy of computations on an Intel Xeon Phi processor.

OpenMP* Product Thread Affinity Control

OpenMP* product compiler directives can be easily used to parallelize processing and significantly speed up processing. However, it is very important to execute OpenMP threads on different logical CPUs of modern multicore processors in order to utilize their internal resources as best as possible.

In the case of using the Intel C++ compiler and Intel OpenMP run-time libraries, the KMP_AFFINITY environment variable provides flexibility and simplifies that task. Here are three simple examples of using the KMP_AFFINITY environment variable:

	KMP_AFFINITY = scatter
	KMP_AFFINITY = balanced
	KMP_AFFINITY = compact

These two screenshots of the Htop* utility12 demonstrate how OpenMP threads are assigned (pinned) to Intel Xeon Phi processor 72105 logical CPUs during processing of an MMA using 64 cores of the processor:

KMP
Screenshot 1. KMP_AFFINITY = scatter or balanced. Note: Processing is faster when compared to KMP_AFFINITY = compact.

KMP
Screenshot 2. KMP_AFFINITY = compact. Note: Processing is slower when compared to KMP_AFFINITY = scatter or balanced.

Recommended Intel® C++ Compiler Command-Line Options

Here is a list of Intel C++ Compiler command-line options that a software engineer should consider, which can improve performance of processing of CMMAs:

O3
fp-model
parallel
unroll
unroll-aggressive
opt-streaming-stores
opt-mem-layout-trans

Os
openmp
ansi-alias
fma
opt-matmul
opt-block-factor
opt-prefetch

The reader can use 'icpc -help' or 'icc -help' to learn more about these command-line options.

Conclusion

Application of different optimization techniques to the CMMA were reviewed in this article.

Three versions of CMMA to compute a product of square dense matrices were evaluated in four test programs. Performance of these CMMAs was compared to a highly optimized 'cblas_sgemm' function of the Intel MKL7.

Tests were completed on a computer system with an Intel® Xeon Phi processor 72105 running the Linux Red Hat operating system in 'All2All' Cluster mode and for 'Flat', 'Hybrid 50-50', and 'Cache' MCDRAM modes.

It was demonstrated that CMMA could be used for cases when matrices of small sizes, up to 1,024 x 1,024, need to be multiplied.

It was demonstrated that performance of MMAs is higher when MCDRAM-type RAM memory is allocated for matrices with sizes up to 16,384 x 16,384 instead of DDR4-type RAM memory.

Advantages of using CMMA to compute the product of two matrices are as follows:

  • In any programming language, simple to implement to run on CPUs or GPUs9
  • Highly portable source codes when implemented in C, C++, or Java programming languages
  • Simple to integrate with existing software for a wide range of computer platforms
  • Simple to debug and troubleshoot
  • Predictable memory footprint at run time
  • Easy to optimize using parallelization and vectorization techniques
  • Low overheads and very good performance for matrices of sizes ranging from 256 x 256 to 1,024 x 1,024 (see Figures 1 through 5)
  • Very good accuracy of computations for matrices of sizes ranging from 8 x 8 to 2,048 x 2,048 (see Table 6 in Appendix C)

Disadvantages of using CMMA to compute a product of two matrices are as follows:

  • Poor performance for large matrices with sizes greater than 2048 x 2048
  • Poor performance when implemented using high-level programming languages due to processing overheads
  • Reduced accuracy of computations for matrices of sizes ranging from 2,048 x 2,048 to 65,536 x 65,536 (see Table 6 in Appendix C)

References

1. Matrix Multiplication on Mathworld

http://mathworld.wolfram.com/MatrixMultiplication.html

2. Matrix Multiplication on Wikipedia

https://en.wikipedia.org/wiki/Matrix_multiplication

3. Asymptotic Complexity of an Algorithm

https://en.wikipedia.org/wiki/Time_complexity

4. The IEEE 754 Standard for Floating Point Arithmetic

http://standards.ieee.org/

5. Intel® Many Integrated Core Architecture

https://software.intel.com/en-us/xeon-phi/x200-processor
http://ark.intel.com/products/94033/Intel-Xeon-Phi-Processor-7210-16GB-1_30-GHz-64-core
https://software.intel.com/en-us/forums/intel-many-integrated-core

6. Intel® C++ Compiler

https://software.intel.com/en-us/c-compilers
https://software.intel.com/en-us/forums/intel-c-compiler

7. Intel® MKL

https://software.intel.com/en-us/intel-mkl
https://software.intel.com/en-us/intel-mkl/benchmarks
https://software.intel.com/en-us/forums/intel-math-kernel-library

8. Intel® Developer Zone Forums

https://software.intel.com/en-us/forum

9. Optimizing Matrix Multiply for Intel® Processor Graphics Architecture Gen 9

https://software.intel.com/en-us/articles/sgemm-ocl-opt

10. Performance Tools for Software Developers Loop Blocking

https://software.intel.com/en-us/articles/performance-tools-for-software-developers-loop-blocking

11. Memkind library

https://github.com/memkind/memkind

12. Htop* monitoring utility

https://sourceforge.net/projects/htop

Downloads

Performance_CMMA_system.zip

List of all files (sources, test reports, and so on):

Performance_CMMA_system.pdf - Copy of this paper.

mmatest1.c - Performance tests for matrix multiply algorithms on Intel® Xeon Phi processors.

dataset1.txt - Results of tests.

mmatest2.c - Performance tests for matrix multiply algorithms on Intel® Xeon Phi processors for DDR4:DDR4:DDR4 and DDR4:DDR4:MCDRAM MASs.

dataset2.txt - Results of tests.

mmatest3.c - Performance tests for matrix multiply algorithms on Intel® Xeon Phi processors in three MCDRAM modes for DDR4:DDR4:DDR4 and MCDRAM:MCDRAM:MCDRA MASs.

dataset3.txt - Results of tests.

mmatest4.c - Verification of matrix multiply algorithms accuracy of computations on Intel® Xeon Phi processors.

dataset4.txt - Results of tests.

Note:   Intel C++ Compiler versions used to compile tests:
17.0.1 Update 132 for Linux*
16.0.3 Update 210 for Linux*

Abbreviations

CPU - Central processing unit
GPU - Graphics processing unit
ISA - Instruction set architecture
MIC - Intel® Many Integrated Core Architecture
RAM - Random access memory
DRAM - Dynamic random access memory
MCDRAM - Multichannel DRAM
HBW - High bandwidth memory
DDR4 - Double data rate (generation) 4
SIMD - Single instruction multiple data
SSE - Streaming SIMD extensions
AVX - Advanced vector extensions
FP - Floating point
FPA - Floating point arithmetic4
SP - Single precision4
DP - Double precision4
FLOPS - Floating point operations per second
MM - Matrix multiplication
MMA - Matrix multiplication algorithm
CMMA - Classic matrix multiplication algorithm
MTA - Matrix transpose algorithm
AC - Asymptotic complexity
IC - Implementation complexity
EA - Error analysis
MAS - Memory allocation scheme
LPS - Loop processing scheme
CS - Compute scheme
LIOT - Loop interchange optimization technique
LBOT - Loop blocking optimization technique
ICC - Intel C++ Compiler6
MKL - Math kernel library7
CBLAS - C basic linear algebra subprograms
IDZ - Intel® Developer Zone8
IEEE - Institute of Electrical and Electronics Engineers4
GB - Gigabytes
TN - Total number

Appendix A - Technical Specifications of the Intel® Xeon Phi™ Processor System

Summary of the Intel Xeon Phi processor system used for testing:

Process technology: 14nm
Processor name: Intel Xeon Phi processor 7210
Frequency: 1.30 GHz
Packages (sockets): 1
Cores: 64
Processors (CPUs): 256
Cores per package: 64
Threads per core: 4
On-Package Memory: 16 GB high bandwidth MCDRAM (bandwidth ~400 GB/s)
DDR4 Memory: 96 GB 6 Channel (Bandwidth ~ 80 GB/s)
ISA: Intel® AVX-512 (Vector length 512-bit)

Detailed processor specifications:

http://ark.intel.com/products/94033/Intel-Xeon-Phi-Processor-7210-16GB-1_30-GHz-64-core

Summary of a Linux operating system:

[guest@... ~]$ uname -a

Linux c002-n002 3.10.0-327.13.1.el7.xppsl_1.4.0.3211.x86_64 #1 SMP
Fri Jul 8 11:44:24 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

[guest@... ~]$ cat /proc/version

Linux version 3.10.0-327.13.1.el7.xppsl_1.4.0.3211.x86_64 (qb_user@89829b4f89a5)
(gcc version 4.8.5 20150623 (Red Hat 4.8.5-4) (GCC)) #1 SMP Fri Jul 8 11:44:24 UTC 2016

Appendix B - Comparison of Processing Times for MMAs versus MTA

Comparison of processing times for Intel MKL 'cblas_sgemm' and CMMA vs. MTA:

[Intel MKL & CMMA]

Matrix A [32768 x 32768] Matrix B [32768 x 32768]
Number of OpenMP threads: 64
MKL - Completed in: 51.2515874 seconds
CMMA - Completed in: 866.5838490 seconds

[MTA]

Matrix size: 32768 x 32768
Transpose Classic - Completed in: 1.730 secs
Transpose Diagonal - Completed in: 1.080 secs
Transpose Eklundh - Completed in: 0.910 secs

When compared processing time of MTA to:
MKL 'cblas_sgemm'. the transposition takes ~2.42 percent of the processing time.
CMMA, the transposition takes ~0.14 percent of the processing time.

Appendix C - Error Analysis (Absolute Errors for SP FP Data Type)

NMMACalculated SP ValueAbsolute Error
8MKL8.0000800.000000
8CMMA8.0000800.000000
16MKL16.0001600.000000
32CMMA16.0001600.000000
32MKL32.000309-0.000011
32CMMA32.0003200.000000
64MKL64.0006710.000031
128CMMA64.0006410.000001
128MKL128.001160-0.000120
128CMMA128.0012820.000002
256MKL256.002319-0.000241
512CMMA256.0025630.000003
512MKL512.004639-0.000481
512CMMA512.005005-0.000115
1024MKL1024.009521-0.000719
2048CMMA1024.009888-0.000352
2048MKL2048.019043-0.001437
2048CMMA2048.0214840.001004
4096MKL4096.038574-0.002386
8192CMMA4096.037109-0.003851
8192MKL8192.074219-0.007701
8192CMMA8192.0996090.017689
16384MKL16384.14648-0.017356
32768CMMA16384.09961-0.064231
32768MKL32768.335940.008258
32768CMMA32768.10156-0.226118
65536MKL65536.718750.063390
65536CMMA65536.10156-0.553798

Table 6.

Appendix D - Performance of MMAs for Different MASs

MKL Performance
Figure 2. Performance of Intel® MKL 'cbals_sgemm'. KMP_AFFINITY environment variable set to 'scatter'. Cluster mode: 'All2All'. MCDRAM mode: 'Flat'. Test program mmatest2.c. A lower bar height means faster processing.

MKL Performance
Figure 3. Performance of Intel® MKL 'cbals_sgemm' vs. CMMA. KMP_AFFINITY environment variable set to 'scatter'. Cluster mode: 'All2All'. MCDRAM mode: 'Flat'. Test program mmatest3.c. A lower bar height means faster processing.

MKL Performance
Figure 4. Performance of Intel® MKL 'cbals_sgemm' vs. CMMA. KMP_AFFINITY environment variable set to 'scatter'. Cluster mode: 'All2All'. MCDRAM mode: 'Hybrid 50-50'. Test program mmatest3.c. A lower bar height means faster processing.

MKL Performance
Figure 5. Performance of Intel® MKL 'cbals_sgemm' vs. CMMA. KMP_AFFINITY environment variable set to 'scatter'. Cluster mode: 'All2All'. MCDRAM mode: 'Cache'. Test program mmatest3.c. A lower bar height means faster processing.

About the Author

Sergey Kostrov is a highly experienced C/C++ software engineer and Intel® Black Belt Developer. He is an expert in design and implementation of highly portable C/C++ software for embedded and desktop platforms, scientific algorithms, and high performance computing of big data sets.

Intel Solutions and Technologies for the Evolving Data Center

$
0
0

 

One Stop for Optimizing Your Data Center

From AI to Big Data to HPC: End-to-end Solutions

Whether your data center is data- or compute-intensive and whether it serves cloud, high-performance computing, enterprise, storage, networking, or big data analytics, we have solutions and technologies to make your life easier. 

Explore

 

Data center managers, integrators, and developers can now optimize the entire stack to run faster and more efficiently on Intel® architecture. The Intel® Xeon® and Intel® Xeon Phi™ product family paired with Intel® Solid State Drives and NVMe* storage provide a strong foundation. Intel is committed to a standardized, shared platform for virtualization including SDN/NFV (networking), while providing hardware-based security and manageability for now and in the future.

But Intel is more than a hardware innovator. Regardless of your challenges, Intel provides optimized industry SDKs, libraries, and tuning tools. And these tools are supplemented by expert-provided training plus documentation including code samples, configuration guides, walk-throughs, use cases, and support forums.
 

 

AI: MACHINE LEARNING AND DEEP LEARNING

Intel supports rapid innovation in artificial intelligence focusing on community, tools, and training. Starting with the Intel® Nervana™ AI Academy, this section of the Intel® Software Developer Zone drills down into to computational machine learning and deep learning, with extensive Intel-optimized libraries and frameworks along with documentation and tutorials.

The Deep Learning Training Tool Beta helps you easily develop and train deep learning solutions using your own hardware. It can ease your data preparation, as well as design and train models using automated experiments and advanced visualizations.

Tools available include:
BigDL open source distributed library for Apache Spark*
Intel® Distribution for Python*
Deep Learning Webinar

 

MODERN CODE

You’ve no doubt heard of recent hardware innovations of the Intel® Many Integrated Core Architecture (Intel® MIC) including the multilevel extreme parallelism, vectorization and threading of the Intel® Xeon® and Intel® Xeon Phi™ product family. Plus, there are larger caches, new SIMD extensions, new memory and file architectures and hardware enforced security of select data and application code via Intel® Software Guard Extensions (Intel® SGX).

But they all require code and tool changes to get the most from the data center. To address this, Intel provides training and tools to quickly and easily optimize code for new technologies.

Extensive free training on code improvements and parallel programming is available online and by workshops and events.

Tools available include:
Intel® Parallel Studio XE (vectorization advisor and MPI profiling)
Intel® Advisor (vectorization optimization and threading design tool)
Intel® C/C++ Compilers and Intel® Fortran Compilers
Intel® VTune™ Amplifier XE (performance analysis of multiple CPUs and FPUs)
Application Performance Snapshot Tool

 

BIG DATA ANALYTICS

When handling huge volumes of data, Intel can help you provide faster, easier and more insightful big data analytics using open software platforms, libraries, developer kits and tools that take advantage of the Intel Xeon and Intel Xeon Phi product family’s extreme parallelism and vectorization. Fully integrated with popular platforms (Apache* Hadoop*, Spark*,R, Matlab* Java*, and NoSQL), Intel optimizations have been well-tested and benchmarked.

Extensive documentation is available on how real-life developers are using Intel hardware, software, and tools to effectively store, manage, process, and analyze data.

The Intel® Data Analytics Acceleration Library (Intel® DAAL) provides highly-optimized algorithmic building blocks and can be paired with the Intel® Math Kernel Library (Intel® MKL) containing optimized threaded and vectorized functions. In fact, the TAP Analytics Toolkit (TAP ATK) provides both Intel® DAAL and Intel® MKL already integrated with Spark.

 

HIGH-PERFORMANCE STORAGE

Intel is at the cutting edge of Storage not only with Intel® SSDs and NVMe but by working with the open source community to optimize and secure the infrastructure. Training is available at Intel® Storage Builders University.


Major tools available include:
Intel® Intelligent Storage Acceleration Library (Intel® ISA-L)
Storage Performance Development Kit (SPDK)
Intel® QuickAssist Technology
Intel® VTune™ Amplifier
Storage Performance Snapshot
Intel® Cache Acceleration Software (Intel® CAS)

 

SDN/NFV NETWORKING

Besides providing a standardized open platform ideal for SDN/NFV (virtualized networking) and the unique hardware capabilities in Intel’s network controllers, Intel has provided extensive additions to, and testing of, the Data Plane Development Kit (DPDK) and training through Intel® Network Builders University. Check out the thriving community of developers and subscribe to the 'Out of the Box' Network Developers Newsletter.

   

HPC AND CLUSTER

If you run visualization or other massive parallelism applications, you know the advantages of using the Intel Xeon and Intel Xeon Phi product family with MCDRAM and associated NUMA/Memory/Cache Modes, wide vector units and up to 68 cores. While the Intel® Scalable System Framework (Intel® SSF) and Intel® Omni-Path Architecture (Intel® OPA) focus on performance, balance and scalability, Intel is working with research and production HPC and clusters to support integration with all the major stacks as well as developing code and tools to optimize and simplify the work.

The Intel® HPC Orchestrator provides a modular integrated validated stack including the Lustre* parallel file system. It is supplemented by critical tools for cluster optimization:

Intel® Trace Analyzer and Collector which quickly finds MPI bottlenecks
Intel® MPI Library and docs to improve implementation of MPI 3.1 on multiple fabrics
MPI Performance Snapshot to help with performance tuning.
Intel® VTune™ Amplifier XE for performance analysis of multiple CPUs, FPUs and NUMA

 

 

Conclusion

Regardless of your job title and data center activities, Intel helps streamline and optimize your work to gain a competitive edge with end-to-end solutions, from high-performance hardware to new technologies, optimizations, tools and training. See what resources Intel provides to optimize and speed up your development now and remain competitive in the industry.

Explore

Intel® Manycore Platform Software Stack for Intel® Xeon Phi™ Coprocessor x200

$
0
0

Summary of (latest) changes

This article describes the most recent changes that have been made to the Intel® Manycore Platform Software Stack (Intel® MPSS) 4.x. If you've subscribed to get update notifications, you can use this information to quickly determine whether these changes apply to you.

  • May 8, 2017, Intel® MPSS 4.4.0 HotFix 1 released for Linux* and Windows*

‍‍About the Intel® Manycore Platform Software Stack 4.x

The Intel MPSS 4.x is necessary to run the Intel® Xeon Phi™ coprocessor x200. It has been tested to work with specific versions of 64-bit operating systems:

The readme files (referenced in the Downloads section) have more information on how to build and install the stack.

One important component of Intel MPSS is the Symmetric Communications Interface (SCIF). The SCIF is included in the RPM bundle. SCIF provides a mechanism for inter-node communications within a single platform. A node, for SCIF purposes, is defined as either an Intel® Xeon Phi™ coprocessor or the Intel® Xeon® processor. In particular, the SCIF abstracts the details of communicating over the PCI Express* bus. The SCIF APIs are callable from both user space (uSCIF) and kernel space (kSCIF).

Intel MPSS is downloadable from the sources below. Note that these packages include documentation and APIs (for example, the SCIF API).

For Linux systems, users can measure Intel® Xeon Phi™ processor and coprocessor x200 product family performance with a tool called micperf. micperf is designed to incorporate a variety of benchmarks into a simple user experience with a single interface for execution. For the coprocessor, the micperf package is distributed as an RPM file within Intel MPSS. The following table summarizes all the benchmarks that can be run with the micperf tool:

Benchmark

CLI Name

Target Operations

Component

Comments

Intel® Math Kernel Library (Intel® MKL) DGEMM

dgemm

Double-precision floating point

VFU

For the processor, micperf provides a MCDRAM and DDR version

Intel MKL SGEMM

sgemm

Single-precision floating point

VFU

For the processor, micperf provides a MCDRAM and DDR version

Intel MKL SMP Linpack

linpack

Double-precision floating point

VFU

 

SHOC Download*

shoc download

Bus transfer host to device

PCIe* bus

Only available for the coprocessor

SHOC Readback*

shoc readback

Bus transfer device to host

PCIe bus

Only available for the coprocessor

STREAM*

stream

Round-trip memory to registers

MCDRAM, GDDR and caches

For the processor, micperf provides a MCDRAM and DDR version

HPLinpack*

hplinpack

Double-precision floating point

VFU

Only available for the processor

HPCG*

hpcg

Double-precision floating point

VFU

Only available for the processor; requires Intel® MPI Library

Note: the Intel MPSS download files for Linux marked “.gz” should end in “.gz” when downloaded; most browsers leave the extension alone, but Windows Explorer* may rename the files. If this affects you, we recommend renaming the file to the proper extension after downloading.

‍‍Getting notified of future updates

If you want to receive updates when we publish a new Intel MPSS 4.x stack, add a comment at the bottom of this page.

‍‍Release support schedule?

The following table shows when releases were issued and when Intel will no longer support them. Releases with a strikethrough are no longer supported. For an overview of Intel's release structure and support length, please see this article.

Downloads

There are currently two major releases available for the Intel MPSS 4.x. The most recent major release is 4.4.x.

We recommend that new adopters start by using the 4.4 release. Support for each Intel MPSS release ends 6 months from the date it was posted, except for long-term support products.

 

Intel MPSS 4.4.0 HotFix 1 release for Linux

Intel® Manycore Platform Software Stack versionDownloads availableSize (range)MD5 Checksum

mpss-4.4.0 Hotfix 1(released: May 8, 2017)

RedHat 7.3


214MB
8a015c38379b8be42c8045d3ceb44545
 

RedHat 7.2


214MB
694b7b908c12061543d2982750985d8b
 

SuSE 12.2

213MB506ab12af774f78fa8e107fd7a4f96fd
 

SuSE 12.1

213MBb8520888954e846e8ac8604d62a9ba96
 

SuSE 12.0

213MB88a3a4415afae1238453ced7a0df28ea
 Card installer file (mpss-4.4.0-card.tar)761MBd26e26868297cea5fd4ffafe8d78b66e
 Source file (mpss-4.4.0-card-source.tar)514MB127713d06496090821b5bb3613c95b30

 

Documentation linkDescriptionLast Updated OnSize (approx)
releasenotes-linux.txtRelease Notes (English)May 201715KB
README.txtReadme (includes installation instructions) for Linux (English)May 201717KB
MPSS_Users_Guide.pdfMPSS User's guideMay 20173MB
EULA.txtEnd User License Agreement (IMPORTANT: Read Before Downloading, Installing, or Using)May 201733KB
   

 

 

 

Intel MPSS 4.4.0 HotFix 1 release for Microsoft Windows

Intel® Manycore Platform Software Stack versionDownloads availableSizeMD5 Checksum

64-bit Install Package (release May 8, 2017)

 mpss-4.4.0-windows.zip

1091MB204a65b36858842f472a37c77129eb53

 

Documentation linkDescriptionLast Updated OnSize
releaseNotes-windows.txtEnglish - release notesMay 20177KB
readme-windows.pdfEnglish - readme for Microsoft* WindowsMay 2017399KB
MPSS user's_GuideMPSS User Guide for WindowsMay 20173MB
EULA.txtEnd User License Agreement (IMPORTANT: Read Before Downloading, Installing, or Using)May 201733KB

 

‍‍Additional documentation

The Intel MPSS packages contain additional documentation for Linux: man pages and documents in /usr/share/doc/ (see myo, intel-coi-* and micperf-* directories). The Platform Control Panel User’s Guide is now in /usr/share/doc/systools/micmgmt/

Also, below is a link to the Intel® MPSS Performance Guide, which documents best-known methods for fine-tuning the Intel MPSS runtime environment to get the best application performance.

http://software.intel.com/sites/default/files/managed/72/db/mpss-performance-guide.pdf‍‍

‍‍Where to ask questions and get more information

The discussion forum at http://software.intel.com/en-us/forums/intel-many-integrated-core is available to join and discuss any enhancements or issues with Intel® MPSS.

Information about Intel MPSS security can be found here. 

You can also find support collaterals here or submit an issue.

Intel® Xeon Phi™ Coprocessor x200 Quick Start Guide

$
0
0

Introduction

This document introduces the basic concept of the Intel® Xeon Phi™ coprocessor x200 product family, tells how to install the coprocessor software stack, discusses the build environment, and points to important documents so that you can write code and run applications.

The Intel Xeon Phi coprocessor x200 is the second generation of the Intel Xeon Phi product family. Unlike the first generation running on an embedded Linux* uOS, this second generation supports the standard Linux kernel. The Intel Xeon Phi coprocessor x200 is designed for installation in a third-generation PCI Express* (PCIe*) slot of an Intel® Xeon® processor host. The following figure shows a typical configuration:

Figure 1

Benefits of the Intel Xeon Phi coprocessor:

  • System flexibility: Build a system that can support a wide range of applications, from serial to highly parallel, while leveraging code optimized for Intel Xeon processors or Intel Xeon Phi processors.
  • Maximize density: Gain significant performance improvements with limited acquisition cost by maximizing system density.
  • Upgrade path: Improve performance by adding to an Intel Xeon processor system or upgrading from the first generation of the Intel Xeon Phi product family with minimum code changes.

For workloads that fit within 16 GB coprocessor memory, adding a coprocessor to a host server allows customers to avoid costly networking. For workloads that have a significant portion of highly parallel phases, offload can offer significant performance with minimal code optimization investment.

Additional Documentation

Basic System Architecture

The Intel Xeon Phi coprocessor x200 is based on a modern Intel® Atom™ microarchitecture with considerable high performance computing (HPC)-focused performance improvements. It has up to 72 cores with four threads per core, giving a total of 288 CPUs as viewed by the operating system, and has up to 16 GB of high-bandwidth on-package MCDRAM memory that provides over 500 GB/s effective bandwidth. The coprocessor has an x16 PCI Express Gen3 interface (8 GT/s) to connect to the host system.

The cores are laid out in units called tiles. Each tile contains a pair of cores, a shared 1 MB L2 cache, and a hub connecting the tile to a mesh interface. Each core contains two 512-bit wide vector processing units. The coprocessor supports Intel® AVX-512F (foundation), Intel AVX-512CD (conflict detection), Intel AVX-512PF (prefetching), and Intel AVX-512ER (exponential reciprocal) ISA.

Figure 2

Intel® Manycore Platform Software Stack

Intel® Manycore Platform Software Stack (Intel® MPSS) is the user and system software that allows programs to run on and communication with the Intel Xeon Phi coprocessor. Intel MPSS version 4.x.x is used for the Intel Xeon Phi coprocessor x200 and can be download from here [(https://software.intel.com/en-us/articles/intel-manycore-platform-software-stack-for-intel-xeon-phi-coprocessor-x200)]. (Note that the older Intel MPSS version 3.x.x is used for the Intel Xeon Phi coprocessor x100); standard Linux kernel running on the coprocessor.

You can download the Intel MPSS stack at https://software.intel.com/en-us/articles/intel-manycore-platform-software-stack-for-intel-xeon-phi-coprocessor-x200. The following host operating systems are supported: Red Hat* Enterprise Linux Server, SUSE* Linux Enterprise Server and Microsoft Windows*. For detailed information on requirements and on installation, please consult the README file for Intel MPSS. The figure below shows the high representation of the Intel MPSS. The host software stack is on the left and the coprocessor software stack is on the right.

Figure 3

Install the Software Stack and Start the Coprocessor

Installation Guide for Linux* Host:

  1. From the “Intel Manycore Platform Software Stack for Intel Xeon Phi Coprocessor x200 (https://software.intel.com/en-us/articles/intel-manycore-platform-software-stack-for-intel-xeon-phi-coprocessor-x200), navigate to the latest version of the Intel MPSS release for Linux and download “Readme for Linux (English)” (README.txt). Also download the release notes (releasenotes-linux.txt) and the User’s Guide for Intel MPSS.
  2. Install one of the following supported operating systems in the host:
    • Red Hat Enterprise Linux Server 7.2 64-bit kernel 3.10.0-327
    • Red Hat Enterprise Linux Server 7.3 64-bit kernel 3.10.0-514
    • SUSE Linux Enterprise Server SLES 12 kernel 3.12.28-4-default
    • SUSE Linux Enterprise Server SLES 12 SP1 kernel 3.12.49-11-default
    • SUSE Linux Enterprise Server SLES 12 SP2 kernel 4.4.21-69-default

    Be sure to install ssh, which is used to log in to the card.

    WARNING: On installing Red Hat, it may automatically update you to a new version of the Linux kernel. If this happens, you will not be able to use the prebuilt host driver, but will need to rebuild it manually for the new kernel version. Please see Section 5 in the readme.txt for instructions on building an Intel MPSS host driver for a specific Linux kernel.

  3. Log in as root.
  4. Download the release driver appropriated for your operating system in Step 1 (<mpss-version>-linux.tar), where <mpss-4> is mpss-4.3.3 at the time this document was written.
  5. Install the host driver RPMs as detailed in Section 6 of readme.txt. Don’t skip the creation of configuration files for your coprocessor.
  6. Update the flash on your coprocessor(s) as detailed in Section 8 of readme.txt.
  7. Reboot the system.
  8. Start the Intel Xeon Phi coprocessor (you can set up the card to start with the host system; it will not do so by default), and then run micinfo to verify that it is set up properly:
    # systemctl start mpss
    # micctrl –w
    # /usr/bin/micinfo
    micinfo Utility Log
    Created On Mon Apr 10 12:14:08 2017
    
    System Info:
        Host OS                        : Linux
        OS Version                     : 3.10.0-327.el7.x86_64
        MPSS Version                   : 4.3.2.5151
        Host Physical Memory           : 128529 MB
    
    Device No: 0, Device Name: mic0 [x200]
    
    Version:
        SMC Firmware Version           : 121.27.10198
        Coprocessor OS Version         : 4.1.36-mpss_4.3.2.5151 GNU/Linux
        Device Serial Number           : QSKL64000441
        BIOS Version                   : GVPRCRB8.86B.0012.R02.1701111545
        BIOS Build date                : 01/11/2017
        ME Version                     : 3.2.2.4
    
    Board:
        Vendor ID                      : 0x8086
        Device ID                      : 0x2260
        Subsystem ID                   : 0x7494
        Coprocessor Stepping ID        : 0x01
        UUID                           : A03BAF9B-5690-E611-8D4F-001E67FC19A4
        PCIe Width                     : x16
        PCIe Speed                     : 8.00 GT/s
        PCIe Ext Tag Field             : Disabled
        PCIe No Snoop                  : Enabled
        PCIe Relaxed Ordering          : Enabled
        PCIe Max payload size          : 256 bytes
        PCIe Max read request size     : 128 bytes
        Coprocessor Model              : 0x57
        Coprocessor Type               : 0x00
        Coprocessor Family             : 0x06
        Coprocessor Stepping           : B0
        Board SKU                      : B0 SKU _NA_A
        ECC Mode                       : Enabled
        PCIe Bus Information           : 0000:03:00.0
        Coprocessor SMBus Address      : 0x00000030
        Coprocessor Brand              : Intel(R) Corporation
        Coprocessor Board Type         : 0x0a
        Coprocessor TDP                : 300.00 W
    
    Core:
        Total No. of Active Cores      : 68
        Threads per Core               : 4
        Voltage                        : 900.00 mV
        Frequency                      : 1.20 GHz
    
    Thermal:
        Thermal Dissipation            : Active
        Fan RPM                        : 6000
        Fan PWM                        : 100 %
        Die Temp                       : 38 C
    
    Memory:
        Vendor                         : INTEL
        Size                           : 16384.00 MB
        Technology                     : MCDRAM
        Speed                          : 6.40 GT/s
        Frequency                      : 6.40 GHz
        Voltage                        : Not Available

Installation Guide for Windows* Host:

  1. From the “Intel Manycore Platform Software Stack for Intel Xeon Phi Coprocessor x200 (https://software.intel.com/en-us/articles/intel-manycore-platform-software-stack-for-intel-xeon-phi-coprocessor-x200), navigate to the latest version of the Intel MPSS release for Microsoft Windows. Download “Readme file for Microsoft Windows” (readme-windows.pdf). Also download the “Release notes” (releaseNotes-windows.txt) and the “Intel MPSS User’s Guide” (MPSS_Users_Guide-windows.pdf).
  2. Install one of the following supported operating systems in the host:
    • Microsoft Windows 8.1 (64-bit)
    • Microsoft Windows® 10 (64-bit)
    • Microsoft Windows Server 2012 R2 (64-bit)
    • Microsoft Windows Server 2016 (64-bit)
  3. Log in as “administrator”.
  4. Install .NET Framework* 4.5 or higher on the system (http://www.microsoft.com/net/download), Python* 2.7.5 x86-64 or higher (Python 3.x is not supported), Pywin32 build or higher (https://sourceforge.net/projects/pywin32).
  5. Be sure to install PuTTY* and PuTTYgen*, which are used to log in to the card’s OS.
  6. Follow the preliminary steps as instructed in Section 2.2.1 of the Readme file.
  7. Restart the system.
  8. Download the drivers package mpss-4.*-windows.zip for your Windows operating system from the page described in Step 1.
  9. Unzip the zip file to get the Windows exec files (“mpss-4.*.exe” and “mpss-essentials-4*.exe”).
  10. Install the Windows Installer file “mpss-4.*.exe” as detailed in Section 3.2 of the User’s Guide. Note that if a previous version of the Intel Xeon Phi coprocessor stack is already installed, use Windows Control Panel to uninstall it prior to installing the current version. By default, Intel MPSS is installed in “c:\Program Files\Intel\MPSS”. Also, install “mpss-essentials-4*.exe”, the native binary utilities for the Intel Xeon Phi coprocessor. These are required when using offload programming or cross compilers.
  11. Confirm that the new Intel MPSS stack is successfully installed by looking at Control Panel > Programs > Programs and Features: Intel Xeon Phi (see the following illustrations).

    Figure 4

  12. Update the flash according to Section 2.2.3 of the readme-windows.pdf file.
  13. Reboot the system.
  14. Log in to the host and verify that the Intel Xeon Phi x200 coprocessors are detected by the Device Manager (Control Panel > Hardware > Device Manager, and click “System devices”):

    Figure 5
  15. Start the Intel Xeon Phi coprocessor (you can set up the card to start with the host system; it will not do so by default). Launch a command-prompt window and start the Intel MPSS stack:
        prompt> micctrl --start
  16. Run the command “micinfo” to verify that it is set up properly:
        prompt> micinfo.exe

    Figure 6

Intel® Parallel Studio XE

After starting the Intel MPSS stack, users can write applications running on the coprocessor using Intel Parallel Studio XE.

Intel Parallel Studio XE is a software development suite that helps boost application performance by taking advantage of the ever-increasing processor core count and vector register width available in Intel Xeon processors, Intel Xeon Phi processors and coprocessors, and other compatible processors. Starting with the Intel Parallel Studio 2018 beta, the following Intel® products support program development on the Intel Xeon Phi coprocessor x200:

  • Intel® C Compiler/Intel® C++ Compiler/Intel® Fortran Compiler
  • Intel® Math Kernel Library (Intel® MKL)
  • Intel® Data Analytics Acceleration Library (Intel® DAAL)
  • Intel® Integrated Performance Primitives (Intel® IPP)
  • Intel® Cilk™ Plus
  • Intel® Threading Building Blocks (Intel® TBB)
  • Intel® VTune™ Amplifier XE
  • Intel® Advisor XE
  • Intel® Inspector XE
  • Intel® MPI Library
  • Intel® Trace Analyzer and Collector
  • Intel® Cluster Ready
  • Intel® Cluster Checker

To get started writing programs running on the coprocessor, you can get the code samples at https://software.intel.com/en-us/product-code-samples. The packages “Intel Parallel Studio XE for Linux - Sample Bundle”, and “Intel Parallel Studio XE for Windows - Sample Bundle” contain code samples for Linux and Windows, respectively.

Programming Models on Coprocessor

There are three programing models that can be used for the Intel Xeon Phi coprocessor x200: offload programing model, symmetric programing model, and native programing model.

  • Offload programing: The main application runs on the host, and offload selected, highly parallel portions of the program to the coprocessor(s) to take advantage of manycore architecture. The serial portion of the program still runs in the host to take advantage of big cores architecture.
  • Symmetric programming: The coprocessors and the host are treated as separate nodes. This model is suitable for distributed computing.
  • Native programming: The coprocessors are used as independent nodes, just like a host. Users compile the binary for the coprocessor in the host, transfer the binary, and log in the coprocessor to run the binary.

The figure below summarizes different programming models used for the Intel Xeon Phi coprocessor:

Figure 7

Call for submissions: Intel HPC Developer Conference

$
0
0

Please consider giving a talk, tutorial or presenting a poster at this year's Intel HPC Developer Conference (November 11-12, 2017 - just before SC17 in Denver).

Submissions will be reviewed and responded to in a rolling fashion - so submit soon! (Best to submit by July 20, but okay until August 18.)

Submit online: https://intelhpcdc2017cfa.hubb.me (full information on dates, topics, etc. is on that web site).

The prior Intel HPC Developer Conferences have been very well rated by attendees - and that is due to the high quality of speakers (talks tutorials, panels, etc.) that we have enjoyed.  We are adding poster sessions this year to open up more discussions with attendees.

Technical talks of 30 minutes, Tutorials of 90, 120 or 180 minutes and Poster sessions submissions are encouraged.  Topics range include Parallel Programming, AI (ML/HPDA), High Productivity Languages, Visualization (esp. Software Defined Visualization and In Situ Visualization), Enterprise and Systems.

We expect to have another great conference this year - and we know that rests on the high quality presenters. We look forward to your submissions.  Feel free to drop me a note if you have any questions - or simply put in your proposal online, and put any questions in with your submission (we can talk!).

 

CPUs are set to dominate high end visualization

$
0
0

 Carson Brownlee, Intel.  It is certainly provocative to say that CPUs will dominate any part of visualization - but I say it with confidence that the data supports why this is happening.  The primary drivers are (1) data sizes, (2) minimizing data movement, and (3) ability to change to O(n log n) algorithms.  Couple that with the ultra-hot topic of "Software Defined Visualization" that makes these three things possible - and you have a lot to consider about how the world is changing.

Of course, what is "high end" today often becomes common place over time... so this trend may affect us all eventually.  It's at least worth understanding the elements at play.

At ISC17, in Germany, this week (June 19-21) Intel is demoing (and selling) their vision of a “dream machine” for doing software defined visualization with a special eye towards in situ visualization development. Jim Jeffers, Intel, and friends are demonstrating it at ISC'17 in Germany, and they will be at SIGGRAPH'17 too. The "dream machine" can support visualization of data sets up to 1.5TB in size. They designed it to address the needs of the scientific visualization and professional rendering markets.

Photo credit (above): Asteroid Deep Water Impact Analysis; Data Courtesy: John Patchett, Galen Glisner per Los Alamos National Laboratory tech report LA-UR-17-21595. Visualization: Carson Brownlee, Intel.

With Jim's help, I wrote an article about how more information about how CPUs now offer higher performance and a lower cost than competing GPU-based solutions for the largest visualization tasks.  The full article is posted with coverage at TechEnablement site.

In the full article, aside from my writing about the trend - I do provide links to technical papers the show this trend towards CPUs as the preferred solution for visualization of large data (really really big), as well as links to conferences, and links about the "visualization dream machine" (how I describe it, not what Intel calls it officially).

Dream Machine for Software Defined Visualization

Photo: Intel/Colfex Visualization "Dream" Machine

Using Intel® MPI Library on Intel® Xeon Phi™ Product Family

$
0
0

Introduction

The Message Passing Interface (MPI) standard is a message-passing library, a collection of routines used in distributed-memory parallel programing. This document is designed to help users get started writing code and running MPI applications using the Intel® MPI Library on a development platform that includes the Intel® Xeon Phi™ processor or coprocessor. The Intel MPI Library is a multi-fabric message passing library that implements the MPI-3.1 specification (see Table 1).

In this document, the Intel MPI Library 2017 and 2018 Beta for Linux* OS are used.

Table 1. Intel® MPI Library at a glance

Processors

Intel® processors, coprocessors, and compatibles

Languages

Natively supports C,C++, and Fortran development

Development Environments

Microsoft Visual Studio* (Windows*), Eclipse*/CDT* (Linux*)

Operating Systems

Linux and Windows

Interconnect Fabric Support

Shared memory
RDMA-capable network fabrics through DAPL* (for example, InfiniBand*, Myrinet*)
Intel® Omni-Path Architecture
Sockets (for example, TCP/IP over Ethernet, Gigabit Ethernet*) and others.

This document summarizes the steps to build and run an MPI application on an Intel® Xeon Phi™ processor x200, on an Intel® Xeon Phi™ coprocessor x200 and Intel® Xeon Phi™ coprocessor x100 natively or symmetrically. First, we introduce the Intel Xeon Phi processor x200 product family and Intel Xeon Phi processor x100 product family and the MPI programing models.

Intel® Xeon Phi™ Processor Architecture

Intel Xeon Phi processor x200 product family architecture: There are two versions of this product. The processor version is the host processor and the coprocessor version requires an Intel® Xeon® processor host. Both versions share the architecture below (see Figure 1):

  • Intel® Advanced Vector Extensions 512 (Intel® AVX-512)
  • Up to 72 cores with 2D mesh architecture
  • Each core has two 512-bit vector processing units (VPUs) and four hardware threads
  • Each pair of cores (tile) shares 1 MB L2 cache
  • 8 or 16 GB high-bandwidth on package memory (MCDRAM)
  • 6 channels DDR4, up to 384 GB (available in the processor version only)
  • For the coprocessor, the third-generation PCIe* is connected to the host

 Colorful depiction of the Intel® Xeon Phi™ processor x200 architecture

Figure 1. Intel® Xeon Phi™ processor x200 architecture.

To enable the functionalities of the Intel Xeon Phi processor x200, you need to download and install the Intel Xeon Phi processor software available here.

The Intel Xeon Phi coprocessor x200 attaches to an Intel Xeon processor-based host via a third-generation PCIe interface. The coprocessor runs on a standard Linux OS. It can be used as an extension to the host (so the host can offload the workload) or as an independent compute node. The first step to bring an Intel Xeon Phi coprocessor x200 into service is to install the Intel® Manycore Platform Software Stack (Intel® MPSS) 4.x on the host, which is available here. The Intel MPSS is a collection of software including device drivers, coprocessor management utilities, and the Linux OS for the coprocessor.

Intel Xeon Phi coprocessor x100 architecture: the Intel Xeon Phi coprocessor x100 is the first-generation of the Intel Xeon Phi product family. The coprocessor attaches to an Intel Xeon processor-based host via a second-generation PCIe interface. It runs on an OS separate from the host and has the following architecture (see Figure 2):

  • Intel® Initial Many Core Instructions
  • Up to 61 cores with high-bandwidth, bidirectional ring interconnect architecture
  • Each core has a 512-bit wide VPU and four hardware threads
  • Each core has a private 512-KB L2 cache
  • 16 GB GDDR5 memory
  • The second-generation PCIe is connected to the host

 Colorful depiction of the Intel® Xeon Phi™ processor x100 architecture

Figure 2. Intel® Xeon Phi™ processor x100 architecture.

To bring the Intel Xeon Phi coprocessor x100 into service, you must install the Intel MPSS 3.x on the host, which can be downloaded here.

MPI Programming Models

The Intel MPI Library supports the following MPI programming models (see Figure 3):

  • Host-only model (Intel Xeon processor or Intel Xeon Phi processor): In this mode, all MPI ranks reside and execute the workload on the host CPU only (or Intel Xeon Phi processor only).
  • Offload model: In this mode, the MPI ranks reside solely on the Intel Xeon processor host. The MPI ranks use offload capabilities of the Intel® C/C++ Compiler or Intel® Fortran Compiler to offload some workloads to the coprocessors. Typically, one MPI rank is used per host, and the MPI rank offloads to the coprocessor(s).
  • Coprocessor-only model: In this native mode, the MPI ranks reside solely inside the coprocessor. The application can be launched from the coprocessor.
  • Symmetric model: In this mode, the MPI ranks reside on the host and the coprocessors. The application can be launched from the host.

 MPI programing models

Figure 3. MPI programing models.

Using the Intel® MPI Library

This section shows how to build and run an MPI application in the following configurations: on an Intel Xeon Phi processor x200, on a system with one or more Intel Xeon Phi coprocessor x200, and on a system with one or more Intel Xeon Phi coprocessor x100 (see Figure 4).

 Black and white, different configurations of the Intel® MPI Library

Figure 4. Different configurations: (a) standalone Intel® Xeon Phi™ processor x200, (b) Intel Xeon Phi coprocessor x200 connected to a system with an Intel® Xeon® processor, and (c) Intel® Xeon Phi™ coprocessor x100 connected to a system with an Intel Xeon processor.

Installing the Intel® MPI Library

The Intel MPI Library is packaged as a standalone product or as a part of the Intel® Parallel Studio XE Cluster Edition.

By default, the Intel MPI Library will be installed in the path /opt/intel/impi on the host or the Intel Xeon Phi processor. To start, follow the appropriate directions to install the latest versions of the Intel C/C++ Compiler and the Intel Fortran Compiler.

You can purchase or try the free 30-day evaluation of the Intel Parallel Studio XE from https://software.intel.com/en-us/intel-parallel-studio-xe. These instructions assume that you have the Intel MPI Library tar file - l_mpi_<version>.<package_num>.tgz. This is the latest stable release of the library at the time of writing this article. To check if a newer version exists, log into the Intel® Registration Center. The instructions below are valid for all current and subsequent releases.

As root user, untar the tar file l_mpi_<version>.<package_num>.tgz:

# tar –xzvf l_mpi_<version>.<package_num>.tgz
# cd l_mpi_<version>.<package_num>

Execute the install script on the host and follow the instructions. The installation will be placed in the default installation directory /opt/intel/impi/<version>.<package_num> assuming you are installing the library with root permission.

# ./install.sh

Compiling an MPI program

To compile an MPI program on the host or on an Intel Xeon Phi processor x200:

Before compiling a MPI program you need to establish the proper environment settings for the compiler and for the Intel MPI Library

$ source /opt/intel/compilers_and_libraries_<version>/linux/bin/compilervars.sh intel64
$ source /opt/intel/impi/<version>.<package_num>/bin64/mpivars.sh

or if you installed the Intel® Parallel Studio XE Cluster Edition, you can simply source the configuration script:

$ source /opt/intel/parallel_studio_xe_<version>/psxevars.sh intel64

Compile and link your MPI program using an appropriate compiler command:

To compile and link with the Intel MPI Library, use the appropriate commands from Table 2.

Table 2. MPI compilation Linux* command.

Programming LanguageMPI Compilation Linux* Command
Cmpiicc
C++mpiicpc
Fortran 77 / 95mpiifort

For example, to compile the C program for the host, you can use the wrapper mpiicc:

$ mpiicc ./myprogram.c –o myprogram

To compile the program for Intel Xeon Phi processor x200 and Intel Xeon Phi coprocessor x200, add the knob–xMIC-AVX512 to take advantage of the Intel AVX-512 instruction set architecture (ISA) existing on this architecture. For example, the following command compiles a C program for the Intel Xeon Phi product family x200 using the Intel AVX-512 ISA:

$ mpiicc –xMIC-AVX512 ./myprogram.c –o myprogram.knl

To compile the program for the Intel Xeon Phi coprocessor x100, add the knob–mmic. The following command show how to compile a C program for Intel Xeon Phi coprocessor x100:

$ mpiicc –mmic ./myprogram.c –o myprogram.knc

Running an MPI program on the Intel Xeon Phi processor x200

To run the application on the Intel Xeon Phi processor x200, use the script mpirun:

$ mpirun –n <# of processes> ./myprogram.knl

where n is the number of MPI processes to launch on the processor.

Running an MPI program on the Intel Xeon Phi coprocessor x200 and Intel Xeon Phi coprocessor x100

To run an application on the coprocessors, the following steps are needed:

  • Start the MPSS service if it was stopped previously:

    $ sudo systemctl start mpss

  • Transfer the MPI executable from the host to the coprocessor. For example, use the scp utility to transfer the executable (for the Intel Xeon Phi coprocessor x100) to the coprocessor named mic0:

    $ scp myprogram.knl mic0:~/myprogram.knc

  • Transfer the MPI libraries and compiler libraries to the coprocessors: before the first run of an MPI application on the Intel Xeon Phi coprocessors, we need to copy the appropriate MPI libraries, compiler libraries to the following directories on each coprocessor equipped on this system: for coprocessor x200, libraries under /lib64 directory are transferred; for coprocessor x100, libraries under /mic directory are transferred.

For example, we issue the copy to the first coprocessor x100 called mic0: the mic0 coprocessor is accessible via the IP address 172.31.1.1 as its IP address. Note that all coprocessors have unique IP addresses since they are treated as just other uniquely addressable machines. You can refer to the first coprocessor as mic0 or its IP address.

# sudo scp /opt/intel/impi/2017.3.196/mic/bin/* mic0:/bin/
# sudo scp /opt/intel/impi/2017.3.196/mic/lib/* mic0:/lib64/
# sudo scp /opt/intel/composer_xe_2017.3.196/compiler/lib/mic/* mic0:/lib64/

Instead of copying the MPI and compiler libraries manually, you can also run the script shown below, to transfer to the two coprocessor mic0 and mic1:

#!/bin/sh

export COPROCESSORS="mic0 mic1"
export BINDIR="/opt/intel/impi/2017.3.196/mic/bin"
export LIBDIR="/opt/intel/impi/2017.3.196/mic/lib"
export COMPILERLIB="/opt/intel/compilers_and_libraries_2017/linux/lib/mic"

for coprocessor in `echo $COPROCESSORS`
do
   for prog in mpiexec mpiexec.hydra pmi_proxy mpirun
   do
      sudo scp $BINDIR/$prog $coprocessor:/bin/$prog
   done

   for lib in libmpi.so.12 libmpifort.so.12 libmpicxx.so.12
   do
      sudo scp $LIBDIR/$lib $coprocessor:/lib64/$lib
   done

   for lib in libimf.so libsvml.so libintlc.so.5
   do
      sudo scp $COMPILERLIB/$lib $coprocessor:/lib64/$lib
   done
done

Script used for transferring MPI libraries to two coprocessors.

Another approach is to NFS mount the coprocessors’ file system from the host so that the coprocessors can have access to their MPI libraries from there. One advantage of using NFS mounts is that it saves RAM space on the coprocessors. The details on how to set up NFS mounts can be found in the first example in this document.

To run the application natively on the coprocessor, log in to the coprocessor and then run thempirun script:

$ ssh mic0
$ mpirun –n <# of processes> ./myprogram.knc

where n is the number of MPI processes to launch on the coprocessor.

Finally, to run an MPI program from the host (symmetrically), additional steps are needed:

Set the Intel MPI environment variable I_MPI_MIC to let the Intel MPI Library recognize the coprocessors:

$ export I_MPI_MIC=enable

Disable the firewall in the host:

$ systemctl status firewalld
$ sudo systemctl stop firewalld

For multi-card use, configure Intel MPSS peer-to-peer so that each card can ping others:

$ sudo /sbin/sysctl -w net.ipv4.ip_forward=1

If you want to get debug information, include the flags -verbose and -genv I_MPI_DEBUG=n when running the application.

The following sections include sample MPI programs written in C. The first example shows how to compile and run a program for Intel Xeon Phi processor x200 and for Intel Xeon Phi coprocessor x200. The second example shows how to compile and run a program for Intel Xeon Phi coprocessor x100.

Example 1

For illustration purposes, this example shows how to build and run an Intel MPI application in symmetric mode on a host that connects to two Intel Xeon Phi coprocessors x200. Note that the driver Intel MPSS 4.x should be installed on the host to enable the Intel Xeon Phi coprocessor x200.

In this example, use the integral presentation below to calculate Pi (π):

Image of a mathematical equation

Appendix A includes the implementation program. The workload is divided among the MPI ranks. Each rank spawns a team of OpenMP* threads, and each thread works on a chunk of the workload to take advantage of vectorization. First, compile and run this application on the Intel Xeon processor host. Since this program uses OpenMP, you need to compile the program with OpenMP libraries. Note that the Intel Parallel Studio XE 2018 is used in this example.

Set the environment variables, compile the application for the host, and then generate the optimization report on vectorization and OpenMP:

$ source /opt/intel/compilers_and_libraries_2018/linux/bin/compilervars.sh intel64
$ mpiicc mpitest.c -qopenmp -O3 -qopt-report=5 -qopt-report-phase:vec,openmp -o mpitest

To run two ranks on the host:

$ mpirun -host localhost -n 2 ./mpitest
Hello world: rank 0 of 2 running on knl-lb0.jf.intel.com
Hello world: rank 1 of 2 running on knl-lb0.jf.intel.com
FROM RANK 1 - numthreads = 32
FROM RANK 0 - numthreads = 32

Elapsed time from rank 0:    8246.90 (usec)
Elapsed time from rank 1:    8423.09 (usec)
rank 0 pi=   3.141613006592

Next, compile the application for the Intel Xeon Phi coprocessor x200 and transfer the executable to the coprocessors mic0 and mic1 (assume you already set passwordless on the coprocessors).

$ mpiicc mpitest.c -qopenmp -O3 -qopt-report=5 -qopt-report-phase:vec,openmp -xMIC-AVX512 -o mpitest.knl
$ scp mpitest.knl mic0:~/.
$ scp mpitest.knl mic1:~/.

Enable MPI for the coprocessors and disable the firewall in the host:

$ export I_MPI_MIC=enable
$ sudo systemctl stop firewalld

This example also shows how to mount shared directory using the Network File System (NFS). As root, you mount the /opt/intel directory where the Intel C++ Compiler and Intel MPI are installed. First, add descriptors in the /etc/exports configuration file on the host to share the directory /opt/intelwith the coprocessors, whose IP addresses are 172.31.1.1 and 172.31.2.1 with read-only (ro) privilege.

[host~]# cat /etc/exports
/opt/intel 172.31.1.1(ro,async,no_root_squash)
/opt/intel 172.31.2.1(ro,async,no_root_squash)

Update the NFS export table and restart the NFS server in the host:

[host~]# exportfs –a
[host~]# service nfs restart

Next, log in on the coprocessors and create the mount point /opt/intel:

[host~]# ssh mic0
mic0:~# mkdir /opt
mic0:~# mkdir /opt/intel

 

Insert the descriptor “172.31.2.254:/opt/intel /opt/intel nfs defaults 1” to the /etc/fstab file in mic0:

mic0:~# cat /etc/fstab
/dev/root            /                    auto       defaults              1  1
proc                 /proc                proc       defaults              0  0
devpts               /dev/pts             devpts     mode=0620,gid=5       0  0
tmpfs                /run                 tmpfs      mode=0755,nodev,nosuid,strictatime 0  0
tmpfs                /var/volatile        tmpfs      defaults,size=85%     0  0
172.31.1.254:/opt/intel /opt/intel nfs defaults                            1  1

Finally, mount the shared directory /opt/intel on the coprocessor:

mic0:~# mount –a

Repeat this procedure for mic1 with this descriptor “172.31.2.254:/opt/intel /opt/intel nfs defaults 1 1” added to the /etc/fstab file in mic1.

Make sure that mic0 and mic1 are included in the /etc/hosts file:

$ cat /etc/hosts
127.0.0.1       localhost
::1             localhost
172.31.1.1      mic0
172.31.2.1      mic1

$ mpirun -host localhost -n 1 ./mpitest : -host mic0 -n 1 ~/mpitest.knl : -host mic1 -n 1 ~/mpitest.knl
Hello world: rank 0 of 3 running on knl-lb0
Hello world: rank 1 of 3 running on mic0
Hello world: rank 2 of 3 running on mic1
FROM RANK 0 - numthreads = 64
FROM RANK 2 - numthreads = 272
FROM RANK 1 - numthreads = 272
Elapsed time from rank 0:   12114.05 (usec)
Elapsed time from rank 1:  136089.09 (usec)
Elapsed time from rank 2:  125049.11 (usec)
rank 0 pi=   3.141597270966

By default, the maximum number of hardware threads available on each compute node is used. However, you can change this default behavior by inserting the local environment variable –env in that compute node. For example, to set the number of OpenMP threads on mic0 to 68 and set the compact affinity, you can use the command:

$ mpirun -host localhost -n 1 ./mpitest : -host mic0 -n 1 -env OMP_NUM_THREADS=68 -env KMP_AFFINITY=compact ~/mpitest : -host mic1 -n 1 ~/mpitest
Hello world: rank 0 of 3 running on knl-lb0.jf.intel.com
Hello world: rank 1 of 3 running on mic0
Hello world: rank 2 of 3 running on mic1
FROM RANK 0 - numthreads = 64
FROM RANK 1 - numthreads = 68
FROM RANK 2 - numthreads = 272
Elapsed time from rank 0:   11068.11 (usec)
Elapsed time from rank 1:   57780.98 (usec)
Elapsed time from rank 2:  133417.13 (usec)
rank 0 pi=   3.141597270966

To simplify the launch process, define a file with all machine names, name all the executables, and then move them to a predefined directory. For example, all executables are named mpitest and are located in user home directories:

$ cat hosts_file
knl-lb0:1
mic0:2
mic1:2

$ mpirun -machinefile hosts_file -n 5 ~/mpitest
Hello world: rank 0 of 5 running on knl-lb0
Hello world: rank 1 of 5 running on mic0
Hello world: rank 2 of 5 running on mic0
Hello world: rank 3 of 5 running on mic1
Hello world: rank 4 of 5 running on mic1
FROM RANK 0 - numthreads = 64
FROM RANK 1 - numthreads = 136
FROM RANK 3 - numthreads = 136
FROM RANK 2 - numthreads = 136
FROM RANK 4 - numthreads = 136
Elapsed time from rank 0:   11260.03 (usec)
Elapsed time from rank 1:   71480.04 (usec)
Elapsed time from rank 2:   69352.15 (usec)
Elapsed time from rank 3:   74187.99 (usec)
Elapsed time from rank 4:   67718.98 (usec)
rank 0 pi=   3.141598224640

 

Example 2

Example 2 shows how to build and run an MPI application in symmetric model on a host that connects to two Intel Xeon Phi coprocessors x100. Note that the driver Intel MPSS 3.x should be installed for the Intel Xeon Phi coprocessor x100.

The sample program estimates the calculation of Pi (π) using a Monte Carlo method. Consider a sphere centered at the origin and circumscribed by a cube. The sphere’s radius is r and the cube edge length is 2r. The volumes of a sphere and a cube are given by

Image of a mathematical equation

The first octant of the coordinate system contains one eighth of the volumes of both the sphere and the cube; the volumes in that octant are given by:

Image of a mathematical equation

If we generate Nc points uniformly and randomly in the cube within this octant, we expect that about Ns points will be inside the sphere’s volume according to the following ratio:

Image of a mathematical equation

Therefore, the estimated Pi (π) is calculated by

Image of a mathematical equation

where Nc is the number of points generated in the portion of the cube residing in the first octant, and Ns is the total number of points found inside the portion of the sphere residing in the first octant.

In the implementation, rank 0 (process) is responsible for dividing the work among the other n ranks. Each rank is assigned a chunk of work, and the summation is used to estimate the number Pi. Rank 0 divides the x-axis into n equal segments. Each rank generates (Nc /n) points in the assigned segment, and then computes the number of points in the first octant of the sphere (see Figure 5).

Image of a mathematical results

Figure 5. Each MPI rank handles a different portion in the first octant.

The pseudo code is shown below:

Rank 0 generate n random seed
Rank 0 broadcast all random seeds to n rank
For each rank i [0, n-1]
receive the corresponding seed
set num_inside = 0
For j=0 to Nc / n
generate a point with coordinates
x between [i/n, (i+1)/n]
y between [0, 1]
z between [0, 1]
			compute the distance d = x^2 + y^2 + z^2
			if distance d <= 1, increment num_inside
		Send num_inside back to rank 0
	Rank 0 set Ns  to the sum of all num_inside
	Rank 0 compute Pi = 6 * Ns  / Nc

In order to build the application montecarlo.knc for the Intel Xeon Phi coprocessors x100, the Intel C++ Compiler 2017 is used. Appendix B includes the implementation program. Note that this example just simply shows how to run the code on an Intel Xeon Phi coprocessor x100. You can optimize the sample code for further improvement.

$ source /opt/intel/compilers_and_libraries_2017/linux/bin/compilervars.sh intel64
$ mpiicc –mmic montecarlo.c -o montecarlo.knc

Build the application for the host:

$ mpiicc montecarlo.c -o montecarlo

Transfer the application montecarlo.knc to the /tmp directory on the coprocessors using the scp utility. In this example, we issue the copy to two Intel Xeon Phi coprocessors x100.

$ scp ./montecarlo.knc mic0:/tmp/montecarlo.knc
montecarlo.knc     100% 17KB 16.9KB/s 00:00 $ scp ./montecarlo.knc mic1:/tmp/montecarlo.knc
montecarlo.knc     100% 17KB 16.9KB/s 00:00 

Transfer the MPI libraries and compiler libraries to the coprocessors using the script in Figure 5. Enable the MPI communication between host and Intel Xeon Phi coprocessors x100:

$ export I_MPI_MIC=enable

Run the mpirun script to start the application. The flag –n specifies the number of MPI processes and the flag –host specifies the machine name:

$ mpirun –n <# of processes> -host <hostname> <application>

We can run the application on multiple hosts by separating them with “:”. The first MPI rank (rank 0) always starts on the first part of the command:

$ mpirun –n <# of processes> -host <hostname1> <application> : –n <# of processes> -host <hostname2> <application>

This starts the rank 0 on hostname1 and other ranks on hostname2.

Now run the application on the host. The mpirun command shown below starts the application with 2 ranks on the host, 3 ranks on the coprocessor mic0, and 5 ranks on coprocessor mic1:

$ mpirun -n 2 -host localhost ./montecarlo : -n 3 -host mic0 /tmp/montecarlo.knc \
: -n 5 -host mic1 /tmp/montecarlo.knc

Hello world: rank 0 of 10 running on knc0
Hello world: rank 1 of 10 running on knc0
Hello world: rank 2 of 10 running on knc0-mic0
Hello world: rank 3 of 10 running on knc0-mic0
Hello world: rank 4 of 10 running on knc0-mic0
Hello world: rank 5 of 10 running on knc0-mic1
Hello world: rank 6 of 10 running on knc0-mic1
Hello world: rank 7 of 10 running on knc0-mic1
Hello world: rank 8 of 10 running on knc0-mic1
Hello world: rank 9 of 10 running on knc0-mic1
Elapsed time from rank 0:      13.87 (sec)
Elapsed time from rank 1:      14.01 (sec)
Elapsed time from rank 2:     195.16 (sec)
Elapsed time from rank 3:     195.17 (sec)
Elapsed time from rank 4:     195.39 (sec)
Elapsed time from rank 5:     195.07 (sec)
Elapsed time from rank 6:     194.98 (sec)
Elapsed time from rank 7:     223.32 (sec)
Elapsed time from rank 8:     194.22 (sec)
Elapsed time from rank 9:     193.70 (sec)
Out of 4294967295 points, there are 2248849344 points inside the sphere => pi=  3.141606330872

A shorthand way of doing this in symmetric mode is to use the –machinefile option for the mpirun command in coordination with the I_MPI_MIC_POSTFIX environment variable. In this case, make sure all executables are in the same location on the host and mic0 and mic1 cards.

The I_MPI_MIC_POSTFIX environment variable simply tells the library to add the .mic postfix when running on the cards (since the executables there are called montecarlo.knc).

$ export I_MPI_MIC_POSTFIX=.knc

Now set the rank mapping in your hosts file (by using the <host>:<#_ranks> format):

$ cat hosts_file
localhost:2
mic0:3
mic1:5

And run your executable:

$ mpirun -machinefile hosts_file /tmp/montecarlo

The nice thing about this syntax is that you only have to edit the hosts_file when deciding to change your number of ranks or need to add more cards.

As an alternative, you can ssh to a coprocessor and launch the application from there:

S ssh mic0
S mpirun -n 3 /tmp/montecarlo.knc
Hello world: rank 0 of 3 running on knc0-mic0
Hello world: rank 1 of 3 running on knc0-mic0
Hello world: rank 2 of 3 running on knc0-mic0
Elapsed time from rank 0:     650.47 (sec)
Elapsed time from rank 1:     650.61 (sec)
Elapsed time from rank 2:     648.01 (sec)
Out of 4294967295 points, there are 2248795855 points inside the sphere => pi=  3.141531467438

 

Summary

This document showed you how to compile and run simple MPI applications in symmetric model. In a heterogeneous computing system, the performance in each computational unit is different and this system behavior leads to the load imbalance problem. The Intel® Trace Analyzer and Collector can be used to analyze and understand the behavior of a complex MPI program running on a heterogeneous system. Using the Intel Trace Analyzer and Collector, you can quickly identify bottlenecks, evaluate load balancing, analyze performance, and identify communication hotspots. This powerful tool is essential for debugging and improving the performance of a MPI program running on a cluster with multiple computational units. For more details on using the Intel Trace Analyzer and Collector, read the whitepaper “Understanding MPI Load Imbalance with Intel® Trace Analyzer and Collector” available on /mic-developer. For more details, tips and tricks, and known workarounds, visit our Intel® Cluster Tools and the Intel® Xeon Phi™ Coprocessors page.

References

Appendix A

The code of the first sample program is shown below.

/*
 *  Copyright (c) 2017 Intel Corporation. All Rights Reserved.
 *
 *  Portions of the source code contained or described herein and all documents related
 *  to portions of the source code ("Material") are owned by Intel Corporation or its
 *  suppliers or licensors.  Title to the Material remains with Intel
 *  Corporation or its suppliers and licensors.  The Material contains trade
 *  secrets and proprietary and confidential information of Intel or its
 *  suppliers and licensors.  The Material is protected by worldwide copyright
 *  and trade secret laws and treaty provisions.  No part of the Material may
 *  be used, copied, reproduced, modified, published, uploaded, posted,
 *  transmitted, distributed, or disclosed in any way without Intel's prior
 *  express written permission.
 *
 *  No license under any patent, copyright, trade secret or other intellectual
 *  property right is granted to or conferred upon you by disclosure or
 *  delivery of the Materials, either expressly, by implication, inducement,
 *  estoppel or otherwise. Any license under such intellectual property rights
 *  must be express and approved by Intel in writing.
 */
//******************************************************************************
// Content: (version 1.0)
//      Calculate the number PI using its integral representation.
//
//******************************************************************************
#include <stdio.h>
#include "mpi.h"

#define MASTER 0
#define TAG_HELLO 1
#define TAG_TIME 2

const long ITER = 1024 * 1024;
const long SCALE = 16;
const long NUM_STEP = ITER * SCALE;

float calculate_partialPI(int n, int num) {
   unsigned long i;
   int  numthreads;
   float x, dx, pi = 0.0f;

   #pragma omp parallel
   #pragma omp master
   {
      numthreads = omp_get_num_threads();
      printf("FROM RANK %d - numthreads = %d\n", n, numthreads);
   }

   dx = 1.0 / NUM_STEP;

   unsigned long NUM_STEP1 = NUM_STEP / num;
   unsigned long begin = n * NUM_STEP1;
   unsigned long end = (n + 1) * NUM_STEP1;
   #pragma omp parallel for reduction(+:pi)
   for (i = begin; i < end; i++)
   {
      x = (i + 0.5f) / NUM_STEP;
      pi += (4.0f * dx) / (1.0f + x*x);
   }

   return pi;
}

int main(int argc, char **argv)
{
   float pi1, total_pi;
   double startprocess;
   int i, id, remote_id, num_procs, namelen;
   char name[MPI_MAX_PROCESSOR_NAME];
   MPI_Status stat;

   if (MPI_Init (&argc, &argv) != MPI_SUCCESS)
   {
      printf ("Failed to initialize MPI\n");
      return (-1);
   }

   // Create the communicator, and retrieve the number of processes.
   MPI_Comm_size (MPI_COMM_WORLD, &num_procs);

   // Determine the rank of the process.
   MPI_Comm_rank (MPI_COMM_WORLD, &id);

   // Get machine name
   MPI_Get_processor_name (name, &namelen);

   if (id == MASTER)
   {
      printf ("Hello world: rank %d of %d running on %s\n", id, num_procs, name);

      for (i = 1; i<num_procs; i++)
      {
         MPI_Recv (&remote_id, 1, MPI_INT, i, TAG_HELLO, MPI_COMM_WORLD, &stat);
         MPI_Recv (&num_procs, 1, MPI_INT, i, TAG_HELLO, MPI_COMM_WORLD, &stat);
         MPI_Recv (&namelen, 1, MPI_INT, i, TAG_HELLO, MPI_COMM_WORLD, &stat);
         MPI_Recv (name, namelen+1, MPI_CHAR, i, TAG_HELLO, MPI_COMM_WORLD, &stat);

         printf ("Hello world: rank %d of %d running on %s\n", remote_id, num_procs, name);
      }
   }
   else
   {
      MPI_Send (&id, 1, MPI_INT, MASTER, TAG_HELLO, MPI_COMM_WORLD);
      MPI_Send (&num_procs, 1, MPI_INT, MASTER, TAG_HELLO, MPI_COMM_WORLD);
      MPI_Send (&namelen, 1, MPI_INT, MASTER, TAG_HELLO, MPI_COMM_WORLD);
      MPI_Send (name, namelen+1, MPI_CHAR, MASTER, TAG_HELLO, MPI_COMM_WORLD);
   }

   startprocess = MPI_Wtime();

   pi1 = calculate_partialPI(id, num_procs);

   double elapsed = MPI_Wtime() - startprocess;

   MPI_Reduce (&pi1, &total_pi, 1, MPI_FLOAT, MPI_SUM, MASTER, MPI_COMM_WORLD);
   if (id == MASTER)
   {
      double timeprocess[num_procs];

      timeprocess[MASTER] = elapsed;
      printf("Elapsed time from rank %d: %10.2f (usec)\n", MASTER, 1000000 * timeprocess[MASTER]);

      for (i = 1; i < num_procs; i++)
      {
         // Rank 0 waits for elapsed time value
         MPI_Recv (&timeprocess[i], 1, MPI_DOUBLE, i, TAG_TIME, MPI_COMM_WORLD, &stat);
         printf("Elapsed time from rank %d: %10.2f (usec)\n", i, 1000000 *timeprocess[i]);
      }

      printf("rank %d pi= %16.12f\n", id, total_pi);
   }
   else
   {
      // Send back the processing time (in second)
      MPI_Send (&elapsed, 1, MPI_DOUBLE, MASTER, TAG_TIME, MPI_COMM_WORLD);
   }

   // Terminate MPI.
   MPI_Finalize();
   return 0;
}

 

Appendix B

The code of the second sample program is shown below.

/*
 *  Copyright (c) 2017 Intel Corporation. All Rights Reserved.
 *
 *  Portions of the source code contained or described herein and all documents related
 *  to portions of the source code ("Material") are owned by Intel Corporation or its
 *  suppliers or licensors.  Title to the Material remains with Intel
 *  Corporation or its suppliers and licensors.  The Material contains trade
 *  secrets and proprietary and confidential information of Intel or its
 *  suppliers and licensors.  The Material is protected by worldwide copyright
 *  and trade secret laws and treaty provisions.  No part of the Material may
 *  be used, copied, reproduced, modified, published, uploaded, posted,
 *  transmitted, distributed, or disclosed in any way without Intel's prior
 *  express written permission.
 *
 *  No license under any patent, copyright, trade secret or other intellectual
 *  property right is granted to or conferred upon you by disclosure or
 *  delivery of the Materials, either expressly, by implication, inducement,
 *  estoppel or otherwise. Any license under such intellectual property rights
 *  must be express and approved by Intel in writing.
 */
//******************************************************************************
// Content: (version 0.5)
//      Based on a Monto Carlo method, this MPI sample code uses volumes to
//      estimate the number PI.
//
//******************************************************************************
#include <stdlib.h>
#include <stdio.h>
#include <math.h>
#include <time.h>
#include <math.h>

#include "mpi.h"

#define MASTER 0
#define TAG_HELLO 4
#define TAG_TEST 5
#define TAG_TIME 6

int main(int argc, char *argv[])
{
  int i, id, remote_id, num_procs;

  MPI_Status stat;
  int namelen;
  char name[MPI_MAX_PROCESSOR_NAME];

  // Start MPI.
  if (MPI_Init (&argc, &argv) != MPI_SUCCESS)
    {
      printf ("Failed to initialize MPI\n");
      return (-1);
    }

  // Create the communicator, and retrieve the number of processes.
  MPI_Comm_size (MPI_COMM_WORLD, &num_procs);

  // Determine the rank of the process.
  MPI_Comm_rank (MPI_COMM_WORLD, &id);
    // Get machine name
  MPI_Get_processor_name (name, &namelen);

  if (id == MASTER)
    {
      printf ("Hello world: rank %d of %d running on %s\n", id, num_procs, name);

      for (i = 1; i<num_procs; i++)
	{
	  MPI_Recv (&remote_id, 1, MPI_INT, i, TAG_HELLO, MPI_COMM_WORLD, &stat);
	  MPI_Recv (&num_procs, 1, MPI_INT, i, TAG_HELLO, MPI_COMM_WORLD, &stat);
	  MPI_Recv (&namelen, 1, MPI_INT, i, TAG_HELLO, MPI_COMM_WORLD, &stat);
	  MPI_Recv (name, namelen+1, MPI_CHAR, i, TAG_HELLO, MPI_COMM_WORLD, &stat);

	  printf ("Hello world: rank %d of %d running on %s\n", remote_id, num_procs, name);
	}
    }
  else
    {
      MPI_Send (&id, 1, MPI_INT, MASTER, TAG_HELLO, MPI_COMM_WORLD);
      MPI_Send (&num_procs, 1, MPI_INT, MASTER, TAG_HELLO, MPI_COMM_WORLD);
      MPI_Send (&namelen, 1, MPI_INT, MASTER, TAG_HELLO, MPI_COMM_WORLD);
      MPI_Send (name, namelen+1, MPI_CHAR, MASTER, TAG_HELLO, MPI_COMM_WORLD);
    }

  // Rank 0 distributes seek randomly to all processes.
  double startprocess, endprocess;

  int distributed_seed = 0;
  int *buff;

  buff = (int *)malloc(num_procs * sizeof(int));

  unsigned int MAX_NUM_POINTS = pow (2,32) - 1;
  unsigned int num_local_points = MAX_NUM_POINTS / num_procs;

  if (id == MASTER)
    {
      srand (time(NULL));

      for (i=0; i<num_procs; i++)
	{
	  distributed_seed = rand();
	  buff[i] = distributed_seed;
	}
    }

  // Broadcast the seed to all processes
  MPI_Bcast(buff, num_procs, MPI_INT, MASTER, MPI_COMM_WORLD);

  // At this point, every process (including rank 0) has a different seed. Using their seed,
  // each process generates N points randomly in the interval [1/n, 1, 1]
  startprocess = MPI_Wtime();

  srand (buff[id]);

  unsigned int point = 0;
  unsigned int rand_MAX = 128000;
  float p_x, p_y, p_z;
  float temp, temp2, pi;
  double result;
  unsigned int inside = 0, total_inside = 0;
    for (point=0; point<num_local_points; point++)
    {
      temp = (rand() % (rand_MAX+1));
      p_x = temp / rand_MAX;
      p_x = p_x / num_procs;

      temp2 = (float)id / num_procs;	// id belongs to 0, num_procs-1
      p_x += temp2;

      temp = (rand() % (rand_MAX+1));
      p_y = temp / rand_MAX;

      temp = (rand() % (rand_MAX+1));
      p_z = temp / rand_MAX;

      // Compute the number of points residing inside of the 1/8 of the sphere
      result = p_x * p_x + p_y * p_y + p_z * p_z;

      if (result <= 1)
	  {
		inside++;
	  }
    }

  double elapsed = MPI_Wtime() - startprocess;

  MPI_Reduce (&inside, &total_inside, 1, MPI_UNSIGNED, MPI_SUM, MASTER, MPI_COMM_WORLD);

#if DEBUG
  printf ("rank %d counts %u points inside the sphere\n", id, inside);
#endif

  if (id == MASTER)
    {
      double timeprocess[num_procs];

      timeprocess[MASTER] = elapsed;
      printf("Elapsed time from rank %d: %10.2f (sec) \n", MASTER, timeprocess[MASTER]);

      for (i=1; i<num_procs; i++)
	{
	  // Rank 0 waits for elapsed time value
	  MPI_Recv (&timeprocess[i], 1, MPI_DOUBLE, i, TAG_TIME, MPI_COMM_WORLD, &stat);
	  printf("Elapsed time from rank %d: %10.2f (sec) \n", i, timeprocess[i]);
	}

      temp = 6 * (float)total_inside;
      pi = temp / MAX_NUM_POINTS;
      printf ( "Out of %u points, there are %u points inside the sphere => pi=%16.12f\n", MAX_NUM_POINTS, total_inside, pi);
    }
  else
    {
      // Send back the processing time (in second)
      MPI_Send (&elapsed, 1, MPI_DOUBLE, MASTER, TAG_TIME, MPI_COMM_WORLD);
    }

  free(buff);

  // Terminate MPI.
  MPI_Finalize();

  return 0;
}

Intel® Manycore Platform Software Stack Archive for the Intel® Xeon Phi™ Coprocessor x200 Product Family

$
0
0

On this page you will find the past releases of the Intel® Manycore Platform Software Stack (Intel® MPSS) for the Intel® Xeon Phi™ coprocessor x200 product family. The most recent release is found here: https://software.intel.com/en-us/articles/intel-manycore-platform-software-stack-for-intel-xeon-phi-coprocessor-x200. We recommend customers use the latest release wherever possible.

  • N-1 release for Intel® MPSS 4.4.x

Intel MPSS 4.4.0 HotFix 1 release for Linux*

Intel Manycore Platform Software Stack Version

Downloads Available

Size (range)

MD5 Checksum

Intel MPSS 4.4.0 Hotfix 1 (released: May 8, 2017)

RHEL 7.3

214MB

8a015c38379b8be42c8045d3ceb44545

 

RHEL 7.2

214MB

694b7b908c12061543d2982695985d8b

 

SLES 12.2

213MB

506ab12af774f78fa8e107fd7a4f96fd

 

SLES 12.1

213MB

b8520888954e846e8ac8604d62a9ba96

 

SLES 12.0

213MB

88a3a4415afae1238453ced7a0df28ea

 

Card installer file (mpss-4.4.0-card.tar)

761MB

d26e26868297cea5fd4ffafe8d78b66e

 

Source file (mpss-4.4.0-card-source.tar)

514MB

127713d06496090821b5bb3613c95b30

Document Link

Description

Last Updated On

Size (approx.)

releaseNotes-linux.txt

Release notes (English)

May 2017

15KB

readme.txt

Readme (includes installation instructions) for Linux (English)

May 2017

17KB

mpss_user_guide.pdf

Intel MPSS user guide

May 2017

3MB

eula.txt

End User License Agreement (Important: Read before downloading, installing, or using)

May 2017

33KB

 

Intel MPSS 4.4.0 HotFix 1 release for Windows*

Intel Manycore Platform Software Stack Version

Downloads Available

Size

MD5 Checksum

Intel MPSS 4.4.0 Hotfix 1 (released: May 8, 2017)

mpss-4.4.0-windows.zip

1091MB

204a65b36858842f472a37c77129eb53

Document Link

Description

Last Updated On

Size (approx.)

releasenotes-windows.txt

English - Release notes

May 2017

7KB

readme-windows.pdf

English - Readme for Windows

May 2017

399KB

mpss_users_guide_windows

Intel MPSS user guide for Windows

May 2017

3MB

eula.txt

End User License Agreement (Important: Read before downloading, installing, or using)

May 2017

33KB

 

The discussion forum at http://software.intel.com/en-us/forums/intel-many-integrated-core is available to join and discuss any enhancements or issues with the Intel MPSS.

Recipe: Building and Running GROMACS* on Intel® Processors

$
0
0

Purpose

This recipe describes how to get, build, and run the GROMACS* code on Intel® Xeon® and Intel® Xeon Phi™ processors for better performance on a single node.

Introduction

GROMACS is a versatile package for performing molecular dynamics, using Newtonian equations of motion, for systems with hundreds to millions of particles. GROMACS is primarily designed for biochemical molecules like proteins, lipids, and nucleic acids that have a multitude of complicated bonded interactions. But, since GROMACS is extremely fast at calculating the non-bonded interactions typically dominating simulations, many researchers use it for research on non-biological systems, such as polymers.

GROMACS supports all the usual algorithms expected from a modern molecular dynamics implementation.

The GROMACS code is maintained by developers around the world. The code is available under the GNU General Public License from www.gromacs.org.

Code Access

Download GROMACS:

Workloads Access

Download the workloads:

Generate Water Workloads Input Files:

To generate the .tpr input file:

  • tar xf water_GMX50_bare.tar.gz
  • cd water-cut1.0_GMX50_bare/1536
  • gmx_mpi grompp -f pme.mdp -c conf.gro -p topol.top -o topol_pme.tpr
  • gmx_mpi grompp -f rf.mdp -c conf.gro -p topol.top -o topol_rf.tpr

Build Directions

Build the GROMACS binary. Use cmake configuration for Intel® Compiler 2017.1.132 + Intel® MKL + Intel® MPI 2017.1.132:

Set the Intel Xeon Phi BIOS options to be:

  • Quadrant Cluster mode
  • MCDRAM Flat mode
  • Turbo Enabled

For Intel Xeon Phi, build the code as:

  • BuildDir= "${GromacsPath}/build” # Create the build directory
  • installDir="${GromacsPath}/install"
  • mkdir $BuildDir
     

  • source /opt/intel/<version>/bin/compilervars.sh intel64 # Source the Intel compiler, MKL and IMPI
  • source /opt/intel/impi/<version>/mpivars.sh
  • source /opt/intel/mkl/<version>/mklvars.sh intel64
     

  • cd $BuildDir # Set the build environments for Intel Xeon Phi
FLAGS="-xMIC-AVX512 -g -static-intel"; CFLAGS=$FLAGS CXXFLAGS=$FLAGS CC=mpiicc CXX=mpiicpc cmake .. -DBUILD_SHARED_LIBS=OFF -DGMX_FFT_LIBRARY=mkl -DCMAKE_INSTALL_PREFIX=$installDir -DGMX_MPI=ON -DGMX_OPENMP=ON -DGMX_CYCLE_SUBCOUNTERS=ON -DGMX_GPU=OFF -DGMX_BUILD_HELP=OFF -DGMX_HWLOC=OFF -DGMX_SIMD=AVX_512_KNL -DGMX_OPENMP_MAX_THREADS=256

For Intel Xeon, set the build environments and build the code as above with changes:

  • FLAGS="-xCORE-AVX2 -g -static-intel"
  • -DGMX_SIMD=AVX2_256

Other system setup:

Change the kernel setting for KNL: “nmi_watchdog=0 nohz_full=0-270” One of the ways to change the settings (this could be different for every system):

  • First save your original grub.cfg to be safe
      cp /boot/grub2/grub.cfg /boot/grub2/grub.cfg.ORIG
  • In “/etc/default/grub”. Add (append) the below to the “GRUB_CMDLINE_LINUX”
      nmi_watchdog=0 nohz_full=0-270
  • Save your new configuration
      grub2-mkconfig -o /boot/grub2/grub.cfg
  • Reboot the system. After logging in, verify the settings with 'cat /proc/cmdline’

Build GROMACS:

  • make -j 4
  • sleep 5
  • make check

Run Directions

Run workloads on Intel Xeon Phi with the environment settings and command lines as (nodes.txt : localhost:272):


	export  I_MPI_DEBUG=5
	export I_MPI_FABRICS=shm
	export I_MPI_PIN_MODE=lib
	export KMP_AFFINITY=verbose,compact,1

	gmxBin="${installDir}/bin/gmx_mpi"

	mpiexec.hydra -genvall -machinefile ./nodes.txt -np 66 numactl -m 1 $gmxBin mdrun -npme 0 -notunepme -ntomp 4 -dlb yes -v -nsteps 4000 -resethway -noconfout -pin on -s ${WorkloadPath}water-cut1.0_GMX50_bare/1536/topol_pme.tpr
	export KMP_BLOCKTIME=0
	mpiexec.hydra -genvall -machinefile ./nodes.txt -np 66 numactl -m 1 $gmxBin mdrun -ntomp 4 -dlb yes -v -nsteps 1000 -resethway -noconfout -pin on -s ${WorkloadPath}lignocellulose-rf.BGQ.tpr
	mpiexec.hydra -genvall -machinefile ./nodes.txt -np 64 numactl -m 1 $gmxBin mdrun -ntomp 4 -dlb yes -v -nsteps 5000 -resethway -noconfout -pin on -s ${WorkloadPath}water-cut1.0_GMX50_bare/1536/topol_rf.tpr

Run workloads on Intel Xeon with the environment settings and command lines as:


	export  I_MPI_DEBUG=5
	export I_MPI_FABRICS=shm
	export I_MPI_PIN_MODE=lib
	export KMP_AFFINITY=verbose,compact,1

	gmxBin="${installDir}/bin/gmx_mpi"

	mpiexec.hydra -genvall -machinefile ./nodes.txt -np 72 $gmxBin mdrun -notunepme -ntomp 1 -dlb yes -v -nsteps 4000 -resethway -noconfout -s ${WorkloadPath}water-cut1.0_GMX50_bare/1536_bdw/topol_pme.tpr
	export KMP_BLOCKTIME=0
	mpiexec.hydra -genvall -machinefile ./nodes.txt -np 72 $gmxBin mdrun -ntomp 1 -dlb yes -v -nsteps 1000 -resethway -noconfout -s ${WorkloadPath}lignocellulose-rf.BGQ.tpr
	mpiexec.hydra -genvall -machinefile ./nodes.txt -np 72 $gmxBin mdrun -ntomp 1 -dlb yes -v -nsteps 5000 -resethway -noconfout -s ${WorkloadPath}water-cut1.0_GMX50_bare/1536_bdw/topol_rf.tpr

Performance Testing

Performance tests for GROMACS are illustrated below with comparisons between an Intel Xeon processor and an Intel Xeon Phi processor against three standard workloads: water1536k_pme, water1536k_rf, and lignocellulose3M_rf. In all cases, turbo mode is turned on.

Testing Platform Configurations

The following hardware was used for the above recipe and performance testing.

Processor

Intel® Xeon® Processor E5-2697 v4

Intel® Xeon Phi™ Processor 7250

Stepping

1 (B0)

1 (B0) Bin1

Sockets / TDP

2S / 290W

1S / 215W

Frequency / Cores / Threads

2.3 GHz / 36 / 72

1.4 GHz / 68 / 272

DDR4

8x16GB 2400 MHz(128GB)

6x16 GB 2400 MHz

MCDRAM

N/A

16 GB Flat

Cluster/Snoop Mode/Mem Mode

Home

Quadrant/flat

Turbo

On

On

BIOS

GRRFSDP1.86B.0271.R00.1510301446

GVPRCRB1.86B.0011.R04.1610130403

Compiler

ICC-2017.1.132

ICC-2017.1.132

Operating System

Red Hat Enterprise Linux* 7.2

Red Hat Enterprise Linux 7.2

3.10.0-327.el7.x86_64

3.10.0-327.13.1.el7.xppsl_1.3.3.151.x86_64

GROMACS Build Configurations

The following configurations were used for the above recipe and performance testing.

  • GROMACS Version: GROMACS-2016.1
  • Intel® Compiler Version: 2017.1.132
  • Intel® MPI Library Version: 2017.1.132
  • Workloads used: water1536k_pme, water1536k_rf, and lignocellulose3M_rf

Recipe: Building and running NEMO* on Intel® Xeon Phi™ Processors

$
0
0

About NEMO*

The NEMO* (Nucleus for European Modelling of the Ocean) numerical solutions framework encompasses models of ocean, sea ice, tracers, and biochemistry equations and their related physics. It also incorporates the pre- and post-processing tools and the interface to other components of the Earth System. NEMO allows several ocean-related components of the Earth System to work together or separately, and also allows for two-way nesting via AGRIF software. It is interfaced with the remaining components of the Earth System package (atmosphere, land surfaces, and so on) via the OASIS coupler.

This recipe shows the performance advantages of using the Intel® Xeon Phi™ processor 7250.

NEMO 3.6 is the current stable version.

Downloading the Code

  1. Download the NEMO source code from the official NEMO repository (you should register at www.nemo-ocean.eu ):

    svn co –r 6939  http://forge.ipsl.jussieu.fr/nemo/svn/branches/2015/nemo_v3_6_STABLE/NEMOGCM nemo

  2. Download the XIOS IO server from the official XIOS repository:

    svn co -r 703 http://forge.ipsl.jussieu.fr/ioserver/svn/XIOS/branchs/xios-1.0 xios

  3. If your system has NetCDF libraries with Fortran bindings already installed and they link with NEMO and XIOS binaries, go to the section “Building XIOS for the Intel Xeon Processor”:
  4. NetCDF-Fortran  from https://github.com/Unidata/netcdf-fortran/archive/netcdf-fortran-4.2.tar.gz

Building Additional Libraries for the Intel® Xeon® Processor

  1. First, choose a directory for your experiments, such as “~/NEMO-BDW”:
    export base=”~/NEMO-BDW”
  2. Create a directory and copy all required libraries in $base:
    mkdir -p $base/libraries
  3. Unpack the tarball files in $base/libraries/src.
  4. To build an Intel® Advanced Vector Extensions 2 (Intel® AVX2) version of libraries, set:
    export arch="-xCORE-AVX2"
  5. Set the following environment variables:
    export PREFIX=$base/libraries
    export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:${PREFIX}/lib
    export CFLAGS="-I$PREFIX/include -L$PREFIX/lib –O3 -g -traceback -openmp ${arch} -fPIC"
    export CPPFLAGS=$CFLAGS
    export CXXFLAGS=$CFLAGS
    export FFFLAGS=$CFLAGS
    export FCFLAGS=$CFLAGS
    export LDFLAGS="-L$PREFIX/lib -openmp ${arch} -fPIC"
    export FC=mpiifort
    export CXX=mpiicc
    export CC=mpiicc
    export CPP="icc -E"
  6. Build szip:
    cd $base/libraries/src/szip-2.1
    ./configure --prefix=$PREFIX
    make -j 4
    make install
  7. Build zlib:
    cd $base/libraries/src/zlib-1.2.8
    ./configure --prefix=$PREFIX
    make –j 4
    make install
  8. Build HDF5:
    cd $base/libraries/src/hdf5-1.8.12
    ./configure --with-zlib=$PREFIX --prefix=$PREFIX --enable-fortran --with-szlib=$PREFIX --enable-hl
    make
    make install
  9. Build CURL:
    cd $base/libraries/src/curl- 7.42.1
    ./configure --prefix=$PREFIX
    make –j 4
    make install
  10. Build NetCDF:
    cd $base/libraries/src/netcdf-4.3.3
    export LIBS=" -lhdf5_hl -lhdf5 -lz -lsz -lmpi"
    export LD_FLAGS+=" -L$PREFIX/lib"
    ./configure --prefix=$PREFIX
    make
    make install
  11. Build NetCDF Fortran wrapper:
    cd $base/libraries/src/netcdf-fortran-4.2/
    export LIBS=""
    export CFLAGS="$CFLAGS -lnetcdf"
    export CPPFLAGS=$CFLAGS
    export CXXFLAGS=$CFLAGS
    export FFFLAGS=$CFLAGS
    export FCFLAGS=$CFLAGS
    export FC=ifort
    export CXX=mpiicc
    export CC=mpiicc
    export LDFLAGS+=" -L$I_MPI_ROOT/lib64/"
    ./configure --prefix=$PREFIX
    make
    make install

Building XIOS for the Intel Xeon Processor

  1. Copy XIOS source code to $base/xios
  2. Create files:
    $base/xios/arch/arch-ifort_linux.env
    $base/xios/arch/arch-ifort_linux.fcm
    $base/xios/arch/arch-ifort_linux.path
  3. Add the following lines to the $base/xios/arch/arch-ifort_linux.env file:
    export NETCDF_INC_DIR=$base/libraries/include
    export NETCDF_LIB_DIR=$base/libraries/lib
    export HDF5_INC_DIR=$base/libraries/include
    export HDF5_LIB_DIR=$base/libraries/lib
  4. Add the following lines to the $base/xios/arch/arch-ifort_linux.fcm file:
    %NCDF_INC            -I$base/libraries/include
    %NCDF_LIB            -L$base/libraries/lib -lnetcdff -lnetcdf -lhdf5 -lcurl -lz -lsz
    %FC                  mpiifort
    %FCFLAGS             -O3 -g -traceback -xCORE-AVX2 -I$base/libraries/include -L$base/libraries/lib
    %FFLAGS              -O3 -g -traceback -xCORE-AVX2 -I$base/libraries/include -L$base/libraries/lib
    %LD                  mpiifort
    %FPPFLAGS            -P -C -traditional
    %LDFLAGS             -O3 -g -traceback -xCORE-AVX2 -I$base/libraries/include -L$base/libraries/lib
    %AR                  ar
    %ARFLAGS             -r
    %MK                  gmake
    %USER_INC            %NCDF_INC_DIR
    %USER_LIB            %NCDF_LIB_DIR
    
    %MAKE                gmake
    %BASE_LD        -lstdc++ -lifcore -lintlc
    %LINKER         mpiifort -nofor-main
    %BASE_INC       -D__NONE__
    %CCOMPILER      mpiicc
    %FCOMPILER      mpiifort
    %CPP            cpp
    %FPP            cpp -P
    
    %BASE_CFLAGS    -O3 -g -traceback -xCORE-AVX2 -I$base/libraries/include -L$base/libraries/lib
    %PROD_CFLAGS    -O3 -g -traceback -xCORE-AVX2 -I$base/libraries/include -L$base/libraries/lib
    %DEV_CFLAGS    -O3 -g -traceback -xCORE-AVX2 -I$base/libraries/include -L$base/libraries/lib
    %DEBUG_CFLAGS  -O3 -g -traceback -xCORE-AVX2 -I$base/libraries/include -L$base/libraries/lib
    %BASE_FFLAGS   -O3 -g -traceback -xCORE-AVX2 -I$base/libraries/include -L$base/libraries/lib
    %PROD_FFLAGS    -O3 -g -traceback -xCORE-AVX2 -I$base/libraries/include -L$base/libraries/lib
    %DEV_FFLAGS    -O3 -g -traceback -xCORE-AVX2 -I$base/libraries/include -L$base/libraries/lib
    %DEBUG_FFLAGS   -O3 -g -traceback -xCORE-AVX2 -I$base/libraries/include -L$base/libraries/lib
  5. Add the following lines to the $base/xios/arch/arch-ifort_linux.path file:
    NETCDF_INCDIR="-I $NETCDF_INC_DIR"
    NETCDF_LIBDIR="-L $NETCDF_LIB_DIR"
    NETCDF_LIB="-lnetcdff -lnetcdf -lcurl"
    MPI_INCDIR=""
    MPI_LIBDIR=""
    MPI_LIB=""
    HDF5_INCDIR="-I $HDF5_INC_DIR"
    HDF5_LIBDIR="-L $HDF5_LIB_DIR"
    HDF5_LIB="-lhdf5_hl -lhdf5 -lz -lcurl"
  6. Change directory to $base/xios and execute the following command:
    ./make_xios --full --prod --arch ifort_linux

Building NEMO for the Intel Xeon Processor and Preparing Workloads

  1. Copy NEMO source code to $base/nemo
  2. Apply the following patch to file $base/nemo/NEMOGCM/ NEMO/OPA_SRC/nemogcm.F90:
    @@ -116,6 +116,7 @@
           !!              Madec, 2008, internal report, IPSL.
           !!----------------------------------------------------------------------
           INTEGER ::   istp       ! time step index
    +DOUBLE PRECISION :: mpi_wtime, sstart, send
           !!----------------------------------------------------------------------
           !
     #if defined key_agrif
    @@ -163,18 +164,19 @@
     #if defined key_agrif
               CALL Agrif_Regrid()
     #endif
    -
              DO WHILE ( istp <= nitend .AND. nstop == 0 )
    +sstart = mpi_wtime()
     #if defined key_agrif
                 CALL stp                         ! AGRIF: time stepping
     #else
                 CALL stp( istp )                 ! standard time stepping
     #endif
    +send=mpi_wtime()
    +print *, "Step ", istp, " - " , send-sstart , "s."
                 istp = istp + 1
                 IF( lk_mpp )   CALL mpp_max( nstop )
              END DO
     #endif
    -
           IF( lk_diaobs   )   CALL dia_obs_wri
           !
           IF( ln_icebergs )   CALL icb_end( nitend )
  3. Create the file $base/nemo/ARCH/arch-mpiifort_linux.fcm and add the following lines:
    %NCDF_INC            -I/$base/libraries/include
    %NCDF_LIB            -L$base/libraries/lib -lnetcdff -lnetcdf -lz -lcurl -lhdf5_hl -lhdf5 -lz -lcurl
    %CPP                 icc -E
    %FC                  mpiifort
    %FCFLAGS          -r8 -g -traceback -qopenmp -O3 -xCORE-AVX2 -g -traceback
    %FFLAGS             -r8 -g -traceback -qopenmp -O3 -xCORE-AVX2 -g -traceback
    %LD                  mpiifort
    %FPPFLAGS            -P -C -traditional
    %LDFLAGS             -lstdc++ -lifcore -O3 -xCORE-AVX2 -g -traceback
    %AR                  ar
    %ARFLAGS             -r
    %MK                  gmake
    %XIOS_INC            -I$base/xios/inc
    %XIOS_LIB            -L$base/xios/lib -lxios
    %USER_INC            %NCDF_INC %XIOS_INC
    %USER_LIB            %NCDF_LIB %XIOS_LIB
  4. Build the binary for the GYRE workload:
    cd $base/nemo/NEMOGCM/CONFIG
    ./makenemo -n GYRE -m mpiifort_linux -j 4
  5. Create a sandbox directory for the GYRE runs:
    1.  mkdir -p $base/nemo/gyre-exp
       cp –r $base/nemo/NEMOGCM/CONFIG/GYRE/BLD/bin/nemo.exe $base/nemo/gyre-exp
       cp -r $base/nemo/NEMOGCM/CONFIG/GYRE/EXP00/* $base/nemo/gyre-exp
    2. Switch creating mesh files to off by changing “nn_msh” to 0 in namelist_ref file
    3. Enable benchmark mode by changing “nn_bench” to 1 in namelist_ref  file.
    4. Set the following parameters in the “&namcfg” section:
      jp_cfg = 70
      jpidta = 2102
      jpjdta = 1402
      jpkdta = 31
      jpiglo = 2102
    5. Switch off using the IO server in the iodef.xml file (“using_server = false”)
  6. Build a binary for the ORCA025 workload:
    1. Change  “$base/nemo/NEMOGCM/CONFIG/ORCA2_LIM3/cpp_ORCA2_LIM3.fcm” content to “bld::tool::fppkeys key_trabbl key_vvl key_dynspg_ts key_ldfslp key_traldf_c2d key_traldf_eiv key_dynldf_c3d key_zdfddm key_zdftmx key_mpp_mpi key_zdftke key_lim3 key_iomput”
    2. Change the line “ORCA2_LIM3 OPA_SRC LIM_SRC_3 NST_SRC” to “ORCA2_LIM3 OPA_SRC LIM_SRC_3” in file $base/nemo/NEMOGCM/CONFIG/cfg.txt
    3. ./makenemo -n ORCA2_LIM3 -m mpiifort_linux -j 4
  7. Go to the Barcelona Supercomputing Center (in Spanish), and in section 9 locate the paragraph, “PREGUNTAS Y RESPUESTAS:” with a path to the ftp server and credentials to log in.
  8. Download the BenchORCA025L75.tar.gz file from directory Benchmarks_aceptacion/NEMO/
  9. Extract the contents of the tarball file to $base/nemo/orca-exp
  10. Copy the NEMO binary to the sandbox directory:
    cp $base/nemo/NEMOGCM/CONFIG/ORCA2_LIM3/BLD/bin/nemo.exe $base/nemo/orca-exp
  11. Edit the file $base/nemo/orca-exp/iodef.xml and add the following lines into the “<context id="xios">    <variable_definition>” section:
    <variable id="min_buffer_size" type="int">994473778</variable><variable id="buffer_size" type="int">994473778</variable> 
  12. In the file namelist_ref in section “&namrun” set the following variables:
    nn_itend     =   10
    nn_stock    =    10
    nn_write    =    10
  13. Copy the $base/nemo/NEMOGCM/CONFIG/SHARED/namelist_ref file to $base/nemo/exp-orca
  14. Switch off using the IO server in the iodef.xml file (“using_server = false”)
  15. To build the KNL binaries change “-xCORE-AVX2” to “-xMIC-AVX512”, change $base to another directory, and do all of the steps again.

Running the GYRE Workload with the Intel Xeon Processor

  1. Go to $base/nemo/gyre-exp
  2. Source the environment variables for the compiler and the Intel® MPI Library:
    source /opt/intel/compiler/latest/bin/compilervars.sh intel64
    source /opt/intel/impi/latest/bin/compilervars.sh intel64
  3. Add libraries to LD_LIBRARY_PATH:
    export LD_LIBRARY_PATH=$base/libraries/lib/:$LD_LIBRARY_PATH
  4. Set additional variables for the Intel MPI Library:
    export I_MPI_FABRICS=shm:tmi
    export I_MPI_PIN_CELL=core
  5. Run NEMO:
    mpiexec.hyrda –genvall –f <hostfile> -n <number of ranks> -perhost <ppn> ./nemo.exe

Running the ORCA025 Workload with the Intel Xeon Processor

  1. Go to $base/nemo/orca-exp
  2. Source the environment variables for the compiler and the Intel MPI Library:
    source /opt/intel/compiler/latest/bin/compilervars.sh intel64
    source /opt/intel/impi/latest/bin/compilervars.sh intel64
  3. Add libraries to LD_LIBRARY_PATH:
    export LD_LIBRARY_PATH=$base/libraries/lib/:$LD_LIBRARY_PATH
  4. Set additional variables for Intel MPI Library:
    export I_MPI_FABRICS=shm:tmi
    export I_MPI_PIN_CELL=core
  5. Run NEMO:
    mpiexec.hyrda –genvall –f <hostfile> -n <number of ranks> -perhost <ppn> ./nemo.exe
  6. If you are faced with hangs while the application is running you can run NEMO with the XIOS server in detached mode:
    1. Copy xios_server.exe from $base/xios/bin to $base/nemo/orca-exp
    2. Edit iodef.xml file and set “using_server = true”
    3. mpiexec.hy–da -genvall –f <hostfile> -n <number of ranks> -perhost <ppn> ./nemo.exe : -n 2 ./xios_server.exe

Building Additional Libraries for the Intel® Xeon Phi™ Processor

  1. First, choose a directory for your experiments, such as “~/NEMO-KNL”
    export base=”~/NEMO-KNL”
  2. Create the directory and copy all required libraries in $base:
    mk–ir -p $base/libraries
  3. Unpack the tarball files in $base/libraries/src
  4. To build an Intel AVX2 version of libraries, set:
    export a”ch="-xMIC-AV”512"
  5. Set the following environment variables:
     export PREFIX=$base/libraries
     export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:${PREFIX}/lib
     export CFL”GS="-I$PREFIX/incl–de -L$PREFIX/lib –O3–-g -traceb–ck -openmp ${ar–h} -”PIC"
     export CPPFLAGS=$CFLAGS
     export CXXFLAGS=$CFLAGS
     export FFFLAGS=$CFLAGS
     export FCFLAGS=$CFLAGS
     export LDFL”GS="-L$PREFIX/–ib -openmp ${ar–h} -”PIC"
     export FC=mpiifort
     export CXX=mpiicc
     export CC=mpiicc
     export ”PP="–c” -E"
  6. Build szip:
     cd $base/libraries/src/szip-2.1
     ./config–e --prefix=$PREFIX
     m–ke -j 4
     make install
  7. Build zlib:
    cd $base/libraries/src/zlib-1.2.8
    ./config–e --prefix=$PREFIX
    make –j 4
    make install
  8. Build HDF5:
    cd $base/libraries/src/hdf5-1.8.12
    ./config–e --with-zlib=$PRE–X --prefix=$PRE–X --enable-fort–n --with-szlib=$PRE–X --enable-hl
    make
    make install
  9. Build CURL:
    cd $base/libraries/src/curl- 7.42.1
    ./config–e --prefix=$PREFIX
    make –j 4
    make install
  10. Build NetCDF:
    cd $base/libraries/src/netcdf-4.3.3
    export L”B–=" -lhdf5–hl -lh–f5 –lz -–sz -”mpi"
    export LD_FLA”S–=" -L$PREFIX”lib"
    ./config–e --prefix=$PREFIX
    make
    make install
  11. Build the NetCDF Fortran wrapper:
    cd $base/libraries/src/netcdf-fortran-4.2/
    export L””S=""
    export CFL”GS="$CFL–GS -lne”cdf"
    export CPPFLAGS=$CFLAGS
    export CXXFLAGS=$CFLAGS
    export FFFLAGS=$CFLAGS
    export FCFLAGS=$CFLAGS
    export FC=ifort
    export CXX=mpiicc
    export CC=mpiicc
    export LDFLA”S–=" -L$I_MPI_ROOT/li”64/"
    ./config–e --prefix=$PREFIX
    make
    make install

Building XIOS for the Intel Xeon Phi Processor

  1. Copy XIOS source code to $base/xios
  2. Create files:
    $base/xios/arch/arch-ifort_linux.env
    $base/xios/arch/arch-ifort_linux.fcm
    $base/xios/arch/arch-ifort_linux.path
  3. Add the following lines to the $base/xios/arch/arch-ifort_linux.env file:
    export NETCDF_INC_DIR=$base/libraries/include
    export NETCDF_LIB_DIR=$base/libraries/lib
    export HDF5_INC_DIR=$base/libraries/include
    export HDF5_LIB_DIR=$base/libraries/lib
  4. Add the following lines to the $base/xios/arch/arch-ifort_linux.fcm file:
    %NCDF_INC            -I$base/libraries/include
    %NCDF_LIB            -L$base/libraries/–ib -lnetc–ff -lnet–df -lh–f5 -lc–rl –lz -lsz
    %FC                  mpiifort
    %FCFLAGS             –O3–-g -traceback –xMIC-AVX–12 -I$base/libraries/incl–de -L$base/libraries/lib
    %FFLAGS              –O3–-g -traceb–ck - xMIC-AVX–12 -I$base/libraries/incl–de -L$base/libraries/lib
    %LD                  mpiifort
    %FPPFLAGS           –-P–-C -traditional
    %LDFLAGS             –O3–-g -traceb–ck - xMIC-AVX–12 -I$base/libraries/incl–de -L$base/libraries/lib
    %AR                  ar
    %ARFLAGS             -r
    %MK                  gmake
    %USER_INC            %NCDF_INC_DIR
    %USER_LIB            %NCDF_LIB_DIR
    
    %MAKE                gmake
    %BASE_LD        -lstdc++ -lifc–re -lintlc
    %LINKER         mpiif–rt -nofor-main
    %BASE_INC       -D__NONE__
    %CCOMPILER      mpiicc
    %FCOMPILER      mpiifort
    %CPP            cpp
    %FPP            –pp -P
    
    %BASE_CFLAGS    –O3–-g -traceb–ck - xMIC-AVX512-I$base/libraries/incl–de -L$base/libraries/lib
    %PROD_CFLAGS   –O3–-g -traceb–ck - xMIC-AVX–12 -I$base/libraries/incl–de -L$base/libraries/lib
    %DEV_CFLAGS    –O3–-g -traceb–ck - xMIC-AVX–12 -I$base/libraries/incl–de -L$base/libraries/lib
    %DEBUG_CFL–GS –O3–-g -traceb–ck - xMIC-AVX–12 -I$base/libraries/incl–de -L$base/libraries/lib
    %BASE_FFLAGS   –O3–-g -traceb–ck - xMIC-AVX–12 -I$base/libraries/incl–de -L$base/libraries/lib
    %PROD_FFLAGS    –O3–-g -traceb–ck - xMIC-AVX–12 -I$base/libraries/incl–de -L$base/libraries/lib
    %DEV_FFLAGS    –O3–-g -traceb–ck - xMIC-AVX–12 -I$base/libraries/incl–de -L$base/libraries/lib
    %DEBUG_FFLAGS   –O3–-g -traceb–ck - xMIC-AVX–12 -I$base/libraries/incl–de -L$base/libraries/lib
  5. Add the following lines to the $base/xios/arch/arch-ifort_linux.path file:
    NETCDF_INC”IR="-I $NETCDF_INC”DIR"
    NETCDF_LIB”IR="-L $NETCDF_LIB”DIR"
    NETCDF_”IB="-lnetc–ff -lnet–df -l”url"
    MPI_INC””R=""
    MPI_LIB””R=""
    MPI_””B=""
    HDF5_INC”IR="-I $HDF5_INC”DIR"
    HDF5_LIB”IR="-L $HDF5_LIB”DIR"
    HDF5_”IB="-lhdf5–hl -lh–f5 –lz -l”url"
  6. Change the directory to $base/xios and execute the following command:
    ./make_x–s --f–l --p–d --arch ifort_linux

Building NEMO for the Intel Xeon Phi Processor and Preparing Workloads

  1. Copy the NEMO source code to $base/nemo
  2. Apply the following patch to file $base/nemo/NEMOGCM/ NEMO/OPA_SRC/nemogcm.F90:
    @@ -116,6 +116,7 @@
           !!              Madec, 2008, internal report, IPSL.
           !!----------------------------------------------------------------------
           INTEGER ::   istp       ! time step index
    +DOUBLE PRECISION :: mpi_wtime, sstart, send
           !!----------------------------------------------------------------------
           !
     #if defined key_agrif
    @@ -163,18 +164,19 @@
     #if defined key_agrif
               CALL Agrif_Regrid()
     #endif
    -
              DO WHILE ( istp <= nitend .AND. nstop == 0 )
    +sstart = mpi_wtime()
     #if defined key_agrif
                 CALL stp                         ! AGRIF: time stepping
     #else
                 CALL stp( istp )                 ! standard time stepping
     #endif
    +send=mpi_wtime()
    +print“*, "S“ep ", is“p– “ - " , send-sstar“ ,”"s."
                 istp = istp + 1
                 IF( lk_mpp )   CALL mpp_max( nstop )
              END DO
     #endif
    -
           IF( lk_diaobs   )   CALL dia_obs_wri
           !
           IF( ln_icebergs )   CALL icb_end( nitend )
  3. Create the file $base/nemo/ARCH/arch-mpiifort_linux.fcm and add the following lines:
    %NCDF_INC            -I/$base/libraries/include
    %NCDF_LIB            -L$base/libraries/–ib -lnetc–ff -lnet–df –lz -lc–rl -lhdf5–hl -lh–f5 –lz -lcurl
    %CPP                 –cc -E
    %FC                  mpiifort
    %FCFLAGS          –r8–-g -traceb–ck -qope–mp –O3 - xMIC-AVX–12–-g -traceback
    %FFLAGS             –r8–-g -traceb–ck -qope–mp –O3 - xMIC-AVX–12–-g -traceback
    %LD                  mpiifort
    %FPPFLAGS           –-P–-C -traditional
    %LDFLAGS             -lstdc++ -lifc–re –O3 - xMIC-AVX–12–-g -traceback
    %AR                  ar
    %ARFLAGS             -r
    %MK                  gmake
    %XIOS_INC            -I$base/xios/inc
    %XIOS_LIB            -L$base/xios/–ib -lxios
    %USER_INC            %NCDF_INC %XIOS_INC
    %USER_LIB            %NCDF_LIB %XIOS_LIB
  4. Build the binary for the GYRE workload:
    cd $base/nemo/NEMOGCM/CONFIG
    ./maken–mo -n G–RE -m mpiifort_li–ux -j 4
  5. Create a sandbox directory for the GYRE runs:
    1. mk–ir -p $base/nemo/gyre-exp
      cp –r $base/nemo/NEMOGCM/CONFIG/GYRE/BLD/bin/nemo.exe $base/nemo/gyre-exp–cp -r $base/nemo/NEMOGCM/CONFIG/GYRE/EXP00/* $base/nemo/gyre-exp
    2. Switch off creating mesh files by changing “nn_msh” to 0 in the namelist_ref file
    3. Enable benchmark mode by changing “nn_bench” to 1 in the namelist_ref  file.
    4. Set the following parameters in the “&namcfg” section:
      jp_cfg = 70
      jpidta = 2102
      jpjdta = 1402
      jpkdta = 31
      jpiglo = 2102
      jpjglo = 1402
    5. Switch off using the IO server in the iodef.xml file (“using_server = false”)
  6. Build the binary for ORCA025 workload:
    1. Change  $base/nemo/NEMOGCM/CONFIG/ORCA2_LIM3/cpp_ORCA2_LIM3.fcm content to “bld::tool::fppkeys key_trabbl key_vvl key_dynspg_ts key_ldfslp key_traldf_c2d key_traldf_eiv key_dynldf_c3d key_zdfddm key_zdftmx key_mpp_mpi key_zdftke key_lim3 key_iomput”
    2. Change line “ORCA2_LIM3 OPA_SRC LIM_SRC_3 NST_SRC” to “ORCA2_LIM3 OPA_SRC LIM_SRC_3” in the file $base/nemo/NEMOGCM/CONFIG/cfg.txt 
    3. ./maken–mo -n ORCA2_L–M3 -m mpiifort_li–ux -j 4
  7. Go to the Barcelona Supercomputing Center (in Spanish), and in section 9 locate the paragraph, “PREGUNTAS Y RESPUESTAS:” with the path to the ftp server and credentials to log in.
  8. Download the BenchORCA025L75.tar.gz file from the Benchmarks_aceptacion/NEMO/ directory
  9. Extract the contents of the tarball file to $base/nemo/orca-exp
  10. Copy the NEMO binary to the sandbox directory:
    cp $base/nemo/NEMOGCM/CONFIG/ORCA2_LIM3/BLD/bin/nemo.exe $base/nemo/orca-exp
  11. Edit the file $base/nemo/orca-exp/iodef.xml and add the following lines into the “<context”id="”ios">    <variable_definition>” section:
    <variable”id="min_buffer_”ize" t”pe=”int">994473778</variable><variable”id="buffer_”ize" t”pe=”int">994473778</variable>
  12. In the file namelist_ref in section “&namrun” set the following variables:
    nn_itend    =  10
    nn_stock    =    10
    nn_write    =    10
  13. Copy the $base/nemo/NEMOGCM/CONFIG/SHARED/namelist_ref file to the $base/nemo/exp-orca directory
  14. Switch off using the IO server in the iodef.xml file (“using_server = false”)
  15. To build the KNL binaries, change “-xCORE- to “-xMIC-AVX512”, change $base to another directory, and do all of the steps again.

Running the GYRE Workload with the Intel Xeon Phi Processor

  1. Go to $base/nemo/gyre-exp
  2. Source the environment variables for the compiler and Intel MPI Library:
    source /opt/intel/compiler/latest/bin/compilervars.sh intel64
    source /opt/intel/impi/latest/bin/compilervars.sh intel64
  3. Add the libraries to LD_LIBRARY_PATH:
    export LD_LIBRARY_PATH=$base/libraries/lib/:$LD_LIBRARY_PATH
  4. Set additional variables for Intel MPI Library:
    export I_MPI_FABRICS=shm:tmi
    export I_MPI_PIN_CELL=core
  5. Run NEMO:
    mpiexec.hyrda -genvall –f <hostfile> -n <number of ranks> -perhost <ppn> ./nemo.exe

Running the ORCA025 Workload with the Intel Xeon Phi Processor

  1. Go to $base/nemo/orca-exp
  2. Source environment variables for the compiler and Intel MPI Library:
    source /opt/intel/compiler/latest/bin/compilervars.sh intel64
    source /opt/intel/impi/latest/bin/compilervars.sh intel64
  3. Add libraries to LD_LIBRARY_PATH:
    export LD_LIBRARY_PATH=$base/libraries/lib/:$LD_LIBRARY_PATH
  4. Set additional variables for the Intel MPI Library:
    export I_MPI_FABRICS=shm:tmi
    export I_MPI_PIN_CELL=core
  5. Run NEMO:
    mpiexec.hyrda -genvall –f <hostfile> -n <number of ranks> -perhost <ppn> ./nemo.exe
  6. If you are faced with hangs while the application is running you can run NEMO with the XIOS server in detached mode:
    1. Copy xios_server.exe from $base/xios/bin to $base/nemo/orca-exp
    2. Edit iodef.xml file and set “using_server = true”
    3. mpiexec.hyrda -genvall –f <hostfile> -n <number of ranks> -perhost <ppn> ./nemo.exe : -n 2 ./xios_server.exe

Configuring Test Systems

CPU

Dual-socket Intel® Xeon® processor E5-2697 v4, 2.3 GHz (turbo OFF), 18 cores/socket, 36 cores, 72 threads (HT on)

Intel® Xeon Phi™ processor 7250, 68 core, 136 threads, 1400 MHz core freq. (turbo OFF), 1700 MHz uncore freq.

RAM

128 GB (8 x 16 GB) DDR4 2400 DDR4 DIMMs

96 GB (6 x 16 GB) DDR4 2400 MHz  RDIMMS

Cluster File System Abstract

Intel® Enterprise Edition for Lustre* software SSD (Intel® EE for Lustre* software) SSD (136 TB storage)

Intel® Enterprise Edition for Lustre* software SSD (Intel® EE for Lustre* software) SSD (136 TB storage)

Interconnect

Intel® Omni-Path Architecture (Intel® OPA) Si 100 series

Intel® Omni-Path Architecture (Intel® OPA) Si 100 series

OS / Kernel / IB stack

Oracle Linux* server release 7.2

Kernel: 3.10.0-229.20.1.el6.x86_64.knl2

OFED version: 10.2.0.0.158_72

Oracle Linux server release 7.2

Kernel: 3.10.0-229.20.1.el6.x86_64.knl2

OFED Version 10.2.0.0.158_72

  • NEMO configuration: V3.6 r6939 with XIOS 1.0 r703, Intel® Parallel Studio XE 17.0.0.098, Intel MPI Library 2017 for Linux*
  • MPI configuration:
    • I_MPI_FABRICS=shm:tmi
    • I_MPI_PIN_CELL=core

Performance Results for the Intel Xeon Processor and Intel Xeon Phi Processor

    1. Time of second step for GYRE workload:

# nodesIntel® Xeon® ProcessorIntel® Xeon Phi™ Processor
16.5462293.642156
23.0113522.075075
41.3265010.997129
80.6406320.492369
160.3213780.284348

 

 

 

 

 

 

 

    2. Time of second step for ORCA workload:

# nodesIntel® Xeon® processorIntel® Xeon Phi™ processor
25.764083 
42.6427252.156876
81.3052381.0546
160.677250.643372

Demo: Software Defined Visualization Using Intel® Xeon Phi™ Processor

$
0
0

In this demo we are showcasing the use of Intel® Xeon Phi™ processor, to do a 3D visualization of tumor in a human brain. This can help advance research in medical field by getting precise detection and removal of something like tumor in human brain.

More information

The tool used for visualization is Paraview, with OSPRay as the rendering library.

Pre-requisites

Intel® Xeon Phi™ processor system with CentOS 7.2 Linux* (internet enabled)

Open a terminal, in your work area directory and follow the steps below:

  1. Create directory for the demo

    mkdir Intel_brain_demo

  2. Change directory

    cd Intel_brain_demo

  3. Create two directories under this

    mkdir paraview
    mkdir ospray

  4. Access the files from Dropbox:

    https://www.dropbox.com/s/wj0qp1clxv5xssv/SC_2016_BrainDemo.tar.gz?dl=0

  5. Copy the Paraview and Ospray tar files into the respective directories you created in steps above

    mv SC_2016_BrainDemo/paraview_sc_demo.tgz paraview/
    mv SC_2016_BrainDemo/ospray.tgz ospray/

  6. Untar each of the *tgz directories in the respective area

    tar –xzvf *.tgz

  7. Point the library path

    Export
    LD_LIBRARY_PATH=$LD_LIBRARY_PATH:<…../Intel_brain_demo/ospray/install/lib64>

  8. Optional step: set graphics system to a random variable, only if Paraview doesn’t load normally

    export QT_GRAPHICSSYSTEM=gtk

  9. Change directory to paraview/install where the binaries are

    cd paraview/install

  10. Run Paraview

    ./bin/paraview

  11. Once Paraview loads

    Select File/Load State

  12. Then load the brain_demo.psvm file from the SC_2016BrainDemo that you sync’d in step above

  13. Then it will ask you to load VTK files, click the “...” button to select the appropriate *tumor1.vtk file, then *tumor2.vtk file and then *Tumor1.vtk file in order on your local machine. Then click OK.

  14. An Output Messages pop window will show with warnings. Ignore the warnings and click close, and you should see something like following:

  15. Now you can go to File/Save State and save this state. Now every time you load you can load to this state file to skip the previous step of having to locate the data files.
  16. Then on properties tab on left side, enable Ospray for every view (all the rendewViews1/2/2 by selecting that view and clicking enable Ospray)

  17. Once you do that you should see the images for all three views look as below:

  18. You can also rotate the views and see how they look.

A few issues and how to resolve

Missing OpenGL, install mesa for OpenGL

Sudo yum –y install mesa-libGL
Sudo yum –y install mesa-libGL-devel

libQtGui.so.4 error, install qt-x11 package

yum –y install qt-x11

Acknowledgements

Special thanks to Carson Brownlee and James Jeffers from Intel Corporation for all their contributions and support. Without their efforts, it wouldn’t have been possible to get this demo running.

References

  1. http://www.intel.com/content/www/us/en/processors/xeon/xeon-phi-detail.html
  2. https://software.intel.com/en-us/blogs/Intel-Parallel-Studio-XE-2016
  3. https://gitlab.kitware.com/carson/paraview
  4. https://gitlab.kitware.com/carson/vtk
  5. http://www.ospray.org
  6. http://www.ospray.org/getting_ospray.html
  7. http://dap.xeonphi.com
  8. https://ispc.github.io/downloads.html
  9. https://www.threadingbuildingblocks.org
  10. https://en.wikipedia.org/wiki/Software_rendering

Recipe: Building and Running MILC on Intel® Xeon® Processors and Intel® Xeon Phi™ Processors

$
0
0

Introduction

MILC software represents a set of codes written by the MIMD Lattice Computation (MILC) collaboration used to study quantum chromodynamics (QCD), the theory of the strong interactions of subatomic physics. It performs simulations of four-dimensional SU lattice gauge theory on MIMD (Multiple Instruction, Multiple Data) parallel machines. “Strong interactions” are responsible for binding quarks into protons and neutrons and holding them all together in the atomic nucleus. MILC applications address fundamental questions in high energy and nuclear physics, and is directly related to major experimental programs in these fields. MILC is one of the largest compute cycle users at many U.S. and European supercomputing centers.

This article provides instructions for code access, build, and run directions for the “ks_imp_rhmc” application on Intel® Xeon® processors and Intel® Xeon Phi™ processors. The “ks_imp_rhmc” is a dynamical RHMC (rational hybrid Monte Carlo algorithm) code for staggered fermions. In addition to the naive and asqtad staggered actions, the highly improved staggered quark (HISQ) action is also supported.

Currently, the conjugate gradient (CG) solver in the code uses the QPhiX library. Efforts are ongoing to integrate other operations (gauge force (GF), fermion force (FF)) with the QPhiX library as well.

The QPhiX library provides sparse solvers and Dslash kernels for Lattice QCD simulations optimized for Intel® architectures.

Code Access

The MILC Software and QPhiX library are primarily required. The MILC software can be downloaded from GitHub here: https://github.com/milc-qcd/milc_qcd. Download the master branch. QPhiX support is integrated into this branch for CG solvers.

The QPhiX library and code generator for use with Wilson-Clover fermions (for example, for use with chroma) are available from https://github.com/jeffersonlab/qphix.git and https://github.com/jeffersonlab/qphix-codegen.git, respectively. For the most up to date version, we suggest you use the devel branch of QPhiX. The MILC version is currently not open source. Please contact the MILC collaboration group for access to the QPhiX (MILC) branch.

Build Directions

Compile the QPhiX Library

Users need to build QPhiX first before building the MILC package.

The QPhiX library will have two tar files, mbench*.tar and qphix-codegen*.tar.

Untar the above.

Build qphix-codogen

The files with intrinsics for QPhiX are built in the qphix-codegen directory.

Enter the qphix-codegen directory.

Edit line #3 in “Makefile_xyzt”, enable “milc=1” variable.

Compile as:

source /opt/intel/compiler/<version>/bin/compilervars.sh intel64
source /opt/intel/impi/<version>/mpi/intel64/bin/mpivars.sh
make –f Makefile_xyzt avx512 -- [for Intel® Xeon Phi™ Processor]
make –f Makefile_xyzt avx2 -- [for Intel® Xeon® v3 Processors /Intel® Xeon® v4 Processor]

Build mbench

The mbench is part of the QPhiX library. The MILC version is currently not open source. Please contact the MILC collaboration group for access to the QPhiX (MILC) branch.

Enter the mbench directory.

Edit line #3 in “Makefile_qphixlib”, set “mode=mic” to compile with Intel® AVX-512 for Intel® Xeon Phi™ Processor and “mode=avx” to compile with Intel® Advanced Vector Extensions 2 (Intel® AVX2) for Intel® Xeon® Processors.

Edit line #13 in “Makefile_qphixlib” to enable MPI. Set ENABLE_MPI = 1.

Compile as:

make -f Makefile_qphixlib mode=mic AVX512=1 -- [Intel® Xeon Phi™ Processor]
make -f Makefile_qphixlib mode=avx AVX2=1 -- [Intel® Xeon® Processors]

Compile MILC Code

Install/download the master branch from the above GitHub location.

Download the Makefile.qphix file from the following location:

http://denali.physics.indiana.edu/~sg/MILC_Performance_Recipe/.

Copy the Makefile.qphix to the corresponding application directory. In this case, copy the Makefile.qphix to the “ks_imp_rhmc” application directory and rename it as Makefile.

Make the following changes to the Makefile:

  • On line #17 - Add/uncomment the appropriate ARCH variable:
    • For example, ARCH = knl (compile with Intel AVX-512 for Intel® Xeon Phi™ Processor architecture).
    • For example, ARCH = bdw (compile with Intel AVX2 for Intel® Xeon® Processor architecture).
  • On line #28 - Change MPP variable to “true” if you want MPI.
  • On line #34 - Pick the PRECISION you want:
    • 1 = Single, 2 = Double. We use Double for our runs.
  • Starting line #37 - Compiler is set up and this should work:
    •  If directions above were followed. If not, customize starting at line #40.
  • On line #124 - Setup of Intel compiler starts:
    • Based on ARCH it will use the appropriate flags.
  • On line #395 - QPhiX customizations starts: 
    • On line #399 – Set QPHIX_HOME to correct QPhiX path (Path to mbench directory).
    • The appropriate QPhiX FLAGS will be set if the above is defined correctly.

Compile as:

Enter the ks_imp_rhmc. The Makefile with the above changes should be in this directory. Source the latest Intel® compilers and Intel® MPI Library.

make su3_rhmd_hisq -- Build su3_rhmd_hisq binary
make su3_rhmc_hisq -- Build su3_rhmc_hisq binary

Compile the above binaries for Intel® Xeon Phi™ Processor and Intel® Xeon® Processor (edit Makefile accordingly).

Run Directions

Input Files

There are two required input files, params.rest, and rat.m013m065m838.

They can be downloaded from here:

http://denali.physics.indiana.edu/~sg/MILC_Performance_Recipe/.

The file rat.m013m065m838 defines the residues and poles of the rational functions needed in the calculation. The file params.rest sets all the run time parameters, including the lattice size, the length of the calculation (number of trajectories), and the precision of the various conjugate-gradient solutions.

In addition, a params.<lattice-size> file with required lattice size will be created during runtime. This file essentially has the params.rest appended to it with the lattice size (Nx * Ny * Nz * Ny) to run.

The Lattice Sizes

The size of the four-dimensional space-time lattice is controlled by the “nx, ny, nz, nt” parameters.

As an example, consider a problem as (nx x ny x nz x nt) = 32 x 32 x 32 x 64 running on 64 MPI ranks. To weak scale this problem a user would begin by multiplying nt by 2, then nz by 2, then ny by 2, then nx by 2 and so on, such that all variables get sized accordingly in a round-robin fashion.

This is illustrated in the table below. The original problem size is 32 x 32 x 32 x 64, to keep the elements/rank constant (weak scaling); for 128 rank count, first multiply nt by 2 (32 x 32 x 32 x 128). Similarly, for 512 ranks, multiply ntby 2, nz by 2, ny by 2 from the original problem size to keep the same elements/rank.

Ranks64128256512
nx32323232
ny32323264
nz32326464
nt64128128128
     
Total Elements20971524194304838860816777216
Multiplier1248
Elements/Rank32768327683276832768

Table: Illustrates Weak Scaling of Lattice Sizes

Running with MPI x OpenMP*

The calculation takes place on a four-dimensional hypercubic lattice, representing three spatial dimensions and one time dimension. The quark fields have values on each of the lattice points and the gluon field has values on each of the links connecting nearest-neighbors of the lattice sites. 

The lattice is divided into equal subvolumes, one per MPI rank. The MPI ranks can be thought of as being organized into a four-dimensional grid of ranks. It is possible to control the grid dimensions with the params.rest file. Of course, the grid dimensions must be integer factors of the lattice coordinate dimensions.

Each MPI rank executes the same code. The calculation requires frequent exchanges of quark and gluon values between MPI ranks with neighboring lattice sites. Within a single MPI rank, the site-by-site calculation is threaded using OpenMP* directives, which have been inserted throughout the code. The most time-consuming part of production calculations is the CG solver. In the QPhiX version of the CG solver, the data layout and the calculation at the thread level is further organized to take advantage of the Intel Xeon and Intel Xeon Phi processors SIMD(single instruction, multiple data) lanes.

Running the Test Cases

  1. Create a “run” directory in the top-level directory and add the input files obtained from above.
  2. cd <milc>/run
    P.S: Run the appropriate binary for each architecture.
  3. Create the lattice volume:
    cat << EOF > params.$nx*$ny*$nz*$nt
    prompt 0
    nx $nx
    ny $ny
    nz $nz
    nt $nt
    EOF
    cat params.rest >> params.$nx*$ny*$nz*$nt

    For this performance recipe, we evaluate the single node and multinode (16 nodes) performance with the following weak scaled lattice volume:

    Single Node (nx * ny * nz * nt): 24 x 24 x 24 x 60

    Multinode [16 nodes] (nx * ny * nz * nt): 48 x 48 x 48 x 120

  4. Run on Intel Xeon processor (E5-2697v4).
    Source the latest Intel compilers and Intel MPI Library
    • Intel® Parallel Studio 2017 and above recommended

    Single Node:

    mpiexec.hydra –n 12 –env OMP_NUM_THREADS 3 –env KMP_AFFINITY 'granularity=fine,scatter,verbose'<path-to>/ks_imp_rhmc/su3_rhmd_hisq.bdw < params.24x24x24x60

    Multinode (16 nodes, via Intel® Omni-Path Host Fabric Interface (Intel® OP HFI)):

    # Create a runScript (run-bdw) #<path-to>/ks_imp_rhmc/su3_rhmd_hisq.bdw < params.48x48x48x120
    #Intel® OPA fabric-related environment variables#
    export I_MPI_FABRICS=shm:tmi
    export I_MPI_TMI_PROVIDER=psm2
    export PSM2_IDENTIFY=1
    export I_MPI_FALLBACK=0
    #Create nodeconfig.txt with the following#
    -host <hostname1> -env OMP_NUM_THREADS 3 -env KMP_AFFINITY 'granularity=fine,scatter,verbose' -n 12 <path-to>/run-bdw
    …..
    …..
    …..
    -host <hostname16> -env OMP_NUM_THREADS 3 -env KMP_AFFINITY 'granularity=fine,scatter,verbose' -n 12 <path-to>/run-bdw
    #mpirun command#
    mpiexec.hydra –configfile nodeconfig.txt
  5. Run on Intel Xeon Phi processor (7250).
    Source Intel compilers and Intel MPI Library
    • Intel® Parallel Studio 2017 and above recommended

    Single Node:

    mpiexec.hydra –n 20 –env OMP_NUM_THREADS 3 –env KMP_AFFINITY 'granularity=fine,scatter,verbose' numactl –p 1 <path-to>/ks_imp_rhmc/su3_rhmd_hisq.knl < params.24x24x24x60

    Multinode (16 nodes, via Intel OP HFI):

    # Create a runScript (run-knl) #
    numactl –p 1 <path-to>/ks_imp_rhmc/su3_rhmd_hisq.knl < params.48x48x48x120
    #Intel OPA fabric-related environment variables#
    export I_MPI_FABRICS=shm:tmi
    export I_MPI_TMI_PROVIDER=psm2
    export PSM2_IDENTIFY=1
    export I_MPI_FALLBACK=0
    #Create nodeconfig.txt with the following#
    -host <hostname1> -env OMP_NUM_THREADS 3 -env KMP_AFFINITY 'granularity=fine,scatter,verbose' -n 20 <path-to>/run-knl
    …..
    …..
    …..
    -host <hostname16> -env OMP_NUM_THREADS 3 -env KMP_AFFINITY 'granularity=fine,scatter,verbose' -n 20 <path-to>/run-knl
    #mpirun command#
    mpiexec.hydra –configfile nodeconfig.txt

Performance Results and Optimizations

The output prints the total time to solution for the entire application, which takes into account the time for the different solvers and operators (for example, CG solver, fermion force, link fattening, gauge force, and so on).

The performance chart below is the speedup w.r.t 2S Intel Xeon processor E5-2697v4 based on the total run time.

 Speedup w.r.t 2S Intel® Xeon® processor E5-2697v4

The optimizations as part of the QPhiX library include data layout changes to target vectorization and generation of packed aligned loads/stores, cache blocking, load balancing and improved code generation for each architecture (Intel Xeon processor, Intel Xeon Phi processor) with corresponding intrinsics, where necessary. See References and Resources section for details.

Testing Platform Configurations

The following hardware was used for the above recipe and performance testing.

ProcessorIntel® Xeon® Processor E5-2697 v4Intel® Xeon Phi™ Processor 7250F
Sockets / TDP2S / 290W1S / 215W
Frequency / Cores / Threads2.3 GHz / 36 / 721.4 a / 68 / 272
DDR48x16 GB 2400 MHz6x16 GB 2400 MHz
MCDRAMN/A16 GB Flat
Cluster/Snoop ModeHomeQuadrant
Memory Mode Flat
TurboOFFOFF
BIOSSE5C610.86B.01.01.0016.033
120161139
GVPRCRB1.86B.0010.R02.1
606082342
Operating SystemOracle Linux* 7.2
(3.10.0-229.20.1.el6.x86_64)
Oracle Linux* 7.2
(3.10.0-229.20.1.el6.x86_64)

MILC Build Configurations

The following configurations were used for the above recipe and performance testing.

MILC VersionMaster version as of 28 January 2017
Intel® Compiler Version2017.1.132
Intel® MPI Library Version2017.0.098
MILC Makefiles UsedMakefile.qphix, Makefile_qphixlib, Makefile

References and Resources

  1. MIMD Lattice Computation (MILC) Collaboration: http://physics.indiana.edu/~sg/milc.html
  2. QPhiX Case Study: http://www.nersc.gov/users/computational-systems/cori/application-porting-and-performance/application-case-studies/qphix-case-study/
  3. MILC Staggered Conjugate Gradient Performance on Intel Intel® Xeon Phi™ Processor: https://anl.app.box.com/v/IXPUG2016-presentation-10
Viewing all 327 articles
Browse latest View live


<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>