How to Mount a Shared Directory on Intel® Xeon Phi™ Coprocessor

September 26, 2016, 9:13 pm

Latest and popular articles on Intel Technologies

≫ Next: Recipe: Building NAMD on Intel® Xeon® and Intel® Xeon Phi™ Processors

≪ Previous: 3D Isotropic Acoustic Finite-Difference Wave Equation Code: A Many-Core Processor Implementation and Analysis

In order to run a native program on the Intel® Xeon Phi™ coprocessor, the program and any dependencies must be copied to the target platform. However, this approach takes away memory from the native application. To reserve memory resource (16-GB GDDR5 memory on board the Intel Xeon Phi coprocessor), it is practical to mount a Network File System (NFS) shared directory on the Intel Xeon Phi coprocessor from the host server so that most of its memory can be used for applications. This article shows two ways to accomplish this task: the preferred method is using micctrl utility and the second method is a manual procedure.

Using `micctrl` utility

The preferred method to mount a shared directory on an Intel Xeon Phi coprocessor is to use the micctrl utility shipped with the Intel® Manycore Platform Software Stack (Intel® MPSS). The following example shows how to share the Intel® Compiler C++ library using micctrl. In the host machine used for this example, the MPSS 3.4.8 was installed.

On the host machine, ensure that the shared directory exists:

[host ~]# ls /opt/intel/compilers_and_libraries_2017.0.098/linux/

Add a new descriptor to the /etc/exports configuration file in the host machine, in order to export the directory /var/mpss/mic0.exports to the coprocessor mic0 whose IP address is 172.31.1.1. Use the option read only so that the coprocessor cannot delete anything in the shared library mistakenly:
```
[host ~]# cat /etc/exports
	/opt/intel/compilers_and_libraries_2017.0.098/linux172.31.1.1(ro,async,no_root_squash)
```
For more information on the export options, you can refer to http://nfs.sourceforge.net/nfs-howto/ar01s03.html.
Next, update the NFS export table in the host:
```
[host ~]# exportfs -a
```

From the host, use the micctrl utility to add an NFS entry on the coprocessors:

[host ~]# micctrl --addnfs=/opt/intel/compilers_and_libraries_2017.0.098/linux --dir=/mnt-library --options=defaults

Restart the MPSS service:

[host ~]# service mpss restart
	Shutting down Intel(R) MPSS:                               [  OK  ]
	Starting Intel(R) MPSS:                                    [  OK  ]
	mic0: online (mode: linux image: /usr/share/mpss/boot/bzImage-knightscorner)
	mic1: online (mode: linux image: /usr/share/mpss/boot/bzImage-knightscorner)

Finally, from the coprocessor, verify that the remote directory is accessible:

[host ~]# ssh mic0 cat /etc/fstab
	rootfs          /               auto            defaults                1  1
	proc            /proc           proc            defaults                0  0
	devpts          /dev/pts        devpts          mode=0620,gid=5         0  0
	172.31.1.254:/opt/intel/compilers_and_libraries_2017.0.098/linux  /mnt-library  nfs             defaults 1 1

	[host ~]# ssh mic0 ls /mnt-mic0

Mounting manually

As an example of the manual procedure, let’s assume we want to mount an NFS shared directory /mnt-mic0 on the Intel Xeon Phi coprocessor from the host machine (/var/mpss/mic0.export is the directory that the host machine exports). In this method, steps 1-3 are the same as in the previous method:

On the host machine, ensure that the shared directory exists; if doesn’t exist, create it:
```
[host ~]# mkdir /var/mpss/mic0.export
```
Add a descriptor to the /etc/exports configuration file in the host machine to export the directory /var/mpss/mic0.exports to the coprocessor mic0, which in this case has an IP address of 172.31.1.1:
```
[host ~]# cat /etc/exports
	/var/mpss/mic0.export 172.31.1.1(rw,async,no_root_squash)
```
For more information on the export options, you can refer to http://nfs.sourceforge.net/nfs-howto/ar01s03.html.
Next, update the NFS export table:
```
[host ~]# exportfs -a
```
Next, login on the coprocessor mic0:
```
[host ~]# ssh mic0
```
Create the mount point /mnt-mic0 on the coprocessor:
```
(mic0)# mkdir /mnt-mic0
```

Add the following descriptor to the /etc/fstab file of the coprocessor to specify the server, the path name of the exported directory, the local directory (mount point), the type of the file system, and the list of mount options: “172.31.1.254:/var/mpss/mic0.export /mnt-mic0 nfs defaults 1 1”

(mic0)# cat /etc/fstab
	rootfs          /               auto             defaults                1  1
	proc            /proc           proc             defaults                0  0
	devpts          /dev/pts        devpts           mode=0620,gid=5         0  0
	172.31.1.254:/var/mpss/mic0.export /mnt-mic0 nfs defaults                1  1

To mount the shared directory /var/mpss/mic0.export on the coprocessor, we can type:
```
(mic0)# mount –a
```

Notes:

If "Connection refused" error is received, restart NFS server in the host:

[host~]# service nfs restart
Shutting down NFS daemon:                                  [  OK  ]
Shutting down NFS mountd:                                  [  OK  ]
Shutting down NFS quotas:                                  [  OK  ]
Shutting down NFS services:                                [  OK  ]
Starting NFS services:                                     [  OK  ]
Starting NFS quotas:                                       [  OK  ]
Starting NFS mountd:                                       [  OK  ]
Stopping RPC idmapd:                                       [  OK  ]
Starting RPC idmapd:                                       [  OK  ]
Starting NFS daemon:                                       [  OK  ]

If "Permission denied" error is received, review and correct the /etc/exports file in the host.
If the coprocessor reboots, you have to mount the directory in the coprocessor again.
The above shared directory can be read/write. To change to read only option, use the option (ro,async,no_root_squash) as seen in step 2.

Conclusion

This article shows two methods to mount a shared directory on the Intel Xeon Phi coprocessor. One method is using micctrl utility, the other is the common manual method. Although both methods work, using micctrl utility is the preferred method as it prevents users from entering data incorrectly in the /etc/fstab table of the coprocessor.

References

Intel® Manycore Platform Software Stack User’s Guide revision 3.7 (from https://software.intel.com/en-us/articles/intel-manycore-platform-software-stack-mpss)
Article, “Setting up an NFS Server”

↧

Recipe: Building NAMD on Intel® Xeon® and Intel® Xeon Phi™ Processors

October 21, 2016, 1:36 pm

Latest and popular articles on Intel Technologies

≫ Next: Recipe: Building and Running MILC on Intel® Xeon® Processors and Intel® Xeon Phi™ Processors

≪ Previous: How to Mount a Shared Directory on Intel® Xeon Phi™ Coprocessor

Purpose

This recipe describes a step-by-step process of how to get, build, and run NAMD, Scalable Molecular Dynamic, code on Intel® Xeon Phi™ processor and Intel® Xeon® E5 processors for better performance.

Introduction

NAMD is a parallel molecular dynamics code designed for high-performance simulation of large biomolecule systems. Based on Charm++ parallel objects, NAMD scales to hundreds of cores for typical simulations and beyond 500,000 cores for the largest simulations. NAMD uses the popular molecular graphics program VMD for simulation setup and trajectory analysis, but is also file-compatible with AMBER, CHARMM, and X-PLOR.

NAMD is distributed free of charge with source code. You can build NAMD yourself or download binaries for a wide variety of platforms. Find the details below of how to build on Intel® Xeon Phi™ processor and Intel® Xeon® E5 processors and learn more about NAMD at http://www.ks.uiuc.edu/Research/namd/

Building and running NAMD on Intel® Xeon® Processor E5-2697 v4 (BDW) and Intel® Xeon Phi™ Processor 7250 (KNL)

Download the Code:

Download the latest “Source Code” of NAMD from this site: http://www.ks.uiuc.edu/Development/Download/download.cgi?PackageName=NAMD
Download charm++ 6.7.1 version
- You can get charm++ from the NAMD “Source Code” of the “Version Nightly Build”
- Or download it separately from here http://charmplusplus.org/download/
Download fftw3 version(http://www.fftw.org/download.html)
- Version 3.3.4 is used is this run
Download apao and stvm workloads from here: http://www.ks.uiuc.edu/Research/namd/utilities/

Build the Binaries:

Recommended steps to build fftw3:
- Cd<path>/fftw3.3.4
- ./configure --prefix=$base/fftw3 --enable-single --disable-fortran CC=icc
  Use xMIC-AVX512 for KNL or –xCORE-AVX2 for BDW
- make CFLAGS="-O3 -xMIC-AVX512 -fp-model fast=2 -no-prec-div -qoverride-limits" clean install
Build multicore version of charm++:
- cd <path>/charm-6.7.1
- ./build charm++ multicore-linux64 iccstatic --with-production "-O3 -ip"

Build BDW:

Modify the Linux-x86_64-icc.arch to look like the following:

NAMD_ARCH = Linux-KNL
CHARMARCH = multicore-linux64-iccstatic
FLOATOPTS = -ip -xMIC-AVX512 -O3 -g -fp-model fast=2 -no-prec-div -qoverride-limits -DNAMD_DISABLE_SSE
CXX = icpc -std=c++11 -DNAMD_KNL
CXXOPTS = -static-intel -O2 $(FLOATOPTS)
CXXNOALIASOPTS = -O3 -fno-alias $(FLOATOPTS) -qopt-report-phase=loop,vec -qopt-report=4
CXXCOLVAROPTS = -O2 -ip
CC = icc
COPTS = -static-intel -O2 $(FLOATOPTS)

./config Linux-x86_64-icc --charm-base <charm_path> --charm-arch multicore-linux64- iccstatic --with-fftw3 --fftw-prefix <fftw_path> --without-tcl --charm-opts –verbose
gmake -j

Build KNL:

Modify the arch/Linux-KNL-icc.arch to look like the following:

NAMD_ARCH = Linux-KNL
CHARMARCH = multicore-linux64-iccstatic
FLOATOPTS = -ip -xMIC-AVX512 -O3 -g -fp-model fast=2 -no-prec-div -qoverride-limits -DNAMD_DISABLE_SSE
CXX = icpc -std=c++11 -DNAMD_KNL
CXXOPTS = -static-intel -O2 $(FLOATOPTS)
CXXNOALIASOPTS = -O3 -fno-alias $(FLOATOPTS) -qopt-report-phase=loop,vec -qopt-report=4
CXXCOLVAROPTS = -O2 -ip
CC = icc
COPTS = -static-intel -O2 $(FLOATOPTS)

./config Linux-KNL-icc --charm-base <charm_path> --charm-arch multicore-linux64-iccstatic --with-fftw3 --fftw-prefix <fftw_path> --without-tcl --charm-opts –verbose
gmake –j

Other system setup:

Change the kernel setting for KNL: “nmi_watchdog=0 rcu_nocbs=2-271 nohz_full=2-271” One of the ways to change the settings (this could be different for every system):
- First save your original grub.cfg to be safe
  cp /boot/grub2/grub.cfg /boot/grub2/grub.cfg.ORIG
- In “/etc/default/grub”. Add (append) the below to the “GRUB_CMDLINE_LINUX”
  nmi_watchdog=0 rcu_nocbs=2-271 nohz_full=2-271
- Save your new configuration
  grub2-mkconfig -o /boot/grub2/grub.cfg
- Reboot the system. After logging in, verify the settings with 'cat /proc/cmdline’
Change next lines in *.namd file for both workloads:
     numsteps             1000
     outputtiming          20
     outputenergies     600

Run NAMD:

Run BDW (ppn = 72):
$BIN +p $ppn apoa1/apoa1.namd +pemap 0-($ppn-1)
Run KNL (ppn = 136, MCDRAM in flat mode, similar performance in cache):
numactl –m 1 $BIN +p $ppn apoa1/apoa1.namd +pemap 0-($ppn-1)

Example: numactl –m 1 /NAMD_2.11_Source/Linux-KNL-icc/namd2 +p 136 apoa1/apoa1.namd +pemap 0-135

Performance results reported in Intel Salesforce repository (ns/day; higher is better):

Workload	2S BDW 18c 2.3Ghz (ns/day)	KNL bin1 (ns/day)	KNL vs. 2S BDW (speedup)
stmv	0.45	0.55	1.22x
Ap0a1	5.5	6.18	1.12x

Systems configuration:

Processor	Intel® Xeon® Processor E5-2697 v4(BDW)	Intel® Xeon Phi™ Processor 7250(KNL)
Stepping	1 (B0)	1 (B0) Bin1
Sockets / TDP	2S / 290W	1S / 215W
Frequency / Cores / Threads	2.3 GHz / 36 / 72	1.4 GHz / 68 / 272
DDR4	8x16GB 2400 MHz(128GB)	6x16 GB 2400 MHz
MCDRAM	N/A	16 GB Flat
Cluster/Snoop Mode/Mem Mode	Home	Quadrant/flat
Turbo	On	On
BIOS	GRRFSDP1.86B0271.R00.1510301446	GVPRCRB1.86B.0010.R02.1608040407
Compiler	ICC-2017.0.098	ICC-2017.0.098
Operating System	Red Hat Enterprise Linux* 7.2 (3.10.0-327.e17.x86_64)	Red Hat Enterprise Linux 7.2 (3.10.0-327.22.2.el7.xppsl_1.4.1.3272._86_64)

Building and running NAMD for Cluster on Intel® Xeon® Processor E5-2697 v4 (BDW) and Intel® Xeon Phi™ Processor 7250 (KNL)

Build the Binaries:

Set Intel tools for compilation:

I_MPI_CC=icc;I_MPI_CXX=icpc;I_MPI_F90=ifort;I_MPI_F77=ifort
export I_MPI_CC I_MPI_CXX I_MPI_F90 I_MPI_F77
CC=icc;CXX=icpc;F90=ifort;F77=ifort
export CC CXX F90 F77
export I_MPI_LINK=opt_mt

Recommended steps to build fftw3:
- Cd<path>/fftw3.3.4
- ./configure --prefix=$base/fftw3 --enable-single --disable-fortran CC=icc
- Use xMIC-AVX512 for KNL or –xCORE-AVX2 for BDW
- make CFLAGS="-O3 -xMIC-AVX512 -fp-model fast=2 -no-prec-div -qoverride-limits" clean install
Recommended steps to build multicore version of charm++:
- cd <path>/charm-6.7.1
- chmod –R 777 *
- source /opt/intel/compiler/<version>/compilervars.sh intel64
- source /opt/intel/impi/<version>/bin/mpivars.sh
- ./build charm++ mpi-linux-x86_64 smp mpicxx ifort --with-production $base_charm_opts -DCMK_OPTIMIZE -DMPICH_IGNORE_CXX_SEEK
Build on KNL:
- ./config Linux-KNL-icc --charm-base < fullPath >/charm-6.7.1 --charm-arch mpi-linux-x86_64-ifort-smp-mpicxx --with-fftw3 --fftw-prefix <fullPath>/fftw3 --without-tcl --charm-opts –verbose
- cd “Linux-KNL-icc”
- gmake -j
Build on BDW:
- ./config Linux-KNL-icc --charm-base $FULLPATH/charm-6.7.1 --charm-arch mpi-linux-x86_64-ifort-smp-mpicxx --with-fftw3 --fftw-prefix $FULLPATH/fftw3 --without-tcl --charm-opts -verbose
- cd Linux-KNL-icc
- make clean
- gmake –j

Run the Binaries (ps: “hosts”: is the file that contains the host names to run on):

BDW run on single node:

export I_MPI_PROVIDER=psm2
export I_MPI_FALLBACK=no
export I_MPI_FABRICS=tmi

source /opt/intel/compiler/<version>/compilervars.sh intel64
source /opt/intel/impi/<version>/intel64/bin/mpivars.sh

NTASKS_PER_NODE=1
export MPPEXEC="time -p mpiexec.hydra -perhost $NTASKS_PER_NODE -f ./hosts "
$MPPEXEC -n $node $BINPATH/$BINNAME +ppn 71 $FULLPATH/$WORKLOAD +pemap 1-71 +commap 0

Example:
$MPPEXEC -n 1 $FULLPATH/namd2” +ppn 71 $FULLPATH/stmv/stmv.namd” +pemap 1-71 +commap 0

KNL Run on single node:

export I_MPI_PROVIDER=psm2
export I_MPI_FALLBACK=0
export I_MPI_FABRICS=tmi
export PSM2_IDENTIFY=1
export PSM2_RCVTHREAD=0
export TMI_PSM2_TEST_POLL=1

NTASKS_PER_NODE=1
export MPPEXEC="mpiexec.hydra -perhost $NTASKS_PER_NODE -f ./hosts "
numactl -m 1 $MPPEXEC $BINPATH/$BINNAME +ppn 135 $FULLPATH/$WORKLOAD +pemap 1-135 +commap 0

Example:
numactl -m 1 $MPPEXEC $FULLPATH/namd2 +ppn 135 $FULLPATH/stmv/stmv.namd +pemap 1-135 +commap 0

KNL Run on multi-node (node = number of nodes to run on):
```
export MPPEXEC="mpiexec.hydra -perhost 1 -f ./hosts "
numactl -m 1 $MPPEXEC -n $node numactl -m 1 $BINPATH/$BINNAME +ppn 134 $FULLPATH/$WORKLOAD +pemap 0-($ppn-1) +commap 67 
```
Example:
numactl -m 1 $MPPEXEC -n 8 numactl -m 1 $FULLPATH/namd2 +ppn 134 $FULLPATH/stmv/stmv.nand +pemap 0-66+68 +commap 67

Remark:

For better scale on multinodes run, please increase count of communication threads (1, 2, 4, 8, 13, 17). Example of a command run that can be used:

export MPPEXEC="mpiexec.hydra -perhost 17 -f ./hosts "
numactl -m 1 $MPPEXEC -n $(($node*17)) numactl -m 1 $BINPATH/$BINNAME +ppn 7  $FULLPATH/$WORKLOAD +pemap 0-67,68-135:4.3 +commap 71-135:4 > ${WKL}_cluster_commapN/${WKL}.$node.$

One usage example:

nodes="16 8 4 2 1"
for node in ${nodes}
do
  	export MPPEXEC="mpiexec.hydra -perhost 17 -f ./hosts
numactl -m 1 $MPPEXEC -n $(($node*17)) numactl -m 1 $FullPath.namd2  +ppn 8  $WorkloadPath/$WKL/$WKL.namd  +pemap 0-67+68 +commap 71-135:4 > $ResultFile.$node.$BINNAME.68c2t.commap_8th_from2cx4t
done

Best performance results reported on up to 128 Intel Xeon Phi nodes cluster (ns/day; higher is better):

Workload\node (2HT)	1	2	4	8	16
stmv (ns/day)	0.55	1.05	1.86	3.31	5.31

Workload\node (2HT)	8	16	32	64	128
stmv.28M (ns/day)	0.152	0.310	0.596	1.03	1.91

↧

Recipe: Building and Running MILC on Intel® Xeon® Processors and Intel® Xeon Phi™ Processors

February 7, 2017, 11:42 am

Latest and popular articles on Intel Technologies

≫ Next: Recipe: Building and Running GROMACS* on Intel® Processors

≪ Previous: Recipe: Building NAMD on Intel® Xeon® and Intel® Xeon Phi™ Processors

Introduction

MILC software represents a set of codes written by the MIMD Lattice Computation (MILC) collaboration used to study quantum chromodynamics (QCD), the theory of the strong interactions of subatomic physics. It performs simulations of four-dimensional SU lattice gauge theory on MIMD (Multiple Instruction, Multiple Data) parallel machines. “Strong interactions” are responsible for binding quarks into protons and neutrons and holding them all together in the atomic nucleus. MILC applications address fundamental questions in high energy and nuclear physics, and is directly related to major experimental programs in these fields. MILC is one of the largest compute cycle users at many U.S. and European supercomputing centers.

This article provides instructions for code access, build, and run directions for the “ks_imp_rhmc” application on Intel® Xeon® processors and Intel® Xeon Phi™ processors. The “ks_imp_rhmc” is a dynamical RHMC (rational hybrid Monte Carlo algorithm) code for staggered fermions. In addition to the naive and asqtad staggered actions, the highly improved staggered quark (HISQ) action is also supported.

Currently, the conjugate gradient (CG) solver in the code uses the QPhiX library. Efforts are ongoing to integrate other operations (gauge force (GF), fermion force (FF)) with the QPhiX library as well.

The QPhiX library provides sparse solvers and Dslash kernels for Lattice QCD simulations optimized for Intel® architectures.

Code Access

The MILC Software and QPhiX library are primarily required. The MILC software can be downloaded from GitHub here: https://github.com/milc-qcd/milc_qcd. Download the master branch. QPhiX support is integrated into this branch for CG solvers.

The QPhiX library and code generator for use with Wilson-Clover fermions (for example, for use with chroma) are available from https://github.com/jeffersonlab/qphix.git and https://github.com/jeffersonlab/qphix-codegen.git, respectively. For the most up to date version, we suggest you use the devel branch of QPhiX. The MILC version is currently not open source. Please contact the MILC collaboration group for access to the QPhiX (MILC) branch.

Build Directions

Compile the QPhiX Library

Users need to build QPhiX first before building the MILC package.

The QPhiX library will have two tar files, mbench*.tar and qphix-codegen*.tar.

Untar the above.

Build qphix-codgen

The files with intrinsics for QPhiX are built in the qphix-codegen directory.

Enter the qphix-codegen directory.

Edit line #3 in “Makefile_xyzt”, enable “milc=1” variable.

Compile as:

source /opt/intel/compiler/<version>/bin/compilervars.sh intel64
source /opt/intel/impi/<version>/mpi/intel64/bin/mpivars.sh
make –f Makefile_xyzt avx512 -- [for Intel® Xeon Phi™ Processor]
make –f avx2 -- [for Intel® Xeon® v3 Processors /Intel® Xeon® v4 Processor]

Build mbench

Enter the mbench directory.

Edit line #3 in “Makefile_qphixlib”, set “mode=mic” to compile with Intel® AVX-512 for Intel® Xeon Phi™ Processor and “mode=avx” to compile with Intel® Advanced Vector Extensions 2 (Intel® AVX2) for Intel® Xeon® Processors.

Edit line #13 in “Makefile_qphixlib” to enable MPI. Set ENABLE_MPI = 1.

Compile as:

make -f Makefile_qphixlib mode=mic AVX512=1 -- [Intel® Xeon Phi™ Processor]
make -f Makefile_qphixlib mode=avx AVX2=1 -- [Intel® Xeon® Processors]

Compile MILC Code

Install/download the master branch from the above GitHub location.

Download the Makefile.qphix file from the following location:

http://denali.physics.indiana.edu/~sg/MILC_Performance_Recipe/.

Copy the Makefile.qphix to the corresponding application directory. In this case, copy the Makefile.qphix to the “ks_imp_rhmc” application directory and rename it as Makefile.

Make the following changes to the Makefile:

On line #17 - Add/uncomment the appropriate ARCH variable:
- For example, ARCH = knl (compile with Intel AVX-512 for Intel® Xeon Phi™ Processor architecture).
- For example, ARCH = bdw (compile with Intel AVX2 for Intel® Xeon® Processor architecture).
On line #28 - Change MPP variable to “true” if you want MPI.
On line #34 - Pick the PRECISION you want:
- 1 = Single, 2 = Double. We use Double for our runs.
Starting line #37 - Compiler is set up and this should work:
- If directions above were followed. If not, customize starting at line #40.
On line #124 - Setup of Intel compiler starts:
- Based on ARCH it will use the appropriate flags.
On line #395 - QPhiX customizations starts:
- On line #399 – Set QPHIX_HOME to correct QPhiX path (Path to mbench directory).
- The appropriate QPhiX FLAGS will be set if the above is defined correctly.

Compile as:

Enter the ks_imp_rhmc. The Makefile with the above changes should be in this directory. Source the latest Intel® compilers and Intel® MPI Library.

make su3_rhmd_hisq -- Build su3_rhmd_hisq binary
make su3_rhmc_hisq -- Build su3_rhmc_hisq binary

Compile the above binaries for Intel® Xeon Phi™ Processor and Intel® Xeon® Processor (edit Makefile accordingly).

Run Directions

Input Files

There are two required input files, params.rest, and rat.m013m065m838.

They can be downloaded from here:

http://denali.physics.indiana.edu/~sg/MILC_Performance_Recipe/.

The file rat.m013m065m838 defines the residues and poles of the rational functions needed in the calculation. The file params.rest sets all the run time parameters, including the lattice size, the length of the calculation (number of trajectories), and the precision of the various conjugate-gradient solutions.

In addition, a params.<lattice-size> file with required lattice size will be created during runtime. This file essentially has the params.rest appended to it with the lattice size (Nx * Ny * Nz * Ny) to run.

The Lattice Sizes

The size of the four-dimensional space-time lattice is controlled by the “nx, ny, nz, nt” parameters.

As an example, consider a problem as (nx x ny x nz x nt) = 32 x 32 x 32 x 64 running on 64 MPI ranks. To weak scale this problem a user would begin by multiplying nt by 2, then nz by 2, then ny by 2, then nx by 2 and so on, such that all variables get sized accordingly in a round-robin fashion.

This is illustrated in the table below. The original problem size is 32 x 32 x 32 x 64, to keep the elements/rank constant (weak scaling); for 128 rank count, first multiply nt by 2 (32 x 32 x 32 x 128). Similarly, for 512 ranks, multiply ntby 2, nz by 2, ny by 2 from the original problem size to keep the same elements/rank.

Ranks	64	128	256	512
nx	32	32	32	32
ny	32	32	32	64
nz	32	32	64	64
nt	64	128	128	128

Total Elements	2097152	4194304	8388608	16777216
Multiplier	1	2	4	8
Elements/Rank	32768	32768	32768	32768

Table: Illustrates Weak Scaling of Lattice Sizes

Running with MPI x OpenMP*

The calculation takes place on a four-dimensional hypercubic lattice, representing three spatial dimensions and one time dimension. The quark fields have values on each of the lattice points and the gluon field has values on each of the links connecting nearest-neighbors of the lattice sites.

The lattice is divided into equal subvolumes, one per MPI rank. The MPI ranks can be thought of as being organized into a four-dimensional grid of ranks. It is possible to control the grid dimensions with the params.rest file. Of course, the grid dimensions must be integer factors of the lattice coordinate dimensions.

Each MPI rank executes the same code. The calculation requires frequent exchanges of quark and gluon values between MPI ranks with neighboring lattice sites. Within a single MPI rank, the site-by-site calculation is threaded using OpenMP* directives, which have been inserted throughout the code. The most time-consuming part of production calculations is the CG solver. In the QPhiX version of the CG solver, the data layout and the calculation at the thread level is further organized to take advantage of the Intel Xeon and Intel Xeon Phi processors SIMD(single instruction, multiple data) lanes.

Running the Test Cases

Create a “run” directory in the top-level directory and add the input files obtained from above.
cd <milc>/run
P.S: Run the appropriate binary for each architecture.
Create the lattice volume:
```
cat << EOF > params.$nx*$ny*$nz*$nt
prompt 0
nx $nx
ny $ny
nz $nz
nt $nt
EOF
cat params.rest >> params.$nx*$ny*$nz*$nt
```
For this performance recipe, we evaluate the single node and multinode (16 nodes) performance with the following weak scaled lattice volume:
Single Node (nx * ny * nz * nt): 24 x 24 x 24 x 60
Multinode [16 nodes] (nx * ny * nz * nt): 48 x 48 x 48 x 120

Run on Intel Xeon processor (E5-2697v4).
Source the latest Intel compilers and Intel MPI Library

Intel® Parallel Studio 2017 and above recommended

Single Node:

mpiexec.hydra –n 12 –env OMP_NUM_THREADS 3 –env KMP_AFFINITY 'granularity=fine,scatter,verbose'<path-to>/ks_imp_rhmc/su3_rhmd_hisq.bdw < params.24x24x24x60

Multinode (16 nodes, via Intel® Omni-Path Host Fabric Interface (Intel® OP HFI)):

# Create a runScript (run-bdw) #<path-to>/ks_imp_rhmc/su3_rhmd_hisq.bdw < params.48x48x48x120
#Intel® OPA fabric-related environment variables#
export I_MPI_FABRICS=shm:tmi
export I_MPI_TMI_PROVIDER=psm2
export PSM2_IDENTIFY=1
export I_MPI_FALLBACK=0
#Create nodeconfig.txt with the following#
-host <hostname1> -env OMP_NUM_THREADS 3 -env KMP_AFFINITY 'granularity=fine,scatter,verbose' -n 12 <path-to>/run-bdw
…..
…..
…..
-host <hostname16> -env OMP_NUM_THREADS 3 -env KMP_AFFINITY 'granularity=fine,scatter,verbose' -n 12 <path-to>/run-bdw
#mpirun command#
mpiexec.hydra –configfile nodeconfig.txt

Run on Intel Xeon Phi processor (7250).
Source Intel compilers and Intel MPI Library

Intel® Parallel Studio 2017 and above recommended

Single Node:

mpiexec.hydra –n 20 –env OMP_NUM_THREADS 3 –env KMP_AFFINITY 'granularity=fine,scatter,verbose' numactl –p 1 <path-to>/ks_imp_rhmc/su3_rhmd_hisq.knl < params.24x24x24x60

Multinode (16 nodes, via Intel OP HFI):

# Create a runScript (run-knl) #
numactl –p 1 <path-to>/ks_imp_rhmc/su3_rhmd_hisq.knl < params.48x48x48x120
#Intel OPA fabric-related environment variables#
export I_MPI_FABRICS=shm:tmi
export I_MPI_TMI_PROVIDER=psm2
export PSM2_IDENTIFY=1
export I_MPI_FALLBACK=0
#Create nodeconfig.txt with the following#
-host <hostname1> -env OMP_NUM_THREADS 3 -env KMP_AFFINITY 'granularity=fine,scatter,verbose' -n 20 <path-to>/run-knl
…..
…..
…..
-host <hostname16> -env OMP_NUM_THREADS 3 -env KMP_AFFINITY 'granularity=fine,scatter,verbose' -n 20 <path-to>/run-knl
#mpirun command#
mpiexec.hydra –configfile nodeconfig.txt

Performance Results and Optimizations

The output prints the total time to solution for the entire application, which takes into account the time for the different solvers and operators (for example, CG solver, fermion force, link fattening, gauge force, and so on).

The performance chart below is the speedup w.r.t 2S Intel Xeon processor E5-2697v4 based on the total run time.

Speedup w.r.t 2S Intel® Xeon® processor E5-2697v4

The optimizations as part of the QPhiX library include data layout changes to target vectorization and generation of packed aligned loads/stores, cache blocking, load balancing and improved code generation for each architecture (Intel Xeon processor, Intel Xeon Phi processor) with corresponding intrinsics, where necessary. See References and Resources section for details.

Testing Platform Configurations

The following hardware was used for the above recipe and performance testing.

Processor	Intel® Xeon® Processor E5-2697 v4	Intel® Xeon Phi™ Processor 7250F
Sockets / TDP	2S / 290W	1S / 215W
Frequency / Cores / Threads	2.3 GHz / 36 / 72	1.4 a / 68 / 272
DDR4	8x16 GB 2400 MHz	6x16 GB 2400 MHz
MCDRAM	N/A	16 GB Flat
Cluster/Snoop Mode	Home	Quadrant
Memory Mode		Flat
Turbo	OFF	OFF
BIOS	SE5C610.86B.01.01.0016.033 120161139	GVPRCRB1.86B.0010.R02.1 606082342
Operating System	Oracle Linux* 7.2 (3.10.0-229.20.1.el6.x86_64)	Oracle Linux* 7.2 (3.10.0-229.20.1.el6.x86_64)

MILC Build Configurations

The following configurations were used for the above recipe and performance testing.

MILC Version	Master version as of 28 January 2017
Intel® Compiler Version	2017.1.132
Intel® MPI Library Version	2017.0.098
MILC Makefiles Used	Makefile.qphix, Makefile_qphixlib, Makefile

References and Resources

MIMD Lattice Computation (MILC) Collaboration: http://physics.indiana.edu/~sg/milc.html
QPhiX Case Study: http://www.nersc.gov/users/computational-systems/cori/application-porting-and-performance/application-case-studies/qphix-case-study/
MILC Staggered Conjugate Gradient Performance on Intel Intel® Xeon Phi™ Processor: https://anl.app.box.com/v/IXPUG2016-presentation-10

↧

Recipe: Building and Running GROMACS* on Intel® Processors

February 24, 2017, 11:03 am

Latest and popular articles on Intel Technologies

≫ Next: Recipe: Building and running NEMO* on Intel® Xeon Phi™ Processors

≪ Previous: Recipe: Building and Running MILC on Intel® Xeon® Processors and Intel® Xeon Phi™ Processors

Purpose

This recipe describes how to get, build, and run the GROMACS* code on Intel® Xeon® and Intel® Xeon Phi™ processors for better performance on a single node.

Introduction

GROMACS is a versatile package for performing molecular dynamics, using Newtonian equations of motion, for systems with hundreds to millions of particles. GROMACS is primarily designed for biochemical molecules like proteins, lipids, and nucleic acids that have a multitude of complicated bonded interactions. But, since GROMACS is extremely fast at calculating the non-bonded interactions typically dominating simulations, many researchers use it for research on non-biological systems, such as polymers.

GROMACS supports all the usual algorithms expected from a modern molecular dynamics implementation.

The GROMACS code is maintained by developers around the world. The code is available under the GNU General Public License from www.gromacs.org.

Code Access

Download GROMACS:

Get the GROMACS-2016.1 release. This code version includes optimization for better performance on the Intel® Xeon Phi™ processor: http://manual.gromacs.org/documentation/2016/download.html

Workloads Access

Download the workloads:

water1.5M_pme and water1.5M_rf: ftp://ftp.gromacs.org/pub/benchmarks/water_GMX50_bare.tar.gz
lignocellulose3M_rf: http://www.prace-i.eu/UEABS/GROMACS/1.2/GROMACS_TestCaseB.tar.gz

Generate Water Workloads Input Files:

To generate the .tpr input file:

tar xf water_GMX50_bare.tar.gz
cd water-cut1.0_GMX50_bare/1536
gmx_mpi grompp -f pme.mdp -c conf.gro -p topol.top -o topol_pme.tpr
gmx_mpi grompp -f rf.mdp -c conf.gro -p topol.top -o topol_rf.tpr

Build Directions

Build the GROMACS binary. Use cmake configuration for Intel® Compiler 2017.1.132 + Intel® MKL + Intel® MPI 2017.1.132:

Set the Intel Xeon Phi BIOS options to be:

Quadrant Cluster mode
MCDRAM Flat mode
Turbo Enabled

For Intel Xeon Phi, build the code as:

BuildDir= "${GromacsPath}/build” # Create the build directory
installDir="${GromacsPath}/install"
mkdir $BuildDir
source /opt/intel/<version>/bin/compilervars.sh intel64 # Source the Intel compiler, MKL and IMPI
source /opt/intel/impi/<version>/mpivars.sh
source /opt/intel/mkl/<version>/mklvars.sh intel64
cd $BuildDir # Set the build environments for Intel Xeon Phi

FLAGS="-xMIC-AVX512 -g -static-intel"; CFLAGS=$FLAGS CXXFLAGS=$FLAGS CC=mpiicc CXX=mpiicpc cmake .. -DBUILD_SHARED_LIBS=OFF -DGMX_FFT_LIBRARY=mkl -DCMAKE_INSTALL_PREFIX=$installDir -DGMX_MPI=ON -DGMX_OPENMP=ON -DGMX_CYCLE_SUBCOUNTERS=ON -DGMX_GPU=OFF -DGMX_BUILD_HELP=OFF -DGMX_HWLOC=OFF -DGMX_SIMD=AVX_512_KNL -DGMX_OPENMP_MAX_THREADS=256

For Intel Xeon, set the build environments and build the code as above with changes:

FLAGS="-xCORE-AVX2 -g -static-intel"
-DGMX_SIMD=AVX2_256

Build GROMACS:

make -j 4
sleep 5
make check

Run Directions

Run workloads on Intel Xeon Phi with the environment settings and command lines as (nodes.txt : localhost:272):


	export  I_MPI_DEBUG=5
	export I_MPI_FABRICS=shm
	export I_MPI_PIN_MODE=lib
	export KMP_AFFINITY=verbose,compact,1

	gmxBin="${installDir}/bin/gmx_mpi"

	mpiexec.hydra -genvall -machinefile ./nodes.txt -np 66 numactl -m 1 $gmxBin mdrun -npme 0 -notunepme -ntomp 4 -dlb yes -v -nsteps 4000 -resethway -noconfout -pin on -s ${WorkloadPath}water-cut1.0_GMX50_bare/1536/topol_pme.tpr
	export KMP_BLOCKTIME=0
	mpiexec.hydra -genvall -machinefile ./nodes.txt -np 66 numactl -m 1 $gmxBin mdrun -ntomp 4 -dlb yes -v -nsteps 1000 -resethway -noconfout -pin on -s ${WorkloadPath}lignocellulose-rf.BGQ.tpr
	mpiexec.hydra -genvall -machinefile ./nodes.txt -np 64 numactl -m 1 $gmxBin mdrun -ntomp 4 -dlb yes -v -nsteps 5000 -resethway -noconfout -pin on -s ${WorkloadPath}water-cut1.0_GMX50_bare/1536/topol_rf.tpr

Run workloads on Intel Xeon with the environment settings and command lines as:


	export  I_MPI_DEBUG=5
	export I_MPI_FABRICS=shm
	export I_MPI_PIN_MODE=lib
	export KMP_AFFINITY=verbose,compact,1

	gmxBin="${installDir}/bin/gmx_mpi"

	mpiexec.hydra -genvall -machinefile ./nodes.txt -np 72 $gmxBin mdrun -notunepme -ntomp 1 -dlb yes -v -nsteps 4000 -resethway -noconfout -s ${WorkloadPath}water-cut1.0_GMX50_bare/1536_bdw/topol_pme.tpr
	export KMP_BLOCKTIME=0
	mpiexec.hydra -genvall -machinefile ./nodes.txt -np 72 $gmxBin mdrun -ntomp 1 -dlb yes -v -nsteps 1000 -resethway -noconfout -s ${WorkloadPath}lignocellulose-rf.BGQ.tpr
	mpiexec.hydra -genvall -machinefile ./nodes.txt -np 72 $gmxBin mdrun -ntomp 1 -dlb yes -v -nsteps 5000 -resethway -noconfout -s ${WorkloadPath}water-cut1.0_GMX50_bare/1536_bdw/topol_rf.tpr

Performance Testing

Performance tests for GROMACS are illustrated below with comparisons between an Intel Xeon processor and an Intel Xeon Phi processor against three standard workloads: water1536k_pme, water1536k_rf, and lignocellulose3M_rf. In all cases, turbo mode is turned on.

Testing Platform Configurations

The following hardware was used for the above recipe and performance testing.

Processor	Intel® Xeon® Processor E5-2697 v4	Intel® Xeon Phi™ Processor 7250
Stepping	1 (B0)	1 (B0) Bin1
Sockets / TDP	2S / 290W	1S / 215W
Frequency / Cores / Threads	2.3 GHz / 36 / 72	1.4 GHz / 68 / 272
DDR4	8x16GB 2400 MHz(128GB)	6x16 GB 2400 MHz
MCDRAM	N/A	16 GB Flat
Cluster/Snoop Mode/Mem Mode	Home	Quadrant/flat
Turbo	On	On
BIOS	GRRFSDP1.86B.0271.R00.1510301446	GVPRCRB1.86B.0011.R04.1610130403
Compiler	ICC-2017.1.132	ICC-2017.1.132
Operating System	Red Hat Enterprise Linux* 7.2	Red Hat Enterprise Linux 7.2
Operating System	3.10.0-327.el7.x86_64	3.10.0-327.13.1.el7.xppsl_1.3.3.151.x86_64

GROMACS Build Configurations

The following configurations were used for the above recipe and performance testing.

GROMACS Version: GROMACS-2016.1
Intel® Compiler Version: 2017.1.132
Intel® MPI Library Version: 2017.1.132
Workloads used: water1536k_pme, water1536k_rf, and lignocellulose3M_rf

↧

Recipe: Building and running NEMO* on Intel® Xeon Phi™ Processors

March 23, 2017, 2:56 pm

Latest and popular articles on Intel Technologies

≫ Next: Demo: Software Defined Visualization Using Intel® Xeon Phi™ Processor

≪ Previous: Recipe: Building and Running GROMACS* on Intel® Processors

About NEMO*

The NEMO* (Nucleus for European Modelling of the Ocean) numerical solutions framework encompasses models of ocean, sea ice, tracers, and biochemistry equations and their related physics. It also incorporates the pre- and post-processing tools and the interface to other components of the Earth System. NEMO allows several ocean-related components of the Earth System to work together or separately, and also allows for two-way nesting via AGRIF software. It is interfaced with the remaining components of the Earth System package (atmosphere, land surfaces, and so on) via the OASIS coupler.

This recipe shows the performance advantages of using the Intel® Xeon Phi™ processor 7250.

NEMO 3.6 is the current stable version.

Downloading the Code

Download the NEMO source code from the official NEMO repository (you should register at www.nemo-ocean.eu ):
svn co –r 6939 http://forge.ipsl.jussieu.fr/nemo/svn/branches/2015/nemo_v3_6_STABLE/NEMOGCM nemo
Download the XIOS IO server from the official XIOS repository:
svn co -r 703 http://forge.ipsl.jussieu.fr/ioserver/svn/XIOS/branchs/xios-1.0 xios
If your system has NetCDF libraries with Fortran bindings already installed and they link with NEMO and XIOS binaries, go to the section “Building XIOS for the Intel Xeon Processor”:
- szip 2.1 from https://support.hdfgroup.org/ftp/lib-external/szip/2.1/src/szip-2.1.tar.gz
- zlib 1.2.8 from http://www.zlib.net/fossils/zlib-1.2.8.tar.gz
- HDF5 1.8.12 from https://support.hdfgroup.org/ftp/HDF5/releases/hdf5-1.8.12/src/hdf5-1.8.12.tar.gz
- CURL 7.42.1 from https://curl.haxx.se/download/curl-7.42.1.tar.gz
- NetCDF-C 4.3.3 from https://github.com/Unidata/netcdf-c/archive/v4.3.3.tar.gz
NetCDF-Fortran from https://github.com/Unidata/netcdf-fortran/archive/netcdf-fortran-4.2.tar.gz

Building Additional Libraries for the Intel® Xeon® Processor

First, choose a directory for your experiments, such as “~/NEMO-BDW”:
```
export base=”~/NEMO-BDW”
```
Create a directory and copy all required libraries in $base:
```
mkdir -p $base/libraries
```
Unpack the tarball files in $base/libraries/src.
To build an Intel® Advanced Vector Extensions 2 (Intel® AVX2) version of libraries, set:
```
export arch="-xCORE-AVX2"
```

Set the following environment variables:

export PREFIX=$base/libraries
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:${PREFIX}/lib
export CFLAGS="-I$PREFIX/include -L$PREFIX/lib –O3 -g -traceback -openmp ${arch} -fPIC"
export CPPFLAGS=$CFLAGS
export CXXFLAGS=$CFLAGS
export FFFLAGS=$CFLAGS
export FCFLAGS=$CFLAGS
export LDFLAGS="-L$PREFIX/lib -openmp ${arch} -fPIC"
export FC=mpiifort
export CXX=mpiicc
export CC=mpiicc
export CPP="icc -E"

Build szip:

cd $base/libraries/src/szip-2.1
./configure --prefix=$PREFIX
make -j 4
make install

Build zlib:

cd $base/libraries/src/zlib-1.2.8
./configure --prefix=$PREFIX
make –j 4
make install

Build HDF5:

cd $base/libraries/src/hdf5-1.8.12
./configure --with-zlib=$PREFIX --prefix=$PREFIX --enable-fortran --with-szlib=$PREFIX --enable-hl
make
make install

Build CURL:

cd $base/libraries/src/curl- 7.42.1
./configure --prefix=$PREFIX
make –j 4
make install

Build NetCDF:

cd $base/libraries/src/netcdf-4.3.3
export LIBS=" -lhdf5_hl -lhdf5 -lz -lsz -lmpi"
export LD_FLAGS+=" -L$PREFIX/lib"
./configure --prefix=$PREFIX
make
make install

Build NetCDF Fortran wrapper:

cd $base/libraries/src/netcdf-fortran-4.2/
export LIBS=""
export CFLAGS="$CFLAGS -lnetcdf"
export CPPFLAGS=$CFLAGS
export CXXFLAGS=$CFLAGS
export FFFLAGS=$CFLAGS
export FCFLAGS=$CFLAGS
export FC=ifort
export CXX=mpiicc
export CC=mpiicc
export LDFLAGS+=" -L$I_MPI_ROOT/lib64/"
./configure --prefix=$PREFIX
make
make install

Building XIOS for the Intel Xeon Processor

Copy XIOS source code to $base/xios

Create files:

$base/xios/arch/arch-ifort_linux.env
$base/xios/arch/arch-ifort_linux.fcm
$base/xios/arch/arch-ifort_linux.path

Add the following lines to the $base/xios/arch/arch-ifort_linux.env file:

export NETCDF_INC_DIR=$base/libraries/include
export NETCDF_LIB_DIR=$base/libraries/lib
export HDF5_INC_DIR=$base/libraries/include
export HDF5_LIB_DIR=$base/libraries/lib

Add the following lines to the $base/xios/arch/arch-ifort_linux.fcm file:

%NCDF_INC            -I$base/libraries/include
%NCDF_LIB            -L$base/libraries/lib -lnetcdff -lnetcdf -lhdf5 -lcurl -lz -lsz
%FC                  mpiifort
%FCFLAGS             -O3 -g -traceback -xCORE-AVX2 -I$base/libraries/include -L$base/libraries/lib
%FFLAGS              -O3 -g -traceback -xCORE-AVX2 -I$base/libraries/include -L$base/libraries/lib
%LD                  mpiifort
%FPPFLAGS            -P -C -traditional
%LDFLAGS             -O3 -g -traceback -xCORE-AVX2 -I$base/libraries/include -L$base/libraries/lib
%AR                  ar
%ARFLAGS             -r
%MK                  gmake
%USER_INC            %NCDF_INC_DIR
%USER_LIB            %NCDF_LIB_DIR

%MAKE                gmake
%BASE_LD        -lstdc++ -lifcore -lintlc
%LINKER         mpiifort -nofor-main
%BASE_INC       -D__NONE__
%CCOMPILER      mpiicc
%FCOMPILER      mpiifort
%CPP            cpp
%FPP            cpp -P

%BASE_CFLAGS    -O3 -g -traceback -xCORE-AVX2 -I$base/libraries/include -L$base/libraries/lib
%PROD_CFLAGS    -O3 -g -traceback -xCORE-AVX2 -I$base/libraries/include -L$base/libraries/lib
%DEV_CFLAGS    -O3 -g -traceback -xCORE-AVX2 -I$base/libraries/include -L$base/libraries/lib
%DEBUG_CFLAGS  -O3 -g -traceback -xCORE-AVX2 -I$base/libraries/include -L$base/libraries/lib
%BASE_FFLAGS   -O3 -g -traceback -xCORE-AVX2 -I$base/libraries/include -L$base/libraries/lib
%PROD_FFLAGS    -O3 -g -traceback -xCORE-AVX2 -I$base/libraries/include -L$base/libraries/lib
%DEV_FFLAGS    -O3 -g -traceback -xCORE-AVX2 -I$base/libraries/include -L$base/libraries/lib
%DEBUG_FFLAGS   -O3 -g -traceback -xCORE-AVX2 -I$base/libraries/include -L$base/libraries/lib

Add the following lines to the $base/xios/arch/arch-ifort_linux.path file:

NETCDF_INCDIR="-I $NETCDF_INC_DIR"
NETCDF_LIBDIR="-L $NETCDF_LIB_DIR"
NETCDF_LIB="-lnetcdff -lnetcdf -lcurl"
MPI_INCDIR=""
MPI_LIBDIR=""
MPI_LIB=""
HDF5_INCDIR="-I $HDF5_INC_DIR"
HDF5_LIBDIR="-L $HDF5_LIB_DIR"
HDF5_LIB="-lhdf5_hl -lhdf5 -lz -lcurl"

Change directory to $base/xios and execute the following command:
```
./make_xios --full --prod --arch ifort_linux
```

Building NEMO for the Intel Xeon Processor and Preparing Workloads

Copy NEMO source code to $base/nemo

Apply the following patch to file $base/nemo/NEMOGCM/ NEMO/OPA_SRC/nemogcm.F90:

@@ -116,6 +116,7 @@
       !!              Madec, 2008, internal report, IPSL.
       !!----------------------------------------------------------------------
       INTEGER ::   istp       ! time step index
+DOUBLE PRECISION :: mpi_wtime, sstart, send
       !!----------------------------------------------------------------------
       !
 #if defined key_agrif
@@ -163,18 +164,19 @@
 #if defined key_agrif
           CALL Agrif_Regrid()
 #endif
-
          DO WHILE ( istp <= nitend .AND. nstop == 0 )
+sstart = mpi_wtime()
 #if defined key_agrif
             CALL stp                         ! AGRIF: time stepping
 #else
             CALL stp( istp )                 ! standard time stepping
 #endif
+send=mpi_wtime()
+print *, "Step ", istp, " - " , send-sstart , "s."
             istp = istp + 1
             IF( lk_mpp )   CALL mpp_max( nstop )
          END DO
 #endif
-
       IF( lk_diaobs   )   CALL dia_obs_wri
       !
       IF( ln_icebergs )   CALL icb_end( nitend )

Create the file $base/nemo/ARCH/arch-mpiifort_linux.fcm and add the following lines:

%NCDF_INC            -I/$base/libraries/include
%NCDF_LIB            -L$base/libraries/lib -lnetcdff -lnetcdf -lz -lcurl -lhdf5_hl -lhdf5 -lz -lcurl
%CPP                 icc -E
%FC                  mpiifort
%FCFLAGS          -r8 -g -traceback -qopenmp -O3 -xCORE-AVX2 -g -traceback
%FFLAGS             -r8 -g -traceback -qopenmp -O3 -xCORE-AVX2 -g -traceback
%LD                  mpiifort
%FPPFLAGS            -P -C -traditional
%LDFLAGS             -lstdc++ -lifcore -O3 -xCORE-AVX2 -g -traceback
%AR                  ar
%ARFLAGS             -r
%MK                  gmake
%XIOS_INC            -I$base/xios/inc
%XIOS_LIB            -L$base/xios/lib -lxios
%USER_INC            %NCDF_INC %XIOS_INC
%USER_LIB            %NCDF_LIB %XIOS_LIB

Build the binary for the GYRE workload:

cd $base/nemo/NEMOGCM/CONFIG
./makenemo -n GYRE -m mpiifort_linux -j 4

Create a sandbox directory for the GYRE runs:
1. ```
 mkdir -p $base/nemo/gyre-exp
 cp –r $base/nemo/NEMOGCM/CONFIG/GYRE/BLD/bin/nemo.exe $base/nemo/gyre-exp
 cp -r $base/nemo/NEMOGCM/CONFIG/GYRE/EXP00/* $base/nemo/gyre-exp
```
2. Switch creating mesh files to off by changing “nn_msh” to 0 in namelist_ref file
3. Enable benchmark mode by changing “nn_bench” to 1 in namelist_ref file.
4. Set the following parameters in the “&namcfg” section:
```
jp_cfg = 70
jpidta = 2102
jpjdta = 1402
jpkdta = 31
jpiglo = 2102
```
5. Switch off using the IO server in the iodef.xml file (“using_server = false”)
Build a binary for the ORCA025 workload:
1. Change “$base/nemo/NEMOGCM/CONFIG/ORCA2_LIM3/cpp_ORCA2_LIM3.fcm” content to “bld::tool::fppkeys key_trabbl key_vvl key_dynspg_ts key_ldfslp key_traldf_c2d key_traldf_eiv key_dynldf_c3d key_zdfddm key_zdftmx key_mpp_mpi key_zdftke key_lim3 key_iomput”
2. Change the line “ORCA2_LIM3 OPA_SRC LIM_SRC_3 NST_SRC” to “ORCA2_LIM3 OPA_SRC LIM_SRC_3” in file $base/nemo/NEMOGCM/CONFIG/cfg.txt
3. ./makenemo -n ORCA2_LIM3 -m mpiifort_linux -j 4
Go to the Barcelona Supercomputing Center (in Spanish), and in section 9 locate the paragraph, “PREGUNTAS Y RESPUESTAS:” with a path to the ftp server and credentials to log in.
Download the BenchORCA025L75.tar.gz file from directory Benchmarks_aceptacion/NEMO/
Extract the contents of the tarball file to $base/nemo/orca-exp

Copy the NEMO binary to the sandbox directory:

cp $base/nemo/NEMOGCM/CONFIG/ORCA2_LIM3/BLD/bin/nemo.exe $base/nemo/orca-exp

Edit the file $base/nemo/orca-exp/iodef.xml and add the following lines into the “<context id="xios"> <variable_definition>” section:
```
<variable id="min_buffer_size" type="int">994473778</variable><variable id="buffer_size" type="int">994473778</variable> 
```
In the file namelist_ref in section “&namrun” set the following variables:
```
nn_itend     =   10
nn_stock    =    10
nn_write    =    10
```
Copy the $base/nemo/NEMOGCM/CONFIG/SHARED/namelist_ref file to $base/nemo/exp-orca
Switch off using the IO server in the iodef.xml file (“using_server = false”)
To build the KNL binaries change “-xCORE-AVX2” to “-xMIC-AVX512”, change $base to another directory, and do all of the steps again.

Running the GYRE Workload with the Intel Xeon Processor

Go to $base/nemo/gyre-exp

Source the environment variables for the compiler and the Intel® MPI Library:

source /opt/intel/compiler/latest/bin/compilervars.sh intel64
source /opt/intel/impi/latest/bin/compilervars.sh intel64

Add libraries to LD_LIBRARY_PATH:

export LD_LIBRARY_PATH=$base/libraries/lib/:$LD_LIBRARY_PATH

Set additional variables for the Intel MPI Library:

export I_MPI_FABRICS=shm:tmi
export I_MPI_PIN_CELL=core

Run NEMO:

mpiexec.hyrda –genvall –f <hostfile> -n <number of ranks> -perhost <ppn> ./nemo.exe

Running the ORCA025 Workload with the Intel Xeon Processor

Go to $base/nemo/orca-exp

Source the environment variables for the compiler and the Intel MPI Library:

source /opt/intel/compiler/latest/bin/compilervars.sh intel64
source /opt/intel/impi/latest/bin/compilervars.sh intel64

Add libraries to LD_LIBRARY_PATH:

export LD_LIBRARY_PATH=$base/libraries/lib/:$LD_LIBRARY_PATH

Set additional variables for Intel MPI Library:

export I_MPI_FABRICS=shm:tmi
export I_MPI_PIN_CELL=core

Run NEMO:

mpiexec.hyrda –genvall –f <hostfile> -n <number of ranks> -perhost <ppn> ./nemo.exe

If you are faced with hangs while the application is running you can run NEMO with the XIOS server in detached mode:
1. Copy xios_server.exe from $base/xios/bin to $base/nemo/orca-exp
2. Edit iodef.xml file and set “using_server = true”
3. mpiexec.hy–da -genvall –f <hostfile> -n <number of ranks> -perhost <ppn> ./nemo.exe : -n 2 ./xios_server.exe

Building Additional Libraries for the Intel® Xeon Phi™ Processor

First, choose a directory for your experiments, such as “~/NEMO-KNL”
```
export base=”~/NEMO-KNL”
```
Create the directory and copy all required libraries in $base:
```
mk–ir -p $base/libraries
```
Unpack the tarball files in $base/libraries/src
To build an Intel AVX2 version of libraries, set:
```
export a”ch="-xMIC-AV”512"
```

Set the following environment variables:

 export PREFIX=$base/libraries
 export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:${PREFIX}/lib
 export CFL”GS="-I$PREFIX/incl–de -L$PREFIX/lib –O3–-g -traceb–ck -openmp ${ar–h} -”PIC"
 export CPPFLAGS=$CFLAGS
 export CXXFLAGS=$CFLAGS
 export FFFLAGS=$CFLAGS
 export FCFLAGS=$CFLAGS
 export LDFL”GS="-L$PREFIX/–ib -openmp ${ar–h} -”PIC"
 export FC=mpiifort
 export CXX=mpiicc
 export CC=mpiicc
 export ”PP="–c” -E"

Build szip:

 cd $base/libraries/src/szip-2.1
 ./config–e --prefix=$PREFIX
 m–ke -j 4
 make install

Build zlib:

cd $base/libraries/src/zlib-1.2.8
./config–e --prefix=$PREFIX
make –j 4
make install

Build HDF5:

cd $base/libraries/src/hdf5-1.8.12
./config–e --with-zlib=$PRE–X --prefix=$PRE–X --enable-fort–n --with-szlib=$PRE–X --enable-hl
make
make install

Build CURL:

cd $base/libraries/src/curl- 7.42.1
./config–e --prefix=$PREFIX
make –j 4
make install

Build NetCDF:

cd $base/libraries/src/netcdf-4.3.3
export L”B–=" -lhdf5–hl -lh–f5 –lz -–sz -”mpi"
export LD_FLA”S–=" -L$PREFIX”lib"
./config–e --prefix=$PREFIX
make
make install

Build the NetCDF Fortran wrapper:

cd $base/libraries/src/netcdf-fortran-4.2/
export L””S=""
export CFL”GS="$CFL–GS -lne”cdf"
export CPPFLAGS=$CFLAGS
export CXXFLAGS=$CFLAGS
export FFFLAGS=$CFLAGS
export FCFLAGS=$CFLAGS
export FC=ifort
export CXX=mpiicc
export CC=mpiicc
export LDFLA”S–=" -L$I_MPI_ROOT/li”64/"
./config–e --prefix=$PREFIX
make
make install

Building XIOS for the Intel Xeon Phi Processor

Copy XIOS source code to $base/xios

Create files:

$base/xios/arch/arch-ifort_linux.env
$base/xios/arch/arch-ifort_linux.fcm
$base/xios/arch/arch-ifort_linux.path

Add the following lines to the $base/xios/arch/arch-ifort_linux.env file:

export NETCDF_INC_DIR=$base/libraries/include
export NETCDF_LIB_DIR=$base/libraries/lib
export HDF5_INC_DIR=$base/libraries/include
export HDF5_LIB_DIR=$base/libraries/lib

Add the following lines to the $base/xios/arch/arch-ifort_linux.fcm file:

%NCDF_INC            -I$base/libraries/include
%NCDF_LIB            -L$base/libraries/–ib -lnetc–ff -lnet–df -lh–f5 -lc–rl –lz -lsz
%FC                  mpiifort
%FCFLAGS             –O3–-g -traceback –xMIC-AVX–12 -I$base/libraries/incl–de -L$base/libraries/lib
%FFLAGS              –O3–-g -traceb–ck - xMIC-AVX–12 -I$base/libraries/incl–de -L$base/libraries/lib
%LD                  mpiifort
%FPPFLAGS           –-P–-C -traditional
%LDFLAGS             –O3–-g -traceb–ck - xMIC-AVX–12 -I$base/libraries/incl–de -L$base/libraries/lib
%AR                  ar
%ARFLAGS             -r
%MK                  gmake
%USER_INC            %NCDF_INC_DIR
%USER_LIB            %NCDF_LIB_DIR

%MAKE                gmake
%BASE_LD        -lstdc++ -lifc–re -lintlc
%LINKER         mpiif–rt -nofor-main
%BASE_INC       -D__NONE__
%CCOMPILER      mpiicc
%FCOMPILER      mpiifort
%CPP            cpp
%FPP            –pp -P

%BASE_CFLAGS    –O3–-g -traceb–ck - xMIC-AVX512-I$base/libraries/incl–de -L$base/libraries/lib
%PROD_CFLAGS   –O3–-g -traceb–ck - xMIC-AVX–12 -I$base/libraries/incl–de -L$base/libraries/lib
%DEV_CFLAGS    –O3–-g -traceb–ck - xMIC-AVX–12 -I$base/libraries/incl–de -L$base/libraries/lib
%DEBUG_CFL–GS –O3–-g -traceb–ck - xMIC-AVX–12 -I$base/libraries/incl–de -L$base/libraries/lib
%BASE_FFLAGS   –O3–-g -traceb–ck - xMIC-AVX–12 -I$base/libraries/incl–de -L$base/libraries/lib
%PROD_FFLAGS    –O3–-g -traceb–ck - xMIC-AVX–12 -I$base/libraries/incl–de -L$base/libraries/lib
%DEV_FFLAGS    –O3–-g -traceb–ck - xMIC-AVX–12 -I$base/libraries/incl–de -L$base/libraries/lib
%DEBUG_FFLAGS   –O3–-g -traceb–ck - xMIC-AVX–12 -I$base/libraries/incl–de -L$base/libraries/lib

Add the following lines to the $base/xios/arch/arch-ifort_linux.path file:

NETCDF_INC”IR="-I $NETCDF_INC”DIR"
NETCDF_LIB”IR="-L $NETCDF_LIB”DIR"
NETCDF_”IB="-lnetc–ff -lnet–df -l”url"
MPI_INC””R=""
MPI_LIB””R=""
MPI_””B=""
HDF5_INC”IR="-I $HDF5_INC”DIR"
HDF5_LIB”IR="-L $HDF5_LIB”DIR"
HDF5_”IB="-lhdf5–hl -lh–f5 –lz -l”url"

Change the directory to $base/xios and execute the following command:
```
./make_x–s --f–l --p–d --arch ifort_linux
```

Building NEMO for the Intel Xeon Phi Processor and Preparing Workloads

Copy the NEMO source code to $base/nemo

Apply the following patch to file $base/nemo/NEMOGCM/ NEMO/OPA_SRC/nemogcm.F90:

@@ -116,6 +116,7 @@
       !!              Madec, 2008, internal report, IPSL.
       !!----------------------------------------------------------------------
       INTEGER ::   istp       ! time step index
+DOUBLE PRECISION :: mpi_wtime, sstart, send
       !!----------------------------------------------------------------------
       !
 #if defined key_agrif
@@ -163,18 +164,19 @@
 #if defined key_agrif
           CALL Agrif_Regrid()
 #endif
-
          DO WHILE ( istp <= nitend .AND. nstop == 0 )
+sstart = mpi_wtime()
 #if defined key_agrif
             CALL stp                         ! AGRIF: time stepping
 #else
             CALL stp( istp )                 ! standard time stepping
 #endif
+send=mpi_wtime()
+print“*, "S“ep ", is“p– “ - " , send-sstar“ ,”"s."
             istp = istp + 1
             IF( lk_mpp )   CALL mpp_max( nstop )
          END DO
 #endif
-
       IF( lk_diaobs   )   CALL dia_obs_wri
       !
       IF( ln_icebergs )   CALL icb_end( nitend )

Create the file $base/nemo/ARCH/arch-mpiifort_linux.fcm and add the following lines:

%NCDF_INC            -I/$base/libraries/include
%NCDF_LIB            -L$base/libraries/–ib -lnetc–ff -lnet–df –lz -lc–rl -lhdf5–hl -lh–f5 –lz -lcurl
%CPP                 –cc -E
%FC                  mpiifort
%FCFLAGS          –r8–-g -traceb–ck -qope–mp –O3 - xMIC-AVX–12–-g -traceback
%FFLAGS             –r8–-g -traceb–ck -qope–mp –O3 - xMIC-AVX–12–-g -traceback
%LD                  mpiifort
%FPPFLAGS           –-P–-C -traditional
%LDFLAGS             -lstdc++ -lifc–re –O3 - xMIC-AVX–12–-g -traceback
%AR                  ar
%ARFLAGS             -r
%MK                  gmake
%XIOS_INC            -I$base/xios/inc
%XIOS_LIB            -L$base/xios/–ib -lxios
%USER_INC            %NCDF_INC %XIOS_INC
%USER_LIB            %NCDF_LIB %XIOS_LIB

Build the binary for the GYRE workload:

cd $base/nemo/NEMOGCM/CONFIG
./maken–mo -n G–RE -m mpiifort_li–ux -j 4

Create a sandbox directory for the GYRE runs:
1. ```
mk–ir -p $base/nemo/gyre-exp
cp –r $base/nemo/NEMOGCM/CONFIG/GYRE/BLD/bin/nemo.exe $base/nemo/gyre-exp–cp -r $base/nemo/NEMOGCM/CONFIG/GYRE/EXP00/* $base/nemo/gyre-exp
```
2. Switch off creating mesh files by changing “nn_msh” to 0 in the namelist_ref file
3. Enable benchmark mode by changing “nn_bench” to 1 in the namelist_ref file.
4. Set the following parameters in the “&namcfg” section:
```
jp_cfg = 70
jpidta = 2102
jpjdta = 1402
jpkdta = 31
jpiglo = 2102
jpjglo = 1402
```
5. Switch off using the IO server in the iodef.xml file (“using_server = false”)
Build the binary for ORCA025 workload:
1. Change $base/nemo/NEMOGCM/CONFIG/ORCA2_LIM3/cpp_ORCA2_LIM3.fcm content to “bld::tool::fppkeys key_trabbl key_vvl key_dynspg_ts key_ldfslp key_traldf_c2d key_traldf_eiv key_dynldf_c3d key_zdfddm key_zdftmx key_mpp_mpi key_zdftke key_lim3 key_iomput”
2. Change line “ORCA2_LIM3 OPA_SRC LIM_SRC_3 NST_SRC” to “ORCA2_LIM3 OPA_SRC LIM_SRC_3” in the file $base/nemo/NEMOGCM/CONFIG/cfg.txt
3. ./maken–mo -n ORCA2_L–M3 -m mpiifort_li–ux -j 4
Go to the Barcelona Supercomputing Center (in Spanish), and in section 9 locate the paragraph, “PREGUNTAS Y RESPUESTAS:” with the path to the ftp server and credentials to log in.
Download the BenchORCA025L75.tar.gz file from the Benchmarks_aceptacion/NEMO/ directory
Extract the contents of the tarball file to $base/nemo/orca-exp

Copy the NEMO binary to the sandbox directory:

cp $base/nemo/NEMOGCM/CONFIG/ORCA2_LIM3/BLD/bin/nemo.exe $base/nemo/orca-exp

Edit the file $base/nemo/orca-exp/iodef.xml and add the following lines into the “<context”id="”ios"> <variable_definition>” section:
```
<variable”id="min_buffer_”ize" t”pe=”int">994473778</variable><variable”id="buffer_”ize" t”pe=”int">994473778</variable>
```
In the file namelist_ref in section “&namrun” set the following variables:
```
nn_itend    =  10
nn_stock    =    10
nn_write    =    10
```
Copy the $base/nemo/NEMOGCM/CONFIG/SHARED/namelist_ref file to the $base/nemo/exp-orca directory
Switch off using the IO server in the iodef.xml file (“using_server = false”)
To build the KNL binaries, change “-xCORE- to “-xMIC-AVX512”, change $base to another directory, and do all of the steps again.

Running the GYRE Workload with the Intel Xeon Phi Processor

Go to $base/nemo/gyre-exp

Source the environment variables for the compiler and Intel MPI Library:

source /opt/intel/compiler/latest/bin/compilervars.sh intel64
source /opt/intel/impi/latest/bin/compilervars.sh intel64

Add the libraries to LD_LIBRARY_PATH:

export LD_LIBRARY_PATH=$base/libraries/lib/:$LD_LIBRARY_PATH

Set additional variables for Intel MPI Library:

export I_MPI_FABRICS=shm:tmi
export I_MPI_PIN_CELL=core

Run NEMO:

mpiexec.hyrda -genvall –f <hostfile> -n <number of ranks> -perhost <ppn> ./nemo.exe

Running the ORCA025 Workload with the Intel Xeon Phi Processor

Go to $base/nemo/orca-exp

Source environment variables for the compiler and Intel MPI Library:

source /opt/intel/compiler/latest/bin/compilervars.sh intel64
source /opt/intel/impi/latest/bin/compilervars.sh intel64

Add libraries to LD_LIBRARY_PATH:

export LD_LIBRARY_PATH=$base/libraries/lib/:$LD_LIBRARY_PATH

Set additional variables for the Intel MPI Library:

export I_MPI_FABRICS=shm:tmi
export I_MPI_PIN_CELL=core

Run NEMO:

mpiexec.hyrda -genvall –f <hostfile> -n <number of ranks> -perhost <ppn> ./nemo.exe

If you are faced with hangs while the application is running you can run NEMO with the XIOS server in detached mode:
1. Copy xios_server.exe from $base/xios/bin to $base/nemo/orca-exp
2. Edit iodef.xml file and set “using_server = true”
3. mpiexec.hyrda -genvall –f <hostfile> -n <number of ranks> -perhost <ppn> ./nemo.exe : -n 2 ./xios_server.exe

Configuring Test Systems

CPU	Dual-socket Intel® Xeon® processor E5-2697 v4, 2.3 GHz (turbo OFF), 18 cores/socket, 36 cores, 72 threads (HT on)	Intel® Xeon Phi™ processor 7250, 68 core, 136 threads, 1400 MHz core freq. (turbo OFF), 1700 MHz uncore freq.
RAM	128 GB (8 x 16 GB) DDR4 2400 DDR4 DIMMs	96 GB (6 x 16 GB) DDR4 2400 MHz RDIMMS
Cluster File System Abstract	Intel® Enterprise Edition for Lustre* software SSD (Intel® EE for Lustre* software) SSD (136 TB storage)	Intel® Enterprise Edition for Lustre* software SSD (Intel® EE for Lustre* software) SSD (136 TB storage)
Interconnect	Intel® Omni-Path Architecture (Intel® OPA) Si 100 series	Intel® Omni-Path Architecture (Intel® OPA) Si 100 series
OS / Kernel / IB stack	Oracle Linux* server release 7.2 Kernel: 3.10.0-229.20.1.el6.x86_64.knl2 OFED version: 10.2.0.0.158_72	Oracle Linux server release 7.2 Kernel: 3.10.0-229.20.1.el6.x86_64.knl2 OFED Version 10.2.0.0.158_72

NEMO configuration: V3.6 r6939 with XIOS 1.0 r703, Intel® Parallel Studio XE 17.0.0.098, Intel MPI Library 2017 for Linux*
MPI configuration:
- I_MPI_FABRICS=shm:tmi
- I_MPI_PIN_CELL=core

Performance Results for the Intel Xeon Processor and Intel Xeon Phi Processor

1. Time of second step for GYRE workload:

# nodes	Intel® Xeon® Processor	Intel® Xeon Phi™ Processor
1	6.546229	3.642156
2	3.011352	2.075075
4	1.326501	0.997129
8	0.640632	0.492369
16	0.321378	0.284348

2. Time of second step for ORCA workload:

# nodes	Intel® Xeon® processor	Intel® Xeon Phi™ processor
2	5.764083
4	2.642725	2.156876
8	1.305238	1.0546
16	0.67725	0.643372

↧

Demo: Software Defined Visualization Using Intel® Xeon Phi™ Processor

March 24, 2017, 8:18 am

Latest and popular articles on Intel Technologies

≫ Next: Using MPI-3 Shared Memory in Intel® Xeon Phi™ Processors

≪ Previous: Recipe: Building and running NEMO* on Intel® Xeon Phi™ Processors

In this demo we are showcasing the use of Intel® Xeon Phi™ processor, to do a 3D visualization of tumor in a human brain. This can help advance research in medical field by getting precise detection and removal of something like tumor in human brain.

More information

The tool used for visualization is Paraview, with OSPRay as the rendering library.

Pre-requisites

Intel® Xeon Phi™ processor system with CentOS 7.2 Linux* (internet enabled)

Open a terminal, in your work area directory and follow the steps below:

Create directory for the demo
mkdir Intel_brain_demo
Change directory
cd Intel_brain_demo
Create two directories under this
mkdir paraview mkdir ospray
Access the files from Dropbox:
https://www.dropbox.com/s/wj0qp1clxv5xssv/SC_2016_BrainDemo.tar.gz?dl=0
Copy the Paraview and Ospray tar files into the respective directories you created in steps above
mv SC_2016_BrainDemo/paraview_sc_demo.tgz paraview/ mv SC_2016_BrainDemo/ospray.tgz ospray/
Untar each of the *tgz directories in the respective area
tar –xzvf *.tgz
Point the library path
Export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:<…../Intel_brain_demo/ospray/install/lib64>
Optional step: set graphics system to a random variable, only if Paraview doesn’t load normally
export QT_GRAPHICSSYSTEM=gtk
Change directory to paraview/install where the binaries are
cd paraview/install
Run Paraview
./bin/paraview
Once Paraview loads
Select File/Load State
Then load the brain_demo.psvm file from the SC_2016BrainDemo that you sync’d in step above
Then it will ask you to load VTK files, click the “...” button to select the appropriate *tumor1.vtk file, then *tumor2.vtk file and then *Tumor1.vtk file in order on your local machine. Then click OK.
An Output Messages pop window will show with warnings. Ignore the warnings and click close, and you should see something like following:
Now you can go to File/Save State and save this state. Now every time you load you can load to this state file to skip the previous step of having to locate the data files.
Then on properties tab on left side, enable Ospray for every view (all the rendewViews1/2/2 by selecting that view and clicking enable Ospray)
Once you do that you should see the images for all three views look as below:
You can also rotate the views and see how they look.

A few issues and how to resolve

Missing OpenGL, install mesa for OpenGL

Sudo yum –y install mesa-libGL Sudo yum –y install mesa-libGL-devel

libQtGui.so.4 error, install qt-x11 package

yum –y install qt-x11

Acknowledgements

Special thanks to Carson Brownlee and James Jeffers from Intel Corporation for all their contributions and support. Without their efforts, it wouldn’t have been possible to get this demo running.

References

↧

Using MPI-3 Shared Memory in Intel® Xeon Phi™ Processors

March 29, 2017, 12:33 pm

Latest and popular articles on Intel Technologies

≫ Next: Benefits of Intel® Optimized Caffe* in comparison with BVLC Caffe*

≪ Previous: Demo: Software Defined Visualization Using Intel® Xeon Phi™ Processor

This whitepaper introduces the MPI-3 shared memory feature, the corresponding APIs, and a sample program to illustrate the use of MPI-3 shared memory in the Intel® Xeon Phi™ processor.

Introduction to MPI-3 Shared Memory

MPI-3 shared memory is a feature introduced in version 3.0 of the message passing interface (MPI) standard. It is implemented in Intel® MPI Library version 5.0.2 and beyond. MPI-3 shared memory allows multiple MPI processes to allocate and have access to the shared memory in a compute node. For applications that require multiple MPI processes to exchange huge local data, this feature reduces the memory footprint and can improve performance significantly.

In the MPI standard, each MPI process has its own address space. With MPI-3 shared memory, each MPI process exposes its own memory to other processes. The following figure illustrates the concept of shared memory: Each MPI process allocates and maintains its own local memory, and exposes a portion of its memory to the shared memory region. All processes then can have access to the shared memory region. Using the shared memory feature, users can reduce the data exchange among the processes.

By default, the memory created by an MPI process is private. It is best to use MPI-3 shared memory when only memory needs to be shared and all other resources remain private. As each process has access to the shared memory region, users need to pay attention to process synchronization when using shared memory.

Sample Code

In this section, sample code is provided to illustrate the use of MPI-3 shared memory.

A total of eight MPI processes are created on the node. Each process maintains a long array of 32 million elements. For each element _j in the array, the process updates this element value based on its current value and the values of the element _j in the corresponding arrays of two nearest processes, and the same procedure is applied for the whole array. The following pseudo-code shows when running the program for eight MPI processes with 64 iterations:

Repeat the following procedure 64 times:
for each MPI process n from 0 to 7:
    for each element j in the array A[k]:A_n[j] ← 0.5*A_n[j]  + 0.25*A_previous[j] + 0.25*A_next[j]

where A_n is the long array belonging to the process n, and A_n [j] is the value of the element j in the array belonging to the process n. In this program, since each process exposes it to local memory, all processes can have access to all arrays, although each process just needs the two neighbor arrays (for example, process 0 needs data from processes 1 and 7, process 1 needs data from processes 0 and 2,…).

Besides the basic APIs used for MPI programming, the following MPI-3 shared memory APIs are introduced in this example:

MPI_Comm_split_type: Used to create a new communicator where all processes share a common property. In this case, we pass MPI_COMM_TYPE_SHARED as an argument in order to create a shared memory from a parent communicator such as MPI_COMM_WORLD, and decompose the communicator into a shared memory communicator shmcomm.
MPI_Win_allocate_shared: Used to create a shared memory that is accessible by all processes in the shared memory communicator. Each process exposes its local memory to all other processes, and the size of the local memory allocated by each process can be different. By default, the total shared memory is allocated contiguously. The user can pass an info hint “alloc_shared_noncontig” to specify that the shared memory does not have to be contiguous, which can cause performance improvement, depending on the underlying hardware architecture.
MPI_Win_free: Used to release the memory.
MPI_Win_shared_query: Used to query the address of the shared memory of an MPI process.
MPI_Win_lock_all and MPI_Win_unlock_all: Used to start an access epoch to all processes in the window. Only shared epochs are needed. The calling process can access the shared memory on all processes.
MPI_Win_sync: Used to ensure the completion of copying the local memory to the shared memory.
MPI_Barrier: Used to block the caller process on the node until all processes reach a barrier. The barrier synchronization API works across all processes.

Basic Performance Tuning for Intel® Xeon Phi™ Processor

This test is run on an Intel Xeon Phi processor 7250 at 1.40 GHz with 68 cores, installed with Red Hat Enterprise Linux* 7.2 and Intel® Xeon Phi™ Processor Software 1.5.1, and Intel® Parallel Studio 2017 update 2. By default, the Intel compiler will try to vectorize the code, and each MPI process has a single thread of execution. OpenMP* pragma is added at loop level for later use. To compile the code, run the following command line to generate the binary mpishared.out:

$ mpiicc mpishared.c -qopenmp -o mpishared.out
$ mpirun -n 8 ./mpishared.out
Elapsed time in msec: 5699 (after 64 iterations)

To explore the thread parallelism, run four threads per core, and re-compile with –xMIC-AVX512 to take advantage of Intel® Advanced Vector Extensions 512 (Intel® AVX-512) instructions:

$ mpiicc mpishared.c -qopenmp -xMIC-AVX512 -o mpishared.out
$ export OMP_NUM_THREADS=4
$ mpirun -n 8 ./mpishared.out
Elapsed time in msec: 4535 (after 64 iterations)

As MCDRAM in this system is currently configured as flat, the Intel Xeon Phi processor appears as two NUMA nodes. The node 0 contains all CPUs and the on-platform memory DDR4, while node 1 has the on-packet memory MCDRAM:

$ numactl -H
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271
node 0 size: 98200 MB
node 0 free: 92775 MB
node 1 cpus:
node 1 size: 16384 MB
node 1 free: 15925 MB
node distances:
node   0   1
  0:  10  31
  1:  31  10

To allocate the memory in the MCDRAM (node 1), pass the argument –m 1 to the command numactl as follows:

$ numactl -m 1 mpirun -n 8 ./mpishared.out
Elapsed time in msec: 3070 (after 64 iterations)

This simple optimization technique greatly improves performance speeds.

Summary

This whitepaper introduced the MPI-3 shared memory feature, followed by sample code, which used IMP-3 shared memory APIs. The pseudo-code explained what the program is doing along with an explanation of shared memory APIs. The program ran on an Intel Xeon Phi processor, and it was further optimized with simple techniques.

Reference

MPI Forum, MPI 3.0
Message Passing Interface Forum, MPI: A Message-Passing Interface Standard Version 3.0
The MIT Press, Using Advanced MPI
James Reinders, Jim Jeffers, Publisher: Morgan Kaufmann, Chapter 16 - MPI-3 Shared Memory Programming Introduction, High Performance Parallelism Pearls Volume Two

Appendix

The code of the sample MPI program is available for download.

↧

Benefits of Intel® Optimized Caffe* in comparison with BVLC Caffe*

April 9, 2017, 12:33 am

Latest and popular articles on Intel Technologies

≫ Next: Performance of Classic Matrix Multiplication Algorithm on Intel® Xeon Phi™ Processor System

≪ Previous: Using MPI-3 Shared Memory in Intel® Xeon Phi™ Processors

Overview

This article introduces Berkeley Vision and Learning Center (BVLC) Caffe* and a custom version of Caffe*, Intel® Optimized Caffe*. We explain why and how Intel® Optimized Caffe* performs efficiently on Intel® Architecture via Intel® VTune™ Amplifier and the time profiling option of Caffe* itself.

Introduction to BVLC Caffe* and Intel® Optimized Caffe*

Caffe* is a well-known and widely used machine vision based Deep Learning framework developed by the Berkeley Vision and Learning Center (BVLC). It is an open-source framework and is evolving currently. It allows users to control a variety options such as libraries for BLAS, CPU or GPU focused computation, CUDA, OpenCV, MATLAB and Python before you build Caffe* through 'Makefile.config'. You can easily change the options in the configuration file and BVLC provides intuitive instructions on their project web page for developers.

Intel® Optimized Caffe* is Intel distributed customized Caffe* version for Intel Architectures. Intel® Optimized Caffe* offers all the goodness of main Caffe* with the addition of Intel Architectures optimized functionality and multi-node distributor training and scoring. Intel® Optimized Caffe* makes it possible to more efficiently utilize CPU resources.

To see in detail how Intel® Optimized Caffe* has changed in order to optimize itself to Intel Architectures, please refer this page : https://software.intel.com/en-us/articles/caffe-optimized-for-intel-architecture-applying-modern-code-techniques

In this article, we will first profile the performance of BVLC Caffe* with Cifar 10 example and then will profile the performance of Intel® Optimized Caffe* with the same example. Performance profile will be conducted through two different methods.

Tested platform : Xeon Phi™ 7210 ( 1.3Ghz, 64 Cores ) with 96GB RAM, CentOS 7.2

1. Caffe* provides its own timing option for example :

./build/tools/caffe time \
    --model=examples/cifar10/cifar10_full_sigmoid_train_test_bn.prototxt \
    -iterations 1000

2. Intel® VTune™ Amplifier : Intel® VTune™ Amplifier is a powerful profiling tool that provides advanced CPU profiling features with a modern analysis interface. https://software.intel.com/en-us/intel-vtune-amplifier-xe

How to Install BVLC Caffe*

Please refer the BVLC Caffe project web page for installation : http://caffe.berkeleyvision.org/installation.html

If you have Intel® MKL installed on your system, it is better using MKL as BLAS library.

In your Makefile.config , choose BLAS := mkl and specify MKL address. ( The default set is BLAS := atlas )

In our test, we kept all configurations as they are specified as default except the CPU only option.

Test example

In this article, we will use 'Cifar 10' example included in Caffe* package as default.

You can refer BVLC Caffe project page for detail information about this exmaple : http://caffe.berkeleyvision.org/gathered/examples/cifar10.html

You can simply run the training example of Cifar 10 as the following :

cd $CAFFE_ROOT
./data/cifar10/get_cifar10.sh
./examples/cifar10/create_cifar10.sh
./examples/cifar10/train_full_sigmoid_bn.sh

First, we will try the Caffe's own benchmark method to obtain its performance results as the following:

./build/tools/caffe time \
    --model=examples/cifar10/cifar10_full_sigmoid_train_test_bn.prototxt \
    -iterations 1000

as results, we got the layer-by-layer forward and backward propagation time. The command above measure the time each forward and backward pass over a batch f images. At the end it shows the average execution time per iteration for 1,000 iterations per layer and for the entire calculation.

This test was run on Xeon Phi™ 7210 ( 1.3Ghz, 64 Cores ) with 96GB RAM of DDR4 installed with CentOS 7.2.

The numbers in the above results will be compared later with the results of Intel® Optimized Caffe*.

Before that, let's take a look at the VTune™ results also to observe the behave of Caffe* in detail.

VTune Profiling

Intel® VTune™ Amplifier is a modern processor performance profiler that is capable of analyzing top hotspots quickly and helping tuning your target application. You can find the details of Intel® VTune™ Amplifier from the following link :

Intel® VTune™ Amplifier : https://software.intel.com/en-us/intel-vtune-amplifier-xe

We used Intel® VTune™ Amplifier in this article to find the function with the highest total CPU utilization time. Also, how OpenMP threads are working.

VTune result analysis

What we can see here is some functions listed on the left side of the screen which are taking the most of the CPU time. They are called 'hotspots' and can be the target functions for performance optimization.

In this case, we will focus on 'caffe::im2col_cpu<float>' function as a optimization candidate.

'im2col_cpu<float>' is one of the steps in performing direct convolution as a GEMM operation for using highly optimized BLAS libraries. This function took the largest CPU resource in our test of training Cifar 10 model using BVLC Caffe*.

Let's take a look at the threads behaviors of this function. In VTune™, you can choose a function and filter other workloads out to observe only the workloads of the specified function.

On the above result, we can see the CPI ( Cycles Per Instruction ) of the fuction is 0.907 and the function utilizes only one single thread for the entire calculation.

One more intuitive data provided by VTune is here.

This 'CPU Usage Histogram' provides the data of the numbers of CPUs that were running simultaneously. The number of CPUs the training process utilized appears to be about 25. The platform has 64 physical core with Intel® Hyper-Threading Technology so it has 256 CPUs. The CPU usage histogram here might imply that the process is not efficiently threaded.

However, we cannot just determine that these results are 'bad' because we did not set any performance standard or desired performance to classify. We will compare these results with the results of Intel® Optimized Caffe* later.

Let's move on to Intel® Optimized Caffe* now.

How to Install Intel® Optimized Caffe*

The basic procedure of installation of Intel® Optimized Caffe* is the same as BVLC Caffe*.

When clone Intel® Optimized Caffe* from Git, you can use this alternative :

git clone https://github.com/intel/caffe

Additionally, it is required to install Intel® MKL to bring out the best performance of Intel® Optimized Caffe*.

Please download and install Intel® MKL. Intel offers MKL for free without technical support or for a license fee to get one-on-one private support. The default BLAS library of Intel® Optimized Caffe* is set to MKL.

Intel® MKL : https://software.intel.com/en-us/intel-mkl

After downloading Intel® Optimized Caffe* and installing MKL, in your Makefile.config, make sure you choose MKL as your BLAS library and point MKL include and lib folder for BLAS_INCLUDE and BLAS_LIB

BLAS :=mkl

BLAS_INCLUDE := /opt/intel/mkl/include
BLAS_LIB := /opt/intel/mkl/lib/intel64

If you encounter 'libstdc++' related error during the compilation of Intel® Optimized Caffe*, please install 'libstdc++-static'. For example :

sudo yum install libstdc++-static

Optimization factors and tunes

Before we run and test the performance of examples, there are some options we need to change or adjust to optimize performance.

Use 'mkl' as BLAS library : Specify 'BLAS := mkl' in Makefile.config and configure the location of your MKL's include and lib location also.

Set CPU utilization limit :

echo "100" | sudo tee /sys/devices/system/cpu/intel_pstate/min_perf_pct
echo "0" | sudo tee /sys/devices/system/cpu/intel_pstate/no_turbo

Put 'engine:"MKL2017"' at the top of your train_val.prototxt or solver.prototxt file or use this option with caffe tool : -engine "MKL2017"
Current implementation uses OpenMP threads. By default the number of OpenMP threads is set to the number of CPU cores. Each one thread is bound to a single core to achieve best performance results. It is however possible to use own configuration by providing right one through OpenMP environmental variables like KMP_AFFINITY, OMP_NUM_THREADS or GOMP_CPU_AFFINITY. For the example run below , 'OMP_NUM_THREADS = 64' has been used.
Intel® Optimized Caffe* has edited many parts of original BVLC Caffe* code to achieve better code parallelization with OpenMP*. Depending on other processes running on the background, it is often useful to adjust the number of threads getting utilized by OpenMP*. For Intel Xeon Phi™ product family single-node we recommend to use OMP_NUM_THREADS = numer_of_cores-2.
Please also refer here : Intel Recommendation to Achieve the best performance

If you observe too much overhead because of too frequent movement of thread by OS, you can try to adjust OpenMP* affinity environment variable :

KMP_AFFINITY=compact,granularity=fine

Test example

For Intel® Optimized Caffe* we run the same example to compare the results with the previous results.

cd $CAFFE_ROOT
./data/cifar10/get_cifar10.sh
./examples/cifar10/create_cifar10.sh

./build/tools/caffe time \
    --model=examples/cifar10/cifar10_full_sigmoid_train_test_bn.prototxt \
    -iterations 1000

Comparison

The results with the above example is the following :

Again , the platform used for the test is : Xeon Phi™ 7210 ( 1.3Ghz, 64 Cores ) with 96GB RAM, CentOS 7.2

first, let's look at the BVLC Caffe*'s and Intel® Optimized Caffe* together,

-->

to make it easy to compare, please see the table below. The duration each layer took in milliseconds has been listed, and on the 5th column we stated how many times Intel® Optimized Caffe* is faster than BVLC Caffe* at each layer. You can observe significant performance improvements except for bn layers relatively. Bn stands for "Batch Normalization" which requires fairly simple calculations with small optimization potential. Bn forward layers show better results and Bn backward layers show 2~3% slower results than the original. Worse performance can occur here in result of threading overhead. Overall in total, Intel® Optimized Caffe* achieved about 28 times faster performance in this case.

	Direction	BVLC (ms)	Intel (ms)	Performance Benefit (x)
conv1	Forward	40.2966	1.65063	24.413
conv1	Backward	54.5911	2.24787	24.286
pool1	Forward	162.288	1.97146	82.319
pool1	Backward	21.7133	0.459767	47.227
bn1	Forward	1.60717	0.812487	1.978
bn1	Backward	1.22236	1.24449	0.982
Sigmoid1	Forward	132.515	2.24764	58.957
Sigmoid1	Backward	17.9085	0.262797	68.146
conv2	Forward	125.811	3.8915	32.330
conv2	Backward	239.459	8.45695	28.315
bn2	Forward	1.58582	0.854936	1.855
bn2	Backward	1.2253	1.25895	0.973
Sigmoid2	Forward	132.443	2.2247	59.533
Sigmoid2	Backward	17.9186	0.234701	76.347
pool2	Forward	17.2868	0.38456	44.952
pool2	Backward	27.0168	0.661755	40.826
conv3	Forward	40.6405	1.74722	23.260
conv3	Backward	79.0186	4.95822	15.937
bn3	Forward	0.918853	0.779927	1.178
bn3	Backward	1.18006	1.18185	0.998
Sigmoid3	Forward	66.2918	1.1543	57.430
Sigmoid3	Backward	8.98023	0.121766	73.750
pool3	Forward	12.5598	0.220369	56.994
pool3	Backward	17.3557	0.333837	51.989
ipl	Forward	0.301847	0.186466	1.619
ipl	Backward	0.301837	0.184209	1.639
loss	Forward	0.802242	0.641221	1.251
loss	Backward	0.013722	0.013825	0.993
Ave.	Forward	735.534	21.6799	33.927
Ave.	Backward	488.049	21.7214	22.469
Ave.	Forward-Backward	1223.86	43.636	28.047
Total		1223860	43636	28.047

Some of many reasons this optimization was possible are :

Code vectorization for SIMD
Finding hotspot functions and reducing function complexity and the amount of calculations
CPU / system specific optimizations
Reducing thread movements
Efficient OpenMP* utilization

Additionally, let's compare the VTune results of this example between BVLC Caffe and Intel® Optimized Caffe*.

Simply we will looking at how efficiently im2col_cpu function has been utilized.

BVLC Caffe*'s im2col_cpu function had CPI at 0.907 and was single threaded.

In case of Intel® Optimized Caffe* , im2col_cpu has its CPI at 2.747 and is multi threaded by OMP Workers.

The reason why CPI rate increased here is vectorization which brings higher CPI rate because of longer latency for each instruction and multi-threading which can introduce spinning while waitning for other threads to finish their jobs. However, in this example, benefits from vectorization and multi-threading exceed the latency and overhead and bring performance improvements after all.

VTune suggests that CPI rate close to 2.0 is theoretically ideal and for our case, we achieved about the right CPI for the function. The training workload for the Cifar 10 example is to handle 32 x 32 pixel images for each iteration so when those workloads split down to many threads, each of them can be a very small task which may cause transition overhead for multi-threading. With larger images we would see lower spining time and smaller CPI rate.

CPU Usage Histogram for the whole process also shows better threading results in this case.

Useful links

BVLC Caffe* Project : http://caffe.berkeleyvision.org/

BVLC Caffe* Git : https://github.com/BVLC/caffe

Intel® Optimized Caffe* Introduction : https://software.intel.com/en-us/videos/what-is-intel-optimized-caffe

Intel® Optimized Caffe* Git : https://github.com/intel/caffe

Intel® Optimized Caffe* Recommendations for the best performance : https://github.com/intel/caffe/wiki/Recommendations-to-achieve-best-performance

Intel® Optimized Caffe* Modern Code Techniques : https://software.intel.com/en-us/articles/caffe-optimized-for-intel-architecture-applying-modern-code-techniques

Summary

Intel® Optimized Caffe* is a customized Caffe* version for Intel Architectures with modern code techniques.

In Intel® Optimized Caffe*, Intel leverages optimization tools and Intel® performance libraries, perform scalar and serial optimizations, implements vectorization and parallelization.

↧

Performance of Classic Matrix Multiplication Algorithm on Intel® Xeon Phi™ Processor System

April 14, 2017, 4:46 pm

Latest and popular articles on Intel Technologies

≫ Next: Intel Solutions and Technologies for the Evolving Data Center

≪ Previous: Benefits of Intel® Optimized Caffe* in comparison with BVLC Caffe*

Introduction
An Overview of the Classic Matrix Multiplication Algorithm
Total Number of Floating Point Operations
Implementation Complexity
Optimization Techniques
Memory Allocation Schemes
Loop Processing Schemes
Compute Schemes
Error Analysis
Performance on Intel® Xeon Phi™ Processor System
OpenMP* Product Thread Affinity Control
Recommended Intel® C++ Compiler Command-Line Options
Conclusion
References
Downloads
Abbreviations
Appendix A - Technical Specifications of Intel Xeon Phi Processor System
Appendix B - Comparison of Processing Times for MMAs vs. MTA
Appendix C - Error Analysis (Absolute Errors for SP FP Data Type)
Appendix D - Performance of MMAs for Different MASs
About the Author

Introduction

Matrix multiplication (MM) of two matrices is one of the most fundamental operations in linear algebra. The algorithm for MM is very simple, it could be easily implemented in any programming language, and its performance significantly improves when different optimization techniques are applied.

Several versions of the classic matrix multiplication algorithm (CMMA) to compute a product of square dense matrices are evaluated in four test programs. Performance of these CMMAs is compared to a highly optimized 'cblas_sgemm' function of the Intel® Math Kernel Library (Intel® MKL)⁷. Tests are completed on a computer system with Intel® Xeon Phi™ processor 7210⁵ running the Linux Red Hat* operating system in 'All2All' Cluster mode and for 'Flat', 'Hybrid 50-50', and 'Cache' MCDRAM modes.

All versions of CMMAs for single and double precision floating point data types described in the article are implemented in the C programming language and compiled with Intel® C++ Compiler versions 17 and 16 for Linux*⁶.

The article targets experienced C/C++ software engineers and can be considered as a reference on application optimization techniques, analysis of performance, and accuracy of computations related to MMAs.

If needed, the reader may review the contents of References¹ or² for a description of mathematical fundamentals of MM, because theoretical topics related to MM are not covered in this article.

An Overview of the Classic Matrix Multiplication Algorithm

A fundamental property of any algorithm is its asymptotic complexity (AC)³.

In generic form, AC for MMA can be expressed as follows:

MMA AC = O(N^Omega)

where O stands for operation on a data element, also known in computer science as a Big O; N is one dimension of the matrix, and omega is a matrix exponent which equals 3.0 for CMMA. That is:

CMMA AC = O(N^3)

In order to compute a product of two square matrices using CMMA, a cubic number of floating point (FP) multiplication operations is required. In other words, the CMMA runs in O(N^3) time.

An omega lower than 3.0 is possible, and it means that an MMA computes a product of two matrices faster because an optimization technique, mathematical or programming, is applied and fewer FP multiplication operations are required to compute the product.

A list of several MMAs with different values of omega is as follows:

Algorithm	Omega	Note
Francois Le Gall	2.3728639	(1)
Virginia Vassilevska Williams	2.3728642
Stothers	2.3740000
Coppersmith-Winograd	2.3760000
Bini	2.7790000
Pan	2.7950000
Strassen	2.8070000	(2)
Strassen-Winograd	2.8070000
Classic	3.0000000	(3)

Table 1. Algorithms are sorted by omega in ascending order.

Total Number of Floating Point Operations

Let's assume that:

M x N is a dimension of a matrix A, or A[M,N]
N x P is a dimension of a matrix B, or B[N,P]
M x P is a dimension of a matrix C, or C[M,P]

There are three relations between M, N and P:

Relation #1: A[...,N] = B[N,...]
Relation #2: A[M,...] = C[M,...]
Relation #3: B[...,P] = C[...,P]

If one of these three relations is not met, the product of two matrices cannot be computed.

In this article only square matrices of dimension N, where M = N = P, will be considered. Therefore:

A[N,N] is the same as A[M,N]
B[N,N] is the same as B[N,P]
C[N,N] is the same as C[M,P]

The following table shows how many multiplications are needed to compute a product of two square matrices of different Ns for three algorithms from Table 1 with omega = 2.3728639 (1), omega = 2.807 (2) and omega = 3.0 (3).

N	Omega = 2.3728639 (1)	Omega = 2.807 (2)	Omega = 3.0 (3)
128	100,028	822,126	2,097,152
256	518,114	5,753,466	16,777,216
512	2,683,668	40,264,358	134,217,728
1024	13,900,553	281,781,176	1,073,741,824
2048	72,000,465	1,971,983,042	8,589,934,592
4096	372,939,611	13,800,485,780	68,719,476,736
8192	1,931,709,091	96,579,637,673	549,755,813,888
16384	10,005,641,390	675,891,165,093	4,398,046,511,104
32768	51,826,053,965	4,730,074,351,662	35,184,372,088,832
65536	268,442,548,034	33,102,375,837,652	281,474,976,710,656

Table 2.

For example, to compute a product of two square dense matrices of dimension N equal to 32,768, Francois Le Gall (1) MMA needs ~51,826,053,965 multiplications and Classic (3) MMA needs ~35,184,372,088,832 multiplications.

Imagine the case of the product of two square matrices where N equals 32,768 needs to be computed on a very slow computer system. It means that if the Francois Le Gall MMA completes the processing in one day, then the classic MMA will need ~679 days on the same computer system, or almost two years. This is because the Francois Le Gall MMA needs ~679x fewer multiplications to compute a product:

~35,184,372,088,832 / ~51,826,053,965 = ~678.9

In the case of using a famous Strassen (2) MMA, ~91 days would be needed:

~4,730,074,351,662 / ~51,826,053,965 = ~91.3

In many software benchmarks the performance of an algorithm, or some processing, is measured in floating point operations per second (FLOPS), and not in elapsed time intervals, like days, hours, minutes, or seconds. That is why it is very important to know an exact total number (TN) of FP operations completed to calculate a FLOPS value.

With modern C++ compilers, it is very difficult to estimate an exact TN of FP operations per unit of time completed at run time due to extensive optimizations of generated binary codes. It means that an analysis of binary codes could be required, and this is outside of the scope of this article.

However, an estimate value of the TN of FP operations, multiplications and additions, for CMMA when square matrices are used can be easily calculated. Here are two simple examples:

Example 1: N = 2

	Multiplications	= 8				// 2 * 2 * 2 = 2^3
	Additions	= 4				// 2 * 2 * 1 = 2^2*(2-1)
	TN FP Ops	= 8 + 4 = 12

Example 2: N = 3

	Multiplications	= 27				// 3 * 3 * 3 = 3^3
	Additions	= 18				// 3 * 3 * 2 = 3^2*(3-1)
	TN FP Ops	= 27 + 18 = 45

It is apparent that the TN of FP operations to compute a product of two square matrices can be calculated using a simple formula:

TN FP Ops = (N^3) + ((N^2) * (N-1))

Note: Take into account that in the versions of the MMA used for sparse matrices, no FP operations are performed if the matrix element at position (i,j) is equal to zero.

Implementation Complexity

In the C programming language only four lines of code are needed to implement a core part of the CMMA:

for( i = 0; i < N; i += 1 )
		for( j = 0; j < N; j += 1 )
			for( k = 0; k < N; k += 1 )
				C[i][j] += A[i][k] * B[k][j];

Therefore, CMMA's implementation complexity (IC) could be rated as very simple.

Declarations of all intermediate variables, memory allocations, and initialization of matrices are usually not taken into account.

More complex versions of MMA, like Strassen or Strassen-Winograd, could have several thousands of code lines.

Optimization Techniques

In computer programming, matrices could be represented in memory as 1-D or 2-D data structures.

Here is a static declaration of matrices A, B, and C as 1-D data structures of a single precision (SP) FP data type (float):

	float fA[N*N];
	float fB[N*N];
	float fC[N*N];

and this is what a core part of the CMMA looks like:

	for( i = 0; i < N; i += 1 )
		for( j = 0; j < N; j += 1 )
			for( k = 0; k < N; k += 1 )
				C[N*i+j] += A[N*i+k] * B[N*k+j];

Here is a static declaration of matrices A, B, and C as 2-D data structures of a single precision (SP) FP data type (float):

	float fA[N][N];
	float fB[N][N];
	float fC[N][N];

and this is what the core part of CMMA looks like:

	for( i = 0; i < N; i += 1 )
		for( j = 0; j < N; j += 1 )
			for( k = 0; k < N; k += 1 )
				C[i][j] += A[i][k] * B[k][j];

Many other variants of the core part of CMMA are possible and they will be reviewed.

Memory Allocation Schemes

In the previous section of this article, two examples of a static declaration of matrices A, B, and C were given. In the case of dynamic allocation of memory for matrices, explicit calls to memory allocation functions need to be made. In this case, declarations and allocations of memory can look like the following:

Declaration of matrices A, B, and C as 1-D data structures:

	__attribute__( ( aligned( 64 ) ) ) float *fA;
	__attribute__( ( aligned( 64 ) ) ) float *fB;
	__attribute__( ( aligned( 64 ) ) ) float *fC;

and this is how memory needs to be allocated:

	fA = ( float * )_mm_malloc( N * sizeof( float ), 64 );
	fB = ( float * )_mm_malloc( N * sizeof( float ), 64 );
	fC = ( float * )_mm_malloc( N * sizeof( float ), 64 );

Note: Allocated memory blocks are 64-byte aligned, contiguous, and not fragmented by an operating system memory manager; this improves performance of processing.

Declaration of matrices A, B, and C as 2-D data structures:

	__attribute__( ( aligned( 64 ) ) ) float **fA;
	__attribute__( ( aligned( 64 ) ) ) float **fB;
	__attribute__( ( aligned( 64 ) ) ) float **fC;

and this is how memory needs to be allocated:

	fA = ( float ** )calloc( N, sizeof( float * ) );
	fB = ( float ** )calloc( N, sizeof( float * ) );
	fC = ( float ** )calloc( N, sizeof( float * ) );
	for( i = 0; i < N; i += 1 )
	{
		fA[i] = ( float * )calloc( N, sizeof( float ) );
		fB[i] = ( float * )calloc( N, sizeof( float ) );
		fC[i] = ( float * )calloc( N, sizeof( float ) );
	}

Note: Allocated memory blocks are not contiguous and can be fragmented by an operating system memory manager, and fragmentation can degrade performance of processing.

In the previous examples, a DDR4-type RAM memory was allocated for matrices. However, on an Intel Xeon Phi processor system⁵ a multichannel DRAM (MCDRAM)-type RAM memory could be allocated as well, using functions from a memkind library¹¹ when MCDRAM mode is configured to 'Flat' or 'Hybrid'. For example, this is how an MCDRAM-type RAM memory can be allocated:

	fA = ( float * )hbw_malloc( N * sizeof( float ) );
	fB = ( float * )hbw_malloc( N * sizeof( float ) );
	fC = ( float * )hbw_malloc( N * sizeof( float ) );

Note: An 'hbw_malloc' function of the memkind library was used instead of an '_mm_malloc' function.

On an Intel Xeon Phi processor system, eight variants of memory allocation for matrices A, B, and C are possible:

Matrix A	Matrix B	Matrix C	Note
DDR4	DDR4	DDR4	(1)
DDR4	DDR4	MCDRAM	(2)
DDR4	MCDRAM	DDR4
DDR4	MCDRAM	MCDRAM
MCDRAM	DDR4	DDR4
MCDRAM	DDR4	MCDRAM
MCDRAM	MCDRAM	DDR4
MCDRAM	MCDRAM	MCDRAM

Table 3.

It is recommended to use MCDRAM memory as much as possible because its bandwidth is ~400 GB/s, and it is ~5 times faster than the ~80 GB/s bandwidth of DDR4 memory⁵.

Here is an example of how 'cblas_sgemm' MMA performs for two memory allocation schemes (MASs) (1) and (2):

	Matrix multiplication C=A*B where matrix A (32768x32768) and matrix B (32768x32768)
	Allocating memory for matrices A, B, C: MAS=DDR4:DDR4:DDR4
	Initializing matrix data
	Matrix multiplication started
	Matrix multiplication completed at 50.918 seconds

	Allocating memory for matrices A, B, C: MAS=DDR4:DDR4:MCDRAM
	Initializing matrix data
	Matrix multiplication started
	Matrix multiplication completed at 47.385 seconds

It is clear that there is a performance improvement of ~7 percent when an MCDRAM memory was allocated for matrix C.

Loop Processing Schemes

A loop processing scheme (LPS) describes what optimization techniques are applied to the 'for' statements of the C language of the core part of CMMA. For example, the following code:

	for( i = 0; i < N; i += 1 )						// loop 1
		for( j = 0; j < N; j += 1 )					// loop 2
			for( k = 0; k < N; k += 1 )				// loop 3
				C[i][j] += A[i][k] * B[k][j];

corresponds to an LPS=1:1:1, and it means that loop counters are incremented by 1.

Table 4 below includes short descriptions of different LPSs:

LPS	Note
1:1:1	Loops not unrolled
1:1:2	3rd loop unrolls to 2-in-1 computations
1:1:4	3rd loop unrolls to 4-in-1 computations
1:1:8	3rd loop unrolls to 8-in-1 computations
1:2:1	2nd loop unrolls to 2-in-1 computations
1:4:1	2nd loop unrolls to 4-in-1 computations
1:8:1	2nd loop unrolls to 8-in-1 computations

Table 4.

For example, the following code corresponds to an LPS=1:1:2, and it means that counters 'i' and 'j' for loops 1 and 2 are incremented by 1, and counter 'k' for loop 3 is incremented by 2:

	for( i = 0; i < N; i += 1 )						// :1
	{
		for( j = 0; j < N; j += 1 )					// :1
		{
			for( k = 0; k < N; k += 2 )				// :2 (unrolled loop)
			{
				C[i][j] += A[i][k  ] * B[k   ][j];
				C[i][j] += A[i][k+1] * B[k+1][j];
			}
		}
	}

Note: A C++ compiler could unroll loops as well if command line-options for unrolling are used. A software engineer should prevent such cases when source code unrolling is used at the same time, because it prevents vectorization of inner loops, and degrades performance of processing.

Another optimization technique is the loop interchange optimization technique (LIOT). When the LIOT is used, a core part of CMMA looks as follows:

	for( i = 0; i < N; i += 1 )						// loop 1
		for( k = 0; k < N; k += 1 )					// loop 2
			for( j = 0; j < N; j += 1 )				// loop 3
				C[i][j] += A[i][k] * B[k][j];

It is worth noting that counters 'j' and 'k' for loops 2 and 3 were exchanged.

The loops unrolling and LIOT allow for improved performance of processing because elements of matrices A and B are accessed more efficiently.

Compute Schemes

A compute scheme (CS) describes the computation of final or intermediate values and how elements of matrices are accessed.

In a CMMA an element (i,j) of the matrix C is calculated as follows:

	C[i][j] += A[i][k] * B[k][j]

and its CS is ij:ik:kj.

However, elements of matrix B are accessed in a very inefficient way. That is, the next element of matrix B, which needs to be used in the calculation, is located at a distance of (N * sizeof (datatype)) bytes. For very small matrices this is not critical because they can fit into CPU caches. However, for larger matrices it affects performance of computations, which can be significantly degraded, due to cache misses.

In order to solve that problem and improve performance of computations, a very simple optimization technique is used. If matrix B is transposed, the next element that needs to be used in the calculation will be located at a distance of (sizeof (datatype)) bytes. Thus, access to the elements of matrix B will be similar to the access to the elements of matrix A.

In a transpose-based CMMA, an element (i,j) of the matrix C is calculated as follows:

	C[i][j] += A[i][k] * B[j][k]

and its CS is ij:ik:jk. Here B[j][k] is used instead of B[k][j].

It is very important to use the fastest possible algorithm for the matrix B transposition before processing is started. In Appendix B an example is given on how much time is needed to transpose a square matrix of 32,768 elements, and how much time is needed to compute the product on an Intel Xeon Phi processor system.

Another optimization technique is the loop blocking optimization technique (LBOT) and it allows the use of smaller subsets of A, B, and C matrices to compute the product. When the LBOT is used, a core part of CMMA looks as follows:

	for( i = 0; i < N; i += BlockSize )
	{
		for( j = 0; j < N; j += BlockSize )
		{
			for( k = 0; k < N; k += BlockSize )
			{
				for( ii = i; ii < ( i+BlockSize ); ii += 1 )
					for( jj = j; jj < ( j+BlockSize ); jj += 1 )
						for( kk = k; kk < ( k+BlockSize ); kk += 1 )
							C[ii][jj] += A[ii][kk] * B[kk][jj];
			}
		}
	}

Note: A detailed description of LBOT can be found at¹⁰.

Table 5 shows four examples of CSs:

CS	Note
ij:ik:kj	Default
ij:ik:jk	Transposed
iijj:iikk:kkjj	Default LBOT
iijj:iikk:jjkk	Transposed LBOT

Table 5.

Error Analysis

In any version of MMA many FP operations need to be done in order to compute values of elements of matrix C. Since FP data types SP or DP have limited precision⁴, rounding errors accumulate very quickly. A common misconception is that rounding errors can occur only in cases where large or very large matrices need to be multiplied. This is not true because, in the case of floating point arithmetic (FPA), a rounding error is also a function of the range of an input value, and it is not a function of the size of input matrices.

However, a very simple optimization technique allows improvement in the accuracy of computations.

If matrices A and B are declared as an SP FP data type, then intermediate values could be stored in a variable of DP FP data type:

	for( i = 0; i < N; i += 1 )
	{
		for( j = 0; j < N; j += 1 )
		{
			double sum = 0.0;
			for( k = 0; k < N; k += 1 )
			{
				sum += ( double )( A[i][k] * B[k][j] );
			}
			C[i][j] = sum;
		}
	}

The accuracy of computations will be improved, but performance of processing can be lower.

An error analysis (EA) is completed using the mmatest4.c test program for different sizes of matrices of SP and DP FP data types (see Table 6 in Appendix C with results).

Performance on the Intel® Xeon Phi™ Processor System

Several versions of the CMMA to compute a product of square dense matrices are evaluated in four test programs. Performance of these CMMAs is compared to a highly optimized 'cblas_sgemm' function of the Intel MKL⁷. Also see Appendix D for more evaluations.

performance evaluation
Figure 1. Performance tests for matrix multiply algorithms on Intel® Xeon Phi™ processor using mmatest1.c with KMP_AFFINITY environment variable set to 'scatter', 'balanced', and 'compact'. A lower bar height means faster processing.

Here are the names of source files with a short description of tests:

mmatest1.c - Performance tests matrix multiply algorithms on an Intel Xeon Phi processor.
mmatest2.c - Performance tests matrix multiply algorithms on an Intel Xeon Phi processor in one MCDRAM mode ('Flat') for DDR4:DDR4:DDR4 and DDR4:DDR4:MCDRAM MASs.
mmatest3.c - Performance tests matrix multiply algorithms on an Intel Xeon Phi processor in three MCDRAM modes ('All2All', 'Flat', and 'Cache') for DDR4:DDR4:DDR4 and MCDRAM:MCDRAM:MCDRAM MASs. Note: In 'Cache' MCDRAM mode, MCDRAM:MCDRAM:MCDRAM MAS cannot be used.
mmatest4.c - Verification matrix multiply algorithms accuracy of computations on an Intel Xeon Phi processor.

OpenMP* Product Thread Affinity Control

OpenMP* product compiler directives can be easily used to parallelize processing and significantly speed up processing. However, it is very important to execute OpenMP threads on different logical CPUs of modern multicore processors in order to utilize their internal resources as best as possible.

In the case of using the Intel C++ compiler and Intel OpenMP run-time libraries, the KMP_AFFINITY environment variable provides flexibility and simplifies that task. Here are three simple examples of using the KMP_AFFINITY environment variable:

	KMP_AFFINITY = scatter
	KMP_AFFINITY = balanced
	KMP_AFFINITY = compact

These two screenshots of the Htop* utility¹² demonstrate how OpenMP threads are assigned (pinned) to Intel Xeon Phi processor 7210⁵ logical CPUs during processing of an MMA using 64 cores of the processor:

KMP
Screenshot 1. KMP_AFFINITY = scatter or balanced. Note: Processing is faster when compared to KMP_AFFINITY = compact.

KMP
Screenshot 2. KMP_AFFINITY = compact. Note: Processing is slower when compared to KMP_AFFINITY = scatter or balanced.

Recommended Intel® C++ Compiler Command-Line Options

Here is a list of Intel C++ Compiler command-line options that a software engineer should consider, which can improve performance of processing of CMMAs:

O3
fp-model
parallel
unroll
unroll-aggressive
opt-streaming-stores
opt-mem-layout-trans

Os
openmp
ansi-alias
fma
opt-matmul
opt-block-factor
opt-prefetch

The reader can use 'icpc -help' or 'icc -help' to learn more about these command-line options.

Conclusion

Application of different optimization techniques to the CMMA were reviewed in this article.

Three versions of CMMA to compute a product of square dense matrices were evaluated in four test programs. Performance of these CMMAs was compared to a highly optimized 'cblas_sgemm' function of the Intel MKL⁷.

Tests were completed on a computer system with an Intel® Xeon Phi processor 7210⁵ running the Linux Red Hat operating system in 'All2All' Cluster mode and for 'Flat', 'Hybrid 50-50', and 'Cache' MCDRAM modes.

It was demonstrated that CMMA could be used for cases when matrices of small sizes, up to 1,024 x 1,024, need to be multiplied.

It was demonstrated that performance of MMAs is higher when MCDRAM-type RAM memory is allocated for matrices with sizes up to 16,384 x 16,384 instead of DDR4-type RAM memory.

Advantages of using CMMA to compute the product of two matrices are as follows:

In any programming language, simple to implement to run on CPUs or GPUs⁹
Highly portable source codes when implemented in C, C++, or Java programming languages
Simple to integrate with existing software for a wide range of computer platforms
Simple to debug and troubleshoot
Predictable memory footprint at run time
Easy to optimize using parallelization and vectorization techniques
Low overheads and very good performance for matrices of sizes ranging from 256 x 256 to 1,024 x 1,024 (see Figures 1 through 5)
Very good accuracy of computations for matrices of sizes ranging from 8 x 8 to 2,048 x 2,048 (see Table 6 in Appendix C)

Disadvantages of using CMMA to compute a product of two matrices are as follows:

Poor performance for large matrices with sizes greater than 2048 x 2048
Poor performance when implemented using high-level programming languages due to processing overheads
Reduced accuracy of computations for matrices of sizes ranging from 2,048 x 2,048 to 65,536 x 65,536 (see Table 6 in Appendix C)

References

1. Matrix Multiplication on Mathworld

http://mathworld.wolfram.com/MatrixMultiplication.html

2. Matrix Multiplication on Wikipedia

https://en.wikipedia.org/wiki/Matrix_multiplication

3. Asymptotic Complexity of an Algorithm

https://en.wikipedia.org/wiki/Time_complexity

4. The IEEE 754 Standard for Floating Point Arithmetic

http://standards.ieee.org/

5. Intel® Many Integrated Core Architecture

https://software.intel.com/en-us/xeon-phi/x200-processor
http://ark.intel.com/products/94033/Intel-Xeon-Phi-Processor-7210-16GB-1_30-GHz-64-core
https://software.intel.com/en-us/forums/intel-many-integrated-core

6. Intel® C++ Compiler

https://software.intel.com/en-us/c-compilers
https://software.intel.com/en-us/forums/intel-c-compiler

7. Intel® MKL

https://software.intel.com/en-us/intel-mkl
https://software.intel.com/en-us/intel-mkl/benchmarks
https://software.intel.com/en-us/forums/intel-math-kernel-library

8. Intel® Developer Zone Forums

https://software.intel.com/en-us/forum

9. Optimizing Matrix Multiply for Intel® Processor Graphics Architecture Gen 9

https://software.intel.com/en-us/articles/sgemm-ocl-opt

10. Performance Tools for Software Developers Loop Blocking

https://software.intel.com/en-us/articles/performance-tools-for-software-developers-loop-blocking

11. Memkind library

https://github.com/memkind/memkind

12. Htop* monitoring utility

https://sourceforge.net/projects/htop

Downloads

Performance_CMMA_system.zip

List of all files (sources, test reports, and so on):

Performance_CMMA_system.pdf - Copy of this paper.

mmatest1.c - Performance tests for matrix multiply algorithms on Intel® Xeon Phi processors.

dataset1.txt - Results of tests.

mmatest2.c - Performance tests for matrix multiply algorithms on Intel® Xeon Phi processors for DDR4:DDR4:DDR4 and DDR4:DDR4:MCDRAM MASs.

dataset2.txt - Results of tests.

mmatest3.c - Performance tests for matrix multiply algorithms on Intel® Xeon Phi processors in three MCDRAM modes for DDR4:DDR4:DDR4 and MCDRAM:MCDRAM:MCDRA MASs.

dataset3.txt - Results of tests.

mmatest4.c - Verification of matrix multiply algorithms accuracy of computations on Intel® Xeon Phi processors.

dataset4.txt - Results of tests.

Note: Intel C++ Compiler versions used to compile tests:
17.0.1 Update 132 for Linux*
16.0.3 Update 210 for Linux*

Abbreviations

CPU - Central processing unit
GPU - Graphics processing unit
ISA - Instruction set architecture
MIC - Intel® Many Integrated Core Architecture
RAM - Random access memory
DRAM - Dynamic random access memory
MCDRAM - Multichannel DRAM
HBW - High bandwidth memory
DDR4 - Double data rate (generation) 4
SIMD - Single instruction multiple data
SSE - Streaming SIMD extensions
AVX - Advanced vector extensions
FP - Floating point
FPA - Floating point arithmetic⁴
SP - Single precision⁴
DP - Double precision⁴
FLOPS - Floating point operations per second
MM - Matrix multiplication
MMA - Matrix multiplication algorithm
CMMA - Classic matrix multiplication algorithm
MTA - Matrix transpose algorithm
AC - Asymptotic complexity
IC - Implementation complexity
EA - Error analysis
MAS - Memory allocation scheme
LPS - Loop processing scheme
CS - Compute scheme
LIOT - Loop interchange optimization technique
LBOT - Loop blocking optimization technique
ICC - Intel C++ Compiler⁶
MKL - Math kernel library⁷
CBLAS - C basic linear algebra subprograms
IDZ - Intel® Developer Zone⁸
IEEE - Institute of Electrical and Electronics Engineers⁴
GB - Gigabytes
TN - Total number

Appendix A - Technical Specifications of the Intel® Xeon Phi™ Processor System

Summary of the Intel Xeon Phi processor system used for testing:

Process technology: 14nm
Processor name: Intel Xeon Phi processor 7210
Frequency: 1.30 GHz
Packages (sockets): 1
Cores: 64
Processors (CPUs): 256
Cores per package: 64
Threads per core: 4
On-Package Memory: 16 GB high bandwidth MCDRAM (bandwidth ~400 GB/s)
DDR4 Memory: 96 GB 6 Channel (Bandwidth ~ 80 GB/s)
ISA: Intel® AVX-512 (Vector length 512-bit)

Detailed processor specifications:

http://ark.intel.com/products/94033/Intel-Xeon-Phi-Processor-7210-16GB-1_30-GHz-64-core

Summary of a Linux operating system:

[guest@... ~]$ uname -a

Linux c002-n002 3.10.0-327.13.1.el7.xppsl_1.4.0.3211.x86_64 #1 SMP
Fri Jul 8 11:44:24 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

[guest@... ~]$ cat /proc/version

Linux version 3.10.0-327.13.1.el7.xppsl_1.4.0.3211.x86_64 (qb_user@89829b4f89a5)
(gcc version 4.8.5 20150623 (Red Hat 4.8.5-4) (GCC)) #1 SMP Fri Jul 8 11:44:24 UTC 2016

Appendix B - Comparison of Processing Times for MMAs versus MTA

Comparison of processing times for Intel MKL 'cblas_sgemm' and CMMA vs. MTA:

[Intel MKL & CMMA]

Matrix A [32768 x 32768] Matrix B [32768 x 32768]
Number of OpenMP threads: 64
MKL - Completed in: 51.2515874 seconds
CMMA - Completed in: 866.5838490 seconds

[MTA]

Matrix size: 32768 x 32768
Transpose Classic - Completed in: 1.730 secs
Transpose Diagonal - Completed in: 1.080 secs
Transpose Eklundh - Completed in: 0.910 secs

When compared processing time of MTA to:
MKL 'cblas_sgemm'. the transposition takes ~2.42 percent of the processing time.
CMMA, the transposition takes ~0.14 percent of the processing time.

Appendix C - Error Analysis (Absolute Errors for SP FP Data Type)

N	MMA	Calculated SP Value	Absolute Error
8	MKL	8.000080	0.000000
8	CMMA	8.000080	0.000000
16	MKL	16.000160	0.000000
32	CMMA	16.000160	0.000000
32	MKL	32.000309	-0.000011
32	CMMA	32.000320	0.000000
64	MKL	64.000671	0.000031
128	CMMA	64.000641	0.000001
128	MKL	128.001160	-0.000120
128	CMMA	128.001282	0.000002
256	MKL	256.002319	-0.000241
512	CMMA	256.002563	0.000003
512	MKL	512.004639	-0.000481
512	CMMA	512.005005	-0.000115
1024	MKL	1024.009521	-0.000719
2048	CMMA	1024.009888	-0.000352
2048	MKL	2048.019043	-0.001437
2048	CMMA	2048.021484	0.001004
4096	MKL	4096.038574	-0.002386
8192	CMMA	4096.037109	-0.003851
8192	MKL	8192.074219	-0.007701
8192	CMMA	8192.099609	0.017689
16384	MKL	16384.14648	-0.017356
32768	CMMA	16384.09961	-0.064231
32768	MKL	32768.33594	0.008258
32768	CMMA	32768.10156	-0.226118
65536	MKL	65536.71875	0.063390
65536	CMMA	65536.10156	-0.553798

Table 6.

Appendix D - Performance of MMAs for Different MASs

MKL Performance
Figure 2. Performance of Intel® MKL 'cbals_sgemm'. KMP_AFFINITY environment variable set to 'scatter'. Cluster mode: 'All2All'. MCDRAM mode: 'Flat'. Test program mmatest2.c. A lower bar height means faster processing.

MKL Performance
Figure 3. Performance of Intel® MKL 'cbals_sgemm' vs. CMMA. KMP_AFFINITY environment variable set to 'scatter'. Cluster mode: 'All2All'. MCDRAM mode: 'Flat'. Test program mmatest3.c. A lower bar height means faster processing.

MKL Performance
Figure 4. Performance of Intel® MKL 'cbals_sgemm' vs. CMMA. KMP_AFFINITY environment variable set to 'scatter'. Cluster mode: 'All2All'. MCDRAM mode: 'Hybrid 50-50'. Test program mmatest3.c. A lower bar height means faster processing.

MKL Performance
Figure 5. Performance of Intel® MKL 'cbals_sgemm' vs. CMMA. KMP_AFFINITY environment variable set to 'scatter'. Cluster mode: 'All2All'. MCDRAM mode: 'Cache'. Test program mmatest3.c. A lower bar height means faster processing.

About the Author

Sergey Kostrov is a highly experienced C/C++ software engineer and Intel® Black Belt Developer. He is an expert in design and implementation of highly portable C/C++ software for embedded and desktop platforms, scientific algorithms, and high performance computing of big data sets.

↧

Intel Solutions and Technologies for the Evolving Data Center

April 24, 2017, 3:06 pm

Latest and popular articles on Intel Technologies

≫ Next: Intel® Manycore Platform Software Stack for Intel® Xeon Phi™ Coprocessor x200

≪ Previous: Performance of Classic Matrix Multiplication Algorithm on Intel® Xeon Phi™ Processor System

One Stop for Optimizing Your Data Center

From AI to Big Data to HPC: End-to-end Solutions

Whether your data center is data- or compute-intensive and whether it serves cloud, high-performance computing, enterprise, storage, networking, or big data analytics, we have solutions and technologies to make your life easier.

Explore

Data center managers, integrators, and developers can now optimize the entire stack to run faster and more efficiently on Intel® architecture. The Intel® Xeon® and Intel® Xeon Phi™ product family paired with Intel® Solid State Drives and NVMe* storage provide a strong foundation. Intel is committed to a standardized, shared platform for virtualization including SDN/NFV (networking), while providing hardware-based security and manageability for now and in the future.

But Intel is more than a hardware innovator. Regardless of your challenges, Intel provides optimized industry SDKs, libraries, and tuning tools. And these tools are supplemented by expert-provided training plus documentation including code samples, configuration guides, walk-throughs, use cases, and support forums.

AI: MACHINE LEARNING AND DEEP LEARNING

Intel supports rapid innovation in artificial intelligence focusing on community, tools, and training. Starting with the Intel® Nervana™ AI Academy, this section of the Intel® Software Developer Zone drills down into to computational machine learning and deep learning, with extensive Intel-optimized libraries and frameworks along with documentation and tutorials.

The Deep Learning Training Tool Beta helps you easily develop and train deep learning solutions using your own hardware. It can ease your data preparation, as well as design and train models using automated experiments and advanced visualizations.

Tools available include:
BigDL open source distributed library for Apache Spark*
Intel® Distribution for Python*
Deep Learning Webinar

MODERN CODE

You’ve no doubt heard of recent hardware innovations of the Intel® Many Integrated Core Architecture (Intel® MIC) including the multilevel extreme parallelism, vectorization and threading of the Intel® Xeon® and Intel® Xeon Phi™ product family. Plus, there are larger caches, new SIMD extensions, new memory and file architectures and hardware enforced security of select data and application code via Intel® Software Guard Extensions (Intel® SGX).

But they all require code and tool changes to get the most from the data center. To address this, Intel provides training and tools to quickly and easily optimize code for new technologies.

Extensive free training on code improvements and parallel programming is available online and by workshops and events.

Tools available include:
Intel® Parallel Studio XE (vectorization advisor and MPI profiling)
Intel® Advisor (vectorization optimization and threading design tool)
Intel® C/C++ Compilers and Intel® Fortran Compilers
Intel® VTune™ Amplifier XE (performance analysis of multiple CPUs and FPUs)
Application Performance Snapshot Tool

BIG DATA ANALYTICS

When handling huge volumes of data, Intel can help you provide faster, easier and more insightful big data analytics using open software platforms, libraries, developer kits and tools that take advantage of the Intel Xeon and Intel Xeon Phi product family’s extreme parallelism and vectorization. Fully integrated with popular platforms (Apache* Hadoop*, Spark*,R, Matlab* Java*, and NoSQL), Intel optimizations have been well-tested and benchmarked.

Extensive documentation is available on how real-life developers are using Intel hardware, software, and tools to effectively store, manage, process, and analyze data.

The Intel® Data Analytics Acceleration Library (Intel® DAAL) provides highly-optimized algorithmic building blocks and can be paired with the Intel® Math Kernel Library (Intel® MKL) containing optimized threaded and vectorized functions. In fact, the TAP Analytics Toolkit (TAP ATK) provides both Intel® DAAL and Intel® MKL already integrated with Spark.

HIGH-PERFORMANCE STORAGE

Intel is at the cutting edge of Storage not only with Intel® SSDs and NVMe but by working with the open source community to optimize and secure the infrastructure. Training is available at Intel® Storage Builders University.

Major tools available include:
Intel® Intelligent Storage Acceleration Library (Intel® ISA-L)
Storage Performance Development Kit (SPDK)
Intel® QuickAssist Technology
Intel® VTune™ Amplifier
Storage Performance Snapshot
Intel® Cache Acceleration Software (Intel® CAS)

SDN/NFV NETWORKING

Besides providing a standardized open platform ideal for SDN/NFV (virtualized networking) and the unique hardware capabilities in Intel’s network controllers, Intel has provided extensive additions to, and testing of, the Data Plane Development Kit (DPDK) and training through Intel® Network Builders University. Check out the thriving community of developers and subscribe to the 'Out of the Box' Network Developers Newsletter.

HPC AND CLUSTER

If you run visualization or other massive parallelism applications, you know the advantages of using the Intel Xeon and Intel Xeon Phi product family with MCDRAM and associated NUMA/Memory/Cache Modes, wide vector units and up to 68 cores. While the Intel® Scalable System Framework (Intel® SSF) and Intel® Omni-Path Architecture (Intel® OPA) focus on performance, balance and scalability, Intel is working with research and production HPC and clusters to support integration with all the major stacks as well as developing code and tools to optimize and simplify the work.

The Intel® HPC Orchestrator provides a modular integrated validated stack including the Lustre* parallel file system. It is supplemented by critical tools for cluster optimization:

Intel® Trace Analyzer and Collector which quickly finds MPI bottlenecks
Intel® MPI Library and docs to improve implementation of MPI 3.1 on multiple fabrics
MPI Performance Snapshot to help with performance tuning.
Intel® VTune™ Amplifier XE for performance analysis of multiple CPUs, FPUs and NUMA

Conclusion

Regardless of your job title and data center activities, Intel helps streamline and optimize your work to gain a competitive edge with end-to-end solutions, from high-performance hardware to new technologies, optimizations, tools and training. See what resources Intel provides to optimize and speed up your development now and remain competitive in the industry.

Explore

↧

Intel® Manycore Platform Software Stack for Intel® Xeon Phi™ Coprocessor x200

May 8, 2017, 9:30 am

Latest and popular articles on Intel Technologies

≫ Next: Intel® Xeon Phi™ Coprocessor x200 Quick Start Guide

≪ Previous: Intel Solutions and Technologies for the Evolving Data Center

‍Summary of (latest) changes

This article describes the most recent changes that have been made to the Intel® Manycore Platform Software Stack (Intel® MPSS) 4.x. If you've subscribed to get update notifications, you can use this information to quickly determine whether these changes apply to you.

May 8, 2017, Intel® MPSS 4.4.0 HotFix 1 released for Linux* and Windows*

‍‍About the Intel® Manycore Platform Software Stack 4.x

The Intel MPSS 4.x is necessary to run the Intel® Xeon Phi™ coprocessor x200. It has been tested to work with specific versions of 64-bit operating systems:

Red Hat Enterprise Linux Server 7.2, 7.3.0 (for Intel MPSS 4.4.0 Hotfix 1)
SuSE Linux Enterprise Server (SLES) 12, SP1, and SP2 (Intel MPSS 4.4.0 Hotfix 1)
Microsoft Windows 8.1, Windows 10, Windows Server* 2012 R2, and Windows Server 2016

The readme files (referenced in the Downloads section) have more information on how to build and install the stack.

One important component of Intel MPSS is the Symmetric Communications Interface (SCIF). The SCIF is included in the RPM bundle. SCIF provides a mechanism for inter-node communications within a single platform. A node, for SCIF purposes, is defined as either an Intel® Xeon Phi™ coprocessor or the Intel® Xeon® processor. In particular, the SCIF abstracts the details of communicating over the PCI Express* bus. The SCIF APIs are callable from both user space (uSCIF) and kernel space (kSCIF).

Intel MPSS is downloadable from the sources below. Note that these packages include documentation and APIs (for example, the SCIF API).

For Linux systems, users can measure Intel® Xeon Phi™ processor and coprocessor x200 product family performance with a tool called micperf. micperf is designed to incorporate a variety of benchmarks into a simple user experience with a single interface for execution. For the coprocessor, the micperf package is distributed as an RPM file within Intel MPSS. The following table summarizes all the benchmarks that can be run with the micperf tool:

Benchmark	CLI Name	Target Operations	Component	Comments
Intel® Math Kernel Library (Intel® MKL) DGEMM	`dgemm`	Double-precision floating point	VFU	For the processor, micperf provides a MCDRAM and DDR version
Intel MKL SGEMM	`sgemm`	Single-precision floating point	VFU	For the processor, micperf provides a MCDRAM and DDR version
Intel MKL SMP Linpack	`linpack`	Double-precision floating point	VFU
SHOC Download*	`shoc download`	Bus transfer host to device	PCIe* bus	Only available for the coprocessor
SHOC Readback*	`shoc readback`	Bus transfer device to host	PCIe bus	Only available for the coprocessor
STREAM*	`stream`	Round-trip memory to registers	MCDRAM, GDDR and caches	For the processor, micperf provides a MCDRAM and DDR version
HPLinpack*	`hplinpack`	Double-precision floating point	VFU	Only available for the processor
HPCG*	`hpcg`	Double-precision floating point	VFU	Only available for the processor; requires Intel® MPI Library

Note: the Intel MPSS download files for Linux marked “.gz” should end in “.gz” when downloaded; most browsers leave the extension alone, but Windows Explorer* may rename the files. If this affects you, we recommend renaming the file to the proper extension after downloading.

‍‍Getting notified of future updates

If you want to receive updates when we publish a new Intel MPSS 4.x stack, add a comment at the bottom of this page.

‍‍Release support schedule?

The following table shows when releases were issued and when Intel will no longer support them. Releases with a strikethrough are no longer supported. For an overview of Intel's release structure and support length, please see this article.

Downloads

There are currently two major releases available for the Intel MPSS 4.x. The most recent major release is 4.4.x.

We recommend that new adopters start by using the 4.4 release. Support for each Intel MPSS release ends 6 months from the date it was posted, except for long-term support products.

Intel MPSS 4.4.0 HotFix 1 release for Linux

Intel® Manycore Platform Software Stack version	Downloads available	Size (range)	MD5 Checksum
mpss-4.4.0 Hotfix 1(released: May 8, 2017)	RedHat 7.3	214MB	8a015c38379b8be42c8045d3ceb44545
	RedHat 7.2	214MB	694b7b908c12061543d2982750985d8b
	SuSE 12.2	213MB	506ab12af774f78fa8e107fd7a4f96fd
	SuSE 12.1	213MB	b8520888954e846e8ac8604d62a9ba96
	SuSE 12.0	213MB	88a3a4415afae1238453ced7a0df28ea
	Card installer file (mpss-4.4.0-card.tar)	761MB	d26e26868297cea5fd4ffafe8d78b66e
	Source file (mpss-4.4.0-card-source.tar)	514MB	127713d06496090821b5bb3613c95b30

Documentation link	Description	Last Updated On	Size (approx)
releasenotes-linux.txt	Release Notes (English)	May 2017	15KB
README.txt	Readme (includes installation instructions) for Linux (English)	May 2017	17KB
MPSS_Users_Guide.pdf	MPSS User's guide	May 2017	3MB
EULA.txt	End User License Agreement (IMPORTANT: Read Before Downloading, Installing, or Using)	May 2017	33KB

Intel MPSS 4.4.0 HotFix 1 release for Microsoft Windows

Intel® Manycore Platform Software Stack version	Downloads available	Size	MD5 Checksum
64-bit Install Package (release May 8, 2017)	mpss-4.4.0-windows.zip	1091MB	204a65b36858842f472a37c77129eb53

Documentation link	Description	Last Updated On	Size
releaseNotes-windows.txt	English - release notes	May 2017	7KB
readme-windows.pdf	English - readme for Microsoft* Windows	May 2017	399KB
MPSS user's_Guide	MPSS User Guide for Windows	May 2017	3MB
EULA.txt	End User License Agreement (IMPORTANT: Read Before Downloading, Installing, or Using)	May 2017	33KB

‍‍Additional documentation

The Intel MPSS packages contain additional documentation for Linux: man pages and documents in /usr/share/doc/ (see myo, intel-coi-* and micperf-* directories). The Platform Control Panel User’s Guide is now in /usr/share/doc/systools/micmgmt/

Also, below is a link to the Intel® MPSS Performance Guide, which documents best-known methods for fine-tuning the Intel MPSS runtime environment to get the best application performance.

http://software.intel.com/sites/default/files/managed/72/db/mpss-performance-guide.pdf ‍‍

‍‍Where to ask questions and get more information

The discussion forum at http://software.intel.com/en-us/forums/intel-many-integrated-core is available to join and discuss any enhancements or issues with Intel® MPSS.

Information about Intel MPSS security can be found here.

You can also find support collaterals here or submit an issue.

↧

Intel® Xeon Phi™ Coprocessor x200 Quick Start Guide

May 8, 2017, 11:45 am

Latest and popular articles on Intel Technologies

≫ Next: Call for submissions: Intel HPC Developer Conference

≪ Previous: Intel® Manycore Platform Software Stack for Intel® Xeon Phi™ Coprocessor x200

Introduction

This document introduces the basic concept of the Intel® Xeon Phi™ coprocessor x200 product family, tells how to install the coprocessor software stack, discusses the build environment, and points to important documents so that you can write code and run applications.

The Intel Xeon Phi coprocessor x200 is the second generation of the Intel Xeon Phi product family. Unlike the first generation running on an embedded Linux* uOS, this second generation supports the standard Linux kernel. The Intel Xeon Phi coprocessor x200 is designed for installation in a third-generation PCI Express* (PCIe*) slot of an Intel® Xeon® processor host. The following figure shows a typical configuration:

Benefits of the Intel Xeon Phi coprocessor:

System flexibility: Build a system that can support a wide range of applications, from serial to highly parallel, while leveraging code optimized for Intel Xeon processors or Intel Xeon Phi processors.
Maximize density: Gain significant performance improvements with limited acquisition cost by maximizing system density.
Upgrade path: Improve performance by adding to an Intel Xeon processor system or upgrading from the first generation of the Intel Xeon Phi product family with minimum code changes.

For workloads that fit within 16 GB coprocessor memory, adding a coprocessor to a host server allows customers to avoid costly networking. For workloads that have a significant portion of highly parallel phases, offload can offer significant performance with minimal code optimization investment.

Additional Documentation

Intel® 64 and IA-32 Architectures Software Developer’s Manual: Documents the model-specific registers (MSRs) for the Intel Xeon Phi coprocessor x200 product family.
Intel Xeon Phi Processor Software Optimization Guide: Documents important features of the Intel Xeon Phi processor x200 product family and how to take advantage of them.
Intel Xeon Phi Processor Performance Monitoring Reference Manual: Documents the performance monitoring registers and events for the Intel Xeon Phi processor x200 product family.
Intel Xeon Phi processor x200 product family Linux OS support: Included with the MPSP release.
Intel® Parallel Studio XE

Basic System Architecture

The Intel Xeon Phi coprocessor x200 is based on a modern Intel® Atom™ microarchitecture with considerable high performance computing (HPC)-focused performance improvements. It has up to 72 cores with four threads per core, giving a total of 288 CPUs as viewed by the operating system, and has up to 16 GB of high-bandwidth on-package MCDRAM memory that provides over 500 GB/s effective bandwidth. The coprocessor has an x16 PCI Express Gen3 interface (8 GT/s) to connect to the host system.

The cores are laid out in units called tiles. Each tile contains a pair of cores, a shared 1 MB L2 cache, and a hub connecting the tile to a mesh interface. Each core contains two 512-bit wide vector processing units. The coprocessor supports Intel® AVX-512F (foundation), Intel AVX-512CD (conflict detection), Intel AVX-512PF (prefetching), and Intel AVX-512ER (exponential reciprocal) ISA.

Intel® Manycore Platform Software Stack

Intel® Manycore Platform Software Stack (Intel® MPSS) is the user and system software that allows programs to run on and communication with the Intel Xeon Phi coprocessor. Intel MPSS version 4.x.x is used for the Intel Xeon Phi coprocessor x200 and can be download from here [(https://software.intel.com/en-us/articles/intel-manycore-platform-software-stack-for-intel-xeon-phi-coprocessor-x200)]. (Note that the older Intel MPSS version 3.x.x is used for the Intel Xeon Phi coprocessor x100); standard Linux kernel running on the coprocessor.

You can download the Intel MPSS stack at https://software.intel.com/en-us/articles/intel-manycore-platform-software-stack-for-intel-xeon-phi-coprocessor-x200. The following host operating systems are supported: Red Hat* Enterprise Linux Server, SUSE* Linux Enterprise Server and Microsoft Windows*. For detailed information on requirements and on installation, please consult the README file for Intel MPSS. The figure below shows the high representation of the Intel MPSS. The host software stack is on the left and the coprocessor software stack is on the right.

Install the Software Stack and Start the Coprocessor

Installation Guide for Linux* Host:

From the “Intel Manycore Platform Software Stack for Intel Xeon Phi Coprocessor x200 (https://software.intel.com/en-us/articles/intel-manycore-platform-software-stack-for-intel-xeon-phi-coprocessor-x200), navigate to the latest version of the Intel MPSS release for Linux and download “Readme for Linux (English)” (README.txt). Also download the release notes (releasenotes-linux.txt) and the User’s Guide for Intel MPSS.
Install one of the following supported operating systems in the host:
- Red Hat Enterprise Linux Server 7.2 64-bit kernel 3.10.0-327
- Red Hat Enterprise Linux Server 7.3 64-bit kernel 3.10.0-514
- SUSE Linux Enterprise Server SLES 12 kernel 3.12.28-4-default
- SUSE Linux Enterprise Server SLES 12 SP1 kernel 3.12.49-11-default
- SUSE Linux Enterprise Server SLES 12 SP2 kernel 4.4.21-69-default
Be sure to install ssh, which is used to log in to the card.
WARNING: On installing Red Hat, it may automatically update you to a new version of the Linux kernel. If this happens, you will not be able to use the prebuilt host driver, but will need to rebuild it manually for the new kernel version. Please see Section 5 in the readme.txt for instructions on building an Intel MPSS host driver for a specific Linux kernel.
Log in as root.
Download the release driver appropriated for your operating system in Step 1 (<mpss-version>-linux.tar), where <mpss-4> is mpss-4.3.3 at the time this document was written.
Install the host driver RPMs as detailed in Section 6 of readme.txt. Don’t skip the creation of configuration files for your coprocessor.
Update the flash on your coprocessor(s) as detailed in Section 8 of readme.txt.
Reboot the system.

Start the Intel Xeon Phi coprocessor (you can set up the card to start with the host system; it will not do so by default), and then run micinfo to verify that it is set up properly:

# systemctl start mpss
# micctrl –w
# /usr/bin/micinfo
micinfo Utility Log
Created On Mon Apr 10 12:14:08 2017

System Info:
    Host OS                        : Linux
    OS Version                     : 3.10.0-327.el7.x86_64
    MPSS Version                   : 4.3.2.5151
    Host Physical Memory           : 128529 MB

Device No: 0, Device Name: mic0 [x200]

Version:
    SMC Firmware Version           : 121.27.10198
    Coprocessor OS Version         : 4.1.36-mpss_4.3.2.5151 GNU/Linux
    Device Serial Number           : QSKL64000441
    BIOS Version                   : GVPRCRB8.86B.0012.R02.1701111545
    BIOS Build date                : 01/11/2017
    ME Version                     : 3.2.2.4

Board:
    Vendor ID                      : 0x8086
    Device ID                      : 0x2260
    Subsystem ID                   : 0x7494
    Coprocessor Stepping ID        : 0x01
    UUID                           : A03BAF9B-5690-E611-8D4F-001E67FC19A4
    PCIe Width                     : x16
    PCIe Speed                     : 8.00 GT/s
    PCIe Ext Tag Field             : Disabled
    PCIe No Snoop                  : Enabled
    PCIe Relaxed Ordering          : Enabled
    PCIe Max payload size          : 256 bytes
    PCIe Max read request size     : 128 bytes
    Coprocessor Model              : 0x57
    Coprocessor Type               : 0x00
    Coprocessor Family             : 0x06
    Coprocessor Stepping           : B0
    Board SKU                      : B0 SKU _NA_A
    ECC Mode                       : Enabled
    PCIe Bus Information           : 0000:03:00.0
    Coprocessor SMBus Address      : 0x00000030
    Coprocessor Brand              : Intel(R) Corporation
    Coprocessor Board Type         : 0x0a
    Coprocessor TDP                : 300.00 W

Core:
    Total No. of Active Cores      : 68
    Threads per Core               : 4
    Voltage                        : 900.00 mV
    Frequency                      : 1.20 GHz

Thermal:
    Thermal Dissipation            : Active
    Fan RPM                        : 6000
    Fan PWM                        : 100 %
    Die Temp                       : 38 C

Memory:
    Vendor                         : INTEL
    Size                           : 16384.00 MB
    Technology                     : MCDRAM
    Speed                          : 6.40 GT/s
    Frequency                      : 6.40 GHz
    Voltage                        : Not Available

**Installation Guide for Windows* Host:**

From the “Intel Manycore Platform Software Stack for Intel Xeon Phi Coprocessor x200 (https://software.intel.com/en-us/articles/intel-manycore-platform-software-stack-for-intel-xeon-phi-coprocessor-x200), navigate to the latest version of the Intel MPSS release for Microsoft Windows. Download “Readme file for Microsoft Windows” (readme-windows.pdf). Also download the “Release notes” (releaseNotes-windows.txt) and the “Intel MPSS User’s Guide” (MPSS_Users_Guide-windows.pdf).
Install one of the following supported operating systems in the host:
- Microsoft Windows 8.1 (64-bit)
- Microsoft Windows® 10 (64-bit)
- Microsoft Windows Server 2012 R2 (64-bit)
- Microsoft Windows Server 2016 (64-bit)
Log in as “administrator”.
Install .NET Framework* 4.5 or higher on the system (http://www.microsoft.com/net/download), Python* 2.7.5 x86-64 or higher (Python 3.x is not supported), Pywin32 build or higher (https://sourceforge.net/projects/pywin32).
Be sure to install PuTTY* and PuTTYgen*, which are used to log in to the card’s OS.
Follow the preliminary steps as instructed in Section 2.2.1 of the Readme file.
Restart the system.
Download the drivers package mpss-4.*-windows.zip for your Windows operating system from the page described in Step 1.
Unzip the zip file to get the Windows exec files (“mpss-4.*.exe” and “mpss-essentials-4*.exe”).
Install the Windows Installer file “mpss-4.*.exe” as detailed in Section 3.2 of the User’s Guide. Note that if a previous version of the Intel Xeon Phi coprocessor stack is already installed, use Windows Control Panel to uninstall it prior to installing the current version. By default, Intel MPSS is installed in “c:\Program Files\Intel\MPSS”. Also, install “mpss-essentials-4*.exe”, the native binary utilities for the Intel Xeon Phi coprocessor. These are required when using offload programming or cross compilers.
Confirm that the new Intel MPSS stack is successfully installed by looking at Control Panel > Programs > Programs and Features: Intel Xeon Phi (see the following illustrations).
Update the flash according to Section 2.2.3 of the readme-windows.pdf file.
Reboot the system.
Log in to the host and verify that the Intel Xeon Phi x200 coprocessors are detected by the Device Manager (Control Panel > Hardware > Device Manager, and click “System devices”):
Start the Intel Xeon Phi coprocessor (you can set up the card to start with the host system; it will not do so by default). Launch a command-prompt window and start the Intel MPSS stack:
```
    prompt> micctrl --start
```
Run the command “micinfo” to verify that it is set up properly:
```
    prompt> micinfo.exe
```

Intel® Parallel Studio XE

After starting the Intel MPSS stack, users can write applications running on the coprocessor using Intel Parallel Studio XE.

Intel Parallel Studio XE is a software development suite that helps boost application performance by taking advantage of the ever-increasing processor core count and vector register width available in Intel Xeon processors, Intel Xeon Phi processors and coprocessors, and other compatible processors. Starting with the Intel Parallel Studio 2018 beta, the following Intel® products support program development on the Intel Xeon Phi coprocessor x200:

Intel® C Compiler/Intel® C++ Compiler/Intel® Fortran Compiler
Intel® Math Kernel Library (Intel® MKL)
Intel® Data Analytics Acceleration Library (Intel® DAAL)
Intel® Integrated Performance Primitives (Intel® IPP)
Intel® Cilk™ Plus
Intel® Threading Building Blocks (Intel® TBB)
Intel® VTune™ Amplifier XE
Intel® Advisor XE
Intel® Inspector XE
Intel® MPI Library
Intel® Trace Analyzer and Collector
Intel® Cluster Ready
Intel® Cluster Checker

To get started writing programs running on the coprocessor, you can get the code samples at https://software.intel.com/en-us/product-code-samples. The packages “Intel Parallel Studio XE for Linux - Sample Bundle”, and “Intel Parallel Studio XE for Windows - Sample Bundle” contain code samples for Linux and Windows, respectively.

Programming Models on Coprocessor

There are three programing models that can be used for the Intel Xeon Phi coprocessor x200: offload programing model, symmetric programing model, and native programing model.

Offload programing: The main application runs on the host, and offload selected, highly parallel portions of the program to the coprocessor(s) to take advantage of manycore architecture. The serial portion of the program still runs in the host to take advantage of big cores architecture.
Symmetric programming: The coprocessors and the host are treated as separate nodes. This model is suitable for distributed computing.
Native programming: The coprocessors are used as independent nodes, just like a host. Users compile the binary for the coprocessor in the host, transfer the binary, and log in the coprocessor to run the binary.

The figure below summarizes different programming models used for the Intel Xeon Phi coprocessor:

↧

Call for submissions: Intel HPC Developer Conference

June 13, 2017, 4:16 pm

Latest and popular articles on Intel Technologies

≫ Next: CPUs are set to dominate high end visualization

≪ Previous: Intel® Xeon Phi™ Coprocessor x200 Quick Start Guide

Please consider giving a talk, tutorial or presenting a poster at this year's Intel HPC Developer Conference (November 11-12, 2017 - just before SC17 in Denver).

Submissions will be reviewed and responded to in a rolling fashion - so submit soon! (Best to submit by July 20, but okay until August 18.)

Submit online: https://intelhpcdc2017cfa.hubb.me (full information on dates, topics, etc. is on that web site).

The prior Intel HPC Developer Conferences have been very well rated by attendees - and that is due to the high quality of speakers (talks tutorials, panels, etc.) that we have enjoyed. We are adding poster sessions this year to open up more discussions with attendees.

Technical talks of 30 minutes, Tutorials of 90, 120 or 180 minutes and Poster sessions submissions are encouraged. Topics range include Parallel Programming, AI (ML/HPDA), High Productivity Languages, Visualization (esp. Software Defined Visualization and In Situ Visualization), Enterprise and Systems.

We expect to have another great conference this year - and we know that rests on the high quality presenters. We look forward to your submissions. Feel free to drop me a note if you have any questions - or simply put in your proposal online, and put any questions in with your submission (we can talk!).

↧

CPUs are set to dominate high end visualization

June 19, 2017, 11:32 am

Latest and popular articles on Intel Technologies

≫ Next: Using Intel® MPI Library on Intel® Xeon Phi™ Product Family

≪ Previous: Call for submissions: Intel HPC Developer Conference

Carson Brownlee, Intel. It is certainly provocative to say that CPUs will dominate any part of visualization - but I say it with confidence that the data supports why this is happening. The primary drivers are (1) data sizes, (2) minimizing data movement, and (3) ability to change to O(n log n) algorithms. Couple that with the ultra-hot topic of "Software Defined Visualization" that makes these three things possible - and you have a lot to consider about how the world is changing.

Of course, what is "high end" today often becomes common place over time... so this trend may affect us all eventually. It's at least worth understanding the elements at play.

At ISC17, in Germany, this week (June 19-21) Intel is demoing (and selling) their vision of a “dream machine” for doing software defined visualization with a special eye towards in situ visualization development. Jim Jeffers, Intel, and friends are demonstrating it at ISC'17 in Germany, and they will be at SIGGRAPH'17 too. The "dream machine" can support visualization of data sets up to 1.5TB in size. They designed it to address the needs of the scientific visualization and professional rendering markets.

Photo credit (above): Asteroid Deep Water Impact Analysis; Data Courtesy: John Patchett, Galen Glisner per Los Alamos National Laboratory tech report LA-UR-17-21595. Visualization: Carson Brownlee, Intel.

With Jim's help, I wrote an article about how more information about how CPUs now offer higher performance and a lower cost than competing GPU-based solutions for the largest visualization tasks. The full article is posted with coverage at TechEnablement site.

In the full article, aside from my writing about the trend - I do provide links to technical papers the show this trend towards CPUs as the preferred solution for visualization of large data (really really big), as well as links to conferences, and links about the "visualization dream machine" (how I describe it, not what Intel calls it officially).

Dream Machine for Software Defined Visualization

Photo: Intel/Colfex Visualization "Dream" Machine

↧

Using Intel® MPI Library on Intel® Xeon Phi™ Product Family

June 27, 2017, 9:29 am

Latest and popular articles on Intel Technologies

≫ Next: Intel® Manycore Platform Software Stack Archive for the Intel® Xeon Phi™ Coprocessor x200 Product Family

≪ Previous: CPUs are set to dominate high end visualization

Introduction

The Message Passing Interface (MPI) standard is a message-passing library, a collection of routines used in distributed-memory parallel programing. This document is designed to help users get started writing code and running MPI applications using the Intel® MPI Library on a development platform that includes the Intel® Xeon Phi™ processor or coprocessor. The Intel MPI Library is a multi-fabric message passing library that implements the MPI-3.1 specification (see Table 1).

In this document, the Intel MPI Library 2017 and 2018 Beta for Linux* OS are used.

Table 1. Intel® MPI Library at a glance

Processors	Intel® processors, coprocessors, and compatibles
Languages	Natively supports C,C++, and Fortran development
Development Environments	Microsoft Visual Studio* (Windows), Eclipse/CDT* (Linux*)
Operating Systems	Linux and Windows
Interconnect Fabric Support	Shared memory RDMA-capable network fabrics through DAPL* (for example, InfiniBand, Myrinet) Intel® Omni-Path Architecture Sockets (for example, TCP/IP over Ethernet, Gigabit Ethernet*) and others.

This document summarizes the steps to build and run an MPI application on an Intel® Xeon Phi™ processor x200, on an Intel® Xeon Phi™ coprocessor x200 and Intel® Xeon Phi™ coprocessor x100 natively or symmetrically. First, we introduce the Intel Xeon Phi processor x200 product family and Intel Xeon Phi processor x100 product family and the MPI programing models.

Intel® Xeon Phi™ Processor Architecture

Intel Xeon Phi processor x200 product family architecture: There are two versions of this product. The processor version is the host processor and the coprocessor version requires an Intel® Xeon® processor host. Both versions share the architecture below (see Figure 1):

Intel® Advanced Vector Extensions 512 (Intel® AVX-512)
Up to 72 cores with 2D mesh architecture
Each core has two 512-bit vector processing units (VPUs) and four hardware threads
Each pair of cores (tile) shares 1 MB L2 cache
8 or 16 GB high-bandwidth on package memory (MCDRAM)
6 channels DDR4, up to 384 GB (available in the processor version only)
For the coprocessor, the third-generation PCIe* is connected to the host

Colorful depiction of the Intel® Xeon Phi™ processor x200 architecture

Figure 1. Intel® Xeon Phi™ processor x200 architecture.

To enable the functionalities of the Intel Xeon Phi processor x200, you need to download and install the Intel Xeon Phi processor software available here.

The Intel Xeon Phi coprocessor x200 attaches to an Intel Xeon processor-based host via a third-generation PCIe interface. The coprocessor runs on a standard Linux OS. It can be used as an extension to the host (so the host can offload the workload) or as an independent compute node. The first step to bring an Intel Xeon Phi coprocessor x200 into service is to install the Intel® Manycore Platform Software Stack (Intel® MPSS) 4.x on the host, which is available here. The Intel MPSS is a collection of software including device drivers, coprocessor management utilities, and the Linux OS for the coprocessor.

Intel Xeon Phi coprocessor x100 architecture: the Intel Xeon Phi coprocessor x100 is the first-generation of the Intel Xeon Phi product family. The coprocessor attaches to an Intel Xeon processor-based host via a second-generation PCIe interface. It runs on an OS separate from the host and has the following architecture (see Figure 2):

Intel® Initial Many Core Instructions
Up to 61 cores with high-bandwidth, bidirectional ring interconnect architecture
Each core has a 512-bit wide VPU and four hardware threads
Each core has a private 512-KB L2 cache
16 GB GDDR5 memory
The second-generation PCIe is connected to the host

Colorful depiction of the Intel® Xeon Phi™ processor x100 architecture

Figure 2. Intel® Xeon Phi™ processor x100 architecture.

To bring the Intel Xeon Phi coprocessor x100 into service, you must install the Intel MPSS 3.x on the host, which can be downloaded here.

MPI Programming Models

The Intel MPI Library supports the following MPI programming models (see Figure 3):

Host-only model (Intel Xeon processor or Intel Xeon Phi processor): In this mode, all MPI ranks reside and execute the workload on the host CPU only (or Intel Xeon Phi processor only).
Offload model: In this mode, the MPI ranks reside solely on the Intel Xeon processor host. The MPI ranks use offload capabilities of the Intel® C/C++ Compiler or Intel® Fortran Compiler to offload some workloads to the coprocessors. Typically, one MPI rank is used per host, and the MPI rank offloads to the coprocessor(s).
Coprocessor-only model: In this native mode, the MPI ranks reside solely inside the coprocessor. The application can be launched from the coprocessor.
Symmetric model: In this mode, the MPI ranks reside on the host and the coprocessors. The application can be launched from the host.

MPI programing models

Figure 3. MPI programing models.

Using the Intel® MPI Library

This section shows how to build and run an MPI application in the following configurations: on an Intel Xeon Phi processor x200, on a system with one or more Intel Xeon Phi coprocessor x200, and on a system with one or more Intel Xeon Phi coprocessor x100 (see Figure 4).

Black and white, different configurations of the Intel® MPI Library

Figure 4. Different configurations: (a) standalone Intel® Xeon Phi™ processor x200, (b) Intel Xeon Phi coprocessor x200 connected to a system with an Intel® Xeon® processor, and (c) Intel® Xeon Phi™ coprocessor x100 connected to a system with an Intel Xeon processor.

Installing the Intel® MPI Library

The Intel MPI Library is packaged as a standalone product or as a part of the Intel® Parallel Studio XE Cluster Edition.

By default, the Intel MPI Library will be installed in the path /opt/intel/impi on the host or the Intel Xeon Phi processor. To start, follow the appropriate directions to install the latest versions of the Intel C/C++ Compiler and the Intel Fortran Compiler.

You can purchase or try the free 30-day evaluation of the Intel Parallel Studio XE from https://software.intel.com/en-us/intel-parallel-studio-xe. These instructions assume that you have the Intel MPI Library tar file - l_mpi_<version>.<package_num>.tgz. This is the latest stable release of the library at the time of writing this article. To check if a newer version exists, log into the Intel® Registration Center. The instructions below are valid for all current and subsequent releases.

As root user, untar the tar file l_mpi_<version>.<package_num>.tgz:

# tar –xzvf l_mpi_<version>.<package_num>.tgz
# cd l_mpi_<version>.<package_num>

Execute the install script on the host and follow the instructions. The installation will be placed in the default installation directory /opt/intel/impi/<version>.<package_num> assuming you are installing the library with root permission.

# ./install.sh

Compiling an MPI program

To compile an MPI program on the host or on an Intel Xeon Phi processor x200:

Before compiling a MPI program you need to establish the proper environment settings for the compiler and for the Intel MPI Library

$ source /opt/intel/compilers_and_libraries_<version>/linux/bin/compilervars.sh intel64 $ source /opt/intel/impi/<version>.<package_num>/bin64/mpivars.sh

or if you installed the Intel® Parallel Studio XE Cluster Edition, you can simply source the configuration script:

$ source /opt/intel/parallel_studio_xe_<version>/psxevars.sh intel64

Compile and link your MPI program using an appropriate compiler command:

To compile and link with the Intel MPI Library, use the appropriate commands from Table 2.

Table 2. MPI compilation Linux* command.

Programming Language	MPI Compilation Linux* Command
C	mpiicc
C++	mpiicpc
Fortran 77 / 95	mpiifort

For example, to compile the C program for the host, you can use the wrapper mpiicc:

$ mpiicc ./myprogram.c –o myprogram

To compile the program for Intel Xeon Phi processor x200 and Intel Xeon Phi coprocessor x200, add the knob–xMIC-AVX512 to take advantage of the Intel AVX-512 instruction set architecture (ISA) existing on this architecture. For example, the following command compiles a C program for the Intel Xeon Phi product family x200 using the Intel AVX-512 ISA:

$ mpiicc –xMIC-AVX512 ./myprogram.c –o myprogram.knl

To compile the program for the Intel Xeon Phi coprocessor x100, add the knob–mmic. The following command show how to compile a C program for Intel Xeon Phi coprocessor x100:

$ mpiicc –mmic ./myprogram.c –o myprogram.knc

Running an MPI program on the Intel Xeon Phi processor x200

To run the application on the Intel Xeon Phi processor x200, use the script mpirun:

$ mpirun –n <# of processes> ./myprogram.knl

where n is the number of MPI processes to launch on the processor.

Running an MPI program on the Intel Xeon Phi coprocessor x200 and Intel Xeon Phi coprocessor x100

To run an application on the coprocessors, the following steps are needed:

Start the MPSS service if it was stopped previously:
$ sudo systemctl start mpss
Transfer the MPI executable from the host to the coprocessor. For example, use the scp utility to transfer the executable (for the Intel Xeon Phi coprocessor x100) to the coprocessor named mic0:
$ scp myprogram.knl mic0:~/myprogram.knc
Transfer the MPI libraries and compiler libraries to the coprocessors: before the first run of an MPI application on the Intel Xeon Phi coprocessors, we need to copy the appropriate MPI libraries, compiler libraries to the following directories on each coprocessor equipped on this system: for coprocessor x200, libraries under /lib64 directory are transferred; for coprocessor x100, libraries under /mic directory are transferred.

For example, we issue the copy to the first coprocessor x100 called mic0: the mic0 coprocessor is accessible via the IP address 172.31.1.1 as its IP address. Note that all coprocessors have unique IP addresses since they are treated as just other uniquely addressable machines. You can refer to the first coprocessor as mic0 or its IP address.

# sudo scp /opt/intel/impi/2017.3.196/mic/bin/* mic0:/bin/ # sudo scp /opt/intel/impi/2017.3.196/mic/lib/* mic0:/lib64/ # sudo scp /opt/intel/composer_xe_2017.3.196/compiler/lib/mic/* mic0:/lib64/

Instead of copying the MPI and compiler libraries manually, you can also run the script shown below, to transfer to the two coprocessor mic0 and mic1:

#!/bin/sh

export COPROCESSORS="mic0 mic1"
export BINDIR="/opt/intel/impi/2017.3.196/mic/bin"
export LIBDIR="/opt/intel/impi/2017.3.196/mic/lib"
export COMPILERLIB="/opt/intel/compilers_and_libraries_2017/linux/lib/mic"

for coprocessor in `echo $COPROCESSORS`
do
   for prog in mpiexec mpiexec.hydra pmi_proxy mpirun
   do
      sudo scp $BINDIR/$prog $coprocessor:/bin/$prog
   done

   for lib in libmpi.so.12 libmpifort.so.12 libmpicxx.so.12
   do
      sudo scp $LIBDIR/$lib $coprocessor:/lib64/$lib
   done

   for lib in libimf.so libsvml.so libintlc.so.5
   do
      sudo scp $COMPILERLIB/$lib $coprocessor:/lib64/$lib
   done
done

Script used for transferring MPI libraries to two coprocessors.

Another approach is to NFS mount the coprocessors’ file system from the host so that the coprocessors can have access to their MPI libraries from there. One advantage of using NFS mounts is that it saves RAM space on the coprocessors. The details on how to set up NFS mounts can be found in the first example in this document.

To run the application natively on the coprocessor, log in to the coprocessor and then run thempirun script:

$ ssh mic0 $ mpirun –n <# of processes> ./myprogram.knc

where n is the number of MPI processes to launch on the coprocessor.

Finally, to run an MPI program from the host (symmetrically), additional steps are needed:

Set the Intel MPI environment variable I_MPI_MIC to let the Intel MPI Library recognize the coprocessors:

$ export I_MPI_MIC=enable

Disable the firewall in the host:

$ systemctl status firewalld
$ sudo systemctl stop firewalld

For multi-card use, configure Intel MPSS peer-to-peer so that each card can ping others:

$ sudo /sbin/sysctl -w net.ipv4.ip_forward=1

If you want to get debug information, include the flags -verbose and -genv I_MPI_DEBUG=n when running the application.

The following sections include sample MPI programs written in C. The first example shows how to compile and run a program for Intel Xeon Phi processor x200 and for Intel Xeon Phi coprocessor x200. The second example shows how to compile and run a program for Intel Xeon Phi coprocessor x100.

Example 1

For illustration purposes, this example shows how to build and run an Intel MPI application in symmetric mode on a host that connects to two Intel Xeon Phi coprocessors x200. Note that the driver Intel MPSS 4.x should be installed on the host to enable the Intel Xeon Phi coprocessor x200.

In this example, use the integral presentation below to calculate Pi (π):

Image of a mathematical equation

Appendix A includes the implementation program. The workload is divided among the MPI ranks. Each rank spawns a team of OpenMP* threads, and each thread works on a chunk of the workload to take advantage of vectorization. First, compile and run this application on the Intel Xeon processor host. Since this program uses OpenMP, you need to compile the program with OpenMP libraries. Note that the Intel Parallel Studio XE 2018 is used in this example.

Set the environment variables, compile the application for the host, and then generate the optimization report on vectorization and OpenMP:

$ source /opt/intel/compilers_and_libraries_2018/linux/bin/compilervars.sh intel64 $ mpiicc mpitest.c -qopenmp -O3 -qopt-report=5 -qopt-report-phase:vec,openmp -o mpitest

To run two ranks on the host:

$ mpirun -host localhost -n 2 ./mpitest
Hello world: rank 0 of 2 running on knl-lb0.jf.intel.com
Hello world: rank 1 of 2 running on knl-lb0.jf.intel.com
FROM RANK 1 - numthreads = 32
FROM RANK 0 - numthreads = 32

Elapsed time from rank 0:    8246.90 (usec)
Elapsed time from rank 1:    8423.09 (usec)
rank 0 pi=   3.141613006592

Next, compile the application for the Intel Xeon Phi coprocessor x200 and transfer the executable to the coprocessors mic0 and mic1 (assume you already set passwordless on the coprocessors).

$ mpiicc mpitest.c -qopenmp -O3 -qopt-report=5 -qopt-report-phase:vec,openmp -xMIC-AVX512 -o mpitest.knl
$ scp mpitest.knl mic0:~/.
$ scp mpitest.knl mic1:~/.

Enable MPI for the coprocessors and disable the firewall in the host:

$ export I_MPI_MIC=enable
$ sudo systemctl stop firewalld

This example also shows how to mount shared directory using the Network File System (NFS). As root, you mount the /opt/intel directory where the Intel C++ Compiler and Intel MPI are installed. First, add descriptors in the /etc/exports configuration file on the host to share the directory /opt/intelwith the coprocessors, whose IP addresses are 172.31.1.1 and 172.31.2.1 with read-only (ro) privilege.

[host~]# cat /etc/exports
/opt/intel 172.31.1.1(ro,async,no_root_squash)
/opt/intel 172.31.2.1(ro,async,no_root_squash)

Update the NFS export table and restart the NFS server in the host:

[host~]# exportfs –a
[host~]# service nfs restart

Next, log in on the coprocessors and create the mount point /opt/intel:

[host~]# ssh mic0
mic0:~# mkdir /opt
mic0:~# mkdir /opt/intel

Insert the descriptor “172.31.2.254:/opt/intel /opt/intel nfs defaults 1” to the /etc/fstab file in mic0:

mic0:~# cat /etc/fstab
/dev/root            /                    auto       defaults              1  1
proc                 /proc                proc       defaults              0  0
devpts               /dev/pts             devpts     mode=0620,gid=5       0  0
tmpfs                /run                 tmpfs      mode=0755,nodev,nosuid,strictatime 0  0
tmpfs                /var/volatile        tmpfs      defaults,size=85%     0  0
172.31.1.254:/opt/intel /opt/intel nfs defaults                            1  1

Finally, mount the shared directory /opt/intel on the coprocessor:

mic0:~# mount –a

Repeat this procedure for mic1 with this descriptor “172.31.2.254:/opt/intel /opt/intel nfs defaults 1 1” added to the /etc/fstab file in mic1.

Make sure that mic0 and mic1 are included in the /etc/hosts file:

$ cat /etc/hosts
127.0.0.1       localhost
::1             localhost
172.31.1.1      mic0
172.31.2.1      mic1

$ mpirun -host localhost -n 1 ./mpitest : -host mic0 -n 1 ~/mpitest.knl : -host mic1 -n 1 ~/mpitest.knl
Hello world: rank 0 of 3 running on knl-lb0
Hello world: rank 1 of 3 running on mic0
Hello world: rank 2 of 3 running on mic1
FROM RANK 0 - numthreads = 64
FROM RANK 2 - numthreads = 272
FROM RANK 1 - numthreads = 272
Elapsed time from rank 0:   12114.05 (usec)
Elapsed time from rank 1:  136089.09 (usec)
Elapsed time from rank 2:  125049.11 (usec)
rank 0 pi=   3.141597270966

By default, the maximum number of hardware threads available on each compute node is used. However, you can change this default behavior by inserting the local environment variable –env in that compute node. For example, to set the number of OpenMP threads on mic0 to 68 and set the compact affinity, you can use the command:

$ mpirun -host localhost -n 1 ./mpitest : -host mic0 -n 1 -env OMP_NUM_THREADS=68 -env KMP_AFFINITY=compact ~/mpitest : -host mic1 -n 1 ~/mpitest
Hello world: rank 0 of 3 running on knl-lb0.jf.intel.com
Hello world: rank 1 of 3 running on mic0
Hello world: rank 2 of 3 running on mic1
FROM RANK 0 - numthreads = 64
FROM RANK 1 - numthreads = 68
FROM RANK 2 - numthreads = 272
Elapsed time from rank 0:   11068.11 (usec)
Elapsed time from rank 1:   57780.98 (usec)
Elapsed time from rank 2:  133417.13 (usec)
rank 0 pi=   3.141597270966

To simplify the launch process, define a file with all machine names, name all the executables, and then move them to a predefined directory. For example, all executables are named mpitest and are located in user home directories:

$ cat hosts_file
knl-lb0:1
mic0:2
mic1:2

$ mpirun -machinefile hosts_file -n 5 ~/mpitest
Hello world: rank 0 of 5 running on knl-lb0
Hello world: rank 1 of 5 running on mic0
Hello world: rank 2 of 5 running on mic0
Hello world: rank 3 of 5 running on mic1
Hello world: rank 4 of 5 running on mic1
FROM RANK 0 - numthreads = 64
FROM RANK 1 - numthreads = 136
FROM RANK 3 - numthreads = 136
FROM RANK 2 - numthreads = 136
FROM RANK 4 - numthreads = 136
Elapsed time from rank 0:   11260.03 (usec)
Elapsed time from rank 1:   71480.04 (usec)
Elapsed time from rank 2:   69352.15 (usec)
Elapsed time from rank 3:   74187.99 (usec)
Elapsed time from rank 4:   67718.98 (usec)
rank 0 pi=   3.141598224640

Example 2

Example 2 shows how to build and run an MPI application in symmetric model on a host that connects to two Intel Xeon Phi coprocessors x100. Note that the driver Intel MPSS 3.x should be installed for the Intel Xeon Phi coprocessor x100.

The sample program estimates the calculation of Pi (π) using a Monte Carlo method. Consider a sphere centered at the origin and circumscribed by a cube. The sphere’s radius is r and the cube edge length is 2r. The volumes of a sphere and a cube are given by

Image of a mathematical equation

The first octant of the coordinate system contains one eighth of the volumes of both the sphere and the cube; the volumes in that octant are given by:

Image of a mathematical equation

If we generate N_c points uniformly and randomly in the cube within this octant, we expect that about N_s points will be inside the sphere’s volume according to the following ratio:

Image of a mathematical equation

Therefore, the estimated Pi (π) is calculated by

Image of a mathematical equation

where N_c is the number of points generated in the portion of the cube residing in the first octant, and N_s is the total number of points found inside the portion of the sphere residing in the first octant.

In the implementation, rank 0 (process) is responsible for dividing the work among the other n ranks. Each rank is assigned a chunk of work, and the summation is used to estimate the number Pi. Rank 0 divides the x-axis into n equal segments. Each rank generates (N_c/n) points in the assigned segment, and then computes the number of points in the first octant of the sphere (see Figure 5).

Image of a mathematical results

Figure 5. Each MPI rank handles a different portion in the first octant.

The pseudo code is shown below:

Rank 0 generate n random seed
Rank 0 broadcast all random seeds to n rank
For each rank i [0, n-1]
receive the corresponding seed
set num_inside = 0
For j=0 to Nc / n
generate a point with coordinates
x between [i/n, (i+1)/n]
y between [0, 1]
z between [0, 1]
			compute the distance d = x^2 + y^2 + z^2
			if distance d <= 1, increment num_inside
		Send num_inside back to rank 0
	Rank 0 set Ns  to the sum of all num_inside
	Rank 0 compute Pi = 6 * Ns  / Nc

In order to build the application montecarlo.knc for the Intel Xeon Phi coprocessors x100, the Intel C++ Compiler 2017 is used. Appendix B includes the implementation program. Note that this example just simply shows how to run the code on an Intel Xeon Phi coprocessor x100. You can optimize the sample code for further improvement.

$ source /opt/intel/compilers_and_libraries_2017/linux/bin/compilervars.sh intel64 $ mpiicc –mmic montecarlo.c -o montecarlo.knc

Build the application for the host:

$ mpiicc montecarlo.c -o montecarlo

Transfer the application montecarlo.knc to the /tmp directory on the coprocessors using the scp utility. In this example, we issue the copy to two Intel Xeon Phi coprocessors x100.

$ scp ./montecarlo.knc mic0:/tmp/montecarlo.knc
montecarlo.knc     100% 17KB 16.9KB/s 00:00 $ scp ./montecarlo.knc mic1:/tmp/montecarlo.knc
montecarlo.knc     100% 17KB 16.9KB/s 00:00

Transfer the MPI libraries and compiler libraries to the coprocessors using the script in Figure 5. Enable the MPI communication between host and Intel Xeon Phi coprocessors x100:

$ export I_MPI_MIC=enable

Run the mpirun script to start the application. The flag –n specifies the number of MPI processes and the flag –host specifies the machine name:

$ mpirun –n <# of processes> -host <hostname> <application>

We can run the application on multiple hosts by separating them with “:”. The first MPI rank (rank 0) always starts on the first part of the command:

$ mpirun –n <# of processes> -host <hostname1> <application> : –n <# of processes> -host <hostname2> <application>

This starts the rank 0 on hostname1 and other ranks on hostname2.

Now run the application on the host. The mpirun command shown below starts the application with 2 ranks on the host, 3 ranks on the coprocessor mic0, and 5 ranks on coprocessor mic1:

$ mpirun -n 2 -host localhost ./montecarlo : -n 3 -host mic0 /tmp/montecarlo.knc \
: -n 5 -host mic1 /tmp/montecarlo.knc

Hello world: rank 0 of 10 running on knc0
Hello world: rank 1 of 10 running on knc0
Hello world: rank 2 of 10 running on knc0-mic0
Hello world: rank 3 of 10 running on knc0-mic0
Hello world: rank 4 of 10 running on knc0-mic0
Hello world: rank 5 of 10 running on knc0-mic1
Hello world: rank 6 of 10 running on knc0-mic1
Hello world: rank 7 of 10 running on knc0-mic1
Hello world: rank 8 of 10 running on knc0-mic1
Hello world: rank 9 of 10 running on knc0-mic1
Elapsed time from rank 0:      13.87 (sec)
Elapsed time from rank 1:      14.01 (sec)
Elapsed time from rank 2:     195.16 (sec)
Elapsed time from rank 3:     195.17 (sec)
Elapsed time from rank 4:     195.39 (sec)
Elapsed time from rank 5:     195.07 (sec)
Elapsed time from rank 6:     194.98 (sec)
Elapsed time from rank 7:     223.32 (sec)
Elapsed time from rank 8:     194.22 (sec)
Elapsed time from rank 9:     193.70 (sec)
Out of 4294967295 points, there are 2248849344 points inside the sphere => pi=  3.141606330872

A shorthand way of doing this in symmetric mode is to use the –machinefile option for the mpirun command in coordination with the I_MPI_MIC_POSTFIX environment variable. In this case, make sure all executables are in the same location on the host and mic0 and mic1 cards.

The I_MPI_MIC_POSTFIX environment variable simply tells the library to add the .mic postfix when running on the cards (since the executables there are called montecarlo.knc).

$ export I_MPI_MIC_POSTFIX=.knc

Now set the rank mapping in your hosts file (by using the <host>:<#_ranks> format):

$ cat hosts_file
localhost:2
mic0:3
mic1:5

And run your executable:

$ mpirun -machinefile hosts_file /tmp/montecarlo

The nice thing about this syntax is that you only have to edit the hosts_file when deciding to change your number of ranks or need to add more cards.

As an alternative, you can ssh to a coprocessor and launch the application from there:

S ssh mic0
S mpirun -n 3 /tmp/montecarlo.knc
Hello world: rank 0 of 3 running on knc0-mic0
Hello world: rank 1 of 3 running on knc0-mic0
Hello world: rank 2 of 3 running on knc0-mic0
Elapsed time from rank 0:     650.47 (sec)
Elapsed time from rank 1:     650.61 (sec)
Elapsed time from rank 2:     648.01 (sec)
Out of 4294967295 points, there are 2248795855 points inside the sphere => pi=  3.141531467438

Summary

This document showed you how to compile and run simple MPI applications in symmetric model. In a heterogeneous computing system, the performance in each computational unit is different and this system behavior leads to the load imbalance problem. The Intel® Trace Analyzer and Collector can be used to analyze and understand the behavior of a complex MPI program running on a heterogeneous system. Using the Intel Trace Analyzer and Collector, you can quickly identify bottlenecks, evaluate load balancing, analyze performance, and identify communication hotspots. This powerful tool is essential for debugging and improving the performance of a MPI program running on a cluster with multiple computational units. For more details on using the Intel Trace Analyzer and Collector, read the whitepaper “Understanding MPI Load Imbalance with Intel® Trace Analyzer and Collector” available on /mic-developer. For more details, tips and tricks, and known workarounds, visit our Intel® Cluster Tools and the Intel® Xeon Phi™ Coprocessors page.

References

The MPI standard specification documents
Intel MPI Library: /intel-mpi-library
Intel MPI Library – Documentation: /en-us/articles/intel-mpi-library-documentation/
Intel Xeon Phi Processor software
Intel Manycore Platform Software Stack for Intel® Xeon Phi™ Coprocessor x200
Intel Manycore Platform Software Stack for Intel® Xeon Phi™ Coprocessor x100

Appendix A

The code of the first sample program is shown below.

/*
 *  Copyright (c) 2017 Intel Corporation. All Rights Reserved.
 *
 *  Portions of the source code contained or described herein and all documents related
 *  to portions of the source code ("Material") are owned by Intel Corporation or its
 *  suppliers or licensors.  Title to the Material remains with Intel
 *  Corporation or its suppliers and licensors.  The Material contains trade
 *  secrets and proprietary and confidential information of Intel or its
 *  suppliers and licensors.  The Material is protected by worldwide copyright
 *  and trade secret laws and treaty provisions.  No part of the Material may
 *  be used, copied, reproduced, modified, published, uploaded, posted,
 *  transmitted, distributed, or disclosed in any way without Intel's prior
 *  express written permission.
 *
 *  No license under any patent, copyright, trade secret or other intellectual
 *  property right is granted to or conferred upon you by disclosure or
 *  delivery of the Materials, either expressly, by implication, inducement,
 *  estoppel or otherwise. Any license under such intellectual property rights
 *  must be express and approved by Intel in writing.
 */
//******************************************************************************
// Content: (version 1.0)
//      Calculate the number PI using its integral representation.
//
//******************************************************************************
#include <stdio.h>
#include "mpi.h"

#define MASTER 0
#define TAG_HELLO 1
#define TAG_TIME 2

const long ITER = 1024 * 1024;
const long SCALE = 16;
const long NUM_STEP = ITER * SCALE;

float calculate_partialPI(int n, int num) {
   unsigned long i;
   int  numthreads;
   float x, dx, pi = 0.0f;

   #pragma omp parallel
   #pragma omp master
   {
      numthreads = omp_get_num_threads();
      printf("FROM RANK %d - numthreads = %d\n", n, numthreads);
   }

   dx = 1.0 / NUM_STEP;

   unsigned long NUM_STEP1 = NUM_STEP / num;
   unsigned long begin = n * NUM_STEP1;
   unsigned long end = (n + 1) * NUM_STEP1;
   #pragma omp parallel for reduction(+:pi)
   for (i = begin; i < end; i++)
   {
      x = (i + 0.5f) / NUM_STEP;
      pi += (4.0f * dx) / (1.0f + x*x);
   }

   return pi;
}

int main(int argc, char **argv)
{
   float pi1, total_pi;
   double startprocess;
   int i, id, remote_id, num_procs, namelen;
   char name[MPI_MAX_PROCESSOR_NAME];
   MPI_Status stat;

   if (MPI_Init (&argc, &argv) != MPI_SUCCESS)
   {
      printf ("Failed to initialize MPI\n");
      return (-1);
   }

   // Create the communicator, and retrieve the number of processes.
   MPI_Comm_size (MPI_COMM_WORLD, &num_procs);

   // Determine the rank of the process.
   MPI_Comm_rank (MPI_COMM_WORLD, &id);

   // Get machine name
   MPI_Get_processor_name (name, &namelen);

   if (id == MASTER)
   {
      printf ("Hello world: rank %d of %d running on %s\n", id, num_procs, name);

      for (i = 1; i<num_procs; i++)
      {
         MPI_Recv (&remote_id, 1, MPI_INT, i, TAG_HELLO, MPI_COMM_WORLD, &stat);
         MPI_Recv (&num_procs, 1, MPI_INT, i, TAG_HELLO, MPI_COMM_WORLD, &stat);
         MPI_Recv (&namelen, 1, MPI_INT, i, TAG_HELLO, MPI_COMM_WORLD, &stat);
         MPI_Recv (name, namelen+1, MPI_CHAR, i, TAG_HELLO, MPI_COMM_WORLD, &stat);

         printf ("Hello world: rank %d of %d running on %s\n", remote_id, num_procs, name);
      }
   }
   else
   {
      MPI_Send (&id, 1, MPI_INT, MASTER, TAG_HELLO, MPI_COMM_WORLD);
      MPI_Send (&num_procs, 1, MPI_INT, MASTER, TAG_HELLO, MPI_COMM_WORLD);
      MPI_Send (&namelen, 1, MPI_INT, MASTER, TAG_HELLO, MPI_COMM_WORLD);
      MPI_Send (name, namelen+1, MPI_CHAR, MASTER, TAG_HELLO, MPI_COMM_WORLD);
   }

   startprocess = MPI_Wtime();

   pi1 = calculate_partialPI(id, num_procs);

   double elapsed = MPI_Wtime() - startprocess;

   MPI_Reduce (&pi1, &total_pi, 1, MPI_FLOAT, MPI_SUM, MASTER, MPI_COMM_WORLD);
   if (id == MASTER)
   {
      double timeprocess[num_procs];

      timeprocess[MASTER] = elapsed;
      printf("Elapsed time from rank %d: %10.2f (usec)\n", MASTER, 1000000 * timeprocess[MASTER]);

      for (i = 1; i < num_procs; i++)
      {
         // Rank 0 waits for elapsed time value
         MPI_Recv (&timeprocess[i], 1, MPI_DOUBLE, i, TAG_TIME, MPI_COMM_WORLD, &stat);
         printf("Elapsed time from rank %d: %10.2f (usec)\n", i, 1000000 *timeprocess[i]);
      }

      printf("rank %d pi= %16.12f\n", id, total_pi);
   }
   else
   {
      // Send back the processing time (in second)
      MPI_Send (&elapsed, 1, MPI_DOUBLE, MASTER, TAG_TIME, MPI_COMM_WORLD);
   }

   // Terminate MPI.
   MPI_Finalize();
   return 0;
}

Appendix B

The code of the second sample program is shown below.

/*
 *  Copyright (c) 2017 Intel Corporation. All Rights Reserved.
 *
 *  Portions of the source code contained or described herein and all documents related
 *  to portions of the source code ("Material") are owned by Intel Corporation or its
 *  suppliers or licensors.  Title to the Material remains with Intel
 *  Corporation or its suppliers and licensors.  The Material contains trade
 *  secrets and proprietary and confidential information of Intel or its
 *  suppliers and licensors.  The Material is protected by worldwide copyright
 *  and trade secret laws and treaty provisions.  No part of the Material may
 *  be used, copied, reproduced, modified, published, uploaded, posted,
 *  transmitted, distributed, or disclosed in any way without Intel's prior
 *  express written permission.
 *
 *  No license under any patent, copyright, trade secret or other intellectual
 *  property right is granted to or conferred upon you by disclosure or
 *  delivery of the Materials, either expressly, by implication, inducement,
 *  estoppel or otherwise. Any license under such intellectual property rights
 *  must be express and approved by Intel in writing.
 */
//******************************************************************************
// Content: (version 0.5)
//      Based on a Monto Carlo method, this MPI sample code uses volumes to
//      estimate the number PI.
//
//******************************************************************************
#include <stdlib.h>
#include <stdio.h>
#include <math.h>
#include <time.h>
#include <math.h>

#include "mpi.h"

#define MASTER 0
#define TAG_HELLO 4
#define TAG_TEST 5
#define TAG_TIME 6

int main(int argc, char *argv[])
{
  int i, id, remote_id, num_procs;

  MPI_Status stat;
  int namelen;
  char name[MPI_MAX_PROCESSOR_NAME];

  // Start MPI.
  if (MPI_Init (&argc, &argv) != MPI_SUCCESS)
    {
      printf ("Failed to initialize MPI\n");
      return (-1);
    }

  // Create the communicator, and retrieve the number of processes.
  MPI_Comm_size (MPI_COMM_WORLD, &num_procs);

  // Determine the rank of the process.
  MPI_Comm_rank (MPI_COMM_WORLD, &id);
    // Get machine name
  MPI_Get_processor_name (name, &namelen);

  if (id == MASTER)
    {
      printf ("Hello world: rank %d of %d running on %s\n", id, num_procs, name);

      for (i = 1; i<num_procs; i++)
	{
	  MPI_Recv (&remote_id, 1, MPI_INT, i, TAG_HELLO, MPI_COMM_WORLD, &stat);
	  MPI_Recv (&num_procs, 1, MPI_INT, i, TAG_HELLO, MPI_COMM_WORLD, &stat);
	  MPI_Recv (&namelen, 1, MPI_INT, i, TAG_HELLO, MPI_COMM_WORLD, &stat);
	  MPI_Recv (name, namelen+1, MPI_CHAR, i, TAG_HELLO, MPI_COMM_WORLD, &stat);

	  printf ("Hello world: rank %d of %d running on %s\n", remote_id, num_procs, name);
	}
    }
  else
    {
      MPI_Send (&id, 1, MPI_INT, MASTER, TAG_HELLO, MPI_COMM_WORLD);
      MPI_Send (&num_procs, 1, MPI_INT, MASTER, TAG_HELLO, MPI_COMM_WORLD);
      MPI_Send (&namelen, 1, MPI_INT, MASTER, TAG_HELLO, MPI_COMM_WORLD);
      MPI_Send (name, namelen+1, MPI_CHAR, MASTER, TAG_HELLO, MPI_COMM_WORLD);
    }

  // Rank 0 distributes seek randomly to all processes.
  double startprocess, endprocess;

  int distributed_seed = 0;
  int *buff;

  buff = (int *)malloc(num_procs * sizeof(int));

  unsigned int MAX_NUM_POINTS = pow (2,32) - 1;
  unsigned int num_local_points = MAX_NUM_POINTS / num_procs;

  if (id == MASTER)
    {
      srand (time(NULL));

      for (i=0; i<num_procs; i++)
	{
	  distributed_seed = rand();
	  buff[i] = distributed_seed;
	}
    }

  // Broadcast the seed to all processes
  MPI_Bcast(buff, num_procs, MPI_INT, MASTER, MPI_COMM_WORLD);

  // At this point, every process (including rank 0) has a different seed. Using their seed,
  // each process generates N points randomly in the interval [1/n, 1, 1]
  startprocess = MPI_Wtime();

  srand (buff[id]);

  unsigned int point = 0;
  unsigned int rand_MAX = 128000;
  float p_x, p_y, p_z;
  float temp, temp2, pi;
  double result;
  unsigned int inside = 0, total_inside = 0;
    for (point=0; point<num_local_points; point++)
    {
      temp = (rand() % (rand_MAX+1));
      p_x = temp / rand_MAX;
      p_x = p_x / num_procs;

      temp2 = (float)id / num_procs;	// id belongs to 0, num_procs-1
      p_x += temp2;

      temp = (rand() % (rand_MAX+1));
      p_y = temp / rand_MAX;

      temp = (rand() % (rand_MAX+1));
      p_z = temp / rand_MAX;

      // Compute the number of points residing inside of the 1/8 of the sphere
      result = p_x * p_x + p_y * p_y + p_z * p_z;

      if (result <= 1)
	  {
		inside++;
	  }
    }

  double elapsed = MPI_Wtime() - startprocess;

  MPI_Reduce (&inside, &total_inside, 1, MPI_UNSIGNED, MPI_SUM, MASTER, MPI_COMM_WORLD);

#if DEBUG
  printf ("rank %d counts %u points inside the sphere\n", id, inside);
#endif

  if (id == MASTER)
    {
      double timeprocess[num_procs];

      timeprocess[MASTER] = elapsed;
      printf("Elapsed time from rank %d: %10.2f (sec) \n", MASTER, timeprocess[MASTER]);

      for (i=1; i<num_procs; i++)
	{
	  // Rank 0 waits for elapsed time value
	  MPI_Recv (&timeprocess[i], 1, MPI_DOUBLE, i, TAG_TIME, MPI_COMM_WORLD, &stat);
	  printf("Elapsed time from rank %d: %10.2f (sec) \n", i, timeprocess[i]);
	}

      temp = 6 * (float)total_inside;
      pi = temp / MAX_NUM_POINTS;
      printf ( "Out of %u points, there are %u points inside the sphere => pi=%16.12f\n", MAX_NUM_POINTS, total_inside, pi);
    }
  else
    {
      // Send back the processing time (in second)
      MPI_Send (&elapsed, 1, MPI_DOUBLE, MASTER, TAG_TIME, MPI_COMM_WORLD);
    }

  free(buff);

  // Terminate MPI.
  MPI_Finalize();

  return 0;
}

↧

Intel® Manycore Platform Software Stack Archive for the Intel® Xeon Phi™ Coprocessor x200 Product Family

June 28, 2017, 9:36 am

Latest and popular articles on Intel Technologies

≫ Next: Recipe: Building and Running GROMACS* on Intel® Processors

≪ Previous: Using Intel® MPI Library on Intel® Xeon Phi™ Product Family

On this page you will find the past releases of the Intel® Manycore Platform Software Stack (Intel® MPSS) for the Intel® Xeon Phi™ coprocessor x200 product family. The most recent release is found here: https://software.intel.com/en-us/articles/intel-manycore-platform-software-stack-for-intel-xeon-phi-coprocessor-x200. We recommend customers use the latest release wherever possible.

N-1 release for Intel® MPSS 4.4.x

Intel MPSS 4.4.0 HotFix 1 release for Linux*

Intel Manycore Platform Software Stack Version	Downloads Available	Size (range)	MD5 Checksum
Intel MPSS 4.4.0 Hotfix 1 (released: May 8, 2017)	RHEL 7.3	214MB	8a015c38379b8be42c8045d3ceb44545
	RHEL 7.2	214MB	694b7b908c12061543d2982695985d8b
	SLES 12.2	213MB	506ab12af774f78fa8e107fd7a4f96fd
	SLES 12.1	213MB	b8520888954e846e8ac8604d62a9ba96
	SLES 12.0	213MB	88a3a4415afae1238453ced7a0df28ea
	Card installer file (mpss-4.4.0-card.tar)	761MB	d26e26868297cea5fd4ffafe8d78b66e
	Source file (mpss-4.4.0-card-source.tar)	514MB	127713d06496090821b5bb3613c95b30

Document Link	Description	Last Updated On	Size (approx.)
releaseNotes-linux.txt	Release notes (English)	May 2017	15KB
readme.txt	Readme (includes installation instructions) for Linux (English)	May 2017	17KB
mpss_user_guide.pdf	Intel MPSS user guide	May 2017	3MB
eula.txt	End User License Agreement (Important: Read before downloading, installing, or using)	May 2017	33KB

Intel MPSS 4.4.0 HotFix 1 release for Windows*

Intel Manycore Platform Software Stack Version	Downloads Available	Size	MD5 Checksum
Intel MPSS 4.4.0 Hotfix 1 (released: May 8, 2017)	mpss-4.4.0-windows.zip	1091MB	204a65b36858842f472a37c77129eb53

Document Link	Description	Last Updated On	Size (approx.)
releasenotes-windows.txt	English - Release notes	May 2017	7KB
readme-windows.pdf	English - Readme for Windows	May 2017	399KB
mpss_users_guide_windows	Intel MPSS user guide for Windows	May 2017	3MB
eula.txt	End User License Agreement (Important: Read before downloading, installing, or using)	May 2017	33KB

The discussion forum at http://software.intel.com/en-us/forums/intel-many-integrated-core is available to join and discuss any enhancements or issues with the Intel MPSS.

↧

Recipe: Building and Running GROMACS* on Intel® Processors

February 24, 2017, 11:03 am

Latest and popular articles on Intel Technologies

≫ Next: Recipe: Building and running NEMO* on Intel® Xeon Phi™ Processors

≪ Previous: Intel® Manycore Platform Software Stack Archive for the Intel® Xeon Phi™ Coprocessor x200 Product Family

Purpose

This recipe describes how to get, build, and run the GROMACS* code on Intel® Xeon® and Intel® Xeon Phi™ processors for better performance on a single node.

Introduction

GROMACS supports all the usual algorithms expected from a modern molecular dynamics implementation.

The GROMACS code is maintained by developers around the world. The code is available under the GNU General Public License from www.gromacs.org.

Code Access

Download GROMACS:

Get the GROMACS-2016.1 release. This code version includes optimization for better performance on the Intel® Xeon Phi™ processor: http://manual.gromacs.org/documentation/2016/download.html

Workloads Access

Download the workloads:

water1.5M_pme and water1.5M_rf: ftp://ftp.gromacs.org/pub/benchmarks/water_GMX50_bare.tar.gz
lignocellulose3M_rf: http://www.prace-i.eu/UEABS/GROMACS/1.2/GROMACS_TestCaseB.tar.gz

Generate Water Workloads Input Files:

To generate the .tpr input file:

tar xf water_GMX50_bare.tar.gz
cd water-cut1.0_GMX50_bare/1536
gmx_mpi grompp -f pme.mdp -c conf.gro -p topol.top -o topol_pme.tpr
gmx_mpi grompp -f rf.mdp -c conf.gro -p topol.top -o topol_rf.tpr

Build Directions

Build the GROMACS binary. Use cmake configuration for Intel® Compiler 2017.1.132 + Intel® MKL + Intel® MPI 2017.1.132:

Set the Intel Xeon Phi BIOS options to be:

Quadrant Cluster mode
MCDRAM Flat mode
Turbo Enabled

For Intel Xeon Phi, build the code as:

BuildDir= "${GromacsPath}/build” # Create the build directory
installDir="${GromacsPath}/install"
mkdir $BuildDir
source /opt/intel/<version>/bin/compilervars.sh intel64 # Source the Intel compiler, MKL and IMPI
source /opt/intel/impi/<version>/mpivars.sh
source /opt/intel/mkl/<version>/mklvars.sh intel64
cd $BuildDir # Set the build environments for Intel Xeon Phi

FLAGS="-xMIC-AVX512 -g -static-intel"; CFLAGS=$FLAGS CXXFLAGS=$FLAGS CC=mpiicc CXX=mpiicpc cmake .. -DBUILD_SHARED_LIBS=OFF -DGMX_FFT_LIBRARY=mkl -DCMAKE_INSTALL_PREFIX=$installDir -DGMX_MPI=ON -DGMX_OPENMP=ON -DGMX_CYCLE_SUBCOUNTERS=ON -DGMX_GPU=OFF -DGMX_BUILD_HELP=OFF -DGMX_HWLOC=OFF -DGMX_SIMD=AVX_512_KNL -DGMX_OPENMP_MAX_THREADS=256

For Intel Xeon, set the build environments and build the code as above with changes:

FLAGS="-xCORE-AVX2 -g -static-intel"
-DGMX_SIMD=AVX2_256

Other system setup:

Change the kernel setting for KNL: “nmi_watchdog=0 nohz_full=0-270” One of the ways to change the settings (this could be different for every system):

First save your original grub.cfg to be safe
cp /boot/grub2/grub.cfg /boot/grub2/grub.cfg.ORIG
In “/etc/default/grub”. Add (append) the below to the “GRUB_CMDLINE_LINUX”
nmi_watchdog=0 nohz_full=0-270
Save your new configuration
grub2-mkconfig -o /boot/grub2/grub.cfg
Reboot the system. After logging in, verify the settings with 'cat /proc/cmdline’

Build GROMACS:

make -j 4
sleep 5
make check

Run Directions

Run workloads on Intel Xeon Phi with the environment settings and command lines as (nodes.txt : localhost:272):


	export  I_MPI_DEBUG=5
	export I_MPI_FABRICS=shm
	export I_MPI_PIN_MODE=lib
	export KMP_AFFINITY=verbose,compact,1

	gmxBin="${installDir}/bin/gmx_mpi"

	mpiexec.hydra -genvall -machinefile ./nodes.txt -np 66 numactl -m 1 $gmxBin mdrun -npme 0 -notunepme -ntomp 4 -dlb yes -v -nsteps 4000 -resethway -noconfout -pin on -s ${WorkloadPath}water-cut1.0_GMX50_bare/1536/topol_pme.tpr
	export KMP_BLOCKTIME=0
	mpiexec.hydra -genvall -machinefile ./nodes.txt -np 66 numactl -m 1 $gmxBin mdrun -ntomp 4 -dlb yes -v -nsteps 1000 -resethway -noconfout -pin on -s ${WorkloadPath}lignocellulose-rf.BGQ.tpr
	mpiexec.hydra -genvall -machinefile ./nodes.txt -np 64 numactl -m 1 $gmxBin mdrun -ntomp 4 -dlb yes -v -nsteps 5000 -resethway -noconfout -pin on -s ${WorkloadPath}water-cut1.0_GMX50_bare/1536/topol_rf.tpr

Run workloads on Intel Xeon with the environment settings and command lines as:


	export  I_MPI_DEBUG=5
	export I_MPI_FABRICS=shm
	export I_MPI_PIN_MODE=lib
	export KMP_AFFINITY=verbose,compact,1

	gmxBin="${installDir}/bin/gmx_mpi"

	mpiexec.hydra -genvall -machinefile ./nodes.txt -np 72 $gmxBin mdrun -notunepme -ntomp 1 -dlb yes -v -nsteps 4000 -resethway -noconfout -s ${WorkloadPath}water-cut1.0_GMX50_bare/1536_bdw/topol_pme.tpr
	export KMP_BLOCKTIME=0
	mpiexec.hydra -genvall -machinefile ./nodes.txt -np 72 $gmxBin mdrun -ntomp 1 -dlb yes -v -nsteps 1000 -resethway -noconfout -s ${WorkloadPath}lignocellulose-rf.BGQ.tpr
	mpiexec.hydra -genvall -machinefile ./nodes.txt -np 72 $gmxBin mdrun -ntomp 1 -dlb yes -v -nsteps 5000 -resethway -noconfout -s ${WorkloadPath}water-cut1.0_GMX50_bare/1536_bdw/topol_rf.tpr

Performance Testing

Testing Platform Configurations

The following hardware was used for the above recipe and performance testing.

Processor	Intel® Xeon® Processor E5-2697 v4	Intel® Xeon Phi™ Processor 7250
Stepping	1 (B0)	1 (B0) Bin1
Sockets / TDP	2S / 290W	1S / 215W
Frequency / Cores / Threads	2.3 GHz / 36 / 72	1.4 GHz / 68 / 272
DDR4	8x16GB 2400 MHz(128GB)	6x16 GB 2400 MHz
MCDRAM	N/A	16 GB Flat
Cluster/Snoop Mode/Mem Mode	Home	Quadrant/flat
Turbo	On	On
BIOS	GRRFSDP1.86B.0271.R00.1510301446	GVPRCRB1.86B.0011.R04.1610130403
Compiler	ICC-2017.1.132	ICC-2017.1.132
Operating System	Red Hat Enterprise Linux* 7.2	Red Hat Enterprise Linux 7.2
Operating System	3.10.0-327.el7.x86_64	3.10.0-327.13.1.el7.xppsl_1.3.3.151.x86_64

GROMACS Build Configurations

The following configurations were used for the above recipe and performance testing.

GROMACS Version: GROMACS-2016.1
Intel® Compiler Version: 2017.1.132
Intel® MPI Library Version: 2017.1.132
Workloads used: water1536k_pme, water1536k_rf, and lignocellulose3M_rf

↧

Recipe: Building and running NEMO* on Intel® Xeon Phi™ Processors

March 23, 2017, 2:56 pm

Latest and popular articles on Intel Technologies

≫ Next: Demo: Software Defined Visualization Using Intel® Xeon Phi™ Processor

≪ Previous: Recipe: Building and Running GROMACS* on Intel® Processors

About NEMO*

This recipe shows the performance advantages of using the Intel® Xeon Phi™ processor 7250.

NEMO 3.6 is the current stable version.

Downloading the Code

Download the NEMO source code from the official NEMO repository (you should register at www.nemo-ocean.eu ):
svn co –r 6939 http://forge.ipsl.jussieu.fr/nemo/svn/branches/2015/nemo_v3_6_STABLE/NEMOGCM nemo
Download the XIOS IO server from the official XIOS repository:
svn co -r 703 http://forge.ipsl.jussieu.fr/ioserver/svn/XIOS/branchs/xios-1.0 xios
If your system has NetCDF libraries with Fortran bindings already installed and they link with NEMO and XIOS binaries, go to the section “Building XIOS for the Intel Xeon Processor”:
- szip 2.1 from https://support.hdfgroup.org/ftp/lib-external/szip/2.1/src/szip-2.1.tar.gz
- zlib 1.2.8 from http://www.zlib.net/fossils/zlib-1.2.8.tar.gz
- HDF5 1.8.12 from https://support.hdfgroup.org/ftp/HDF5/releases/hdf5-1.8.12/src/hdf5-1.8.12.tar.gz
- CURL 7.42.1 from https://curl.haxx.se/download/curl-7.42.1.tar.gz
- NetCDF-C 4.3.3 from https://github.com/Unidata/netcdf-c/archive/v4.3.3.tar.gz
NetCDF-Fortran from https://github.com/Unidata/netcdf-fortran/archive/netcdf-fortran-4.2.tar.gz

Building Additional Libraries for the Intel® Xeon® Processor

First, choose a directory for your experiments, such as “~/NEMO-BDW”:
```
export base=”~/NEMO-BDW”
```
Create a directory and copy all required libraries in $base:
```
mkdir -p $base/libraries
```
Unpack the tarball files in $base/libraries/src.
To build an Intel® Advanced Vector Extensions 2 (Intel® AVX2) version of libraries, set:
```
export arch="-xCORE-AVX2"
```

Set the following environment variables:

export PREFIX=$base/libraries
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:${PREFIX}/lib
export CFLAGS="-I$PREFIX/include -L$PREFIX/lib –O3 -g -traceback -openmp ${arch} -fPIC"
export CPPFLAGS=$CFLAGS
export CXXFLAGS=$CFLAGS
export FFFLAGS=$CFLAGS
export FCFLAGS=$CFLAGS
export LDFLAGS="-L$PREFIX/lib -openmp ${arch} -fPIC"
export FC=mpiifort
export CXX=mpiicc
export CC=mpiicc
export CPP="icc -E"

Build szip:

cd $base/libraries/src/szip-2.1
./configure --prefix=$PREFIX
make -j 4
make install

Build zlib:

cd $base/libraries/src/zlib-1.2.8
./configure --prefix=$PREFIX
make –j 4
make install

Build HDF5:

cd $base/libraries/src/hdf5-1.8.12
./configure --with-zlib=$PREFIX --prefix=$PREFIX --enable-fortran --with-szlib=$PREFIX --enable-hl
make
make install

Build CURL:

cd $base/libraries/src/curl- 7.42.1
./configure --prefix=$PREFIX
make –j 4
make install

Build NetCDF:

cd $base/libraries/src/netcdf-4.3.3
export LIBS=" -lhdf5_hl -lhdf5 -lz -lsz -lmpi"
export LD_FLAGS+=" -L$PREFIX/lib"
./configure --prefix=$PREFIX
make
make install

Build NetCDF Fortran wrapper:

cd $base/libraries/src/netcdf-fortran-4.2/
export LIBS=""
export CFLAGS="$CFLAGS -lnetcdf"
export CPPFLAGS=$CFLAGS
export CXXFLAGS=$CFLAGS
export FFFLAGS=$CFLAGS
export FCFLAGS=$CFLAGS
export FC=ifort
export CXX=mpiicc
export CC=mpiicc
export LDFLAGS+=" -L$I_MPI_ROOT/lib64/"
./configure --prefix=$PREFIX
make
make install

Building XIOS for the Intel Xeon Processor

Copy XIOS source code to $base/xios

Create files:

$base/xios/arch/arch-ifort_linux.env
$base/xios/arch/arch-ifort_linux.fcm
$base/xios/arch/arch-ifort_linux.path

Add the following lines to the $base/xios/arch/arch-ifort_linux.env file:

export NETCDF_INC_DIR=$base/libraries/include
export NETCDF_LIB_DIR=$base/libraries/lib
export HDF5_INC_DIR=$base/libraries/include
export HDF5_LIB_DIR=$base/libraries/lib

Add the following lines to the $base/xios/arch/arch-ifort_linux.fcm file:

%NCDF_INC            -I$base/libraries/include
%NCDF_LIB            -L$base/libraries/lib -lnetcdff -lnetcdf -lhdf5 -lcurl -lz -lsz
%FC                  mpiifort
%FCFLAGS             -O3 -g -traceback -xCORE-AVX2 -I$base/libraries/include -L$base/libraries/lib
%FFLAGS              -O3 -g -traceback -xCORE-AVX2 -I$base/libraries/include -L$base/libraries/lib
%LD                  mpiifort
%FPPFLAGS            -P -C -traditional
%LDFLAGS             -O3 -g -traceback -xCORE-AVX2 -I$base/libraries/include -L$base/libraries/lib
%AR                  ar
%ARFLAGS             -r
%MK                  gmake
%USER_INC            %NCDF_INC_DIR
%USER_LIB            %NCDF_LIB_DIR

%MAKE                gmake
%BASE_LD        -lstdc++ -lifcore -lintlc
%LINKER         mpiifort -nofor-main
%BASE_INC       -D__NONE__
%CCOMPILER      mpiicc
%FCOMPILER      mpiifort
%CPP            cpp
%FPP            cpp -P

%BASE_CFLAGS    -O3 -g -traceback -xCORE-AVX2 -I$base/libraries/include -L$base/libraries/lib
%PROD_CFLAGS    -O3 -g -traceback -xCORE-AVX2 -I$base/libraries/include -L$base/libraries/lib
%DEV_CFLAGS    -O3 -g -traceback -xCORE-AVX2 -I$base/libraries/include -L$base/libraries/lib
%DEBUG_CFLAGS  -O3 -g -traceback -xCORE-AVX2 -I$base/libraries/include -L$base/libraries/lib
%BASE_FFLAGS   -O3 -g -traceback -xCORE-AVX2 -I$base/libraries/include -L$base/libraries/lib
%PROD_FFLAGS    -O3 -g -traceback -xCORE-AVX2 -I$base/libraries/include -L$base/libraries/lib
%DEV_FFLAGS    -O3 -g -traceback -xCORE-AVX2 -I$base/libraries/include -L$base/libraries/lib
%DEBUG_FFLAGS   -O3 -g -traceback -xCORE-AVX2 -I$base/libraries/include -L$base/libraries/lib

Add the following lines to the $base/xios/arch/arch-ifort_linux.path file:

NETCDF_INCDIR="-I $NETCDF_INC_DIR"
NETCDF_LIBDIR="-L $NETCDF_LIB_DIR"
NETCDF_LIB="-lnetcdff -lnetcdf -lcurl"
MPI_INCDIR=""
MPI_LIBDIR=""
MPI_LIB=""
HDF5_INCDIR="-I $HDF5_INC_DIR"
HDF5_LIBDIR="-L $HDF5_LIB_DIR"
HDF5_LIB="-lhdf5_hl -lhdf5 -lz -lcurl"

Change directory to $base/xios and execute the following command:
```
./make_xios --full --prod --arch ifort_linux
```

Building NEMO for the Intel Xeon Processor and Preparing Workloads

Copy NEMO source code to $base/nemo

Apply the following patch to file $base/nemo/NEMOGCM/ NEMO/OPA_SRC/nemogcm.F90:

@@ -116,6 +116,7 @@
       !!              Madec, 2008, internal report, IPSL.
       !!----------------------------------------------------------------------
       INTEGER ::   istp       ! time step index
+DOUBLE PRECISION :: mpi_wtime, sstart, send
       !!----------------------------------------------------------------------
       !
 #if defined key_agrif
@@ -163,18 +164,19 @@
 #if defined key_agrif
           CALL Agrif_Regrid()
 #endif
-
          DO WHILE ( istp <= nitend .AND. nstop == 0 )
+sstart = mpi_wtime()
 #if defined key_agrif
             CALL stp                         ! AGRIF: time stepping
 #else
             CALL stp( istp )                 ! standard time stepping
 #endif
+send=mpi_wtime()
+print *, "Step ", istp, " - " , send-sstart , "s."
             istp = istp + 1
             IF( lk_mpp )   CALL mpp_max( nstop )
          END DO
 #endif
-
       IF( lk_diaobs   )   CALL dia_obs_wri
       !
       IF( ln_icebergs )   CALL icb_end( nitend )

Create the file $base/nemo/ARCH/arch-mpiifort_linux.fcm and add the following lines:

%NCDF_INC            -I/$base/libraries/include
%NCDF_LIB            -L$base/libraries/lib -lnetcdff -lnetcdf -lz -lcurl -lhdf5_hl -lhdf5 -lz -lcurl
%CPP                 icc -E
%FC                  mpiifort
%FCFLAGS          -r8 -g -traceback -qopenmp -O3 -xCORE-AVX2 -g -traceback
%FFLAGS             -r8 -g -traceback -qopenmp -O3 -xCORE-AVX2 -g -traceback
%LD                  mpiifort
%FPPFLAGS            -P -C -traditional
%LDFLAGS             -lstdc++ -lifcore -O3 -xCORE-AVX2 -g -traceback
%AR                  ar
%ARFLAGS             -r
%MK                  gmake
%XIOS_INC            -I$base/xios/inc
%XIOS_LIB            -L$base/xios/lib -lxios
%USER_INC            %NCDF_INC %XIOS_INC
%USER_LIB            %NCDF_LIB %XIOS_LIB

Build the binary for the GYRE workload:

cd $base/nemo/NEMOGCM/CONFIG
./makenemo -n GYRE -m mpiifort_linux -j 4

Create a sandbox directory for the GYRE runs:
1. ```
 mkdir -p $base/nemo/gyre-exp
 cp –r $base/nemo/NEMOGCM/CONFIG/GYRE/BLD/bin/nemo.exe $base/nemo/gyre-exp
 cp -r $base/nemo/NEMOGCM/CONFIG/GYRE/EXP00/* $base/nemo/gyre-exp
```
2. Switch creating mesh files to off by changing “nn_msh” to 0 in namelist_ref file
3. Enable benchmark mode by changing “nn_bench” to 1 in namelist_ref file.
4. Set the following parameters in the “&namcfg” section:
```
jp_cfg = 70
jpidta = 2102
jpjdta = 1402
jpkdta = 31
jpiglo = 2102
```
5. Switch off using the IO server in the iodef.xml file (“using_server = false”)
Build a binary for the ORCA025 workload:
1. Change “$base/nemo/NEMOGCM/CONFIG/ORCA2_LIM3/cpp_ORCA2_LIM3.fcm” content to “bld::tool::fppkeys key_trabbl key_vvl key_dynspg_ts key_ldfslp key_traldf_c2d key_traldf_eiv key_dynldf_c3d key_zdfddm key_zdftmx key_mpp_mpi key_zdftke key_lim3 key_iomput”
2. Change the line “ORCA2_LIM3 OPA_SRC LIM_SRC_3 NST_SRC” to “ORCA2_LIM3 OPA_SRC LIM_SRC_3” in file $base/nemo/NEMOGCM/CONFIG/cfg.txt
3. ./makenemo -n ORCA2_LIM3 -m mpiifort_linux -j 4
Go to the Barcelona Supercomputing Center (in Spanish), and in section 9 locate the paragraph, “PREGUNTAS Y RESPUESTAS:” with a path to the ftp server and credentials to log in.
Download the BenchORCA025L75.tar.gz file from directory Benchmarks_aceptacion/NEMO/
Extract the contents of the tarball file to $base/nemo/orca-exp

Copy the NEMO binary to the sandbox directory:

cp $base/nemo/NEMOGCM/CONFIG/ORCA2_LIM3/BLD/bin/nemo.exe $base/nemo/orca-exp

Edit the file $base/nemo/orca-exp/iodef.xml and add the following lines into the “<context id="xios"> <variable_definition>” section:
```
<variable id="min_buffer_size" type="int">994473778</variable><variable id="buffer_size" type="int">994473778</variable> 
```
In the file namelist_ref in section “&namrun” set the following variables:
```
nn_itend     =   10
nn_stock    =    10
nn_write    =    10
```
Copy the $base/nemo/NEMOGCM/CONFIG/SHARED/namelist_ref file to $base/nemo/exp-orca
Switch off using the IO server in the iodef.xml file (“using_server = false”)
To build the KNL binaries change “-xCORE-AVX2” to “-xMIC-AVX512”, change $base to another directory, and do all of the steps again.

Running the GYRE Workload with the Intel Xeon Processor

Go to $base/nemo/gyre-exp

Source the environment variables for the compiler and the Intel® MPI Library:

source /opt/intel/compiler/latest/bin/compilervars.sh intel64
source /opt/intel/impi/latest/bin/compilervars.sh intel64

Add libraries to LD_LIBRARY_PATH:

export LD_LIBRARY_PATH=$base/libraries/lib/:$LD_LIBRARY_PATH

Set additional variables for the Intel MPI Library:

export I_MPI_FABRICS=shm:tmi
export I_MPI_PIN_CELL=core

Run NEMO:

mpiexec.hyrda –genvall –f <hostfile> -n <number of ranks> -perhost <ppn> ./nemo.exe

Running the ORCA025 Workload with the Intel Xeon Processor

Go to $base/nemo/orca-exp

Source the environment variables for the compiler and the Intel MPI Library:

source /opt/intel/compiler/latest/bin/compilervars.sh intel64
source /opt/intel/impi/latest/bin/compilervars.sh intel64

Add libraries to LD_LIBRARY_PATH:

export LD_LIBRARY_PATH=$base/libraries/lib/:$LD_LIBRARY_PATH

Set additional variables for Intel MPI Library:

export I_MPI_FABRICS=shm:tmi
export I_MPI_PIN_CELL=core

Run NEMO:

mpiexec.hyrda –genvall –f <hostfile> -n <number of ranks> -perhost <ppn> ./nemo.exe

If you are faced with hangs while the application is running you can run NEMO with the XIOS server in detached mode:
1. Copy xios_server.exe from $base/xios/bin to $base/nemo/orca-exp
2. Edit iodef.xml file and set “using_server = true”
3. mpiexec.hy–da -genvall –f <hostfile> -n <number of ranks> -perhost <ppn> ./nemo.exe : -n 2 ./xios_server.exe

Building Additional Libraries for the Intel® Xeon Phi™ Processor

First, choose a directory for your experiments, such as “~/NEMO-KNL”
```
export base=”~/NEMO-KNL”
```
Create the directory and copy all required libraries in $base:
```
mk–ir -p $base/libraries
```
Unpack the tarball files in $base/libraries/src
To build an Intel AVX2 version of libraries, set:
```
export a”ch="-xMIC-AV”512"
```

Set the following environment variables:

 export PREFIX=$base/libraries
 export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:${PREFIX}/lib
 export CFL”GS="-I$PREFIX/incl–de -L$PREFIX/lib –O3–-g -traceb–ck -openmp ${ar–h} -”PIC"
 export CPPFLAGS=$CFLAGS
 export CXXFLAGS=$CFLAGS
 export FFFLAGS=$CFLAGS
 export FCFLAGS=$CFLAGS
 export LDFL”GS="-L$PREFIX/–ib -openmp ${ar–h} -”PIC"
 export FC=mpiifort
 export CXX=mpiicc
 export CC=mpiicc
 export ”PP="–c” -E"

Build szip:

 cd $base/libraries/src/szip-2.1
 ./config–e --prefix=$PREFIX
 m–ke -j 4
 make install

Build zlib:

cd $base/libraries/src/zlib-1.2.8
./config–e --prefix=$PREFIX
make –j 4
make install

Build HDF5:

cd $base/libraries/src/hdf5-1.8.12
./config–e --with-zlib=$PRE–X --prefix=$PRE–X --enable-fort–n --with-szlib=$PRE–X --enable-hl
make
make install

Build CURL:

cd $base/libraries/src/curl- 7.42.1
./config–e --prefix=$PREFIX
make –j 4
make install

Build NetCDF:

cd $base/libraries/src/netcdf-4.3.3
export L”B–=" -lhdf5–hl -lh–f5 –lz -–sz -”mpi"
export LD_FLA”S–=" -L$PREFIX”lib"
./config–e --prefix=$PREFIX
make
make install

Build the NetCDF Fortran wrapper:

cd $base/libraries/src/netcdf-fortran-4.2/
export L””S=""
export CFL”GS="$CFL–GS -lne”cdf"
export CPPFLAGS=$CFLAGS
export CXXFLAGS=$CFLAGS
export FFFLAGS=$CFLAGS
export FCFLAGS=$CFLAGS
export FC=ifort
export CXX=mpiicc
export CC=mpiicc
export LDFLA”S–=" -L$I_MPI_ROOT/li”64/"
./config–e --prefix=$PREFIX
make
make install

Building XIOS for the Intel Xeon Phi Processor

Copy XIOS source code to $base/xios

Create files:

$base/xios/arch/arch-ifort_linux.env
$base/xios/arch/arch-ifort_linux.fcm
$base/xios/arch/arch-ifort_linux.path

Add the following lines to the $base/xios/arch/arch-ifort_linux.env file:

export NETCDF_INC_DIR=$base/libraries/include
export NETCDF_LIB_DIR=$base/libraries/lib
export HDF5_INC_DIR=$base/libraries/include
export HDF5_LIB_DIR=$base/libraries/lib

Add the following lines to the $base/xios/arch/arch-ifort_linux.fcm file:

%NCDF_INC            -I$base/libraries/include
%NCDF_LIB            -L$base/libraries/–ib -lnetc–ff -lnet–df -lh–f5 -lc–rl –lz -lsz
%FC                  mpiifort
%FCFLAGS             –O3–-g -traceback –xMIC-AVX–12 -I$base/libraries/incl–de -L$base/libraries/lib
%FFLAGS              –O3–-g -traceb–ck - xMIC-AVX–12 -I$base/libraries/incl–de -L$base/libraries/lib
%LD                  mpiifort
%FPPFLAGS           –-P–-C -traditional
%LDFLAGS             –O3–-g -traceb–ck - xMIC-AVX–12 -I$base/libraries/incl–de -L$base/libraries/lib
%AR                  ar
%ARFLAGS             -r
%MK                  gmake
%USER_INC            %NCDF_INC_DIR
%USER_LIB            %NCDF_LIB_DIR

%MAKE                gmake
%BASE_LD        -lstdc++ -lifc–re -lintlc
%LINKER         mpiif–rt -nofor-main
%BASE_INC       -D__NONE__
%CCOMPILER      mpiicc
%FCOMPILER      mpiifort
%CPP            cpp
%FPP            –pp -P

%BASE_CFLAGS    –O3–-g -traceb–ck - xMIC-AVX512-I$base/libraries/incl–de -L$base/libraries/lib
%PROD_CFLAGS   –O3–-g -traceb–ck - xMIC-AVX–12 -I$base/libraries/incl–de -L$base/libraries/lib
%DEV_CFLAGS    –O3–-g -traceb–ck - xMIC-AVX–12 -I$base/libraries/incl–de -L$base/libraries/lib
%DEBUG_CFL–GS –O3–-g -traceb–ck - xMIC-AVX–12 -I$base/libraries/incl–de -L$base/libraries/lib
%BASE_FFLAGS   –O3–-g -traceb–ck - xMIC-AVX–12 -I$base/libraries/incl–de -L$base/libraries/lib
%PROD_FFLAGS    –O3–-g -traceb–ck - xMIC-AVX–12 -I$base/libraries/incl–de -L$base/libraries/lib
%DEV_FFLAGS    –O3–-g -traceb–ck - xMIC-AVX–12 -I$base/libraries/incl–de -L$base/libraries/lib
%DEBUG_FFLAGS   –O3–-g -traceb–ck - xMIC-AVX–12 -I$base/libraries/incl–de -L$base/libraries/lib

Add the following lines to the $base/xios/arch/arch-ifort_linux.path file:

NETCDF_INC”IR="-I $NETCDF_INC”DIR"
NETCDF_LIB”IR="-L $NETCDF_LIB”DIR"
NETCDF_”IB="-lnetc–ff -lnet–df -l”url"
MPI_INC””R=""
MPI_LIB””R=""
MPI_””B=""
HDF5_INC”IR="-I $HDF5_INC”DIR"
HDF5_LIB”IR="-L $HDF5_LIB”DIR"
HDF5_”IB="-lhdf5–hl -lh–f5 –lz -l”url"

Change the directory to $base/xios and execute the following command:
```
./make_x–s --f–l --p–d --arch ifort_linux
```

Building NEMO for the Intel Xeon Phi Processor and Preparing Workloads

Copy the NEMO source code to $base/nemo

Apply the following patch to file $base/nemo/NEMOGCM/ NEMO/OPA_SRC/nemogcm.F90:

@@ -116,6 +116,7 @@
       !!              Madec, 2008, internal report, IPSL.
       !!----------------------------------------------------------------------
       INTEGER ::   istp       ! time step index
+DOUBLE PRECISION :: mpi_wtime, sstart, send
       !!----------------------------------------------------------------------
       !
 #if defined key_agrif
@@ -163,18 +164,19 @@
 #if defined key_agrif
           CALL Agrif_Regrid()
 #endif
-
          DO WHILE ( istp <= nitend .AND. nstop == 0 )
+sstart = mpi_wtime()
 #if defined key_agrif
             CALL stp                         ! AGRIF: time stepping
 #else
             CALL stp( istp )                 ! standard time stepping
 #endif
+send=mpi_wtime()
+print“*, "S“ep ", is“p– “ - " , send-sstar“ ,”"s."
             istp = istp + 1
             IF( lk_mpp )   CALL mpp_max( nstop )
          END DO
 #endif
-
       IF( lk_diaobs   )   CALL dia_obs_wri
       !
       IF( ln_icebergs )   CALL icb_end( nitend )

Create the file $base/nemo/ARCH/arch-mpiifort_linux.fcm and add the following lines:

%NCDF_INC            -I/$base/libraries/include
%NCDF_LIB            -L$base/libraries/–ib -lnetc–ff -lnet–df –lz -lc–rl -lhdf5–hl -lh–f5 –lz -lcurl
%CPP                 –cc -E
%FC                  mpiifort
%FCFLAGS          –r8–-g -traceb–ck -qope–mp –O3 - xMIC-AVX–12–-g -traceback
%FFLAGS             –r8–-g -traceb–ck -qope–mp –O3 - xMIC-AVX–12–-g -traceback
%LD                  mpiifort
%FPPFLAGS           –-P–-C -traditional
%LDFLAGS             -lstdc++ -lifc–re –O3 - xMIC-AVX–12–-g -traceback
%AR                  ar
%ARFLAGS             -r
%MK                  gmake
%XIOS_INC            -I$base/xios/inc
%XIOS_LIB            -L$base/xios/–ib -lxios
%USER_INC            %NCDF_INC %XIOS_INC
%USER_LIB            %NCDF_LIB %XIOS_LIB

Build the binary for the GYRE workload:

cd $base/nemo/NEMOGCM/CONFIG
./maken–mo -n G–RE -m mpiifort_li–ux -j 4

Create a sandbox directory for the GYRE runs:
1. ```
mk–ir -p $base/nemo/gyre-exp
cp –r $base/nemo/NEMOGCM/CONFIG/GYRE/BLD/bin/nemo.exe $base/nemo/gyre-exp–cp -r $base/nemo/NEMOGCM/CONFIG/GYRE/EXP00/* $base/nemo/gyre-exp
```
2. Switch off creating mesh files by changing “nn_msh” to 0 in the namelist_ref file
3. Enable benchmark mode by changing “nn_bench” to 1 in the namelist_ref file.
4. Set the following parameters in the “&namcfg” section:
```
jp_cfg = 70
jpidta = 2102
jpjdta = 1402
jpkdta = 31
jpiglo = 2102
jpjglo = 1402
```
5. Switch off using the IO server in the iodef.xml file (“using_server = false”)
Build the binary for ORCA025 workload:
1. Change $base/nemo/NEMOGCM/CONFIG/ORCA2_LIM3/cpp_ORCA2_LIM3.fcm content to “bld::tool::fppkeys key_trabbl key_vvl key_dynspg_ts key_ldfslp key_traldf_c2d key_traldf_eiv key_dynldf_c3d key_zdfddm key_zdftmx key_mpp_mpi key_zdftke key_lim3 key_iomput”
2. Change line “ORCA2_LIM3 OPA_SRC LIM_SRC_3 NST_SRC” to “ORCA2_LIM3 OPA_SRC LIM_SRC_3” in the file $base/nemo/NEMOGCM/CONFIG/cfg.txt
3. ./maken–mo -n ORCA2_L–M3 -m mpiifort_li–ux -j 4
Go to the Barcelona Supercomputing Center (in Spanish), and in section 9 locate the paragraph, “PREGUNTAS Y RESPUESTAS:” with the path to the ftp server and credentials to log in.
Download the BenchORCA025L75.tar.gz file from the Benchmarks_aceptacion/NEMO/ directory
Extract the contents of the tarball file to $base/nemo/orca-exp

Copy the NEMO binary to the sandbox directory:

cp $base/nemo/NEMOGCM/CONFIG/ORCA2_LIM3/BLD/bin/nemo.exe $base/nemo/orca-exp

Edit the file $base/nemo/orca-exp/iodef.xml and add the following lines into the “<context”id="”ios"> <variable_definition>” section:
```
<variable”id="min_buffer_”ize" t”pe=”int">994473778</variable><variable”id="buffer_”ize" t”pe=”int">994473778</variable>
```
In the file namelist_ref in section “&namrun” set the following variables:
```
nn_itend    =  10
nn_stock    =    10
nn_write    =    10
```
Copy the $base/nemo/NEMOGCM/CONFIG/SHARED/namelist_ref file to the $base/nemo/exp-orca directory
Switch off using the IO server in the iodef.xml file (“using_server = false”)
To build the KNL binaries, change “-xCORE- to “-xMIC-AVX512”, change $base to another directory, and do all of the steps again.

Running the GYRE Workload with the Intel Xeon Phi Processor

Go to $base/nemo/gyre-exp

Source the environment variables for the compiler and Intel MPI Library:

source /opt/intel/compiler/latest/bin/compilervars.sh intel64
source /opt/intel/impi/latest/bin/compilervars.sh intel64

Add the libraries to LD_LIBRARY_PATH:

export LD_LIBRARY_PATH=$base/libraries/lib/:$LD_LIBRARY_PATH

Set additional variables for Intel MPI Library:

export I_MPI_FABRICS=shm:tmi
export I_MPI_PIN_CELL=core

Run NEMO:

mpiexec.hyrda -genvall –f <hostfile> -n <number of ranks> -perhost <ppn> ./nemo.exe

Running the ORCA025 Workload with the Intel Xeon Phi Processor

Go to $base/nemo/orca-exp

Source environment variables for the compiler and Intel MPI Library:

source /opt/intel/compiler/latest/bin/compilervars.sh intel64
source /opt/intel/impi/latest/bin/compilervars.sh intel64

Add libraries to LD_LIBRARY_PATH:

export LD_LIBRARY_PATH=$base/libraries/lib/:$LD_LIBRARY_PATH

Set additional variables for the Intel MPI Library:

export I_MPI_FABRICS=shm:tmi
export I_MPI_PIN_CELL=core

Run NEMO:

mpiexec.hyrda -genvall –f <hostfile> -n <number of ranks> -perhost <ppn> ./nemo.exe

If you are faced with hangs while the application is running you can run NEMO with the XIOS server in detached mode:
1. Copy xios_server.exe from $base/xios/bin to $base/nemo/orca-exp
2. Edit iodef.xml file and set “using_server = true”
3. mpiexec.hyrda -genvall –f <hostfile> -n <number of ranks> -perhost <ppn> ./nemo.exe : -n 2 ./xios_server.exe

Configuring Test Systems

CPU	Dual-socket Intel® Xeon® processor E5-2697 v4, 2.3 GHz (turbo OFF), 18 cores/socket, 36 cores, 72 threads (HT on)	Intel® Xeon Phi™ processor 7250, 68 core, 136 threads, 1400 MHz core freq. (turbo OFF), 1700 MHz uncore freq.
RAM	128 GB (8 x 16 GB) DDR4 2400 DDR4 DIMMs	96 GB (6 x 16 GB) DDR4 2400 MHz RDIMMS
Cluster File System Abstract	Intel® Enterprise Edition for Lustre* software SSD (Intel® EE for Lustre* software) SSD (136 TB storage)	Intel® Enterprise Edition for Lustre* software SSD (Intel® EE for Lustre* software) SSD (136 TB storage)
Interconnect	Intel® Omni-Path Architecture (Intel® OPA) Si 100 series	Intel® Omni-Path Architecture (Intel® OPA) Si 100 series
OS / Kernel / IB stack	Oracle Linux* server release 7.2 Kernel: 3.10.0-229.20.1.el6.x86_64.knl2 OFED version: 10.2.0.0.158_72	Oracle Linux server release 7.2 Kernel: 3.10.0-229.20.1.el6.x86_64.knl2 OFED Version 10.2.0.0.158_72

NEMO configuration: V3.6 r6939 with XIOS 1.0 r703, Intel® Parallel Studio XE 17.0.0.098, Intel MPI Library 2017 for Linux*
MPI configuration:
- I_MPI_FABRICS=shm:tmi
- I_MPI_PIN_CELL=core

Performance Results for the Intel Xeon Processor and Intel Xeon Phi Processor

1. Time of second step for GYRE workload:

# nodes	Intel® Xeon® Processor	Intel® Xeon Phi™ Processor
1	6.546229	3.642156
2	3.011352	2.075075
4	1.326501	0.997129
8	0.640632	0.492369
16	0.321378	0.284348

2. Time of second step for ORCA workload:

# nodes	Intel® Xeon® processor	Intel® Xeon Phi™ processor
2	5.764083
4	2.642725	2.156876
8	1.305238	1.0546
16	0.67725	0.643372

↧

Demo: Software Defined Visualization Using Intel® Xeon Phi™ Processor

March 24, 2017, 8:18 am

Latest and popular articles on Intel Technologies

≫ Next: Recipe: Building and Running MILC on Intel® Xeon® Processors and Intel® Xeon Phi™ Processors

≪ Previous: Recipe: Building and running NEMO* on Intel® Xeon Phi™ Processors

More information

The tool used for visualization is Paraview, with OSPRay as the rendering library.

Pre-requisites

Intel® Xeon Phi™ processor system with CentOS 7.2 Linux* (internet enabled)

Open a terminal, in your work area directory and follow the steps below:

Create directory for the demo
mkdir Intel_brain_demo
Change directory
cd Intel_brain_demo
Create two directories under this
mkdir paraview mkdir ospray
Access the files from Dropbox:
https://www.dropbox.com/s/wj0qp1clxv5xssv/SC_2016_BrainDemo.tar.gz?dl=0
Copy the Paraview and Ospray tar files into the respective directories you created in steps above
mv SC_2016_BrainDemo/paraview_sc_demo.tgz paraview/ mv SC_2016_BrainDemo/ospray.tgz ospray/
Untar each of the *tgz directories in the respective area
tar –xzvf *.tgz
Point the library path
Export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:<…../Intel_brain_demo/ospray/install/lib64>
Optional step: set graphics system to a random variable, only if Paraview doesn’t load normally
export QT_GRAPHICSSYSTEM=gtk
Change directory to paraview/install where the binaries are
cd paraview/install
Run Paraview
./bin/paraview
Once Paraview loads
Select File/Load State
Then load the brain_demo.psvm file from the SC_2016BrainDemo that you sync’d in step above
Then it will ask you to load VTK files, click the “...” button to select the appropriate *tumor1.vtk file, then *tumor2.vtk file and then *Tumor1.vtk file in order on your local machine. Then click OK.
An Output Messages pop window will show with warnings. Ignore the warnings and click close, and you should see something like following:
Now you can go to File/Save State and save this state. Now every time you load you can load to this state file to skip the previous step of having to locate the data files.
Then on properties tab on left side, enable Ospray for every view (all the rendewViews1/2/2 by selecting that view and clicking enable Ospray)
Once you do that you should see the images for all three views look as below:
You can also rotate the views and see how they look.

A few issues and how to resolve

Missing OpenGL, install mesa for OpenGL

Sudo yum –y install mesa-libGL Sudo yum –y install mesa-libGL-devel

libQtGui.so.4 error, install qt-x11 package

yum –y install qt-x11

Acknowledgements

Special thanks to Carson Brownlee and James Jeffers from Intel Corporation for all their contributions and support. Without their efforts, it wouldn’t have been possible to get this demo running.

References

↧

Recipe: Building and Running MILC on Intel® Xeon® Processors and Intel® Xeon Phi™ Processors

February 7, 2017, 11:42 am

Latest and popular articles on Intel Technologies

≫ Next: How to Get Intel® MKL/IPP/DAAL

≪ Previous: Demo: Software Defined Visualization Using Intel® Xeon Phi™ Processor

Introduction

The QPhiX library provides sparse solvers and Dslash kernels for Lattice QCD simulations optimized for Intel® architectures.

Code Access

Build Directions

Compile the QPhiX Library

Users need to build QPhiX first before building the MILC package.

The QPhiX library will have two tar files, mbench*.tar and qphix-codegen*.tar.

Untar the above.

Build qphix-codogen

The files with intrinsics for QPhiX are built in the qphix-codegen directory.

Enter the qphix-codegen directory.

Edit line #3 in “Makefile_xyzt”, enable “milc=1” variable.

Compile as:

source /opt/intel/compiler/<version>/bin/compilervars.sh intel64
source /opt/intel/impi/<version>/mpi/intel64/bin/mpivars.sh
make –f Makefile_xyzt avx512 -- [for Intel® Xeon Phi™ Processor]
make –f Makefile_xyzt avx2 -- [for Intel® Xeon® v3 Processors /Intel® Xeon® v4 Processor]

Build mbench

The mbench is part of the QPhiX library. The MILC version is currently not open source. Please contact the MILC collaboration group for access to the QPhiX (MILC) branch.

Enter the mbench directory.

Edit line #13 in “Makefile_qphixlib” to enable MPI. Set ENABLE_MPI = 1.

Compile as:

make -f Makefile_qphixlib mode=mic AVX512=1 -- [Intel® Xeon Phi™ Processor]
make -f Makefile_qphixlib mode=avx AVX2=1 -- [Intel® Xeon® Processors]

Compile MILC Code

Install/download the master branch from the above GitHub location.

Download the Makefile.qphix file from the following location:

http://denali.physics.indiana.edu/~sg/MILC_Performance_Recipe/.

Copy the Makefile.qphix to the corresponding application directory. In this case, copy the Makefile.qphix to the “ks_imp_rhmc” application directory and rename it as Makefile.

Make the following changes to the Makefile:

On line #17 - Add/uncomment the appropriate ARCH variable:
- For example, ARCH = knl (compile with Intel AVX-512 for Intel® Xeon Phi™ Processor architecture).
- For example, ARCH = bdw (compile with Intel AVX2 for Intel® Xeon® Processor architecture).
On line #28 - Change MPP variable to “true” if you want MPI.
On line #34 - Pick the PRECISION you want:
- 1 = Single, 2 = Double. We use Double for our runs.
Starting line #37 - Compiler is set up and this should work:
- If directions above were followed. If not, customize starting at line #40.
On line #124 - Setup of Intel compiler starts:
- Based on ARCH it will use the appropriate flags.
On line #395 - QPhiX customizations starts:
- On line #399 – Set QPHIX_HOME to correct QPhiX path (Path to mbench directory).
- The appropriate QPhiX FLAGS will be set if the above is defined correctly.

Compile as:

Enter the ks_imp_rhmc. The Makefile with the above changes should be in this directory. Source the latest Intel® compilers and Intel® MPI Library.

make su3_rhmd_hisq -- Build su3_rhmd_hisq binary
make su3_rhmc_hisq -- Build su3_rhmc_hisq binary

Compile the above binaries for Intel® Xeon Phi™ Processor and Intel® Xeon® Processor (edit Makefile accordingly).

Run Directions

Input Files

There are two required input files, params.rest, and rat.m013m065m838.

They can be downloaded from here:

http://denali.physics.indiana.edu/~sg/MILC_Performance_Recipe/.

The Lattice Sizes

The size of the four-dimensional space-time lattice is controlled by the “nx, ny, nz, nt” parameters.

Ranks	64	128	256	512
nx	32	32	32	32
ny	32	32	32	64
nz	32	32	64	64
nt	64	128	128	128

Total Elements	2097152	4194304	8388608	16777216
Multiplier	1	2	4	8
Elements/Rank	32768	32768	32768	32768

Table: Illustrates Weak Scaling of Lattice Sizes

Running with MPI x OpenMP*

Running the Test Cases

Create a “run” directory in the top-level directory and add the input files obtained from above.
cd <milc>/run
P.S: Run the appropriate binary for each architecture.
Create the lattice volume:
```
cat << EOF > params.$nx*$ny*$nz*$nt
prompt 0
nx $nx
ny $ny
nz $nz
nt $nt
EOF
cat params.rest >> params.$nx*$ny*$nz*$nt
```
For this performance recipe, we evaluate the single node and multinode (16 nodes) performance with the following weak scaled lattice volume:
Single Node (nx * ny * nz * nt): 24 x 24 x 24 x 60
Multinode [16 nodes] (nx * ny * nz * nt): 48 x 48 x 48 x 120

Run on Intel Xeon processor (E5-2697v4).
Source the latest Intel compilers and Intel MPI Library

Intel® Parallel Studio 2017 and above recommended

Single Node:

mpiexec.hydra –n 12 –env OMP_NUM_THREADS 3 –env KMP_AFFINITY 'granularity=fine,scatter,verbose'<path-to>/ks_imp_rhmc/su3_rhmd_hisq.bdw < params.24x24x24x60

Multinode (16 nodes, via Intel® Omni-Path Host Fabric Interface (Intel® OP HFI)):

# Create a runScript (run-bdw) #<path-to>/ks_imp_rhmc/su3_rhmd_hisq.bdw < params.48x48x48x120
#Intel® OPA fabric-related environment variables#
export I_MPI_FABRICS=shm:tmi
export I_MPI_TMI_PROVIDER=psm2
export PSM2_IDENTIFY=1
export I_MPI_FALLBACK=0
#Create nodeconfig.txt with the following#
-host <hostname1> -env OMP_NUM_THREADS 3 -env KMP_AFFINITY 'granularity=fine,scatter,verbose' -n 12 <path-to>/run-bdw
…..
…..
…..
-host <hostname16> -env OMP_NUM_THREADS 3 -env KMP_AFFINITY 'granularity=fine,scatter,verbose' -n 12 <path-to>/run-bdw
#mpirun command#
mpiexec.hydra –configfile nodeconfig.txt

Run on Intel Xeon Phi processor (7250).
Source Intel compilers and Intel MPI Library

Intel® Parallel Studio 2017 and above recommended

Single Node:

mpiexec.hydra –n 20 –env OMP_NUM_THREADS 3 –env KMP_AFFINITY 'granularity=fine,scatter,verbose' numactl –p 1 <path-to>/ks_imp_rhmc/su3_rhmd_hisq.knl < params.24x24x24x60

Multinode (16 nodes, via Intel OP HFI):

# Create a runScript (run-knl) #
numactl –p 1 <path-to>/ks_imp_rhmc/su3_rhmd_hisq.knl < params.48x48x48x120
#Intel OPA fabric-related environment variables#
export I_MPI_FABRICS=shm:tmi
export I_MPI_TMI_PROVIDER=psm2
export PSM2_IDENTIFY=1
export I_MPI_FALLBACK=0
#Create nodeconfig.txt with the following#
-host <hostname1> -env OMP_NUM_THREADS 3 -env KMP_AFFINITY 'granularity=fine,scatter,verbose' -n 20 <path-to>/run-knl
…..
…..
…..
-host <hostname16> -env OMP_NUM_THREADS 3 -env KMP_AFFINITY 'granularity=fine,scatter,verbose' -n 20 <path-to>/run-knl
#mpirun command#
mpiexec.hydra –configfile nodeconfig.txt

Performance Results and Optimizations

The performance chart below is the speedup w.r.t 2S Intel Xeon processor E5-2697v4 based on the total run time.

Speedup w.r.t 2S Intel® Xeon® processor E5-2697v4

Testing Platform Configurations

The following hardware was used for the above recipe and performance testing.

Processor	Intel® Xeon® Processor E5-2697 v4	Intel® Xeon Phi™ Processor 7250F
Sockets / TDP	2S / 290W	1S / 215W
Frequency / Cores / Threads	2.3 GHz / 36 / 72	1.4 a / 68 / 272
DDR4	8x16 GB 2400 MHz	6x16 GB 2400 MHz
MCDRAM	N/A	16 GB Flat
Cluster/Snoop Mode	Home	Quadrant
Memory Mode		Flat
Turbo	OFF	OFF
BIOS	SE5C610.86B.01.01.0016.033 120161139	GVPRCRB1.86B.0010.R02.1 606082342
Operating System	Oracle Linux* 7.2 (3.10.0-229.20.1.el6.x86_64)	Oracle Linux* 7.2 (3.10.0-229.20.1.el6.x86_64)

MILC Build Configurations

The following configurations were used for the above recipe and performance testing.

MILC Version	Master version as of 28 January 2017
Intel® Compiler Version	2017.1.132
Intel® MPI Library Version	2017.0.098
MILC Makefiles Used	Makefile.qphix, Makefile_qphixlib, Makefile

References and Resources

MIMD Lattice Computation (MILC) Collaboration: http://physics.indiana.edu/~sg/milc.html
QPhiX Case Study: http://www.nersc.gov/users/computational-systems/cori/application-porting-and-performance/application-case-studies/qphix-case-study/
MILC Staggered Conjugate Gradient Performance on Intel Intel® Xeon Phi™ Processor: https://anl.app.box.com/v/IXPUG2016-presentation-10

↧

Using micctrl utility

Mounting manually

Conclusion

References

Purpose

Introduction

Building and running NAMD on Intel® Xeon® Processor E5-2697 v4 (BDW) and Intel® Xeon Phi™ Processor 7250 (KNL)

Building and running NAMD for Cluster on Intel® Xeon® Processor E5-2697 v4 (BDW) and Intel® Xeon Phi™ Processor 7250 (KNL)

Introduction

Code Access

Build Directions

Compile the QPhiX Library

Build qphix-codgen

Build mbench

Compile MILC Code

Run Directions

Input Files

The Lattice Sizes

Running with MPI x OpenMP*

Running the Test Cases

Performance Results and Optimizations

Testing Platform Configurations

MILC Build Configurations

References and Resources

Purpose

Introduction

Code Access

Workloads Access

Generate Water Workloads Input Files:

Build Directions

Run Directions

Performance Testing

Testing Platform Configurations

GROMACS Build Configurations

About NEMO*

Building Additional Libraries for the Intel® Xeon® Processor

Building XIOS for the Intel Xeon Processor

Building NEMO for the Intel Xeon Processor and Preparing Workloads

Running the GYRE Workload with the Intel Xeon Processor

Running the ORCA025 Workload with the Intel Xeon Processor

Building Additional Libraries for the Intel® Xeon Phi™ Processor

Building XIOS for the Intel Xeon Phi Processor

Building NEMO for the Intel Xeon Phi Processor and Preparing Workloads

Running the GYRE Workload with the Intel Xeon Phi Processor

Running the ORCA025 Workload with the Intel Xeon Phi Processor

Configuring Test Systems

Performance Results for the Intel Xeon Processor and Intel Xeon Phi Processor

More information

Pre-requisites

A few issues and how to resolve

Acknowledgements

References

Introduction to MPI-3 Shared Memory

Sample Code

Basic Performance Tuning for Intel® Xeon Phi™ Processor

Summary

Reference

Appendix

Overview

Introduction to BVLC Caffe* and Intel® Optimized Caffe*

How to Install BVLC Caffe*

Test example

VTune Profiling

VTune result analysis

How to Install Intel® Optimized Caffe*

Optimization factors and tunes

Test example

Comparison

Useful links

Summary

Contents

Introduction

An Overview of the Classic Matrix Multiplication Algorithm

Total Number of Floating Point Operations

Implementation Complexity

Optimization Techniques

Memory Allocation Schemes

Loop Processing Schemes

Compute Schemes

Error Analysis

Using `micctrl` utility

**Installation Guide for Windows* Host:**