Quantcast
Channel: Intel Developer Zone Articles
Viewing all articles
Browse latest Browse all 327

How to get WRF running on the Intel® Xeon Phi™ Coprocessor

$
0
0

WRF on Intel® Xeon Phi™ Coprocessors and Intel® Xeon® processors

I.Overview

This document demonstrates the best methods to obtain, build and run the WRF model on a single Intel® Xeon Phi™ Coprocessor node and an Intel® Xeon® processor based server. This document also describes the best WRF software configuration and affinity settings to extract the best performance from a single node Intel® Xeon Phi™ Coprocessor and an Intel® Xeon® processor-based system.

II.Introduction 

The Weather Research and Forecasting (WRF) model is a numerical weather prediction system designed to server atmospheric research and operational forecasting needs. WRF is used by academic atmospheric scientists, forecast teams at operational centers, application scientists, etc. Please see http://www.wrf-model.org/index.php for more details about WRF. The source code and input files can be downloaded from the NCAR website.

III.Compiling, running and Validating WRF to run on a Standalone Intel® Xeon Phi™ coprocessor (Single Card)

 

Compile WRF

1.Download and un-tar the WRF3.5 source code from the NCAR repository http://www.mmm.ucar.edu/wrf/users/download/get_sources.html#V341.

2.Source the setup files for Intel® MPI Library and Intel® Compiler (example:)

source /opt/intel/impi/4.1.0.030/mic/bin/mpivars.sh
source /opt/intel/composer_xe_2013.2.146/bin/compilervars.sh intel64

3.Export the path for the host netcdf. Having netcdf built for Intel Xeon Phi Coprocessor is a prerequisite.

export NETCDF=/localdisk/igokhale/KNC/trunk/WRFV3.4/netcdf/mic/ 

4.cd into the ../WRFV3/ directory and run ./configure and select option 21.

5.Edit the configure.wrf to change mpicc to mpiicc in the ‘DM_CC’ flags. (this will be fixed in the next release of WRF)

6.Run ./compile wrf >& build.mic

7.This will build a wrf.exe in the ../WRFV3/main folder.

8.For a new ,clean build,  run ./clean –a and repeat the process.

Run WRF

1.Download the CONUS12_rundir from http://www.mmm.ucar.edu/WG2bench/conus12km_data_v3/ and place it in ../WRFV3.

2.Copy the binary from ../WRFV3/main/wrf.exe to the ../CONUS12_rundir/wrf.exe

3.cd into the CONUS12_rundir and execute WRF as follows on a coprocessor natively with the runtime parameters in the below script.

Script to run on Intel Xeon Phi coprocessor (native)

bash-4.1$ cat wrf.sh
source /opt/intel/impi/4.1.0.030/mic/bin/mpivars.sh
export LD_LIBRARY_PATH=/opt/intel/composer_xe_2013.2.146/compiler/lib/mic/
export KMP_STACKSIZE=62m
ulimit -s unlimited
export I_MPI_DEBUG=5
export WRF_NUM_TILES_X=3
export WRF_NUM_TILES_Y=60
export I_MPI_PIN_MODE=mpd
export KMP_PLACE_THREADS=60C,3T
export OMP_NUM_THREADS=180
export KMP_AFFINITY=balanced,granularity=thread
export KMP_LIBRARY=turnaround
export KMP_BLOCKTIME=infinite
mpiexec.hydra -np 1 ./wrf.exe

4.The run is completed when it prints ‘wrf: SUCCESS COMPLETE WRF', on the screen. You will have 2 files:   rsl.error.0000 and rsl.out.0000 in your CONUS12_rundir directory.

5.After the run, compute the total time taken to simulate 149 timesteps with the script below. The sum and mean values are of interest for WRF (lower is better).   The following parsing script may help:

bash-4.1$ cat gettiming.sh 
grep 'Timing for main' rsl.out.0000 | sed '1d' | head -149 | awk '{print $9}' | awk -f stats.awk
bash-4.1$ cat stats.awk 
BEGIN{ a = 0.0 ; i = 0 ; max = -999999999  ; min = 9999999999 }
{
    i ++ 
    a += $1
    if ( $1 > max ) max = $1
    if ( $1 < min ) min = $1
}
END{ printf("---n%10s  %8dn%10s  %15fn%10s  %15fn%10s  %15fn%10s  %15fn%10s  %15fn","items:",i,"max:",max,"min:",min,"sum:",a,"mean:",a/(i*1.0),"mean/max:",(a/(i*1.0))/max) }

Validation of your runs

To validate if the WRF run is correct or bogus, do the following:

 diffwrf your_output wrfout_reference > diffout_tag 

The‘DIGITS’ column should contain a high value (>3). If yes, the WRF run is considered valid.

IV.Compiling WRF to run on a 2-Socket Intel® Xeon® server

We used these instructions on a 2-Socket Intel® Xeon® E5-26xx system.

Compile WRF

1.Download and un-tar the WRF3.5 source code from the NCAR repository http://www.mmm.ucar.edu/wrf/users/download/get_sources.html

2.Source the setup files for Intel MPI Library and Intel Compiler (example:)

source /opt/intel/impi/4.1.0.030/intel64/bin/mpivars.sh
source /opt/intel/composer_xe_2013.2.146/bin/compilervars.sh intel64

3.Export the path for the host netcdf. Having netcdf built for the host (for example, an Intel® Xeon® processor-based server  in our case) is a prerequisite.

export NETCDF=/localdisk/igokhale/KNC/trunk/WRFV3.4/netcdf/xeon/ 

4.Cd  into the WRFV3 directory created in step #1 and run ./configure and select option 25: “Linux x86_64 i486 i586 i686, Xeon (SNB with AVX mods) ifort compiler with icc  (dm+sm)”. On the next prompt for nesting options, hit return for the default, which is 1.

5.Edit the configure.wrf to remove -DINTEL_ALIGN64 from the ARCH_LOCAL flags. (this will be fixed in the next release of WRFV3.5)

6.Run ./compile wrf >& build.snb.avx . This will build a wrf.exe in the ../WRFV3/main folder.  (Note: to speed up compiles, set the environment variable J to “-j 4” or whatever number of parallel make tasks you wish to use.)

7.For a new clean build run ./clean –a and repeat the process.

Run WRF

1.Download the CONUS12_rundir from http://www.mmm.ucar.edu/WG2bench/conus12km_data_v3/ and place it in ../WRFV3.

2.Copy the binary from ../WRFV3/main/wrf.exe to the ../CONUS12_rundir/wrf.exe

3.cd into the CONUS12_rundir and execute WRF with the runtime parameters in the below script.

Here is an example script to run it on an Intel® Xeon® host:

bash-4.1$ cat run.sh 
source /opt/intel/composer_xe_2013.2.146/bin/compilervars.sh intel64
source /opt/intel/impi/4.1.0.030/intel64/bin/mpivars.sh
ulimit -s unlimited
I_MPI_PIN_MODE=mpd 
OMP_NUM_THREADS=8 
KMP_STACKSIZE=64m 
KMP_AFFINITY=scatter,granularity=thread 
KMP_BLOCKTIME=infinite 
KMP_LIBRARY=turnaround 
WRF_NUM_TILES=32 
mpiexec.hydra -np 2 ./wrf.exe

4.The run is completed when it prints ‘wrf: SUCCESS COMPLETE WRF, on the screen. You will have 2 files rsl.error.0000 and rsl.out.0000 in your CONUS12_rundir directory.

5.After the run, compute the total time taken to  simulate 149 timesteps with the scripts below. The sum and mean values are of interest for WRF (lower the better).

The following script should help parse the output:

bash-4.1$ cat gettiming.sh 
grep 'Timing for main' rsl.out.0000 | sed '1d' | head -149 | awk '{print $9}' | awk -f stats.awk
bash-4.1$ cat stats.awk 
BEGIN{ a = 0.0 ; i = 0 ; max = -999999999  ; min = 9999999999 }
{
    i ++ 
    a += $1
    if ( $1 > max ) max = $1
    if ( $1 < min ) min = $1
}
END{ printf("---n%10s  %8dn%10s  %15fn%10s  %15fn%10s  %15fn%10s  %15fn%10s  %15fn","items:",i,"max:",max,"min:",min,"sum:",a,"mean:",a/(i*1.0),"mean/max:",(a/(i*1.0))/max) }

Validation of the run

1)To validate if the WRF run is successful or bogus, do the following:

diffwrf your_output wrfout_reference > diffout_tag 

b.The ‘DIGITS’ column should contain a high value (>3). If yes, the WRF run is considered valid.

Compiler Options:

-mmic : build an application that natively runs on Intel® Xeon Phi™ Coprocessor

–openmp  : enable the compiler to generate multi-threaded code based on the OpenMP* directives (same as -fopenmp)

-O3 :enable aggressive optimizations by the compiler.

-opt-streaming-stores always : generate streaming stores

-fimf-precision=low : low precision for higher performance

-fimf-domain-exclusion=15 : gives lowest precision sequences for Single precision and Double precision. 

-opt-streaming-cache-evict=0 : turn off all cache line evicts.

V.Additional Resources

Here is how one would go about compiling NETCDF for Intel®  Xeon Phi™ coprocessor.   We did not test these instructions -- if you find that something is wrong or missing, please add a comment in the article, or in the community forum to let us know.  

1.Download NETCDF from http://www.unidata.ucar.edu/downloads/netcdf/netcdf-3_6_2/index.jsp

2.Create a directory called NETCDF:   ‘mkdir NETCDF’

3.cd into NETCDF directory and create two new directories ‘xeonphi’ and ‘intel64’.

4.cd into ‘intel64’ and untar netcdf-3.6.2 in this directory. After untaring, you will have a netcdf directory called ‘netcdf-3.6.2’

5.cd into ../NETCDF/intel64/netcdf-3.6.2

  • Source the intel compiler : e.g. source /opt/intel/composer_xe_2013.1.117/bin/compilervars.sh intel64
  • Compile as follows: ./configure --prefix=/path/to/NETCDF/intel64 --disable-cxx CC=icc F77=ifort FC=ifort
  • On the terminal run : make
  • On the terminal run : make install
  • Open a new terminal and cd into ../NETCDF/xeonphi/netcdf-3.6.2.
  • Source the Intel compiler setup : e.g. source  /opt/intel/composer_xe_2013.1.117/bin/compilervars.sh intel64
  • Run as follows on terminal: NM=nm ./configure --prefix=/path/to/NETCDF/xeonphi/ F77=ifort CC=icc CXX=icpc --disable-cxx --host=x86_64-k1om-linux --build=x86_64-unknown-linux CFLAGS=-mmic FFLAGS=-mmic LDFLAGS=-mmic
  • Edit the ../NETCDF/xeonphi/netcdf-3.6.2/fortran/nfconfig.inc as follows:

#define NF_INT1_IS_C_SIGNED_CHAR 1
#undef NF_INT1_IS_C_SHORT
#undef NF_INT1_IS_C_INT  
#undef NF_INT1_IS_C_LONG
#define NF_INT2_IS_C_SHORT 1 
#undef NF_INT2_IS_C_INT
#undef NF_INT2_IS_C_LONG
#define NF_INT_IS_C_INT 1
#undef NF_INT_IS_C_LONG
#define NF_REAL_IS_C_FLOAT 1
#undef NF_REAL_IS_C_DOUBLE
#define NF_DOUBLEPRECISION_IS_C_DOUBLE 1
#undef NF_DOUBLEPRECISION_IS_C_FLOAT
  • On the terminal run : make
  • On the terminal run : make install

7. Now, netcdf has been built for Intel® Xeon Phi™ Coprocessor and is ready for use.

VI.Acknowledgements

The author would like to thank all who contributed to the WRF project to date.

VII.About the Author

Indraneil Gokhale is a Software Architect in the Intel Software and Services Group (Intel SSG).


Viewing all articles
Browse latest Browse all 327

Trending Articles