I. Overview
This document demonstrates the best methods to obtain, build, and run the Weather Research and Forecasting (WRF) model on an Intel® Xeon® processor-based server in native mode on a single Intel® Xeon Phi™ coprocessor, and in symmetric mode using both. This document also describes the best WRF software configuration and affinity settings to extract the best performance on this server.
II. Introduction
The WRF model is a numerical weather prediction system designed to server atmospheric research and operational forecasting needs. WRF is used by academic atmospheric scientists, forecast teams at operational centers, application scientists, etc. Please see http://www.wrf-model.org/index.php for more details about WRF. The source code and input files can be downloaded from the NCAR website. The latest version as of this writing is WRFV3.6. In this article, we use the conus12km benchmark.
III. Compiling, Running, and Validating WRF to run natively on an Intel® Xeon Phi™ coprocessor (Single Card)
You can obtain Intel® Composer XE, which includes the Intel® C/C++ and Fortran Compilers1, from https://registrationcenter.intel.com/regcenter/register.aspx, or register at https://software.intel.com/en-us/ to get a free 30-day evaluation copy.
Compile WRF
- Download and un-tar the WRFV3.6 source code from the NCAR repository http://www.mmm.ucar.edu/wrf/users/download/get_sources.html#V351.
- Source the environment for Intel® MPI Library and for the Intel Compiler:
- source /opt/intel/impi/4.1.1.036/mic/bin/mpivars.sh
- source /opt/intel/composer_xe_2013/bin/compilervars.sh intel64
- On bash, export the path for the host netcdf and host pnetcdf. Having netcdf and pnetcdf built for the Intel Xeon Phi coprocessor is a prerequisite.
- export NETCDF=/localdisk/igokhale/k1om/trunk/WRFV3.6/netcdf/mic/
- export PNETCDF=/localdisk/igokhale/k1om/trunk/WRFV3.6/pnetcdf/mic/
- Turn on Large file IO support by entering: export WRFIO_NCD_LARGE_FILE_SUPPORT=1
- Cd into the ../WRFV3/ directory, run ./configure, and select the option to build with the coprocessor (option 17). On the next prompt for nesting options, press return for the default, which is 1.
- In the configure.wrf that is created, remove -DUSE_NETCDF4_FEATURES and replace –O3 with –O2.
- Replace !DEC$ vector always with !DEC$ SIMD on line 7578 in the dyn_em/module_advect_em.F source file.
- Run ./compile wrf >& build.mic. This will build a wrf.exe in the ../WRFV3/main folder.
- For a new, clean build run ./clean –a and repeat the process.
Run WRF
- Download the CONUS12_rundir from http://www.mmm.ucar.edu/WG2bench/conus12km_data_v3/ and place it in ../WRFV3.
- Copy the binary from ../WRFV3/main/wrf.exe to the ../CONUS12_rundir/wrf.exe.
- Copy/link all files in WRFV3/run to the directory where you are executing wrf.exe (i.e., the CONUS12_rundir).
- Edit the namelist.input to add "use_baseparam_fr_nml = .t." under the &dynamics heading without the quotation marks (").
- cd into the CONUS12_rundir and execute WRF as follows on a coprocessor natively with the runtime parameters in the following script:
Script to run on coprocessor (native)
bash-4.1$ cat wrf.sh source /opt/intel/impi/4.1.0.030/mic/bin/mpivars.sh export LD_LIBRARY_PATH=/opt/intel/composer_xe_2013.2.146/compiler/lib/mic/ export KMP_STACKSIZE=62m ulimit -s unlimited export I_MPI_DEBUG=5 export WRF_NUM_TILES_X=3 export WRF_NUM_TILES_Y=60 export I_MPI_PIN_MODE=mpd export KMP_PLACE_THREADS=60C,3T export OMP_NUM_THREADS=180 export KMP_AFFINITY=balanced,granularity=thread export KMP_LIBRARY=turnaround export KMP_BLOCKTIME=infinite mpiexec.hydra -np 1 ./wrf.exe
6. The run is completed when it prints ‘wrf: SUCCESS COMPLETE WRF' on the screen. You will have two files, rsl.error.0000 and rsl.out.0000, in your CONUS12_rundir directory.
7. After the run, compute the total time taken to simulate 149 timesteps with the script below. The sum and mean values are of interest for WRF (lower is better). The following parsing script may help:
bash-4.1$ cat gettiming.sh grep 'Timing for main' rsl.out.0000 | sed '1d' | head -149 | awk '{print $9}' | awk -f stats.awk bash-4.1$ cat stats.awk BEGIN{ a = 0.0 ; i = 0 ; max = -999999999 ; min = 9999999999 } { i ++ a += $1 if ( $1 > max ) max = $1 if ( $1 < min ) min = $1 } END{ printf("---n%10s %8dn%10s %15fn%10s %15fn%10s %15fn%10s %15fn%10s %15fn","items:",i,"max:",max,"min:",min,"sum:",a,"mean:",a/(i*1.0),"mean/max:",(a/(i*1.0))/max) }
Validation of your runs
To validate if the WRF run is correct or bogus, do the following:
The ‘DIGITS’ column should contain a high value (>3). If it does, the WRF run is considered valid.
IV. Compiling WRF to run on a 2-Socket Intel® Xeon® processor-based server
We used these instructions on a 2-Socket Intel® Xeon® E5-26xx processor-based server.
Compile WRF
- Download and un-tar the WRF3.5 source code from the NCAR repository http://www.mmm.ucar.edu/wrf/users/download/get_sources.html#V351.
- Source the environment for Intel® MPI Library and for the Intel Compiler
- source /opt/intel/impi/4.1.1.036/intel64/bin/mpivars.sh
- source /opt/intel/composer_xe_2013/bin/compilervars.sh intel64
- Export the path for the host netcdf and pnetcdf. Having netcdf and pnetcdf built for the host is a prerequisite.
- export NETCDF=/localdisk/igokhale/IVB/trunk/WRFV3.6/netcdf/xeon/
- export PNETCDF=/localdisk/igokhale/IVB/trunk/WRFV3.6/pnetcdf/xeon/
- Turn on Large file IO support by typing: export WRFIO_NCD_LARGE_FILE_SUPPORT=1
- cd into the WRFV3 directory created in step 1, run ./configure and select option 21: "Linux x86_64 i486 i586 i686, Xeon (SNB with AVX mods) ifort compiler with icc (dm+sm)". On the next prompt for nesting options, hit return for the default, which is 1.
- In the configure.wrf that is created, remove -DUSE_NETCDF4_FEATURES and replace –O3 with –O2.
- Replace !DEC$ vector always with !DEC$ SIMD on line 7578 in the dyn_em/module_advect_em.F source file.
- Run ./compile wrf >& build.snb.avx . This will build a wrf.exe in the ../WRFV3/main folder. (Note: to speed up compile times, set the environment variable J to "-j 4" or whatever number of parallel make tasks you wish to use.)
- For a new, clean build, run ./clean –a and repeat the process.
Run WRF
- Download the CONUS12_rundir from http://www.mmm.ucar.edu/WG2bench/conus12km_data_v3/ and place it in ../WRFV3.
- Copy the binary from ../WRFV3/main/wrf.exe to the ../CONUS12_rundir/wrf.exe.
- Copy/link all files in WRFV3/run to the directory where you are executing wrf.exe (i.e., the CONUS12_rundir).
- Edit the namelist.input to add "use_baseparam_fr_nml = .t." under the &dynamics heading without the quotation marks (").
- cd into the CONUS12_rundir and execute WRF with the runtime parameters in the following script:
Here is an example script to run on an Intel Xeon processor-based host:
bash-4.1$ cat run.sh source /opt/intel/composer_xe_2013.2.146/bin/compilervars.sh intel64 source /opt/intel/impi/4.1.0.030/intel64/bin/mpivars.sh ulimit -s unlimited I_MPI_PIN_MODE=mpd OMP_NUM_THREADS=2 KMP_STACKSIZE=64m KMP_AFFINITY=scatter,granularity=thread KMP_BLOCKTIME=infinite KMP_LIBRARY=turnaround WRF_NUM_TILES=48 mpiexec.hydra -np 12 ./wrf.exe
6. The run is completed when it prints ‘wrf: SUCCESS COMPLETE WRF’on the screen. You will have two files, rsl.error.0000 and rsl.out.0000, in your CONUS12_rundir directory.
7. After the run, compute the total time taken to simulate 149 timesteps with the script below. The sum and mean values are of interest for WRF (lower the better).
The following script should help parse the output:
bash-4.1$ cat gettiming.sh grep 'Timing for main' rsl.out.0000 | sed '1d' | head -149 | awk '{print $9}' | awk -f stats.awk bash-4.1$ cat stats.awk BEGIN{ a = 0.0 ; i = 0 ; max = -999999999 ; min = 9999999999 } { i ++ a += $1 if ( $1 > max ) max = $1 if ( $1 < min ) min = $1 } END{ printf("---n%10s %8dn%10s %15fn%10s %15fn%10s %15fn%10s %15fn%10s %15fn","items:",i,"max:",max,"min:",min,"sum:",a,"mean:",a/(i*1.0),"mean/max:",(a/(i*1.0))/max) }
Validation of the run
- To validate if the WRF run is successful or bogus, do the following:
- diffwrf your_output wrfout_reference > diffout_tag
- The ‘DIGITS’ column should contain a high value (>3). If it does, the WRF run is considered valid.
Compiler Options:
- -mmic : build an application that runs natively on an Intel Xeon Phi coprocessor
- –openmp : enable the compiler to generate multi-threaded code based on the OpenMP* directives (same as -fopenmp)
- -O3 : enable aggressive optimizations by the compiler
- -opt-streaming-stores always : generate streaming stores
- -fimf-precision=low : low precision for higher performance
- -fimf-domain-exclusion=15 : generate lowest precision sequences for Single precision and Double precision
- -opt-streaming-cache-evict=0 : turn off all cache line evicts
V. Additional Resources (NETCDF)
Here is how you can compile NETCDF for the Intel Xeon Phi coprocessor.
- Download NETCDF from http://www.unidata.ucar.edu/downloads/netcdf/netcdf-3_6_2/index.jsp.
- Create a directory called NETCDF: ‘mkdir NETCDF’.
- cd into NETCDF directory and untar netcdf-3.6.2.tar.gz (tar xvzf netcdf-3.6.2.tar.gz).
- cd into netcdf-3.6.2 (created after untaring the netcdf-3.6.2.tar.gz).
- Source the Intel compiler, e.g., source /opt/intel/composer_xe_2013/bin/compilervars.csh intel64.
- Set the following environment variables:
setenv CPPFLAGS "-DpgiFortran"
setenv CXX "icpc"
setenv CC "icc"
setenv F77 "ifort"
- Run this command on the terminal: ./configure NM=nm --prefix=/lpath/to/NETCDF --disable-cxx --host=x86_64-k1om-linux --build=x86_64-unknown-linux.
- Run this command on the terminal: make CFLAGS=-mmic FCLAGS=-mmic LDFLAGS=-mmic.
- Then run: make install.
Now, netcdf has been built for Intel Xeon Phi coprocessor and is ready for use.
V. Run WRF Conus12km in symmetric mode on a 2-Socket Intel® Xeon® Processor-based server with Intel® Xeon Phi™ Coprocessors
Script to run in Symmetric mode
I am using: node01
When you request nodes, make sure you have a large stack size MIC_ULIMIT_STACKSIZE=365536
source /opt/intel/impi/4.1.0.036/mic/bin/mpivars.sh source /opt/intel/composer_xe_2013_sp1.1.106/bin/compilervars.sh intel64 export I_MPI_DEVICE=rdssm export I_MPI_MIC=1 export I_MPI_DAPL_PROVIDER_LIST=ofa-v2-mlx4_0-1u,ofa-v2-scif0 export I_MPI_PIN_MODE=pm export I_MPI_PIN_DOMAIN=auto ./run.symmetric
Below is the run.symmetric to run the code in symmetric mode:
run.symmetric script #!/bin/sh mpiexec.hydra -host node01 -n 12 -env WRF_NUM_TILES 20 -env KMP_AFFINITY scatter -env OMP_NUM_THREADS 2 -env KMP_LIBRARY=turnaround -env OMP_SCHEDULE=static -env KMP_STACKSIZE=190M -env I_MPI_DEBUG 5 /path/to/CONUS2.5_rundir/x86/wrf.exe : -host node01-mic1 -n 8 -env KMP_AFFINITY balanced -env OMP_NUM_THREADS 30 -env KMP_LIBRARY=turnaround -env OMP_SCHEDULE=static -env KMP_STACKSIZE=190M -env I_MPI_DEBUG 5 /path/to/CONUS2.5_rundir/mic/wrf.sh In ../CONUS2.5_rundir/mic, create a wrf.sh file as follows:
Below is the wrf.sh that is needed for the Intel Xeon Phi coprocessor part of the runscript.
wrf.sh script export LD_LIBRARY_PATH=/opt/intel/compiler/2013_sp1.1.106/composer_xe_2013_sp1.1.106/compiler/lib/mic:$LD_LIBRARY_PATH /path/to/CONUS2.5_rundir/mic/wrf.exe
VI. Acknowledgements
The author would like to thank all who have contributed to the WRF project to date.
VII. About the Author
Indraneil Gokhale is a Software Architect in the Intel Software and Services Group (Intel AZ SSG).