Running R with Support for Intel® Xeon Phi™ Coprocessors

Introduction

R is a free, open-source software environment for statistical computing and analysis. It builds and runs on a wide variety of UNIX* and Linux* platforms, Windows*, and MacOS*. The project web site is http://www.r-project.org/. Users can either download the source code from the web site and build their own executables, or they can download prebuilt executables for each of the supported OS's. Most users download the prebuilt executables. This recipe will review how to build R using Intel compilers and math libraries, and will describe how to use these to run on Intel® Xeon® processors and on Intel® Xeon Phi™ coprocessors together.

Obtaining R and Intel Software Tools

To download R

Go to http://www.r-project.org/ and click on the "CRAN" (Comprehensive R Archive Network) link on left hand side of the page. That takes you to a list of CRAN mirrors.
Select a mirror and click on its link.
Click on the "R Sources" link on the left.
The latest source code release should be prominently featured near the top of the page. Click on the link and follow your browser's directions to download the software.
The package is usually archived and compressed. Use the appropriate decompression tools for your OS to install the software.

To download the Intel software tools

Go to https://software.intel.com/en-us/intel-parallel-studio-xe-evaluation-options
Decide if you want to evaluate or buy the tools. If you want to buy, click on the "Buy" button for your OS. If you want to evaluate, click on the "Download <OS> version" link on the left.
Follow the instructions on the ensuing web pages to complete the evaluation or purchase process.
You will receive an email containing a license file, serial number, and installation instructions. Follow the instructions to download and install the tools.

Building with Intel Software Tools

Assuming the Intel tools are installed in /opt/intel/composerxe, the build process on Linux is as follows:

$ source /opt/intel/composerxe/bin/compilervars.sh intel64 $ ./configure --with-blas="-L/opt/intel/composerxe/mkl/lib/intel64 -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -liomp5 -lpthread -lm" --with-lapack CC=icc CFLAGS=-O2 CXX=icpc CXXFLAGS=-O2 F77=ifort FFLAGS=-O2 FC=ifort FCFLAGS=-O2 $ make $ make check

The Intel Software tools team provides more in-depth articles on building R on the Intel Developer Zone website

Building R with Intel® Math Kernel Library support gives you access to optimized matrix multiplication routines at the heart of many R data analysis computations.

Once you have built your own R executable, you run it the same way you would run a default build or a downloaded executable.

Baseline Performance on Intel® Xeon® Processors

The prebuilt executables for the Linux variants are built with the GNU* tools. Unfortunately, this results in single-thread performance, even on multicore systems with matrix operations that could be performed in parallel. The chart below shows performance of R built with the Intel 14.0.1 compilers and Intel® Math Kernel Library on Red Hat* 6.3 compared to a "default" build (i.e. no config options) using gcc 4.4.6. The build with Intel® MKL runs matrix operations on multiple cores, so it is much faster on those operations. The R benchmark-2.5 used is available at

http://r.research.att.com/benchmarks/R-benchmark-25.R

The matrix sizes were increased to reflect a larger workload size. The results show R built with Intel® MKL is 13x faster than the gcc build. Running R with Support for Intel® Xeon Phi™ Coprocessors

Test	Time for gcc build	Time for icc/MKL build
Creation, transp., deformation of a 5000x5000 matrix	3.25	2.95
5000x5000 normal distributed random matrix ^1000	5.13	1.52
Sorting of 14,000,000 random values	1.61	1.64
5600x5600 cross-product matrix (b = a' * a)	97.44	0.56
Linear regr. over a 4000x4000 matrix (c = a \ b')	46.06	0.49
FFT over 4,800,000 random values	0.65	0.61
Eigenvalues of a 1200x1200 random matrix	5.55	1.37
Determinant of a 5000x5000 random matrix	34.18	0.55
Cholesky decomposition of a 6000x6000 matrix	37.07	0.47
Inverse of a 3200x3200 random matrix	29.49	0.57
3,500,000 Fibonacci numbers calculation (vector calc)	1.31	0.38
Creation of a 6000x6000 Hilbert matrix (matrix calc)	0.77	0.99
Grand common divisors of 400,000 pairs (recursion)	0.63	0.56
Creation of a 1000x1000 Toeplitz matrix (loops)	2.24	2.34
Escoufier's method on a 90x90 matrix (mixed)	9.55	6.02
Total	274.93	21.01

The following hardware and software were used for the above recipe and performance testing.

2-socket/24 cores:
Processor: Intel® Xeon® processor E5-2697 V2 @ 2.70GHz (12 cores) with Intel® Hyper-Threading5
Operating System: Red Hat Enterprise Linux* 2.6.32-358.6.2.el6.x86_64.crt1 #4 SMP Fri May 17 15:33:33 MDT 2013 x86_64 x86_64 x86_64 GNU/Linux
Memory: 64GB
Coprocessor: 2X Intel® Xeon Phi™ Coprocessor 7120P: 61 cores @ 1.238 GHz, 4-way Intel Hyper-Threading5, Memory: 15872 MB
Intel® Many-core Platform Software Stack Version 2.1.6720-16
Intel® C++ Compiler Version 13.1.3 20130607 (2013.5.192)

There are independent but similar comparisons sprinkled around the Internet. An article by Revolution Analytics at http://www.revolutionanalytics.com/revolution-revor-enterprise-benchmark-details describes a test using their own Intel-built R product on Windows. A blog at http://www.r-bloggers.com/speeding-up-r-with-intels-math-kernel-library-mkl/ describes a comparison between an R executable built with generic BLAS with one built with MKL, running on Ubuntu.

R Support for Intel® Xeon Phi™ Coprocessors

An additional advantage of building with the Intel software tools is that if you have an Intel® Xeon Phi™ coprocessor in your system, Intel® MKL will automatically offload certain parallel matrix operations to it. If you have followed the build directions above for Intel Software tools, you are already ready to use R with an Intel® Xeon Phi™ coprocessor. You can tell Intel® MKL to offload matrix operations by setting the following environment variable:

$ export MKL_MIC_ENABLE=1

When R starts a matrix operation, Intel® MKL will partition the work between the host processors and the coprocessors in the platform. You can configure the ratio of work with the MKL_HOST_WORKDIVISION and MKL_MIC_0_WORKDIVISION environment variables. For example,

$ export MKL_HOST_WORKDIVISION=0.1 $ export MKL_MIC_0_WORKDIVISION=0.9

This tells Intel® MKL to send 90% of the work to the Intel® Xeon Phi™ coprocessor, and keep 10% of the work on the host processors. In the case of two Intel® Xeon Phi™ coprocessors,

$ export MKL_HOST_WORKDIVISION=0.2 $ export MKL_MIC_0_WORKDIVISION=0.4 $ export MKL_MIC_1_WORKDIVISION=0.4

Tells Intel® MKL to send 80% of the work to the Intel® Xeon Phi™ coprocessors (split evenly among the two cards), and keep 20% of the work on the host processors. You can also experiment with different ratios to find the optimum work distribution.

The performance using Intel® MKL automatic offload depends heavily on the size of the workload you are analyzing with R. Intel® MKL contains a set of heuristics that determine when a workload is large enough to benefit from automatic offload. This is an area of ongoing exploration, and you are invited to give your workloads a try and share what you are able to accomplish.

Summary

R built with the Intel software tools can show a significant performance improvement over the prebuilt executables or executables built by the user with the GNU* tools. This paper described how to download R and the Intel software tools, how to build R with the Intel software tools, and how to run R across host processors and Intel® Xeon Phi™ coprocessors with Intel® MKL automatic offload support. We invite you to try this yourself and share your experiences with us.