Webinar: Get Ready for Intel® Math Kernel Library on Intel® Xeon Phi™ Coprocessors

Intel recently unveiled the new Intel® Xeon Phi™ product – a coprocessor based on the Intel® Many Integrated Core architecture. Intel® Math Kernel Library (Intel® MKL) 11.0 introduces high-performance and comprehensive math functionality support for the Intel® Xeon Phi™ coprocessor. You can download the audio recording of the webinar and the presentation slides from the links below.

Webinar audio recording (WMV)
Webinar presentation slides (PDF)

More information can be found from our "Intel® Math Kernel Library on Intel® Xeon Phi™ Coprocessors" central page. If you have questions, please ask them either on the public Intel MKL forum or in a priviate secure Intel® Premier Support.

Questions and Answers from the webinar

Is there anyone using Intel Xeon Phi product? What Kind of applications they run on it?
Many users have successfully benefited from it. For example, seven supercomputers on the most recent Top 500 list already use Intel Xeon Phi coprocessors in combination with Intel Xeon processors. A lot of HPC applications, for example, those in the areas of new drug discovery, weather prediction, global financial analysis, oil exploration, Hollywood movie special effects, can make good use of all the power provided by Intel Xeon Phi.
Is Intel® Cluster Studio XE 2013 or Intel® Parallel Studio XE 2013 required in order to use Intel Xeon Phi coprocessors?
Intel Cluster Studio XE 2013 and Intel Parallel Studio XE 2013 are bundle products that contain necessary tools for programming the coprocessor. For example, Intel compilers (FORTRAN or C/C++) are required to build code for native execution on the coprocessor. The pragmas and directives used to offload computations to the coprocessor are only supported by Intel compilers. Intel MKL provides highly optimized math functions for the coprocessor. Intel MPI (a component of Intel Cluster Studio XE) enables building code scalable to multiple coprocessors and hosts. These bundle products also provide tools for threading assistant, performance and thread profiling, memory and threading error detection, etc.
What if a system has multiple coprocessors? Does Intel MKL try to balance the tasks across them?
In the case of automatic offload, MKL will try to make use of multiple coprocessors for a computation. Users can also pick which coprocessors to use. In the case of compiler-assisted offload, it is up to the user to specify which coprocessors to use and to orchestrate the work division among them.
Do the performance charts published online include cost of data transfer between host and coprocessors?
The performance charts compare native execution performance on the coprocessor with host execution performance on the host processor. Hence, data transfer cost is not reflected.
Do the performance charts published online compare the dual-socket E5-2680 CPU performance against single coprocessor performance?
Yes. The host CPU used to obtain the performance charts is an Intel Xeon E5-2680 CPU with 2 sockets and 8 cores per socket. The coprocessor is an Intel Xeon Phi SE10, with 61 cores. Each of the online performance charts has detailed configuration listed at the bottom.
What happens if multiple user processes or threads call Intel MKL functions with automatic offload?
Currently, a process/thread doing automatic offload is not aware of other processes/threads that may also be offloading at the same time. In this scenario, all processes/threads will offload to a coprocessor. This leads to the risks of thread oversubscription and running out of memory on the coprocessor. It is possible, however, with careful memory management and thread affinity settings, to have multiple offloading processes/threads use different group of cores on the coprocessor at the same time.
Will more routines get automatic offload support in future?
Automatic offload works well when there is enough computation in a function to offset the data transfer overhead. Currently, only GEMM, TRSM, TRMM and LU, QR, Cholesky are supported with this model. There might be other functions in Intel MKL that can be good candidates for automatic offload. We are investigating all opportunities. Please contact us via our support channels if you see more needs for automatic offload.
Can you show us in detail the configurations of running the LINPACK benchmark natively on the coprocessor?
Intel optimized SMP LINPACK benchmark is included in Intel MKL 11.0 installation packages. Please find it in $MKLROOT/benchmarks/linpack. See the execution scripts in this location for the default configuration.
Is the memory allocated for arguments of an Intel MKL routine resides on the coprocessor or on the host?
Unless input data already exists on the coprocessor or output data is not needed on the host, MKL routine input arguments are allocated on the host and then copied to the coprocessor. Enough space needs to be allocated on the coprocessor to receive the data. Output arguments are copied back to the host. The offload pragmas offers a rich set of controls for data transfer and memory management on the coprocessor. In the case of MKL automatic offload, however, the MKL runtime system handles all these transparently.
If memory population between host and coprocessor is transparent, now you have two copies of data. What about data synching?
In the case of Intel MKL automatic offload, data synching is taken care of transparently by the Intel MKL runtime. If a function call is offloaded using pragmas, then the user needs to rely on the facilities provided by the pragmas to synch/merge data. Intel Xeon Phi coprocessor also supports a shared memory programming model called MYO (“mine”, “yours”, “ours”). Data synching between host processors and coprocessors is taken care of implicitly in this model.
Refer to this article for more information.
If I have two automatic offload function calls, and a non-automatic offload function call in between them, suppose these functions reuse data, will the data persist on the coprocessor to be reused?
Data persistence on coprocessor is currently not supported for function calls using Intel MKL automatic offload. The input data is always copied from host to coprocessor in the beginning of an automatic offload execution and output data is always copied back at the end.
Can PARDISO and other sparse solvers make use of the coprocessor? How does the performance compare with, say, running on an 8-core Xeon processor?
Yes. Intel MKL sparse solvers, including PARDISO, can make use of the coprocessor. However, our optimization effort has so far been focused on dense matrices (BLAS and LAPACK). Sparse solvers at present are not optimized to the same extent. Performance of sparse solvers, on processor or on the coprocessor, largely depends on the properties of sparse systems. It’s hard to have a performance comparison without putting a particular sparse system in the context.
Is Intel® Integrated Performance Primitives (Intel® IPP) supported on Intel Xeon Phi product?
Support for Intel IPP is still to be determined. If you have a request for supporting Intel IPP on the Intel Xeon Phi coprocessor, please follow the regular Intel IPP support channel to submit a formal request.
There are a lot of pragmas to set. Are there any preprocessors to scan one's FORTRAN code for LAPACK calls and automatically insert all the appropriate pragmas?
There is no such a tool to automatically scan your code and insert pragmas. But if you use MKL automatic offload (when applicable), then you can take the advantage of computation offloading without using pragmas.
The offload pragmas from Intel compilers are very different than OpenACC. Can users do either one for the Intel Xeon Phi coprocessor?
Intel compilers do not have plans to support OpenACC.
What is the difference, if any, between using the Intel specific compiler directives to offload to the coprocessor and using the newly proposed OpenMP coprocessor/accelerator directives? Am I correct that these new OpenMP directives will be added to the Intel compilers next year?
Intel compiler offload directives offer a much richer set of features than OpenMP offload directives. Intel Compiler 13.0 update 2 (both FORTRAN and C/C++) will add support for OpenMP offload directives.
Does GCC support Intel Xeon Phi?
Please see this page for information on third-party tools available with support for Intel Xeon Phi coprocessor.
Our changes to the GCC tool chain, available as of June 2012, allow it to build the coprocessor’s Linux environment, including our drivers, for the Intel Xeon Phi Coprocessor. The changes do not include support for vector instructions and related optimization improvements. GCC for Intel Xeon Phi is really only for building the kernel and related tools; it is not for building applications. Using GCC to build an application for Intel Xeon Phi Coprocessor will most often result in low performance code due its current inability to vectorize for the new Knights Corner vector instructions. Future changes to give full usage of Knights Corner vector instructions would require work on the GCC vectorizer to utilize those instructions’ masking capabilities
Is debugging supported on the coprocessor?
Yes. Debugging is supported. At this time, Intel debugger, GDB, TotalView, and Allinea DDT are debuggers available with support Intel Xeon Phi coprocessor. See this page for more information.
Is the 8GB memory shared by all cores on the coprocessor? Are there memory hierarchies on the Intel Xeon Phi coprocessor?
Yes. All cores on a coprocessor share 8GB memory. The memory hierarchy includes the shared 8GB memory, and for each core, a 32KB L1 instruction cache and a 32 KB L1 data cache and a 512KB unified L2 cache. The caches are fully coherent and implement the x86 memory order model. See here for a description of the Intel Many Integrated Core architecture.
How lightweight are threads on the coprocessor? Is context switching expensive?
Context switch is more expensive on Intel Xeon Phi coprocessors than on Intel Xeon processors. This is because the coprocessor has more vector registers, and a coprocessor core is typically slower than a processor core.
What MPI implementations are supported?
At present, Intel MPI and MPICH2 are the two implementations that support Intel Xeon Phi coprocessors.
Can I put an MPI rank on the host processor and another MPI rank on the coprocessor to have a 2-node MPI environment?
Yes. This usage model has been supported since Intel MPI 4.1. Please refer to the Intel MPI product page for more information on Intel MPI support for Intel Xeon Phi coprocessors.
Can you explain the motherboard requirements for Intel Xeon Phi coprocessors, e.g. power, BIOS, PCI bandwidth?
The system requirements for Intel Xeon Phi coprocessors are documented in the README file of the MIC Platform Software Stack (MPSS), which comes with every installation of the coprocessor. There are two recommended configurations:
Configuration 1: Motherboard: W2600CR2 (formerly Crown Pass), BIOS 44, Host CPU: Intel Xeon processor E5-2600 family, RAM: 32 GB ECC DDR3-1333 (8 x 4GB), HDD: 1 TB SATA2, Power supply: 2x 1200W PMBus non-redundant, Graphics Card: Distribution-compatible graphics card.
Configuration 2: Motherboard: S5520SC (formerly Shady Cove), Bios: BIOS X2_59, Host CPU: Intel Xeon processor 5600 family, RAM: 24 GB 1333 MHz DDR3 (6 x 4GB), HDD: minimum 500GB, Power supply: 1200W, Graphics Card: Distribution-compatible graphics card.

What is the estimated price of Intel Xeon Phi coprocessor?
Please contact your OEMs or your Intel field representatives to get estimated pricing of Intel Xeon Phi coprocessor.
Where to buy Intel software tools that will support Intel Xeon Phi coprocessor?
Please contact your local Intel® Software Development Products Resellers for more details.

Learning Lab

Intel Math Kernal Library (Intel MKL)

intel mkl

mkl 11.0

Intel Many Integrated Cores

Native and Offload Programming Models for Intel(R) MIC