Tuning the Intel MKL DFT functions performance on Intel® Xeon Phi™ coprocessors

Overview

Intel® Math Kernel Library (Intel® MKL) includes the optimized DFT transform functions on Intel® Xeon Phi™ coprocessors. These functions are carefully vectorized and threaded to take advantage of the hardware features. This article provides some performance tuning tips on running MKL DFT function on Intel Xeon Phi coprocessors. We will start with some simple example code.

Building the example code

The attached "fftsample_native.cpp" file is one simple example on Intel MKL DFT function. The code computes two-dimensional real DFT transform with single float data. It first creates a new DFT descriptor, changes the defaulting setting for the descriptor, and then calls MKL DFT computation function to perform the forward transforms. The transform computation is called once to make sure the data is ready in the cache, then it repeats the DFT transform call for several times and measure their average performance. The code outputs the performance number measured by the GFLOPS (giga floating point operations per second). For two-dimensional real transform, the flop count is computed by the 2.5*N*M*log2(M*N), where M,and N are the transform size.

The MKL functions could be called by the applications that run natively on coprocessors. It is a fast way to make an existing application run on Intel MIC Architecture with minimal changes to the source code.

Build the sample code natively on the Intel® Xeon Phi™ coprocessor with Intel Composer XE:

>icc -mmic –o fftsample_native fftsample_native.cpp –openmp –mkl

Copy the application to Intel Xeon Phi™ coprocessor, and run it application natively:

>./ fftsample_native
performance: 24.256216 GFLOPS

Tip 1: reusing the DFTI structures when possible

The MKL descriptor functions DftiCreateDescriptor/DftiCommitDescriptor allocate the necessary internal memory, and perform the initialization to facilitates the FFT computation. It may also involve the computation on exploring different factorizations of the input length and searching the highly efficient computation method. If the DFT configuration, including the transform type, input data size, and other parameters do not change, the structure could be reused by the following DFT transform, which can reduce the overhead to initialize such structure.

In the sample code, since the DFT parameter does not change, we could reuse the DFT descriptor. Comment the related DFT descriptor function DftiCreateDescriptor/DftiSetValue/DftiCommitDescriptor/DftiFreeDescriptor in the repeating loops (about from line 81 to line 85, and line 89) .

Recompile and run the application:

>./ fftsample_native
performance: 61.380058 GFLOPS.

Tip 2: making memory alignment for the input and output data.

To improve performance with data access, it is recommended that the memory address for input and output data is aligned to 64 byte. You can use MKL function mkl_malloc() or other aligned memory allocators in system to allocate such memory buffer.

In the sample code, undefined "SYS_MALLOC", and define "ALIGN_MALLOC" (about line 27), which will use the mkl_malloc() to allocate the memory with 64 byte alignment.

Recompile and run the sample code:

>./ fftsample_native
performance: 62.137836 GFLOPS

Tip 3: the leading dimension size for multiple –dimensional DFT

For two or higher-dimensional DFT transform, the leading dimension size in bytes should be divisible by 64(the size of cache line), but not divisible by 128. That is for the single-precision complex data, use leading dimensions (strides) divisible by 8 but not divisible by 16, and double-precision complex data, use leading dimensions divisible by 4 but not divisible by 8.

Undefine "NO_PADDING", and define "PADDING_LEN". Recompile and run the sample code:

>./ fftsample_native
performance: 103.776588 GFLOPS

Tip 4: threading setting for the FFT functions

With multi-core coprocessors, it can achieve the best performance by requiring threads not to migrate from core to core. In order to do this, you need to set an affinity mask to bind the threads to the coprocessor cores. For example, you can use KMP_AFFINITY environment variable to control such behavior.

Meanwhile, the FFT functions can achieve the high performance when the number of threads is a power of 2, and it is recommended to set as 128 threading when the size of input data and output data is less than 32M (about the last level cache size of coprocessors). For the data that is larger than 32M, it is recommended to set the number to the total system threading number.

Set the following environment, and run the application:
>export KMP_AFFINITY=scatter,granularity=fine
>export OMP_NUM_THREADS=240 (or MKL_NUM_THREADS=240)
>./ fftsample_native
performance: 103.816588 GFLOPS

Tip 5: using huge memory pages

In many cases, the DFT functions could improve its performance when input and output data buffers are allocated with huge memory pages(2M pages). Compared with memory allocated with the default page size(4K), the memory with huge memory pages reduce TLB misses and page faults.

For native executions, you can allocate memory with 2M pages by by using the mmap system call or the libhugetlbfs library. Check the article on how to use huge memory pages to improve the application performance for the recipes for both approaches.

Note: The new version MPSS 2.1.4982-15 upgraded kernel to version 2.6.38.8, which includes the transparent huge page support. The OS will try to allocate some huge pages for applications transparently. The FFT function performance will be only slightly impacted, even if the application allocates the memory with the default page size. Only when the transparent huge page is disabled, you can manually allocate the memory with huge memory pages

Tip 6: increasing the performance with offload execution

In addition using Intel MKL functions in a native application, you can use compiler assisted offload to offload computations to offload computations to Intel Xeon Phi coprocessors and call Intel MKL function. Intel compiler and its offload pragma support to manage the functions and data offloaded to a coprocessor.

For the DFT transform, the data transferring may be critical for the offload FFT performance. Intel compiler provide different options related to the data transferring. If a memory buffer could be specified as the "nocopy", its value will be reused from the last target execution and avoid the data copying. Also if the input or output data size does not changed, it can use "alloc_if" and "free_if" to control whether a new fresh memory will be allocated or use the old ones on the coprocessor. Refer to the Intel® C++ Compiler XE 13.0 User and Reference Guides for more information.

Check the offload sample code "fftsample_offload.cpp" for offload FFT transform: Here the DFTI structures are allocated at the coprocessor first, and then reused by different DFT transforms. The memory for input and output buffers are only allocated in the first loop, and freed in the last one. The memory on host CPU and coprocessor will be reused.

Compile and run the sample code:
>icc –o fftsample_offload fftsample_offload.cpp –mkl
>export MIC_ENV_PREFIX=MIC
>export MIC_KMP_AFFINITY=scatter,granularity=fine
>export MIC_OMP_NUM_THREADS=240
>export MIC_USE_2MB_BUFFERS=64K
>./ fftsample_offload

Summary

Intel® Math Kernel Library (Intel® MKL) provided high optimized DFT transform functions for Intel® Xeon Phi™ coprocessors. For more information on MKL performance tips, you can check the following articles:

Performance Tips of Using Intel® MKL on Intel® Xeon Phi™ Coprocessor
Intel® Math Kernel Library for Linux* OS User's Guide
Improving Performance on Intel Xeon Phi Coprocessors

Appendix
Test environment: The code was tested on Intel(R) Core(TM) i7 CPU 970 @ 3.20GHz with Intel(R) Xeon Phi(TM) Coprocessor 5110p, MPSS 2.1.4982-15, compiled by Intel Compiler XE Version 13.0.1.117, with Intel MKL 11.0 update 1