Memory Management Optimizations on the Intel® Xeon Phi™ Coprocessor Using Abstract Vector Register Selection, _mm

Abstract

This paper examines software performance optimization for an implementation of a non-library version of DGEMM executing in native mode on the Intel® Xeon Phi™ coprocessor running Linux* OS. The performance optimizations will incorporate the C/C++ _mm_malloc intrinsic or the mmap function for doing dynamic storage allocation, in conjunction with high-level vector register management, and data prefetching. The dynamic storage allocation will be used to manage tiled data structure objects that will accommodate the memory hierarchy of the Intel Xeon Phi coprocessor architecture. The focus in terms of optimizing application performance execution based on storage allocation is to:

Align the starting addresses of data objects during storage allocation so that vector operations on the Intel Xeon Phi coprocessor will not require additional special vector alignment when using a vector processing unit associated with each hardware thread.
Select tile sizes and do abstract vector register management that will allow for cache reuse and data locality.
Use data prefetching to improve timely referencing of the tiled data structures into the Intel Xeon Phi coprocessor cache hierarchy.

These methodologies are intended to provide you insight when applying code modernization to legacy software applications and when developing new software.

1. Introduction

If you are optimizing applications for Intel® Xeon Phi™ coprocessor architecture using Linux* OS in symmetric, offload, or native modes, and profiling analysis from software tools such as Intel® VTune™ Amplifier XE [1], and/or Intel® Trace Analyzer and Collector [2] indicates that there are memory management bottlenecks [3], this article might help you achieve better execution performance.

What strategies are used to improve application performance?

This article examines memory management, which involves tiling of data structures using the following strategies:

Aligned data storage allocation. This paper examines the use of the _mm_malloc intrinsic and mmap for dynamic storage allocation on Intel Xeon Phi coprocessor architecture.
Use of the MIC_USE_2MB_BUFFERS environment variable. This environment variable is used in conjunction with the mmap storage allocation utility.
Vector register management. An attempt will be made to manage the vector registers on the Intel Xeon Phi coprocessor by using explicit compiler semantics including C/C++ Extensions for Array Notation (CEAN) [4].
Prefetching controls. Prefetching control will be applied to the Intel Xeon Phi coprocessor cache hierarchy.

Developers of applications on Intel Xeon Phi coprocessor architecture may find these methodologies useful for optimizing programming applications, which exploit at the core level hybrid parallel programming consisting of a combination of both threading and vectorization techniques.

How is this article organized?

Section 2 will look at the Intel Xeon Phi coprocessor architecture so as to provide insight on what a software developer may want to think about in doing code modernization for existing applications or for developing new applications. Section 3 examines the storage allocation intrinsics called _mm_malloc and _mm_free for creating data structures and tiles for memory management. Similarly, Section 4 discusses the notion of using the storage allocation utility called mmap to create data structures and tiles, and munmap for storage deallocation. Section 5 examines the semantic actions associated with the MIC_USE_2MB_BUFFERS environment variable. Section 6 discusses prefetch tuning. Section 7 applies these memory management techniques to a double-precision floating-point matrix multiply algorithm, and works through restructuring transformations to improve execution performance. Section 8 describes performance results. Section 9 provides conclusions, and section 10 gives references.

2. The Intel® Xeon Phi™ Coprocessor Architecture

Figure 1 shows the microarchitecture for the Intel Xeon Phi coprocessor [5]. The Intel Xeon Phi coprocessor principally consists of 61 processing cores with four hardware threads per core. The coprocessor also has caches, memory controllers, PCIe* (Peripheral Component Interconnect Express*) client logic, and a very high bandwidth, bidirectional ring interconnection network (Figure 1) [5]. Each core has a private L2 cache that is kept fully coherent by a global-distributed tag directory (labeled TD in Figure 1). The L2 cache is eight-way set associative and is 512 KB in size. The cache is unified where it caches both data and instructions. The L1 cache consists of an eight-way set associative 32 KB L1 instruction and 32 KB L1 data cache. The memory controllers and the PCIe client logic provide a direct interface to the GDDR5 memory (double data rate type five synchronous graphics random access memory) on the coprocessor and the PCIe bus, respectively. All of these components are connected together by the ring interconnection network.

Figure 1. The microarchitecture of the Intel® Xeon Phi™ coprocessor [5]

Figure 1 shows a subset of the cores, the L2 cache associated with each core, and the tag directories. Each core consists of an in-order, dual-issue Intel® x86 pipeline, a local L1 and L2 cache, and a separate vector processing unit (VPU) [5]. Table 1 shows the local cache organization for each of the Intel Xeon Phi coprocessor cores.

Table 1. Intel® Xeon Phi™ Coprocessor Cache Hierarchy

Level	Storage Size	Attributes
L1 instruction cache	32 KB	eight-way set associative with 64-byte cache line size
L1 data cache	32 KB	eight-way set associative with 64-byte cache line size
L2 instruction and data cache	512 KB	eight-way set associative with 64-byte cache line size

In Table 1, a 64-byte cache line implies that it can contain 512 bits of data. In terms of single precision and double-precision floating-point data, the 64-byte cache lines can hold 16 single-precision floating-point objects or 8 double-precision floating-point objects. In associating the 64-byte (512 bits) cache-line-width architecture with the VPU on the Intel Xeon Phi coprocessor [6], the vector instructions can process 512 bits of data, and the VPUs support a Single Instruction Multiple Data (SIMD) capability. Each hardware thread within a core in Figure 1, has 32 512-bit-wide vector registers (zmm0–zmm31), and the architecture also offers vector mask registers (k0–k7) to support an abundant set of conditional operations on data elements within the zmm vector registers (see Figure 2). These eight vector masking registers (k0–k7) are 16 bits wide, and they control the updating of the vector registers during VPU computation [7].

Figure 2. The vector register configuration for each hardware thread within a core of an Intel® Xeon Phi™ coprocessor [6].

In Figure 2, the MXCSR is a 32-bit control status register and it maintains the status
of the following [6]:

Exception flags to indicate SIMD floating-point exceptions signaled by floating-point instructions
Rounding behavior and control
Exception suppression

The vector operations can be checked for exceptions during floating-point execution.

Loading data into vector registers is most efficient when the beginning of the data is aligned on 64-byte memory boundaries, and this type of alignment strategy may allow you to exploit efficient use of these 512-bit vector registers that can do 16 single-precision floating-point operations or 8 double-precision floating-point operations simultaneously. A 512-bit VPU also supports Fused Multiply-Add (FMA) instructions [5], where each of the three registers acts as a source and one of them is also a destination [8]. Hence, the FMA vector instructions can execute 32 single-precision or 16 double-precision floating point operations per clock cycle. A VPU also provides support for processing 16 32-bit integers at a time [9].

In summary, the Intel Xeon Phi coprocessor has 61 cores with four hardware threads per core. The L1 and L2 cache lines for each core are 512 bits wide. Each hardware thread has 32 vector registers, which are also 512 bits wide. A 512-bit VPU is capable of 16 simultaneous arithmetic operations in single-precision mode or 8 in double-precision mode. The FMA instructions enable the execution of 32 single-precision or 16 double-precision floating-point operations per clock cycle. These architectural features and the alignment of data on 64-byte memory boundaries for the Intel Xeon Phi coprocessor should be kept in mind when doing code modernization for existing software applications or when developing new software.

In regard to data-structure alignment for data management that includes tiling and prefetching on the Intel Xeon Phi coprocessor, sections 3 through 6 discuss memory allocation techniques and data look-ahead for improving application performance.

3. The Storage Allocation Intrinsic _mm_malloc

The _mm_malloc and _mm_free intrinsics [10] have the following prototype interfaces and the prototype declaration can be found in the “<malloc.h>” include file:

#include <malloc.h>
void *_mm_malloc (size_t size, size_t align);
void _mm_free (void *p);

The _mm_malloc routine takes an extra parameter, which is the alignment constraint. This constraint must be a power of two. The pointer that is returned from _mm_malloc is guaranteed to be aligned on the specified boundary.

A call to _mm_free will return the dynamically allocated storage back to the free-list.

An actual _mm_malloc call for allocating storage might look something like:

_mm_malloc(size,64);

where size is the amount of storage to be allocated in bytes. The integer value 64 refers to storage alignment in bytes, and therefore the storage alignment for this example is in 64-byte memory address boundaries.

The paper by Krishnaiyer et al. [10] provides additional details about the _mm_malloc and _mm_free intrinsics.

4. The Storage Allocation Utility `mmap`

The mmap and munmap prototype declarations [11] can be found in the “<sys/mman.h>” include file and they have the following prototype interfaces:

#include <sys/mman.h>
void *mmap(void *addr, size_t length, int prot, int flags, int fd, off_t offset);

int munmap(void *addr, size_t length);

With respect to the mmap prototype, the starting address for the new mapping is specified in the parameter argument addr. The length argument specifies the length of the mapping. If addr is NULL, the kernel chooses the address at which to create the mapping, and use of the NULL actual argument to mmap is the most portable method for creating a new mapping. If addr is not NULL, the kernel takes it as a hint about where to place the mapping. On Linux OS for example, the mapping will be created at a nearby page boundary. The address of the new mapping is returned as the result of the call. The prot argument describes the desired memory protection for the mapping, and it must not conflict with the open mode of the file. It is either PROT_NONE or the bitwise OR of one or more of the following flag protection values [11]:

PROT_EXEC - Pages may be executed.

PROT_READ - Pages may be read.

PROT_WRITE - Pages may be written.

PROT_NONE - Pages may not be accessed.

The flags argument determines whether updates to the mapping are visible to other processes mapping the same region and whether updates are carried through to the underlying file. This behavior is determined by including exactly one of the following values in flags, and some of the possible flag values are MAP_PRIVATE, MAP_HUGETLB, MAP_ANONYMOUS, and MAP_POPULATE. These flag-constants have the following meanings:

MAP_ANONYMOUS - The mapping is not backed by any file, but rather its contents are initialized to zero. The fd and offset arguments are ignored, but some implementations require the value of fd to be -1, if MAP_ANONYMOUS (or MAP_ANON) is specified, and portable applications should ensure this.

MAP_PRIVATE - Create a private copy-on-write mapping.

MAP_HUGETLB - Allocate the mapping using "huge pages.”

MAP_POPULATE - Populate (pre-page-fault) page tables for a mapping.

An mmap call for allocating storage might look something like the following:

mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_HUGETLB | MAP_ANONYMOUS | MAP_POPULATE, 0, 0);

where size is the amount of storage to be allocated in bytes.

In contrast to mmap, the munmap function will remove the address range for the mmap storage mapping.

The “Linux Programmer’s Manual” [11] provides additional details about the mmap and munmap functions.

5. Semantic Action Description for the Environment Variable `MIC_USE_2MB_BUFFERS`

Managing the size of memory pages on Intel Xeon Phi coprocessor may be controlled by the environment variable MIC_USE_2MB_BUFFERS [12]. This environment variable is applicable to native execution mode as well as offload execution mode on the Intel Xeon Phi coprocessor. However, this article limits the discussion to native mode execution of an application, using the mmap storage allocation utility. For additional approaches, refer to How to Use Huge Pages to Improve Application Performance on Intel® Xeon Phi™ Coprocessor [12].

In native mode, to achieve good execution performance on the Intel Xeon Phi coprocessor, huge memory pages (2MB) may be necessary for the coprocessor storage allocations. This is because large storage requirements for variables and buffers are sometimes handled more efficiently with 2MB pages versus 4KB pages. With 2MB pages, Translation Lookaside Buffer (TLB) misses and page faults may be reduced, and there is also a lower cost for storage allocation. This implies that when using the MIC_USE_2MB_BUFFERS environment variable for native execution mode, pointer-based objects will be allocated in large pages when their runtime lengths exceed the value of this environment variable setting.

The value setting for the MIC_USE_2MB_BUFFERS environment variable has the following syntax using C Shell or T Shell [12]:

setenv MIC_USE_2MB_BUFFERS <integer>B|K|M|G|T

where B, K, M, G, and T are suffixes immediately following the positive <integer> value and have the semantic meanings:

B = bytes

K = kilobytes

M = megabytes

G = gigabytes

T = terabytes

The Bourne Shell syntax is [12]:

export MIC_USE_2MB_BUFFERS=<integer>B|K|M|G|T

An example environment variable setting for MIC_USE_2MB_BUFFERS using Bourne Shell syntax might be:

export MIC_USE_2MB_BUFFERS=3M

Notice that there is no space between the “3” and the “M”, and the statement above indicates that users can specify memory allocations with sizes ≥ 3 megabytes to be allocated in huge pages by setting the threshold to MIC_USE_2MB_BUFFERS=3M.

6. Prefetch Tuning

Compiler prefetching is turned on by default for the Intel Xeon Phi coprocessor [13]. The default Intel® C/C++ Compiler and Intel® Fortran Compiler optimization level is –O2, and therefore compiler prefetching is enabled for the –O2 setting and above. Compiler prefetching can be controlled by compiler command-line switches, using directives within the C/C++ or Fortran* programming application, and prefetch intrinsics within the programming application. The prefetch distance is the number of iterations ahead that a prefetch is issued. Prefetching is done after the vectorization-phase, so the distance is in terms of vectorized iterations if an entire serial loop or part of a serial loop is vectorized.

The Intel Xeon Phi coprocessor also has a hardware L2 prefetcher that is enabled by default. In general, if the software prefetching algorithm is performing well for an application, the hardware prefetcher will not join in with the software prefetcher.

For this article the Intel C/C++ Compiler option:

-opt-prefetch-distance=n1[,n2]

is explored. The arguments n₁ and n₂ have the following semantic actions in regard to the -opt-prefetch-distance compiler switch:

The distance n₁(number of future loop iterations) for first-level prefetching into the Intel Xeon Phi coprocessor L2 cache.
The distance n₂ for second-level prefetching from the L2 cache into the L1 cache, where n₂≤ n₁. The exception is that n₁ can be 0 for values of n₂.

Some useful values to try for n₁ are 0, 4, 8, 16, 32, and 64 [13]. Similarly, useful values to try for n₂ are 0, 1, 2, 4, and 8. These L2 prefetching values signified by n₁ can be permuted with prefetching values n₂ that control data movement from the L2 cache into the L1 cache. This permutation process can reveal the best combination of n₁ and n₂ values. For example, a setting might be

-opt-prefetch-distance=0,1

where the value 0 tells the compiler to disable compiler prefetching into the L2 cache, and the n₂ value of 1 indicates that 1 iteration of compiler prefetching should be done from the L2 cache into the L1 cache.

In summary, sections 3 through 5 discussed memory allocation techniques to allocate data storage in an aligned fashion and to control page sizes, where these storage allocation methodologies could be used to manage data that is moved from say main memory into a cache hierarchy for a given processor architecture. This section discussed prefetching of these memory objects into the cache hierarchy. In the next sections, these techniques will be applied so as to optimize an algorithm such as a double-precision version of matrix multiply.

7. Matrix Multiply Background and Programming Example

Matrix multiply has the core computational assignment:

C_ij = C_ij + A_ip× B_pj

A basic matrix multiply loop structure implemented in the C/C++ programming language might look something like the following:

int i, j, p;
   for (i=0; i < row; i++)
      for (j=0; j < col; j++)
         for (p=0; p < row; p++)
            c[i][j] = c[i][j] + a[i][p] * b[p][j];

Software vendor libraries are available for performing matrix multiply in a highly efficient manner. However, in regard to this article, the reason for using matrix multiply as an example for applying the memory-allocation-optimization techniques is that the basic algorithm is roughly four lines long. Additionally, it is hoped that the reader, after seeing a before and after for the applied restructuring transformations, will think about associating restructuring transformations of a similar nature to their applications that they are targeting for code modernization.

Goto et al. [14] have looked at restructuring transformations for the basic matrix multiply loop structure shown above so as to optimize it for various processor architectures. This has required organizing the A, B, and C matrices into tiles and packing the matrix elements to promote efficient memory referencing.

Smith et al. [15] have extended the work done in reference [14] into a matrix multiply solver that looks something like the following in pseudo-code form:

for jc = 0, … , n - 1 in steps of nc
   for pc = 0, … , k - 1 in steps of kc
      for ic = 0, … , m - 1 in steps of mc
         for jr = 0, … , nc - 1 in steps of nr
            for ir = 0, … , mc - 1 in steps of mr
               C[ir:ir+mr–1,jr:jr+nr-1] += …
            endfor
         endfor
      endfor
   endfor
endfor

The term pseudo-code is used for the loop structure above because the for-loop control structures and the vector-section syntax:

C[ir:ir+mr–1,jr:jr+nr-1] += …

do not conform to the C/C++ Standard. For the matrix multiply pseudo-code above, the notion of CEAN has the following syntactical structure [4]:

section_operator ::= [<lower bound>; : <length>; : <stride>]

where the <lower bound>, <length>, and <stride> are of integer types, representing a set of integer values as follows:

<lower bound>, <lower bound + <stride>>, …, <lower bound> + (<length> - 1) × <stride>

For the core solver pseudo-code referenced above, the array notation:

C[ir:ir+mr–1,jr:jr+nr-1]

does not provide a stride component, and therefore the stride access is unity. Proper application of the CEAN for the A, B, and C matrix objects can allow for the CEAN translation of

C[ir:ir+mr–1][jr:jr+nr-1] += …

into FMA instructions for the Intel Xeon Phi coprocessor architecture.

The matrix multiply solver for the referenced BLIS* (BLAS-like Library Instantiation Software) [15] implementation is written in assembly language. In contrast, for the implementation that is described in this paper, the inner solver for matrix multiply is written in CEAN.

In addition to vectorization, threading semantics have been added to exploit multi-core and many-core processor architectures [15]. The i_c loop and the j_r loop are used for implementing two levels of threaded parallelism. In essence, a second level of threaded parallelism is nested within a first level of parallelism:

for ic = 0, … , m - 1 in steps of mc
   for jr = 0, … , nc - 1 in steps of nr

Assuming that there are 61 cores on the Intel Xeon Phi coprocessor, the iterations of loop i_c will be distributed to each of the 61 cores. Each core has four hardware (logical) threads, and iterations of loop j_r will be distributed to each of the four hardware threads within each core. OpenMP* semantics [16 ] are used to represent the two levels of threading. Figure 3 shows an abstract rendering of the possible 61 Intel Xeon Phi coprocessor cores and the four hardware threads associated with each core.

Figure 3 is an extension of Figure 1, where it is assumed that the Intel Xeon Phi coprocessor has 61 cores. Figure 3 also shows the association of iterations of loop i_c associated with each of the cores, and iterations of loop j_r associated with the four hardware threads within each core.

Figure 3. Abstract rendering of at most 61 cores on an Intel® Xeon Phi™ coprocessor. There are four hardware threads associated with each core.

The source for the non-library implementation of BLIS/DGEMM that is linked to this paper shows the details of how the loop structures above are used to manage the tile references for the A, B, and C matrices. Also, conditional compilation macros are used to control, at compile time, which storage allocation function _mm_malloc or mmap is to be used to allocate dynamic storage for matrices.

8. Performance Results for the BLIS/DGEMM Implementation

As noted earlier, each of the four hardware threads for the Intel Xeon Phi coprocessor core has 32 vector registers that are 512 bits wide. Figure 4 shows, at a high level of abstraction, how the C, Ã, and B̃ matrix data components progress from main memory into the L2 cache, the L1 cache, and the vector registers through iterations of the three innermost loops [15]. Note that for the Ã and B̃ data structures there has been blocking and packing of data within these tiles.

The tiled objects Ã and B̃ have the respective dimensions of m_c× k_c, and k_c× n_c. In reference to Figure 4, the for-loop control structures that reference these three variables have the following values associated with them for the experiments that will be discussed:

m_c is 120
k_c is 240
n_c is equal to the matrix order

In Figure 4, the green rectangle represents 30 rows of matrix C, and each row contains eight double-precision words. The blue rectangles are strips of the Ã tile, and a blue strip represents 30 rows by 240 columns of double-precision data. The red rectangles are strips of the B̃ tile consisting of 240 rows, and each row contains eight double-precision words.

Figure 4. Tiling illustration of the three inner-most loops encapsulating the micro-kernel C[i_r:i_r+m_r-1,j_r:j_r+n_r-1] += … for the BLIS* [15] implementation demonstrated in orig_dgemm.c. The tiled objects are Ã and B̃ and these tiles have respective sizes of m_c× k_c, and k_c× n_c as referenced in the for-loop control structures

Recall that each vector register can reference eight double-precision floating-point operations. Also for Figure 4(c), there is an opportunity to use the FMA vector instruction for the core computation:

c[ir:ir+mr-1,jr:jr+nr-1] += …

For the source file called orig_dgemm.c, the references to the 30 assignments to a section of the C matrix look something like the following:

c[ir:ir+mr-1, jr:jr+nr-1] += …
c[ir+1:ir+mr-1,jr:jr+nr-1] += …

…

c[ir+29:ir+mr-1,jr:jr+nr-1] += …

For the compilation of orig_dgemm.c into native mode executables for Intel Xeon Phi coprocessor, the floating-point-operations-per-second results for using a simple malloc, _mm_malloc(<size>,64), _mm_malloc(<size>,8092), and mmap are as follows (Figure 5):

Figure 5. Floating-point-operations-per-second results for “orig_dgemm.c” with a matrix order of 23,040.

Note that the “malloc” implementation provides a baseline, and the C/C++ source that is provided with this article does not contain all of the necessary semantics for the user to actually run a “malloc” experiment, which would require removal of all of the #pragma vector aligned directives.

For the source file called opt_dgemm.c, the references to the 30 rows of matrix C are replaced with the following:

t0[0:8] = c[ir:ir+mr-1,jr:jr+nr-1];
t1[0:8] = c[ir+1:ir+mr-1,jr:jr+nr-1];

…

t29[0:8] = c[ir+29:ir+mr-1,jr:jr+nr-1];

for ( … )

t0[0:8] += …
t1[0:8] += …

…

t29[0:8] += …

endfor

c[ir:ir+mr-1,jr:jr+nr-1] = t0[0:8];
c[ir+1:ir+mr-1,jr:jr+nr-1] = t1[0:8];

…

c[ir+29:ir+mr-1,jr:jr+nr-1] = t29[0:8];

The notion of using the array temporaries t0 through t29 can be thought of as assigning abstract vector registers in the computation of partial results for the matrix multiply algorithm. Running in native mode on Intel Xeon Phi coprocessor, the floating-point-operations-per-second results for using _mm_malloc(<size>,64), _mm_malloc(<size>,8092), and mmap are as follows (Figure 6):

Figure 6. Floating-point-operations-per-second results for “opt_dgemm.c” without the compiler prefetching option -opt-prefetch-distance. The matrix order was 23,040.

For the data gathered in Figure 6 using the array temporaries t0 through t29, the default prefetch settings for the compiler were applied to the three executables that were built.

Finally, for prefetching optimization experiments, one could use the n₁prefetching values for the L2 cache of 0, 4, 8, 16, 32, and 64. These L2 prefetching values could be permuted with prefetching values n₂ where the n₂ values are 0, 1, 2, 4, and 8, where n₂≤ n₁.

These various permutations of n₁ and n₂ integer values for the -opt-prefetch-distance compiler switch were experimented with for the executables built from opt_dgemm.c, but all combinations of floating-point-operations-per-second results aren’t shown in this paper. What Figure 7 shows are Intel Xeon Phi coprocessor native mode performance results from applying the prefetching compiler option:

-opt-prefetch-distance=0,1

to the source file opt_dgemm.c. The floating-point-operations-per-second results are based on building three respective executables using dynamic storage allocation from _mm_malloc(<size>,64), _mm_malloc(<size>,8092), and mmap.

Figure 7. Floating-point-operations-per-second results for “opt_dgemm.c” with the compiler prefetching option -opt-prefetch-distance=0,1. The matrix order is 23,040.

The user can download the shell scripts, Makefiles, C/C++ source files, and a README.TXT file at the following URL:

DGEMM Download Package

The README.TXT file contains information about how to build and run the executables from a host system.

After downloading and untarring the package, the reader can try similar experiments in native mode (as described here) on their Intel Xeon Phi coprocessor by first sourcing on the host system, an Intel® Parallel Studio XE Cluster Edition script:

<Path-to-Intel-Parallel-Studio-XE-Cluster-Edition>/psxevars.sh intel64

and then issue a command that looks something like the following within the directory opt_dgemm_v1:

$ cd ./scripts $ ./orig_dgemm.sh <mic-coprocessor-name>

Again, note that this script must be launched from a host system, and <mic-coprocessor-name> should be replaced by an appropriate coprocessor name on the user’s system, for example, mic0.

The output report for a run should look something like the following and can be found in the reports sub-directory:

Matrix padding = 64
Matrix padding stride = 8
a = 47460625940480; b = 47464884477952; c = 47469143023616

OpenMP Dense matrix-matrix multiplication
Matrix order          = 23040
Matrix tile k-size         = 240
Matrix tile m-size         = 120
Number of core threads     = 61
Number of hardware threads     = 4
Number of iterations  = 1
Solution validates
Rate (MFlops/s): 248346.156132,  Avg time (s): 98.496314,  Min time (s): 98.496314, Max time (s): 98.496314

Please note that on your system the floating-point-operations-per-second results will vary from those shown in Figures 5, 6, and 7. Results will be influenced by factors such as the version of the operating system, the software stack component versions, and the coprocessor stepping. A Shell script for running an mmap executable as referenced in Figures 5, 6, and 7 has the following arguments:

61 4 240 120 23040 dynamic 2 2M

where:

61 defines the number of core threads.
4 defines the number of hardware threads per core that are to be used.
240 is the value of k_c and defines the number of columns for the tile and the number of rows for the tile. The variable k_c was mentioned at the beginning of this section (Section 8) as being part of the tile size. Also see Figure 4.
120 is the value of m_c and defines the number of rows for tile. The variable m_c was also mentioned at the beginning of this section (Section 8) as being part of the tile size. Also see Figure 4.
23040 is the matrix order.
The values dynamic and 2 are used to control the OpenMP scheduling (see below).
2M is for the value for the MIC_USE_2MB_BUFFERS environment variable.

As mentioned earlier, OpenMP is used to manage threaded parallelism. In so doing, the OpenMP Standard [16] provides a scheduling option for work-sharing loops:

schedule(kind[, chunk_size])

This scheduling option is part of the C/C++ directive: #pragma omp parallel for, or #pragma omp for, and the Fortran directive: !$omp parallel do, or !$omp do. The schedule clause specifies how iterations of the associated loops are divided into contiguous non-empty subsets, called chunks, and how these chunks are distributed among threads of a team. Each thread executes its assigned chunk or chunks in the context of its implicit task. The chunk_size expression is evaluated using the original list items of any variables that are made private in the loop construct.

Table 2 provides a summary of the possible settings for the “kind” component for the “schedule” option are:

Table 2. “kind” scheduling values for the OpenMP* schedule(kind[, chunk_size]) directive component for OpenMP work sharing loops [16,17].

Kind	Description
`static`	Divide the loop into equal-sized chunks or as equal as possible in the case where the number of loop iterations is not evenly divisible by the number of threads multiplied by the chunk size. By default, chunk size is the loop-count/number-of-threads. Set chunk to 1 to interleave the iterations.
`dynamic`	Use the internal work queue to give a chunk-sized block of loop iterations to each thread. When a thread is finished, it retrieves the next block of loop iterations from the top of the work queue. By default, the chunk size is 1. Be careful when using this scheduling type because of the extra overhead involved.
`guided`	Similar to dynamic scheduling, but the chunk size starts off large and decreases to better handle load imbalance between iterations. The optional chunk parameter specifies the minimum size chunk to use. By default, the chunk size is approximately loop-count/number-of-threads.
`auto`	When schedule (auto) is specified, the decision regarding scheduling is delegated to the compiler. The programmer gives the compiler the freedom to choose any possible mapping of iterations to threads in the team.
`runtime`	Uses the OMP_SCHEDULE environment variable to specify which one of the three loop-scheduling types should be used. OMP_SCHEDULE is a string formatted exactly the same as would appear on the parallel construct.

An alternative to using the scheduling option of the C/C++ directives:

#pragma omp parallel for or #pragma omp for

or the Fortran directives:

!$omp parallel do or !$omp do

is to use the OpenMP environment variable OMP_SCHEDULE which has the options:

type[,chunk]

where:

type is one of static, dynamic, guided, or auto.
chunk is an optional positive integer that specifies the chunk size.

For the executable example used in this paper, we found that dynamic scheduling provided the best results. Also, various chunk size values were tested. The values that provide the best results for this paper were 2 and 102. These values were determined from experiments by varying the chunk size value for the Ã and the B̃ components.

9. Conclusions

The experiments on Intel Xeon Phi coprocessor architecture using _mm_malloc and mmap storage allocation for a non-library implementation of DGEMM indicate that both data-alignment storage allocation utilities can be competitive in performance on the Intel Xeon Phi coprocessor. mmap also used the MIC_USE_2MB_BUFFERS environment variable to help improve its results. Management of the Intel Xeon Phi coprocessor vector registers at the program-language application level is important. In general, the software developer may want to use conditional compilation macros within their applications to either use _mm_malloc or mmap for managing their dynamic storage allocation. In this way, you can experiment with the application to see which storage allocation methodology provides the best execution performance for your application running on a targeted processor architecture. Lastly, prefetching controls were used for the L1 and L2 data caches. The experiments showed that making adjustments to prefetching further improved execution performance.

10. References

“Intel® VTune™ Amplifier 2016,” https://software.intel.com/en-us/intel-vtune-amplifier-xe.
“Intel® Trace Analyzer and Collector,” https://software.intel.com/en-us/intel-trace-analyzer.
S. Cepeda, “Optimization and Performance Tuning for Intel® Xeon Phi™ Coprocessors, Part 2: Understanding and Using Hardware Events,” https://software.intel.com/en-us/articles/optimization-and-performance-tuning-for-intel-xeon-phi-coprocessors-part-2-understanding , November 2012.
“C/C++ Extensions for Array Notations Programming Model,” https://software.intel.com/en-us/node/522649.
G. Chrysos, “The Intel® Xeon Phi™ X100 Family Coprocessor – the Architecture,” https://software.intel.com/en-us/articles/intel-xeon-phi-coprocessor-codename-knights-corner, November 2012.
Intel® Xeon Phi™ Coprocessor Instruction Set Architecture Reference Manual, https://software.intel.com/sites/default/files/forum/278102/327364001en.pdf, September 2012.
R. Rahman, “Intel® Xeon Phi™ Coprocessor Vector Microarchitecture,” https://software.intel.com/en-us/articles/intel-xeon-phi-coprocessor-vector-microarchitecture, May 2013.
M. Cornea, “Numerical Computation on Intel® Xeon Phi™ Coprocessors Using the Intel® Compilers and Math Libraries,” http://www.lip6.fr/public/2013-10-15_Cornea.pdf, October 2013.
R. Rahman, Intel® Coprocessor Architecture and Tools: The Guide for Application Developers (Expert’s Voice in Microprocessors), 1st Edition, Apress, September 2013.
R. Krishnaiyer, A. Sharp, R. W. Green, and M. Corden, “Data Alignment to Assist Vectorization,” https://software.intel.com/en-us/articles/data-alignment-to-assist-vectorization, September 2013.
“Linux Programmer’s Manual,” http://man7.org/linux/man-pages/man2/mmap.2.html.
“How to Use Huge Pages to Improve Application Performance on Intel® Xeon Phi™ Coprocessor,” https://software.intel.com/sites/default/files/Large_pages_mic_0.pdf .
R. Krishnaiyer, “Compiler Prefetching for the Intel® Xeon Phi™ coprocessor,” https://software.intel.com/sites/default/files/managed/54/77/5.3-prefetching-on-mic-update.pdf.
K. Goto and R. van de Geijn, “Anatomy of High-Performance Matrix Multiplication,” ACM Transactions on Mathematical Software, Vol. 34, No. 3, May 2008, pp. 1-25.
T. M. Smith, R. van de Geijn, M. Smelyanskiy, and J. R. Hammond, “Anatomy of High-Performance Many-Threaded Matrix Multiplication,” Proceedings of the 2014 IEEE 28th International Parallel and Distributed Processing Symposium, May 2014, pp. 1049-1059.
“The OpenMP API Specification for Parallel Programming,” http://openmp.org/wp/openmp-specifications.
R. Green, “OpenMP Loop Scheduling,” https://software.intel.com/en-us/articles/openmp-loop-scheduling.