BKM: Coaxing the Compiler to Vectorize Structured Data via Gathers

Best Known Method

To help the compiler generate better vector code, sometimes it helps to decompose complex data structures to allow the compiler to understand the available parallelism and vectorize the code. Decomposing data accesses may allow the compiler to use more advanced features like vector gather and scatter. Though adjacent data elements are preferred in order to maximize the performance advantage of vector loads and stores, sometimes the requirement to access data through indices is unavoidable. Though use of vector gathers and scatters is generally slower than vector loads and stores, they can be beneficial if the amount of vector computation following the creation of the vectors is enough to offset the vector gather time. The compiler will generate vector gathers and scatters as needed, but sometimes it can be challenged by complex data structures. Showing the compiler how to access these complex data structures can help.

Original Code

This BKM originated with analysis of a Molecular Dynamics (MD) code. In MD codes, atom forces are typically maintained by three values: x, y, and z. These force values are arranged in a position array that contains all atoms.

// atom forces
typedef struct float3 {
   float x;
   float y;
   float z;
} float3;

// position is an array of float3
float3* position;

Force calculations for a given atom are determined by its neighbor atoms. For each atom, there is a list of neighbor atoms. To compute the forces for a given atom, this list is accessed to find all of its neighbors.

In the code segment below the position array, containing all of the atoms, is accessed by using an index in the neighList array. The neighbors’ values are placed into the struct jpos. This is followed by calculations using the atom forces of this neighbor atom.

        181    for (int k=0; k<dis; k++){

        183        jpos = position[ neighList[j*dis + k + maxNeighbors * i] ];

        192        float delx = ipos.x - jpos.x;

        193        float dely = ipos.y - jpos.y;

        194        float delz = ipos.z - jpos.z;

        195        float r2inv = delx*delx + dely*dely + delz*delz;

The following is a portion of the assembly code generated for the above code segment. It shows that the position array elements are accessed one at a time even though vector instructions are being used. The neighList is accessed is get the index for the positionarray (line #183). This index is loaded into register %rax. Each of the three adjacent elements of position indexed by %rax, x, y, and z, are loaded from memory separately at the vsubps subtraction instructions (line #192, #193, #194). Notice the {1to16} instruction modifier on the vsubps instruction indicating a broadcast of a single data value into all 16 elements of the source vector for the subtraction. All subsequent computations are completed as scalar operations even though vector instructions are used.

    movl      8(%r14,%r15,4), %eax                          #183.30 c1

    movslq    %eax, %rax                                    #183.21 c3

    shlq      $4, %rax                                      #183.21 c5

    vsubps    (%rax,%rbx){1to16}, %zmm19, %zmm28{%k4}       #192.28 c8

    vsubps    4(%rax,%rbx){1to16}, %zmm18, %zmm27{%k4}      #193.28 c10

    vsubps    8(%rax,%rbx){1to16}, %zmm22, %zmm26{%k4}      #194.32 c12

    vmulps    %zmm27, %zmm27, %zmm0{%k4}                    #195.41 c14

    vmovaps   %zmm28, %zmm1                                 #195.53 c16

    vmovaps   %zmm26, %zmm2                                 #195.53 c18

    vfmadd213ps %zmm0, %zmm28, %zmm1{%k4}                   #195.53 c20

    nop                                                     #195.53 c22

    vfmadd213ps %zmm1, %zmm26, %zmm2{%k4}                   #195.53 c24

Improved Code

To improve performance, we would like the compiler to exploit vector memory accesses and vector instructions using full vectors rather than scalar values. After loading the neighbor values from the position array, there are many force calculations to be completed (not shown in the code segment). Ideally these calculations will be completed using the performance advantage of full vectors rather than scalar values.

We note that the neighList is a sequential list of indexes that are used to access the position array. This is ideal for vector gathers. The position array could be accessed using vector gather instructions, where the index is the vector of neighList that was just loaded.

Creating vectors for complicated data types like structs is a challenge for today’s compilers that will likely be addressed in the future. For now, to help the compiler on this code, we recompose indexing the complex data type to explicitly access the individual data elements. After this code change, the compiler is able to use vector gathers and fully vectorize the code. In the code below, the accesses to the elements in position array are made explicitly.

181     for (int k=0;k<dis;k++){

184         jposx = position[ neighList[j*dis + k + maxNeighbors * i] ].x;

185         jposy = position[ neighList[j*dis + k + maxNeighbors * i] ].y;

186         jposz = position[ neighList[j*dis + k + maxNeighbors * i] ].z;

192         float delx = iposx - jposx;

193         float dely = iposy - jposy;

194         float delz = iposz - jposz;

195         float r2inv = delx*delx + dely*dely + delz*delz;

The following is a portion of the assembly code generated for the above code segment. It shows that the position array is accessed by vector gather instructions (line #184 #185 #186). The vector-gather instructions place the x, y, and z data values into three vector registers %zmm21, %zmm22, and %zmm23. The vector-subtract (vsubps) instructions (line #192 #193 #194) now use full vectors rather than scalar values:

        vgatherdps 4(%r13,%zmm20,4), %zmm22{%k4}                #185.22

        jkzd      ..L318, %k4   # Prob 50%                      #185.22

 ..L319 vgatherdps 4(%r13,%zmm20,4), %zmm22{%k4}                #185.22

        jknzd     ..L319, %k4   # Prob 50%                      #185.22

 ..L318 vgatherdps (%r13,%zmm20,4), %zmm21{%k3}                 #184.22

        jkzd      ..L320, %k3   # Prob 50%                      #184.22

 ..L321 vgatherdps (%r13,%zmm20,4), %zmm21{%k3}                 #184.22

        jknzd     ..L321, %k3   # Prob 50%                      #184.22

 ..L320 vsubps    %zmm22, %zmm11, %zmm2                         #193.27 c27

        vgatherdps 8(%r13,%zmm20,4), %zmm23{%k5}                #186.22

        jkzd      ..L322, %k5   # Prob 50%                      #186.22

 ..L323 vgatherdps 8(%r13,%zmm20,4), %zmm23{%k5}                #186.22

        jknzd     ..L323, %k5   # Prob 50%                      #186.22

 ..L322 vsubps    %zmm21, %zmm12, %zmm3                         #192.27 c33

        cmpl      %r15d, %r10d                                  #158.9 c33

        vmulps    %zmm2, %zmm2, %zmm31                          #195.41 c35

        vsubps    %zmm23, %zmm9, %zmm1                          #194.31 c37

        vfmadd231ps %zmm3, %zmm3, %zmm31                        #195.41 c39

        vfmadd231ps %zmm1, %zmm1, %zmm31                        #195.41

To understand this listing, note that the vector gather instructions work in pairs, the first one initiating an operation and the second one continuing it until it is completed. The remaining vector instructions in this example operate on vectors of distinct values, leaving a full vector of results in %zmm1.

With this code change, there was a 2.5X performance improvement in the overall MD application as the force calculations were completed using full vectors rather than scalar values.

Intel® C++ Composer XE

Intel® Composer XE

Intel® Many Integrated Core Architecture

Optimization

Vectorization

URL