To help the compiler generate better vector code, sometimes it helps to decompose complex data structures to allow the compiler to understand the available parallelism and vectorize the code. Decomposing data accesses may allow the compiler to use more advanced features like vector gather and scatter. Though adjacent data elements are preferred in order to maximize the performance advantage of vector loads and stores, sometimes the requirement to access data through indices is unavoidable. Though use of vector gathers and scatters is generally slower than vector loads and stores, they can be beneficial if the amount of vector computation following the creation of the vectors is enough to offset the vector gather time. The compiler will generate vector gathers and scatters as needed, but sometimes it can be challenged by complex data structures. Showing the compiler how to access these complex data structures can help.
Original Code
This BKM originated with analysis of a Molecular Dynamics (MD) code. In MD codes, atom forces are typically maintained by three values: x, y, and z. These force values are arranged in a position array that contains all atoms.
// atom forces
typedef struct float3 {
float x;
float y;
float z;
} float3;
// position is an array of float3
float3* position;
Force calculations for a given atom are determined by its neighbor atoms. For each atom, there is a list of neighbor atoms. To compute the forces for a given atom, this list is accessed to find all of its neighbors.
In the code segment below the position array, containing all of the atoms, is accessed by using an index in the neighList array. The neighbors’ values are placed into the struct jpos. This is followed by calculations using the atom forces of this neighbor atom.
181 for (int k=0; k<dis; k++){
183 jpos = position[ neighList[j*dis + k + maxNeighbors * i] ];
192 float delx = ipos.x - jpos.x;
193 float dely = ipos.y - jpos.y;
194 float delz = ipos.z - jpos.z;
195 float r2inv = delx*delx + dely*dely + delz*delz;
The following is a portion of the assembly code generated for the above code segment. It shows that the position array elements are accessed one at a time even though vector instructions are being used. The neighList is accessed is get the index for the positionarray (line #183). This index is loaded into register %rax. Each of the three adjacent elements of position indexed by %rax, x, y, and z, are loaded from memory separately at the vsubps subtraction instructions (line #192, #193, #194). Notice the {1to16} instruction modifier on the vsubps instruction indicating a broadcast of a single data value into all 16 elements of the source vector for the subtraction. All subsequent computations are completed as scalar operations even though vector instructions are used.
movl 8(%r14,%r15,4), %eax #183.30 c1
movslq %eax, %rax #183.21 c3
shlq $4, %rax #183.21 c5
vsubps (%rax,%rbx){1to16}, %zmm19, %zmm28{%k4} #192.28 c8
vsubps 4(%rax,%rbx){1to16}, %zmm18, %zmm27{%k4} #193.28 c10
vsubps 8(%rax,%rbx){1to16}, %zmm22, %zmm26{%k4} #194.32 c12
vmulps %zmm27, %zmm27, %zmm0{%k4} #195.41 c14
vmovaps %zmm28, %zmm1 #195.53 c16
vmovaps %zmm26, %zmm2 #195.53 c18
vfmadd213ps %zmm0, %zmm28, %zmm1{%k4} #195.53 c20
nop #195.53 c22
vfmadd213ps %zmm1, %zmm26, %zmm2{%k4} #195.53 c24
Improved Code
To improve performance, we would like the compiler to exploit vector memory accesses and vector instructions using full vectors rather than scalar values. After loading the neighbor values from the position array, there are many force calculations to be completed (not shown in the code segment). Ideally these calculations will be completed using the performance advantage of full vectors rather than scalar values.
We note that the neighList is a sequential list of indexes that are used to access the position array. This is ideal for vector gathers. The position array could be accessed using vector gather instructions, where the index is the vector of neighList that was just loaded.
Creating vectors for complicated data types like structs is a challenge for today’s compilers that will likely be addressed in the future. For now, to help the compiler on this code, we recompose indexing the complex data type to explicitly access the individual data elements. After this code change, the compiler is able to use vector gathers and fully vectorize the code. In the code below, the accesses to the elements in position array are made explicitly.
181 for (int k=0;k<dis;k++){
184 jposx = position[ neighList[j*dis + k + maxNeighbors * i] ].x;
185 jposy = position[ neighList[j*dis + k + maxNeighbors * i] ].y;
186 jposz = position[ neighList[j*dis + k + maxNeighbors * i] ].z;
192 float delx = iposx - jposx;
193 float dely = iposy - jposy;
194 float delz = iposz - jposz;
195 float r2inv = delx*delx + dely*dely + delz*delz;
The following is a portion of the assembly code generated for the above code segment. It shows that the position array is accessed by vector gather instructions (line #184 #185 #186). The vector-gather instructions place the x, y, and z data values into three vector registers %zmm21, %zmm22, and %zmm23. The vector-subtract (vsubps) instructions (line #192 #193 #194) now use full vectors rather than scalar values:
vgatherdps 4(%r13,%zmm20,4), %zmm22{%k4} #185.22
jkzd ..L318, %k4 # Prob 50% #185.22
..L319 vgatherdps 4(%r13,%zmm20,4), %zmm22{%k4} #185.22
jknzd ..L319, %k4 # Prob 50% #185.22
..L318 vgatherdps (%r13,%zmm20,4), %zmm21{%k3} #184.22
jkzd ..L320, %k3 # Prob 50% #184.22
..L321 vgatherdps (%r13,%zmm20,4), %zmm21{%k3} #184.22
jknzd ..L321, %k3 # Prob 50% #184.22
..L320 vsubps %zmm22, %zmm11, %zmm2 #193.27 c27
vgatherdps 8(%r13,%zmm20,4), %zmm23{%k5} #186.22
jkzd ..L322, %k5 # Prob 50% #186.22
..L323 vgatherdps 8(%r13,%zmm20,4), %zmm23{%k5} #186.22
jknzd ..L323, %k5 # Prob 50% #186.22
..L322 vsubps %zmm21, %zmm12, %zmm3 #192.27 c33
cmpl %r15d, %r10d #158.9 c33
vmulps %zmm2, %zmm2, %zmm31 #195.41 c35
vsubps %zmm23, %zmm9, %zmm1 #194.31 c37
vfmadd231ps %zmm3, %zmm3, %zmm31 #195.41 c39
vfmadd231ps %zmm1, %zmm1, %zmm31 #195.41
To understand this listing, note that the vector gather instructions work in pairs, the first one initiating an operation and the second one continuing it until it is completed. The remaining vector instructions in this example operate on vectors of distinct values, leaving a full vector of results in %zmm1.
With this code change, there was a 2.5X performance improvement in the overall MD application as the force calculations were completed using full vectors rather than scalar values.