FDTD Algorithm Optimization on Intel® Xeon Phi™ coprocessor

Content

Download Article

Download FDTD Algorithm Optimization on Intel® Xeon Phi™ coprocessor [PDF 609KB]

1. Introduction

1.1 Background

Finite Difference Time Domain (FDTD) is an algorithm which is widely used in electromagnetic simulation. It is achieved by solving the Maxwell equations to calculate electrical and magnetic value. This algorithm appears in many industries and science domains, such as radar, navigation, microwave, antenna, electromagnetic detection and protection. Figure 1 shows different application domain images.

Figure 1 FDTD application domain

1.2 Algorithm introduction

The basic principles of the FDTD algorithm replace the differential formula of Maxwell curl equations with finite difference formulas. First, it transfers the differential curl equations into a set of partial differential equations of electric and magnetic field components, and then changes this set of equations to finite difference formulas. Cross calculate the value of electric and magnetic fields following the time step, and by the data samples of the continuous electromagnetic field in a certain space, and a period of time, to achieve simulation of the electromagnetic waves propagation in the time domain.

Maxwell equations show below,

E is electric field value, H is magnetic field value. Transfer the equations to a set of equations of six filed components in x, y, z directions,

Assume represents one of the electric or magnetic field components, discrete in the time and space domain,

Difference approximation formulas can be written as,

Figure2 Yee grid

Figure 2 is the arrangement of E and H in x, y, z directions. It is called the Yee grid. Every H is surrounded by four E and every E is surrounded by four H. This space sample method is very friendly for difference calculations of Maxwell equations. Take Ex as an example, the formula 3 can be written as

Then we can get the difference formula of Ex,

. Other field components Ey, Ez, Hx, Hy, Hz also can be represented in the same form. This is the basic iteration of FDTD algorithm. After enough time steps, it can obtain the stable result. Figure 3 gives the iteration process.

Figure3 FDTD cross calculation iteration in the time domain

2. Implementation on Intel® Xeon Phi™ coprocessor

The baseline code is written in C, parallelized by OpenMP. We implemented and tested this code on a preproduction Intel® Xeon Phi™ coprocessor based on the first generation Intel® Xeon Phi™ product codenamed Knights Corner, with 61 cores running at 1.09 GHz with 8GB 5.5GT/s GDDR RAM. The card is hosted on an Intel® Xeon™ E5-2670 with two 8-core processors running at 2.6GHz with 32GB DDR3-1600 RAM. The operation system on host is RedHat* Enterprise Linux* 6.1. The card is executing Intel® Manycore Platform Software Stack (Intel® MPSS) version 2.1.4346-16 with flash version 2.1.02.0314 and the Linux μOS version is 2.6.34.11-g65c0cd9. The development environment is Intel® Composer XE 2013 Linux release.

2.1 Baseline Code Analysis

Figure 4 shows the calculation process of FDTD. There are three dataset modes for this workload: SMALL dataset (60*110*150), MEDIUM dataset (240*300*300), and LARGE dataset (270*140*830). In this case, we use the MEDIUM dataset to deep dive the workload optimization on Intel® Many Integrated Core Architecture (Intel® MIC) in native mode. The hotspot functions in this MEDIUM dataset mode include:

update_E_PML(21.5%), update_H_PML(24.7%)
update_e(18.8% CPU time), update_h(19.1%),
SAR_EH_CPX(10.4%),

The main loops of hotspot functions can be summarized,

The outer loop count is too small and not a good candidate for parallelization by OpenMP.
Most loops have the similar loop count and structures.
It needs to check that if it can be auto-vectorized well by compiler.

Figure4 calculation process

2.2 Baseline Performance Profiling

First, we get the scalability result and it shows very badly. It can only scale to 62 threads. More threads bring the speed ratio down. But for Intel MIC, it’s better to scale to 100+ threads because of the pipeline design of micro-architecture. So for this workload, scalability tuning is very critical. Another important factor is vectorization performance as Intel MIC has 512-bit wide vector length. Adding compiler option –vec-report[n] will give us more information in the vectorization report. The vectorization report by compiler is shown below. Some of them cannot be auto-vectorized.

new_pml.cpp(828): (col. 25) remark: loop was not vectorized: existence of vector dependence.

new_pml.cpp(806): (col. 25) remark: loop was not vectorized: not inner loop.

new_pml.cpp(917): (col. 33) remark: loop was not vectorized: existence of vector dependence.

3. Optimization on Intel Xeon Phi coprocessor

In this workload, the main optimization technologies include two aspects, which will help to get good performance on Intel MIC.

3.1 Parallelism

We mainly adjust the code structure to increase the scalability. Almost all of the hotspot functions are modified in the same way.

Loop fusion

It’s obvious that x, y, z directions have the similar calculation. And the loop counts are not much different.

Merge the nested loops and reduce from 3 nest layers to 2 layers. The detailed method shows below, change the code from

#pragma omp for nowait
 for(i=0; i <= nx; i++)
     for(j=1; j <= ny; j++) {
          for(k=1; k <= nz; k++) {
                         ……
          }
  }
To
int j_end = ny+1;
 int ij_end=(nx+1)*(ny+1);
 int ij_start = j_end + 1;
#pragma omp for nowait
    for(int ij=ij_start; ij < ij_end; ij++) {
          i = ij/j_end;
          j = ij%j_end;
          if(!j) continue;
        for(k=1; k <= nz; k++) { 
……
}

This will help to increase openMP load balance and get better scalability.

Figure5 shows the scalability improved on Intel® Xeon Phi™ coprocessor after modification.

Figure5 FDTD scalability on Intel® Xeon Phi™ coprocessor

2.3 Vectorization

From the source code, we can read that there is not any dependency and the memory access is contiguous, which is very helpful for vectorization. There are many methods to help vectorization. What we do is to use pragma and add the compiler option “-ansi-alias”. But this option has a limitation. Since it is used to disable ANSI alias checking, it requires that source code should comply ANSI rules or ensure that two pointers cannot overlap the same memory area in the code.

Combine with “#pragma simd” and the switch “-ansi-alias” will make it easier for vectorization. Compared with scalar code, vectorization will bring about 4X speed up on Intel ®MIC.

Below is the performance of Intel® Xeon® 2S microarchitecture codename Sandy Bridge. This workload FDTD achieves 3.2X on Intel MIC compared with baseline code on Intel microarchitecture code name Sandy Bridge and it’s 1.3x of optimized code on Intel® microarchitecture code name Sandy Bridge.

Figure6 FDTD performance comparison

*Knights Corner

3. Conclusion

In this case study, we optimize the code on Intel Xeon microarchitecture code name Sandy Bridge and then port to Intel MIC just by adding the option “-mmic” to then achieve the expected performance. Then we can conclude that:

Tuning the CPU can bring even more improvement on Intel MIC.
In this workload, most loops vectorization can be achieved by compiler with “#pragma simd” and the option “-ansi-alias”. The effect is almost the same with using intrinsics.
Scalability tuning brings great performance improved and play an important role in this workload.

References

[1]http://software.intel.com/mic-developer
[2]http://www.openmp.org/
[3] K.S. Yee. Numerical solution of initial boundary value problems involving Maxwell's equations in isotropic media [ J] .IEEE AP - 14 1966.

[4] Gao, Benqing. FDTD Method. 1995.

About the Authors

Shan Zhou works in DRD, Software & Service Group, Intel Corporation. She joined Intel in 2011. She works closely with PRC HPC customers and helps them achieve performance on Intel Architecture platforms. Shan got her Master’s degree in Electronic Engineering from Xidian University in 2009.

Guangming Tan, Associate Professor, works at State Key Laboratory of Computer Architecture, ICT, CAS. His main research interest is parallel algorithms and programming on Multi/Many-core architectures.

Notices

INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.

A "Mission Critical Application" is any application in which failure of the Intel Product could result, directly or indirectly, in personal injury or death. SHOULD YOU PURCHASE OR USE INTEL'S PRODUCTS FOR ANY SUCH MISSION CRITICAL APPLICATION, YOU SHALL INDEMNIFY AND HOLD INTEL AND ITS SUBSIDIARIES, SUBCONTRACTORS AND AFFILIATES, AND THE DIRECTORS, OFFICERS, AND EMPLOYEES OF EACH, HARMLESS AGAINST ALL CLAIMS COSTS, DAMAGES, AND EXPENSES AND REASONABLE ATTORNEYS' FEES ARISING OUT OF, DIRECTLY OR INDIRECTLY, ANY CLAIM OF PRODUCT LIABILITY, PERSONAL INJURY, OR DEATH ARISING IN ANY WAY OUT OF SUCH MISSION CRITICAL APPLICATION, WHETHER OR NOT INTEL OR ITS SUBCONTRACTOR WAS NEGLIGENT IN THE DESIGN, MANUFACTURE, OR WARNING OF THE INTEL PRODUCT OR ANY OF ITS PARTS.

Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined". Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information.

The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.

Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order. Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800-548-4725, or go to:http://www.intel.com/design/literature.htm

Intel, the Intel logo, VTune, Phi and Xeon are trademarks of Intel Corporation in the U.S. and other countries.
*Other names and brands may be claimed as the property of others
Copyright © 2013 Intel Corporation. All rights reserved.

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.

Intel processor numbers are not a measure of performance. Processor numbers differentiate features within each processor family, not across different processor families: Go to:Learn About Intel® Processor Numbers

MIC

Intel® Xeon Phi™ Coprocessor

Finite Difference Time Domain

Maxwell's equations

differential curl equations

Shan Zhou

Guangming Tan

multi-core

Finite Difference Time Domain Algorithm Optimization on Intel® Xeon Phi™ coprocessor

electromagnetic simulation

Server

Intel® Many Integrated Core Architecture

URL