How to Detect and Repair Correctness Issues in Code to Run on the Intel® Xeon Phi™ Coprocessor Architecture with Intel® Inspector XE

Intel® Xeon Phi™ coprocessors combine advanced power performance with the benefits of standard CPU programming models. Developing and tuning for Intel® Xeon Phi™ coprocessors means you get both great coprocessor performance and improved performance on Intel® Xeon® processors.

Parallel computing has many of the same challenges across all computing devices. The developer is responsible for making appropriate choices in designing a parallel algorithm, and for understanding the implications of these choices.

Intel® Inspector XE is a product for analyzing correctness issues that is available alone or as part of the Intel® Parallel Studio XE or Intel® Cluster Studio XE. As part of the studio, it can also be used to visualize static analysis data generated by the Intel® Composer XE. Intel Inspector XE can be used to detect and repair correctness issues while you are transitioning to the Intel Xeon Phi coprocessor or with some build manipulation to analyze your code after the transition is complete.

Code Preparation for the Transition to Using the Intel Xeon Phi Coprocessor

The best way to prepare code aimed at Intel Xeon Phi coprocessors for performance and correctness is to maximize the performance and fix the correctness issues on Intel Xeon processors first.

Step 1: If you have the Intel Parallel Studio XE or Intel® Cluster Studio XE, run static analysis to check for any API or memory issues found at compile time. Any issues found using static analysis while running on the Intel Xeon processor are issues that would also happen when running on the Intel Xeon Phi coprocessor, so it is best to eliminate them first.

Step 2: Run your code using the Intel Inspector XE memory error analysis tools. This will help you find any memory issues that are only evident during runtime. As with static analysis, all the issues you find on the Intel Xeon processor would also happen after the transition to the Intel Xeon Phi coprocessor.

Step 3: Use the Intel Inspector XE threading error analysis to find any threading errors in the code as written. Since the scheduling of threads will be different when the code actually runs on the Intel Xeon Phi coprocessor, the race detection will be slightly different on the Intel Xeon processor. Although, the frequency of any given race will change and the probability of a race actually happening during a particular run on the Intel Xeon Phi coprocessor will be different, there will be no false positives. All races reported by the tool will be possible in the actual execution whether run on the Intel Xeon Phi coprocessor or on the Intel Xeon processor.

Issues Found after the Transition to Using the Intel Xeon Phi Coprocessor

No matter how carefully the code is checked before the transition, changes you make in your code to add funcitonality later may cause new issues to arise. Intel Inspector XE can also be used to deal with these.

Intel Inspector XE will not run on the Intel Xeon Phi coprocessorarchitecture directly. You will have to make some code or compiler option modifications and take advantage of the Intel Composer XE and the ability to compile code for the Intel Xeon processor to run your analysis.

There are two major methodologies for taking advantage of the Intel Xeon Phi coprocessor in your application:

· Compile specifically for the Intel Xeon Phi coprocessor and directly copy the code to the coprocessor. Then you can run directly on the Intel Xeon Phi coprocessor. This is frequently called native mode.

· Offload selective portions of an application to the Intel Xeon Phi coprocessor, taking advantage of the #pragma offload directive supported by the Intel Composer XE compiler. This is frequently referred to as offload mode.

Exactly what you have to do to use the Intel Inspector XE after your code has been transitioned for use on the Intel Xeon Phi coprocessor depends on how you translated your code.

Threading and Memory Correctness with Various Programming Models for the Intel Xeon Phi Coprocessor

Native Mode

Use the Intel MPI Library to program for the Intel Xeon Phi coprocessor.

Using the Intel MPI Library to program for the Intel Xeon Phi coprocessor is one of the more common way of running code on the Intel Xeon Phi coprocessor. The Intel MPI Library provides a very clean and clear way to address the Intel Xeon Phi coprocessor without a great deal of initial cost, although there is a tuning process to improve performance and you will have to choose the most relevant programming model yourself.

Using this model, the host is one (or several) ranks of Intel Xeon processors and Intel Xeon Phi coprocessors. Each rank (process) is multithreaded using whatever model you prefer (usually the OpenMP* or pthreads model, but perhaps the Intel® Cilk™ Plus model as well).

In order to run on the Intel Xeon Phi coprocessor, the code must be compiled using the –mmic option and then located on the NFS shared drive or copied directly to the coprocessor. Intel Xeon processor ranks have their own compilation of the code.

Since Intel Inspector XE 2013 does not run on the Intel Xeon Phi coprocessor, to perform threading and memory correctness analysis, you have to use the right invocation to mpirun to tell it to do everything on the host side. The simplest way to do this is to use a host file that does not contain any Intel Xeon Phi coprocessors.

An example command line for the Intel Inspector XE would be:

mpirun –n 2 –f onlyXeon.hosts inspxe-cl –collect ti3 –result-dir foo ./a.out

Once you have done this, all code will be analyzed by the Intel Inspector XE.

SSH to the card and use the native threads/OpenMP*/Intel Cilk Plus program there directly.

This is much like using the Intel MPI library to address the Intel Xeon Phi coprocessor. The source is compiled with the correct options for addressing the Intel Xeon Phi coprocessor, possibly including specific options for the Intel Xeon Phi coprocessor 512-bit-wide vectors. It is then copied directly to the Intel Xeon Phi coprocessor and executed there.

In order to analyze this code with the Intel Inspector XE, you will need to recompile to address the Intel Xeon processors using the Intel Composer XE. If you have set the array alignment in your code to better align on the Intel Xeon Phi coprocessor – in C++ by using __assume_aligned(array, 64), or in Fortran by using !dec$ assume_aligned Y:64– you will need to return to the default setting.

With this recompile to address the Intel Xeon processors using the Composer XE, and removing any flag to use the Intel Xeon Phi coprocessor-specific vectors, all of the code will be made available for analysis to Intel Inspector XE.

Note: As mentioned previously in this article, the scheduling of threads will be different when the code actually runs on the Intel Xeon Phi coprocessor, so the race detection will be slightly different. Although the frequency of any given race will change and the probability of a race actually happening during a particular run on the Intel Xeon Phi coprocessor will be different, there will be no false positives. All races reported by the tool will be possible in the actual execution whether on the Intel Xeon Phi coprocessor or on the Intel Xeon processor. All memory issues reported by the tool will be real issues that happened when running on the Intel Xeon Phi coprocessor.

Offload Mode

Another way to program to the Intel Xeon Phi coprocessor is to use the #pragma offload in the Intel Composer XE in your code to tell the compiler which portions are intended to be run on the Intel Xeon Phi coprocessor.

If this is how you are addressing the Intel Xeon Phi coprocessor, theoretically you will be able to just #ifdef out all of the offload pragmas and you will have a working C program after recompiling to address the Intel Xeon processor. Realistically, you will have to be careful because it is possible to generate self-deadlocking programs that way. In addition, if you have changed the array alignment, you will need to revert to the native alignment for the Intel Xeon processor.

You could just run the Intel Inspector XE without wrapping or otherwise turning off the pragmas. However, in that case the Intel Inspector XE will only work for the code that runs on the host and not for any code that would have been offloaded to the Intel Xeon Phi coprocessor.

A note about Intel Inspector XE dynamic analysis using offload: The Intel Inspector XE will not be able to analyze any code that is #ifdef-ed out or otherwise elided in the conversion to run on the Intel Xeon processor. Therefore, the Intel Inspector XE will not be able to detect races or memory errors involved in the actual offloading of your code to the Intel Xeon Phi coprocessor that would happen at run time.

Static Analysis on Code Written for the Intel Xeon Phi Coprocessor

Static analysis of code is performed by the Intel® Composer XE using a special compile-time switch and then the data is analyzed using the command line or GUI interfaces of the Intel Inspector XE.

The static analysis engine, as of the Intel Composer XE 2013 Update 1, does not recognize the offload #pragma and will work as if there were no Intel Xeon Phi coprocessor-specific code in the program. While collection will work, there may be some false positives generated by the tool if the offload #pragma is not #ifdef-ed out of your code before analysis.

Conclusion

The Intel Xeon Phi coprocessor provides a powerful means of unleashing the power of parallelism to drive your workflow. Parallelism, however, can come with many pitfalls. It is strongly recommended that you take advantage of the Intel Inspector XE to clean up correctness issues in your code before you optimize your performance for the coprocessor.

Once you have optimized for the Intel Xeon Phi coprocessor, you can still take advantage of the correctness checking functionality of Intel Inspector XE if you make some basic changes to your code and your compilation. This allows you to have the same level of confidence and testability in your Intel Xeon Phi coprocessor code as you would have for any other CPU.

Intel® Cluster Studio XE

Intel® Parallel Studio XE

Intel® Inspector XE