Performance tuning of OpenCL* applications on Intel® Xeon Phi™ coprocessor using Intel® VTune™ Amplifier XE 2013

Download Article

Download Performance tuning of OpenCL* applications on Intel® Xeon Phi™ coprocessor using Intel® VTune™ Amplifier XE 2013 [PDF 603KB]

Introduction

The Intel® SDK for OpenCL* Applications XE 2013 release provides a development environment for OpenCL 1.2 applications on both Intel® Xeon® processors and Intel® Xeon Phi™ coprocessors for Linux* operating systems. The SDK is available at http://www.intel.com/software/opencl-xe/ and includes development tools, runtime, and support for optimization tools. In addition, recent releases of the VTune™ Amplifier XE 2013 provide essential functionality for tuning OpenCL applications on Intel Xeon Phi coprocessors. This article provides the basic workflow for profiling OpenCL on Intel Xeon Phi coprocessors and some examples of performance analysis.

Steps to profile your OpenCL application

The profiling of OpenCL application with the VTune Amplifier XE 2013 is similar to profiling any native or offload application on the Intel Xeon Phi coprocessor.

Here are the steps you need to follow to profile your OpenCL application. This information is aggregated from a few pages in the VTune analyzer manual at http://software.intel.com/sites/products/documentation/doclib/stdxe/2013/amplifierxe/lin/ug_docs/GUID-06A94CB9-56AB-4DB3-B1E4-E03535A50A9C.htm.

Install the sampling driver.
To enable the sampling collection driver:
1. cd /opt/intel/vtune_amplifier_xe_2013/bin64/k1om
2. ./sep_micboot_create.sh
3. ./sep_micboot_install.sh
4. ./userapi_micboot_install.sh
  To enable the JIT collection on the Xeon Phi coprocessor:
5. service mpss restart
You must restart the system for the new drivers to be loaded.
Note: You need to perform these steps only once.
Create a VTune Ampilfier XE 2013 project.
1. Source /opt/intel/vtune_amplifier_xe_2013/amplxe-vars.sh
2. amplxe-gui &
3. Create a project File->New->
  1. When the “Project Properties” dialog appears, specify your OpenCL binary as the application to launch.
  2. You should also specify some additional search libraries.
    1. Click the “Search Directories” tab.
    2. Search directories for: “All"
      1. /lib/firmware/mic
      2. /opt/intel/composer_xe_2013/lib/mic
Run a Lightweight-hotspots collection.
1. Click on the “New Analysis” button. This will bring up the New Analysis dialog (below).
2. Scroll down to Knights Corner Platform -> Lightweight hotspots.
3. Click Start.

Figure 1.Intel® VTune™ Amplifier XE 2013 screen shot for choosing the analysis type for the Intel® Xeon Phi™ coprocessor

Analyzing your OpenCL application

Once your analysis completes, you should see a view similar to the following. If you click on “Bottom-up” and then choose the grouping as selected below, you will be ready to start tuning your application.

Figure 2.Intel® VTune™ Amplifier XE 2013 Bottom-Up screen shot showing a summary by process/module for the Intel® Xeon Phi™ coprocessor.

Some important comments:

You should focus on the mic_server process. This process covers all the device-side OpenCL activities. It is generally recommended to filter by this process.
For the overall activity aggregated in the “CPU time” chart on the figure (CPU here means “MIC cores”), it is recommended to zoom and filter the area of actual kernel execution. In our example analysis, this area is the largest red rectangle in Figure 2.
Notice that time spent in the mic_server consists of:
- [Dynamic Code], which constitutes the kernels
- Intel® Threading Building Blocks (TBB) costs
- SVML (vector math library that is responsible for most heavy built-ins like math)
- Other functions: for example Linux kernel routines inside vmlinux.

The previous screen shot showed the hotspots of the processess. Now let’s inspect the same trace for top hotpots over all modules, assuming you already filtered by the mic_server process. This is easy when switched to Top-down Tree view:

Figure 3.Intel® VTune™ Amplifier XE 2013 screen shot of the Top-down Tree view showing the top hotspots for Intel® Xeon Phi™ coprocessor.

Here you get the top-list of hotpots from all modules. In this example notice that most are from dynamic code (specific kernel names are listed). There is some contribution for the TBB library as well and finally some heavy math (__ocl_svml_b2_sqrt) that is attributed to the code from SVML module.

In general, seeing many entries for TBB in the hotpots breakdown might indicate some inefficiency in work group scheduling. For example, a small number of TBB entries could mean that the work groups are too lightweight. Refer to the section called “Intel Xeon Phi coprocessor cores utilization” for an example analysis for work group parallelism.

If you click on a specific kernel, you can inspect the resulting assembly code. This is useful to locate the expensive instructions for example:

Heavy math built-ins that are subject for native or relaxed math experiment. Refer to the Intel SDK for OpenCL Applications XE 2013 Optimization Guide listed in references at the end of the article.
Prefetching instructions are costly according to the trace. It is likely that the prefetching itself is inefficient. Consider the dedicated section on HW/SW prefetching in the Optimization Guide.
Similarly, if you are observing gather/scatter instructions in the instruction hotspots, it is likely your data layout and/or access needs some improvement. Refer to the corresponding section in the Optimization Guide.
If there are masked instructions in the instruction hotspot regions, it is likely your code suffers from divergent branches and associated penalties. For help on this, refer to the Optimization Guide.

General Exploration on Intel Xeon Phi coprocessor using events

In addition to analyzing by hotspot, you can conduct experiments using a variety of hardware events and associated efficiency metrics. For example, analyzing kernels for data read/write misses might help you to identify potential improvements in the prefetching code, or better data reuse via tiling.

Figure 4.Intel® VTune™ Amplifier XE 2013 screen shot of event-driven profiling analysis for Intel® Xeon Phi™ coprocessors with data read misses being the metric of interest.

The event-driven analysis for the OpenCL application on Intel Xeon Phi coprocessors is conceptually similar to the analysis for the regular native (or offload) application for the coprocessor. For further details, we direct you to this introduction: http://software.intel.com/en-us/articles/optimization-and-performance-tuning-for-intel-xeon-phi-coprocessors-part-2-understanding.

Example analysis of Intel Xeon Phi coprocessor cores utilization

Although looking at the individual events is useful, one of the most important general hints from profiling is the aggregated cores utilization. You can select and filter by a region of execution on the timeline. See Figure 1 for a sample screen shot of the timeline where the approximate area of interest is marked with red rectangle. You can inspect the level of activity for individual threads, but overall activity, which is aggregated on the “CPU Time” chart, is an important metric of efficiency.

Specifically, seeing large holes in the core utilization might indicate insufficient parallel slack, tasks that are too short, or too frequent synchronization, and other pitfalls. Let’s consider an example of the overall cores utilization for a custom data-mining application implemented in OpenCL and executed on an Intel Xeon Phi coprocessor. The application uses an iterative algorithm where the data size is growing from iteration to iteration. Since early iterations are pretty time consuming, it is particularly important to saturate the available compute resource efficiently. Inspecting the aggregated cores utilization revealed a poor employment especially for the first five iterations.

Figure 5.Original (aggregated) cores utilization on the VTune™ Amplifier timeline for a custom OpenCL* data-mining application executed on the Intel® Xeon Phi™ coprocessor. The iterative nature of the algorithm and the growing demand for compute power that increases every iteration are clearly seen.

The Intel SDK for OpenCL Applications XE 2013 Optimization Guide explains the internals of the Intel OpenCL implementation for the Intel Xeon Phi coprocessor (see the reference at the end of the article). Take particular note for how individual work groups are mapped to hardware threads and an important recommendation of having sufficient numbers of work groups on the fly.

In the given example, the actual reason, beyond poor utilization, is the suboptimal work group size used in this OpenCL application. Specifically, the value of 32, which originated as a NVidia GPU warp size since the app was initially targeted for GPUs, and the input problem size resulted in too few work groups for the number of Intel Xeon Phi cores. This is an example of insufficient parallel slack that we mentioned earlier.

After changing the work group size to 16, which still preserves vectorization, the resulting utilization was improved considerably:

Figure 6.Improved (aggregated) cores utilization on the VTune™ Amplifier timeline for custom OpenCL* data-mining application executed on Intel® Xeon Phi™ coprocessor. With the optimized value of work group size, the device is well saturated after the first algorithm iteration.