Accelerator Overview


* Compute Canada documentation:


What is an accelerator? When should I use it?

On Guillimin, we have two types of hardware acceleration options: The Intel Xeon Phi and the Nvidia Kepler GPU. Both types of devices are cards that interface with a node through the PCI-express bus, and are designed to accelerate a computation through massive parallelization. These accelerator devices contain a large number of processing cores, as well as internal memory. They are most often used in conjunction with the CPUs of the node to accelerate certain ‘hot spots’ of a computation that requires a large amount of algebraic operations.

Your code may experience significant performance enhancements using an accelerator if:

  • Your code spends a lot of time doing a task that can be easily divided into hundreds (or more) of parallel tasks (i.e. your application can benefit from the massive parallelism offered by an accelerator), and

  • A large number of algebraic operations are needed per gigabyte of data to process your data (i.e. it is worth transferring the data through the PCI-express bus to do the processing)


How does the performance compare between Nvidia Tesla K20s and Xeon Phi 1550Ps?

The performance will be application dependent (see below). However, the following table shows the theoretical performance peaks in terms of floating point operations per second (flops/s) and memory bandwidth.


Nvidia GPU specifications

Intel Xeon Phi specifications



Single Precision (Tflops/s)

Double Precision (Tflops/s)

Cores (*)

Memory Bandwidth (GB/s)

Memory Size (GB)

Tesla K20



13 SMXs, 2496 CUDA cores



Xeon Phi 5110P






 Comparison between Xeon Phi, K20, and Sandybridge CPU

(*) Note that CUDA GPUs employ a different architecture than familiar x86 systems. In contrast to general purpose x86 processing cores, CUDA GPUs use streaming multiprocessors (SMXs) which are designed to execute blocks of SIMT threads in parallel. Neither SMXs nor CUDA cores can be directly compared to x86 processing cores. The cores of the Xeon Phi can be compared to the cores in the worker nodes. The Xeon Phi cores are Pentium generation cores with a clock speed of 1.053 GHz, making them slower than the Sandy Bridge cores of the Phase II worker nodes at 2.6 GHz.

Please see this article or this article for a practical benchmark comparison between similar accelerators.


How do I choose between Nvidia GPUs and Intel Xeon Phis?

Nvidia GPUs and Intel Xeon Phis are accelerators with different hardware architectures designed to address different types of problems. There is considerable overlap in their capabilities, but it is important to understand their differences in order to choose the accelerator best suited to your problem.

Nvidia GPUs are specifically designed for solving problems that can be expressed in a single-instruction, multiple thread (SIMT) model. For example, processing a large vector of data where each element of the vector can be treated independently can be easily matched to the SIMT model. Note that the K20 GPUs are also capable of multiple-instruction, multiple thread (MIMT) processing through asynchronous CUDA streams or the new Hyper-Q capability in CUDA 5. Nvidia has developed a mature ecosystem for developing applications on CUDA-enabled GPUs including a programming model (CUDA-C), as well as profilers, debuggers, libraries, examples, and other useful applications. Because CUDA GPUs have been available for a longer period of time, there is a richer collection of highly-optimized third-party libraries and applications available for CUDA GPUs. Unfortunately, the parallel programming model for GPUs is specialized and the learning-curve is steeper for newcomers than with the Xeon Phi.

The Intel Xeon Phi has a less specialized architecture than a GPU, and is designed to be familiar to anyone who has experience with parallel programming in an x86 environment. The Phi contains Intel Pentium generation processors and runs a version of the Linux operating system. Thus, it can execute parallel code written for ‘normal’ computers using a wide variety of modern and legacy programming models including Pthreads, OpenMP, MPI and even GPU software (e.g. CUDA or OpenCL). So, you may be able to straightforwardly port your applications to use the Phi without much modification. However, optimizing your application specifically for Phi use is still recommended to achieve the best performance. Because of their compatibility with standard x86 hardware, Phi programmers can enjoy using their favourite compilers, profilers, and debuggers. The Xeon Phi supports an offload programming model similar to how GPUs are used, but programs can also be run natively directly on the card. Some of the most exciting new capabilities of the Kepler generation of Nvidia GPUs (Hyper-Q and dynamic parallelism) are quite natural on the Xeon Phi.


How can my code/applications use an accelerator? How much effort is required to port my code?

There are four main ways for you to use an accelerator to accelerate your computation:

  1. Explicit programming: The programmer writes explicit instructions for the accelerator device to execute as well as instructions for transferring data to and from the device (e.g. CUDA-C for GPUs or OpenMP+Cilk Plus for Phis). This method requires to most effort and knowledge from programmers because algorithms must be ported and optimized on the accelerator device.

  2. Accelerator-specific pragmas/directives: Accelerator code is automatically generated from your serial code by a compiler (e.g. OpenACC, OpenMP 4.0). For many applications, adding a few lines of code (pragmas/directives) can result in good performance gains on the accelerator.

  3. Accelerator-enabled libraries: Only requires the use of the library, no explicit accelerator programming is necessary once the library has been written. The programmer effort is similar to using a non-accelerator enabled scientific library.

  4. Accelerator-aware applications: These software packages have been programed by other scientists/engineers/software developers to use accelerators and may require little or no programming for the end-user.


I have an application that can use an accelerator. How should I submit my job?

Please see our separate documentation for more specific details about how to use our GPU or Xeon Phi nodes.