Using Xeon Phis on Guillimin

Print

Xeon Phi Hardware on Guillimin

The following table summarizes the Phase 2 Xeon Phi nodes available on Guillimin

 

Node Type

CountProcessorsTotal coresMemory (GB/core)Total memory (GB)CardsTotal Peak SP FP (TFlops)Total Peak DP FP (TFlops)Network
AW Intel 50   800 4 3,200 Dual Intel Phi 5110P,
60 cores, 1.053 GHz, 30 MB cache, 8 GB memory, Peak SP FP: 2.0 TFlops, Peak DP FP: 1.0 TFlops
200 100  

Full Xeon Phi device specifications are found in Intel's documentation.

 

Submitting Xeon Phi jobs

Submitting a Xeon Phi job is similar to submitting a regular job. The main difference is that the submission script and/or the qsub command should specify the "mics" resource in the resource list (see the example below).

Example Intel Xeon Phi job:

$ qsub -l nodes=1:ppn=16:mics=2 ./submit_script.sh

This job is automatically routed to the phi queue. Each accelerator node has two devices (two Xeon Phis, or two K20 GPUs) and 16 cores. You may request one or both accelerator from any nodes in your job, but you will not be able to access accelerators that have not been requested. You must request at least one CPU per node to gain access to the Xeon Phis.


If you use a pmem value (this is in general not necessary for Phi jobs as all Phi nodes have 64GB of memory), please note that it applies per node, instead of per core.

 

Xeon Phi software modules

If you are using our Xeon Phi nodes, please note the availability of the following modules (NOTE: as of 2014-09-30 the MIC module is no longer necessary)

  • Intel compilers version 14.0.4 (including support for Xeon Phi)  - module add ifort_icc/14.0.4
  • Intel SDK for OpenCL applications 1.2 (OpenCL compilers and ICD) - module add intel_opencl
  • Intel MPI 5.0.1 (including support for Xeon Phi) - module add ifort_icc/14.0.4 intel_mpi

 

Offload Mode Jobs

In offload mode, the accelerator is sent work (computational hotspots) by a process running on the host CPU(s). To use this mode, special instructions (such as directives or pragmas) must be used in the source code to indicate to the compiler how the accelerator is to be used. Please see our training materials for examples.

$ module add ifort_icc
$ icc -o offload -openmp offload.c
$ ./offload

 

Native Mode Jobs

A native mode program is compiled for execution directly on the Xeon Phi and does not normally use any host resources. Often, parallel code will not need to be modified from its host-only version to compile and run it in native mode. To compile for native mode, use the -mmic compiler flag

$ module add ifort_icc
$ icc -mmic -o mm_omp.MIC -openmp mm_omp.c
$ micnativeloadex ./mm_omp.MIC

It is a good policy to add .MIC extensions to MIC binaries so they are not confused with CPU binaries. micnativeloadex is a program supplied by Intel that attempts to copy over any required libraries to the MIC device and then runs the MIC binary. In some cases, users may wish to manually set their paths instead of using micnativeloadex.

$ scp mm_omp.MIC mic0:~/.
$ ssh mic0 "export LD_LIBRARY_PATH=/software/compilers/Intel/2013-sp1-14.0/lib/mic:$LD_LIBRARY_PATH; ./mm_omp.MIC"

To use MPI in native mode, please use the intel_mpi module and set the I_MPI_MIC environment variable

$ module add intel_mpi ifort_icc
$ export I_MPI_MIC=enable
$ export I_MPI_MIC_POSTFIX=.MIC
$ mpiicc -mmic -o hello.MIC hello.c
$ mpirun -n 60 -host mic0 ./hello
MIC: Hello from aw-4r12-n37-mic0 4 of 60
MIC: Hello from aw-4r12-n37-mic0 21 of 60
MIC: Hello from aw-4r12-n37-mic0 30 of 60
...

 

Symmetric Mode

It is possible to have MPI processes run on both host cores and accelerator cores. Note that the different speeds of these cores can create load-balancing problems that should be addressed when designing MPI code for symmetric mode use on Xeon Phis. To use symmetric mode, the MPI code should be compiled separately for the host and the device. All of the advice for running MPI in native mode should be followed

$ module add intel_mpi ifort_icc
$ export I_MPI_MIC=enable
$ export I_MPI_MIC_POSTFIX=.MIC
$ mpiicc -o hello hello.c
$ mpiicc -mmic -o hello.MIC hello.c
$ mpirun -perhost 1 -n 4 -host $(cat $PBS_NODEFILE $PBS_MICFILE | tr '\n' ',') ./hello
CPU: Hello from aw-4r12-n40 1 of 4
CPU: Hello from aw-4r12-n37 0 of 4
MIC: Hello from aw-4r12-n37-mic0 2 of 4
MIC: Hello from aw-4r12-n40-mic0 3 of 4

Note: if the job is contained within one node (that is, nodes=1, where the MPI processes are on the host and one or two devices or on both devices but not the host), you may or may not get better performance (higher bandwidth, but also higher latency) using

$ export I_MPI_FABRICS=shm:dapl
$ export I_MPI_DAPL_PROVIDER=ofa-v2-scif0

before running the mpirun command instead of

$ export I_MPI_FABRICS=shm:tmi

which is set by default when you load the intel_mpi module.

Note 2: the tmi (PSM) protocol by default supports up to 13 total MPI processes per host and Phi cards combined. It is possible to raise this up to 52, 39, or 26, while losing a little bit of efficiency by setting:

$ export PSM_RANKS_PER_CONTEXT=4

or 3, or 2, respectively.

Xeon Phi Training/Education

Please see our recent Xeon Phi training event materials for more information about how to use Intel Xeon Phi co-processors effectively for your research. General information about parallel programming with MPI or OpenMP can also be found in our training materials.

 

Recommended Reading