CentOS 6.3 environment (Phase 1 + 2)

Print

Phase 1 and 2 Hardware Overview

The following tables summarize the new hardware available for phase 2 running CentOS Linux version 6.3.

Compute nodes:

Node TypeCountProcessorsTotal coresMemory (GB/core)Total memory (GB)Network
SW2 216 Dual Intel Sandy Bridge EP E5-2670
(8-core, 2.6 GHz, 20MB Cache, 115W)
3456 4 13,824 InfiniBand QDR 2:1 blocking
LM2 146 2304 8 18,432 InfiniBand QDR 1:1 blocking
XLM2 6 96 16 1,536
1 Quad Intel Sandy Bridge EP E5-4620
(8-core, 2.2 GHz, 16MB Cache, 95W)
32 12 384
1 32 32

1024

 

Accelerator nodes:

 

Node Type

CountProcessorsTotal coresMemory (GB/core)Total memory (GB)CardsTotal Peak SP FP (TFlops)Total Peak DP FP (TFlops)Network
AW nVidia 25 Dual Intel Sandy Bridge EP E5-2670 (8-core, 2.6 GHz, 20MB Cache, 115W) 400 8 3,200 Dual nVidia K20,
Peak SP FP: 3.52 TFLops, Peak DP FP: 1.17 TFlops
176 58.5 InfiniBand QDR 2:1 blocking
25 400 4 1,600 176 58.5
AW Intel 50 800 4 3,200 Dual Intel Phi 5110P,
60 cores, 1.053 GHz, 30 MB cache, 8 GB memory, Peak SP FP: 2.0 TFlops, Peak DP FP: 1.0 TFlops
200 100

 

The following tables summarize the Phase 1 hardware now accessible from the Phase 2 login nodes:

Compute nodes:

Node TypeCountProcessorsTotal coresMemory (GB/core)Total memory (GB)Network

LM

189 Dual Intel Westmere EP Xeon X5650
(6-core, 2.66 GHz, 12MB Cache, 95W)
2268 6 13,608 InfiniBand QDR 1:1 blocking

SW

468 5616 3 16,848 InfiniBand QDR 2:1 blocking
HB 400 4800 2 9,600 InfiniBand QDR 1:1 blocking

 

For an overview of all of Guillimin’s Phase 1 and Phase 2 hardware, see our documentation page.

 

How to get access to the CentOS 6.3 Environment

Starting from January 27, 2014, all Guillimin users have access to the CentOS 6.3 environment (All of Phase 2 and all of Phase 1 nodes).

How to login to the CentOS 6.3 Environment

Access to the CentOS 6.3 login nodes is provided through Secure Shell (ssh). The login process is similar to that of Phase 1 (see our documentation), however users will specify a different host name in place of guillimin.clumeq.ca. For example, from the Unix command line:

ssh guillimin-p2.hpc.mcgill.ca -l username

or

ssh guillimin5.hpc.mcgill.ca -l username
ssh guillimin6.hpc.mcgill.ca -l username
ssh guillimin7.hpc.mcgill.ca -l username
ssh guillimin8.hpc.mcgill.ca -l username

 

Disk Storage Setup

On the Phase 1 and 2 CentOS 6.3 nodes, you have access to various storage spaces, depending on the location of your project space(s):

  • /sb/home or /home
  • /gs/scratch
  • /gs/project
  • /sb/project
  • /lb/project

 

Your home directory is the same for both Phase 1 and 2.

 

Operating system and software environment

While some our Phase 1 nodes are based upon the CentOS 5 operating system, the bulk of Phase 1 and all of Phase 2 systems are running CentOS Linux version 6.3. Some software compiled for CentOS 5 may run in the new operating system. However, to ensure the best performance and to prevent possible errors, we strongly recommend rebuilding a new version of your software for use in the new environment. By the end of March 2014, we will phase out CentOS 5 and migrate all Phase 1 nodes to CentOS 6.3 so that there is a consistent environment across all of the compute nodes.

Nearly all of the software packages available for Phase 1 have been installed for Phase 2 and are available through our module system (Use the command ‘module av’ for a complete list). Default versions for modules have now been updated to the latest available stable versions. For example, loading the module "ifort_icc" will now give you ifort_icc/14.0.1. To continue using the old modules you have to explicitly specify the older version, for example, ifort_icc/12.0.4. For more details on our software, including compilers, libraries, tools, and applications please see our documentation.

If you are using our new accelerator nodes (GPU or Xeon Phi), please note the availability of the following modules:

  • The CUDA Toolkit 5.0 (including CUDA-C and OpenCL compilers)- module add CUDA_Toolkit
  • PGI accelerator compilers (including OpenACC and CUDA-fortran) - module add pgiaccel
  • Intel compilers version 14.0.4 (including support for Xeon Phi)  - module add ifort_icc/14.0.4
  • Xeon Phi development environment variables - module add MIC
  • Intel SDK for OpenCL applications 1.2 (OpenCL compilers and ICD) - module add intel_opencl MIC
  • Intel MPI 5.0.1 (including support for Xeon Phi) - module add ifort_icc/14.0.4 intel_mpi MIC

 

Batch System Setup

Job submission on the new CentOS 6.3 environment is very similar to Phase 1 job submission environment based upon CentOS 5.8. Please see our phase 1 job submission documentation.

Please Note:  The environment described below has been configured for the purposes of making accessible and testing the Phase 2 systems.  Future changes to the scheduler system are underway in order to simplify the user experience of working with the batch submission system.

Please Note: We are changing from msub to qsub as the recommended submission program. Note that specifying a RAP ID in your submission script of the form #PBS -A xyz-123-ab is compulsory for all jobs submitted with qsub.

The main difference between Phase 1 and Phase 2 hardware is that the new nodes have 16 cores per node instead of the 12 cores per node of Phase 1. We now also offer an extra large memory queue (xlm2), an Intel Xeon Phi queue (phi), and an nVidia Tesla K20 GPU queue (k20). There is a phase 2 debug queue (debug).

 

Example partial-node sw job (goes to phase 1 sw compute nodes only, so there is 3 gigabytes per core available):

$ qsub -q sw -l nodes=1:ppn=1 ./submit_script.sh

 

Example whole-node job (use queue sw, or if more memory is required lm or xlm2):

$ qsub -q lm -l nodes=1:ppn=16 ./submit_script.sh

Example whole-node job for phase 1 compute nodes (use queue sw, or if less memory is required, hb):

$ qsub -q sw -l nodes=1:ppn=12 ./submit_script.sh

 

Example extremely large memory jobs (i.e. jobs requiring 384 GB and 1024 GB nodes):

$ qsub -q xlm2 -l nodes=1:ppn=32:m384G,mem=378gb 
$ qsub -q xlm2 -l nodes=1:ppn=32:m1024G,mem=1009gb 

 

Example Intel Xeon Phi job:

$ qsub -q phi -l nodes=1:ppn=16:mics=2 ./submit_script.sh

 

Example nVidia K20 GPU job:

$ qsub -q k20 -l nodes=1:ppn=16:gpus=2 ./submit_script.sh

 

Note that the phi and k20 queues are currently configured for whole node scheduling. If your job requests fewer resources than the entire node, your job will still receive the resources of the entire node. Please try to pack your jobs to use these resources effectively if possible. Each accelerator node has two devices (two Xeon Phis, or two K20 GPUs) and 16 cores.

The maximum walltime of a job in in the CentOS 6.3 environment is currently 30 days with the exception of the xlm2 nodes which is 7 days.  The debug queue maximum walltime is 2 hours.

 

How to submit jobs from the worker nodes (jobs within jobs)

On Phase 1, jobs can be submitted from the worker nodes by using the msub command on the gm-schrm server through ssh. The process is similar on the combined Phase 1 + 2 CentOS 6.3 environment, but gm-schrmat is used instead of gm-schrm. From within your job’s submission script, the job can spawn a child job using the following command:

ssh gm-schrmat qsub -q sw -l nodes=1:ppn=16 ./child_job_submission_script

 

Contact Us - for questions and feedback

 

This email address is being protected from spambots. You need JavaScript enabled to view it.

 

Please contact us not only if there are problems or issues, but also to let us know if your software experiences a speed-up on the new equipment, or if you can recommend any improvements to the configuration. Do not hesitate to let us know about your experiences on the phase 2 system.