Submitting your Job


Historical updates

Job management commands

Computational jobs on Guillimin are run in batch mode (deferred execution, as opposed to interactive). Jobs to be executed are submitted to queues, which then run depending on a number of factors controlled by the scheduler. The scheduler (MOAB) insures that the cluster resources are used in the most effective way, and, at the same time, are shared between the users in a fair manner. You manage your jobs on the cluster (submit, check status, kill, etc.) ONLY with the help of a special set of scheduler commands.

We summarize the basic MOAB and Torque commands (further details can be found with man qsub, or man with any command given below).

Job queues on Guillimin

Computing nodes of Guillimin cluster are grouped in many partitions:

Partition

Memory per core on Westmere nodes

(ppn=12)

Memory per core on Sandy Bridge nodes

(ppn=16)

Serial Workload (SW, SW2) - for serial jobs and "light" parallel jobs 3 GB 4 GB
High Bandwidth (HB) - for massively parallel jobs, memory 2 GB  
Large Memory (LM, LM2) - for jobs, requiring large memory footprint 6 GB 8 GB
Extra Large Memory (XLM2) - limited selection of extra large memory nodes   12, 16, or 32 GB
Accelerated Workload (AW) - nodes with GPUs and Xeon Phis   4 or 8 GB

The queues on Guillimin are organized accordingly. Depending on the type and requirements of your code you submit your job to one of the following queues:

In general, the default queue (with no explicit queue name specified) is recommended to minimize waiting times in the queue. The scheduler will steer the job to a fitting node, depending on 3 parameters: the walltime, the number of processors per node specified (ppn), and the minimum amount of memory needed per core (pmem). The sw, hb, lm, aw, and xlm2 queues are there to provide shorthands to run on specific node types and for backwards compatibility.

The last debug queue is a special one. It is specially created to allow you to test your code, before you submit it for a long run. The jobs, submitted to "debug" queue should normally start almost immediately, and you will be able to see if your program behaves as you expect. There are strict resource and time limitations for this queue though: The default running time is 30 min only and the maximum is 2 hours. If the parameters in your submission file exceed these limits, your job will be rejected!

IMPORTANT:

Tables showing how queues map to nodes

For beginning users we advise you to read the sections below to get started, but for more advanced users it is useful to know which queues map to which compute nodes. The first table shows the mapping from the highest level. In general, jobs are categorized in "serial" jobs of less than 12 cores on a single node (nodes=1:ppn<12), parallel jobs with 12 cores per node (ppn=12) for Westmere nodes, parallel jobs with 16 cores per node (ppn=16) for Sandy Bridge nodes, and parallel jobs with an unspecified number of cores per node (procs=n).

Queue

nodes=1:ppn<12

ppn=12

ppn=16

procs=n, n≥12

default (metaq) SW, SW2, AW SW, HB, LM SW2, LM2, XLM2 SW, HB, LM, SW2, LM2, XLM2
hb NOT ALLOWED HB SW2 HB
sw SW, SW2 SW SW2 SW, SW2

lm

NOT ALLOWED LM LM2 LM, LM2

aw

AW AW AW AW

The behaviour of the default queue firstly depends on whether the job is a "serial" (<12 cores) job or not. In that case it will most likely run on one of around 250 Westmere SW nodes, but can also run on one of around 45 Sandy Bridge SW2 nodes, or for shorter duration jobs, one of around 80 Sandy Bridge AW nodes. By specifying the "sandybridge" or "westmere" attribute a job can be forced to run on a particular node type. Depending on the walltime the job is either routed to the internal serial-short or sw-serial queue. Note: you should not submit directly to such internal queues but you see them appearing in checkjob and other commands.

PBS -l value where n<12walltime ≤ 36h (serial-short)walltime > 36h (sw-serial)
nodes=1:ppn=n SW, SW2, AW SW
nodes=1:ppn=n:westmere SW SW
nodes=1:ppn=n:sandybridge SW2, AW SW2

For parallel jobs using 12 cores or more the behaviour depends on the pmem value (minimum number of required megabytes per core), the ppn value, and the walltime. Shorter jobs with low memory requirements can run almost anywhere on the cluster. Longer duration jobs will only run on the minimum node that fits the memory requirements given, so that they do not run on nodes that are over-specified. Note that jobs with high memory requirements (>=5700m) also always have a higher priority than jobs with lower memory requirements, so that even short-duration low-memory jobs will enter last on LM and LM2 nodes.

pmem value (internal queue)walltime <= 72hwalltime > 72h
 ppn=12ppn=16procsppn=12ppn=16procs
1700m(*) (hbplus, hb) HB, SW, LM SW2, LM2 HB, SW, SW2, LM, LM2 HB SW2 HB
2700m(†) (swplus, sw-parallel, sw2-parallel) SW, LM SW2, LM2 SW, SW2, LM, LM2 SW SW2 SW
3700m(‡) (sw2plus, sw2-parallel) LM SW2, LM2 SW2, LM, LM2 LM SW2 SW2
5700m (lm, lm2) LM LM2 LM, LM2 LM LM2 LM, LM2
7700m (lm2) NOT ALLOWED LM2 LM2 NOT ALLOWED LM2 LM2
>7800m (xlm2, see below) XLM2 XLM2 XLM2 XLM2 XLM2 XLM2

(*) default if procs>12 or nodes>1 (which need to communicate over the InfiniBand network, so "HB" nodes are useful).
(†) default if procs=12 or nodes=1:ppn=12
(‡) default if ppn=16

Job Accounting

For any user with an NRAC/LRAC allocation a special account with the Resource Allocation Project identifier (RAPid) has been created within the Compute Canada Database. These special accounts have correspondingly been setup within the scheduler environment of Guillimin and must be specified as part of the job submission for users to access their NRAC/LRAC allocation. If no RAPid is specified the job will be rejected. Specifying the RAPid for your project is important in order to have the job scheduled with the priority assigned to the project.

The RAPid can be specified as either part of the job submission script, or on the command line as an option to the qsub command. For example, if your RAPid is xyz-123-ab you could include the following line at the beginning of your job script:

#PBS -A xyz-123-ab

You could also specify the RAPid on the command line as an argument to the qsub command:

qsub -A xyz-123-ab script_file

Examples

Submitting a single-processor job

You would create a job submission file, e.g. script_file, with the following content:

#!/bin/bash
#PBS -l nodes=1:ppn=1
#PBS -l walltime=12:00:00
#PBS -A xyz-123-ab
#PBS -o outputfile
#PBS -e errorfile
#PBS -N jobname

module load ...
cd $PBS_O_WORKDIR
./your_app
# OR use an interpreter for your code: bash, perl, python, Rscript, sas
interpreter your_code

The statements in your script_file above have the following meaning:

You submit your job with the following command:

qsub script_file

Note: as of September 27, 2013, you can no longer submit serial jobs to the "hb" and "lm" queues.

Note: use #PBS -l nodes=1:ppn=1:sandybridge if you wish to run on a newer Sandy Bridge node; all serial jobs are accepted this way now; shorter duration (presently less than 36 hours) jobs are allowed to run on accelerator nodes too.

Submitting an OpenMP job

The submission of programs, compiled with OpenMP options, is similar to the one of serial jobs. The difference is that now you reserve more than 1 core on the node. The following sample script submits OpenMP code to be executed on 4 cores.

#!/bin/bash
#PBS -l nodes=1:ppn=4
#PBS -A xyz-123-ab
#PBS -o outputfile
#PBS -e errorfile
#PBS -N jobname

module load ...
cd $PBS_O_WORKDIR

export OMP_NUM_THREADS=4
./your_omp_app

IMPORTANT:

Submitting MPI jobs

Parallel programs, compiled with MPI libraries, are conceptually different from serial ones or OpenMP codes. They are normally run across multiple computing nodes with data and instruction exchange performed via the cluster network. Therefore more that 1 physical computing node is usually reserved for the job, and the executable itself is started with the help of a special launcher, which ensures the data exchange between the processes.

Here is the example of submission script for MPI job, which sends your program to be executed on 48 (3*12 or 4*16) CPU cores:

#!/bin/bash
#PBS -l procs=48
#PBS -l pmem=1700m
#PBS -A xyz-123-ab
#PBS -o outputfile
#PBS -e errorfile
#PBS -N jobname

module load ...
cd  $PBS_O_WORKDIR

mpiexec -n 48 ./your_mpi_app

The particular features of this submission script are as follows:

IMPORTANT:

Submitting hybrid (MPI and OpenMP) jobs

For parallel programs compiled with MPI libraries, that use OpenMP for intra-node communications, the above example needs to be modified. For 48 CPU cores on 4 nodes with 12 cores each it reads as follows:

#!/bin/bash
#PBS -l nodes=4:ppn=12
#PBS -A xyz-123-ab
#PBS -o outputfile
#PBS -e errorfile
#PBS -N jobname
module load ...
cd $PBS_O_WORKDIR
export IPATH_NO_CPUAFFINITY=1
export OMP_NUM_THREADS=12
mpiexec -n 4 -npernode 1 ./your_app

The particular features of this submission script are as follows:

Submitting to extra large memory nodes

There are 4 types of extra large memory nodes, each with its own properties:

Node TypeCountCores per nodeMemory (GB/core)Maximal pmem valueMaximal walltime
m256G 5 16 16 15700m 12 hours (7 days)
m256G 1 16 16 15700m 7 days
m384G 1 16 (32) 12 11700m 12 hours (7 days)
m512G 4 16 32 31700m 7 days
m1024G 1 16 (32) 32 31700m 12 hours (7 days)


Here the value between brackets is only available to specific groups (CFI grant holders), everybody else has the shorter walltime.

Any submission using a pmem value of at least 8 GB will go to one of the extra large memory nodes. A higher value should be specified if your job needs it.

Examples:

This job is allowed to run on any extra large memory node:

#PBS -l nodes=1:ppn=16,pmem=11700m,walltime=10:00:00

This job is only allowed on any non-reserved large memory node (1 x m256G, 4 x m512G), except for the special groups:

#PBS -l nodes=1:ppn=16,pmem=11700m,walltime=1:00:00:00

Serial jobs are also allowed:

#PBS -l nodes=1:ppn=1,pmem=31700m,walltime=10:00:00

This job can run on any of the m512G and m1024G nodes:

#PBS -l nodes=1:ppn=16,pmem=31700m,walltime=10:00:00

These jobs can only run on the m512G nodes, except for the special groups:

#PBS -l nodes=1:ppn=16,pmem=31700m,walltime=1:00:00:00

#PBS -l nodes=4:ppn=16,pmem=31700m,walltime=1:00:00:00

If you wish to run on a specific node, please add a property, e.g.:

#PBS -l nodes=1:ppn=32:m1024G,pmem=31700m

#PBS -l nodes=1:ppn=16:m256G,pmem=15700m

Submitting jobs with a Python script

A job could be a Python script that contains all the PBS information in its header. It is also possible to run a MPI and/or OpenMP job from the Python script:

#!/usr/bin/python
#PBS -l nodes=4:ppn=12
#PBS -l walltime=1:00:00:00
#PBS -A xyz-123-ab
#PBS -j oe
#PBS -N jobname

import os
import subprocess

os.chdir(os.getenv('PBS_O_WORKDIR', '/home/username/your_project_name'))
os.environ['IPATH_NO_CPUAFFINITY'] = '1'
os.environ['OMP_NUM_THREADS'] = '12'

subprocess.call("mpiexec -n 4 -npernode 1 ./your_app > results.txt", shell=True)

# Some other work in pure Python
...

In the previous example:

How to submit jobs from the worker nodes (jobs within jobs)

From within your job’s submission script, the job can spawn a child job using the following qsub command:

qsub -A <RAPid> ./child_job_submission_script