Submitting your Job


 

Job management commands

Computational jobs on Guillimin are run in batch mode (deferred execution, as opposed to interactive). Jobs to be executed are submitted to queues, which then run depending on a number of factors controlled by the scheduler. The scheduler (MOAB) insures that the cluster resources are used in the most effective way, and, at the same time, are shared between the users in a fair manner. You manage your jobs on the cluster (submit, check status, kill, etc.) ONLY with the help of a special set of scheduler commands.

We summarize the basic MOAB and Torque commands (further details can be found with man qsub, or man with any command given below).

Job queues on Guillimin

Computing nodes of Guillimin cluster are grouped in 4 partitions. Note: the Westmere nodes have been decommissioned so ppn=12 no longer has a special meaning and should not be used anymore except in serial jobs with nodes=1.

Partition

Memory per core on Sandy Bridge nodes

(ppn=16)

Serial Workload (SW2) - for jobs, not requiring a large memory footprint 4 GB
Large Memory (LM2) - for jobs, requiring large memory footprint 8 GB
Extra Large Memory (XLM2) - limited selection of extra large memory nodes 12, 16, or 32 GB
Accelerated Workload (AW) - nodes with GPUs and Xeon Phis 4 or 8 GB

The queues on Guillimin are organized accordingly. Depending on the type and requirements of your code you submit your job to one of the following queues:

In general, the default queue (with no explicit queue name specified) is recommended to minimize waiting times in the queue. The scheduler will steer the job to a fitting node, depending on 3 parameters: the walltime, the number of processors per node specified (ppn), and the minimum amount of memory needed per core (pmem).

The last debug queue is a special one. It is specially created to allow you to test your code, before you submit it for a long run. The jobs, submitted to "debug" queue should normally start almost immediately, and you will be able to see if your program behaves as you expect. There are strict resource and time limitations for this queue though: The default running time is 30 min only and the maximum is 2 hours. If the parameters in your submission file exceed these limits, your job will be rejected!

IMPORTANT:

Tables showing how queues map to nodes

For beginning users we advise you to read the sections below to get started, but for more advanced users it is useful to know which queues map to which compute nodes. The first table shows the mapping from the highest level. In general, jobs are categorized in "serial" jobs of less than 16 cores on a single node (nodes=1:ppn<16), parallel jobs with 16 cores per node (ppn=16) for Sandy Bridge nodes, and parallel jobs with an unspecified number of cores per node (procs=n). Note: the Westmere nodes have been decommissioned so ppn=12 no longer has a special meaning and should not be used anymore except in serial jobs with nodes=1.

Queue

nodes=1:ppn<16

ppn=16

procs=n, n≥16

default (metaq) SW2, LM2, XLM2, AW SW2, LM2, XLM2, AW
SW2, LM2, XLM2

The behaviour of the default queue firstly depends on whether the job is a "serial" (<16 cores) job or not. In that case it will most likely run on one of around 45 Sandy Bridge SW2 nodes, or for shorter duration jobs, one of around 80 Sandy Bridge AW nodes. Depending on the walltime and pmem value the job is either routed to the internal serial-short, sw-serial, or lm2-serial queue. Note: you should not submit directly to such internal queues but you see them appearing in checkjob and other commands.

PBS -l value where n<16pmem valuewalltime ≤ 36h (internal queue name)walltime > 36h (internal queue name)
nodes=1:ppn=n 3700m SW2, AW (serial-short)
SW2 (sw-serial)
nodes=1:ppn=n 7700m LM2 (lm2-serial) LM2 (lm2-serial)

For parallel jobs using 16 cores or more the behaviour depends on the pmem value (minimum number of required megabytes per core), the ppn value, and the walltime. Shorter jobs with low memory requirements can run almost anywhere on the cluster. Longer duration jobs will only run on the minimum node that fits the memory requirements given, so that they do not run on nodes that are over-specified. Note that jobs with high memory requirements (>=7700m) also always have a higher priority than jobs with lower memory requirements, so that even short-duration low-memory jobs will enter last on LM2 nodes.

pmem value (internal queue)walltime <= 72hwalltime > 72h
 ppn=16procsppn=16procs
3700m(‡) (sw2plus, sw2-parallel) SW2, LM2 SW2,LM2 SW2 SW2
7700m (lm2) LM2 LM2 LM2 LM2
>7800m (xlm2, see below) XLM2 XLM2 XLM2 XLM2


(‡) default

Job Accounting

For any user with an NRAC/LRAC allocation a special account with the Resource Allocation Project identifier (RAPid) has been created within the Compute Canada Database. These special accounts have correspondingly been setup within the scheduler environment of Guillimin and must be specified as part of the job submission for users to access their NRAC/LRAC allocation. If no RAPid is specified the job will be rejected unless you are a member of only one RAPid. Specifying the RAPid for your project is important in order to have the job scheduled with the priority assigned to the project.

The RAPid can be specified as either part of the job submission script, or on the command line as an option to the qsub command. For example, if your RAPid is xyz-123-ab you could include the following line at the beginning of your job script:

#PBS -A xyz-123-ab

You could also specify the RAPid on the command line as an argument to the qsub command:

qsub -A xyz-123-ab script_file

Examples

Submitting a single-processor job

You would create a job submission file, e.g. script_file, with the following content:

#!/bin/bash
#PBS -l nodes=1:ppn=1
#PBS -l walltime=12:00:00
#PBS -A xyz-123-ab
#PBS -o outputfile
#PBS -e errorfile
#PBS -N jobname

module load ...
cd $PBS_O_WORKDIR
./your_app
# OR use an interpreter for your code: bash, perl, python, Rscript, sas
interpreter your_code

The statements in your script_file above have the following meaning:

You submit your job with the following command:

qsub script_file

Submitting an OpenMP job

The submission of programs, compiled with OpenMP options, is similar to the one of serial jobs. The difference is that now you reserve more than 1 core on the node. The following sample script submits OpenMP code to be executed on 4 cores.

#!/bin/bash
#PBS -l nodes=1:ppn=4
#PBS -A xyz-123-ab
#PBS -o outputfile
#PBS -e errorfile
#PBS -N jobname

module load ...
cd $PBS_O_WORKDIR

export OMP_NUM_THREADS=4
./your_omp_app

IMPORTANT:

Submitting MPI jobs

Parallel programs, compiled with MPI libraries, are conceptually different from serial ones or OpenMP codes. They are normally run across multiple computing nodes with data and instruction exchange performed via the cluster network. Therefore more that 1 physical computing node is usually reserved for the job, and the executable itself is started with the help of a special launcher, which ensures the data exchange between the processes.

Here is the example of submission script for MPI job, which sends your program to be executed on 48 (3*16) CPU cores:

#!/bin/bash
#PBS -l procs=48
#PBS -l pmem=3700m
#PBS -A xyz-123-ab
#PBS -o outputfile
#PBS -e errorfile
#PBS -N jobname

module load ...
cd  $PBS_O_WORKDIR

mpiexec -n 48 ./your_mpi_app

The particular features of this submission script are as follows:

IMPORTANT:

Submitting hybrid (MPI and OpenMP) jobs

For parallel programs compiled with MPI libraries, that use OpenMP for intra-node communications, the above example needs to be modified. For 48 CPU cores on 3 nodes with 16 cores each it reads as follows:

#!/bin/bash
#PBS -l nodes=3:ppn=16
#PBS -A xyz-123-ab
#PBS -o outputfile
#PBS -e errorfile
#PBS -N jobname
module load ...
cd $PBS_O_WORKDIR
export IPATH_NO_CPUAFFINITY=1
export OMP_NUM_THREADS=16
mpiexec -n 3 -npernode 1 ./your_app

The particular features of this submission script are as follows:

Submitting to extra large memory nodes

There are 4 types of extra large memory nodes, each with its own properties:

Node TypeCountCores per nodeMemory (GB/core)Maximal pmem valueMaximal walltime
m256G 5 16 16 15700m 12 hours (7 days)
m256G 1 16 16 15700m 7 days
m384G 1 16 (32) 12 11700m 12 hours (7 days)
m512G 4 16 32 31700m 7 days
m1024G 1 16 (32) 32 31700m 12 hours (7 days)


Here the value between brackets is only available to specific groups (CFI grant holders), everybody else has the shorter walltime.

Any submission using a pmem value of at least 8 GB will go to one of the extra large memory nodes. A higher value should be specified if your job needs it.

Examples:

This job is allowed to run on any extra large memory node:

#PBS -l nodes=1:ppn=16,pmem=11700m,walltime=10:00:00

This job is only allowed on any non-reserved large memory node (1 x m256G, 4 x m512G), except for the special groups:

#PBS -l nodes=1:ppn=16,pmem=11700m,walltime=1:00:00:00

Serial jobs are also allowed:

#PBS -l nodes=1:ppn=1,pmem=31700m,walltime=10:00:00

This job can run on any of the m512G and m1024G nodes:

#PBS -l nodes=1:ppn=16,pmem=31700m,walltime=10:00:00

These jobs can only run on the m512G nodes, except for the special groups:

#PBS -l nodes=1:ppn=16,pmem=31700m,walltime=1:00:00:00

#PBS -l nodes=4:ppn=16,pmem=31700m,walltime=1:00:00:00

If you wish to run on a specific node, please add a property, e.g.:

#PBS -l nodes=1:ppn=32:m1024G,pmem=31700m

#PBS -l nodes=1:ppn=16:m256G,pmem=15700m

Submitting jobs with a Python script

A job could be a Python script that contains all the PBS information in its header. It is also possible to run a MPI and/or OpenMP job from the Python script:

#!/usr/bin/python
#PBS -l nodes=3:ppn=16
#PBS -l walltime=1:00:00:00
#PBS -A xyz-123-ab
#PBS -j oe
#PBS -N jobname

import os
import subprocess

os.chdir(os.getenv('PBS_O_WORKDIR', '/home/username/your_project_name'))
os.environ['IPATH_NO_CPUAFFINITY'] = '1'
os.environ['OMP_NUM_THREADS'] = '16'

subprocess.call("mpiexec -n 3 -npernode 1 ./your_app > results.txt", shell=True)

# Some other work in pure Python
...

In the previous example:

How to submit jobs from the worker nodes (jobs within jobs)

From within your job’s submission script, the job can spawn a child job using the following qsub command:

qsub -A <RAPid> ./child_job_submission_script