Submitting your Job

Print

Historical updates

Job management commands

Computational jobs on Guillimin are run in batch mode (deferred execution, as opposed to interactive). Jobs to be executed are submitted to queues, which then run depending on a number of factors controlled by the scheduler. The scheduler (MOAB) insures that the cluster resources are used in the most effective way, and, at the same time, are shared between the users in a fair manner. You manage your jobs on the cluster (submit, check status, kill, etc.) ONLY with the help of a special set of scheduler commands.

We summarize the basic MOAB and Torque commands (further details can be found with man qsub, or man with any command given below).

  • qsub script_file - submits a new job, where script_file is a text file, containing the name of your executable program and execution options (number of cores requested, working directory, job name, etc.). The example of script_file is given below. Can be used with the following arguments:
    • qsub -q qname - submits your job to a particular queue qname
    • qsub -I (capital i) - submits an interactive job, you have to wait until you get a shell running on a worker node
  • qstat- shows a current list of submitted jobs.
  • showq- shows a current list of submitted jobs; note: it takes a few minutes before your job shows up in showq. Can be used with the following arguments:
    • showq -v - shows a full detailed list of submitted jobs
    • showq -u username - show a listing of submitted jobs for your username
    • showq -r - the same as the previous, but a list of assigned nodes is shown for each job
  • checkjob -v job_ID - shows why the job is waiting its execution
  • canceljob job_ID or qdel job_ID - kills the job, or removes it from the queue. The job_ID can be obtained by the qstat and showq commands.
  • showstart job_ID - shows when the scheduler estimates the job will start (not very reliable)

Job queues on Guillimin

Computing nodes of Guillimin cluster are grouped in many partitions:

Partition

Memory per core on Westmere nodes

(ppn=12)

Memory per core on Sandy Bridge nodes

(ppn=16)

Serial Workload (SW, SW2) - for serial jobs and "light" parallel jobs 3 GB 4 GB
High Bandwidth (HB) - for massively parallel jobs, memory 2 GB  
Large Memory (LM, LM2) - for jobs, requiring large memory footprint 6 GB 8 GB
Extra Large Memory (XLM2) - limited selection of extra large memory nodes   12, 16, or 32 GB
Accelerated Workload (AW) - nodes with GPUs and Xeon Phis   4 or 8 GB

The queues on Guillimin are organized accordingly. Depending on the type and requirements of your code you submit your job to one of the following queues:

  • metaq (default)
  • sw
  • hb
  • lm
  • xlm2
  • aw
  • debug (made of three nodes in the SW2 partition)

In general, the default queue (with no explicit queue name specified) is recommended to minimize waiting times in the queue. The scheduler will steer the job to a fitting node, depending on 3 parameters: the walltime, the number of processors per node specified (ppn), and the minimum amount of memory needed per core (pmem). The sw, hb, lm, aw, and xlm2 queues are there to provide shorthands to run on specific node types and for backwards compatibility.

The last debug queue is a special one. It is specially created to allow you to test your code, before you submit it for a long run. The jobs, submitted to "debug" queue should normally start almost immediately, and you will be able to see if your program behaves as you expect. There are strict resource and time limitations for this queue though: The default running time is 30 min only and the maximum is 2 hours. If the parameters in your submission file exceed these limits, your job will be rejected!

IMPORTANT:

  • The computing nodes, assigned to 3 mentioned above partitions, have different amount of memory! The memory per CPU core for SW, HB and LM partitions is about 2.76 GB, 1.73 GB and 5.85 GB respectively. Be aware that if your job exceeds these limits, it will be killed automatically. Also, if you decide to explicitly specify the memory requirements for the job in your submission file (normally not required), and these requirements appear to exceed the mentioned above limits, your job will be automatically blocked.
  • Please note, that the "sw" queue can also handle parallel jobs. If your code does not require a lot of interprocessor communications, you will probably not notice any performance issues while using this queue. However, if your program performs large data exchanges between the nodes (like, for example, 3D-FFT, parallelized with MPI), then it may be beneficial to use the "hb" queue instead.
  • If you are using thread-type parallelization (like OpenMP), the default queue will usually steer your job to an SW node.
  • It is ALWAYS a good idea to submit a short test job to "debug" queue first, before a long-time run. In this way you will immediately know if your program works as expected or not. However, please do NOT use the "debug" queue for real production-type runs! Remember, that your job will be automatically killed after 30 min !
  • The DEFAULT walltime for each of the other queues is 3 hours, with a MAXIMUM allowed walltime of 30 days, except for XLM2 nodes for which it is 7 days.
  • For jobs using GPUs or Xeon Phis, the pmem parameter denotes memory per node, NOT per core.

Tables showing how queues map to nodes

For beginning users we advise you to read the sections below to get started, but for more advanced users it is useful to know which queues map to which compute nodes. The first table shows the mapping from the highest level. In general, jobs are categorized in "serial" jobs of less than 12 cores on a single node (nodes=1:ppn<12), parallel jobs with 12 cores per node (ppn=12) for Westmere nodes, parallel jobs with 16 cores per node (ppn=16) for Sandy Bridge nodes, and parallel jobs with an unspecified number of cores per node (procs=n).

Queue

nodes=1:ppn<12

ppn=12

ppn=16

procs=n, n≥12

default (metaq) SW, SW2, AW SW, HB, LM SW2, LM2, XLM2 SW, HB, LM, SW2, LM2, XLM2
hb NOT ALLOWED HB SW2 HB
sw SW, SW2 SW SW2 SW, SW2

lm

NOT ALLOWED LM LM2 LM, LM2

aw

AW AW AW AW

The behaviour of the default queue firstly depends on whether the job is a "serial" (<12 cores) job or not. In that case it will most likely run on one of around 250 Westmere SW nodes, but can also run on one of around 45 Sandy Bridge SW2 nodes, or for shorter duration jobs, one of around 80 Sandy Bridge AW nodes. By specifying the "sandybridge" or "westmere" attribute a job can be forced to run on a particular node type. Depending on the walltime the job is either routed to the internal serial-short or sw-serial queue. Note: you should not submit directly to such internal queues but you see them appearing in checkjob and other commands.

PBS -l value where n<12walltime ≤ 36h (serial-short)walltime > 36h (sw-serial)
nodes=1:ppn=n SW, SW2, AW SW
nodes=1:ppn=n:westmere SW SW
nodes=1:ppn=n:sandybridge SW2, AW SW2

For parallel jobs using 12 cores or more the behaviour depends on the pmem value (minimum number of required megabytes per core), the ppn value, and the walltime. Shorter jobs with low memory requirements can run almost anywhere on the cluster. Longer duration jobs will only run on the minimum node that fits the memory requirements given, so that they do not run on nodes that are over-specified. Note that jobs with high memory requirements (>=5700m) also always have a higher priority than jobs with lower memory requirements, so that even short-duration low-memory jobs will enter last on LM and LM2 nodes.

pmem value (internal queue)walltime <= 72hwalltime > 72h
 ppn=12ppn=16procsppn=12ppn=16procs
1700m(*) (hbplus, hb) HB, SW, LM SW2, LM2 HB, SW, SW2, LM, LM2 HB SW2 HB
2700m(†) (swplus, sw-parallel, sw2-parallel) SW, LM SW2, LM2 SW, SW2, LM, LM2 SW SW2 SW
3700m(‡) (sw2plus, sw2-parallel) LM SW2, LM2 SW2, LM, LM2 LM SW2 SW2
5700m (lm, lm2) LM LM2 LM, LM2 LM LM2 LM, LM2
7700m (lm2) NOT ALLOWED LM2 LM2 NOT ALLOWED LM2 LM2
>7800m (xlm2, see below) XLM2 XLM2 XLM2 XLM2 XLM2 XLM2

(*) default if procs>12 or nodes>1 (which need to communicate over the InfiniBand network, so "HB" nodes are useful).
(†) default if procs=12 or nodes=1:ppn=12
(‡) default if ppn=16

Job Accounting

For any user with an NRAC/LRAC allocation a special account with the Resource Allocation Project identifier (RAPid) has been created within the Compute Canada Database. These special accounts have correspondingly been setup within the scheduler environment of Guillimin and must be specified as part of the job submission for users to access their NRAC/LRAC allocation. If no RAPid is specified the job will be rejected. Specifying the RAPid for your project is important in order to have the job scheduled with the priority assigned to the project.

The RAPid can be specified as either part of the job submission script, or on the command line as an option to the qsub command. For example, if your RAPid is xyz-123-ab you could include the following line at the beginning of your job script:

#PBS -A xyz-123-ab

You could also specify the RAPid on the command line as an argument to the qsub command:

qsub -A xyz-123-ab script_file

Examples

Submitting a single-processor job

You would create a job submission file, e.g. script_file, with the following content:

#!/bin/bash
#PBS -l nodes=1:ppn=1
#PBS -l walltime=12:00:00
#PBS -A xyz-123-ab
#PBS -o outputfile
#PBS -e errorfile
#PBS -N jobname

module load ...
cd $PBS_O_WORKDIR
./your_app
# OR use an interpreter for your code: bash, perl, python, Rscript, sas
interpreter your_code

The statements in your script_file above have the following meaning:

  • #PBS -l nodes=1:ppn=1 - you are requesting one cpu core on one computational node
  • #PBS -l walltime=12:00:00 - sets the execution time limit for your job (12 hours in this case, only 3 hours will be set by system if this parameter is omitted)
  • #PBS -A xyz-123-ab - See the "Job Accounting" section below
  • #PBS -o outputfile - the stdout of your code will be redirected to outputfile file
  • #PBS -e errorfile - the stderr of your code will be redirected to errorfile file
  • #PBS -N jobname - your job will have name jobname
  • module load ... - loads any modules required by the job
  • cd $PBS_O_WORKDIR - changes directory to the same place you were in when you submitted your job
  • ./your_app - starts execution of your_app, which is an executable file

You submit your job with the following command:

qsub script_file

Note: as of September 27, 2013, you can no longer submit serial jobs to the "hb" and "lm" queues.

Note: use #PBS -l nodes=1:ppn=1:sandybridge if you wish to run on a newer Sandy Bridge node; all serial jobs are accepted this way now; shorter duration (presently less than 36 hours) jobs are allowed to run on accelerator nodes too.

Submitting an OpenMP job

The submission of programs, compiled with OpenMP options, is similar to the one of serial jobs. The difference is that now you reserve more than 1 core on the node. The following sample script submits OpenMP code to be executed on 4 cores.

#!/bin/bash
#PBS -l nodes=1:ppn=4
#PBS -A xyz-123-ab
#PBS -o outputfile
#PBS -e errorfile
#PBS -N jobname

module load ...
cd $PBS_O_WORKDIR

export OMP_NUM_THREADS=4
./your_omp_app

IMPORTANT:

  • The number of nodes for an OpenMP job is always 1 (you can not go beyond 1 node with OpenMP), and the number of cores (ppn) should never exceed the number of physical CPU cores of that node (12 or 16 in our case). So, the largest job you can run is "nodes=1:ppn=12" or "nodes=1:ppn=16".
  • The OMP_NUM_THREADS variable should correspond to the number of requested CPU cores. It should NEVER exceed the number of cores you reserved!
  • The OpenMP jobs should normally be submitted to the default queue. In special cases when your code needs a large memory footprint, you can submit OpenMP jobs with a higher pmem value, with e.g. "nodes=1:ppn=12,pmem=5700m" or "nodes=1:ppn=16,pmem=7700m" (the full 6 or 8 GB cannot be used because the Operating System takes space too). Jobs submitted to the "lm" and "hb" queues with ppn<12 are no longer accepted, as of September 27, 2013.
  • For programs compiled with MPI support, it is also necessary to set the IPATH_NO_CPUAFFINITY variable, for instance using export IPATH_NO_CPUAFFINITY=1.
  • Use #PBS -l nodes=1:ppn=m:sandybridge, where m<12, or #PBS -l nodes=1:ppn=16 if you wish to run on a newer Sandy Bridge node.

Submitting MPI jobs

Parallel programs, compiled with MPI libraries, are conceptually different from serial ones or OpenMP codes. They are normally run across multiple computing nodes with data and instruction exchange performed via the cluster network. Therefore more that 1 physical computing node is usually reserved for the job, and the executable itself is started with the help of a special launcher, which ensures the data exchange between the processes.

Here is the example of submission script for MPI job, which sends your program to be executed on 48 (3*12 or 4*16) CPU cores:

#!/bin/bash
#PBS -l procs=48
#PBS -l pmem=1700m
#PBS -A xyz-123-ab
#PBS -o outputfile
#PBS -e errorfile
#PBS -N jobname

module load ...
cd  $PBS_O_WORKDIR

mpiexec -n 48 ./your_mpi_app

The particular features of this submission script are as follows:

  • The line "#PBS -l procs=48" asks the scheduler to reserve 48 CPU cores for your job on 3 or 4 nodes in the cluster, depending on availability. If this number is not a multiple of 12 or 16 it will be rounded up.
  • If instead, you use "#PBS -l nodes=4:ppn=12" you ask the scheduler to reserve 4*12=48 CPU cores for your job on 4 older Westmere nodes in the cluster.
  • If instead, you use "#PBS -l nodes=3:ppn=16" you ask the scheduler to reserve 3*16=48 CPU cores for your job on 3 newer Sandy Bridge nodes in the cluster.
  • The line "#PBS -l pmem=1700m" corresponds to the default memory reserved per core; if you need more memory this value should be higher. Recommended pmem values are 2700m (exclude HB nodes), 3700m (only run on LM or Sandy Bridge nodes), 5700m (only run on LM), and 7700m (only run on Sandy Bridge LM); see below for the extra large memory XLM2 nodes.
  • mpiexec -n 36 ./your_mpi_app - starts program your_mpi_app, compiled with MPI, in parallel on 36 cores. The program "mpiexec" is a mentioned above launcher which "organizes" all communications between the MPI processes. The parameter -n should NEVER be larger than the number of "nodes"*"ppn" or "procs" in the "#PBS -l " line.
  • We no longer recommend using "ppn" values less than 12 for multi-node jobs.

IMPORTANT:

  • Although many third-party user application packages often provide MPI sources as a part of the package, we strongly advise to build your application with our MPI packages, which are already installed on the system, and are accessible through "module" environment. These packages are specially built using InfiniBand libraries and ensure that MPI traffic of your parallel application will go through InfiniBand network.
  • Inside your MPI job please make sure that the software "modules", corresponding to your compiler and MPI package, are loaded. This way you will be sure that these modules will always be loaded on every computing node, where your job is sent.
  • We no longer recommend adding corresponding "module load ..." lines to your .bashrc file, or using "#PBS -V". A job is most reliably re-run if it is self-contained and has all information needed, instead of relying on the environment.

Submitting hybrid (MPI and OpenMP) jobs

For parallel programs compiled with MPI libraries, that use OpenMP for intra-node communications, the above example needs to be modified. For 48 CPU cores on 4 nodes with 12 cores each it reads as follows:

#!/bin/bash
#PBS -l nodes=4:ppn=12
#PBS -A xyz-123-ab
#PBS -o outputfile
#PBS -e errorfile
#PBS -N jobname
module load ...
cd $PBS_O_WORKDIR
export IPATH_NO_CPUAFFINITY=1
export OMP_NUM_THREADS=12
mpiexec -n 4 -npernode 1 ./your_app

The particular features of this submission script are as follows:

  • export IPATH_NO_CPUAFFINITY=1 - tells the underlying software not to pin each process to one CPU core, which would effectively disable OpenMP parallelism.
  • export OMP_NUM_THREADS=12 - specifies the number of threads used for OpenMP for all 4 processes.
  • mpiexec -n 4 -npernode 1 ./your_app - starts program your_app, compiled with MPI, in parallel on 4 nodes, with 1 process per node.

Submitting to extra large memory nodes

There are 4 types of extra large memory nodes, each with its own properties:

Node TypeCountCores per nodeMemory (GB/core)Maximal pmem valueMaximal walltime
m256G 5 16 16 15700m 12 hours (7 days)
m256G 1 16 16 15700m 7 days
m384G 1 16 (32) 12 11700m 12 hours (7 days)
m512G 4 16 32 31700m 7 days
m1024G 1 16 (32) 32 31700m 12 hours (7 days)


Here the value between brackets is only available to specific groups (CFI grant holders), everybody else has the shorter walltime.

Any submission using a pmem value of at least 8 GB will go to one of the extra large memory nodes. A higher value should be specified if your job needs it.

Examples:

This job is allowed to run on any extra large memory node:

#PBS -l nodes=1:ppn=16,pmem=11700m,walltime=10:00:00

This job is only allowed on any non-reserved large memory node (1 x m256G, 4 x m512G), except for the special groups:

#PBS -l nodes=1:ppn=16,pmem=11700m,walltime=1:00:00:00

Serial jobs are also allowed:

#PBS -l nodes=1:ppn=1,pmem=31700m,walltime=10:00:00

This job can run on any of the m512G and m1024G nodes:

#PBS -l nodes=1:ppn=16,pmem=31700m,walltime=10:00:00

These jobs can only run on the m512G nodes, except for the special groups:

#PBS -l nodes=1:ppn=16,pmem=31700m,walltime=1:00:00:00

#PBS -l nodes=4:ppn=16,pmem=31700m,walltime=1:00:00:00

If you wish to run on a specific node, please add a property, e.g.:

#PBS -l nodes=1:ppn=32:m1024G,pmem=31700m

#PBS -l nodes=1:ppn=16:m256G,pmem=15700m

Submitting jobs with a Python script

A job could be a Python script that contains all the PBS information in its header. It is also possible to run a MPI and/or OpenMP job from the Python script:

#!/usr/bin/python
#PBS -l nodes=4:ppn=12
#PBS -l walltime=1:00:00:00
#PBS -A xyz-123-ab
#PBS -j oe
#PBS -N jobname

import os
import subprocess

os.chdir(os.getenv('PBS_O_WORKDIR', '/home/username/your_project_name'))
os.environ['IPATH_NO_CPUAFFINITY'] = '1'
os.environ['OMP_NUM_THREADS'] = '12'

subprocess.call("mpiexec -n 4 -npernode 1 ./your_app > results.txt", shell=True)

# Some other work in pure Python
...

In the previous example:

  • #PBS -l walltime=1:00:00:00 - reserves nodes for one day
  • #PBS -j oe - merges output and error files in the output file. Since -o and -e are not given, the output file will be "jobname.oJOBID"
  • os.getenv('PBS_O_WORKDIR') - even in Bash, the environment variable $PBS_O_WORKDIR is the current directory from where you have submitted this job. The Python function getenv() lets you provide a default value in case the variable is not set.
  • os.environ['var_name'] = 'value' - to set an environment variable
  • subprocess.call("command arg1 arg2 ... > results.txt", shell=True) - to run a shell command. While the output of the Python script will be in jobname.oJOBID, the output of command will be in results.txt.

How to submit jobs from the worker nodes (jobs within jobs)

From within your job’s submission script, the job can spawn a child job using the following qsub command:

qsub -A <RAPid> ./child_job_submission_script