Frequently Asked Questions

FAQ

Compilation

Job submission

Job execution

Files and storage

General

 


Why do I get a warning about feupdateenv not being implimented?

You may see this warning when using the Intel compilers on Guillimin, particularly when compiling MPI code using mpi wrappers such as mpicc or mpif90:

/software/compilers/Intel/2013-5-13.1.3/lib/intel64/libimf.so: warning: warning: feupdateenv is not implemented and will always fail

Intel links MPI programs with it's math library libimf. This library does not implement a function from the C math library, feupdateenv, and so it produces this warning to inform users. The warning is mostly harmless and can usually be ignored. To be certain that there will be no problems, you may wish to link your programs against both the C math library and the Intel math library (this may not suppress the warning).

mpicc -lm -limf ...

Why is my job blocked?

More information about blocked jobs can be found by using the 'checkjob -v jobID' command. Often, there is a line near the bottom of the output with a 'BLOCK MSG':

BLOCK MSG: job 12345678 violates active SOFT MAXPS limit of 77760000 for acct abc-123-aa  partition ALL (Req: 7142400  InUse: 99175336) (recorded at last scheduling iteration)

The most common cause of blocked jobs is a violation of our MAXPS limit, indicating that your group has scheduled too many outstanding processor seconds at the same time. Descriptions of this and other scheduler policies (MAXIJOB, MAXPROC, etc.) are available on our Moab scheduling policies documentation page.

Jobs may also be blocked because of user-defined dependencies. In this case, there will not be a BLOCK MSG, but there will be a note with more information:

 NOTE:  job cannot run  (dependency 12345678 jobsuccessfulcomplete not met)

If your job is in a BatchHold state, there is a problem with the job submission that is causing the scheduler to repeatedly fail to schedule the job. Often, this is because the user has requested more processors per node than are available on the nodes. Please double check your submission parameters. If you think the job should run as it is, try using the 'releasehold jobID' command.

Please This email address is being protected from spambots. You need JavaScript enabled to view it. if you can't find an explanation for your blocked job in the checkjob -v output.


How to submit a job from a running job?

From a worker node, just use the qsub command. You should no longer connect to gm-schrmat:

qsub -A <RAPid> -q <queue_name> /path/to/script/<script_name>

Specifying a RAPid is always compulsory when using qsub on a worker node.


How to activate email notifications?

In your PBS script:

#PBS -M your_email_address
#PBS -m abe

You can enable any combination of messages you need (a, b, e, ab, ae, be, abe):

 


What does Exit_status=### mean?

Every job on our system returns an Exit_status code upon completion. These codes are listed in the PBS Epilogue information printed in your job's output file. This code can be used to identify possible problems that may have occurred. Exit_status=0 usually indicates a successful job. Here is a list of some of the most common exit codes and what they mean (bold indicates that the Exit_status is relatively frequent on Guillimin):

Negative error codes usually point to a failure of the scheduler or the nodes. For these errors, please contact us with the jobID ( This email address is being protected from spambots. You need JavaScript enabled to view it. ). Examples:

Exit_status Description
-11 JOB_EXEC_RERUN: Job was rerun
-10 JOB_EXEC_FAILUID: Invalid UID/GID for job
-4 JOB_EXEC_INITABT : Job aborted on MOM initialization
-3 JOB_EXEC_RETRY: job execution failed, do retry
-2 JOB_EXEC_FAIL2 : Job exec failed, after files, no retry
-1 JOB_EXEC_FAIL1 : Job exec failed, before files, no retry

Exit codes between 0 and 127 indicate the exit code given by the last command in the job script. Examples:

0 Job Success!
1 General error

Exit codes between 128 and 173 indicate that the process ended due to receiving a signal. Examples:

Exit_status Description
128 Invalid argument to exit()
131 SIGQUIT: ctrl-\, core dumped
132 SIGILL: Malformed, unknown, or priviledged instruction
133 SIGTRAP: Debugger breakpoint
134 SIGABRT: Process itself called abort
135 SIGBUS: Bus error (on Guillimin: often a file system issue)
136 SIGFPE: Bad arithmetic operation (e.g. division by zero)
137 SIGKILL (e.g. kill -9 command)
139 SIGSEGV: Segmentation Fault
143 SIGTERM (probably not canceljob or oom)
151 SIGURG: Urgent condition on socket

Exit codes between 174 and 253 indicate a "Fatal error signal".

Exit codes larger than 253:

254 Command invoked cannot execute
255 Command not found, possible path problem
265 SIGKILL (e.g. kill -9 command), possible out-of-memory error
271 SIGTERM (e.g. canceljob or oom), possible memory error

 


What is the best way to share files with my colleagues?

It is best to keep any shared files in your group's project space, and use your home directory only for personal files. Anyone who needs access to shared files in a group's project space should consider becoming a member of that group by applying for a new role in the CCDB portal. Once your new role is approved by the groups' principal investigator, you will have access to the group's project space and other resource allocations. It is ultimately the responsibility of each user and/or group to correctly set the permissions on their folders and files. However, we are happy to help if you have any questions.

We recommend using your home directory for personal, non-shared files. However, you may wish to have shared files which make use of our backup policy for home directories (by default we do not back up project spaces). In this case, you may choose to set up access control lists (ACLs) on your home directory or to change the permissions. We advise against making the permissions on your home directory too open, as you may unintentionally expose private information to other users.


How can I contact McGill HPC?

Our main email address : This email address is being protected from spambots. You need JavaScript enabled to view it.

To contact a specific member of our staff, please visit our staff page.


What information should I include when requesting support?

The following pieces of information are useful for diagnosing problems: