PBS exit codes

Interpreting PBS exit codes

  • The PBS Server logs and accounting logs record the ‘exit status’ of jobs.
  • Zero or positive exit status is the status of the top-level shell.
  • Certain negative exit statuses are used internally and will never be reported to the user.
  • The positive exit status values indicate which signal killed the job.
  • Depending on the system, values greater than 128 (or on some systems 256, see wait(2) or waitpid(2) for more information) are the value of the signal that killed the job.
  • To interpret (or ‘decode’) the signal contained in the exit status value, subtract the base value from the exit status.
  • For example, if a job had an exit status of 143, that indicates the jobs was killed via a SIGTERM (e.g. 143 - 128 = 15, signal 15 is SIGTERM).
    • See the kill(1) manual page for a mapping of signal numbers to signal name on your operating system.

Job termination

  • The exit code from a batch job is a standard Unix termination signal.
  • Typically, exit code 0 means successful completion.
  • Codes 1-127 are generated from the job calling exit() with a non-zero value to indicate an error.
  • Exit codes 129-255 represent jobs terminated by Unix signals.
  • Each signal has a corresponding value which is indicated in the job exit code.

Job termination signals

Signal Name Signal Number Exit Type Reason
 
SIGHUP 1 Term Hangup detected on controlling terminal or death of controlling process
SIGINT 2 Term Interrupt from keyboard
SIGQUIT 3 Core Quit from keyboard
SIGILL 4 Core Illegal Instruction
SIGABRT 6 Core Abort signal from abort(3)
SIGFPE 8 Core Floating point exception
SIGKILL 9 Term Kill signal
SIGSEGV 11 Core Invalid memory reference
SIGPIPE 13 Term Broken pipe: write to pipe with no readers
SIGALRM 14 Term Timer signal from alarm(2)
SIGTERM 15 Term Termination signal

 

NOTE: Consult the signal(7) man page for a complete list of signals.

Job exit status

Exit Code Reason
 
9 Ran out of CPU time.
64 The job ended nicely for but your job was running out of CPU time. The solution is to submit the job to a queue with more resources (bigger CPU time limit).
125 An ErrMsg(severe) was reached in your job.
127 Something wrong with the machine?
130 The job ran out of CPU or swap time. If swap time is the culprit, check for memory leaks.
131 The job ran out of CPU or swap time. If swap time is the culprit, check for memory leaks.
134 The job was killed with an abort signal, and you probably got core dumped. Often this is caused either by an assert() or an ErrMsg(fatal) being hit in your job. There may be a run-time bug in your code. Use a debugger such as gdb or Totalview to find out what's wrong.
137 The job was killed because it exceeded the time limit.
139 Segmentation violation. Usually indicates a pointer error.
140 The job exceeded the "wall clock" time limit (as opposed to the CPU time limit).

Interpreting PBS error codes

The error returns possible to a Batch Request (qstat, qdel, qsub):

* Each error is prefixed with the string PBSE_ for Portable (Posix)
* Batch System Error.  The numeric values start with 15000 since the
* POSIX Batch Extensions Working group is 1003.15

PBS error codes

PBS Variable Error Code Description
 
PBSE_NONE 0 no error
PBSE_UNKJOBID 15001 Unknown Job Identifier
PBSE_NONE 0 no error
PBSE_UNKJOBID 15001 Unknown Job Identifier
PBSE_NOATTR 15002 Undefined Attribute
PBSE_ATTRRO 15003 attempt to set READ ONLY attribute
PBSE_IVALREQ 15004 Invalid request
PBSE_UNKREQ 15005 Unknown batch request
PBSE_TOOMANY 15006 Too many submit retries
PBSE_PERM 15007 No permission
PBSE_BADHOST 15008 access from host not allowed
PBSE_JOBEXIST 15009 job already exists
PBSE_SYSTEM 15010 system error occurred
PBSE_INTERNAL 15011 internal server error occurred
PBSE_REGROUTE 15012 parent job of dependent in rte que
PBSE_UNKSIG 15013 unknown signal name
PBSE_BADATVAL 15014 bad attribute value
PBSE_MODATRRUN 15015 Cannot modify attrib in run state
PBSE_BADSTATE 15016 request invalid for job state
PBSE_UNKQUE 15018 Unknown queue name
PBSE_BADCRED 15019 Invalid Credential in request
PBSE_EXPIRED 15020 Expired Credential in request
PBSE_QUNOENB 15021 Queue not enabled
PBSE_QACESS 15022 No access permission for queue
PBSE_BADUSER 15023 Bad user - no password entry
PBSE_HOPCOUNT 15024 Max hop count exceeded
PBSE_QUEEXIST 15025 Queue already exists
PBSE_ATTRTYPE 15026 incompatable queue attribute type
PBSE_QUEBUSY 15027 Queue Busy (not empty)
PBSE_QUENBIG 15028 Queue name too long
PBSE_NOSUP 15029 Feature/function not supported
PBSE_QUENOEN 15030 Cannot enable queue,needs add def
PBSE_PROTOCOL 15031 Protocol (ASN.1) error
PBSE_BADATLST 15032 Bad attribute list structure
PBSE_NOCONNECTS 15033 No free connections
PBSE_NOSERVER 15034 No server to connect to
PBSE_UNKRESC 15035 Unknown resource
PBSE_EXCQRESC 15036 Job exceeds Queue resource limits
PBSE_QUENODFLT 15037 No Default Queue Defined
PBSE_NORERUN 15038 Job Not Rerunnable
PBSE_ROUTEREJ 15039 Route rejected by all destinations
PBSE_ROUTEEXPD 15040 Time in Route Queue Expired
PBSE_MOMREJECT 15041 Request to MOM failed
PBSE_BADSCRIPT 15042 (qsub) cannot access script file
PBSE_STAGEIN 15043 Stage In of files failed
PBSE_RESCUNAV 15044 Resources temporarily unavailable
PBSE_BADGRP 15045 Bad Group specified
PBSE_MAXQUED 15046 Max number of jobs in queue
PBSE_CKPBSY 15047 Checkpoint Busy, may be retries
PBSE_EXLIMIT 15048 Limit exceeds allowable
PBSE_BADACCT 15049 Bad Account attribute value
PBSE_ALRDYEXIT 15050 Job already in exit state
PBSE_NOCOPYFILE 15051 Job files not copied
PBSE_CLEANEDOUT 15052 unknown job id after clean init
PBSE_NOSYNCMSTR 15053 No Master in Sync Set
PBSE_BADDEPEND 15054 Invalid dependency
PBSE_DUPLIST 15055 Duplicate entry in List
PBSE_DISPROTO 15056 Bad DIS-based Request Protocol
PBSE_EXECTHERE 15057 cannot execute there
PBSE_SISREJECT 15058 sister rejected
PBSE_SISCOMM 15059 sister could not communicate
PBSE_SVRDOWN 15060 req rejected -server shutting down
PBSE_CKPSHORT 15061 not all tasks could checkpoint
PBSE_UNKNODE 15062 Named node is not in the list
PBSE_UNKNODEATR 15063 node-attribute not recognized
PBSE_NONODES 15064 Server has no node list
PBSE_NODENBIG 15065 Node name is too big
PBSE_NODEEXIST 15066 Node name already exists
PBSE_BADNDATVAL 15067 Bad node-attribute value
PBSE_MUTUALEX 15068 State values are mutually exclusive
PBSE_GMODERR 15069 Error(s) during global modification of nodes
PBSE_NORELYMOM 15070 could not contact Mom
PBSE_NOTSNODE 15071 no time-shared nodes
     
Resource monitor specific
PBSE_RMUNKNOWN 15201 resource unknown
PBSE_RMBADPARAM 15202 parameter could not be used
PBSE_RMNOPARAM 15203 a parameter needed did not exist
PBSE_RMEXIST 15204 something specified didn't exist
PBSE_RMSYSTEM 15205 a system error occured
PBSE_RMPART 15206 only part of reservation made
RM_ERR_UNKNOWN PBSE_RMUNKNOWN  
RM_ERR_BADPARAM PBSE_RMBADPARAM  
RM_ERR_NOPARAM PBSE_RMNOPARAM  
RM_ERR_EXIST PBSE_RMEXIST  
RM_ERR_SYSTEM PBSE_RMSYSTEM