4. Slurm Command
4.1. Confirmation of nodes and partions (sinfo)
View information about queues (partitions).
Command format
$ sinfo
Output
Item |
Explanation |
|---|---|
PARTITION |
Name of a partition |
AVAIL |
Partition state |
TIMELIMIT |
Maximum time limit for any user job. infinite is showed to identify a partition without a job time limit. |
NODES |
Number of nodes allocated to the partition. |
STATE |
State of nodes. The suffix “*” identifies nodes that are presently not responding. |
NODELIST |
Names of nodes |
Example of command execution
[UserY@loginvm-XXX ~]$ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
Interactive up infinite 1 down* fx-01-12-06
Interactive up infinite 2 alloc fx-01-12-[00-01]
Batch* up infinite 5 idle fx-01-12-[02-05,07]
・
・
・
4.2. Job assignment (salloc)
Get a set of nodes to which you want to assign the job and execute the command. Release the allocation after the command ends.
Command format
$ salloc <option> <command>
Options
Option |
Explanation |
|---|---|
-J <job name> |
Specify a name for the job allocation |
-p <partition name> |
Submit a job to a specified queue (partition) |
-N <number of node> |
Specify number of nodes |
-n <number of process> |
Specify number of processes |
–time=<time> |
Set a limit on the total execution time of job assignments |
4.3. Job execution (srun)
Run a parallel job on cluster managed by Slurm.
Command format
$ srun <option> <execute job>
Options
Option |
Explanation |
|---|---|
-J <job name> |
Specify a name for the job allocation |
-p <partition name> |
Submit a job to a specified queue (partition) |
-N <number of node> |
Specify number of nodes |
-n <number of process> |
Specify number of processes |
-o ./out_%j.log |
Output standard output to a file called “out_*jobID*.log” |
-e ./err_%j.log |
Output standard error output to a file called “err_*jobID*.log” |
–time=<time> |
Set a limit on the total execution time of job assignments |
–pty <SHELL> |
Run interactively |
–preserve-env |
Pass the current values of environment variables SLURM_JOB_NODES and SLURM_NTSASKS to the executable |
4.4. Job execution (sbatch)
Send the batch script to Slurm.
Command format
$ sbatch <option> <job script>
Options
Option |
Explanation |
|---|---|
-J <job name> |
Specify a name for the job allocation |
-p <partition name> |
Submit a job to a specified queue (partition) |
-N <number of node> |
Specify number of nodes |
-n <number of process> |
Specify number of processes |
-o ./out_%j.log |
Output standard output to a file called “out_*jobID*.log” |
-e ./err_%j.log |
Output standard error output to a file called “err_*jobID*.log” |
–time=<time> |
Set a limit on the total execution time of job assignments |
4.5. Check running jobs (squeue)
Display a list of currently running jobs and job information. Jobs executed by other users are not displayed.
Command format
$ squeue
Output
Item |
Explanation |
|---|---|
JOBID |
Job ID assigned to the job |
PARTITION |
Name of the queue (partition) that submitted the job |
NAME |
Displays the job name. Displays the command string if unspecified. |
USER |
Displays the user who executes the job submission request |
ST |
Displays the status of the job. See table below for status list. |
TIME |
Job execution time |
NODES |
Number of nodes used for job execution |
NODELIST(REASON) |
List of host names on which jobs are executed |
Job status description
State
Explanation
CA(CANCELLED)
State cancelled by user/administrator
CD(COMPLETED)
Terminate all processes on all nodes
CF(CONFIGUREING)
Wait for resources to become available after they are allocated
CG(COMPLETING)
Process of the termination procedure
F(FAILED)
Terminated with a non-zero exit code or other failure
NF(NODE_FAIL)
Terminated because one of the assigned nodes failed
PD(PENDING)
Pending for resource allocation
PR(PREEMPTED)
Job aborted and terminated
R(RUNNING)
Currently Running
S(SUSPENDED)
Suspend execution to wait for resource allocation
TO(TIMEOUT)
Terminated due to timeout
Example of command execution
[UserY@loginvm-XXX ~]$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
10114 Batch sleep UserY R 1:41 1 fx-01-10-02
4.6. Check running jobs including other users (squeues)
Display a list of the node usage information for your and other users’ currently running jobs on the login server. The information might be out of date because it is updated every 10 seconds. This is different from the slurm standard command.
Command format
$ squeues
Output
Item |
Explanation |
|---|---|
JOBID |
Job ID assigned to the job |
NODES |
Number of nodes used for job execution |
END_TIME |
Job end time |
TIME_LEFT |
Time to end job |
NODELIST |
List of nodes used for job execution |
ST |
Displays the status of the job. See the squeue command for status list. |
SCHEDNODES |
The node that will be used if the job is pending. If running, show (null). |
Example of command execution
[UserY@loginvm-XXX ~]$ squeues
JOBID NODES END_TIME TIME_LEFT NODELIST ST SCHEDNODES
10115 1 2023-03-01T0:30:00 2:00:00 fx-01-12-00 R (null)
4.7. Abort job (scancel)
Abort the currently running job by specifying the job ID. You can also cancel multiple jobs at once by including job IDs in a series separated by spaces.
Command format
scancel <JOBID> <JOBID>
4.8. Check the jobs that have completed execution (sacct)
Display a list of jobs that have completed execution. Jobs executed by other users are not displayed.
Command format
sacct <option>
Options
Option |
Explanation |
|---|---|
-j <job ID> |
Specify job ID |
-o <item, item, …> |
Specify output items separated by commas. See the table below for output items. |
-e |
Show items that can be specified with the -o option |
-S, –starttime |
Display information after the specified date and time. If not specified, the current day’s 0: 00 is set. |
-E, –endtime |
Display information before the specified date and time |
Output items
Item
Explanation
User
Execution user of the job submission request (job)
JobID
Job ID assigned to the job
Partition
Name of the queue (partition) that submitted the job
NNodes
Number of nodes used for job execution
Submit
Date and time the job was submitted
Start
Date and time when job execution started
End
Date and time when job execution completed
Elapsed
Job execution time
State
Job status
Example of command execution
[UserY@loginvm-XXX ~]$ sacct
JobID JobName Partition Account AllocCPUS State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
10122 bash Interacti+ 48 COMPLETED 0:0
10122.0 bash 48 CANCELLED 0:53
10124 bash Interacti+ 48 COMPLETED 0:0
10124.0 bash 48 CANCELLED 0:53
10125 sim.job Batch 3072 FAILED 4:0