4. Slurm Command

4.1. Confirmation of nodes and partions (sinfo)

View information about queues (partitions).

Command format

$ sinfo

Output

Item

Explanation

PARTITION

Name of a partition

AVAIL

Partition state

TIMELIMIT

Maximum time limit for any user job. infinite is showed to identify a partition without a job time limit.

NODES

Number of nodes allocated to the partition.

STATE

State of nodes. The suffix “*” identifies nodes that are presently not responding.

NODELIST

Names of nodes

Example of command execution

[UserY@loginvm-XXX ~]$ sinfo
PARTITION   AVAIL  TIMELIMIT  NODES  STATE NODELIST
Interactive    up   infinite      1  down* fx-01-12-06
Interactive    up   infinite      2  alloc fx-01-12-[00-01]
Batch*         up   infinite      5   idle fx-01-12-[02-05,07]
・
・
・

4.2. Job assignment (salloc)

Get a set of nodes to which you want to assign the job and execute the command. Release the allocation after the command ends.

Command format

$ salloc <option> <command>

Options

Option

Explanation

-J <job name>

Specify a name for the job allocation

-p <partition name>

Submit a job to a specified queue (partition)

-N <number of node>

Specify number of nodes

-n <number of process>

Specify number of processes

–time=<time>

Set a limit on the total execution time of job assignments

4.3. Job execution (srun)

Run a parallel job on cluster managed by Slurm.

Command format

$ srun <option> <execute job>

Options

Option

Explanation

-J <job name>

Specify a name for the job allocation

-p <partition name>

Submit a job to a specified queue (partition)

-N <number of node>

Specify number of nodes

-n <number of process>

Specify number of processes

-o ./out_%j.log

Output standard output to a file called “out_*jobID*.log”

-e ./err_%j.log

Output standard error output to a file called “err_*jobID*.log”

–time=<time>

Set a limit on the total execution time of job assignments

–pty <SHELL>

Run interactively

–preserve-env

Pass the current values of environment variables SLURM_JOB_NODES and SLURM_NTSASKS to the executable

4.4. Job execution (sbatch)

Send the batch script to Slurm.

Command format

$ sbatch <option> <job script>

Options

Option

Explanation

-J <job name>

Specify a name for the job allocation

-p <partition name>

Submit a job to a specified queue (partition)

-N <number of node>

Specify number of nodes

-n <number of process>

Specify number of processes

-o ./out_%j.log

Output standard output to a file called “out_*jobID*.log”

-e ./err_%j.log

Output standard error output to a file called “err_*jobID*.log”

–time=<time>

Set a limit on the total execution time of job assignments

4.5. Check running jobs (squeue)

Display a list of currently running jobs and job information. Jobs executed by other users are not displayed.

Command format

$ squeue

Output

Item

Explanation

JOBID

Job ID assigned to the job

PARTITION

Name of the queue (partition) that submitted the job

NAME

Displays the job name. Displays the command string if unspecified.

USER

Displays the user who executes the job submission request

ST

Displays the status of the job. See table below for status list.

TIME

Job execution time

NODES

Number of nodes used for job execution

NODELIST(REASON)

List of host names on which jobs are executed

Job status description

State

Explanation

CA(CANCELLED)

State cancelled by user/administrator

CD(COMPLETED)

Terminate all processes on all nodes

CF(CONFIGUREING)

Wait for resources to become available after they are allocated

CG(COMPLETING)

Process of the termination procedure

F(FAILED)

Terminated with a non-zero exit code or other failure

NF(NODE_FAIL)

Terminated because one of the assigned nodes failed

PD(PENDING)

Pending for resource allocation

PR(PREEMPTED)

Job aborted and terminated

R(RUNNING)

Currently Running

S(SUSPENDED)

Suspend execution to wait for resource allocation

TO(TIMEOUT)

Terminated due to timeout

Example of command execution

[UserY@loginvm-XXX ~]$ squeue
JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
10114     Batch    sleep    UserY  R       1:41      1 fx-01-10-02

4.6. Check running jobs including other users (squeues)

Display a list of the node usage information for your and other users’ currently running jobs on the login server. The information might be out of date because it is updated every 10 seconds. This is different from the slurm standard command.

Command format

$ squeues

Output

Item

Explanation

JOBID

Job ID assigned to the job

NODES

Number of nodes used for job execution

END_TIME

Job end time

TIME_LEFT

Time to end job

NODELIST

List of nodes used for job execution

ST

Displays the status of the job. See the squeue command for status list.

SCHEDNODES

The node that will be used if the job is pending. If running, show (null).

Example of command execution

[UserY@loginvm-XXX ~]$ squeues
JOBID NODES END_TIME TIME_LEFT NODELIST ST SCHEDNODES
10115 1 2023-03-01T0:30:00 2:00:00 fx-01-12-00 R (null)

4.7. Abort job (scancel)

Abort the currently running job by specifying the job ID. You can also cancel multiple jobs at once by including job IDs in a series separated by spaces.

Command format

scancel <JOBID> <JOBID>

4.8. Check the jobs that have completed execution (sacct)

Display a list of jobs that have completed execution. Jobs executed by other users are not displayed.

Command format

sacct <option>

Options

Option

Explanation

-j <job ID>

Specify job ID

-o <item, item, …>

Specify output items separated by commas. See the table below for output items.

-e

Show items that can be specified with the -o option

-S, –starttime

Display information after the specified date and time. If not specified, the current day’s 0: 00 is set.

-E, –endtime

Display information before the specified date and time

Output items

Item

Explanation

User

Execution user of the job submission request (job)

JobID

Job ID assigned to the job

Partition

Name of the queue (partition) that submitted the job

NNodes

Number of nodes used for job execution

Submit

Date and time the job was submitted

Start

Date and time when job execution started

End

Date and time when job execution completed

Elapsed

Job execution time

State

Job status

Example of command execution

[UserY@loginvm-XXX ~]$ sacct
JobID           JobName  Partition    Account  AllocCPUS      State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
10122              bash Interacti+                    48  COMPLETED      0:0
10122.0            bash                               48  CANCELLED     0:53
10124              bash Interacti+                    48  COMPLETED      0:0
10124.0            bash                               48  CANCELLED     0:53
10125           sim.job      Batch                  3072     FAILED      4:0