TORQUE
From Montana Tech High Performance Computing
Torque is a resource manager that provides the functionality to start, cancel, and monitor batch jobs sent to the cluster. The basic Torque scheduler has been disabled and scheduling is now done with Moab. Please reference the Moab page for details about submitting batch jobs, scheduling, queues, etc.
Torque documentation is being maintained here for now. Full documentation is available at Adaptive Computing [1].
Contents
Submitting Jobs with msub
The basic Torque scheduler has been replaced with Moab. Please use msub instead of qsub. Job scripts for Torque should also work for Moab.
A job is created by submitting an executable script to the batch server with qsub [2]. The qsub documentation describes a variety of command line arguments for requesting resources, declaring the job name, specifying the priority or destination queue, defining the mail options, etc.. The script contains the commands that will be executed on the compute node assigned by TORQUE for the job. For jobs that request multiple nodes, the script will run on a single node and should contain the commands necessary to utilize all the processors assigned to the job. An example of an MPI job script is below. The job scripts can contain PBS directives that replace the need to use the qsub command line arguments.
Requesting Resources
There are 20 compute nodes with 32 processors per node in the cluster. If no resources are requested, then a single processor on a node will be assigned. Use the -l flag to request resources [3]. For example, "qsub -l nodes=4" will allocate 1 processor on each of four nodes for the job, because the default is to assign 1 processor per node requested. To request all the processors on a node, use ppn=32 (i.e., qsub -l nodes=4:ppn=32). Other resources that are often requested memory size and walltime.
Examples
Interactive Job
To run a program interactively on a compute node:
-
qsub -I
If you want to request a specific node, use the -l option with the resource request:
-
qsub -I -l nodes=n9
Script without PBS directives
A script does not require PBS directives. For instances a simple testjob script to print the host name and ping the management node would contain:
-
#!/bin/sh
hostname
ping -c 30 scyld
To request 2 nodes and 4 processors per node with a mail message when the job ends, the command line would look like:
-
qsub testjob -l nodes=2:ppn=4 -m e -M username@mtech.edu
An output file will be created that contains the hostname that the script ran on and the output from pinging the management node for 30 seconds.
Script with PBS directives
Since scripts are normally submitted several times, it is more convenient to include the qsub options in the script file as PBS directives. The previous testjob script would become:
-
#!/bin/sh
#PBS -l nodes=2:ppn=4
#PBS -N PingJob
#PBS -d /home/mtech/username
#PBS -m e
#PBS -M username@mtech.edu
#PBS -l walltime=00:01:00
hostname
ping -c 30 scyld
The job is now simply submitted with:
-
qsub testjob
Script for MPI job
Applications that use MPI require slightly more sophisticated scripts that set the shell and MPI version, identifies the compute nodes allocated for the job, and initiates the mpd daemons on the assigned compute nodes. An example for MPICH2:
-
#!/bin/bash
#PBS -l nodes=4:ppn=32
#PBS -N MPIJob
#PBS -d /home/mtech/username
#PBS -S /bin/bash
#PBS -m e
#PBS -M username@mtech.edu
#PBS -l walltime=00:10:00
MPDHOSTS=mpd.hosts.$PBS_JOBID
sort -u $PBS_NODEFILE > $MPDHOSTS
NODES=`cat $MPDHOSTS | wc -l `
NPROCS=`cat $PBS_NODEFILE | wc -l`
echo "NODES=$NODES"
echo "NPROCS=$NPROCS"
module load mpich2/gnu
mpirun -np $NPROCS --hostfile $MPDHOSTS mympiapp
rm $MPDHOSTS
InfiniBand with OpenMPI
By default the 1 Gig eth network is used. To specify that OpenMPI uses the InfiniBand network, include --mca btl openib,sm,self :
-
module load openmpi/gnu
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/lib64
mpirun --mca btl openib,sm,self -np $NPROCS --hostfile $MPDHOSTS mympiapp
Monitoring jobs with qstat
qstat will show the status of your jobs, but not the status of other users' jobs.
For the full status of your jobs, use the -f option. For the full status of an individual job, include the job id:
-
qstat -f 456.hpc
Another useful option is -n to view the nodes the job is running on.
Full qstat documentation is at http://www.adaptivecomputing.com/resources/docs/torque/4-0-1/help.htm#topics/commands/qstat.htm