Actions

TORQUE

From Montana Tech High Performance Computing

Torque is a resource manager that provides the functionality to start, cancel, and monitor batch jobs sent to the cluster. The basic Torque scheduler has been disabled and scheduling is now done with Moab. Please reference the Moab page for details about submitting batch jobs, scheduling, queues, etc.

Torque documentation is being maintained here for now. Full documentation is available at Adaptive Computing [1].

Submitting Jobs with msub

The basic Torque scheduler has been replaced with Moab. Please use msub instead of qsub. Job scripts for Torque should also work for Moab.

A job is created by submitting an executable script to the batch server with qsub [2]. The qsub documentation describes a variety of command line arguments for requesting resources, declaring the job name, specifying the priority or destination queue, defining the mail options, etc.. The script contains the commands that will be executed on the compute node assigned by TORQUE for the job. For jobs that request multiple nodes, the script will run on a single node and should contain the commands necessary to utilize all the processors assigned to the job. An example of an MPI job script is below. The job scripts can contain PBS directives that replace the need to use the qsub command line arguments.

Requesting Resources

There are 20 compute nodes with 32 processors per node in the cluster. If no resources are requested, then a single processor on a node will be assigned. Use the -l flag to request resources [3]. For example, "qsub -l nodes=4" will allocate 1 processor on each of four nodes for the job, because the default is to assign 1 processor per node requested. To request all the processors on a node, use ppn=32 (i.e., qsub -l nodes=4:ppn=32). Other resources that are often requested memory size and walltime.

Examples

Interactive Job

To run a program interactively on a compute node:

qsub -I

If you want to request a specific node, use the -l option with the resource request:

qsub -I -l nodes=n9

Script without PBS directives

A script does not require PBS directives. For instances a simple testjob script to print the host name and ping the management node would contain:

#!/bin/sh
hostname
ping -c 30 scyld

To request 2 nodes and 4 processors per node with a mail message when the job ends, the command line would look like:

qsub testjob -l nodes=2:ppn=4 -m e -M username@mtech.edu

An output file will be created that contains the hostname that the script ran on and the output from pinging the management node for 30 seconds.

Script with PBS directives

Since scripts are normally submitted several times, it is more convenient to include the qsub options in the script file as PBS directives. The previous testjob script would become:

#!/bin/sh
#PBS -l nodes=2:ppn=4
#PBS -N PingJob
#PBS -d /home/mtech/username
#PBS -m e
#PBS -M username@mtech.edu
#PBS -l walltime=00:01:00
hostname
ping -c 30 scyld

The job is now simply submitted with:

qsub testjob

Script for MPI job

Applications that use MPI require slightly more sophisticated scripts that set the shell and MPI version, identifies the compute nodes allocated for the job, and initiates the mpd daemons on the assigned compute nodes. An example for MPICH2:

#!/bin/bash
#PBS -l nodes=4:ppn=32
#PBS -N MPIJob
#PBS -d /home/mtech/username
#PBS -S /bin/bash
#PBS -m e
#PBS -M username@mtech.edu
#PBS -l walltime=00:10:00

MPDHOSTS=mpd.hosts.$PBS_JOBID
sort -u $PBS_NODEFILE > $MPDHOSTS
NODES=`cat $MPDHOSTS | wc -l `
NPROCS=`cat $PBS_NODEFILE | wc -l`
echo "NODES=$NODES"
echo "NPROCS=$NPROCS"
module load mpich2/gnu
mpirun -np $NPROCS --hostfile $MPDHOSTS mympiapp
rm $MPDHOSTS

InfiniBand with OpenMPI

By default the 1 Gig eth network is used. To specify that OpenMPI uses the InfiniBand network, include --mca btl openib,sm,self :

module load openmpi/gnu
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/lib64
mpirun --mca btl openib,sm,self -np $NPROCS --hostfile $MPDHOSTS mympiapp

Monitoring jobs with qstat

qstat will show the status of your jobs, but not the status of other users' jobs.

For the full status of your jobs, use the -f option. For the full status of an individual job, include the job id:

qstat -f 456.hpc

Another useful option is -n to view the nodes the job is running on.

Full qstat documentation is at http://www.adaptivecomputing.com/resources/docs/torque/4-0-1/help.htm#topics/commands/qstat.htm