High Performance Computing Environment

BATCH PROCESSING SYSTEM

Introduction
Modern supercomputers consist of a collection of more or less equal 'computing nodes', each node is equipped with one or more processors, memory, networking hardware and so on. In principle, such a node is not too different from a normal high-end PC.

In order to use the nodes efficiently, usage is made of a batch system. The concepts involved are not too difficult to understand, but there is a learning curve to get used to it. Moreover, there are some issues that are common to all batch systems, so that is why we wrote this Batch-HOWTO. We will explain what a batch system is, how to work with it and what the important details are. This document is by no means exhaustive and many important details and possibilities are not mentioned, but we hope that it suffices to give you a clear picture what a batch system is.

What is a batch system
Practically everybody is used to interactive computer systems: word processing, using the Internet, compiling and running programs, etc. Here one types in commands or clicks on some buttons, and the computer system gives an immediate response.
In a batch system however, one has to prepare a file containing the commands ( a 'job' ) that will be executed later. These commands invoke computer programs, define the location of files and so on. In general, the commands will be executed on another system, possible even a system in another country. Here we will restrict ourself to the situation that the jobs will run on the same system, albeit on other nodes than the node the users login to. The advantages of a batch system are:

Many jobs: It is possible to submit tens, hundreds or even thousands of jobs at the same time: the system will take care that they are run, as many at the same time as possible. Interactively, this would be not so easy.
Efficient usage of resources: When the computer is busy, interactive use would not be an option: one would have to wait until resources (a number of nodes for example) are available. In a job system, one simply submits the job, the system takes care when to run it.

Nodes, processors, cores, batch nodes and login nodes

These terms can be loosely defined as:

Node: a collection of processors, sharing the same memory and possibly the same (network) adapters. Example: a PC.
Processor: a housing containing one or more cores. Example: the Pentium processor in a PC.
Core: a (piece of) a microchip capable to carry out instructions in computer memory. Example: one of the cores of a dual core processor in a PC.
Batch node: a node dedicated for executing batch jobs.
Login node: a node dedicated for interactive work, for example to prepare a job, compile programs and so on.

A batch system is responsible for allocating cores, processors or nodes to a job. It depends on the system what kind of granularity is used. For example: in a system consisting of nodes, each with two single-core processors, it is customary to allocate a whole number of nodes to a job. On the other hand, if a node contains 64 or more processors, one can choose to allocate a number of processors to a job, allowing more than one job to run simultaneously on one node.

Common properties of batch systems
Various batch systems exist, but they all share the following parts:

A method to define the requirements of a job
A method to define the actions that are to be performed
Handling of standard output and standard error
A system to schedule jobs
Utilities to monitor the progress of jobs

Defining the requirements of a job
A job can be classified using two parameters:

How many processors or nodes are needed?
What is the estimated duration of the job?

Next to these requirements, one can specify more details: the amount of memory that is needed; request special hardware (fast network adapters for example); specify the name of the standard output file; and so on. The user prepares a file wherein the job-specifications are made in a special language. This language differs from one job system to another. Here follow two examples for a job needing 2 nodes for 1 hour:

Torque batch system:

#PBS -l walltime=1:00:00
#PBS -l nodes=2

Loadleveler batch system:

# @ wall_clock_limit = 1:00:00
# @ node = 2

Defining what has to be done
The things to be done are to be specified in a shell script. The shell can be bash, ksh, sh, csh and tcsh. We recommend to use in a job the same shell as the interactive login shell which normally will be bash. A typical shell script could be:
echo "Start of job at `date`"
cd $HOME/workdir
./myprog one two three
echo "End of job at `date`"
So, in order to use a batch system, one has to have a basic knowledge of shell programming. There are many bash tutorials on the web, here is an example

Putting it together
To prepare and submit a job, one logs in to a login node, using some communication program (winscp, putty, ssh). Example for a UNIX workstation, connecting to the virgo system at HPCE:
ssh ipaddress of server
and, using an editor (vi, emacs, joe) one puts the description of the requirements of a job and the shell script in one file. An example for the Torque batch system:
The contents of a file called 'myjob':
#PBS -lwalltime=3:00:00
#PBS -lnodes=2
echo "Start of job"
cd $HOME/workdir
./myprog phase1
echo "End of job"
To submit this job to the Torque batch system, issue the command:
qsub myjob
The system will take care that the job is run on two nodes and will abort the job after 3 hours since the start of the job, if it is still running. The job system assumes that the job is ended when the job script is ended. The accounted system usage is always the number of nodes or processors times the number of wall clock hours the job is running.

Parallel jobs
It is possible to write a computer program that uses more than one core at the same time in parallel. Such a program will need less time if there are more cores available. MPI and OpenMP are often used tools for creating parallel programs. Note that only parallelized programs can make use of more than one core. It is useless to ask for more than one core if the program is not parallelized.

Monitoring a job
Every job system offers some commands to monitor the status of a job. Have a look at the documentation for the system you are using.

Managing jobs
After a job has been submitted there are various possibilities to manage it. The most important one is canceling a submitted job: removing it from the queue if it not running yet, or removing it when it is running. Sometimes there are possibilities to change the requirements of an already submitted job.

Automatic rerun of jobs after system failure
Most batch systems arrange that jobs are automatically resubmitted after a system failure. This is not always desirable, and there is always a way to prevent automatic rescheduling of your jobs, look at the appropriate documentation.

Scratch disk space and fast disk access
Depending on the system, it can be important to optimize a job with respect to the usage of the input, output or intermediate (scratch) files. One can imagine, that a load of hundreds of jobs, all frequently accessing a database on the home file system can be very bad for the performance of the I/O performance if this file system is accessed by NFS or a similar system. Therefore it is wise to put frequent accessed files onto a special scratch file system. On a cluster this could be the local disks available in a node. Detailed information is available in the documentation of the various systems.

Running more than one process on a node
If the job system allocates a whole number of nodes to a job, and one wants to run serial (not parallelized) jobs, the usage of the resources can be bad. For example, if a node contains 8 cores, only 12.5% of the available CPU power would be used. Given the fact that submitters of serial jobs always want to run many of those (otherwise they would not need a supercomputer), the solution to enhance the efficiency is not difficult, without parallelizing the serial program. Here follows an example how to run 8 processes simultaneously on a eight-core node:
for i in {1..8}
do
cd $HOME/workdir.$i
myprog input output 2>&1 &
done
wait
In this example, 8 instances of the program `myprog` are started in the background, each running in a different working directory. The wait command is essential: without it, the job script itself would end almost immediately, and the job system would decide that the job is ended and presumably kill all the myprog processes. In practice, one would write a script to generate this kind of jobs.

How to find out if a script is running in the batch or interactively
Sometimes it comes in handy if one can determine in a script if this is running as part of a batch job or interactively. A general way to set an environment variable is as follows:
unset INTERACTIVE
/usr/bin/tty > /dev/null 2>&1 && export INTERACTIVE=1

The environment variable INTERACTIVE will be set in an interactive environment, and unset in a batch environment.