Description
Using a batch system is not difficult; however, it needs some practical expertise to create an efficient batch job. Some information on the principles of a batch system is described in our Batch HOWTO.
On the Virgo system LoadLeveler is installed as batch system.
Batch jobs are submitted to the LoadLeveler system via
llsubmit
LoadLeveler is a batch scheduling system for submitting serial and parallel jobs. LoadLeveler matches job requirements with the machine resources.. Submitting and running your jobs through the batch system is necessary to avoid system overload and enforces a fair share of the resources among the users. Before you can submit a job or perform any other job related tasks, you need to build a job command file.
A Job Command file is simply a text file that may contain the following different types of information:
Your Job Command file may tell LoadLeveler what you want to run through the executable keyword or you may have your Job Command file serve as the executable by not specifying an executable or by explicitly setting the executable to be the Job Command file itself. Therefore, your Job Command file may be a shell script that drives the programs that you want to execute.
The Job Command file may also contain different job steps that together constitute your job. You may name each job step and you must have a queue keyword entry for each job step you want to run. Unless otherwise noted, the keywords you set for the first job step will be inherited by all subsequent job steps. By default, LoadLeveler will view each job step as an independent entity but, by using the dependency keyword, you can conditionally execute different programs depending on the return value of the previous job steps.
Command | Description |
---|---|
llclass | Returns information about classes |
llq | Query information about jobs in the queues |
llcancel | Cancel job from the queue |
llstatus | Returns status information about nodes in the cluster |
Set environment variables before submitting your job
For some programs, we provide module files that modify the environment appropriately. If these are loaded in the script section, the environment is often not properly inherited. It is much safer to do the following:
$ module load
$ llsubmit file.job
Furthermore, environment variables should be explicitly copied to individual processes using the appropriate LoadLeveler keyword:
#@ environment = COPY_ALL
Do not oversubscribe for memory or CPU resources unnecessarily
Unless the entire node is reserved for running your job, requesting too much memory will make it more difficult for the system to schedule other jobs on that node, as there will be fewer jobs that can fit into the leftover portion of memory. Thus, part of the node may be idle while other researchers' jobs are needlessly delayed. Similarly, if your job can only scale well to 12 cores, requesting 24 cores will delay other researchers' jobs, while offering you only a marginal speedup. In this matter, please be considerate of your colleagues.
Do not oversubscribe for walltime
LL uses backfilling scheduling which means, that if there are resources (cores or memory or GPU devices), reserved for another job, waiting for more resources, the scheduler can slip in a shorter job to use currently idle resources, increasing the efficiency of scheduling algorithm and allowing you to receive results faster. This process, however, relies on fair assessment of the requested job wall time, which is specified by wall_clock_limit.
Copyright © 2014 - All Rights Reserved - Domain Name
Template by OS Templates