Starting Casper jobs¶

This page describes how to use PBS Pro to submit jobs to run on nodes in the Casper cluster. Unless GPUs are required, run jobs that require the use of more than one compute node on Derecho.

Procedures for starting both interactive jobs and batch jobs on Casper are described below. Also:

Compile your code on Casper nodes if you will run it on Casper.
See Calculating charges to learn how core-hours charges are calculated for jobs that run on Casper.

Begin by logging in on Casper or Derecho.

Casper Wall-clock limits

The wall-clock limit on the Casper cluster is 24 hours except as noted below.

Specify the hours your job needs as in the examples below. Use either the hours:minutes:seconds format or minutes:seconds.

Interactive jobs¶

Starting a remote command shell with execcasper¶

Run the execcasper command to start an interactive job. Invoking it without an argument will start an interactive shell on the first available HTC node. The default wall-clock time is 6 hours.

To use another type of node, include a select statement specifying the resources you need. The execcasper command accepts all PBS flags and resource specifications as detailed by man qsub.

If you do not include a resource specification by using either a select statement or convenience flags, you will be assigned 1 CPU with 10 GB of memory and no GPUs.

If no project is assigned with either the -A option or the DAV_PROJECT environment variable, any valid project listed for your username will be chosen at random.

Starting a virtual desktop with vncmgr¶

If your work with complex programs such as MATLAB and VAPOR requires the use of virtual network computing (VNC) server and client software, use vncmgr instead of execcasper.

Using vncmgr simplifies configuring and running a VNC session in a Casper batch job. How to do that is documented here.

Batch jobs¶

Prepare a batch script by following one of the examples here. Most Casper batch jobs use the casper submission queue. The exception is for GPU development jobs, which are submitted to the gpudev submission queue.

Be aware that the system does not import your login environment by default, so make sure your script loads the software modules that you will need to run the job.

Caution: Avoid using the PBS -V option with cross submission

Avoid using the PBS -V option to propagate your environment settings to the batch job; it can cause odd behaviors and job failures when used in submissions to Casper from Derecho. If you need to forward certain environment variables to your job, use the lower-case -v option to specify them. (See man qsub for details.)

When your job script is ready, use qsub to submit it from the Casper login nodes.

GPU development jobs¶

A submission queue called gpudev is available between 8 a.m. and 5:30 p.m. Mountain time Monday to Friday to support application development and debugging efforts on general purpose and ML/AI GPU applications. This queue provides rapid access to up to 4 V100 GPUs, avoiding the sometimes lengthy queue wait times in the gpgpu execution queue.

Job submissions to this queue are limited to 30 minutes walltime instead of the 24-hour wallclock limit for all other submissions. All jobs submitted to the queue must request one or more V100 GPUs (up to four) in their resource directives. Node memory can be specified explicitly as usual, but by default jobs will be assigned N/4 of the total memory on a node, where N is the number of V100 GPUs requested.

Concurrent resource limits¶

Job limits are in place to ensure short dispatch times and a fair distribution of system resources. The specific limits that apply to your submission depend on the resources requested by your job. Based on your request, your submission will be classified as shown in the table.

Submission queue	Job category (execution queue)	Job resource requests	Limits
casper 24-hour wallclock limit	largemem	mem>361 GB ncpus<=36 ngpus=0	Up to 5 jobs eligible for execution at any one time (more can be queued)
	htc	mem<=361 GB ncpus<=36 ngpus=0	Up to 468 CPUs in use per user at any one time. Up to 4680 GB memory per user at any one time (across all jobs in category)
	vis	gpu_type=gp100	Up to 2 GPUs in use per user at any one time Individual jobs are limited to a single gp100 (no multi-GPU jobs)
	gpgpu	gpu_type=v100\|a100	Up to 32 GPUs in use per user at any one time; users may submit jobs requesting more than 32 GPUs for execution on weekends.
gpudev 30-minute wallclock limit		ncpus<=36 1<=ngpus<=4	Queue is only operational from 8 a.m. to 5:30 p.m. Mountain time, Monday to Friday. Users may have only one active job in the queue at any time.

NVMe node-local storage¶

Casper nodes each have 2 TB of local NVMe solid-state disk (SSD) storage. Some is used to augment memory to reduce the likelihood of jobs failing because of excessive memory use.

NVMe storage can also be used while a job is running. (Recommended only for I/O-intensive jobs.) Data stored in /local_scratch/pbs.$PBS_JOBID are deleted when the job ends.

To use this disk space while your job is running, include the following in your batch script after customizing as needed.

### Copy input data to NVMe (can check that it fits first using "df -h")
cp -r /glade/scratch/$USER/input_data /local_scratch/pbs.$PBS_JOBID

### Run script to process data (NCL example takes input and output paths as command line arguments)
ncl proc_data.ncl /local_scratch/pbs.$PBS_JOBID/input_data /local_scratch/pbs.$PBS_JOBID/output_data

### Move output data before the job ends and your output is deleted
mv /local_scratch/pbs.$PBS_JOBID/output_data ${SCRATCH}

Script examples¶

See this page for many Casper PBS job script examples: Casper job script examples

When your script is ready, submit your batch job for scheduling as shown here.