Gilbreth Queue Changes

July 18, 2019
Gilbreth

To facilitate various types of workloads, to reduce wait time in the queues, and to improve GPU utilization on Gilbreth, we have made several changes to its queue configurations. Users now have access to two new queues on Gilbreth (long, and highmem).

Since the training nodes have specialty hardware and are few in number (3) these are restricted to users whose workloads can scale across 4 GPUs or more.

The intended use-cases of various queues are described below:

partner: (max walltime 24 hours) This is the default queue for submitting short to moderately long jobs. Since partner includes all the nodes in the cluster, your jobs will likely start sooner than they would in other queues.

long: (max walltime 7 days) If your job requires more than 24 hours to complete, you can submit it to the long queue. There are only 4 nodes in this queue, so you may have to wait for a considerable amount of time to get access to a node.

training: (max walltime 7 days) If your job can scale on 4 GPUs or more and it requires longer than 24 hours, then use this queue. Please note that ITaP staff may ask you to provide evidence that your jobs can fully utilize the GPUs, before granting access to this queue. There are only 3 nodes in this queue, so you may have to wait a considerable amount of time before your job is scheduled.

highmem: (max walltime 4 hours) If your job requires GPUs with large memory (32GB), but can finish in a short time, use the highmem queue. This queue shares nodes with the training queue, so you may need to wait until a node becomes available.

debug: (max walltime 30 mins) This is intended for quick testing as you develop your code. Jobs are expected to schedule quickly on the debug queue if no one else is running, and the short walltime limit will keep this open.

Specifications
Queue GPU Type Intended use-case Number of nodes Max walltime
partner Nvidia P100 (16 GB) or V100 (16 GB) Short to moderately long jobs 48 24 hours
long V100 (16 GB) Long jobs 4 7 days
training V100 (32 GB) Long jobs such as Deep Learning model training, code must scale to 4-GPUs or more 3* 7 days
highmem V100 (32GB) Short jobs that require large GPU memory 3* 4 hours
debug P100 (16GB) or V100 (16 GB) Quick testing 1 30 mins
If you have any questions about the changes, or need access to the training queue, please send an email to rcac-help@purdue.edu

Originally posted: July 18, 2019  4:42pm