Gilbreth Queue Changes

July 18, 2019
Announcements
Gilbreth

To facilitate various types of workloads, to reduce wait time in the queues, and to improve GPU utilization on Gilbreth, we have made several changes to its queue configurations. Users now have access to two new queues on Gilbreth (long, and highmem).

Since the training nodes have specialty hardware and are few in number (3) these are restricted to users whose workloads can scale across 4 GPUs or more.

The intended use-cases of various queues are described below:

partner: (max walltime 24 hours) This is the default queue for submitting short to moderately long jobs. Since partner includes all the nodes in the cluster, your jobs will likely start sooner than they would in other queues.

long: (max walltime 7 days) If your job requires more than 24 hours to complete, you can submit it to the long queue. There are only 4 nodes in this queue, so you may have to wait for a considerable amount of time to get access to a node.

training: (max walltime 7 days) If your job can scale on 4 GPUs or more and it requires longer than 24 hours, then use this queue. Please note that ITaP staff may ask you to provide evidence that your jobs can fully utilize the GPUs, before granting access to this queue. There are only 3 nodes in this queue, so you may have to wait a considerable amount of time before your job is scheduled.

highmem: (max walltime 4 hours) If your job requires GPUs with large memory (32GB), but can finish in a short time, use the highmem queue. This queue shares nodes with the training queue, so you may need to wait until a node becomes available.

debug: (max walltime 30 mins) This is intended for quick testing as you develop your code. Jobs are expected to schedule quickly on the debug queue if no one else is running, and the short walltime limit will keep this open.

Queue Details
Queue	GPU Type	Intended use-case	Number of nodes	Max walltime
partner	Nvidia P100 (16 GB) or V100 (16 GB)	Short to moderately long jobs	48	24 hours
long	V100 (16 GB)	Long jobs	4	7 days
training	V100 (32 GB)	Long jobs such as Deep Learning model training, code must scale to 4-GPUs or more	3*	7 days
highmem	V100 (32GB)	Short jobs that require large GPU memory	3*	4 hours
debug	P100 (16GB) or V100 (16 GB)	Quick testing	1	30 mins

If you have any questions about the changes, or need access to the training queue, please send an email to rcac-help@purdue.edu

Originally posted: July 18, 2019 4:42pm EDT

Gilbreth Queue Changes

Follow Us