Unscheduled Brown outage

November 8, 2020  4:00pm – November 9, 2020  8:30am

UPDATE: November 9, 2020  9:19am

As of 8:30am on November 9, 2020, the problem has been isolated and fixed. The Brown cluster has been returned to normal service. Please report any issues to rcac-help@purdue.edu.

ORIGINAL: November 8, 2020  8:20pm

The Brown cluster began experiencing issues with its job scheduler around 4:00pm. The problem manifests itself as Slurm-related commands (slist, squeue, sinteractive, sbatch, etc) being slow, unresponsive or timing out. Queue selection dialogs in interactive job submission tools inside Thinlinc and OnDemand gateway are affected as well. The scheduler itself seem to be functioning and jobs already in the queue appear to be starting.

Engineers are currently diagnosing the issue and are working to identify a fix. We will provide an update by noon tomorrow or sooner as we investigate the problem.

Originally posted: November 8, 2020  8:20pm