Unscheduled Brown outage
November 8, 2020 4:00pm – November 9, 2020 8:30am
As of 8:30am on November 9, 2020, the problem has been isolated and fixed. The Brown cluster has been returned to normal service. Please report any issues to firstname.lastname@example.org.
ORIGINAL: November 8, 2020 8:20pm
The Brown cluster began experiencing issues with its job scheduler around 4:00pm. The problem manifests itself as Slurm-related commands (
sbatch, etc) being slow, unresponsive or timing out. Queue selection dialogs in interactive job submission tools inside Thinlinc and OnDemand gateway are affected as well. The scheduler itself seem to be functioning and jobs already in the queue appear to be starting.
Engineers are currently diagnosing the issue and are working to identify a fix. We will provide an update by noon tomorrow or sooner as we investigate the problem.