Skip to main content
Have a request for an upcoming news/science story? Submit a Request

Coates PBS scheduler issues

  • Outages
  • Coates

This week, ITaP engineers have been troubleshooting issues with the Coates cluster, with the most common symptom being PBS jobs that abort or restart after some period of run time.

Late yesterday afternoon, a change was made to the cluster's networking configuration that appears to have dramatically improved this issue.

During this coming weekend, monitoring is in place to watch for further PBS issues

Please report to rcac-help@purdue.edu :

  • If your jobs unexpectedly requeue and start back over from the beginning
  • If you see any MPI communication errors in job output or error files
  • Be sure to include any job IDs or approximate timestamps of any observed issues.

Thank you again for your continued understanding.

Previous Message

An update on the previously described PBS issues:

  • Over the past several days, ITaP engineers have successfully resolved two separate network issues that were responsible for the symptoms described below.
  • occurrences of the issues below with "qstat" and "qsub" have been greatly reduced, but do not appear completely eliminated. Systems engineers are, however, still investigating open issues regarding the responsiveness of "qstat", and aborted or restarted jobs, including a number of jobs that were requeued shortly before 7:00pm yesterday evening (August 23).

We appreciate your continued understanding while this issue is being investigated.

Original message:

The Coates cluster is experiencing intermittent issues with its PBS job scheduler. These issues may manifest in any of the following ways:

  • qstat/qsub: cannot connect to server coates-adm.rcac.purdue.edu (errno=15007)
  • Slow response to "qstat" or "qsub" commands
  • Slow job scheduling, usually appearing as jobs remaining idle in the queue for longer than expected ITaP systems engineers are troubleshooting this issue and hope to have a solution in place soon. Please contact rcac-help@purdue.edu if you experience any other issues with the community clusters.

Originally posted: