Skip to main content
Have a request for an upcoming news/science story? Submit a Request

Bell Degraded Capacity

  • Outages
  • Bell

UPDATE:

As of Friday, November 18th, 2022, RCAC engineers were able to revive approximately 80 additional nodes, bringing the total available Bell-A nodes count to 265 (or 55% of the total 480 nodes capacity). We observe significant improvement in owner queues jobs flow enabled by this recovered capacity.

We are now waiting on the vendor to deliver parts necessary for further repairs of remaining downed nodes. We will provide more updates as soon as we have any new information. Thank you once again for your patience and understanding.

UPDATE:

With selfless and creative efforts, on Thursday, November 17th RCAC engineers have revived and brought back to service approximately 100 Bell nodes. This brings the cluster from 10 to approximately 30% of its capacity, and we have observed a significant increase in jobs flow.

Work continues on recovering as many nodes as possible, and we greatly appreciate your patience and understanding in this difficult situation. The standby queue remains disabled in order to streamline owner queues' wait times. We will continue providing updates via this page and Bell cluster mailing list. Please reach out to rcac-help@purdue.edu with any questions or concerns you might have.

UPDATE:

Dear Bell Community,

The Bell High Performance Computing cluster continues to operate at a very limited capacity due to cooling issues. Below is a summary of Bell’s degraded status.

  • As of September, a leakage in the Bell’s cooling system impacted a sizeable portion of the cluster. Due to the severity of the issue and the risk of long-term impact, several of Bell’s nodes were turned off. The removal of these nodes reduced Bell’s available capacity to around 60%. RCAC’s engineers immediately requested the vendor’s support and replacement parts, and were notified by the vendor about delivery delays due to the unavailability of these parts. The engineers are still waiting for the parts.
  • Bell's capacity continued to reduce gradually due to the spread of the same issue. By early November, Bell’s capacity was down to around 35%.
  • On November 15th, 2022, a sudden escalation of the issue further reduced the Bell’s capacity to 10%. RCAC’s engineers immediately started a mitigation plan by revisiting individual nodes to make them available if the risks are deemed low. Furthermore, the standby queue was disabled to prioritize owner’s jobs.

We understand the impact of this issue on your research and are therefore working very closely with the vendor and dedicating as many resources as needed to resolve or alleviate the issue as quickly as possible. Please reach out to us at rcac-help@purdue.edu if you have any urgent computational needs. We will provide an update by Tuesday November 22, or as soon as there is any further update.

UPDATE:

On Tuesday, November 15th, 2022 more Bell nodes have suffered failures with their power and cooling systems due to leaks in faulty cooling plates. At the moment Bell is down to 10% capacity and is operating in an emergency mode. While the vendor is shipping replacements as fast as they can, RCAC engineers are looking for ways to revive at least some of the downed nodes.

Please be aware that with the lack of available cores capacity, wait times in owners queue can no longer be guaranteed, and are expected to be very long until more nodes could be brought back online. Additionally, standby queue has been disabled while we work to restore normalcy in Bell operation.

ORIGINAL:

The Bell cluster continues to experience issues with Hardware. Engineers are currently diagnosing the issues and are working with vendors to schedule and perform repairs as quickly as possible.

Job scheduling continues, but you may experience longer than normal wait times as nodes are removed from service for repairs.

We will update this notice as we learn more.

Originally posted: