Bell Degraded Capacity

September 28, 2022 12:00am - September 6, 2023 11:26am EDT
Outages
Bell

Link to update at May 23, 2023 11:22am EDT UPDATE: May 23, 2023 11:22am EDT

Despite ongoing supply chain delays, RCAC engineers and students continue work on reviving remained downed Bell nodes and bringing them back online. As of Tuesday, May 23rd, 2023, Bell-A active nodes count is approaching 380 nodes, another significant high water mark.

Due to this major advance, we have re-enabled the standby queue on Bell.

We are still waiting for other parts necessary to achieve the end goal of 100% of the cluster up. Thank you for your patience as we are moving towards it. Please reach out to rcac-help@purdue.edu with any questions or concerns.

Link to update at December 10, 2022 10:46am EST UPDATE: December 10, 2022 10:46am EST

On Friday, December 9th, 2022, the first batch of Bell's replacement cold plates has been delivered. Through a concerted effort by RCAC engineers and students, the parts were immediately installed into awaiting downed nodes. Currently Bell-A active node count is at 325 (or 68% of the total capacity), which represents a significant new high water mark in our recovery efforts.

Link to update at December 8, 2022 5:55pm EST UPDATE: December 8, 2022 5:55pm EST

Earlier on Thursday, December 8th, 2022, Bell users may have experienced job restarts and re-queues due to several additional node failures. This also led to increased wait times.

Engineers remediated and returned affected nodes to normal service, but users may need to examine their jobs' output (and resubmit affected jobs if necessary).

The wait continues on the vendor parts necessary to restore Bell nodes that are still down due to the cooling issue. RCAC engineers actively monitor the cluster to ensure the rest of the nodes are available for user jobs. We greatly appreciate your patience.

Link to update at November 18, 2022 3:30pm EST UPDATE: November 18, 2022 3:30pm EST

As of Friday, November 18th, 2022, RCAC engineers were able to revive approximately 80 additional nodes, bringing the total available Bell-A nodes count to 265 (or 55% of the total 480 nodes capacity). We observe significant improvement in owner queues jobs flow enabled by this recovered capacity.

We are now waiting on the vendor to deliver parts necessary for further repairs of remaining downed nodes. We will provide more updates as soon as we have any new information. Thank you once again for your patience and understanding.

Link to update at November 18, 2022 9:55am EST UPDATE: November 18, 2022 9:55am EST

With selfless and creative efforts, on Thursday, November 17th RCAC engineers have revived and brought back to service approximately 100 Bell nodes. This brings the cluster from 10 to approximately 30% of its capacity, and we have observed a significant increase in jobs flow.

Work continues on recovering as many nodes as possible, and we greatly appreciate your patience and understanding in this difficult situation. The standby queue remains disabled in order to streamline owner queues' wait times. We will continue providing updates via this page and Bell cluster mailing list. Please reach out to rcac-help@purdue.edu with any questions or concerns you might have.

Link to update at November 17, 2022 11:12am EST UPDATE: November 17, 2022 11:12am EST

Dear Bell Community,

The Bell High Performance Computing cluster continues to operate at a very limited capacity due to cooling issues. Below is a summary of Bell’s degraded status.

As of September, a leakage in the Bell’s cooling system impacted a sizeable portion of the cluster. Due to the severity of the issue and the risk of long-term impact, several of Bell’s nodes were turned off. The removal of these nodes reduced Bell’s available capacity to around 60%. RCAC’s engineers immediately requested the vendor’s support and replacement parts, and were notified by the vendor about delivery delays due to the unavailability of these parts. The engineers are still waiting for the parts.
Bell's capacity continued to reduce gradually due to the spread of the same issue. By early November, Bell’s capacity was down to around 35%.
On November 15th, 2022, a sudden escalation of the issue further reduced the Bell’s capacity to 10%. RCAC’s engineers immediately started a mitigation plan by revisiting individual nodes to make them available if the risks are deemed low. Furthermore, the standby queue was disabled to prioritize owner’s jobs.

We understand the impact of this issue on your research and are therefore working very closely with the vendor and dedicating as many resources as needed to resolve or alleviate the issue as quickly as possible. Please reach out to us at rcac-help@purdue.edu if you have any urgent computational needs. We will provide an update by Tuesday November 22, or as soon as there is any further update.

Link to update at November 15, 2022 6:58pm EST UPDATE: November 15, 2022 6:58pm EST

On Tuesday, November 15th, 2022 more Bell nodes have suffered failures with their power and cooling systems due to leaks in faulty cooling plates. At the moment Bell is down to 10% capacity and is operating in an emergency mode. While the vendor is shipping replacements as fast as they can, RCAC engineers are looking for ways to revive at least some of the downed nodes.

Please be aware that with the lack of available cores capacity, wait times in owners queue can no longer be guaranteed, and are expected to be very long until more nodes could be brought back online. Additionally, standby queue has been disabled while we work to restore normalcy in Bell operation.

Link to original posting ORIGINAL: September 28, 2022 - October 14, 2022

The Bell cluster continues to experience issues with Hardware. Engineers are currently diagnosing the issues and are working with vendors to schedule and perform repairs as quickly as possible.

Job scheduling continues, but you may experience longer than normal wait times as nodes are removed from service for repairs.

We will update this notice as we learn more.

Originally posted: September 28, 2022 3:37pm EDT