Bell cluster will be unavailable Wednesd...

Bell cluster will be unavailable Wednesday

October 28, 2020 8:00am - 6:00pm EDT
Maintenance
Bell

Link to update at October 29, 2020 12:17pm EDT UPDATE: October 29, 2020 12:17pm EDT

The Bell cluster, including the highmem nodes, has been returned to early access testing status.

During this week’s maintenance, we completed benchmarking and finalized the cluster's internal configuration. Node performance has been tuned, made steady and reproducible; the Open OnDemand gateway is now available for use; and the cluster’s software stack continues to mature - AMD-optimized AOCL math libraries have been integrated, and the latest machine learning frameworks are now available.

However, pre-production testing has revealed an issue with the Lustre filesystem powering Bell’s scratch. The new Lustre on Bell has several exciting performance-enhancing features, but vendor engineers have been re-engaged to assist with tuning the filesystem’s configuration so that the scratch filesystem will function as designed.

While this work is ongoing, the scratch filesystem is temporarily unavailable.

Job scheduling has been re-enabled, without scratch. In the meantime, your labs may make use of /home and /depot for your current testing, but please take care to limit the I/O operations of your test workflows, as these filesystems are not as powerful as Bell’s Lustre.

We will notify you when the scratch filesystem is back online so that you can change your workflows to use scratch again. Thank you very much for your understanding and cooperation.

Link to original posting ORIGINAL: October 28, 2020 8:00am EDT

The Bell cluster will be unavailable Wednesday, October 28, 2020 at 8:00am EDT for scheduled maintenance. During this time, our Engineering team will be working with vendor representatives to complete benchmarking steps and finalize the cluster's internal configuration.

During this time users will still be able to login to cluster front-ends for limited code testing (but please be aware that front-ends will be subject to at least one reboot to apply BIOS updates). As usual, scheduled jobs on compute nodes will not run during the maintenance period.

Originally posted: October 26, 2020 5:29pm EDT
Last updated: October 29, 2020 12:17pm EDT