Bell Cluster Deployment Status

September 30, 2020 – October 12, 2020
Bell

Based on the AMD Epyc "Rome" processor, Bell is entering the final stages of deployment. Dell and AMD were tremendous partners with Bell and with their help we were able to make Bell the most powerful system ever deployed at Purdue, while still maintaining costs at our usual levels.

So far, our RCAC team has completed:

  • Installation of a new liquid cooling loop, to cool the CPUs with liquid directly against the processors.
  • Deployment of a 100Gbps HDR Infiniband network. Work with Dell and Mellanox engineers is still underway to install the latest NIC firmware to ensure reliability and full performance.
  • Creation of the system's Lustre filesystem, which will be the fastest one deployed at Purdue:
    172.18.44.27@o2ib:172.18.44.28@o2ib:/scratch     3.5P   28M  3.4P   1% /scratch/bell
  • Benchmarking and testing has been performed to provide a recommended MPI and most performant set of math libraries for the new CPU architecture.
  • Front end nodes are up and running.
  • Compute nodes are booted and are currently being quality tested (to identify problematic hardware before your jobs find them), and running single node LINPACK benchmarks.

Once the network is clean and any problematic compute nodes are addressed:

  • Over the next several days, the full system will be benchmarked for inclusion in the November 2020 Top 500 list.
  • Engineers will complete deployment of the cluster's /home and /apps fileserver.
  • Our computational science staff will complete building and installation of the cluster's application software stack.
  • Open OnDemand and Globus data transfer nodes will be deployed
  • Delivery of Bell's AMD GPUs are still pending.

And then the system will enter early user mode, and will be ready for you to start using. We anticipate that Bell’s early access period will begin the week of October 12. We will be in touch shortly with further details on the early access period.

During the early user period, cluster max runtimes will be short to allow for weekly maintenance to roll out bug fixes and new features. Many of you will be paired with one of our computational scientists to help answer questions and get your lab up and running as quickly as possible.

Please don't hesitate to contact us at rcac-help@purdue.edu with any questions even before Bell is available for login.

Originally posted: September 30, 2020  3:31pm