RCAC - Outages and Maintenance, Announcements, Science Highlights, Events, Coffee Hour Consultations, Outages, Maintenance, Student Events

Carter Cluster Decommission Early Notice

Sun, 30 Apr 2017 00:00:00 -0400

Thank you for partnering with ITaP in the community cluster program.

Carter has been running for more than 4.5 years now and will be decommissioned on April 30, 2017. This timeline coincides with the five-year warranty of the machine.

Our current timeline for Carter retirement is:

Winter 2017: Trade-in value for Carter established
March 1, 2017: Jobs that run past April 30 will no longer be accepted on Carter
April 30, 2017: Carter hardware will be decommissioned

As Carter winds down, please take time to migrate any data from Carter's scratch to other community clusters or to the Fortress archive. Additionally, we encourage you to consider purchasing access in Halstead which is significantly faster, easier to scale, and much more energy efficient.

You can generate your usage report for Carter from https://www.rcac.purdue.edu/usage

As we decommission the Carter hardware, ITaP will seek bids for the residual value of the hardware and apportion that to the faculty owners - so, for example, if we are able to get $50/node, that dollar value would be provided as credit towards an upcoming cluster.

If you have any questions, please contact rcac-help@purdue.edu

Thank you again for your participation in the community cluster program and we look forward to our continued partnership.

Carter Decommission

Sun, 30 Apr 2017 00:00:00 -0400

As a reminder, Carter will be shut down and decommissioned on Sunday April 30, 2017.

ALL DATA in /scratch/carter WILL NOT BE RETRIEVABLE after April 30, 2017. See below for options on how to move any data you wish to keep from Carter's scratch space.

Original Message:

Thank you for partnering with ITaP in the community cluster program.

Carter will be decommissioned on April 30, 2017.

Compute Nodes and Scratch Storage

The compute nodes and scratch storage will both be decommissioned, and all files will need to be migrated before this time.

ALL DATA in /scratch/carter WILL NOT BE RETRIEVABLE after April 30, 2017, by anyone from any system. ITaP recommends storing important data in the Fortress HPSS archive or Data Depot service. To copy data to another cluster scratch, we recommend rsync or Globus.

For example: rsync -auH /scratch/carter/r/rherban/* halstead:/scratch/halstead/r/rherban

For more information on using Fortress or Data Depot, please visit the following links:

https://www.rcac.purdue.edu/storage/fortress

https://www.rcac.purdue.edu/storage/depot

Applied Credit

Your existing Carter nodes will be awarded a $100 per node credit that can be applied towards Halstead or Snyder. If you have already purchased into Halstead, we can issue a credit to your original purchasing account.

You can generate your usage report for Carter from https://www.rcac.purdue.edu/usage

If you have any questions, please contact rcac-help@purdue.edu

Thank you again for your participation in the community cluster program, and we look forward to our continued partnership.

Emergency Carter Cluster Maintenance

Wed, 15 Mar 2017 12:00:00 -0400

Update:

Owner queues on Carter have been restarted. While Carter is currently deemed stable, performance is still impacted. Engineers are closely monitoring the situation and will take corrective action if necessary.

Update:

At this time, only Carter’s standby queues remain enabled as engineers continue to monitor the scratch file system. Performance has improved and we are working to restore full owner access as soon as possible.

We will provide another update by 5pm today, March 16th.

Update:

Systems engineers alleviated performance problems on the scratch filesystem and brought the cluster back online so users can login to the front-ends and submit jobs. Standby queue is enabled, however owners' queues are still temporarily paused. We will continue monitoring the scratch performance and gradually release owners' queues as conditions allow.

We will provide next update no later than 10am on Thursday, March 16, 2017.

Original message

The cluster will be taken down at Wednesday, March 15th, 2017 at 12:00pm EDT for emergency maintenance. The scratch storage system which serves is not performing correctly. Engineers have made several changes to try to isolate and resolve this issue for days, including pausing standby jobs to reduce load, but we believe the issue cannot be resolved while the cluster is in production.

Any PBS jobs in queue now which request a walltime which would take them past Wednesday, March 15th, 2017 at 12:00pm EDT will not start and will remain in the queue until after the maintenance is completed. Any jobs which have already started and do not complete by Wednesday, March 15th, 2017 at 12:00pm EDT will be forcibly stopped and requeued.

We will post an update on the status of this work by 5:00pm on March 15.

Partial scratch outages on Rice, Snyder, Carter, Scholar and Hammer

Mon, 06 Mar 2017 09:00:00 -0500

The scratch filesystems serving Carter, Conte, and Hansen started behaving abnormally this morning.

This may have affected some jobs, and anyone using one of the login nodes for these clusters may have had sessions freeze or seen delays or other issues accessing scratch space.

Storage engineers are investigating the issues, but we believe the clusters are continuing to function normally for most users at this time.

Emergency Security Patching of RCAC Clusters

Thu, 02 Feb 2017 17:00:00 -0500

Due to a recent security vulnerability, the clusters will have their operating system upgraded to a newer version during February 2, 2017 5:00pm - March 2, 2017 5:00pm EST. Unlike other cluster downtimes, this upgrading process will follow a "rolling reboot" strategy, that is, nodes will be updated and rebooted when jobs currently running on them complete.

Potential impact on users:

Currently running batch jobs will NOT be impacted.
For each front-end server, users will be given a 48-hour notice to save their work and exit from any currently running interactive jobs. Interactive jobs still running at the time of reboot will be terminated.
Users may experience slightly longer scheduling delays during initial hours of the updating process.

Clusters to switch to hierarchical modules

Wed, 01 Feb 2017 00:00:00 -0500

Carter, Conte, Hammer, Hathi, Rice, Scholar, and Snyder have been converted to hierarchical modules. All front-ends have been converted, and nodes will update over the next couple of hours. Running jobs will not be impacted by this change.

If you have any existing login SSH sessions open this morning, you will need to log out and back in to refresh the module environment.

Old style module load commands should still continue to function until the end of the Spring semester, however you will be given a warning message and a suggestion on how to correct your scripts. Once you believe you have made all the necessary changes you may turn off the old-name translation by:

$ touch ~/.hierarchy

Then, by logging out and back in, the automatic translation will be disabled. This will allow you to be sure that your scripts are correctly updated before the translation is turned off system wide at the end of the semester. We will send further details and timing on turning off translations in the coming weeks.

Additionally, a default module has now been set. The ITaP-recommended set of compiler (Intel 16) and MPI (OpenMPI or IMPI, depending on cluster; essentially the devel module) will now be loaded by default when you log in. If you or your job scripts load your own version of a compiler or MPI, your version will automatically replace the default and you will see no change in behavior.

Original Message

On February 1st, 2017 the Carter, Conte, Rice, Snyder, Hammer, and Scholar's software stack will be converted to an hierarchical configuration. This new software has been in use on Halstead since it was brought online. This change will help prevent errors and configuration problems due to version mismatches or conflicting software, and will allow for ITaP to provide a more robust software stack.

This change may require you to make small modifications to your job scripts. The module command will attempt to automatically translate these for you to help ease into the transition but an automatic translation may not always be possible, and this functionality will only be offered for a limited time.

All clusters are now live with new environment, so testing machines have been turned off because the new environment is now live.

Additional information and explanations about the module hierarchy can be found in the user guide.

If you have any questions or concerns please contact us at rcac-help@purdue.edu.

Carter Cluster Maintenance

Tue, 10 Jan 2017 08:00:00 -0500

The maintenance for cluster was cancelled and will be rescheduled at a later date. The cluster has remained in service.

Original Notice

The cluster will be unavailable beginning at Tuesday, January 10th, 2017 at 8:00am EST, for emergency maintenance to mitigate issues with Data Depot access. The cluster will return to full production by Tuesday, January 10th, 2017 at 8:00pm EST.

During this time, will have its connection to Data Depot storage systems improved.

Any PBS jobs which request a walltime which would take them past Tuesday, January 10th, 2017 at 8:00am EST will not start and will remain in the queue until after the maintenance is completed.

Carter Old Scratch Retirement

Fri, 18 Nov 2016 00:00:00 -0500

The old Carter scratch filesystem (Warp) will be retired and shut down in three weeks' time. To access this filesystem and ensure you have any files or data you need transferred, please refer to the Carter Scratch Transfer Tool news posting on this system.

After Friday, November 18th, 2016 at 12:00am EST, we will NOT be able to retrieve any files or data on this system.

Please contact us if you need any additional assistance with the file transfer process. Thank you.

Emergency Cluster Maintenance

Sat, 05 Nov 2016 08:00:00 -0400

The Carter Cluster was returned to production at 10:45pm on November 7. We apologize for this extended outage.

Update: November 7, 2016 6:01pm

Work on reinstalling the Carter nodes continues. All other systems have returned normal operations. We will issue an update on Carter status by midnight tonight.

Update: November 7, 2016 6:00pm

The Carter cluster updates did not deploy correctly to the Carter nodes, and these nodes are all now being prepared for reinstallation to get them back where they need to be. We will issue an update on Carter's status by 6:00pm.

Additionally, problems logging in to the Scholar and Hathi clusters have been addressed this morning. A problem with multi-node jobs and qpeek on Rice has also been identified, and a fix is being deployed across all Rice nodes at this moment.

Update: November 7, 2016 12:35am

All clusters other than Carter have been successfully updated and brought back online over the course of the weekend. Carter is posing some extra challenges. Systems administrators will continue to work on it and we will post an update on Carter's status no later than Monday, November 7th, 2016 at 10:45pm EST.

Original Message:

The clusters will be taken down for emergency cluster maintenance beginning at Saturday, November 5th, 2016 at 8:00am EDT. The clusters will return to normal operations by Sunday, November 6th, 2016 at 11:59pm.

During this time, will have critical kernel security patches applied.

Any PBS jobs already in progress which do not complete by Saturday, November 5th, 2016 at 8:00am EDT will unfortunately have to be terminated. Any new or queued PBS jobs which request a walltime which would take them past Saturday, November 5th, 2016 at 8:00am EDT will not start and will remain in the queue until after the maintenance is completed.

Security vulnerability patch impacts debugging

Mon, 31 Oct 2016 00:00:00 -0400

Due to a recently found vulnerability in the Linux Kernel (known as the Dirty-COW vulnerability), an emergency patch has been applied on the cluster nodes. This patch is necessary to avoid exploitation of the vulnerability. Unfortunately, the patch impacts/disables the Linux kernel's "ptrace" functionality. Tools that use "ptrace" (including Totalview and Intel vTune) are affected by this.

A complete fix of the vulnerability will require an upgrade of the Operating System on the cluster nodes. We are currently working on scheduling a downtime for the clusters to perform this upgrade.

Unscheduled Depot Outage

Tue, 04 Oct 2016 16:15:00 -0400

Measures taken within the first two hours of this problem seem to have resolved the issue.

Original Message:

A portion of the systems serving the Research Data Depot have suffered a failure. Some systems using Depot have been affected, particularly Carter, Snyder, and non-research systems accessing Depot over NFS. Some systems and jobs on these systems may be unaffected, as the impact varies.

Engineers are working to address the specific servers having issues, and the Depot should return to normal function as soon as they are able to do so.

Unscheduled Scratch Outage on Carter

Sat, 01 Oct 2016 16:45:00 -0400

UPDATE

As of about 6:30 pm, the new scratch system was brought back online, and scheduling has been restarted on Carter.

Original Message

The new scratch filesystem serving that was just activated on Tuesday night is currently unavailable.

Both currently running jobs and attempts to access files in scratch will block until the filesystem is back online.

Job scheduling on has been paused while storage engineers address the issue.

Carter Scratch Transfer Tool

Fri, 30 Sep 2016 00:00:00 -0400

On September 27th, 2016, the Carter cluster scratch filesystem, which had been suffering from numerous issues, was replaced by an entirely new system. Unfortunately, in order to put the new system into place quickly, it was not possible to copy over all the existing files to the new system. However, these files are still available to all users for a period time.

You will find your $CLUSTER_SCRATCH directory on Carter is now a new empty directory. Your previous files may be accessed by using a new transfer tool deployed on Carter today. We are trying to keep the old scratch filesystem available in this way for a few weeks to allow for file transfer.

To run the tool, log into Carter as usual using SSH or ThinLinc. Once on Carter, run the command:

/usr/site/rcac/bin/carter-scratch-transfer

This will copy all of your files from the previous scratch system into your new scratch directory under a directory called "carter-warp-files", or "$CLUSTER_SCRATCH/carter-warp-files".

Execution of this command may take some time depending on how many files you have. If you have a large number of files it may be helpful to launch the command from a ThinLinc session to allow the transfer to run unattended.

If you need assistance in transferring files, please contact us at rcac-help@purdue.edu and we can provide assistance and instructions more specific to your situation.

Home Filesystem Maintenance - All Clusters

Tue, 27 Sep 2016 07:00:00 -0400

Conte has been returned to normal operations as well now. This concludes the home directory maintenance on all systems.

Update: September 27, 2016 11:55pm

All systems other than Conte have been successfully returned to normal operations with the new home directory filesystem. Work continues at this point on Conte to ensure the Phi accelerators are properly reconfigured.

Carter has also been given a new scratch filesystem during this maintenance. This should alleviate some of the problems with the previous scratch filesystem on Carter. For more details, please see the Carter-specific announcement on this topic: New Carter Scratch Filesystem

Reminder:

This is a reminder of the Home Filesystem Maintenance taking place next week on Tuesday, September 27th.

Details below.

Original Message:

All of the research clusters () as well as some other minor systems will be unavailable beginning at Tuesday, September 27th, 2016 at 7:00am EDT, for scheduled maintenance. All clusters other than Conte will return to full production by 11:59pm.

Conte will return to partial capacity by that time, but will not return to full production until the following day. Many Conte nodes will remain offline and gradually be returned to service over the following 12-24 hours to allow for power reconfiguration in the data center. Please see the separate article on Conte: Conte Cluster Maintenance.

During the large all-systems maintenance Tuesday, the /home filesystem used by all Research Computing systems will be replaced by a new filesystem. The new filesystem will be based on DDN's GRIDScalar technology and running on new hardware dedicated exclusively to Research Computing home directories.

All files on the existing /home filesystem will be migrated to the new system during the maintenance window and prior to any of the clusters returning to service.

In the coming weeks, any jobs which request a walltime which would take them past Tuesday, September 27th, 2016 at 7:00am EDT will not start and will remain in the queue until after the maintenance is completed.

Software stack changes and upgrades

Tue, 27 Sep 2016 00:00:00 -0400

During the Home Filesystem Maintenance - All Clusters maintenance on September 27th, several upgrades and changes will be made to the software stack on the clusters. Changes will include updates to the default version of the Intel compiler and associated software stack as well to the default MPI libraries. Some older versions of other software will also be removed. These changes are being made in order to bring clusters in line with the software environment that is being planned for the new Halstead cluster.

These upgrades will provide the best performance for the new and existing clusters and will provide a consistent Intel version stack across all of our clusters. The new software stack is currently available on the clusters for testing and upgrade. ITaP research computing staff recommends testing out the new compilers and upgrading prior to September 27th.

WHAT WILL BE THE IMPACT TO INTEL COMPILERS?

We will be upgrading the default Intel version from 13.1.1.163 to 16.0.1.150. The current default has been around for several years, and many researchers are already switching to the latest versions of Intel compilers. The 13.1.1.163 version will remain available on the current clusters for a period of time to give researchers time to finish up projects and upgrade to the latest. Any software dependent on the default version of Intel 13.1.1.163 will also have it's default upgraded.

WHAT WILL BE THE IMPACT TO MPI LIBRARIES?

We will be upgrading the default version of OpenMPI from 1.6.3 to 1.8.1. This new versions offers stability and performance enhancements and some new features. Version 1.8.1 has been available for some time and many researchers have already moved to 1.8.1.

We will be upgrading the default version of IMPI from 4.1.1.036 to 5.1.2.150. This new versions offers stability and performance enhancements and some new features. Version 5.1.2.150 has been available for some time and many researchers have already moved to 5.1.2.150.

Any software dependent on one of these default MPI versions will also have it's default upgraded appropriately.

It is recommended that you upgrade to these new libraries, however, if you need to continue using the old default versions you may do so by switching your "module load" to the specific version. The Intel 13 stack will remain available for those who require it. These new compilers offer bug fixes and enhanced performance and stability. Users are encouraged to send in any experiences with these new compilers to help us evaluate the direction of new compilers on RCAC systems.

WHAT OTHER SOFTWARE WILL BE IMPACTED?

There will be several changes to other miscellaneous software. Older versions of some software will be removed in favor of newer versions. Default versions of a few software will be updated to the latest version. In most cases, these older versions are being infrequently used so most should not be impacted by these changes.

If any software you are using will be impacted by these changes you will see a notice message being printed to your session or in your job output files when loading an affected module. This notice will provide recommendations on the latest version.

HOW DO I KNOW IF MY WORKFLOW WILL BE IMPACTED?

Whenever a module that will be impacted is loaded a notice is printed to your screen or job output log. Please take a look at your job output over the next couple of weeks and make note of any changes being advertised. You may continue using these modules as-is until September 27th to allow time to make any changes necessary. Users are encouraged to make any changes necessary beforehand to avoid disruption when changes are made.

WHAT IF AN IMPACTED MODULE IS REQUIRED BY MY RESEARCH?

We understand some users may not be able to change compilers or MPI libraries in the middle of a research project. Modules involved in a default version update will continue to be available, however, you will need to update your job scripts to request the specific version of the module. If you are already loading specific versions no changes are necessary.

If a version of software you depend on is being completely removed and you are unable to upgrade, please contact us at rcac-help@purdue.edu. We will help you transition to a newer version if possible, or provide you with a copy of the old software version.

WHY ARE YOU CHANGING THE SOFTWARE STACK?

ITaP aims to provide a software stack that allows for optimal use (performance and stability) of the clusters. This necessitates periodic updates to the stack as compilers, libraries, and software are improved over time. By removing older modules from the main stack we help ensure the selection is simple and easy for users to find the best compilers and libraries to use. If no modules were removed the selection would become difficult to navigate as well become difficult for ITaP staff to manage. Any major changes will be coordinated with scheduled maintenance periods to minimize impact.

If you have any questions or concerns with the upcoming changes please contact rcac-help@purdue.edu

New Carter Scratch Filesystem

Tue, 27 Sep 2016 00:00:00 -0400

We are seeing some issues with the systems in the warp-scratch set of hosts. You may encounter an error with your home directory and/or a message about permissions upon login. Even if you see this, you may find the system is still able to access both your scratch directories. However, sometimes this may not work. You may try logging in again, as this should direct you to a different transfer host which may be working better at the time.

Systems administrators are investigating these issues and will be correcting these as they are able to identify the problems.

Original Message:

The Carter cluster scratch filesystem, which had been suffering from numerous issues, has been replaced by an entirely new system today. Unfortunately, in order to put the new system into place quickly due to the ongoing issues, it was not possible to copy over all the existing files to the new system. However, these files are still available to all users.

You will find your $CLUSTER_SCRATCH directory on Carter is now a new empty directory. Your previous files may be accessed from a set of dedicated file transfer systems by SSH to "warp-scratch.rcac.purdue.edu". There you will not only find /scratch/carter (your new scratch space), but also /scratch/carter-warp (your previous scratch space). Please copy whatever files you need from the old space to your new scratch or download them elsewhere if you prefer. We are trying to keep the old scratch filesystem available in this way for a few weeks to allow for file transfer.

A tool has been deployed on Carter front-ends to assist in transferring files:

/usr/site/rcac/bin/carter-scratch-transfer

If you need assistance in transferring files, please contact us at rcac-help@purdue.edu and we can provide instructions more specific to your situation.

We hope that this new filesystem will improve the overall stability and experience on Carter. Thank you for your patience as we have dealt with this problem.

Unscheduled scratch outage on Carter

Thu, 22 Sep 2016 15:00:00 -0400

UPDATE: ITaP engineers have implemented a temporary solution so that work may continue on Carter until the scheduled upcoming maintenance window on Tuesday. Any jobs running which were using the scratch space have been stopped in order to allow for this issue to be addressed.

We are now in the process of re-enabling queues on Carter, which will be done gradually, starting with the faculty partner queues. We will monitor the scratch filesystem closely as this work ramps up.

As a permanent solution to Carter’s scratch file system problems, ITaP is installing a new scratch system this week. This new system will be available after the Tuesday maintenance and instructions will be sent shortly detailing that transition.

Original post: The scratch filesystem serving is currently unavailable.

Both currently running jobs and attempts to access files in scratch will block until the filesystem is back online.

Job scheduling on has been paused while storage engineers address the issue.

No estimated return to service is available at this time, but an update will be sent as soon as more information becomes available.

Degraded performance of several systems

Tue, 13 Sep 2016 00:00:00 -0400

We have seen a significant wave of these events this morning, September 21. For the most part, this wave seems to have been linked to a storage problem that has been resolved. However, we are implementing new monitoring and response procedures today to ensure a similar recurrence is caught and dealt with much more quickly.

Original Message:

System, Network, Storage, and Support staff are working to diagnose and correct issues that have been seen recently within ITaP's Research Computing systems.

Symptoms being reported involve an apparent complete freeze of open sessions, the inability to open new login sessions, difficulties using text editors, and disruptions in file access. In cases we have seen, these events seem to last for about 3-5 minutes, then clear up. However, there may be ongoing effects on jobs running on the Research Clusters, including job failure due to the storage access disruption.

We are examining log files and monitoring processes actively, and are working to correlate the timing of these events across our systems, and expect to identify a fundamental cause that we can then correct. At this time, however, we do not have an estimated time for a fix.

Please follow this news item for further information.

ECN Services Outage

Sat, 16 Jul 2016 08:00:00 -0400

Engineering Computing Network (ECN) will be performing scheduled maintenance this weekend on several ECN server resulting in their unavailability for a short time. Some ECN services will be affected, including several software license servers for ITaP Research Computing systems that are hosted by ECN. License servers are expected to become inaccessible around 8:00am EDT on Saturday, July 16th, 2016 and will return to service no later than 12:00pm EDT.

ITaP Research Computing cluster job scheduling is not affected by the outage, but licenses for software like Matlab, Ansys/Fluent, CFD++, Sentaurus, Comsol, Abaqus, PowerFlow, and PowerAcoustics will be unavailable during the outage period, which may lead to license-controlled software refusing to work and jobs exiting with error conditions.

Users of Matlab are encouraged to always submit jobs that explicitly request license tokens available from the job scheduler. These are specified using the gres attribute in your job submission command. For example, to request a single Matlab license:

$ qsub -l nodes=1:ppn=1,walltime=01:00:00,gres=MATLAB+1 myjob.sub

This way the job is guaranteed to only start execution when the necessary license is available. More examples for various Matlab toolboxes are available in the user guides.

Any other jobs using ECN licensed software that start during this downtime will not be able to check out a license and may result in jobs exiting with errors. As well, any software that requires a constant connection to the ECN licensing servers will stop during this time.

If you are unsure if your software will be affected or have any other concerns please contact us at rcac-help@purdue.edu.

POD Cluster Maintenance

Tue, 07 Jun 2016 05:30:00 -0400

Carter and Scholar are back online for use as of 6:25am, though they will be operating with many nodes still offline. Staff will be working through Wednesday to steadily increase the number of nodes available. This concludes the POD cluster maintenance.

Carter and Scholar are still being worked on. We will issue another update by 6:00am if not already in service.

The Rice, Hammer, and Peregrine1 clusters have been returned to normal operations as of 1:40am. Work continues on Carter and Scholar, and we will issue an update on those systems by 3:00am if not already in service.

The Snyder cluster has been returned to normal operations as of 12:00am. Work continues on the the other clusters listed here.

The work continues on these clusters, although progress was substantially delayed by the concurrent storage systems failure (Unscheduled Storage Outage). We will post an update by 2:00am or sooner as clusters return to service.

The clusters will be unavailable beginning at Tuesday, June 7th, 2016 at 5:30am EDT, for scheduled maintenance. The clusters will return to full production by Tuesday, June 7th, 2016 at 10:00pm.

During this time, maintenance will be performed on the cooling systems used by these clusters. This maintenance period will also allow critical high-availability fixes to be made to the Research Data Depot while client clusters are offline.

Any PBS jobs which request a walltime which would take them past Tuesday, June 7th, 2016 at 5:30am EDT will not start and will remain in the queue until after the maintenance is completed.