Batch system and module changes on RCAC systems
Beginning with the new Carter cluster, RCAC users will note some differences in the PBS batch system and the module names available for use. This article aims to outline the reasons for these changes and describe some of the details.
Why has ITaP changed the PBS software used on Carter, Steele, Hansen, and Radon?
TORQUE and Moab provide a number of stability and reliability improvements over PBS Pro and many new features:
- Improved scalability for large clusters like Carter
- Node health monitoring to prevent job failures
- Advanced scheduling capabilities, such as providing fair-share scheduling within the standby queue
- GUI interface to job submission and monitoring
- More resilient communication protocols to make the cluster more reliable
- Cpuset and NUMA capabilities to more effectively use today's NUMA architecture systems
- Full integration with and knowledge of 3rd-party software licenses
- Graphics processing unit (GPU) support How does TORQUE differ from PBS Pro as used on other community clusters?
There are some differences that a user will experience when interacting with Torque vs PBS Pro. This table outlines the key differences in qsub options between the two systems. Please use this as a guide in converting submission scripts from PBS Pro to TORQUE:
You may test your PBS Pro submission scripts for compatibility with the TORQUE version of "qsub" using the "check_torque_syntax" command.
- Usage: check_torque_syntax [qsub options] filename In addition, TORQUE offers the "checkjob" utility, which can tell you more information about the status of one of your jobs, including reasons it may not be able to start.
Why are module names different from other community clusters?
- Module names on previous clusters took a variety of different formats both in name and version number. For example, openmpi-1.4.4-intel64/12.1, petsc-3.1-p8-openmpi-1.5.4-intel64/12.0.084, hdf5/1.8.0-gcc3, hdf5-1.8.5/gcc-4.1.2 are all formats that appear on community clusters.
- Modules were rarely removed or defaults changed, making maintenance of the software stacks on the clusters extremely complicated.
- As we transition to TORQUE, all modules will be given a consistent naming scheme: packagename/packageversion_compiler-compilerversion. For example: Open MPI 1.4.4, built with Intel 12.1 would be named: openmpi/1.4.4_intel-12.1
- With a standard in place, RCAC application staff can better manage the installed software: providing default versions, informing when the default will change, and providing advance warning when a software module is to be retired. Additionally, this allows RCAC to provide an automatically-updated online software catalog already under development.
- Major changes to the software stack and defaults will be performed only at scheduled downtimes.
- Additionally, Carter will see a much smaller set of compiler and MPI versions. Only current default versions of each compiler and MPI will be installed, along with the previous default.
- Old versions of software will be archived to an unsupported module path which can be accessed by "module load unsupported". For example, if your code can only build and run with an old version of the PGI compiler that has been deprecated, you can still access that module in the "unsupported" path.
- Existing community clusters will begin to transition to this scheme through the spring and summer of 2012.
Some module tips:
- "module load devel" will load a convenient set of the recommended compiler/MPI/math library combination for your cluster. In Carter's case, "devel" will load OpenMPI 1.4.4, Intel 12.1, and Intel MKL.
- You can always execute "module load openmpi", and load the recommended OpenMPI built with the recommended compiler for the cluster you are on,and similarly for other parallel libraries.
- All modules are 64-bit. The "bitness" of a module will no longer appear in its name.
Please contact us at firstname.lastname@example.org if you have any questions about the tranisiotn to TORQUE, the module changes, or any other concerns.