Moffett - Getting Started

Overview of Moffett

Moffett is a SiCortex 5832 system. It consists of 28 modules, each containing 27 six-processor SMP nodes for a total of 4536 processors. The SiCortex design is highly unusual; it pairs relatively slow individual processors (633 MHz) with an extraordinarily fast custom interconnect fabric, and provides these in very large numbers. In addition, the SiCortex design uses very little power and thereby generates very little heat. Moffett is best suited to very wide parallel jobs with very high communication needs and may be used to explore the scalability of parallel algorithms. Serial applications, on the other hand, would suffer from the individually slow processor speeds and are not recommended.

Namesake

Moffett is named in honor of David Moffett, the late Associate Vice President for Research Computing and Purdue graduate. More information about his life and impact on Purdue is available in an RCAC Biography of David Moffett.

Detailed Hardware Specification

Number of Nodes Processor Cores per Node Memory per Node Interconnect TeraFlops
756 633 MHz SiCortex 5832 6 8 GB SiCortex Interconnect Fabric 5.74

All Moffett nodes run Linux kernel version 2.6.18 and use SLURM and Maui for resource and job management. Operating system patches are applied monthly or as security needs dictate. All nodes have been configured to allow for unlimited stack usage, as well as unlimited core dump size (though disk space and server quotas may still be a limiting factor).

Obtaining an Account

Moffett is a machine operated by RCAC. Accounts are granted to any Purdue faculty or staff who have massively parallel programs to develop, test, or run. To get an account, use the online Research Computing Account Request Form.

Login / SSH

To issue jobs on Moffett, users may log on to the front-end host moffett.rcac.purdue.edu via SSH.

SSH Client Software

All access to the RCAC systems must be through secure (encrypted) connections. Standard telnet and FTP are not supported. SSH, SCP, and SFTP may be used instead.

Secure Shell or SSH is a way of establishing a secure channel between a local and a remote computer. It uses public-key cryptography to authenticate the remote computer and (optionally) to allow the remote computer to authenticate the user. It is usually used to log in to a remote machine and execute commands similar to telnet, but it also supports tunneling and forwarding of X11 or arbitrary TCP connections. The associated SFTP and SCP protocols may be used to transfer files. There are many SSH clients available, depending on the operating system you use.

Linux / Solaris / AIX / HP-UX / Unix:

  • "ssh", "sftp", and "scp" are pre-installed. Log in using ssh myusername@servername.

Microsoft Windows:

Mac OS X:

  • "ssh", "sftp", and "scp" are pre-installed. You may start a local terminal window from "Applications->Utilities". Log in using ssh myusername@servername.
  • MacSSH and MacSFTP
  • NiftyTelnet 1.1 SSH

SSH Keys

SSH can be used in conjunction with many different means of authentication. One popular authentication method is called Public Key Authentication (PKA). PKA is a method of establishing your identity to a remote computer using related sets of encryption data called keys. PKA is a more secure alternative to traditional password-based authentication with which you are probably familiar.

To employ PKA via SSH, you manually generate a keypair (also called SSH keys) in the location from where you wish to initiate a connection to a remote machine. This keypair consists of two text files, one which is called a private key and one which is called a public key. You keep the private key file confidential on your local machine or local home directory (hence the name "private" key). You then login to a remote machine (if possible) and append the corresponding public key text to the end of a specific file, or have a system administrator do so on your behalf. In future login attempts, the public and private keys are compared to verify your identity, which then grants you access to the remote machine.

As a user, you can create, maintain, and employ as many keypairs as you wish. If you connect to a computational resource from your work laptop, your work desktop, and your home desktop, you can create and employ keypairs on each. You can also create multiple keypairs on a single local machine to serve different purposes, such as establishing access to different remote machines, or establishing different types of access to a single remote machine. In short, PKA via SSH offers a secure but flexible means of identifying yourself to all kinds computational resources.

Passphrases and SSH Keys

When a you create a keypair, you are prompted to provide a passphrase for the private key. This passphrase is different than a password in a number of ways. First, a passphrase is, as the name implies, a phrase. It can include most types of characters, including spaces, and has no limits on length. Second, this passphrase is not transmitted to the remote machine for verification. It is used only to allow the use of your local private key and is specific to a specific local private key.

Perhaps you are wondering why you would need a private key passphrase at all when using PKA. If the private key is kept secure, why the need for a passphrase just to use it? Indeed, if the location of your private keys were always completely secure, a passphrase might not be needed. In reality, a number of situations could arise in which someone may improperly gain access to your private key files. In these situations, a passphrase offers another level of security for you, the user who created the keypair.

Think of the private key/passphrase combination as being analogous to your ATM card/PIN combination. The ATM card itself is the object that grants access to your important accounts, and as such, should be kept secure at all times—just as a private key should. But if you ever lose your wallet or your ATM card is stolen, you are glad that your PIN exists to offer you another level of protection. The same is true for a private key passphrase.

When you create a keypair, you should always provide a corresponding private key passphrase. For security purposes, avoid using phrases that would be guessed by automated programs (e.g. phrases that consist solely of words in English-language dictionaries). This passphrase can never be recovered if forgotten, so make note of it. There are only limited situations when the use of a non-passphrase-protected private key is warranted—conducting automated file backups is one such situation. If you need to use a non-passphrase-protected private key to conduct automated backups to Fortress, see the No-Passphrase SSH Keys section.

Passwords

If you have received a default password as part of the process of obtaining your account, you should change it immediately when you log on for the first time. This can be done from any terminal/SSH session with the command "passwd". You will have the same password on all RCAC systems. If you change your password on any one RCAC system, it will change on all RCAC systems.

If you already have a Purdue career account, then you will initially be given the same userid and password as your career account. There is no need to change your career account password because you have received an account on RCAC systems.

There is not currently any requirement regarding how often you must change your password within RCAC, but for security reasons changing a password every six months, preferably every three months, is good practice.

All passwords should:

  • Be something you have never used as a password before, on this or any other system.
  • Be easy for you to remember and difficult for others to guess.
  • Be at least eight characters long.
  • Be a combination of upper and lowercase letters, numbers, and symbols.
  • TIP: Abbreviate a sentence or song lyric: "The dog Samson ate 4 new slippers!" = "TdSa4ns!"

Never share your password with another user or make your password known to anyone else. Systems staff will NEVER ask for your password, by email or otherwise.

Storage Options

File storage options on RCAC systems include home directories, scratch file systems, /tmp, and long-term or permanent storage. Each of these have different performance and intended uses, and some vary from system to system as well. Home directories and long-term storage are backed up nightly, but scratch and /tmp are not and may be occasionally purged without warning. Below is more detail about each of these storage options.

Home Directories

Your home directory is the default directory you are placed in when you log in.

You should use this space for storing files you want to keep long term such as source code, scripts, input data sets, etc. It should also be used for files you want to keep and which you use often. The home directory will physically reside on the BlueArc NFS Server. You can find the path to your home directory by logging in, and typing pwd:

$ pwd
/home/ba01/u103/myusername

The second component of the reply indicates the name of the host where your home directory physically resides. In this example, the home directory is on the RCAC home directory file server named "ba01" under area "u103". This will vary from person to person. Remember, you can always check where your home directory is located by doing a pwd command in your home directory.

Regardless of its physical location, your home directory and its contents are available on almost all the RCAC front-end hosts and their nodes via the Network File System (NFS). The only exception is Black.

Note that your home directory has a quota capping the size and/or number of files you may store within. For more information, refer to the Storage Quotas / Limits Section.

Scratch Directories

Scratch directories are provided by RCAC and are intended for short-term file storage only.

Backups are not performed on scratch directories. In the event of a disk crash or file purge, files in scratch directories can not be recovered. Please be sure to copy any important files to more permanent storage.

All files stored in RCAC scratch directories older than 90 days will be automatically removed (purged). Owners of these files will be notified one week before removal via email. For more information, please refer to our Scratch File Purging Policy.

RCAC scratch directories are provided by a central BlueArc server and are accessible from most RCAC systems. There are two primary scratch file systems: scratch95 and scratch96. A scratch directory already exists for all Moffett users. Your RCAC scratch directory is located under scratch95 or scratch96 within a subdirectory by the first letter of your username.

To find the path to your RCAC scratch directory, run myscratch:

	$ myscratch
	/scratch/scratch96/m/myusername

The variable $RCAC_SCRATCH is also set to your RCAC scratch directory path. Use this variable in any scripts. Your actual scratch directory path may change without warning, but this variable will remain current.

$ echo $RCAC_SCRATCH
/scratch/scratch96/m/myusername

To find the path to someone else's RCAC scratch directory, use the command findscratch:

$ findscratch someuser
/scratch/scratch95/s/someuser

Note that your RCAC scratch directory has a quota capping the size and/or number of files you may store within. For more information, refer to the Storage Quotas / Limits Section.

/tmp Directory

The /tmp directory is intended for temporary files that are used during the execution of a process or job or while you examine files created by your jobs. Used properly, /tmp may provide faster local storage to an active process than any other storage option. However, do not use it for longer-term storage or critical results.

Files stored in /tmp are not backed up and are removed whenever space is low or whenever the system is rebooted. In the event of a loss, files in /tmp can not be recovered, so use it only for files that can be recreated relatively easily.

Long-Term Storage

Long-term Storage or Permanent Storage is available to RCAC users on the DXUL/UniTree archival storage system, commonly referred to as "Fortress". DXUL (DiskXtender for Unix and Linux) and UniTree are a software package that manages a hierarchical storage system. Program files, data files and any other files which are not used often, but which must be saved, can be put in permanent storage. Fortress currently has a 1.2 PB capacity. However, since two copies are retained for every file, the usable capacity is only 600 TB.

Recently used files smaller than 0.5 MB have their primary copy stored on low-cost disks, but the second copy is on tape or optical disks. This provides a rapid restore time to the disk cache. However, the large latency to access a larger file (usually involving a copy from a tape cartridge) makes it unsuitable for use as active storage.

In addition to poor performance, these two uses can cause severe problems with the system itself:

  • DO NOT store any actively used files on Fortress.
  • DO NOT store large collections of small files on Fortress.

Do not use Fortress as a second home directory. Instead, use tar or some similar archive tool to combine all the smaller files you wish to store into a single large file first.

For active data storage you should use either local storage or a scratch file system. You may then copy any results you wish to archive to Fortress when computation is complete.

Fortress is directly accessible (via FTP, SSH, SCP, SFTP, and NFS) from all RCAC systems, as well as most systems in ECN and CS and from several other major servers on campus. To access Fortress in any way other than NFS, you must login to fortress.rcac.purdue.edu. RCAC has more information about Fortress, including how to obtain a Fortress account and how to access your files on Fortress.

Environment Variables

There are many environment variables related to storage locations and paths which are automatically set for you upon log in, or may be changed if necessary.

Use environment variables instead of actual paths whenever possible to avoid problems if the specific paths to any of these change. Some of the environment variables you should have are:

  • $USER: your username
  • $HOME: path to your home directory
  • $PWD: path to your current directory
  • $RCAC_SCRATCH: path to scratch filesystem
  • $PATH: all directories searched for commands/applications
  • $HOSTNAME: name of the machine you are on
  • $SHELL: your current shell (bash, tcsh, csh, ksh)
  • $SSH_CLIENT: your local client's IP address
  • $TERM: type of terminal or terminal emulator being used
  • $OMP_NUM_THREADS: OpenMP number of threads

All environment variables begin with the dollar sign ($) and are all uppercase. These may be used on the command line or in any scripts in place of and in combination with hard-coded values:

$ ls $HOME
...

$ ls $RCAC_SCRATCH/myproject
...

$ ls $RCAC_SCRATCH/myproject/$HOSTNAME_data
...

You may find the value of any environment variable by using the echo command:

$ echo $RCAC_SCRATCH
/scratch/scratch95/m/myusername

$ echo $SHELL
/usr/local/bin/tcsh

You may list the values of all environment variable using the env command:

$ env
USER=myusername
HOME=/home/ba01/u101/myusername
RCAC_SCRATCH=/scratch/scratch95/m/myusername
SHELL=/usr/local/bin/tcsh
...

You may create or overwrite an environment variable using either export or setenv, depending on your shell:

  (for bash and sh)
$ export VARIABLE=value

  (for tcsh and csh)
% setenv VARIABLE value

Storage Quotas / Limits

Your disk usage is limited on RCAC systems. However, each filesystem (scratch, home directory, etc.) may have a different limit. If you exceed the soft limit or quota, you will see warnings whenever writing to the disk that you are over quota, but the write will still succeed. If you exceed the hard limit or limit, your write will fail until you either remove other files or your quota is increased. Generally, RCAC systems do not impose a soft limit—only a hard limit.

Checking Quota Usage

You may find out what your current quota is by using the quota command:

$ quota
Disk quotas for user myusername (uid 12345): 
     Filesystem  blocks   quota   limit   grace   files   quota   limit   grace
     ba01:/u103 2346272       0 5000000           17508       0   65535

The columns are as follows:

  1. Filesystem: This indicates the line is for the user's files on /u103/, which doing echo $HOME confirms is the user's home directory filesystem.
  2. Blocks: This shows how many 1 KB blocks the user's files take up. In this case, 2346272 KB / 1024 = 2291 MB, or 2291 MB / 1024 = 2.24 GB.
  3. Quota: This shows that soft limits are not being imposed (0).
  4. Limit: This shows how many 1 KB blocks the user's hard limit is. In this case, (5000000 KB / 1024) / 1024 = 4.77 GB.
  5. Grace: This would show the grace period (in days) for any soft limit (none in this case).
  6. Files: This shows how many file pointers (inodes) the user is currently using. This is based more on the number of files and directories and not the size.
  7. Quota: This shows that soft limits are not being imposed for file pointers (0).
  8. Limit: This shows the user's file pointer hard limit. It is possible, though unlikely, to hit this and not the size limit if you create a large number of very small files.
  9. Grace: This would show the grace period (in days) for any file pointer soft limit (none in this case).

You may also see the disk usage of any given directory by using du:

$ du -hs
1.1G    .

$ du -hs $HOME
138M    /home/ba01/u103/myusername

This can be very helpful in figuring out where your largest files or directories are, so that you may clean out unneeded large files and avoid hitting your quota.

Requesting Quota Increase

If you find you need additional disk space on an RCAC account, please first consider archiving and compressing old files and moving them to long-term storage. If this option does not resolve the issue, you may send an email to rcac-help@purdue.edu and request additional space.

Archive and Compression

There are several options for archiving and compressing groups of files or directories on RCAC systems. All of the following tools are provided:

  • zip   (more information)
    Simple compression and file packaging utility.
    Examples:
      (compress file somefile.c)
    $ zip somefile.zip somefile.c
    
      (extract contents of somefile.zip)
    $ unzip somefile.zip
    
      (compress all files in a directory into one archive file)
    $ zip -r somefile.zip somedirectory/
    
      (compress all ".c" files in current directory into one archive file)
    $ zip -r somefile.zip . -i \*.c
    
  • tar   (more information)
    Saves many files together into a single archive file, and restores individual files from the archive. Includes automatic archive compression/decompression options and special features that allow tar to be used for incremental and full backups.
    Examples:
      (archive file somefile.c)
    $ tar cvf somefile.tar somefile.c
    
      (archive and compress file somefile.c)
    $ tar czvf somefile.tar.gz somefile.c
    
      (list contents of archive somefile.tar)
    $ tar tvf somefile.tar
    
      (extract contents of somefile.tar)
    $ tar xvf somefile.tar
    
      (extract contents of gzipped archive somefile.tar.gz)
    $ tar xzvf somefile.tar.gz
    
      (archive and compress all files in a directory into one archive file)
    $ tar czvf somefile.tar.gz somedirectory/
    
      (archive and compress all ".c" files in current directory into one archive file)
    $ tar czvf somefile.tar.gz *.c 
    
  • gzip   (more information)
    Compression utility designed as a replacement for compress, with much better compression and no patented algorithms. The standard compression system for all GNU software.
    Examples:
      (compress file somefile - also removes uncompressed file)
    $ gzip somefile
    
      (uncompress file somefile.gz - also removes compressed file)
    $ gunzip somefile.gz
    
  • bzip2   (more information)
    Strong, lossless data compressor based on the Burrows-Wheeler transform. Also available as a library.
    Examples:
      (compress file somefile - also removes uncompressed file)
    $ bzip2 somefile
    
      (uncompress file somefile.bz2 - also removes compressed file)
    $ bunzip2 somefile.bz2
    
  • compress   (more information)
    Adaptive Lempel-Ziv compressor. Not often used today.

Windows users can work with these same formats using some of the following software:

  • 7-Zip
    Free Windows software package that can handle all the above formats.
  • WinZip
    Commercial Windows software package that can handle all the above formats.
  • WinRAR
    Commercial Windows software package that can handle all the above formats.

File Transfer

There are a variety of ways to transfer data to and from RCAC systems. Which you should use depends on several factors, including the ease of use for you personally, connection speed and bandwidth, the size and number of files to be transferred. For more details on file transfer methods and applications, refer to the Moffett Complete User Guide.

Provided Applications

The third-party software on Moffett is shown in the following table. Additional software may be available on other RCAC systems. Please contact rcac-help@purdue.edu if you are interested in the availability of software not shown in this list.

Papiex/PAPI
mpipex/mpiP
ioex
gprof
hpcex
TAU/tauex
Vampir
GPTL/gptlex
Pathscale compiler
GNU compiler
Mpich

Environment Management with the Module Command

Currently, modules are not installed on moffett. All compilers and other programs are already in your path.

Provided Compilers

Compilers are available on Moffett for Fortran 77, Fortran 90, Fortran 95, C, and C++. The compilers can produce general-purpose and architecture-specific optimizations to improve performance. These include loop-level optimizations, inter-procedural analysis and cache optimizations. The compilers support automatic and user-directed parallelization of Fortran, C, and C++ applications for multiprocessing execution. More detailed documentation on each compiler set available on Moffett follows.

Here is some more documentation from other sources on programming for the SiCortex platform:

PathScale Compiler Set

To use the PathScale compiler set on Moffett, you need load no modules. The compiler programs will already be in your path. Here are some examples:

Language Serial Program MPI Program OpenMP Program
Fortran77
$ pathf95 myprogram.f -o myprogram
$ mpif77 myprogram.f -o myprogram
$ path95 -mp myprogram.f -o myprogram
Fortran90
$ path95 myprogram.f90 -o myprogram
$ mpif90 myprogram.f90 -o myprogram
$ path95 -mp myprogram.f90 -o myprogram
Fortran95
$ path95 myprogram.f95 -o myprogram
(not available)
$ path95 -mp myprogram.f95 -o myprogram
C
$ pathcc myprogram.c -o myprogram
$ mpicc myprogram.c -o myprogram
$ pathcc -mp myprogram.c -o myprogram
C++
$ pathCC myprogram.cpp -o myprogram
$ mpicxx myprogram.cpp -o myprogram
$ pathCC -mp myprogram.cpp -o myprogram

More information on compiler options can be found in the official man pages, which can be accessed using the "man" command, or online here:

Here is some more documentation from other sources on the PathScale compilers:

GNU Compiler Set

To use the GNU compiler set on Moffett, you need load no modules. The compiler programs will already be in your path. Here are some examples:

Language Serial Program MPI Program OpenMP Program
Fortran77 (not available) (not available) (not available)
Fortran90 (not available) (not available) (not available)
Fortran95 (not available) (not available) (not available)
C
$ gcc myprogram.c -o myprogram
(not available) (not available)
C++
$ g++ myprogram.cpp -o myprogram
(not available) (not available)

More information on compiler options can be found in the official man pages, which can be accessed using the "man" command, or online here:

Running Jobs on Moffett

There are a number of different compilers and programs installed on the RCAC systems. On Moffett they are all already in your path.

There are two methods for submitting jobs to Moffett. First, you may submit jobs directly to a queue on Moffett. These jobs may be serial, message-passing, or shared-memory in nature. You use SLURM to submit jobs to a queue. Secondly, the Moffett cluster is a part of BoilerGrid. You may submit serial jobs to BoilerGrid and specifically request that the serial jobs be run on the resources on Moffett.

Running Jobs via SLURM

Aside from the PathScale and GNU compilers, there are a number of programs installed on the SiCortex. They include: Perl, Python, Tcl, perfmon2, PAPI, Tau, and Totalview. All programs are already in your path, which means the module command is NOT used on Moffett - a difference from most of the other RCAC machines.

Another difference from most of the other RCAC systems is the use of the job scheduler/management system SLURM and partitions, in place of PBS and queues. Queues are not used, but in their place are partitions.

To see a list of all installed packages, type epm -qa.

A few environment variables

There are a few SLURM environment variables which are especially useful. They are:

  • SLURM_NNODES: contains the number of nodes. Can be used to check if you got the number of nodes you thought you asked for - both in the interactive subshell and in a script.
  • SLURM_NPROCS: contains the number of processors allocated.
  • SLURM_JOBID: contains the id of the current job - only available in an interactive shell or during the run (you can ask for it in your batch script).

Basic SLURM

The srun command is used to run programs. It can start multiple tasks on multiple nodes, where each of the tasks is a separate process that executes the same program. By default, SLURM allocates one processor per task, but starts tasks on multiple processors as necessary. The argument -n specifies the number of tasks, and the argument -N specifies the number of nodes.

A few useful options to srun:

  • -N: number of nodes, if not given, enough will be allocated to fullfill -n and/or -c. A range can be given; if you ask for, say, 1-1, then you will get 1 and only 1 node, despite what you otherwise ask for. It will also assure that all processors will be allocated on the same node.
  • -n: number of tasks
  • -c: CPUs per task. Request that ncpus be allocated per process. This may be useful if the job is multithreaded and requires more than one CPU per task for optimal performance. The default is one CPU per process.
  • -b: batch job. Followed by the job submission file (also known as job script). The job submission files can just simply be a list of commands, like 'srun -N 1', they can be shell script commands, or they can be slightly more advanced SLURM job submission files:
                      #SLURM -N 2 -n 2
                      #SLURM --mpi=lam
    
  • -B Followed by --extra-node-info=sockets[:cores[:threads]]. This specifies more detailed allocation requests, such as the number and type of computational resources within a cluster: number of sockets (or physical processors) per node, cores per socket, and threads per core. The total amount of resources being requested is the product of all of the terms. Each value can be a single number or a range (e.g. min-max). An asterisk (*) can be used as a placeholder indicating that all available resources of that type are to be utilized. The individual levels can also be specified in separate options if desired:
                      --sockets-per-node=sockets
                      --cores-per-socket=cores
                      --threads-per-core=threads
    

Moffett SLURM Tips

SLURM broadcasts stdin from the attached terminal to all of the processes and returns each process’s stdout and stderr to the terminal. SLURM buffers stdout, this behavior can cause unexpected results. For example, if a job crashes before completing, there is no indication of it because SLURM continues to hold off output while it waits for the job to finish. In this scenario, you would cancel the job using scancel.

Moffett SLURM Partitions (Queues)

Moffett runs the job scheduler/management system SLURM. Queues (like in the more familiar PBS) are not used, but in their place are partitions, which serve - more or less - the same funtion. There have been two partitions defined, scx and scx-comp. Unless your group gets its own partition, you should use scx-comp. To see which partitions are there, type sinfo.

user123@moffett-fe00 ~/fortran $ sinfo
PARTITION AVAIL  TIMELIMIT NODES  STATE NODELIST
scx          up   infinite     2  alloc scx-m0n[0-1]
scx          up   infinite   538   idle scx-m0n[2-26],scx-m10n[0-26],scx-m11n[0-26],scx-m12n[0-26],scx-m13n[0-26],scx-m14n[0-26],scx-m15n[0-26],scx-m17n[0-26],scx-m1n[0-26],scx-m20n[0-26],scx-m23n[0-26],scx-m27n[0-26],scx-m30n[0-26],scx-m31n[0-26],scx-m32n[0-26],scx-m33n[0-26],scx-m35n[0-26],scx-m5n[0-26],scx-m6n[0-26],scx-m7n[0-26]
scx-comp     up   infinite     2  alloc scx-m0n[0-1]
scx-comp     up   infinite   528   idle scx-m0n[2-5,7-26],scx-m10n[0-26],scx-m11n[0-26],scx-m12n[0-26],scx-m13n[0-26],scx-m14n[0-26],scx-m15n[0-26],scx-m17n[0-26],scx-m1n[0-5,7-26],scx-m20n[0-26],scx-m23n[0-26],scx-m27n[0-26],scx-m30n[0-26],scx-m31n[0,2,4-26],scx-m32n[0-5,7-26],scx-m33n[0,2,4-26],scx-m35n[0,2,4-26],scx-m5n[0-26],scx-m6n[0-5,7-26],scx-m7n[0-26]
user123@moffett-fe00 ~/fortran $ 

From the above table you can see the name, timelimit, and maximum number of available nodes.

Using the partitions is done through SLURM.

Moffett SLURM Submission Script

Instead of just submitting the program to SLURM and giving all arguments on the command line, you can submit a batch job. This is done by writing a job submission file and submitting this to SLURM. The job submission file then contains all the commands and arguments to run the job - like other programs, MPI applications, or simple srun commands. When you have submitted the batch job for execution, srun will exit immediately. The job will run when SLURM determines that adequate resources are available. When the job has returned, you will get a file on your directory named slurm-<jobid>.out containing the output from the job.

The job submission file is submitted like this:

	srun -p scx-comp -b <path-to-jobscript>/myscript.sh

Examples of job submission files

Very simple job. Asks for 1 node and 4 processors. Runs the program 'hello'.

	#!/bin/sh
	srun -N 1 -n 4 ~user123/slurm/hello

Running the above.

	user123@moffett-fe00 ~/slurm $ srun -p scx-comp -b slurm_hello 
	srun: jobid 935 submitted
	user123@moffett-fe00 ~/slurm $ ls
	hello    hello.cc  hello.o    hello2.f  slurm-935.out
	hello.c  hello.f   hello2.cc  hello2.o  slurm_hello
	user123@moffett-fe00 ~/slurm $ less slurm-935.out 
	Processor 2 of 4: Hello World!
	Processor 0 of 4: Hello World!
	Processor 3 of 4: Hello World!
	Processor 1 of 4: Hello World!
	user123@moffett-fe00 ~/slurm $ 

SLURM Job Submission

Running a single node job

	srun -p <partition> -N 1 <executable> [args]

Running a single node, single processor job

	srun -p <partition> -N1 -n1 <executable> [args]

Running a multinode job

	srun -p <partition> -N <nodes> -n <tasks> <executable> [args]

Running a batch job

	sbatch -p <partition> -b <jobscript.sh>

SLURM Job Status

See status of partitions and nodes

user123@moffett-fe00 ~ $ sinfo
PARTITION AVAIL  TIMELIMIT NODES  STATE NODELIST
scx          up   infinite     2  alloc scx-m0n[0-1]
scx          up   infinite   538   idle scx-m0n[2-26],scx-m10n[0-26],scx-m11n[0-26],scx-m12n[0-26],scx-m13n[0-26],scx-m14n[0-26],scx-m15n[0-26],scx-m17n[0-26],scx-m1n[0-26],scx-m20n[0-26],scx-m23n[0-26],scx-m27n[0-26],scx-m30n[0-26],scx-m31n[0-26],scx-m32n[0-26],scx-m33n[0-26],scx-m35n[0-26],scx-m5n[0-26],scx-m6n[0-26],scx-m7n[0-26]
scx-comp     up   infinite     2  alloc scx-m0n[0-1]
scx-comp     up   infinite   528   idle scx-m0n[2-5,7-26],scx-m10n[0-26],scx-m11n[0-26],scx-m12n[0-26],scx-m13n[0-26],scx-m14n[0-26],scx-m15n[0-26],scx-m17n[0-26],scx-m1n[0-5,7-26],scx-m20n[0-26],scx-m23n[0-26],scx-m27n[0-26],scx-m30n[0-26],scx-m31n[0,2,4-26],scx-m32n[0-5,7-26],scx-m33n[0,2,4-26],scx-m35n[0,2,4-26],scx-m5n[0-26],scx-m6n[0-5,7-26],scx-m7n[0-26]
user123@moffett-fe00 ~ $ 

Get the status of all SLURM jobs

	user123@moffett-fe00 ~ $ squeue
	  JOBID PARTITION     NAME     USER  ST       TIME  NODES NODELIST(REASON)
	    842  scx-comp           user1   R    4:20:40      1 scx-m0n0
	    870  scx-comp test_dax   user2   R      13:40      1 scx-m0n1
	user123@moffett-fe00 ~ $ 

Get the status of individual jobs

	user123@moffett-fe00 ~ $ scontrol show job 870
	JobId=870 UserId=user2(150963) GroupId=pucc(1233)
	   Name=test_daxpy.sicortex
	   Priority=4294901718 Partition=scx-comp BatchFlag=0
	   AllocNode:Sid=moffett-fe00:17704 TimeLimit=UNLIMITED ExitCode=0:0
	   JobState=COMPLETED StartTime=06/09-13:33:56 EndTime=06/09-13:47:55
	   NodeList=scx-m0n1 NodeListIndices=
	   AllocCPUs=6
	   ReqProcs=1 ReqNodes=1 ReqS:C:T=1-65535:1-65535:1-65535
	   Shared=0 Contiguous=0 CPUs/task=0
	   MinProcs=1 MinSockets=1 MinCores=1 MinThreads=1
	   MinMemory=1 MinTmpDisk=1 Features=(null)
	   Dependency=0 Account=(null) Reason=None Network=(null)
	   ReqNodeList=(null) ReqNodeListIndices=
	   ExcNodeList=(null) ExcNodeListIndices=
	   SubmitTime=06/09-13:33:56 SuspendTime=None PreSusTime=0
	
	user123@moffett-fe00 ~ $ 

When your job is running, there are SLURM commands which can be used to track its progress and to stop/restart it. Before you can do this, you need to know the job id. If it is a batch job, the job id will have been displayed when the job started. In either case, the job id can be found from the 'squeue' command, which displays both the job id and the job name, together with the status and resource information for every job currently managed by the SLURM control daemon. If you do not specify any options, then the report displays this as (hours:minutes:seconds), total nodes, and node list.

Example, squeue

	user123@moffett-fe00 ~ $ squeue
	  JOBID PARTITION     NAME     USER  ST       TIME  NODES NODELIST(REASON)
	    950  scx-comp           user1   R    3:02:36      1 scx-m0n4
	    979  scx-comp           user3   R      54:57      1 scx-m0n5
	user123@moffett-fe00 ~ $ 

scontrol: this command provides more detailed information about individual jobs.

Example

	user123@moffett-fe00 ~/slurm $ scontrol show job 950
	JobId=950 UserId=user1(48811) GroupId=pucc(1233)
	   Name=
	   Priority=4294901638 Partition=scx-comp BatchFlag=0
	   AllocNode:Sid=moffett-fe00:16167 TimeLimit=UNLIMITED ExitCode=0:0
	   JobState=RUNNING StartTime=06/10-10:19:20 EndTime=NONE
	   NodeList=scx-m0n4 NodeListIndices=4-4
	   AllocCPUs=6
	   ReqProcs=1 ReqNodes=1 ReqS:C:T=1-65535:1-65535:1-65535
	   Shared=0 Contiguous=0 CPUs/task=0
	   MinProcs=1 MinSockets=1 MinCores=1 MinThreads=1
	   MinMemory=1 MinTmpDisk=1 Features=(null)
	   Dependency=0 Account=(null) Reason=None Network=(null)
	   ReqNodeList=(null) ReqNodeListIndices=
	   ExcNodeList=(null) ExcNodeListIndices=
	   SubmitTime=06/10-10:19:20 SuspendTime=None PreSusTime=0
	
	user123@moffett-fe00 ~/slurm $ 

sinfo: monitoring node or partition status. The sinfo command reports the current status information on partitions and individual nodes. It is the equivalent of qstat -q under PBS. When you do not specify any options, the report displays (partition, availability, time limit, node count, node state, node list), for all nodes and partitions on the system.

user123@moffett-fe00 ~ $ sinfo
PARTITION AVAIL  TIMELIMIT NODES  STATE NODELIST
scx          up   infinite     2  alloc scx-m0n[4-5]
scx          up   infinite   538   idle scx-m0n[0-3,6-26],scx-m10n[0-26],scx-m11n[0-26],scx-m12n[0-26],scx-m13n[0-26],scx-m14n[0-26],scx-m15n[0-26],scx-m17n[0-26],scx-m1n[0-26],scx-m20n[0-26],scx-m23n[0-26],scx-m27n[0-26],scx-m30n[0-26],scx-m31n[0-26],scx-m32n[0-26],scx-m33n[0-26],scx-m35n[0-26],scx-m5n[0-26],scx-m6n[0-26],scx-m7n[0-26]
scx-comp     up   infinite     2  alloc scx-m0n[4-5]
scx-comp     up   infinite   528   idle scx-m0n[0-3,7-26],scx-m10n[0-26],scx-m11n[0-26],scx-m12n[0-26],scx-m13n[0-26],scx-m14n[0-26],scx-m15n[0-26],scx-m17n[0-26],scx-m1n[0-5,7-26],scx-m20n[0-26],scx-m23n[0-26],scx-m27n[0-26],scx-m30n[0-26],scx-m31n[0,2,4-26],scx-m32n[0-5,7-26],scx-m33n[0,2,4-26],scx-m35n[0,2,4-26],scx-m5n[0-26],scx-m6n[0-5,7-26],scx-m7n[0-26]
user123@moffett-fe00 ~ $ 

SLURM Job Cancellation

scancel and ^C: The scancel command cancels a running or pending job using the job’s id (only job owners and administrators can cancel jobs).

	scancel <jobid>
	user123@moffett-fe00 ~/slurm $ srun -p scx-comp -b slurm_hello
	srun: jobid 991 submitted
	user123@moffett-fe00 ~/slurm $ squeue
	  JOBID PARTITION     NAME     USER  ST       TIME  NODES NODELIST(REASON)
	    950  scx-comp           ayounts   R    3:09:09      1 scx-m0n4
	    979  scx-comp          mluisier   R    1:01:30      1 scx-m0n5
	    991  scx-comp slurm_he  user123   R       0:01      1 scx-m0n0
	user123@moffett-fe00 ~/slurm $ 
	user123@moffett-fe00 ~/slurm $ scancel 991 
	user123@moffett-fe00 ~/slurm $ squeue
	  JOBID PARTITION     NAME     USER  ST       TIME  NODES NODELIST(REASON)
	    950  scx-comp           ayounts   R    3:08:19      1 scx-m0n4
	    979  scx-comp          mluisier   R    1:00:40      1 scx-m0n5
	user123@moffett-fe00 ~/slurm $ 

Alternatively, you can issue ^C ((SIGINIT) signals to cancel a running job.

After srun starts a job, it blocks until all of the job’s tasks terminate (if it is a batch job, the prompt will return while the job runs). Signals sent to srun during this time are broadcast to all of the tasks. SLURM handles ^C signals a special way:

  • One ^C signal generates a status report for all of the associated tasks
  • Two ^C signals within one second typically terminates all of the associated tasks
  • Three ^C signals within one second immediately terminates the job and its remote tasks.
	user123@moffett-fe00 ~/slurm $ srun -p scx-comp -N 4 -n 16 hello
	srun: interrupt (one more within 1 sec to abort)
	srun: task[0-15]: initializing
	Processor 14 of 16: Hello World!
	Processor 5 of 16: Hello World!
	Processor 11 of 16: Hello World!
	Processor 9 of 16: Hello World!
	Processor 4 of 16: Hello World!
	Processor 15 of 16: Hello World!
	Processor 10 of 16: Hello World!
	Processor 0 of 16: Hello World!
	Processor 6 of 16: Hello World!
	Processor 13 of 16: Hello World!
	Processor 8 of 16: Hello World!
	Processor 12 of 16: Hello World!
	Processor 7 of 16: Hello World!
	Processor 3 of 16: Hello World!
	Processor 1 of 16: Hello World!
	Processor 2 of 16: Hello World!
	user123@moffett-fe00 ~/slurm $ srun -p scx-comp -N 4 -n 16 hello
	srun: interrupt (one more within 1 sec to abort)
	srun: task[0-15]: initializing
	srun: sending Ctrl-C to job
	srun: Cancelling job
	user123@moffett-fe00 ~/slurm $ 

Do not kill/skill srun to cancel a SLURM job! Doing so only terminates srun. The tasks continue to run, but not under SLURM management. If you mistakenly kill/skill an srun job, you can use squeue to get the job id and then either scancel the job, or srun -p scx-comp -a <jobid> -j, to reattach srun to the job, and then use the ^C sequence to cancel it.

SLURM Interactive Jobs

SLURM schedules jobs subject to resource availability. You can use the -A (allocate) option to acquire and hold resources for your use. Then you can run and work on your jobs interactively, like the -I option for PBS.

The command would look like this (using the partition scx-comp and asking for 4 nodes)

	srun -p scx-comp -A -N 4

This option blocks until the requested resources are available, then spawns a subshell. From this subshell, you can run interactively on the allocated resources multiple parallel jobs or a job submission file. Once space on a partition is allocated, you do not have to specify the -p <partition> on subsequent invocations of srun.

Example: listing nodes

	user123@moffett-fe00 ~/slurm $ srun -N 4 hostname
	scx-m0n1.scsystem
	scx-m0n0.scsystem
	scx-m0n3.scsystem
	scx-m0n2.scsystem
	user123@moffett-fe00 ~/slurm $ 

When you are in an interactive session, the subshell has already allocated the required resources, and the job will start running immediately. When you are finished running your jobs, you must exit from the interactive session before the resources are released.

Example. Starting an interactive session (asking for 4 nodes - each has 6 processors), and then compiling and running a small MPI job.

	user123@moffett-fe00 ~ $ srun -p scx-comp -A -N 4
	user123@moffett-fe00 ~ $ cd mpi
	user123@moffett-fe00 ~/mpi $ mpicc hello.c -o hello 
	user123@moffett-fe00 ~/mpi $ srun -n 4 hello
	Processor 2 of 4: Hello World!
	Processor 1 of 4: Hello World!
	Processor 3 of 4: Hello World!
	Processor 0 of 4: Hello World!
	user123@moffett-fe00 ~/mpi $ srun -n 12 hello
	Processor 0 of 12: Hello World!
	Processor 5 of 12: Hello World!
	Processor 11 of 12: Hello World!
	Processor 3 of 12: Hello World!
	Processor 2 of 12: Hello World!
	Processor 7 of 12: Hello World!
	Processor 4 of 12: Hello World!
	Processor 1 of 12: Hello World!
	Processor 8 of 12: Hello World!
	Processor 6 of 12: Hello World!
	Processor 10 of 12: Hello World!
	Processor 9 of 12: Hello World!
	user123@moffett-fe00 ~/mpi $ exit
	exit
	user123@moffett-fe00 ~ $ 

SLURM Examples

Submitting and cancelling a job (-b option means the script will be submitted when needed resources are available and no higher priority jobs are pending)

	user123@moffett-fe00 ~/slurm $ srun -p scx-comp -b slurm_hello 
	srun: jobid 918 submitted
	user123@moffett-fe00 ~/slurm $ scancel 918
	user123@moffett-fe00 ~/slurm $ squeue
	  JOBID PARTITION     NAME     USER  ST       TIME  NODES NODELIST(REASON)
	    842  scx-comp           user1   R    7:09:28      1 scx-m0n0
	user123@moffett-fe00 ~/slurm $ 

Runs 2 tasks, each on a different processor

	user123@moffett-fe00 ~/slurm $ srun -p scx-comp -n2 hello
	Processor 1 of 2: Hello World!
	Processor 0 of 2: Hello World!
	user123@moffett-fe00 ~/slurm $ 

Runs 7 tasks distributed across 4 nodes

	user123@moffett-fe00 ~/slurm $ srun -p scx-comp -n 7 -N 4 hello
	Processor 6 of 7: Hello World!
	Processor 3 of 7: Hello World!
	Processor 0 of 7: Hello World!
	Processor 4 of 7: Hello World!
	Processor 2 of 7: Hello World!
	Processor 5 of 7: Hello World!
	Processor 1 of 7: Hello World!
	user123@moffett-fe00 ~/slurm $ 

Runs 9 tasks on 9 different nodes

	user123@moffett-fe00 ~/slurm $ srun -p scx-comp -N 9 hello
	Processor 2 of 9: Hello World!
	Processor 6 of 9: Hello World!
	Processor 4 of 9: Hello World!
	Processor 7 of 9: Hello World!
	Processor 3 of 9: Hello World!
	Processor 8 of 9: Hello World!
	Processor 5 of 9: Hello World!
	Processor 1 of 9: Hello World!
	Processor 0 of 9: Hello World!
	user123@moffett-fe00 ~/slurm $ 

Starts 3 tasks and allocates 2 processors per task

	user123@moffett-fe00 ~/slurm $ srun -p scx-comp -n 3 -c 2 hello
	Processor 2 of 3: Hello World!
	Processor 1 of 3: Hello World!
	Processor 0 of 3: Hello World!
	user123@moffett-fe00 ~/slurm $ 

Runs 6 tasks on six nodes in the partition named <name>

	srun -p <name> -N 6 <myprogram>

Assure all processors are on the same node - here we ask for 6 processors on 1 node

	user123@moffett-fe00 ~/slurm $ srun -p scx-comp -N 1-1 -n 6 hello
	Processor 3 of 6: Hello World!
	Processor 0 of 6: Hello World!
	Processor 2 of 6: Hello World!
	Processor 5 of 6: Hello World!
	Processor 1 of 6: Hello World!
	Processor 4 of 6: Hello World!
	user123@moffett-fe00 ~/slurm $ 

If you specify more tasks than the number of requested nodes can handle, SLURM automatically allocates additional nodes and distributes the tasks across them. In the following example I ask for 2 nodes and 14 processors - remember that each node has 6 processors. Though I ask for too few nodes, SLURM allocates extra, and the job runs.

	user123@moffett-fe00 ~/slurm $ srun -p scx-comp -N 2 -n 14 hello
	Processor 10 of 14: Hello World!
	Processor 0 of 14: Hello World!
	Processor 7 of 14: Hello World!
	Processor 12 of 14: Hello World!
	Processor 11 of 14: Hello World!
	Processor 3 of 14: Hello World!
	Processor 6 of 14: Hello World!
	Processor 4 of 14: Hello World!
	Processor 13 of 14: Hello World!
	Processor 2 of 14: Hello World!
	Processor 8 of 14: Hello World!
	Processor 1 of 14: Hello World!
	Processor 9 of 14: Hello World!
	Processor 5 of 14: Hello World!
	user123@moffett-fe00 ~/slurm $ 

If you specify more nodes than tasks, SLURM issues a warning, reallocates resources, then proceeds to process the job.

	user123@moffett-fe00 ~/slurm $ srun -p scx-comp -N 4 -n 2 hello
	srun: Warning: can't run 2 processes on 4 nodes, setting nnodes to 2
	Processor 1 of 2: Hello World!
	Processor 0 of 2: Hello World!
	user123@moffett-fe00 ~/slurm $ 

Difference between N and n

	user123@moffett-fe00 ~ $ srun -p scx-comp -N3 -l /bin/hostname
	0: scx-m5n12.scsystem
	1: scx-m5n21.scsystem
	2: scx-m5n24.scsystem
	user123@moffett-fe00 ~ $ srun -p scx-comp -n3 -l /bin/hostname
	0: scx-m5n12.scsystem
	2: scx-m5n12.scsystem
	1: scx-m5n12.scsystem
	user123@moffett-fe00 ~ $ 

Serial SLURM Example

Serial programs are highly discouraged, but to run a (small) serial program, just enter ./ followed by the executable name and any necessary arguments at the shell prompt:

	user123@moffett-fe00 ~/fortran $ ./hellof90 
	 Hello, world!
	user123@moffett-fe00 ~/fortran $ 

The program can also be submitted to the job scheduling system, which is the preferred way to run any program. A program can be run either interactively or as a job submission file.

Here is a short example of submitting the above program as a job (asking for one node and one processor on partition scx):

	user123@moffett-fe00 ~/fortran $ srun -p scx -N1 -n1 hellof90 
	 Hello, world!
	user123@moffett-fe00 ~/fortran $ 

MPI SLURM Example

There are two compiler suites for MPI; PathScale (Fortran and C/C++) and GNU (C/C++). The default is the Pathscale compilers and they will be used unless the option --gnu is used. The SiCortex MPI library, libscmpi implements the Message Passing Interface (MPI) for SiCortex systems. MPI programs written in C/C++ must include <mpi.h> and Fortran programs must include mpif.h. Because there is a name conflict between stdio.h and the MPI C++ binding involving SEEK_SET, SEEK_CUR, and SEEK_END, you must either include mpi.h before stdio.h and iostream.h in MPI programs written in C++, or add -DMPICH_IGNORE_CXX_SEEK to the compiler command line to force it to skip the MPI versions of the SEEK_* routines.

MPI Library Linking Order: you must link your program with the MPI library, either -lscmpi or -lscmpi_debug (included by default for mpicc, mpiCC, and mpif90). When you are using other libraries that depend on it, add the MPI library to the end of the linker's command line.

Example (Scalapack)

pathcc -o <mpiprogram> <mpiprogram.c> -lscaLAPACK -lscmpi (C)

pathCC -o <mpiprogram> <mpiprogram.cpp> -lscaLAPACK -lscmpicxx -lscmpi (C++)

pathf95 -o <mpiprogram> <mpiprogram.f90> -scaLAPACK -lscmpi (Fortran) 

Instead of entering the commands for the compilers directly and adding the MPI library, you can use the MPI compiler scripts: mpicc, mpicxx, mpif77, mpif90.

Here is a table showing both ways to compile your MPI program (with and without using the compiler scripts)

Language PathScale GNU
C pathcc -o mpi_program mpi_program.c -lscmpi
OR
mpicc -o mpi_program mpi_program.c
gcc -o mpi_program mpi_program.c -lscmpi
OR
mpicc --gnu -o mpi_program mpi_program.c
C++ pathCC -o mpi_program mpi_program.cpp -lscmpicxx -lscmpi
OR
mpiCC -o mpi_program mpi_program.cpp
g++ -o mpi_program mpi_program.cpp -lscmpicxx -lscmpi
OR
mpiCC --gnu -o mpi_program mpi_program.cpp
Fortran 77 pathf95 -o mpi_program mpi_program.f -lscmpi
OR
mpif77 -o mpi_program mpi_program.f
-
Fortran 90 pathf95 -o mpi_program mpi_program.f90 -lscmpi
OR
mpif90 -o mpi_program mpi_program.f90
-

Here is an example of how to compile a small MPI C program, hello_mpi.c.

	user123@moffett-fe00 ~/mpi $ pathcc -o hello_mpi hello_mpi.c -lscmpi
	user123@moffett-fe00 ~/mpi $ 

	OR
	
	uaer123@moffett-fe00 ~/mpi $ mpicc -o hello_mpi hello_mpi.c
	user123@moffett-fe00 ~/mpi $ 

You can then either run it directly with srun, interactively, or as a batch job (best if it is long).

Directly as a job (partition scx-comp and 4 nodes)

	user123@moffett-fe00 ~/mpi $ srun -p scx-comp -N 4 hello_mpi
	Processor 1 of 4: Hello World!
	Processor 2 of 4: Hello World!
	Processor 3 of 4: Hello World!
	Processor 0 of 4: Hello World!
	user123@moffett-fe00 ~/mpi $ 

Interactively (allocating 4 nodes, then running it in the subshell)

	user123@moffett-fe00 ~/mpi $ srun -p scx-comp -N 4 -A 
	user123@moffett-fe00 ~/mpi $ srun hello_mpi  
	Processor 1 of 4: Hello World!
	Processor 2 of 4: Hello World!
	Processor 3 of 4: Hello World!
	Processor 0 of 4: Hello World!
	user123@moffett-fe00 ~/mpi $ 

Batch job (slurm_hello)

	user123@moffett-fe00 ~/mpi $ srun -p scx-comp -b slurm_hello
	srun: jobid 1058 submitted
	user123@moffett-fe00 ~/mpi $ less slurm-1058.out 
	Processor 3 of 4: Hello World!
	Processor 1 of 4: Hello World!
	Processor 2 of 4: Hello World!
	Processor 0 of 4: Hello World!
	user123@moffett-fe00 ~/mpi $ 

Running Jobs via Condor

Condor allows users to run jobs on systems which would otherwise be idle for however long as those systems are not needed by their primary users. Condor is one of several distributed computing systems RCAC makes available. Most RCAC resources, in addition to being available through normal means, are a part of BoilerGrid and can be used via Condor. If a primary user needs a machine, the Condor job is immediately either checkpointed and/or migrated and the resource made available. Thus, shorter jobs will have a better completion rate via Condor than longer jobs; however, even though jobs may have to be restarted elsewhere, BoilerGrid can offer a vast amount of computational resources to users. Not only are nearly all RCAC systems part of BoilerGrid, so also are large numbers of lab machines at the West Lafayette and other Purdue campuses. BoilerGrid is one of the largest Condor pools in the world. Some machines at other institutions are also a part of a larger Condor federation known as DiaGrid and can be used as well. For more information, refer to the BoilerGrid documentation.