Radon - Complete User Guide

Overview of Radon

The Radon cluster is composed of desktop PCs recycled from instructional computing labs. Radon is currently entirely 64-bit Dell systems with Intel Pentium4 or Xeon processors of various speeds and with memory configurations between 2 and 4 GB of RAM. Nodes are connected with either 100 MB or Gigabit Ethernet. The machines reclaimed from instructional labs are older, slower, and lack high-speed interconnects, so high-communication or sizeable multithreaded programs are not a good fit. Still, there are a fair number of machines, and some codes may be able to take advantage of these effectively.

Detailed Hardware Specification

Radon is currently divided into three different sub-clusters, each with a different combination of CPU speed, memory, and interconnect. Subcluster "a" nodes have 3.6 GHz single-core Intel Pentium4 CPUs, 2 GB RAM, and Gigabit Ethernet; subcluster "b" nodes, 3.2 GHz dual-core Intel Xeon CPUs, 2 GB RAM, and Gigabit Ethernet; and subcluster "c" nodes, 3.2 GHz dual-core Intel Xeon CPUs, 4 GB RAM, and Gigabit Ethernet.

Sub-Cluster Number of Nodes Processor Cores per Node Memory per Node Interconnect Theoretical Peak TeraFLOPS
radon-a[001-144] 144 3.2 GHz Intel Pentium4 1 2 GB Gigabit Ethernet 1.03
radon-b[001-048] 48 3.2 GHz Intel Xeon 2 2 GB Gigabit Ethernet 0.61
radon-c[001-048] 48 3.2 GHz Intel Xeon 2 4 GB Gigabit Ethernet 0.61

All Radon nodes run Red Hat Enterprise Linux 5 and use PBSPro 9.x for resource and job management. Operating system patches are applied monthly or as security needs dictate. All nodes have been configured to allow for unlimited stack usage, as well as unlimited core dump size (though disk space and server quotas may still be a limiting factor).

Node Interconnect Systems

The system interconnect is the networking technology that is used to connect nodes of a cluster to each other. Note that this is often much faster and sometimes radically different from the networking available between a resource and other machines or the outside world. Interconnects have different characteristics that may affect parallel message-passing programs and their design. Each RCAC resource has different interconnect options available, and some have more than one available to all or only portions of the resource's nodes. For information on which interconnects are available, refer to the hardware specification for the resource above. Details about the specific interconnects available on Radon follow.

100 Megabit Ethernet

100 Mbit/s Ethernet is sometimes called "Fast Ethernet" since it was fast in comparison to the original Ethernet speed of 10 Mbit/s. The most common 100 Mbit/s Ethernet standard is also known as "100baseTX", where "TX" stands for "Twisted Pair Copper".

100 Mbps twisted pair runs over two pairs of category 5 or above cable (typical category 5 cable contains 4 pairs and can therefore support two 100baseTX links). Each cable run has a maximum distance of 100 meters (330 feet). In a typical configuration, 100baseTX uses one pair of twisted wires in each direction, providing 100 Mbps of throughput in each direction simultaneously, known as "Full-Duplex".

Gigabit Ethernet

Gigabit Ethernet (GigE) is a form of Ethernet, currently the most widely used network link technology, that is able to transfer data at rates of approximately one gigabit per second—ten times faster than 100 Mbps Ethernet. Consequently, GigE cable runs must be much shorter as well.

Obtaining an Account

Radon is a cluster operated by RCAC. Purdue faculty, staff, and students with the approval of their advisor may request access to Radon using the online Research Computing Account Request Form.

Login / SSH

To issue jobs on Radon, users may log on to the front-end host radon.rcac.purdue.edu via SSH.

SSH Client Software

All access to the RCAC systems must be through secure (encrypted) connections. Standard telnet and FTP are not supported. SSH, SCP, and SFTP may be used instead.

Secure Shell or SSH is a way of establishing a secure channel between a local and a remote computer. It uses public-key cryptography to authenticate the remote computer and (optionally) to allow the remote computer to authenticate the user. It is usually used to log in to a remote machine and execute commands similar to telnet, but it also supports tunneling and forwarding of X11 or arbitrary TCP connections. The associated SFTP and SCP protocols may be used to transfer files. There are many SSH clients available, depending on the operating system you use.

Linux / Solaris / AIX / HP-UX / Unix:

  • "ssh", "sftp", and "scp" are pre-installed. Log in using ssh myusername@servername.

Microsoft Windows:

Mac OS X:

  • "ssh", "sftp", and "scp" are pre-installed. You may start a local terminal window from "Applications->Utilities". Log in using ssh myusername@servername.
  • MacSSH and MacSFTP
  • NiftyTelnet 1.1 SSH

SSH Keys

SSH can be used in conjunction with many different means of authentication. One popular authentication method is called Public Key Authentication (PKA). PKA is a method of establishing your identity to a remote computer using related sets of encryption data called keys. PKA is a more secure alternative to traditional password-based authentication with which you are probably familiar.

To employ PKA via SSH, you manually generate a keypair (also called SSH keys) in the location from where you wish to initiate a connection to a remote machine. This keypair consists of two text files, one which is called a private key and one which is called a public key. You keep the private key file confidential on your local machine or local home directory (hence the name "private" key). You then login to a remote machine (if possible) and append the corresponding public key text to the end of a specific file, or have a system administrator do so on your behalf. In future login attempts, the public and private keys are compared to verify your identity, which then grants you access to the remote machine.

As a user, you can create, maintain, and employ as many keypairs as you wish. If you connect to a computational resource from your work laptop, your work desktop, and your home desktop, you can create and employ keypairs on each. You can also create multiple keypairs on a single local machine to serve different purposes, such as establishing access to different remote machines, or establishing different types of access to a single remote machine. In short, PKA via SSH offers a secure but flexible means of identifying yourself to all kinds computational resources.

Passphrases and SSH Keys

When a you create a keypair, you are prompted to provide a passphrase for the private key. This passphrase is different than a password in a number of ways. First, a passphrase is, as the name implies, a phrase. It can include most types of characters, including spaces, and has no limits on length. Second, this passphrase is not transmitted to the remote machine for verification. It is used only to allow the use of your local private key and is specific to a specific local private key.

Perhaps you are wondering why you would need a private key passphrase at all when using PKA. If the private key is kept secure, why the need for a passphrase just to use it? Indeed, if the location of your private keys were always completely secure, a passphrase might not be needed. In reality, a number of situations could arise in which someone may improperly gain access to your private key files. In these situations, a passphrase offers another level of security for you, the user who created the keypair.

Think of the private key/passphrase combination as being analogous to your ATM card/PIN combination. The ATM card itself is the object that grants access to your important accounts, and as such, should be kept secure at all times—just as a private key should. But if you ever lose your wallet or your ATM card is stolen, you are glad that your PIN exists to offer you another level of protection. The same is true for a private key passphrase.

When you create a keypair, you should always provide a corresponding private key passphrase. For security purposes, avoid using phrases that would be guessed by automated programs (e.g. phrases that consist solely of words in English-language dictionaries). This passphrase can never be recovered if forgotten, so make note of it. There are only limited situations when the use of a non-passphrase-protected private key is warranted—conducting automated file backups is one such situation. If you need to use a non-passphrase-protected private key to conduct automated backups to Fortress, see the No-Passphrase SSH Keys section.

SSH X11 Forwarding

SSH supports tunneling of X11 (X-Windows). If you have an X11 server running on your local machine, you may use X11 applications on remote systems and have their graphical displays appear on your local machine. These X11 connections are tunneled and encrypted automatically by your SSH client. You will need to have a local X11 server running, but free and commercial X11 servers are available for various operating systems.

Linux / Solaris / AIX / HP-UX / Unix:

  • An X11 server is at the core of all graphical sessions. If you are logged in to a graphical environment on these operating systems, you are already running an X11 server.

Microsoft Windows:

  • Xming is a free X11 server available for all versions of Windows, although it may occasionally hang and require a restart. Download the "Public Domain Xming" or donate to the development for the newest version.
  • Hummingbird eXceed is a commercial X11 server available for all versions of Windows.
  • Cygwin is another free X11 server available for all versions of Windows. Download setup.exe and make sure you select the following packages which are not included by default:

    	Packages from the X11 group:
    	
    	X-startup-scripts
    	XFree86-lib-compat
    	xorg-*
    	xterm
    	xwinwm
    	lib-glitz-glx1
    	
    	Under the Graphics group, also select opengl, if you want OpenGL 
    	support. 
    
    Then when the Cygwin X server is installed, start an xterm and type: XWin -multiwindow in it and then enter. You can now run your SSH client.

Mac OS X:

  • X11 is available as an optional install on the Mac OS X v10.3 Panther and x10.4 Tiger install disks. Run the installer, select the X11 option, and follow the instructions.

Enabling the forwarding


Once you are running an X11 server, you will need to enable X11 forwarding/tunneling in your SSH client:

  • "ssh": X11 tunneling should be enabled by default. To be certain it is enabled, you may use "ssh -X".
  • PuTTY: Prior to connection, in your connection's options, under "Tunnels", check "Enable X11 forwarding", and save your connection.
  • Secure CRT: Right-click a saved connection, and select "Properties". Expand the "Connection" settings, then go to "Port Forwarding" -> "Remote/X11". Check "Forward X11 packets" and click "OK".

Note that SSH will set the remote environment variable $DISPLAY to "localhost:XX.YY" when this is working correctly. If you had previously set your $DISPLAY environment variable to your local IP or hostname, you must remove any set/export/setenv of this variable from your login scripts. The environment variable $DISPLAY must be left as SSH sets it, which is to a random local port address. Setting $DISPLAY to an IP or hostname will not work.

Passwords

If you have received a default password as part of the process of obtaining your account, you should change it immediately when you log on for the first time. This can be done from any terminal/SSH session with the command "passwd". You will have the same password on all RCAC systems. If you change your password on any one RCAC system, it will change on all RCAC systems.

If you already have a Purdue career account, then you will initially be given the same userid and password as your career account. There is no need to change your career account password because you have received an account on RCAC systems.

There is not currently any requirement regarding how often you must change your password within RCAC, but for security reasons changing a password every six months, preferably every three months, is good practice.

All passwords should:

  • Be something you have never used as a password before, on this or any other system.
  • Be easy for you to remember and difficult for others to guess.
  • Be at least eight characters long.
  • Be a combination of upper and lowercase letters, numbers, and symbols.
  • TIP: Abbreviate a sentence or song lyric: "The dog Samson ate 4 new slippers!" = "TdSa4ns!"

Never share your password with another user or make your password known to anyone else. Systems staff will NEVER ask for your password, by email or otherwise.

Email

There is no local mail delivery available on Radon. All email sent to Radon will be forwarded to mail.rcac.purdue.edu for delivery.

Login Shell

When your account is activated, your default shell will probably be set to tcsh—an enhanced version of the Berkeley UNIX C shell (csh). The tcsh shell is completely compatible with the standard csh, and all csh commands and scripts work unedited with tcsh. For more details on tcsh, enter "man tcsh" while logged in.

The other popular shell is GNU Bourne-Again SHell (bash), which is completely compatible with the Bourne shell (sh). For more details on bash, enter "man bash" while logged in.

To change your shell temporarily or to try out another shell, just type the shell name as a command ("bash", "tcsh", "ksh"). This will run the new shell as a subshell. To return to your original shell, simple type exit.

To permanently change your login shell, use the command chsh:

$ chsh -s bash
     (or)
$ chsh -s tcsh

To see a list of all available shells:

$ chsh -l

The next time you log on, you will start in the new shell. However, you may switch back at any time.

Storage Options

File storage options on RCAC systems include home directories, scratch file systems, /tmp, and long-term or permanent storage. Each of these have different performance and intended uses, and some vary from system to system as well. Home directories and long-term storage are backed up nightly, but scratch and /tmp are not and may be occasionally purged without warning. Below is more detail about each of these storage options.

Home Directories

Your home directory is the default directory you are placed in when you log in.

You should use this space for storing files you want to keep long term such as source code, scripts, input data sets, etc. It should also be used for files you want to keep and which you use often. The home directory will physically reside on the BlueArc NFS Server. You can find the path to your home directory by logging in, and typing pwd:

$ pwd
/home/ba01/u103/myusername

The second component of the reply indicates the name of the host where your home directory physically resides. In this example, the home directory is on the RCAC home directory file server named "ba01" under area "u103". This will vary from person to person. Remember, you can always check where your home directory is located by doing a pwd command in your home directory.

Regardless of its physical location, your home directory and its contents are available on almost all the RCAC front-end hosts and their nodes via the Network File System (NFS). The only exception is Black.

Note that your home directory has a quota capping the size and/or number of files you may store within. For more information, refer to the Storage Quotas / Limits Section.

Lost Home Directory File Recovery

Only files which have been backed up overnight can be recovered. If you lose a file the same day you created it, it can NOT be recovered.

Files Lost Less than Seven Days Ago

For files lost less than seven days ago, RCAC has implemented self-service file recovery. Backups of all your files are made at midnight daily and you may access these directly.

To recover files lost after midnight today (same day as loss):

$ set BACKUP=`echo $HOME | sed "s,/u1,_snap/backup_snap/u1,;s,/home/,/autohome/,"`
$ cd $BACKUP

  (now locate the file or directory you wish to recover within here)

$ cp mylostfile $HOME
  (or)
$ cp -r mylostdir $HOME

To recover files lost prior to today, but in last week (2-7 day loss):

  (set this to the date you lost the files: 4-digit year, 2-digit month, 2-digit day)
$ set DATE=YYYYMMDD

$ set BACKUP=`echo $HOME | sed "s,/u1,_snap/backup_snap_$DATE*/u1,;s,/home/,/autohome/,"`
$ cd $BACKUP

  (now locate the file or directory you wish to recover within here)

$ cp mylostfile $HOME
  (or)
$ cp -r mylostdir $HOME

Files Lost More than Seven Days Ago

For files lost more than seven days ago, you will need to request RCAC recover your files from backup tapes. Please do so using the flost command from the front-end host of an RCAC resource:

$ flost

Scratch Directories

Scratch directories are provided by RCAC and are intended for short-term file storage only.

Backups are not performed on scratch directories. In the event of a disk crash or file purge, files in scratch directories can not be recovered. Please be sure to copy any important files to more permanent storage.

All files stored in RCAC scratch directories older than 90 days will be automatically removed (purged). Owners of these files will be notified one week before removal via email. For more information, please refer to our Scratch File Purging Policy.

RCAC scratch directories are provided by a central BlueArc server and are accessible from most RCAC systems. There are two primary scratch file systems: scratch95 and scratch96. A scratch directory already exists for all Radon users. Your RCAC scratch directory is located under scratch95 or scratch96 within a subdirectory by the first letter of your username.

To find the path to your RCAC scratch directory, run myscratch:

	$ myscratch
	/scratch/scratch96/m/myusername

The variable $RCAC_SCRATCH is also set to your RCAC scratch directory path. Use this variable in any scripts. Your actual scratch directory path may change without warning, but this variable will remain current.

$ echo $RCAC_SCRATCH
/scratch/scratch96/m/myusername

To find the path to someone else's RCAC scratch directory, use the command findscratch:

$ findscratch someuser
/scratch/scratch95/s/someuser

Note that your RCAC scratch directory has a quota capping the size and/or number of files you may store within. For more information, refer to the Storage Quotas / Limits Section.

/tmp Directory

The /tmp directory is intended for temporary files that are used during the execution of a process or job or while you examine files created by your jobs. Used properly, /tmp may provide faster local storage to an active process than any other storage option. However, do not use it for longer-term storage or critical results.

Files stored in /tmp are not backed up and are removed whenever space is low or whenever the system is rebooted. In the event of a loss, files in /tmp can not be recovered, so use it only for files that can be recreated relatively easily.

Long-Term Storage

Long-term Storage or Permanent Storage is available to RCAC users on the DXUL/UniTree archival storage system, commonly referred to as "Fortress". DXUL (DiskXtender for Unix and Linux) and UniTree are a software package that manages a hierarchical storage system. Program files, data files and any other files which are not used often, but which must be saved, can be put in permanent storage. Fortress currently has a 1.2 PB capacity. However, since two copies are retained for every file, the usable capacity is only 600 TB.

Recently used files smaller than 0.5 MB have their primary copy stored on low-cost disks, but the second copy is on tape or optical disks. This provides a rapid restore time to the disk cache. However, the large latency to access a larger file (usually involving a copy from a tape cartridge) makes it unsuitable for use as active storage.

In addition to poor performance, these two uses can cause severe problems with the system itself:

  • DO NOT store any actively used files on Fortress.
  • DO NOT store large collections of small files on Fortress.

Do not use Fortress as a second home directory. Instead, use tar or some similar archive tool to combine all the smaller files you wish to store into a single large file first.

For active data storage you should use either local storage or a scratch file system. You may then copy any results you wish to archive to Fortress when computation is complete.

Fortress is directly accessible (via FTP, SSH, SCP, SFTP, and NFS) from all RCAC systems, as well as most systems in ECN and CS and from several other major servers on campus. To access Fortress in any way other than NFS, you must login to fortress.rcac.purdue.edu. RCAC has more information about Fortress, including how to obtain a Fortress account and how to access your files on Fortress.

Manual File Transfer to Long-Term Storage

There are a variety of ways to manually transfer files to your Fortress home directory for long-term storage.

SCP

You can use an SCP client to interactively transfer individual files and directories to Fortress. More information on SCP can be found in the File Transfer - SCP section of this guide.

SFTP

You can use an SFTP client to interactively transfer individual files and directories to Fortress. More information on SFTP can be found in the File Transfer - SFTP section of this guide.

Scripted File Transfer to Long-Term Storage

In the absence of NFS access to Fortress, you must login to fortress.rcac.purdue.edu to transfer files to long-term storage. There are limited situations where the use of a login password or a passphrase-protected authentication keypair becomes impractical, and running scripted file backups to Fortress happens to be one of them. When you attempt to establish a connection to Fortress, you will literally be prompted to input a login password or a local private key passphrase. Any time a script or automated process needs to establish the connection, it is unable to respond to such a request. To enable truly automated transfer of files to Fortress, you need to employ public key authentication via SSH with a non-passphrase-protected private key. For a conceptual overview of public key authentication, see the SSH Keys section of this guide.

Now, if your home directory is compromised and an attacker obtains your non-passphrase-protected private key, the attacker will be able to masquerade as you on machines that contain the corresponding public key. Luckily, certain usage restrictions can be customized for each keypair you employ. For example, you could create a non-passphrase-protected keypair and later specify that this public key shall only be used to run a file-backup script, and additionally, is only valid when connecting from a specific machine. Then, if the non-protected private key were to be compromised, the attacker would be saddened to realize that he could only run your file-backup script repeatedly.

It is very important to place a passphrase on all of your generated keypairs. Only use non-protected keypairs when absolutely necessary.

No-Passphrase SSH Keys

Here is how to set up a non-password-protected keypair for use with automated backup scripts to Fortress from Radon.

  1. Log in to Radon
  2. Create a non-passphrase-protected SSH keypair.

    You should use this keypair for the sole purpose of automating backups on Fortress.

    Specify your ~/.ssh/ directory and give the keypair a descriptive name (e.g. "bkup2fort_id_rsa") by using the "-f" flag:
    $ ssh-keygen -t rsa -N "" -f ~/.ssh/mykeypairname
    
    The ssh-keygen command should have created the following files:
    $ ls ~/.ssh/mykey*
    mykeypairname mykeypairname.pub
    
    The first file is the private key. The second file is the public key counterpart.

    Never distribute your private key or copy it to other machines.
  3. Open the public key file with your favorite text editor and prepend the following text to restrict its use:
    from="*.rcac.purdue.edu",no-port-forwarding,no-agent-forwarding,no-X11-forwarding,no-pty
    
    This tells SSH to only allow connections from RCAC resources, to disable a number of forwarding functions, and to not allow interactive shell commands, respectively.
  4. Copy your modified public key over to your Fortress home directory:
    $ scp ~/.ssh/mykeypairname.pub myusername@fortress.rcac.purdue.edu:~/
    
  5. Log into Fortress and cd to ~/.ssh. Create the ~/.ssh directory if neccessary:
    $ ssh myusername@fortress.rcac.purdue.edu
    $ cd ~/.ssh/
    
  6. If a file named "authorized_keys" exists in the .ssh directory, set the proper permissions for it:
    $ chmod 600 ~/.ssh/authorized_keys
    
    If it does not exist, create it:
    $ touch ~/.ssh/authorized_keys
    $ chmod 600 ~/.ssh/authorized_keys
    
  7. Append your modifed public key to the "authorized_keys" file in your Fortress ~/.ssh directory:
    $ cat ~/mykeypairname.pub >> ~/.ssh/authorized_keys
    
  8. View your "authorized_keys" file. The last entry should look similar to this:
    $ cat ~/.ssh/authorized_keys
    
    from="*.rcac.purdue.edu",no-port-forwarding,no-agent-forwarding,no-X11-forwarding,
    no-pty ssh-rsa AABBB3NzaC1yc2EABBABIwAAAIEA3SXgmvos4jFLVFLRrh6YrN3s8FuBOUTCJ0NIsc+
    FtFrSGD2bVV6yMCgpdgz9RZS7U5uTJOW2VBWsJSb6cjjnA2WJzDcS0bEU3lw+TJszv2sEfl/CwF6dyj2U2
    k5VrXIpdosZVKyjoqzQXhFicIRv1/ykdO8xp+qcgc09NbcyGhs= myusername@resource.rcac.purdue.edu
    
  9. Delete your public key file on Fortress (it's now stored in the "authorized_keys" file):
    $ rm ~/mykeypairname.pub
    
  10. Log out of Fortress:
    $ exit
    

SCP

If you have followed the instructions in the No-Passphrase SSH Keys section to employ an unprotected SSH keypair between Radon and Fortress, you can automate the backup process using backup scripts. Because of the restrictions you placed upon the public key, you cannot use this keypair to log in to an interactive SSH session on Fortress, but you can use it to send files from your Radon home directory to Fortress via SCP, or to run local scripts that employ SCP.

Since you can have multiple private keys on Radon (and a similarly, multiple public keys in any given "authorized_keys" file on Fortress) you always need to specify which keypair you intend to employ for a login attempt to Fortress. The most consistent way to do this is with SSH's "-o" flag. This passes options to configure SSH and can be used with all programs that use SSH for providing a secure connection (e.g. SCP, SFTP, and RSYNC).

To test automated SCP authentication from Radon to Fortress, use the following command:

$ scp -o IdentityFile=~/.ssh/mykeypairname ./mylocalfile myusername@fortress.rcac.purdue.edu:~/myremotefile

If this works (i.e. you are not prompted for a passphrase or login password), you can move on to implementing a script using SCP commands like the one above.

While only you can ultimately decide the best approach for your automated backup process, the example scripts below show, in general, how to employ backup scripts on Radon using SCP commands and public key authentication via SSH. The following bash script, named "fortress_backup_script_scp", uses SCP to recursively copy two directories on a user's Radon home directory to the user's Fortress home directory:

#!/usr/local/bin/bash

# A script to use SCP to copy
# whole directories to Fortress

# Define some parameters

user=myusername
remotehost=fortress.rcac.purdue.edu
idfile=~/.ssh/mykeypairname

# Manually populate an array of directories on the 
# local machine we wish to back up on Fortress

localdir[0]=~/mydir2backup
localdir[1]=~/mydir2backup_also

# Get the number of directories to be backed up

numdirs=${#localdir[*]}
count=1

# Loop over every entry in the "localdir" array to
# copy each directory recursively to a folder of 
# the same name in our home directory on Fortress.

printf "\n>> Starting Secure Copy backup to Fortress\n"

for dir in "${localdir[@]}"
do
  printf ">> Copying directory $dir to Fortress ($count of $numdirs)\n"
  scp -r -o IdentityFile=$idfile $dir $user@$remotehost:~/
  let count++
done

printf ">> Done...\n\n"

The output for this script is as follows:

$ ./fortress_backup_script_scp
 
>> Starting Secure Copy backup to Fortress
>> Copying directory /home/ba01/u100/myusername/mydir2backup to Fortress (1 of 2)
bigfile2.tar.gz                                100%  121MB  30.3MB/s   00:04    
bigfile1.tar.gz                                100%  121MB  40.5MB/s   00:03    
>> Copying directory /home/ba01/u100/myusername/mydir2backup_also to Fortress (2 of 2)
bigfile4.tar.gz                                100%  121MB  40.5MB/s   00:03    
bigfile3.tar.gz                                100%  121MB  40.5MB/s   00:03  
>> Done...

By using these techniques, you can automate your file backups to Fortress safely and efficiently.

SFTP

If you have followed the instructions in the No-Passphrase SSH Keys section to employ an unprotected SSH keypair between Radon and Fortress, you can automate the backup process using backup scripts. Because of the restrictions you placed upon the public key, you cannot use this keypair to log in to an interactive SSH session on Fortress, but you can use it to send files from your Radon home directory to Fortress via SFTP, or to run local scripts that employ SFTP.

Since you can have multiple private keys on Radon (and a similarly, multiple public keys in any given "authorized_keys" file on Fortress) you always need to specify which keypair you intend to employ for a login attempt to Fortress. The most consistent way to do this is with SSH's "-o" flag. This passes options to configure SSH and can be used with all programs that use SSH for providing a secure connection (e.g. SCP, SFTP, and RSYNC).

To test automated SFTP authentication from Radon to Fortress, use the following command:

$ sftp -o IdentityFile=~/.ssh/mykeypairname myusername@fortress.rcac.purdue.edu
sftp> bye
$

If this works, (i.e. you are not prompted for a passphrase or login password) you can move on to implementing a script using SFTP commands like the one above.

While only you can ultimately decide the best approach for your automated backup process, the example scripts below show, in general, how to employ backup scripts on Radon using SFTP commands and public key authentication via SSH. The following bash script, named "fortress_backup_script_sftp", uses SFTP commands to navigate through Fortress directories, and pushes files from the user's Radon home directory when needed.

#!/usr/local/bin/bash

# A script to use SFTP to push files to 
# Fortress for backup.

# Set up some parameters

user=myusername
remotehost=fortress.rcac.purdue.edu
idfile=~/.ssh/mykeypairname

printf "\n>> Starting Secure FTP backup session to Fortress\n"

# Invoke SFTP mode, specifying the correct private key, 
# and forcing batch file input from a "here-document"
# (i.e. the rest of this script).

sftp -o IdentityFile=$idfile -b - $user@$remotehost << EOF

cd ./mydir2backup
lcd ./mydir2backup

put -P ./bigfile1.tar.gz 
put -P ./bigfile2.tar.gz 

cd ../mydir2backup_also
lcd ../mydir2backup_also

put -P ./bigfile3.tar.gz 
put -P ./bigfile4.tar.gz 

bye
EOF

# Now we're back to the bash shell...

printf ">> Done...\n\n"

The output for this script is as follows:

$ ./fortress_backup_script_sftp

>> Starting Secure FTP backup session to Fortress
sftp> 
sftp> cd ./files2backup
sftp> lcd ./files2backup
sftp> 
sftp> put -P ./bigfile1.tar.gz 
Uploading ./bigfile1.tar.gz to /archive/fortress/home/myusername/mydir2backup/bigfile1.tar.gz
sftp> put -P ./bigfile2.tar.gz 
Uploading ./bigfile2.tar.gz to /archive/fortress/home/myusername/mydir2backup/bigfile2.tar.gz
sftp> 
sftp> cd ../files2backup_also
sftp> lcd ../files2backup_also
sftp> 
sftp> put -P ./bigfile3.tar.gz 
Uploading ./bigfile3.tar.gz to /archive/fortress/home/myusername/mydir2backup_also/bigfile3.tar.gz
sftp> put -P ./bigfile4.tar.gz 
Uploading ./bigfile4.tar.gz to /archive/fortress/home/myusername/mydir2backup_also/bigfile4.tar.gz
sftp> 
sftp> bye
>> Done...

$ 

By using these techniques, you can automate your file backups to Fortress safely and efficiently.

Environment Variables

There are many environment variables related to storage locations and paths which are automatically set for you upon log-in, or may be changed if necessary. In addition, many more environment variables are set for specific applications, such as compilers, when "modules" for these applications are loaded. (See the module command section for more information.)

Use environment variables instead of actual paths whenever possible to avoid problems if the specific paths to any of these change. Some of the environment variables you should have are:

  • $USER: your username
  • $HOME: path to your home directory
  • $PWD: path to your current directory
  • $RCAC_SCRATCH: path to scratch filesystem
  • $PATH: all directories searched for commands/applications
  • $HOSTNAME: name of the machine you are on
  • $SHELL: your current shell (bash, tcsh, csh, ksh)
  • $SSH_CLIENT: your local client's IP address
  • $TERM: type of terminal or terminal emulator being used
  • $OMP_NUM_THREADS: number of OpenMP threads

All environment variables begin with the dollar sign ($) and are all uppercase. They may be used on the command line or in any scripts in place of and in combination with hard-coded values:

$ ls $HOME
...

$ ls $RCAC_SCRATCH/myproject
...

$ ls $RCAC_SCRATCH/myproject/$HOSTNAME_data
...

You may find the value of any environment variable by using the echo command:

$ echo $RCAC_SCRATCH
/scratch/scratch95/m/myusername

$ echo $SHELL
/usr/local/bin/tcsh

You may list the values of all environment variables using the env command:

$ env
USER=myusername
HOME=/home/ba01/u101/myusername
RCAC_SCRATCH=/scratch/scratch95/m/myusername
SHELL=/usr/local/bin/tcsh
...

You may create or overwrite an environment variable using either export or setenv, depending on your shell:

  (for bash and sh)
$ export VARIABLE=value

  (for tcsh and csh)
% setenv VARIABLE value

Storage Quotas / Limits

Your disk usage is limited on RCAC systems. However, each filesystem (scratch, home directory, etc.) may have a different limit. If you exceed the soft limit or quota, you will see warnings whenever writing to the disk that you are over quota, but the write will still succeed. If you exceed the hard limit or limit, your write will fail until you either remove other files or your quota is increased. Generally, RCAC systems do not impose a soft limit—only a hard limit.

Checking Quota Usage

You may find out what your current quota is by using the quota command:

$ quota
Disk quotas for user myusername (uid 12345): 
     Filesystem  blocks   quota   limit   grace   files   quota   limit   grace
     ba01:/u103 2346272       0 5000000           17508       0   65535

The columns are as follows:

  1. Filesystem: This indicates the line is for the user's files on /u103/, which doing echo $HOME confirms is the user's home directory filesystem.
  2. Blocks: This shows how many 1 KB blocks the user's files take up. In this case, 2346272 KB / 1024 = 2291 MB, or 2291 MB / 1024 = 2.24 GB.
  3. Quota: This shows that soft limits are not being imposed (0).
  4. Limit: This shows how many 1 KB blocks the user's hard limit is. In this case, (5000000 KB / 1024) / 1024 = 4.77 GB.
  5. Grace: This would show the grace period (in days) for any soft limit (none in this case).
  6. Files: This shows how many file pointers (inodes) the user is currently using. This is based more on the number of files and directories and not the size.
  7. Quota: This shows that soft limits are not being imposed for file pointers (0).
  8. Limit: This shows the user's file pointer hard limit. It is possible, though unlikely, to hit this and not the size limit if you create a large number of very small files.
  9. Grace: This would show the grace period (in days) for any file pointer soft limit (none in this case).

You may also see the disk usage of any given directory by using du:

$ du -hs
1.1G    .

$ du -hs $HOME
138M    /home/ba01/u103/myusername

This can be very helpful in figuring out where your largest files or directories are, so that you may clean out unneeded large files and avoid hitting your quota.

Requesting Quota Increase

If you find you need additional disk space on an RCAC account, please first consider archiving and compressing old files and moving them to long-term storage. If this option does not resolve the issue, you may send an email to rcac-help@purdue.edu and request additional space.

Archive and Compression

There are several options for archiving and compressing groups of files or directories on RCAC systems. All of the following tools are provided:

  • zip   (more information)
    Simple compression and file packaging utility.
    Examples:
      (compress file somefile.c)
    $ zip somefile.zip somefile.c
    
      (extract contents of somefile.zip)
    $ unzip somefile.zip
    
      (compress all files in a directory into one archive file)
    $ zip -r somefile.zip somedirectory/
    
      (compress all ".c" files in current directory into one archive file)
    $ zip -r somefile.zip . -i \*.c
    
  • tar   (more information)
    Saves many files together into a single archive file, and restores individual files from the archive. Includes automatic archive compression/decompression options and special features that allow tar to be used for incremental and full backups.
    Examples:
      (archive file somefile.c)
    $ tar cvf somefile.tar somefile.c
    
      (archive and compress file somefile.c)
    $ tar czvf somefile.tar.gz somefile.c
    
      (list contents of archive somefile.tar)
    $ tar tvf somefile.tar
    
      (extract contents of somefile.tar)
    $ tar xvf somefile.tar
    
      (extract contents of gzipped archive somefile.tar.gz)
    $ tar xzvf somefile.tar.gz
    
      (archive and compress all files in a directory into one archive file)
    $ tar czvf somefile.tar.gz somedirectory/
    
      (archive and compress all ".c" files in current directory into one archive file)
    $ tar czvf somefile.tar.gz *.c 
    
  • gzip   (more information)
    Compression utility designed as a replacement for compress, with much better compression and no patented algorithms. The standard compression system for all GNU software.
    Examples:
      (compress file somefile - also removes uncompressed file)
    $ gzip somefile
    
      (uncompress file somefile.gz - also removes compressed file)
    $ gunzip somefile.gz
    
  • bzip2   (more information)
    Strong, lossless data compressor based on the Burrows-Wheeler transform. Also available as a library.
    Examples:
      (compress file somefile - also removes uncompressed file)
    $ bzip2 somefile
    
      (uncompress file somefile.bz2 - also removes compressed file)
    $ bunzip2 somefile.bz2
    
  • compress   (more information)
    Adaptive Lempel-Ziv compressor. Not often used today.

Windows users can work with these same formats using some of the following software:

  • 7-Zip
    Free Windows software package that can handle all the above formats.
  • WinZip
    Commercial Windows software package that can handle all the above formats.
  • WinRAR
    Commercial Windows software package that can handle all the above formats.

File Transfer

There are a variety of ways to transfer data to and from RCAC systems. Which you should use depends on several factors, including the ease of use for you personally, connection speed and bandwidth, the size and number of files to be transferred.

FTP

FTP (File Transfer Protocol) is simple data transfer mechanism. FTP was not designed to provide secure communications, and so FTP is no longer supported on any RCAC systems. Most modern FTP clients support either SFTP or SCP however, which are similar, secure protocols for file transfer. Try using one of the other methods described here instead of FTP.

SCP

SCP (Secure CoPy) is a simple way of transferring files between two machines that use the SSH (Secure SHell) protocol. You may use SCP to connect to any system where you have SSH (log-in) access. SCP is available as a protocol choice in some graphical file transfer programs and also as a command line program on most Linux, Unix, and Mac OS X systems. SCP can copy single files, but will also recursively copy directory contents if given a directory name.

Command-line usage:

  (to a remote system from local)
$ scp sourcefilename myusername@hostname:somedirectory/destinationfilename

  (from a remote system to local)
$ scp myusername@hostname:somedirectory/sourcefilename destinationfilename

  (recursive directory copy to a remote system from local)
$ scp sourcedirectory/ myusername@hostname:somedirectory/

Linux / Solaris / AIX / HP-UX / Unix:

  • The "scp" command line program should already be installed.

Microsoft Windows:

  • WinSCP is a full-featured and free graphical SCP and SFTP client.
  • PuTTY also offers "pscp.exe", which is an extremely small program and a basic SCP client.
  • Secure FX is a commercial SCP and SFTP client which is available free to Purdue students, faculty, and staff with a Purdue career account.

Mac OS X:

  • The "scp" command line program should already be installed. You may start a local terminal window from "Applications->Utilities".

SFTP

SFTP (Secure File Transfer Protocol) is a reliable way of transferring files between two machines. You may use SFTP to connect to most RCAC systems. SFTP is available as a protocol choice in some graphical file transfer programs and also as a command-line program on most Linux, Unix, and Mac OS X systems. SFTP has more features than SCP and allows for other operations on remote files, remote directory listing, and resuming interrupted transfers. Command-line SFTP cannot recursively copy directory contents; to do so, try using SCP or graphical SFTP client.

Command-line usage:

$ sftp -B buffersize myusername@hostname

      (to a remote system from local)
sftp> put sourcefile somedir/destinationfile
sftp> put -P sourcefile somedir/

      (from a remote system to local)
sftp> get sourcefile somedir/destinationfile
sftp> get -P sourcefile somedir/

sftp> exit
  • -B: optional, specify buffer size for transfer; larger may increase speed, but costs memory
  • -P: optional, preserve file attributes and permissions

Linux / Solaris / AIX / HP-UX / Unix:

  • The "sftp" command line program should already be installed.

Microsoft Windows:

  • WinSCP is a full-featured and free graphical SFTP and SCP client.
  • PuTTY also offers "psftp.exe", which is an extremely small program and a basic SFTP client.
  • Secure FX is a commercial SFTP and SCP client which is available free to Purdue students, faculty, and staff with a Purdue career account.

Mac OS X:

  • The "sftp" command-line program should already be installed. You may start a local terminal window from "Applications->Utilities".
  • MacSFTP

LFTP

LFTP is a command-line file-transfer program for Linux and Unix systems. It supports SFTP, HTTP, and HTTPS file-transfers. LFTP has additional features not provided by SFTP such as bandwidth throttling, transfer queues, and parallel transfers. It may be used interactively or scripted.

LFTP with parallel transfers can be much faster than SCP or SFTP, so its use is encouraged when possible.

LFTP is provided only on some RCAC systems. However, it is simply a client, so it is not needed on the remote machine involved in a transfer (the remote system need only support SFTP).

Interactive usage:

$ lftp myusername@hostname

         (transfer all ".dat" files from remote system to local)
lftp :~> mget *.dat

         (transfer "filename.dat" file from local system to remote)
lftp :~> put filename.dat

         (transfer a directory and all contents from remote
          system to local, using 5 connections in parallel)
lftp :~> mirror --parallel=5 remotedirectory localdirectory/

         (transfer a directory and all contents from local
          system to remote, using 8 connections in parallel)
lftp :~> mirror -R --parallel=8 localdirectory remotedirectory/

Batch usage:

  (specify all actions on command line)
$ lftp myusername@hostname -e "mget *.dat"

  (specify all actions in the script file "mytransfer.lftp")
$ lftp myusername@hostname -f mytransfer.lftp

GridFTP

GridFTP is a fast method of transferring large files that uses Globus authentication credentials (x509 certificates). GridFTP is available on some RCAC resources, but only to users who are members of a Grid project, such as TeraGrid, NorthWest Indiana Computational Grid (NWICG), or Open Science Grid (OSG). Note that not all grids may access all RCAC resources.

For more information about how to use GridFTP, consult documentation for your participating grid.

Provided Applications

The third-party software on three commonly used RCAC systems is shown in the following table. Additional software may be available on other RCAC systems, and the software on a specific system can be seen by running the command "module avail" on that system. Please contact rcac-help@purdue.edu if you are interested in the availability of software not shown in this list.

Radon Steele Julius/Caesar
R
AcGrace
Amber
ANSYS
ATLAS
BinUtils
Boost
ClustalX
COMSOL
CPLEX
CUDA
DX
Ferret
FFTW
FLUENT
GAMESS
GAMS
Gaussian
GCC Compiler (C, C++, Fortran)
GCC IA64 Cross-Compiler (xgcc-ia64)
GMP
GMT
GrADS
GROMACS
GhostScript
GSL
HDF4 (Compiled for Intel, GNU, PGI)
HDF5 (Compiled for Intel, GNU, PGI)
ImageMagick
IMSL
Intel Compiler (C, C++, Fortran)
Jasper
Java
LAM
LAMMPS
LSTC
Maple
Mathematica
MATLAB
Mitrionics FPGA Tools (mitrion)
MPFR
MPICH
MPICH2
MPIExec
MrBayes
MUMPS
MVAPICH (for Intel, PGI compilers)
MVAPICH2 (for Intel, GNU, PGI compilers)
MWRank
NCBI
NCL
NCO
NetCDF (for Intel, GNU, PGI compilers)
NTL
NWChem
Octave
PGI Compiler (C, C++, Fortran)
PKG-Config
Python
RASC
SAS
ScaLAPACK
Stata
Subversion
Tau
TecPlot
TotalView
UDUNITS
VASP
Vis5D

Environment Management with the Module Command

RCAC uses the module command as the preferred method for a user to manage the processing environment. With this command, a user may load libraries and paths for using specific applications or compilers. These are organized into packages which may be loaded and unloaded as needed. Please use the module command and do not manually configure your environment, as RCAC staff will frequently make changes to the specifics of various packages. If you use the module command to manage your environment, these changes will not be noticeable.

Below follows a short introduction to the module command. You can see more in the man page for module. Typing module at the command line will give you a brief usage report.

List Available Modules

To see what modules are available on this system, use the "module avail" command:

$ module avail

------------------------ /apps/host/modules/versions -------------------------
3.1.6

-------------------- /apps/host/modules3.1.6/modulefiles ---------------------
dot         module-cvs  module-info modules     null        use.own

----------------------- /apps/host/modules/modulefiles -----------------------
R/2.6.2
R/2.7.0
amber/10
ansys/11.0
dx/4.4.4
fftw/2.1.5
fftw/3.1.2
fluent/6.3.26
gamess/24.MAR.2007.R3(default)
gaussian/D.01
gaussian/E.01(default)
gaussian03/D.01
gaussian03/E.01(default)
gcc/4.3.0
     ...

Load / Unload a Module

You should note that all modules consist of both a name and a version number. When loading a module, you may use only the name to load the default version, or specify which version you wish to load:

$ module load intel
  (load default Intel compiler)

$ module load intel/9.1.045
  (load version 9.1.045 of the Intel compiler)

Note that you will need to load any relevant modules within job submission scripts that use those applications. Loading the module before submitting your job is not sufficient. Also, if you use bash or ksh as your login shell, you will also need to add a line in any submission script to source /etc/profile before invoking "module". Users of csh and tcsh do not need to do this.

     ...
. /etc/profile
module load intel
     ...

To unload a module, use the “module unload” command. It will attempt to undo the changes to your environment, made by that module:

$ module unload intel
  (unload default Intel compiler)

$ module unload intel/9.1.045
  (unload version 9.1.045 of the Intel compiler)

List Currently Loaded Modules

To see what modules you have currently loaded, use "module list":

$ module list
Currently Loaded Modulefiles:
  1) intel/9.1.045

$ module unload intel
$ module list
No Modulefiles Currently Loaded.

Show Module Details

To learn more about what a module does to your environment, you may use the "module show module_name" command, where module_name is any name in the list from command "module avail". This can be either default name like "intel", "gcc", "pgi", and "matlab", or a specific version of amodule, such as "intel/9.1.045". Here is an example showing what loading the default Intel compiler does to the processing environment:

$ module show intel
-------------------------------------------------------------------
/opt/modules/modulefiles/intel/9.1.045:

module-whatis    invoke Intel 9.1 Compilers 
prepend-path     PATH /opt/intel/cce/9.1.045/bin 
prepend-path     PATH /opt/intel/fce/9.1.040/bin 
prepend-path     PATH /opt/intel/idbe/9.1.045/bin 
prepend-path     LD_LIBRARY_PATH /opt/intel/cce/9.1.045/lib 
prepend-path     LD_LIBRARY_PATH /opt/intel/fce/9.1.040/lib 
prepend-path     LD_LIBRARY_PATH /opt/intel/idbe/9.1.045/lib 
prepend-path     LD_LIBRARY_PATH /opt/intel/mkl/9.0/lib/em64t 
setenv           CC icc 
setenv           CXX icpc 
setenv           FC ifort 
setenv           F90 ifort 
setenv           LAPACK_INCLUDE -I/opt/intel/mkl/9.0/include 
setenv           LINK_LAPACK -L/opt/intel/mkl/9.0/lib/em64t \
-lmkl_lapack64 -lmkl_em64t -lmkl -lguide -lpthread 
setenv           LINK_LAPACK_STATIC -L/opt/intel/mkl/9.0/lib/em64t \ 
-lmkl_lapack -lmkl_em64t -lguide -lpthread 
-------------------------------------------------------------------

To show what loading a specific Intel compiler version does to the processing environment.

$ module show intel/9.1.045
-------------------------------------------------------------------
/apps/steele/modules/modulefiles/intel/9.1.045:

module-whatis    invoke Intel 9.1 Compilers 
prepend-path     PATH /opt/intel/cce/9.1.045/bin 
prepend-path     PATH /opt/intel/fce/9.1.040/bin 
prepend-path     PATH /opt/intel/idbe/9.1.045/bin 
prepend-path     LD_LIBRARY_PATH /opt/intel/cce/9.1.045/lib 
prepend-path     LD_LIBRARY_PATH /opt/intel/fce/9.1.040/lib 
prepend-path     LD_LIBRARY_PATH /opt/intel/idbe/9.1.045/lib 
prepend-path     LD_LIBRARY_PATH /opt/intel/mkl/9.1/lib/em64t 
setenv           CC icc 
setenv           CXX icpc 
setenv           FC ifort 
setenv           F90 ifort 
setenv           ICC_HOME /opt/intel/cce/9.1.045 
setenv           IFORT_HOME /opt/intel/fce/9.1.040 
setenv           MKL_HOME /opt/intel/mkl/9.1 
setenv           LAPACK_INCLUDE -I/opt/intel/mkl/9.1/include 
setenv           LINK_LAPACK -L/opt/intel/mkl/9.1/lib/em64t -lmkl_lapack -lmkl_em64t -lmkl -lguide -lpthread 
setenv           LINK_LAPACK_STATIC -L/opt/intel/mkl/9.1/lib/em64t -lmkl_lapack -lmkl_em64t -lguide -lpthread 
-------------------------------------------------------------------

Provided Compilers

Compilers are available on Radon for Fortran 77, Fortran 90, Fortran 95, C, and C++. The compilers can produce general-purpose and architecture-specific optimizations to improve performance. These include loop-level optimizations, inter-procedural analysis and cache optimizations. The compilers support automatic and user-directed parallelization of Fortran, C, and C++ applications for multiprocessing execution. More detailed documentation on each compiler set available on Radon follows.

Intel Compiler Set

To use the Intel compiler set (compilers and associated libraries) on Radon, load the "intel" module, using the "module" command.

Here are some examples for the Intel compilers:

Language Serial Program MPI Program OpenMP Program
Fortran77
$ module load intel
$ ifort myprogram.f -o myprogram
$ module load mpich2-intel
$ mpif77 myprogram.f -o myprogram
$ module load intel
$ ifort -openmp myprogram.f -o myprogram
Fortran90
$ module load intel
$ ifort myprogram.f90 -o myprogram
$ module load mpich2-intel
$ mpif90 myprogram.f90 -o myprogram
$ module load intel
$ ifort -openmp myprogram.f90 -o myprogram
Fortran95 (not available) (not available) (not available)
C
$ module load intel
$ icc myprogram.c -o myprogram
$ module load mpich2-intel
$ mpicc myprogram.c -o myprogram
$ module load intel
$ icc -openmp myprogram.c -o myprogram
C++
$ module load intel
$ icc myprogram.cpp -o myprogram
$ module load mpich2-intel
$ mpiCC myprogram.cpp -o myprogram
$ module load intel
$ icc -openmp myprogram.cpp -o myprogram

Other versions of the Intel compiler and/or libraries may also be available. To see which versions are currently installed, use the command "module avail".

More information on compiler options can be found in the official man pages, which can be accessed if the appropriate module is loaded using the "man" command, or online here:

Here is some more documentation from other sources on the Intel compilers:

GNU Compiler Set

The official name of the GNU compilers is 'GNU Compiler Collection' or 'GCC'. To use the GNU compiler set (compilers and associated libraries) on Radon, load the "gcc" module, using the "module" command.

An older version of the GNU compiler will be in your path by default. Do NOT use this version. Instead, load the newer version using the "module" command.

Here are some examples for the GNU compilers:

Language Serial Program MPI Program OpenMP Program
Fortran77
$ module load gcc
$ gfortran myprogram.f -o myprogram
$ module load mpich2-gcc
$ mpif77 myprogram.f -o myprogram
$ module load gcc
$ gfortran -fopenmp myprogram.f -o myprogram
Fortran90
$ module load gcc
$ gfortran myprogram.f90 -o myprogram
$ module load mpich2-gcc
$ mpif90 myprogram.f90 -o myprogram
$ module load gcc
$ gfortran -fopenmp myprogram.f90 -o myprogram
Fortran95
$ module load gcc
$ gfortran myprogram.f95 -o myprogram
(not available)
$ module load gcc
$ gfortran -fopenmp myprogram.f95 -o myprogram
C
$ module load gcc
$ gcc myprogram.c -o myprogram
$ module load mpich2-gcc
$ mpicc myprogram.c -o myprogram
$ module load gcc
$ gcc -fopenmp myprogram.c -o myprogram
C++
$ module load gcc
$ g++ myprogram.cpp -o myprogram
$ module load mpich2-gcc
$ mpiCC myprogram.cpp -o myprogram
$ module load gcc
$ g++ -fopenmp myprogram.cpp -o myprogram

Other versions of the GNU compiler and/or libraries may also be available. To see which versions are currently installed, use the command "module avail".

More information on compiler options can be found in the official man pages, which can be accessed if the appropriate module is loaded using the "man" command, or online here:

Here is some more documentation from other sources on the GCC compilers:

PGI Compiler Set

To use the PGI compiler set (compilers and associated libraries) on Radon, load the "pgi" module, using the "module" command.

Here are some examples for the PGI compilers:

Language Serial Program MPI Program OpenMP Program
Fortran77
$ module load pgi
$ pgf77 myprogram.f -o myprogram
$ module load mpich2-pgi
$ mpif77 myprogram.f -o myprogram
$ module load pgi
$ pgf77 -mp myprogram.f -o myprogram
Fortran90
$ module load pgi
$ pgf90 myprogram.f90 -o myprogram
$ module load mpich2-pgi
$ mpif90 myprogram.f90 -o myprogram
$ module load pgi
$ pgf90 -mp myprogram.f90 -o myprogram
Fortran95
$ module load pgi
$ pgf95 myprogram.f95 -o myprogram
(not available)
$ module load pgi
$ pgf95 -mp myprogram.f95 -o myprogram
C
$ module load pgi
$ pgcc myprogram.c -o myprogram
$ module load mpich2-pgi
$ mpicc myprogram.c -o myprogram
$ module load pgi
$ pgcc -mp myprogram.c -o myprogram
C++
$ module load pgi
$ pgCC myprogram.cpp -o myprogram
$ module load mpich2-pgi
$ mpiCC myprogram.cpp -o myprogram
$ module load pgi
$ pgCC -mp myprogram.cpp -o myprogram

Other versions of the PGI compiler and/or libraries may also be available. To see which versions are currently installed, use the command "module avail".

More information on compiler options can be found in the official man pages, which can be accessed with the "man" command after loading the appropriate compiler module.

Here is some more documentation from other sources on the PGI compilers:

Compiling OpenMP Programs

Compilers for C, C++, and versions of Fortran are available. To see a Fortran 77 program with OpenMP commands: omp_hello_f77.f. To see a C program with OpenMP commands: omp_hello.c. See the table below for how to compile your program. Any compiler flags accepted by ifort/icc compilers, can be used with OpenMP.

Language Command example, Intel Command example, GNU Command example, PGI
C icc -openmp program.c -o program gcc -fopenmp program.c -o program pgcc -mp program.c -o program
C++ icc -openmp program.cpp -o program g++ -fopenmp program.cpp -o program pgCC -mp program.cpp -o program
Fortran 77 ifort -openmp program.f -o program - pgf77 -mp
Fortran 90 ifort -openmp program.f90 -o program gfortran -fopenmp program.f90 -o program pgf90 -mp program.f90 -o program
Fortran 95 ifort -openmp program.f90 -o program gfortran -fopenmp program.f90 -o program pgf95 -mp program.f90 -o program

Compiling a Fortran 90 program gives the following output when successful (note that the compiler module is loaded first - in this example Intel):

	$ module load intel
	$ ifort -openmp omp_hello_90.f90 -o omp_hello
	omp_hello_90.f90(4): (col. 9) remark: OpenMP DEFINED REGION WAS PARALLELIZED.

A compilation of the example C program gives the following output when successful (also Intel):

	$ icc -openmp omp_hello.c -o omp_hello
	omp_hello.c(15): (col. 1) remark: OpenMP DEFINED REGION WAS PARALLELIZED.
	$

Note that in general, neither GNU nor PGI compilers will output anything for a successful compilation.

Compiling MPI Programs

Compilers for C, C++, and versions of Fortran are available. To see a Fortran 77 program with MPI commands: hello77.f. To see a C program with MPI commands: hello.c. See the table below for how to compile your program. Any compiler flags accepted by ifort/icc compilers, can be used with mpif77/mpicc.

 Language  Command Example (Intel, GNU, PGI) 
 C  mpicc program.c -o program
 C++  mpiCC program.C -o program
 Fortran 77  mpif77 program.f -o program
 Fortran 90  mpif90 program.f -o program

Compiling a Fortran 90 MPI program gives no output when successful:

	$ module load mpich2-intel
	$ mpif90 hello.f90 -o hello
	$ 

A compilation of a C MPI program gives no output when successful:

	$ mpicc hello.c -o hello
	$ 

Note that in general, neither Intel, GNU, or PGI compilers will output anything for a successful compilation.

Compiling Hybrid Programs

Compilers for C, C++, and versions of Fortran are available. To see a hybrid C++ program with OpenMP/MPI commands: hybrid.cpp. See the table below for how to compile your hybrid (OpenMP/MPI) program. Any compiler flags accepted by ifort/icc compilers, can be used with OpenMP.

 Language  Command example, Intel  Command example, GNU  Command example, PGI 
 C  mpicc -openmp program.c -o program  mpicc -fopenmp program.c -o program  mpicc -mp program.c -o program
 C++  mpiCC -openmp program.C -o program  mpiCC -fopenmp program.C -o program  mpiCC -mp program.C -o program
 Fortran 77  mpif77 -openmp program.f -o program  -  mpif77 -mp program.f -o program
 Fortran 90  mpif90 -openmp program.f -o program  mpif90 -fopenmp program.f -o program  mpif90 -mp program.f -o program

Example: (Compiling the C++ program mentioned above)

	$ module load mpich2-intel
	$ mpiCC -openmp hybrid.cpp -o hybrid
	hybrid.cpp(73): (col. 30) remark: OpenMP DEFINED LOOP WAS PARALLELIZED.
	hybrid.cpp(73): (col. 30) remark: OpenMP DEFINED REGION WAS PARALLELIZED.
	hybrid.cpp(34): (col. 3) remark: OpenMP DEFINED LOOP WAS PARALLELIZED.
	hybrid.cpp(25): (col. 1) remark: OpenMP DEFINED REGION WAS PARALLELIZED.
	$ 

Provided Libraries

Some libraries are preinstalled for use on Radon. These may include parallel communications libraries as well as mathematical libraries. More detailed documentation on the libraries available on Radon follows.

MPICH Library

MPICH2 (and MPICH) is available for some compiler combinations on Radon. Refer to the compilers section for an overview of how to link in MPICH2 support. Here are some more documentation from other sources on the MPICH2 and MPICH libraries:

Intel Math Kernel Library (MKL)

Intel Math Kernel Library (MKL) contains ScaLAPACK, LAPACK, Sparse Solver, BLAS, Sparse BLAS, CBLAS, GMP, FFTs, DFTs, VSL, VML, and Interval Arithmetic routines. MKL can be found in the directory "/opt/intel/mkl/9.1" and it is divided into the following subdirectory structure:

  • lib/32 – Libraries for 32-bit Applications
    • libmkl_ia32.a – Optimized Kernels (BLAS, CBLAS, Sparse BLAS, GMP, FFTs, DFTs, VML, VSL, Interval Arithmetic)
    • libmkl_lapack.a – LAPACK Routines
    • libmkl_lapack95.a – LAPACK95 Interface (libmkl_lapack.a also required)
    • libmkl_solver.a – Sparse Solver Routines
    • libguide.a – Threading Library for Static Linking
  • lib/em64t – Libraries for Intel EM64T Applications
    • libmkl_em64t.a – Optimized Kernels (BLAS, CBLAS, Sparse BLAS, GMP, FFTs, DFTs, VML, VSL, Interval Arithmetic)
    • libmkl_lapack.a – LAPACK Routines
    • libmkl_lapack95.a – LAPACK95 Interface (libmkl_lapack.a also required)
    • libmkl_solver.a – Sparse Solver Routines
    • libguide.a – Threading Library for Static Linking
  • lib/64 – Libraries for Itanium 2 Applications
    • libmkl_ipf.a – Optimized Kernels (BLAS, CBLAS, Sparse BLAS, GMP, FFTs, DFTs, VML, VSL, Interval Arithmetic)
    • libmkl_lapack.a – LAPACK Routines
    • libmkl_lapack95.a – LAPACK95 Interface (libmkl_lapack.a also required)
    • libmkl_solver.a – Sparse Solver Routines
    • libguide.a – Threading Library for Static Linking

Here are some example combinations of linking options:

  (static linking of LAPACK and Kernels)
$ <fortran_compiler> myprogram.f -L${MKLPATH} -lmkl_lapack -lmkl_ia32 -lguide -lpthread

  (static linking of Fortran-95 LAPACK Interface and Kernels)
$ <fortran_compiler> myprogram.f95 -L${MKLPATH} -lmkl_lapack95 -lmkl_lapack -lmkl_ia32 -lguide -lpthread

  (static linking of BLAS, Sparse BLAS, GMP, VML/VSL, Interval Arithmetic, and FFT/DFT)
$ <c_compiler> myprogram.c -L${MKLPATH} -lmkl_ia32 -lguide -lpthread -lm

  (dynamic linking of BLAS or FFTs)
$ <c_compiler> myprogram.c -L${MKLPATH} -lmkl -lguide -lpthread

It is recommended that you use dynamic linking of libguide. If so, ensure LD_LIBRARY_PATH is defined such that the correct version of libguide is found and used at run time. If you use static linking of libguide (discouraged), then:

  • If you use the Intel compilers, link in the libguide version that comes with the compiler (use the -openmp option).
  • If you do not use the Intel compilers, link in the libguide version that comes with the Intel MKL above.

Here are some more documentation from other sources on the Intel MKL:

Using cpp with Fortran

If the source file ends with .F, .fpp, or .FPP, it is automatically preprocessed by cpp before it is compiled. If you want to use the C preprocessor with source files that do not end with .F, use the following compiler option to specify the filename suffix:

GNU: -x f77-cpp-input

  Note that the preprocessing is not extended to the contents of 
  files included by the "INCLUDE" directive - the #include 
  preprocessor directive must be used instead.
  

For example, to preprocess source files that end with .f:

    gfortran -x f77-cpp-input program.f

Intel: -cpp

  To tell the compiler to link using C++ runtime libraries 
  included with gcc/icc, use -cxxlib -gcc/-cxxlib -icc.
  

For example, to preprocess source files that end with .f:

    ifort -cpp program.f

Generally, it is best to rename the file from <name>.f to <name>.F. The preprocessor will then be run automatically when the file is compiled.

A good page to look at for combining C/C++ and Fortran, is Using C/C++ and Fortran together.

Calling Fortran from C/C++

When calling your own Fortran routines from C/C++, you should not append an underscore (_) after the name.

A complete list of routines is in the XL Fortran Language Reference Manual.

Here are some links to pages that discuss how to use Fortran from C/C++:

Running Jobs on Radon

There are a number of different compilers and programs installed on the RCAC systems. To access them, use module load <program>. To see the available modules, type module avail. To read more about the "module" command, look here.

There are two methods for submitting jobs to the Radon community cluster. First, you may submit jobs directly to a queue on Radon. These jobs may be serial, message-passing, or shared-memory in nature. You use the Portable Batch System (PBS) to submit jobs to a queue. Secondly, the Radon cluster is a part of BoilerGrid. You may submit serial jobs to BoilerGrid and specifically request that the serial jobs be run on the resources on Radon.

Running Jobs via PBS

Radon uses PBS version 9.x. The newer versions have a few minor differences from the older versions (before 8.0).

Differences are mainly:

  • Use of "cpp" no longer supported
  • Hosts in $PBS_NODEFILE ordered differently

The Portable Batch System (PBS) is a richly featured workload management system providing job scheduling and job management interface on computing resources, including Linux clusters. With PBS, a user requests resources and submits a job to a queue. A description of Radon's queues follows further down.

Note that you should never run big, long, multi-threaded, or CPU-intensive jobs on the front-end host. The front-end hosts are community-owned and running anything but the smallest test-job will slow them down for everyone. Use PBS to submit the job as a job submission file (called a job script in the official manual) or run an interactive PBS job session.

Radon PBS Tips

  • PBS Queues. Always use qstat -Q to determine which queues are available. There will usually be queues which are available to everyone with an account on that system. On most systems these are called either "standby" or "workq".
  • Any program that is installed can be run interactively. You must still use the "module load <program>" command to access it.
  • Programs which open a display can also be run interactively. Just use the -v DISPLAY option to qsub.
  • You can see which nodes you are using on one of the cluster machines with the command: cat $PBS_NODEFILE
  • When running OpenMP programs, you need the processors to be on the same node to get the advantage of shared memory.
  • The order of the processors is random. There is no way to tell which processor will do what and in which order in a parallel program.
  • Remember that ncpus can not be larger than the number of processors on each node on the machine in question.

Radon PBS Queues

To see a list of queues, type qstat -Q.

There are a total of 2 queues, but 'workq' is the default one to which everyone has access. It has a default wall time limit of 30 minutes, and maximum wall time limit of 336 hours.

No priorities are currently enforced on the ITaP/RCAC Linux clusters. All users have an equal chance at resources (note that this is not to say that all jobs have an equal chance at resources).

The nodes in Radon are partitioned into subclusters. By default, user jobs will be given nodes from the same subcluster (meaning that all machines will share an Ethernet switch and will have the highest possible interconnect bandwidth). If it is necessary to run jobs that span multiple subclusters, or to run a job that uses more than the number of nodes in any one subcluster, please contact rcac-help@purdue.edu.

The command qstat -Q will give you a list of all the queues. If you use qstat -Q -f it gives you more information about the various queues, including number of jobs, users and walltime.

Queues and some information about them:

Name of queue Node Count Default walltime Max walltime
workq 240/240 00:30:00 336:00:00
radon_hold N/A 00:30:00 N/A

Node Count is the number of accessible nodes for the queue. It is given as max/default.

Note that the number of processors that can be used for a certain queue is the total number of processors that are available to that queue. It does not mean that every submitter to that queue can get "Node Count" processors. For example, if a queue has a total of 32 accessable processors and one user has requested 16 processors and another has asked for 12 processors, then another submitter to that queue can only get 4 processors as long as the others are in use. If this user asks for more than 4 processors, his/her job will wait in the queue until enough processors are available.

Radon PBS Submission Script

A job submission file (job script) can contain any of the commands that you would otherwise issue yourself from the command line. You can, for example both compile and run a program and also set any necessary environment values. The results from compiling or runnning your programs can usually be seen after the job has run. They will show up in your directory as the files <script_name>.e<job number> and <script_name>.o<job number>. The first file will contain any errors that were reported (hopefully none), and the second file will give any results that your program may have output to the screen. If the program is supposed to write the results to a file, this will of course still happen. The job number is a number which PBS gives to every job. This will be reported when the job is submitted.

It may take quite a while before the job finishes running. How long will, among other things, depend on the number of nodes you have requested, how large the program is, which queue you are running it in, and how many other people are using the system at the same time.

A job submission file may consist of PBS directives, comments and executable statements. A PBS directive provides a way of specifying job attributes in addition to the command line options. For example:

	#PBS -N Job_name
	#PBS -l select=4:mem=320kb,walltime=10:30
	#PBS -m be
	#
	step1 arg1 arg2
	step2 arg3 arg4

The -N Job-name replaces script_name of the error and output files. The qsub command scans the lines of the script file for directives. An initial line in the script that begins with the characters "#!" or the character ":" will be ignored and scanning will start with the next line. Scanning will continue until the first executable line, that is a line that is not blank, not a directive line, nor a line whose first non-white space character is "#". If directives occur on subsequent lines, they will be ignored.

The remainder of the directive line consists of the options to qsub in the same syntax as they appear on the command line. The option character is to be preceded with the "-" character.

If an option is present in both a directive and on the command line, that option and its argument, if any, will be ignored in the directive. The command line takes precedence.

If an option is present in a directive and not on the command line, that option and its argument, if any, will be processed as if it had occurred on the command line.

How you run a program depends on whether it is a serial program, an OpenMP program, or a MPI program. There is no difference in how to run the program for the various compilers.

Important: You must 'module load' the same compiler (and MPICH2 if needed) that you used for compiling. Note that it is not necessary to load the standard compiler if you have loaded the corresponding compiler with the MPICH2 libraries included.

PBS Job Submission

The command to submit the job submission file is the following:

	qsub -q standby -l select=4,walltime=1:00 run_hello

This example submits a job to queue 'standby' and requests 4 nodes. It has a walltime of 1 minute. The job submission file is called run_hello. The names of the queues will be different on the various RCAC systems. You can find a list of their names with the command qstat -q or look at the section 'Queues'.

Some useful options for the qsub command includes (in the list below, note that a chunk is defined as a set of resources that are to be allocated as a unit):

  • -q <name>: tells which queue you want the job to run in. A list of available queues can be seen using the command qstat -Q. If none is choosen, the batch server will be the default server.
  • -l select=[N:]chunk[+[N:]chunk ...], where N specifies how many of that chunk, and a chunk is of the form: resource_name=value[:resource_name=value ...]
    • Job-wide resource_name=value requests are of the form: -l resource_name=value[,resource_name=value ...]. The most important resource_name's are: node (required), ncpus (how many processors), mpiprocs (how many processes).
    • The place statement has this form: -l place=[ arrangement ][: sharing ][: grouping] where
      • arrangement is one of free | pack | scatter
      • sharing is one of excl | shared
      • grouping can have only one instance of group=resource
      and where
      • free: Place job on any node(s). Only good if you have a job that does not need much memory, so you do not mind it sharing the node with other. Will most likely give you access quicker than the other options.
      • pack: You will get processors on one node only - all jobs will be placed on one node. Good for OpenMP.
      • scatter: The chunks with any MPI processes will be spread out across as many of the nodes as possible, attempting to put only one process on each. A chunk with no MPI processes may be taken from the same node as another chunk.
      • excl: Only this job uses the nodes chosen.
      • shared: This job can share the nodes chosen.
      • group=resource: Chunks will be grouped according to a resource. All nodes in the group must have a common value for the resource, which can be either the built-in resource host or a site-defined node-level resource.
      Note that nodes can have sharing attributes that override job placement requests.
    • -I: Job is to be run interactively.
    • -v variable_list: Expands the list of environment variables that are exported to the job. This can also be environment variables from the qsub command environment which are made available to the job when it executes. The variable_list is a comma separated list of strings of the form variable or variable=value. These variables and their values are passed to the job.
    • -V: Declares that all environment variables in the qsub command's environment are to be exported to the batch job.

    Note that ncpus can not be larger than the number of processors on each node on the machine in question.

    Some environment variables can be set. They are then available to PBS. They include:

    • PBS_O_HOST: the name of the host upon which the qsub command is running
    • PBS_O_QUEUE: the name of the original queue to which the job was submitted
    • PBS_O_SYSTEM: the operating system name given by uname -s on the host on which qsub is running
    • PBS_O_WORKDIR: the absolute path of the current working directory of the qsub command
    • PBS_ENVIRONMENT: set to PBS_BATCH to indicate the job is a batch job, or to PBS_INTERACTIVE to indicate the job is a PBS interactive job, see -I option
    • PBS_JOBID: the job identifier assigned to the job by the batch system
    • PBS_JOBNAME: the job name supplied by the user
    • PBS_NODEFILE: the name of the file containing the list of nodes assigned to the job
    • PBS_QUEUE: the name of the queue from which the job is executed

    If you wish to interrupt qsub prior to job start (before you get a command-line prompt), this can be done by typing control-C. It will then query if the user wishes to exit. If the user responds "yes", qsub exits and that job is aborted.

    Instead of using a job submission file, qsub also accepts commands from standard input - the keyboard. To use this option, avoid giving a script operand or give the single character "-". When the script is being read from Standard Input, qsub will copy the file to a temporary file. This temporary file is passed to the library interface routine pbs_submit. The temporary file is removed by qsub after pbs_submit returns or upon the receipt of a signal which would cause qsub to terminate.

    Once the job has started execution, input to and output from the job pass through qsub. Keyboard-generated interrupts are passed to the job. Entries beginning with the tilde ('~') character and containing special sequences are escaped by qsub. The recognized escape sequences are:

    	~.      Qsub terminates execution. The batch job is 
                    also terminated.
    	
    	~susp   Suspend the qsub program if running under the C shell. 
    	        "susp" is the suspend character, usually CNTL-Z.
    	
    	~asusp  Suspend the input half of qsub (terminal to job), 
    	        but allow output to continue to be displayed. Only 
                    works under the C shell.  "asusp" is the auxiliary 
                    suspend character, usually CNTL-Y.
    

    If no script is provided, the qsub command reads the script from standard input. When the script is being read from Standard Input, qsub will copy the file to a temporary file. This temporary file is passed to the library interface routine pbs_submit. The temporary file is removed by qsub after pbs_submit returns or upon the receipt of a signal which would cause qsub to terminate.

    Note: The following warning applies for users of the c-shell, csh. If the job is executed under the csh and a .logout file exists in the home directory in which the job executes, the exit status of the job is that of the .logout script, not the job submission file. This may impact any interjob dependencies. To preserve the job exit status, either remove the .logout file or place the following line as the first line in the .logout file:

    	set EXITVAL = $status
    

    and the following line as the last executable line in .logout

    	exit $EXITVAL
    

    PBS Job Status

    Using the command qstat -a will show you the jobs currently running and their ID's.

    Example (run on Steele):

    $ qstat -a 
    
    steele-adm.rcac.purdue.edu:
                                                                Req'd  Req'd   Elap
    Job ID          Username Queue    Jobname    SessID NDS TSK Memory Time  S Time
    --------------- -------- -------- ---------- ------ --- --- ------ ----- - -----
    77025.steele-ad user123  standby  hello         --    1   8    --  00:05 Q   --
    115505.steele-a user456  ncn      job4         5601   1   1    --  600:0 R 575:0
    ...
    189479.steele-a user456  standby  AR4b          --    5  40    --  04:00 H   --
    189481.steele-a user789  standby  STDIN        1415   1   1    --  00:30 R 00:07
    189483.steele-a user789  standby  STDIN        1758   1   1    --  00:30 R 00:07
    189484.steele-a user456  standby  AR4b          --    5  40    --  04:00 H   --
    189485.steele-a user456  standby  AR4b          --    5  40    --  04:00 Q   --
    189486.steele-a user123  tg_workq STDIN         --    1   1    --  12:00 Q   --
    189490.steele-a user456  standby  job7        26655   1   8    --  04:00 R 00:06
    189491.steele-a user123  standby  job11         --    1   8    --  04:00 Q   --
    $ 
    

    Where 'Q' = Queued, 'R' = Running, and 'H' = Held.

    The list can be very long, making it difficult to find your own runs. If that is the case, use the following command to ask for jobs submitted by a specific user:

    $ qstat -a -u user123
    
    steele-adm.rcac.purdue.edu: 
                                                                Req'd  Req'd   Elap
    Job ID          Username Queue    Jobname    SessID NDS TSK Memory Time  S Time
    --------------- -------- -------- ---------- ------ --- --- ------ ----- - -----
    182792.steele-a user123  standby job1   28422   1   4    --  23:00 R 20:19
    185841.steele-a user123  standby job2   24445   1   4    --  23:00 R 20:19
    185844.steele-a user123  standby job3   12999   1   4    --  23:00 R 20:18
    185847.steele-a user123  standby job4   13151   1   4    --  23:00 R 20:18
    $ 
    

    PBS Job Cancellation

    Stopping the job before it finishes.

    qdel <job id>
    

    You get the job id from the qstat -a or qstat -a -u [username] command.

    PBS Interactive Jobs

    To use the PBS queue interactively, you have to use the -I option. The command to submit such a job would then be like this command

    	qsub -I -q standby -l select=2:ncpus=2
    

    Where the options used means the following:

    • -q <name>: the queue you want the job to run in. A list of available queues can be seen using the command qstat -Q. If none are choosen, the batch server will be the default server.
    • -l select=[N:]chunk[+[N:]chunk ...], where N specifies how many of that chunk, and a chunk is of the form: resource_name=value[:resource_name=value ...]

      Job-wide resource_name=value requests are of the form: -l resource_name=value[,resource_name=value ...]. An example of a resource_name is ncpus, which is number of CPU's.

      The place statement has this form: -l place=[ arrangement ][: sharing ][: grouping]

      where

      - arrangement is one of free | pack | scatter
      - sharing is one of excl | shared
      - grouping can have only one instance of group=resource

      and where

      free: Place job on any node(s). Only good if you have a job that does not need much memory, so you do not mind it sharing the node with others. Will most likely give you access quicker than the other options.
      pack: You will get processors on one node only - all jobs will be placed on one node. Good for OpenMP.
      scatter: The chunks with any MPI processes will be spread out across as many of the nodes as possible, attempting to put only one process on each. A chunk with no MPI processes may be taken from the same node as another chunk.
      excl: Only this job uses the nodes chosen.
      shared: This job can share the nodes chosen.
      group=resource: Chunks will be grouped according to a resource. All nodes in the group must have a common value for the resource, which can be either the built-in resource host or a site-defined node-level resource.

      Note that nodes can have sharing attributes that override job placement requests.

    • -I: Job is to be run interactively.

    As mentioned, the -I option must be specified for the job to be interactive. After opening an interactive session, we may run programs in the normal way. For running serial programs, you should only ask for one "chunk". Parallel programs can be run with the preferred number of nodes, which should be specified with -l select=<# nodes> when qsub is started.

    Note that ncpus can not be larger than the number of processors on each node on the machine in question.

    To open a display when running interactively, start the interactive session with the following command:

    	qsub -I -q <queue> -l select=< (number of nodes)>:ncpus=< (number of processors)> -v DISPLAY
    

    To end an interactive job, just type exit. If you wish to interrupt qsub prior to job start, this can be done by typing control-C. It will then query if the user wishes to exit. If the user responds "yes", qsub exits and the job is aborted.

    To see which nodes your job is using:

    	cat $PBS_NODEFILE
    

    It is strongly suggested that you only use an interactive session for developmental tasks (such as debugging). Use a PBS job submission file when running the finished program.

    PBS Examples

    A large part of submitting a job involves understanding how to request computing resources. This section contains examples of submitting PBS jobs, both using a batch script and interactively. There will be separate examples for MPI and OpenMP jobs. Note that the sections 'batch' and 'interactive' have some examples which might also be relevant for, say, MPI and OpenMP.

    PBS Batch Examples

    This simple example submits the script 'run_hello' to the 'standby' queue on Steele and requests 4 nodes.

    	-bash-3.00$ qsub -q standby -l select=4,walltime=1:00 run_hello
    	99.steele-adm.rcac.purdue.edu
    	-bash-3.00$ 
    

    Doing a ls in your directory will now show two new files:

    	-bash-3.00$ ls
    	hello                            run_hello
    	hello.c                          run_hello.e99
    	hello.out                        run_hello.o99
    	-bash-3.00$  
    

    If everything went well, then the file 'run_hello.e99' will be empty, since it contains any error messages your program gave while running. The file 'run_hello.o99' contains the output from your program.

    Compiling through job submission files

    If you want to do more than just run a program - say, you want to compile a MPI/C program, then you would need to first load the compiler you wish to use. This must be done in the job submission file. To load a compiler, you use module load <compiler>. To load a compiler with MPICH2 included, you use module load mpich2-<compiler>.

    Here is an example of a job submission file which would work if you wanted to compile a MPI/C program with the Intel compiler:

    Tcsh:

    	module load mpich2-intel
    	cd $PBS_O_WORKDIR
    	mpicc program.c -o program
    

    Bash:

    	source /etc/profile
    	module load mpich2-intel
    	cd $PBS_O_WORKDIR
    	mpicc program.c -o program
    

    It is necessary to include the 'source /etc/profile' under bash/ksh, to be able to use the 'module load' command.

    The command to submit a job is the following:

    	qsub -q standby -l select=4,walltime=1:00 run_program 
    

    Where the options used means the following:

    • -q <name>: tells which queue you want the job to run in. A list of available queues can be seen using the command qstat -Q.
    • -l select: tells the job how many "chunks" (CPUs) you want to use (4 in the example) and
    • walltime=hh:mm:ss defines how much wall clock time it has (in the example it is set to 1 minute).

    Submitting this script now gives the following result (it will take a while before the job is completed):

    	-bash-3.00$ qsub -q standby -l select=4,walltime=1:00 run_hello
    	106361.steele-adm.rcac.purdue.edu
    	-bash-3.00$ 
    

    Doing a 'ls' in your directory will now show two new files:

    	bash-2.05a$ ls
    	hello                            run_hello
    	hello.c                          run_hello.e106361
    	hello.out                        run_hello.o106361
    	bash-2.05a$ 
    

    If everything went well, then the file 'run_hello.e106361' will be empty, since it contains any error-messages your program gave while running. The file 'run_hello.o106361' contains the output from your program.

    Getting the environment variables through a job submission file

    If you would like to see the value of the environment variables, then you can make a job submission file like this - called env.job.

    	# Ask for four nodes, 1 processor on each. 
    	#PBS -l select=4:ncpus=1,walltime=00:01:30
    	
    	# Change to directory where job was submitted.
    	cd $PBS_O_WORKDIR
    	
    	# Load for run-time the same module used for compilation:
    	module load gcc
    	
    	# Show details, especially nodes.
    	# PBS_NODEFILE contains a names of assigned nodes.
    	# The results of most of the following commands appear in the error file.
    	cat $PBS_O_HOST
    	cat $PBS_O_QUEUE
    	cat $PBS_O_SYSTEM
    	cat $PBS_O_WORKDIR
    	cat $PBS_ENVIRONMENT
    	cat $PBS_JOBID
    	cat $PBS_JOBNAME
    	cat $PBS_QUEUE
    	cat $PBS_NODEFILE
    

    Then submit it with

    	qsub env.job
    

    Note that ncpus can not be larger than the number of processors on each node on the machine in question.

    PBS Multiple Node Examples

    This section gives various examples of requesting multiple nodes and ways of allocating the processors on these nodes (as many as possible on as few nodes as possible, scattered, random...) In many of the examples I use an interactive session (-I), because this makes it easier to show the different nodes allocated (using the cat $PBS_NODEFILE command). To submit a job submission file instead, leave out -I and add the name of the job submission file at the end.

    4 nodes, 1 processor on each:

    	$ qsub -q standby -I -l select=4:ncpus=1
    	qsub: waiting for job 106336.steele-adm.rcac.purdue.edu to start
    	qsub: job 106336.steele-adm.rcac.purdue.edu ready
    	
    	$ cat $PBS_NODEFILE
    	steele-a109
    	steele-a136
    	steele-a136
    	steele-a136
    	$ exit
    	logout
    	
    	qsub: job 106336.steele-adm.rcac.purdue.edu completed
    	$ 
    

    Another attempt can give different nodes or the same:

    	$ qsub -q standby -I -l select=4:ncpus=1
    	qsub: waiting for job 453659.steele-adm.rcac.purdue.edu to start
    	qsub: job 453659.steele-adm.rcac.purdue.edu ready
    	
    	$ cat $PBS_NODEFILE
    	steele-a187
    	steele-a188
    	steele-a208
    	steele-a209
    	$ exit
    	logout
    	
    	qsub: job 453659.steele-adm.rcac.purdue.edu completed
    	$ 
    

    2 processors placed somewhere on 2 nodes. (Since"free" is the default, they will just be placed on any one or two available nodes, possibly while sharing with other jobs. It is not guaranteed to be on only one node ("pack"), or scattered across as many nodes as possible ("scatter"), and also not on one node without sharing with anyone else ("excl")):

    	$ qsub -q standby -I -l select=2:ncpus=2
    	qsub: waiting for job 106355.steele-adm.rcac.purdue.edu to start
    	qsub: job 106355.steele-adm.rcac.purdue.edu ready
    	
    	$ cat $PBS_NODEFILE
    	steele-a117
    	steele-a117
    	$ exit
    	logout
    	
    	qsub: job 106355.steele-adm.rcac.purdue.edu completed
    	$ 
    

    4 nodes, 1 processor on each node, placed anywhere:

    	$ qsub -q standby -I -l select=4:ncpus=1 -l place=free
    	qsub: waiting for job 106356.steele-adm.rcac.purdue.edu to start
    	qsub: job 106356.steele-adm.rcac.purdue.edu ready
    	
    	$ cat $PBS_NODEFILE
    	steele-a097
    	steele-a117
    	steele-a117
    	steele-a117
    	$ exit
    	logout
    	
    	qsub: job 106356.steele-adm.rcac.purdue.edu completed
    	$ 
    

    8 processors across 4 nodes, packed on as few nodes as possible:

    	$ qsub -q standby -I -l select=4:ncpus=8 -l place=pack
    	qsub: waiting for job 106356.steele-adm.rcac.purdue.edu to start
    	qsub: job 106356.steele-adm.rcac.purdue.edu ready
    	
    	$ cat $PBS_NODEFILE
    	steele-a097
    	steele-a117
    	steele-a117
    	steele-a117
    	$ exit
    	logout
    	
    	qsub: job 106356.steele-adm.rcac.purdue.edu completed
    	$ 
    

    Four processors for a four-rank message-passing program:

    	$ qsub -I -l select=4
    	qsub: waiting for job 439422.steele-adm.rcac.purdue.edu to start
    	qsub: job 439422.steele-adm.rcac.purdue.edu ready
    	
    	$ cat $PBS_NODEFILE
    	steele-a100
    	steele-a171
    	steele-a171
    	steele-a205
    	$ exit
    	logout
    	qsub: job 439422.steele-adm.rcac.purdue.edu completed
    	
    	$
    

    Notice that PBS placed two of the ranks on the same node. When these two ranks pass messages, the process will use direct memory access rather than message-passing. This is the same as place=free.

    To request that the four ranks be on separate nodes:

    	$ qsub -I -l select=4,place=scatter
    	qsub: waiting for job 439423.steele-adm.rcac.purdue.edu to start
    	qsub: job 439423.steele-adm.rcac.purdue.edu ready
    
    	$ cat $PBS_NODEFILE
    	steele-a100
    	steele-a171
    	steele-a205
    	steele-a206
    	$ exit
    	logout
    	qsub: job 439423.steele-adm.rcac.purdue.edu completed
    
    	$
    

    To request that four ranks be on a single node, we may pack them. We would want this if each rank does not use a lot of memory:

    	$ qsub -I -l select=4,place=pack
    	qsub: waiting for job 439600.steele-adm.rcac.purdue.edu to start
    	qsub: job 439600.steele-adm.rcac.purdue.edu ready
    	
    	$ cat $PBS_NODEFILE
    	steele-a101
    	steele-a101
    	steele-a101
    	steele-a101
    	$ exit
    	logout
    	qsub: job 439600.steele-adm.rcac.purdue.edu completed
    	
    	$
    

    The next examples uses program 'intro'.

    place=scatter: PBS must attempt to place only 1 MPI rank on a node. Here, 8 MPI ranks on 8 nodes. (Note: by default, the node on which your processors are allocated may be shared by other jobs. To request exclusive access to nodes, you must either use "ncpus=" to request all of their processors or use the "place=excl" option):

    Script:

    	# PBS -l select=8,place=scatter,walltime=00:01:30
    	mpiexec -n 8 ./intro
    

    Output:

    	R:0   Number of MPI ranks = 8
    	
    	R: 0     namelen:27  name:steele-a136.rcac.purdue.edu
    	
    	R: 1     namelen:27  name:steele-a204.rcac.purdue.edu
    	
    	R: 2     namelen:27  name:steele-a244.rcac.purdue.edu
    	
    	R: 3     namelen:27  name:steele-a258.rcac.purdue.edu
    	
    	R: 4     namelen:27  name:steele-a259.rcac.purdue.edu
    	
    	R: 5     namelen:27  name:steele-a268.rcac.purdue.edu
    	
    	R: 6     namelen:27  name:steele-a270.rcac.purdue.edu
    	
    	R: 7     namelen:27  name:steele-a323.rcac.purdue.edu
    

    place=free: PBS is free to place MPI ranks anywhere. Here, 8 MPI ranks on 3 nodes; nodes have 2 or 4 ranks.

    Script:

    	#PBS -l select=8,place=free,walltime=00:01:30
    	mpiexec -n 8 ./intro
    

    Output:

    	R:0   Number of MPI ranks = 8
    	
    	R: 0     namelen:27  name:steele-a113.rcac.purdue.edu
    	
    	R: 1     namelen:27  name:steele-a113.rcac.purdue.edu
    	
    	R: 2     namelen:27  name:steele-a115.rcac.purdue.edu
    	
    	R: 3     namelen:27  name:steele-a115.rcac.purdue.edu
    	
    	R: 4     namelen:27  name:steele-a136.rcac.purdue.edu
    	
    	R: 5     namelen:27  name:steele-a136.rcac.purdue.edu
    	
    	R: 6     namelen:27  name:steele-a136.rcac.purdue.edu
    	
    	R: 7     namelen:27  name:steele-a136.rcac.purdue.edu
    

    place=pack: all chunks will be taken from one host. All 8 MPI ranks are packed on a single node.

    Script:

    	#PBS -l select=8,place=pack,walltime=00:01:30
    	mpiexec -n 8 ./intro
    

    Output:

    	R:0   Number of MPI ranks = 8
    	
    	R: 0     namelen:27  name:steele-a474.rcac.purdue.edu
    	
    	R: 1     namelen:27  name:steele-a474.rcac.purdue.edu
    	
    	R: 2     namelen:27  name:steele-a474.rcac.purdue.edu
    	
    	R: 3     namelen:27  name:steele-a474.rcac.purdue.edu
    	
    	R: 4     namelen:27  name:steele-a474.rcac.purdue.edu
    	
    	R: 5     namelen:27  name:steele-a474.rcac.purdue.edu
    	
    	R: 6     namelen:27  name:steele-a474.rcac.purdue.edu
    	
    	R: 7     namelen:27  name:steele-a474.rcac.purdue.edu
    

    Using the default case of place=. Here, 8 MPI ranks on 4 nodes; nodes have 1,2, or 3 ranks. Distribution of MPI ranks on nodes looks like place=free.

    Script:

    	#PBS -l select=8,walltime=00:01:30
    	mpiexec -n 8 ./intro
    

    Output:

    	R:0   Number of MPI ranks = 8
    	
    	R: 0     namelen:27  name:steele-a136.rcac.purdue.edu
    	
    	R: 1     namelen:27  name:steele-a136.rcac.purdue.edu
    	
    	R: 2     namelen:27  name:steele-a136.rcac.purdue.edu
    	
    	R: 3     namelen:27  name:steele-a144.rcac.purdue.edu
    	
    	R: 4     namelen:27  name:steele-a144.rcac.purdue.edu
    	
    	R: 5     namelen:27  name:steele-a176.rcac.purdue.edu
    	
    	R: 6     namelen:27  name:steele-a176.rcac.purdue.edu
    	
    	R: 7     namelen:27  name:steele-a181.rcac.purdue.edu
    

    This job was never run. PBS cannot pack all 16 chunks on one 8-core node of Steele.

    Script:

    	#PBS -l select=16,place=pack,walltime=00:01:30
    	mpiexec -n 16 ./intro
    

    By default, the node on which your processors are allocated may be shared by other jobs. To request exclusive access to nodes, you must either use "ncpus=" to request all of their processors or use the "place=excl" option:

    	qsub –I –l select=1 –l place=excl
    

    To explicitly ask to share a node with other jobs:

    	qsub –I –l select=1 –l place=shared
    

    Note that ncpus can not be larger than the number of processors on each node on the machine in question.

    PBS Specific Types of Nodes Examples

    We can request that a job be run on specific nodes. This can be done with the PBSPro 8.0 and newer "select=" syntax. It is useful for selecting nodes based on various quantities.

    The select command can also be used to specify whether or not the processors you are asking for must be on the same nodes (i.e. if you are on a host that has nodes with two processors each and you want 4 processors for your job, then you can either (default) have both processors on two nodes or indicate that you do not care and perhaps get 4 processors on 4 different nodes.) If it is not important for your job that the processors are on the same nodes, then it is better to indicate that, since you might otherwise have to wait a long time for the whole nodes to be free.

    Example Asking for only nodes with (at least) 2 GB memory (running interactively on 4 nodes):

    	qsub -I -q standby -l select=4:mem=2gb
    

    Example of how this is done:

    	-bash-3.00$ qsub -I -q standby -l select=4:mem=2gb
    	qsub: waiting for job 499333.steele-adm.rcac.purdue.edu to start
    	qsub: job 499333.steele-adm.rcac.purdue.edu ready
    	
    	-bash-3.00$ cat $PBS_NODEFILE
    	steele-a177
    	steele-a181
    	steele-a183
    	steele-a189
    	-bash-3.00$ 
    

    PBS Interactive Job Examples

    Running an interactive PBS job

    They can be started with time constraints (walltime=hh:mm:ss) or without time constraints.

    Note that running an interactive job without time constraints means that you will keep the nodes allocated for the default time limit for that queue. If this is shorter than the time you need, your job will not finish. If, on the other hand, it is longer than what you need, you are keeping those nodes from other people's usage. Therefore, use this with caution.

    	$ qsub -I -q standby -l select=2:ncpus=2
    	qsub: waiting for job 100.steele-adm.rcac.purdue.edu to start
    	qsub: job 100.steele-adm.rcac.purdue.edu ready
    	
    	$ 
    

    We then need to change to the directory where our program is located, and then just run it as you would otherwise. If you want to run a MPI-program, remember to type module load mpich2-intel, module load mpich2-gcc, or module load mpich2-pgi first. The program can then be run with mpirun or mpiexec.

    Running an interactive PBS job and opening a display

    	-bash-3.00$ qsub -I -q standby -l select=2:ncpus=2 -v DISPLAY
    	qsub: waiting for job 301.venice-adm.rcac.purdue.edu to start
    	qsub: job 301.venice-adm.rcac.purdue.edu ready
    	
    	-bash-3.00$  
    

    You can then run any program that is installed and which needs to open a display. You may need to load module <program> before you can run it. Again, always check module avail to see which programs can be accessed this way.

    Note that ncpus can not be larger than the number of processors on each node on the machine in question.

    Serial PBS Example

    There are two ways to run a serial program under PBS: batch and interactively. For long jobs, batch submission is to be preferred. There is no difference in how you run a Fortran program, a C program or a C++ program, when they have been compiled.

    Batch submission

    Suppose that we want to run the C program 'hello.c' - where the executable is called 'hello'. Make a script and call it something meaningful, like run_hello. The script should then contain the following:

    	#!/bin/bash
    	cd $PBS_O_WORKDIR
    	./hello
    

    Since PBS will always start in your home directory, you should either do a cd $PBS_O_WORKDIR (which returns you to the directory you submitted the script from) or give the full path to the program.

    The command to submit the job is the following:

    	qsub -q standby -l select=1,walltime=1:00 run_hello
    

    Where I am using the queue 'standby' on Steele, 1 node and a walltime of one min. My job submission file is called run_hello. It should be noted that if you want to use the default queue, you do not need to explicitly ask for it.

    Submitting this script gives the following result. It will take a while before the job completes:

    	$ qsub -q standby -l select=1,walltime=1:00 run_hello
    	91.steele-adm.rcac.purdue.edu
    	$ 
    

    Doing a 'ls' in your directory will now show two new files:

    	$ ls
    	hello                            run_hello
    	hello.c                          run_hello.e91
    	hello.out                        run_hello.o91
    	$ 
    

    If everything went well, then the file 'run_hello.e91' will be empty, since it contains any error-messages your program gave while running. The file 'run_hello.o91' contains the output from your program. In this case the output is:

    	$ cat run_hello.o91
    	Hello World!
    	$ 
    

    Interactively

    To use the PBS queue interactively, you must first give a command like the one below. Remember to type 'cd $PBS_O_WORKDIR' (or the path to the working directory), since you will have been returned to your home directory upon start of the interactive job.

    $ qsub -I -q standby -l select=1
    qsub: waiting for job 189639.steele-adm.rcac.purdue.edu to start
    qsub: job 189639.steele-adm.rcac.purdue.edu ready
    
    $ 
    

    Where we are running in the queue 'standby' on Steele, and asking for 1 node.

    We can now run the job the same way a serial job is normally run. Remember, interactive sessions are mostly for testing purposes, and longer jobs should always be submitted using a job submission file.

    	$ ./hello
    	Hello World!
    	$ 
    

    OpenMP PBS Example

    The OpenMP implementations consist of parallelization directives and libraries. Using directives, you can distribute the work of the application over several processors. The OpenMP runtime library automatically creates the optimal number of threads to be executed in parallel for the multiple processors on the platform where the program is being run. If you are running the program on a system with only one processor, you will not see any speedup. In fact, the program may run slower due to the overhead in the synchronization code generated by the compiler. For best performance, the number of threads should typically be equal to the number of processors you will be using. Remember to always include <omp.h> for C programs.

    To run an OpenMP program, set the environment variable OMP_NUM_THREADS to the desired number of threads, then enter the executable name and any necessary arguments at the shell prompt as with a serial program. If the OpenMP program is other than a short test program, then it should be run either as a batch job (submitted with a job submission file), or as an interactive PBS job.

    Setting OMP_NUM_THREADS

    In Csh:

    	setenv OMP_NUM_THREADS <number of threads>
    

    In Bash:

    	export OMP_NUM_THREADS=<number of threads>
    

    To see which compilers are available for OpenMP, see the section about compiling on Radon.

    To learn how to compile a OpenMP program, see the section Compiling an openMP program.

    You should also set the environment variable PARALLEL to 1. This variable must be set or else any timers used by the program will return incorrect timings (see the etime man page for more details).

    Note: To access the Intel, PGI, or GCC compilers, you must first load them with the command:

    	module load intel
    	module load pgi
    	module load gcc
    

    Running your program


    There are two ways of running your OpenMP programs. If it is a longer program you should always use the PBS queue. This can either be done by submitting a job using a job submission file or interactively.

    There is no difference in how you run a Fortran program, a C program or a C++ program, when they have been compiled.

    Submitting the job to PBS using a script


    Let us say that we want to run the C program example 'omp_hello.c'. Make a script and call it run_omp_hello. The script should then contain something like the following:

    	cd $PBS_O_WORKDIR
    	./omp_hello
    

    Where 'PBS_O_WORKDIR' is a PBS-environment variable, which contains the name of the directory you were in when the job was submitted. You should then submit the job from the directory where your program is located or give the absolute path to it in the script.

    If you need to use the 'module load' command (for maybe loading compilers), then you would make a job submission file like the following:

    	#!/bin/bash
    	source /etc/profile
    	module load intel
    	cd $PBS_O_WORKDIR
    

    It is necessary to include the 'source /etc/profile' under bash/ksh, if you want to be able to use the 'module load intel' or other 'module load' commands.

    A script to just run the (already compiled) program 'omp_hello.c', would look like this (given that the compiled program was named omp_hello):

    	cd $PBS_O_WORKDIR
    	./omp_hello
    

    The command to submit a job is the following:

    	qsub -q standby -l select=2:ncpus=2,walltime=1:00 run_omp_hello 
    

    Where the options used mean the following:

    • -q <name>: tells which queue you want the job to run in (here I have chosen standby). A list of available queues can be seen using the command qstat -Q.
    • -l select: tells the job how many nodes you want to use (2 in the example) and
    • ncpus: specifying the number of processes (tasks) to be run across the nodes requested. Defaults to 1. In this example 2. This means we will totally get 2 nodes, each with 2 processors.
    • walltime=hh:mm:ss defines how much wall clock time it has (in the example it is set to 1 minute).

    Submitting this script now gives the following result:

    	-bash-3.00$ qsub -q standby -l select=2:ncpus=2,walltime=1:00 run_omp_hello
    	98410.steele-adm.rcac.purdue.edu
    	-bash-3.00$ 
    

    Doing a 'ls' in your directory (after a while) will now show two new files:

    	-bash-3.00$ ls
    	hello                            run_omp_hello
    	hello.c                          run_omp_hello.e98410
    	hello.out                        run_omp_hello.o98410
    	-bash-3.00$ 
    

    If everything went well, then the file 'run_omp_hello.e98410' will be empty, since it contains any error-messages your program gave while running. The file 'run_omp_hello.o98410' contains the output from your program. In this case the output is:

    	-bash-3.00$ less run_omp_hello.o98410 
    	Thread 1 says: Hello World
    	Thread 0 says: Hello World
    	Thread 0 reports: the number of threads are 2
    	-bash-3.00$ 
    

    To request a resource to run an eight-thread shared-memory job:

    	$ qsub -I -l select=1:ncpus=8
    	qsub: waiting for job 439569.steele-adm.rcac.purdue.edu to start
    	qsub: job 439569.steele-adm.rcac.purdue.edu ready
    	
    	$ cat $PBS_NODEFILE
    	steele-a548
    	$ exit
    	logout
    	
    	qsub: job 439569.steele-adm.rcac.purdue.edu completed
    	$
    

    Submitting interactively

    Use qstat -Q to see which queues are available on a given machine.

    Example:

    	qsub -I -q standby -l select=4,walltime=8:00
    

    Where the options used means the following:

    • -I means that we want to run the job interactively.
    • -q <name>: tells which queue you want the job to run in (here I have chosen standby - it is the default, so it is not really necessary). A list of available queues can be seen using the command qstat -Q.
    • -l select: tells the job how many nodes you want to use (4 in the example) and
    • walltime=hh:mm:ss defines how much wall clock time it has (in the example it is set to 8 minutes).

    You can also just start up an interactive job without time constraints:

    	qsub -I -q standby -l select=4:ncpus=4
    

    (Where the options used mean we ask for 4 nodes and 4 processors on each node - remember to check that the given machine has the number of processors on a node that you ask for). To end the job, you then type: exit.

    Note that running an interactive job without time constraints means that you will keep the nodes allocated for the default time limit for that queue. If this is shorter than the time you need, your job will not finish. If, on the other hand, it is longer than what you need, you are keeping those nodes from other people's usage. Therefore, use this with caution.

    Running the above PBS command gives:

    	-bash-3.00$ qsub -I -q standby -l select=4:ncpus=4
    	qsub: waiting for job 98362.steele-adm.rcac.purdue.edu to start
    	qsub: job 98362.steele-adm.rcac.purdue.edu ready
    	
    	-bash-3.00$ 
    

    To return to the directory where our program is located.

    	cd $PBS_O_WORKDIR
    

    To run a program we just use ./program.

    Example of running the compiled version of omp_hello.c:

    	-bash-3.00$ qsub -I -q standby -l select=1:ncpus=4
    	qsub: waiting for job 98377.steele-adm.rcac.purdue.edu to start
    	qsub: job 98377.steele-adm.rcac.purdue.edu ready
    	
    	-bash-3.00$ cd openMP
    	-bash-3.00$ ./omp_hello
    	Thread 0 says: Hello World
    	Thread 0 reports: the number of threads are 4
    	Thread 2 says: Hello World
    	Thread 1 says: Hello World
    	Thread 3 says: Hello World
    	-bash-3.00$ 
    

    Example of running the compiled version of omp_hello_77.f:

    	-bash-3.00$ ./omp_hello_77
    	 Thread           0 says: Hello World
    	 Thread           2 says: Hello World
    	 Thread           1 says: Hello World
    	 Thread           3 says: Hello World
    	 Thread           0 reports: the number of threads are           4
    	-bash-3.00$ 
    

    Note that the order of the processors is random. This can not be controlled in a parallel program.

    To see which nodes you are using:

    	cat $PBS_NODEFILE
    

    To stop running interactively, just type exit.

    ./program

    If you have a veryshort test program and are just using ./program, then you need to tell the machine how many task processes you want. This can be done by setting the environment variable, as described in the beginning of this section. You can see the default with echo $OMP_NUM_THREADS.

    Then you just run the program.

    Here is an example of running the program omp_hello.c, for 2 threads (on Radon), then 1 thread:

    	$ ./omp_hello
    	Thread 0 says: Hello World
    	Thread 0 reports: the number of threads are 2
    	Thread 1 says: Hello World
    	$ export OMP_NUM_THREADS=1
    	$ ./omp_hello
    	Thread 0 says: Hello World
    	Thread 0 reports: the number of threads are 1
    	$ 
    

    MPI PBS Example

    The path to MPICH2 or MPICH is probably not setup on your account. Try typing mpicc at the prompt. If you get 'mpicc: command not found', then you need to either use the 'module load' command or add the path to the compilers you use to your setup file (.cshrc, .tcsh, .bash, .login, or .profile). The compilers and MPICH2/MPICH can be found in /opt. They have to be loaded to be able to run a MPI program.

    Using 'module load': The easiest (and preferred) way to access the compilers with MPICH2 included (and the ordinary compilers) is to use the module load mpich2-<compiler> (and module load <compiler> for the ordinary compilers), where <compiler> is one of: intel, gcc, or pgi. Use module avail to see all the possibilities.

    Example (loading Intel compilers with MPICH2 included):

    	$ module load mpich2-intel
    	$ 
    

    To learn how to compile a MPI program, see the section Compiling a MPI program.

    Running your program

    To run your programs, you will need to use the PBS queue. This can either be done interactively or by submitting a job using a job submission file.

    There is no difference in how you run a Fortran program, a C program or a C++ program, when they have been compiled.

    Submitting the job to PBS using a script

    Let us say that we want to run the C program example 'hello.c'. Make a script and call it something meaningful, like run_hello. The script should then contain something like the following:

    	#!/bin/bash
    	source /etc/profile
    	module load mpich2-intel
    	mpirun -np 4 -machinefile $PBS_NODEFILE mpi/hello
    

    It is necessary to include the 'source /etc/profile' under bash/ksh, to be able to use the 'module load mpich2-<compiler>'. The <compiler> can be either intel, gcc, or pgi. Last, the program is run with either mpiexec or mpirun. Since PBS always goes to your home directory, you should give the full path to the program - here mpi/hello, or add cd $PBS_O_WORKDIR before running the program (puts you in the directory you were standing in when issuing the qsub command.

    To submit a job:

    	qsub -q workq -l select=4,walltime=1:00 run_program 
    

    Where the options used mean the following:

    • -q <name>: tells which queue you want the job to run in (here I have chosen workq). A list of available queues can be seen using the command qstat -Q.
    • -l select: tells the job how many nodes you want to use (4 in the example) and
    • walltime=hh:mm:ss defines how much wall clock time it has (in the example it is set to 1 minute).

    Submitting this script now gives the following result (it will take a while before the job is completed):

    	user123@radon-fe00:~/mpi$ qsub -q workq -l select=4,walltime=1:00 run_hello 
    	119452.radon-adm.rcac.purdue.edu
    	user123@radon-fe00:~/mpi$ 
    

    Doing a 'ls' in your directory will now show two new files:

    	user123@radon-fe00:~/mpi$ ls
    	hello                            run_hello
    	hello.c                          run_hello.e119452
    	hello.out                        run_hello.o119452
    	user123@radon-fe00:~/mpi$ 
    

    If everything went well, then the file 'run_hello.e119452' will be empty, since it contains any error-messages your program gave while running. The file 'run_hello.o119452' contains the output from your program. In this case the output is:

    	user123@radon-fe00:~/mpi$ less run_hello.o119452 
    	Processor 2 of 4: Hello World!
    	Processor 3 of 4: Hello World!
    	Processor 1 of 4: Hello World!
    	Processor 0 of 4: Hello World!
    	user123@radon-fe00:~/mpi$ 
    

    Mpiexec is a replacement program for the script mpirun, which is part of the MPICH2 (and MPICH) package. It is used to initialize a parallel job from within a PBS batch or interactive environment. Mpiexec uses the task manager library of PBS to spawn copies of the executable on the nodes in a PBS allocation. There are reasons to use mpiexec rather than a script (mpirun) or an external daemon (mpd):

    Running interactively

    Example (run on the queue 'workq' on Radon):

    	user123@radon-fe00:~/mpi$ qsub -I -q workq -l select=2:ncpus=2,walltime=8:00
    	qsub: waiting for job 119450.radon-adm.rcac.purdue.edu to start
    	qsub: job 119450.radon-adm.rcac.purdue.edu ready
    	
    	user123@radon-b002:~$ 
    

    Where the options used means the following:

    • -I means that we want to run the job interactively.
    • -q <name>: tells which queue you want the job to run in (here I run on workq). A list of available queues can be seen using the command qstat -Q.
    • -l select: tells the job how many nodes you want to use (2 in the example) and
    • ncpus is how many processors you want to use on each node. Here we want 2 - remember to not ask for more than is available on the given machine, and
    • walltime=hh:mm:ss defines how much wall clock time it has (in the example it is set to 8 minutes).

    You can also just start up an interactive job without time constraints:

    	qsub -I -q workq -l select=4
    

    (Where the options used mean we ask for 4 nodes and 1 processor on each node). To end the job, you then type: exit.

    Note that running an interactive job without time constraints means that you will keep the nodes allocated for the default time limit for that queue. If this is shorter than the time you need, your job will not finish. If, on the other hand, it is longer than what you need, you are keeping those nodes from other people's usage. Therefore, use this with caution.

    Running the above PBS command gives:

    	user123@radon-fe00:~/mpi$ qsub -I -q workq -l select=4                    
    	qsub: waiting for job 119451.radon-adm.rcac.purdue.edu to start
    	qsub: job 119451.radon-adm.rcac.purdue.edu ready
    	
    	user123@radon-b002:~$ 
    

    We then need to change to the directory where our program is located. To run a program we use mpirun or mpiexec. You can not just start the program with ./program, since it will then just use one task process.

    mpirun: To run a program with mpirun, you issue the following command (remember to do 'module load mpich2-<compiler>', source your file which contains the path, or simply give the full path to mpirun, if you haven't added either to your setup). We need to add '-machinefile $PBS_NODEFILE' for running with mpirun on Radon - this is not the case if we use mpiexec:

    	mpirun -np <number of tasks> -machinefile $PBS_NODEFILE program
    

    Running this on our program 'hello', for 4 tasks results in:

    	user123@radon-b002:~/mpi$ mpirun -np 4 -machinefile $PBS_NODEFILE hello
    	Processor 2 of 4: Hello World!
    	Processor 1 of 4: Hello World!
    	Processor 3 of 4: Hello World!
    	Processor 0 of 4: Hello World!
    	user123@radon-b002:~/mpi$
    

    Note that the order of the processors is random. This can not be controlled in a parallel program.

    mpiexec: To run with mpiexec, use the following command (only give number of tasks if you do not want to use all you asked for when entering the queue):

    	mpiexec -n <number of tasks> program
    

    It is not necessary to give the -n <number of tasks>, unless you wish to use a different amount than what you asked for when the job was originally started.

    	user123@radon-b002:~/mpi$ mpiexec hello
    	Processor 2 of 4: Hello World!
    	Processor 0 of 4: Hello World!
    	Processor 3 of 4: Hello World!
    	Processor 1 of 4: Hello World!
    	user123@radon-b002:~/mpi$ mpiexec -n 2 hello
    	Processor 1 of 2: Hello World!
    	Processor 0 of 2: Hello World!
    	user123@radon-b002:~/mpi$ 
    

    To see which nodes you are using:

    	cat $PBS_NODEFILE
    

    Notes

    • Starting tasks with the TM interface is much faster than invoking a separate rsh or ssh once for each process.
    • Resources used by the spawned processes are accounted correctly with mpiexec, and reported in the PBS logs, because all the processes of a parallel job remain under the control of PBS, unlike when using startup scripts such as mpirun.
    • Tasks that exceed their assigned limits of CPU time, wallclock time, memory usage, or disk space are killed cleanly by PBS. It is quite hard for processes to escape control of the resource manager when using mpiexec.
    • You can use mpiexec to enforce a security policy. If all jobs are required to startup using mpiexec and the PBS execution environment, it is not necessary to enable rsh or ssh access to the compute nodes in the cluster.
    • PBS Queues. Always use qstat -Q to determine which queues are available. The queueus which usually are available to everyone are generally called standby, and workq.

    • When using MPI_File_Write, the output to the file will be written in binary. To view it, use: od -d 'output.file'.

    • MPI_File_Write will not automatically delete the old contents of an output file, so you may have to remove it before writing new data to it, unless you want to keep the old data.

    • You can see which nodes you are using on one of the cluster machines with the command: cat $PBS_NODEFILE
    • Running a program on the Linux clusters with ./program is a bad idea, since it will just use one task process. So unless that is what you want (which is rarely the case for a MPI-program), you should use mpirun or mpiexec.
    • The order of the processors is random. There is no way to tell which processor will do what and in which order in a parallel program.

    Common mistakes

    • For C programs, remember to write MPI_Send and MPI_Recv and NOT MPI_SEND and MPI_RECV. The compiler will not complain, but during the execution of the program you will get a very confusing error:

               Invalid datatype (0) in MPI_Send, task 0...

    • In C programs, when sending arrays or values with MPI_Send/MPI_Scatter, you need to remember the & in front of the send/recieve buffer name. Example:

      int array[5000];
      int subarray[5000/4];
      .
      .
      .
      MPI_Scatter(&array,sendcount,MPI_INT,&subarray,recvcount,MPI_INT,\
      0,MPI_COMM_WORLD);
      

    • In C programs, when using dynamically allocated arrays, they are really pointers. Therefore, when using the MPI_Send/MPI_Scatter, you must NOT put a & in front of the send/receive buffer name. Example:

      int *array;
      int *subarray;
      .
      .
      .
      array = (int *)malloc(array_size*sizeof(int));
      subarray = (int *)malloc(subarray_size*sizeof(int));
      .
      .
      .
      MPI_Scatter(array,sendcount,MPI_INT,subarray,recvcount,MPI_INT,0,\
      MPI_COMM_WORLD);
      

    • Remember that values must either be globally declared or sent/broadcast/scattered to every processor. Otherwise the other processor will not be able to see their values.

    • Also, remember that memory allocation must be done either globally or on each processor.

    C/MPI programming examples

    Most of the programs below are my answers to the exercises in the online "Introduction to MPI" course at NCSA.

    • hello.c: Each processor prints "Hello World" to the screen.

    • ping.c: Sending a message between processors 0 and 1. The program reports the initial values of inmsg and outmsg before sending and then after. Try running this program with 2, 3, and 4 processors and see that their order is random and that only processors 0 and 1 exchanges a value.

    • ArrayFindSerial.c: This is a serial program to find a target value in an array and report back the indices where that target value was found. The target value and the array is read from a file (data.dat below) and printed to an output file 'out.dat'. This serial program should just be run as ./program. It is included for comparison with the following parallel versions of the same program.

    • ArrayFindParallelChapter4.c: This is the first parallel version of the above serial program. I use the blocking MPI_Send and MPI_Recv. The Master processor reads the array in from the file. Then it divides it up in three equal parts and sends one to each slave processor. These (3) slave processors search through their subarray and finds the indices where the targetvalue are found. These indices are then converted to global indices and sent to the master processor. The master processor then assembles these and write them to a file. The program is meant to run on 4 task processes. I use a static size array, which is initialized in the beginning of the program.

    • ArrayFindParallelChapter5.c: This version still uses the blocking MPI_Send and MPI_Recv, but since I am now sending back both the index where the target value is found and the average of the index and the target value, I need to use a derived data type. This is first created and then used. The program runs on 4 task processes.

    • ArrayFindParallelChapter6.c: As in the last version, I create a derived data type, but I now use MPI_Bcast, MPI_Scatter, MPI_Barrier and MPI_Irecv. The program runs on 4 task processes.

    • ArrayFindParallelChapter7.c: I still use the derived data type and MPI_Bcast and MPI_Scatter, but I know also find the "neighbour processors" and have them send the first value of their subarray to each neighbour processor. The program uses 4 task processes.

    • ArrayFindParallelChapter8.c: Still using MPI_Bcast and MPI_Scatter. To accomplish finding the neighbour processor I this time create a topological ring, which is a much easier way of finding them.

    • ArrayFindParallelChapter9.c: Back to using the blocking MPI_Send and MPI_Recv again. This makes it easier to accomplish the goal of this program, which is to allow each of the 3 slave processors to write to the output file. The master processor (0) still reads from the input file.

    • ArrayFindParallelChapter12.c: I again use MPI_Bcast and MPI_Scatter. The arrays are no longer static and I allocate them dynamically when I know the length of the arrays. This value is read from the input file (b.dat).

    • data.dat: A file with the target value and array used by the 'ArrayFind' programs. The first value is the target value. All values are integers and separated by space.

    • b.dat: The input file used by the program 'ArrayFindParallelChapter12.c'. The value on the first line is the the number of elements in the array, the next line holds the target value to be found, and the following lines are the elements in the array. The values are all integers and they are separated by space.

    Diagnostic Error messages from MPI

    Click here and go to chapter 5 (p. 121) to see what the diagnostic error messages from MPI means.

    Extra examples of MPI programs

    To see a few other examples of running MPI programs go here.

    PBS Hybrid Code Examples

    To request a resource to run a hybrid job which uses two ranks for message-passing and eight threads of shared memory processing within each rank:

    	$ qsub -I -l select=2:mpiprocs=1:ncpus=8
    	qsub: waiting for job 439516.steele-adm.rcac.purdue.edu to start
    	qsub: job 439516.steele-adm.rcac.purdue.edu ready
    		
    	$ cat $PBS_NODEFILE
    	steele-a187
    	steele-a188
    	$ exit
    	logout
    		
    	qsub: job 439516.steele-adm.rcac.purdue.edu completed	
    	$
    

    If a hybrid program uses a lot of memory and we cannot have eight threads on a node, we can request more nodes and fewer threads on each node. The argument scatter will attempt to place only one rank on a node:

    	$ qsub -I -l select=4:mpiprocs=1:ncpus=4,place=scatter
    	qsub: waiting for job 439593.steele-adm.rcac.purdue.edu to start
    	qsub: job 439593.steele-adm.rcac.purdue.edu ready
    	
    	$ cat $PBS_NODEFILE
    	steele-a121
    	steele-a122
    	steele-a125
    	steele-a132
    	$ exit
    	logout
    	
    	qsub: job 439593.steele-adm.rcac.purdue.edu completed
    	$
    

    The following two examples are using this hybrid code.

    Example, Request 2 MPI ranks each with 8 OpenMP threads. Request 2 chunks each having 1 MPI rank with 8 OpenMP threads.

    Job submission file:

    	#PBS -l select=2:mpiprocs=1:ncpus=8,walltime=0:30
    	mpiexec -n 2 ./h
    

    Output:

    steele-a484
    steele-a485
    
    name:steele-a484.rcac.purdue.edu   M_ID:0  M_N:2
    name:steele-a485.rcac.purdue.edu   M_ID:1  M_N:2
    name:steele-a484.rcac.purdue.edu   M_ID:0  O_ID:0  O_P:8  O_T:1
    name:steele-a485.rcac.purdue.edu   M_ID:1  O_ID:0  O_P:8  O_T:1
    
    parallel loop:   name:steele-a484.rcac.purdue.edu M_ID:0 O_ID:0 i: 0
    parallel loop:   name:steele-a484.rcac.purdue.edu M_ID:0 O_ID:0 i: 1
    parallel loop:   name:steele-a484.rcac.purdue.edu M_ID:0 O_ID:1 i: 2
    parallel loop:   name:steele-a484.rcac.purdue.edu M_ID:0 O_ID:1 i: 3
    parallel loop:   name:steele-a484.rcac.purdue.edu M_ID:0 O_ID:2 i: 4
    parallel loop:   name:steele-a484.rcac.purdue.edu M_ID:0 O_ID:2 i: 5
    parallel loop:   name:steele-a484.rcac.purdue.edu M_ID:0 O_ID:3 i: 6
    parallel loop:   name:steele-a484.rcac.purdue.edu M_ID:0 O_ID:3 i: 7
    parallel loop:   name:steele-a484.rcac.purdue.edu M_ID:0 O_ID:4 i: 8
    parallel loop:   name:steele-a484.rcac.purdue.edu M_ID:0 O_ID:4 i: 9
    parallel loop:   name:steele-a484.rcac.purdue.edu M_ID:0 O_ID:5 i:10
    parallel loop:   name:steele-a484.rcac.purdue.edu M_ID:0 O_ID:6 i:12
    parallel loop:   name:steele-a484.rcac.purdue.edu M_ID:0 O_ID:6 i:13
    parallel loop:   name:steele-a484.rcac.purdue.edu M_ID:0 O_ID:5 i:11
    parallel loop:   name:steele-a484.rcac.purdue.edu M_ID:0 O_ID:7 i:14
    parallel loop:   name:steele-a484.rcac.purdue.edu M_ID:0 O_ID:7 i:15
    
    parallel loop:   name:steele-a485.rcac.purdue.edu M_ID:1 O_ID:0 i: 0
    parallel loop:   name:steele-a485.rcac.purdue.edu M_ID:1 O_ID:0 i: 1
    parallel loop:   name:steele-a485.rcac.purdue.edu M_ID:1 O_ID:1 i: 2
    parallel loop:   name:steele-a485.rcac.purdue.edu M_ID:1 O_ID:1 i: 3
    parallel loop:   name:steele-a485.rcac.purdue.edu M_ID:1 O_ID:2 i: 4
    parallel loop:   name:steele-a485.rcac.purdue.edu M_ID:1 O_ID:2 i: 5
    parallel loop:   name:steele-a485.rcac.purdue.edu M_ID:1 O_ID:3 i: 6
    parallel loop:   name:steele-a485.rcac.purdue.edu M_ID:1 O_ID:3 i: 7
    parallel loop:   name:steele-a485.rcac.purdue.edu M_ID:1 O_ID:4 i: 8
    parallel loop:   name:steele-a485.rcac.purdue.edu M_ID:1 O_ID:4 i: 9
    parallel loop:   name:steele-a485.rcac.purdue.edu M_ID:1 O_ID:5 i:10
    parallel loop:   name:steele-a485.rcac.purdue.edu M_ID:1 O_ID:5 i:11
    parallel loop:   name:steele-a485.rcac.purdue.edu M_ID:1 O_ID:6 i:12
    parallel loop:   name:steele-a485.rcac.purdue.edu M_ID:1 O_ID:6 i:13
    parallel loop:   name:steele-a485.rcac.purdue.edu M_ID:1 O_ID:7 i:14
    parallel loop:   name:steele-a485.rcac.purdue.edu M_ID:1 O_ID:7 i:15
    
    second serial region   name:steele-a484.rcac.purdue.edu M_ID:0 O_ID:0 i=999
    second serial region   name:steele-a485.rcac.purdue.edu M_ID:1 O_ID:0 i=999
     
    parallel region:       name:steele-a484.rcac.purdue.edu M_ID=0 O_ID=0
    parallel region:       name:steele-a484.rcac.purdue.edu M_ID=0 O_ID=1
    parallel region:       name:steele-a484.rcac.purdue.edu M_ID=0 O_ID=2
    parallel region:       name:steele-a484.rcac.purdue.edu M_ID=0 O_ID=3
    parallel region:       name:steele-a484.rcac.purdue.edu M_ID=0 O_ID=4
    parallel region:       name:steele-a484.rcac.purdue.edu M_ID=0 O_ID=6
    parallel region:       name:steele-a484.rcac.purdue.edu M_ID=0 O_ID=5
    parallel region:       name:steele-a484.rcac.purdue.edu M_ID=0 O_ID=7
    
    parallel region:       name:steele-a485.rcac.purdue.edu M_ID=1 O_ID=0
    parallel region:       name:steele-a485.rcac.purdue.edu M_ID=1 O_ID=1
    parallel region:       name:steele-a485.rcac.purdue.edu M_ID=1 O_ID=2
    parallel region:       name:steele-a485.rcac.purdue.edu M_ID=1 O_ID=3
    parallel region:       name:steele-a485.rcac.purdue.edu M_ID=1 O_ID=4
    parallel region:       name:steele-a485.rcac.purdue.edu M_ID=1 O_ID=5
    parallel region:       name:steele-a485.rcac.purdue.edu M_ID=1 O_ID=6
    parallel region:       name:steele-a485.rcac.purdue.edu M_ID=1 O_ID=7
    
    third serial region    name:steele-a484.rcac.purdue.edu M_ID:0 O_ID:0 i=999
    third serial region    name:steele-a485.rcac.purdue.edu M_ID:1 O_ID:0 i=999
    
    name:steele-a484.rcac.purdue.edu M_ID:0 O_ID:0   Exits
    name:steele-a485.rcac.purdue.edu M_ID:1 O_ID:0   Exits
    

    Example

    Job submission file:

    	#PBS -l select=4:mpiprocs=1:ncpus=4,walltime=0:30
    	mpiexec -n 4 ./h
    

    Output:

    steele-a355
    steele-a002
    steele-a027
    steele-a028
    
    name:steele-a355.rcac.purdue.edu   M_ID:0  M_N:4
    name:steele-a355.rcac.purdue.edu   M_ID:0  O_ID:0  O_P:8  O_T:1
    name:steele-a002.rcac.purdue.edu   M_ID:1  M_N:4
    name:steele-a002.rcac.purdue.edu   M_ID:1  O_ID:0  O_P:8  O_T:1
    name:steele-a027.rcac.purdue.edu   M_ID:2  M_N:4
    name:steele-a027.rcac.purdue.edu   M_ID:2  O_ID:0  O_P:8  O_T:1
    name:steele-a028.rcac.purdue.edu   M_ID:3  M_N:4
    name:steele-a028.rcac.purdue.edu   M_ID:3  O_ID:0  O_P:8  O_T:1
    
    parallel loop:   name:steele-a355.rcac.purdue.edu M_ID:0 O_ID:0 i: 0
    parallel loop:   name:steele-a355.rcac.purdue.edu M_ID:0 O_ID:0 i: 1
    parallel loop:   name:steele-a355.rcac.purdue.edu M_ID:0 O_ID:0 i: 2
    parallel loop:   name:steele-a355.rcac.purdue.edu M_ID:0 O_ID:0 i: 3
    parallel loop:   name:steele-a355.rcac.purdue.edu M_ID:0 O_ID:1 i: 4
    parallel loop:   name:steele-a355.rcac.purdue.edu M_ID:0 O_ID:1 i: 5
    parallel loop:   name:steele-a355.rcac.purdue.edu M_ID:0 O_ID:1 i: 6
    parallel loop:   name:steele-a355.rcac.purdue.edu M_ID:0 O_ID:1 i: 7
    parallel loop:   name:steele-a355.rcac.purdue.edu M_ID:0 O_ID:2 i: 8
    parallel loop:   name:steele-a355.rcac.purdue.edu M_ID:0 O_ID:2 i: 9
    parallel loop:   name:steele-a355.rcac.purdue.edu M_ID:0 O_ID:2 i:10
    parallel loop:   name:steele-a355.rcac.purdue.edu M_ID:0 O_ID:2 i:11
    parallel loop:   name:steele-a355.rcac.purdue.edu M_ID:0 O_ID:3 i:12
    parallel loop:   name:steele-a355.rcac.purdue.edu M_ID:0 O_ID:3 i:13
    parallel loop:   name:steele-a355.rcac.purdue.edu M_ID:0 O_ID:3 i:14
    parallel loop:   name:steele-a355.rcac.purdue.edu M_ID:0 O_ID:3 i:15
    
    parallel loop:   name:steele-a002.rcac.purdue.edu M_ID:1 O_ID:0 i: 0
    parallel loop:   name:steele-a002.rcac.purdue.edu M_ID:1 O_ID:0 i: 1
    parallel loop:   name:steele-a002.rcac.purdue.edu M_ID:1 O_ID:0 i: 2
    parallel loop:   name:steele-a002.rcac.purdue.edu M_ID:1 O_ID:0 i: 3
    parallel loop:   name:steele-a002.rcac.purdue.edu M_ID:1 O_ID:1 i: 4
    parallel loop:   name:steele-a002.rcac.purdue.edu M_ID:1 O_ID:1 i: 5
    parallel loop:   name:steele-a002.rcac.purdue.edu M_ID:1 O_ID:1 i: 6
    parallel loop:   name:steele-a002.rcac.purdue.edu M_ID:1 O_ID:1 i: 7
    parallel loop:   name:steele-a002.rcac.purdue.edu M_ID:1 O_ID:2 i: 8
    parallel loop:   name:steele-a002.rcac.purdue.edu M_ID:1 O_ID:2 i: 9
    parallel loop:   name:steele-a002.rcac.purdue.edu M_ID:1 O_ID:2 i:10
    parallel loop:   name:steele-a002.rcac.purdue.edu M_ID:1 O_ID:2 i:11
    parallel loop:   name:steele-a002.rcac.purdue.edu M_ID:1 O_ID:3 i:12
    parallel loop:   name:steele-a002.rcac.purdue.edu M_ID:1 O_ID:3 i:13
    parallel loop:   name:steele-a002.rcac.purdue.edu M_ID:1 O_ID:3 i:14
    parallel loop:   name:steele-a002.rcac.purdue.edu M_ID:1 O_ID:3 i:15
    
    parallel loop:   name:steele-a028.rcac.purdue.edu M_ID:3 O_ID:0 i: 0
    parallel loop:   name:steele-a028.rcac.purdue.edu M_ID:3 O_ID:0 i: 1
    parallel loop:   name:steele-a028.rcac.purdue.edu M_ID:3 O_ID:0 i: 2
    parallel loop:   name:steele-a028.rcac.purdue.edu M_ID:3 O_ID:0 i: 3
    parallel loop:   name:steele-a028.rcac.purdue.edu M_ID:3 O_ID:1 i: 4
    parallel loop:   name:steele-a028.rcac.purdue.edu M_ID:3 O_ID:1 i: 5
    parallel loop:   name:steele-a028.rcac.purdue.edu M_ID:3 O_ID:1 i: 6
    parallel loop:   name:steele-a028.rcac.purdue.edu M_ID:3 O_ID:1 i: 7
    parallel loop:   name:steele-a028.rcac.purdue.edu M_ID:3 O_ID:2 i: 8
    parallel loop:   name:steele-a028.rcac.purdue.edu M_ID:3 O_ID:2 i: 9
    parallel loop:   name:steele-a028.rcac.purdue.edu M_ID:3 O_ID:2 i:10
    parallel loop:   name:steele-a028.rcac.purdue.edu M_ID:3 O_ID:2 i:11
    parallel loop:   name:steele-a028.rcac.purdue.edu M_ID:3 O_ID:3 i:12
    parallel loop:   name:steele-a028.rcac.purdue.edu M_ID:3 O_ID:3 i:13
    parallel loop:   name:steele-a028.rcac.purdue.edu M_ID:3 O_ID:3 i:14
    parallel loop:   name:steele-a028.rcac.purdue.edu M_ID:3 O_ID:3 i:15
    
    parallel loop:   name:steele-a027.rcac.purdue.edu M_ID:2 O_ID:0 i: 0
    parallel loop:   name:steele-a027.rcac.purdue.edu M_ID:2 O_ID:0 i: 1
    parallel loop:   name:steele-a027.rcac.purdue.edu M_ID:2 O_ID:0 i: 2
    parallel loop:   name:steele-a027.rcac.purdue.edu M_ID:2 O_ID:0 i: 3
    parallel loop:   name:steele-a027.rcac.purdue.edu M_ID:2 O_ID:1 i: 4
    parallel loop:   name:steele-a027.rcac.purdue.edu M_ID:2 O_ID:1 i: 5
    parallel loop:   name:steele-a027.rcac.purdue.edu M_ID:2 O_ID:1 i: 6
    parallel loop:   name:steele-a027.rcac.purdue.edu M_ID:2 O_ID:1 i: 7
    parallel loop:   name:steele-a027.rcac.purdue.edu M_ID:2 O_ID:2 i: 8
    parallel loop:   name:steele-a027.rcac.purdue.edu M_ID:2 O_ID:2 i: 9
    parallel loop:   name:steele-a027.rcac.purdue.edu M_ID:2 O_ID:2 i:10
    parallel loop:   name:steele-a027.rcac.purdue.edu M_ID:2 O_ID:2 i:11
    parallel loop:   name:steele-a027.rcac.purdue.edu M_ID:2 O_ID:3 i:12
    parallel loop:   name:steele-a027.rcac.purdue.edu M_ID:2 O_ID:3 i:13
    parallel loop:   name:steele-a027.rcac.purdue.edu M_ID:2 O_ID:3 i:14
    parallel loop:   name:steele-a027.rcac.purdue.edu M_ID:2 O_ID:3 i:15
    
    second serial region   name:steele-a355.rcac.purdue.edu M_ID:0 O_ID:0 i=999
    second serial region   name:steele-a002.rcac.purdue.edu M_ID:1 O_ID:0 i=999
    second serial region   name:steele-a027.rcac.purdue.edu M_ID:2 O_ID:0 i=999
    second serial region   name:steele-a028.rcac.purdue.edu M_ID:3 O_ID:0 i=999
     
    parallel region:       name:steele-a355.rcac.purdue.edu M_ID=0 O_ID=0
    parallel region:       name:steele-a355.rcac.purdue.edu M_ID=0 O_ID=1
    parallel region:       name:steele-a355.rcac.purdue.edu M_ID=0 O_ID=2
    parallel region:       name:steele-a355.rcac.purdue.edu M_ID=0 O_ID=3
    parallel region:       name:steele-a002.rcac.purdue.edu M_ID=1 O_ID=0
    parallel region:       name:steele-a002.rcac.purdue.edu M_ID=1 O_ID=1
    parallel region:       name:steele-a002.rcac.purdue.edu M_ID=1 O_ID=2
    parallel region:       name:steele-a002.rcac.purdue.edu M_ID=1 O_ID=3
    parallel region:       name:steele-a027.rcac.purdue.edu M_ID=2 O_ID=0
    parallel region:       name:steele-a027.rcac.purdue.edu M_ID=2 O_ID=1
    parallel region:       name:steele-a027.rcac.purdue.edu M_ID=2 O_ID=2
    parallel region:       name:steele-a027.rcac.purdue.edu M_ID=2 O_ID=3
    parallel region:       name:steele-a028.rcac.purdue.edu M_ID=3 O_ID=0
    parallel region:       name:steele-a028.rcac.purdue.edu M_ID=3 O_ID=1
    parallel region:       name:steele-a028.rcac.purdue.edu M_ID=3 O_ID=2
    parallel region:       name:steele-a028.rcac.purdue.edu M_ID=3 O_ID=3
    
    third serial region    name:steele-a355.rcac.purdue.edu M_ID:0 O_ID:0 i=999
    third serial region    name:steele-a002.rcac.purdue.edu M_ID:1 O_ID:0 i=999
    third serial region    name:steele-a027.rcac.purdue.edu M_ID:2 O_ID:0 i=999
    third serial region    name:steele-a028.rcac.purdue.edu M_ID:3 O_ID:0 i=999
    
    name:steele-a355.rcac.purdue.edu M_ID:0 O_ID:0   Exits
    name:steele-a002.rcac.purdue.edu M_ID:1 O_ID:0   Exits
    name:steele-a027.rcac.purdue.edu M_ID:2 O_ID:0   Exits
    name:steele-a028.rcac.purdue.edu M_ID:3 O_ID:0   Exits
    

    Gaussian PBS Example

    Batch submission

    The following module command will add "subg03" to your search path:

    	module load gaussian03
    

    The "ncpus=" specification should be used as in the following. It does not affect the way the job runs, but it is needed in order to make the #tasks entry in the qstat output appear as expected.

    Examples of typical PBS job submissions

    Submit job using 4 processors on a single node

    	subg03 myjob -l select=1:ncpus=4:mpiprocs=4 -q standby
    

    Submit job using 4 processors on each of 2 nodes

    	subg03 myjob -l select=2:ncpus=4:mpiprocs=4,place=scatter -q standby
    

    Submit job using 8 processors on a single node

    	subg03 myjob -l select=1:ncpus=8:mpiprocs=8 -q standby
    

    Submit job using 8 processors on each of 2 nodes

    	subg03 myjob -l select=2:ncpus=8:mpiprocs=8,place=scatter -q standby
    

    Interactive submission

    Use for example, 4 processors on a single node

    	qsub -I -l select=1:ncpus=4:mpiprocs=4 -q standby
    	module load gaussian03
    	cd PBS_O_WORKDIR
    	rung03 myjob
    

    Maple PBS Example

    Batch submission

    #!/bin/bash
    source /etc/profile           #necessary to be able to use module commands
    #
    #PBS -l select=1:ncpus=1
    #PBS -N Maple
    #PBS -m bea
    #PBS -M  username@emailaddress.domain
    
    module load maple             #load the maple module
    cd $PBS_O_WORKDIR
    maple -q maplescript.mpl
    

    Then submit the script:

    qsub -l select=1 full_path_to/maple.script
    

    Interactive submission

    Note that '-v DISPLAY' sets the display.

    	qsub -I -q standby -l select=1 -v DISPLAY
    

    Then

    	module load maple/11.0
    	
    	or
    	
    	module load maple
    

    Mathematica PBS Example

    Batch submission

    #!/bin/bash
    source /etc/profile           #necessary to be able to use module commands
    #
    #PBS -l walltime=4:00:00
    #PBS -N Mathematica
    #PBS -m bea
    #PBS -V       ##This is necessary to inherit environment variables
    #PBS -M  username@emailaddress.domain
    
    module load mathematica             #load the mathematica module
    cd PBS_O_WORKDIR
    math < math.m output.dat
    

    Then submit the script:

    qsub -l select=1 full_path_to/mathematica.script
    

    Interactive submission

    Note that '-v DISPLAY' sets the display.

    	qsub -I -q standby -l select=1 -v DISPLAY
    

    Then

    Mathematica 5.2 (default):

    	module load mathematica/5.2
    	
    	or 
    	
    	module load mathematica
    

    Mathematica 6.0:

    	module load mathematica/6.0
    

    Matlab PBS Example

    Batch submission

    #!/bin/bash
    source /etc/profile           #necessary to be able to use module commands
    #
    #PBS -N Matlab
    #PBS -m bea
    #PBS -M  username@emailaddress.domain
    
    module load matlab             #load the matlab module
    matlab -nodesktop << EOF       #starts Matlab
    a = 10;                        #Matlab commands
    b = 20;
    c = 30;
    d = sqrt((a + b + c)/pi);
    exit
    EOF                            #end of Matlab commands
    

    OR

    #!/bin/bash
    source /etc/profile           #necessary to be able to use module commands
    #
    #PBS -N Matlab
    #PBS -m bea
    #PBS -M  username@emailaddress.domain
    
    module load matlab             #load the matlab module
    cd PBS_O_WORKDIR
    
    unset DISPLAY
      ### Call MATLAB with the appropriate input and output,
      ### make it immune to hangups and quits using "nohup",
      ### and run it in the background.
    nohup matlab < matlabscript.m > output 
    

    Then submit the script:

    qsub -l select=1 full_path_to/matlab.script
    

    Interactive submission

    Note that '-v DISPLAY' sets the display.

    	qsub -I -q standby -l select=1 -v DISPLAY
    

    Then

    	module load matlab/7.5
    	
    	or
    	
    	module load matlab
    

    R PBS Example

    Batch submission

    #!/bin/bash
    source /etc/profile           #necessary to be able to use module commands
    #
    #PBS -N R
    #PBS -m bea
    #PBS -M  username@emailaddress.domain
    
    module load R             #load the R module
    cd PBS_O_WORKDIR 
    R
    

    Then submit the script:

    qsub -l select=1 full_path_to/R.script
    

    Interactive submission

    Note that '-v DISPLAY' sets the display.

    	qsub -I -q standby -l select=1 -v DISPLAY
    

    Then

    	module load R/2.6.2
    	
    	or 
    	
    	module load R
    

    SAS PBS Example

    Batch submission

    #!/bin/bash
    source /etc/profile           #necessary to be able to use module commands
    #
    #PBS -N SAS
    #PBS -m bea
    #PBS -M  username@emailaddress.domain
    
    module load sas             #load the sas module
    cd PBS_O_WORKDIR
    
    sas file.sas 
    

    Then submit the script:

    qsub -l select=1 full_path_to/sas.script
    

    Interactive submission

    Note that -v DISPLAY sets the display.

    	qsub -I -q standby -l select=1 -v DISPLAY
    

    Then

    	module load sas/8.2.0
    	
    	or 
    	
    	module load sas
    

    Running Jobs via Condor

    Condor allows users to run jobs on systems which would otherwise be idle for however long as those systems are not needed by their primary users. Condor is one of several distributed computing systems RCAC makes available. Most RCAC resources, in addition to being available through normal means, are a part of BoilerGrid and can be used via Condor. If a primary user needs a machine, the Condor job is immediately either checkpointed and/or migrated and the resource made available. Thus, shorter jobs will have a better completion rate via Condor than longer jobs; however, even though jobs may have to be restarted elsewhere, BoilerGrid can offer a vast amount of computational resources to users. Not only are nearly all RCAC systems part of BoilerGrid, so also are large numbers of lab machines at the West Lafayette and other Purdue campuses. BoilerGrid is one of the largest Condor pools in the world. Some machines at other institutions are also a part of a larger Condor federation known as DiaGrid and can be used as well. For more information, refer to the BoilerGrid documentation.

    Radon Condor Tips

    Here are a short list of the steps to get ready to run on Condor:

    • Code Preparation To get a job to run under Condor, it must be able to run as a background batch job. Since Condor runs the program unattended and in the background, it will not be able to do interactive input and output. Condor can redirect console output (stdout and stderr) and keyboard input (stdin) to and from files for you. Create any needed files that contain the proper keystrokes needed for program input. Make certain the program will run correctly with the files.
    • The Condor Universe Condor has more than one runtime environment (called a universe) from which to choose. The most used ones are:

      • the standard universe, which allows a job running under Condor to handle system calls by returning them to the machine where the job was submitted and it also provides the mechanisms necessary to take a checkpoint and migrate a partially completed job, should the machine on which the job is executing become unavailable. To use the standard universe, it is necessary to relink the program with the Condor library using the condor_compile command. To read more about compiling for Condor, look at man condor_compile or in the longer manual.
      • the vanilla universe, which provides a way to run jobs that cannot be relinked. There is no way to take a checkpoint or migrate a job executed under the vanilla universe. For access to input and output files, jobs must either use a shared file system, or use Condor's File Transfer mechanism.

      Choose a universe under which to run the Condor program, and re-link the program if necessary.
    • Submit description file To control the details of a job submission, you use a submit description file. The file contains information about the job such as what executable to run, the files to use for keyboard and screen data, the platform type required to run the program, which universe use wish to use, and where to send e-mail when the job completes. You can also tell Condor how many times to run a program; it is simple to run the same program multiple times with multiple data sets. The sumbit description file is where the requirements and rank commands are defined.
      Write a submit description file to go with the job. Look at this example for guidance.
    • Submit the Job Submit the program to Condor with the condor_submit command.

    Once the job is submitted, Condor will do the rest. You can monitor the jobs progress with the commands condor_q and condor_status. You may modify the order in which Condor will run your jobs with the command condor_prio. If desired, Condor can even inform you in a log file every time your job is checkpointed and/or migrated to a different machine.

    When your program completes, Condor will tell you (by e-mail, if preferred) the exit status of your program and various statistics about its performances, including time used and I/O performed. If you are using a log file for the job (which is recommended) the exit status will be recorded in the log file. You can remove a job from the queue prematurely with the command condor_rm.

    • It is difficult to be allowed to run many (like 200+) jobs in the standard Condor universe, because of how the Purdue pool is.
    • Don't queue up thousands and thousands of jobs in a queue. Use DAGman to divide your jobs into reasonably-sized chunks. (500 jobs or so)
    • When a submit node is heavily used, don't run condor_q constantly. The condor_schedd is single-threaded, and schedules work in the same thread that you're using to list the queue.
    • Long jobs should run in the standard universe, not in the vanilla universe, since they will otherwise never finish.
    • Standard Universe is the most desirable due to checkpoint availability, but no possibility of sub-processes. Scripts can be used as executables. It is also necessary to link with Condor run-time library (use of Intel compilers is not possible). Only static links works. Good for longer jobs because of the checkpoint availability.
    • Vanilla Universe is the only possibility for Windows machines. It only has preemption by suspension or eviction and is thus bad for long jobs, but OK for short jobs (eviction is when the owner of the cluster bumps your job. It will then restart.) Can use Intel compilers (may run 30%-40% faster). Thus it may even be faster for somewhat longer jobs, because the speed gain may be bigger than the advantage from the checkpoint availability.
    • Generally, if the execution of your job runs less than 1/2 hour, then there is almost no eviction. If it is shorter than 1 hour, there will still only be a few evictions.
    • Purdue have both a scavenging/preempting and a scheduling system. Remember that the Condor pool is very heterogeneous, both regarding processor versions and OS versions/types (both Linux of different varieties and some Windows.)
    • It is a good idea to use static links regardless of universe, since you will never know which version of SUN libraries etc. you will find on a given machine.
    • Why no middleware (like Mycluster at TACC)? Middleware can be easier for the user, since it does Condor (and other stuff) 'behind the scene'. Middlewares are schedulers also and will don't start the job until it is guaranteed to run to completion (no eviction). However, it has a lot of job restarts and thus much overhead on many jobs. Therefore, for a large number of jobs the 'pure' Condor is better.


    Limitations on Jobs which can be Checkpointed

    Condor are able to schedule and run ant type of process, but it does have some limitations on which jobs that it can transparently checkpoint and migrate:

    • Multi-process jobs are not allowed. This includes system calls such as fork(), exec(), and system().
    • Interprocess communication is not allowed. This includes pipes, semaphores, and shared memory.
    • Network communication must be brief. A job may make network connections using system calls such as socket(), but a network connection left open for long periods will delay checkpointing and migration.
    • Sending or receiving the SIGUSR2 or SIGTSTP signals is not allowed. Condor reserves these signals for its own use. Sending or receiving all other signals is allowed.
    • Alarms, timers, and sleeping are not allowed. This includes system calls such as alarm(), getitimer(), and sleep().
    • Multiple kernel-level threads are not allowed. However, multiple user-level threads are allowed.
    • Memory mapped files are not allowed. This includes system calls such as mmap() and munmap().
    • File locks are allowed, but not retained between checkpoints.
    • All files must be opened read-only or write-only. A file opened for both reading and writing will cause trouble if a job must be rolled back to an old checkpoint image. For compatibility reasons, a file opened for both reading and writing will result in a warning but not an error.
    • A fair amount of disk space must be available on the submitting machine for storing a job's checkpoint images. A checkpoint image is approximately equal to the virtual memory consumed by a job while it runs. If disk space is short, a special checkpoint server can be designated for storing all the checkpoint images for a pool.
    • On Digital Unix (OSF/1), HP-UX, and Linux, your job must be statically linked. Dynamic linking is allowed on all other platforms.

    Note: these limitations only apply to jobs which Condor has been asked to transparently checkpoint. If job checkpointing is not desired, the limitations above do not apply.

    Note: Jobs Need to be Re-linked to get Checkpointing and Remote System Calls: Although typically no source code changes are required, Condor requires that the jobs be re-linked with the Condor libraries to take advantage of checkpointing and remote system calls. This often precludes commercial software binaries from taking advantage of these services because commercial packages rarely make their object code available. Condor's other services are still available for these commercial packages.

    Choosing a Condor Universe


    Condor allows several types of jobs, but the most used are "standard" and "vanilla". Standard jobs can be checkpointed and migrated from system to system transparently by Condor - jobs can be moved from node to node without restarting. However, for a code to be submitted as a standard job it must be recompiled jobs can be checkpointed and migrated from system to system transparently by Condor - jobs can be moved from node to node without restarting. However, for a code to be submitted as a standard job it must be recompiled using various Condor-specific compiler options and libraries. An application must also conform to a few other restrictions in order to run in the standard universe.

    Those programs that cannot be recompiled can be submitted as vanilla jobs. Virtually any non-parallel program can be submitted. Vanilla jobs cannot be checkpointed. If a node ceases to be idle any running vanilla jobs may be suspended or killed (to be restarted elsewhere).

    Under Windows, only vanilla jobs are allowed.

    A universe in Condor defines an execution environment. Condor Version 6.8.0 supports several different universes for user jobs. :

    • Standard: The standard universe provides migration and reliability, but has some restrictions on the programs that can be run.
    • Vanilla: The vanilla universe provides fewer services, but has very few restrictions.
    • PVM: The PVM universe is for programs written to the Parallel Virtual Machine interface. See section 2.9 of the Condor manual for more about PVM and Condor.
    • MPI: The MPI universe is for programs written to the MPICH interface. See section 2.10 of the Condor manual for more about MPI and Condor. The MPI Universe has be superseded by the Parallel universe.
    • Globus or Grid: The Globus or Grid universe allows users to submit jobs using Condor's interface. These jobs are submitted for execution on grid resources. For Globus jobs, see http://www.globus.org for more information.
    • Java: The Java universe allows users to run jobs written for the Java Virtual Machine (JVM).
    • Scheduler: The scheduler universe allows users to submit lightweight jobs to be spawned by the condor_schedd on the submit host itself.
    • Local: The local universe allows a Condor job to be submitted and executed with different assumptions for the execution conditions of the job.
    • Parallel: The Parallel universe is for programs that require multiple machines for one job. See section 2.10 for more about the Parallel universe.

    The Universe attribute is specified in the submit description file. If a universe is not specified, the default is standard.

    See chapter 2.4.1 of the Condor manual for more details about the different universes.

    Radon Condor Submission Script

    Example 1 Here is first the simplest possible submit description file. It will put one copy of the program hello (which has first been created by condor_compile) in queue for execution by Condor. There has been no definition of platform, so Condor will just use its default, which is to run the job on a machine which has the same architecture and operating system as the machine from which it was submitted.
    No input, output, and error commands are given in the submit description file, so the files stdin, stdout, and stderr will all refer to /dev/null. The program may produce output by explicitly opening a file and writing to it. A log file, hello.log, will also be produced. This log-file will contain events the job had during its lifetime inside of Condor, such as any possible errors. When the job finishes, its exit conditions will also be noted in the log file. It is recommended that you always have a log file so you know what happened to your jobs.
    If your program only returns output to the screen (like the hello.c program below does), then you should include Output = hello.out or something like it, somewhere before Queue. Otherwise you will not see the output.
      ####################                                                    
      # 
      # Example 1                                                            
      # Simple condor job description file                                    
      #                                                                       
      ####################                                                    
                                                                              
      Executable     = hello
      Log            = hello.log
      Queue
    
    Example 2 In this example (from the Condor manual, we queue two copies of the program mathematica. The first copy will run in directory run_1, and the second will run in directory run_2. For both queued copies, stdin will be test.data, stdout will be loop.out, and stderr will be loop.error. There will be two sets of files written, as the files are each written to their own directories. This is a convenient way to organize data if you have a large group of Condor jobs to run. The example file shows program submission of mathematica as a vanilla universe job. This may be necessary if the source and/or object code to program mathematica is not available.
      ####################     
      #                       
      # Example 2: demonstrate use of multiple     
      # directories for data organization.      
      #                                        
      ####################                    
                                             
      Executable     = mathematica          
      Universe = vanilla                   
      input   = test.data                
      output  = loop.out                
      error   = loop.error             
      Log     = loop.log                                                    
                                      
      Initialdir     = run_1         
      Queue                         
                                   
      Initialdir     = run_2      
      Queue
    
    Example 3 In this example (also from the Condor manual, the submit description file queues 150 runs of program foo which has been compiled and linked for Silicon Graphics workstations running IRIX 6.5. This job requires Condor to run the program on machines which have greater than 32 megabytes of physical memory, and expresses a preference to run the program on machines with more than 64 megabytes, if such machines are available. It also advises Condor that it will use up to 28 megabytes of memory when running. Each of the 150 runs of the program is given its own process number, starting with process number 0. So, files stdin, stdout, and stderr will refer to in.0, out.0, and err.0 for the first run of the program, in.1, out.1, and err.1 for the second run of the program, and so forth. A log file containing entries about when and where Condor runs, checkpoints, and migrates processes for the 150 queued programs will be written into file foo.log.

      ####################                    
      #
      # Example 3: Show off some fancy features including
      # use of pre-defined macros and logging.
      #
      ####################                                                    
    
      Executable     = foo                                                    
      Requirements   = Memory >= 32 && OpSys == "IRIX65" && Arch =="SGI"     
      Rank		 = Memory >= 64
      Image_Size     = 28 Meg                                                 
    
      Error   = err.$(Process)                                                
      Input   = in.$(Process)                                                 
      Output  = out.$(Process)                                                
      Log = foo.log
    
      Queue 150
    
    

    Condor Job Submission

    To submit a job to Condor for execution, you must use the condor_submit command. This command takes as an argument the submit description file. As described above, this file contains the commands and keywords used to direct the queuing of jobs - the name of the executable to run, which universe to run in, any requirements and rank info, how many times to run the program, any command line arguments, etc. Based on this information, condor_submit will create a job ClassAd to use for matching with a machine ClassAd. When this have been done, Condor can queue the job for running on that machine.

    There are many advantages to the submit description file. One example could be if you want to run the same program many times, wach time with a different input data set (say, 500 times with 500 different input data sets). It is then easy to tell Condor to do this. Just arrange your data files accordingly so that each run reads its own input, and each run writes its own output. Each individual run may have its own initial working directory, stdin, stdout, stderr, command-line arguments, and shell environment. A program that directly opens its own files will read the file names to use either from stdin or from the command line. A program that opens a static filename every time will need to use a separate subdirectory for the output of each run.

    Write a submit description file and submit it:
    	condor_submit file
    

    Example:
    	condor_submit run_hello (my submit description file is called run_hello). 
    

    See condor_submit in the manual pages, for a more complete description of how to use it.

    Condor Job Status

    To just see the status of a job, type condor_status.

    Condor allocates resources by matching the submitted jobs with the machines. It does this by matching ClassAds. Condor's ClassAds are analogous to the classified advertising section of the newspaper. Sellers/buyers advertise specifics about what they have to sell/wants to buy. Both buyers and sellers have some constraints which must be satisfied, like buyers only being able to pay a certain sum of money or sellers asking for no less than a certain price. Sellers and buyers both want to rank requests to their own advantage, for example, the seller would give a higher rank to a higher price offer. In Condor, users submitting jobs can be thought of as buyers of compute resources and machine owners are sellers.

    All the machines in a Condor pool advertise their attributes. These could be available RAM memory, CPU type, CPU speed, virtual memory size, current load average, or other static and dynamic properties. This machine ClassAd also advertises under what conditions it is willing to run a Condor job and what type of job it would prefer.
    The different owners which allows their machines to be part of the Condor pool, may set individual terms and preferences - maybe specifying that their machines may only be used to run jobs at night or that they have a preference/higher rank for running jobs submitted by their own department.

    A very useful program for finding out which machines and architectures are out there, is the program condor_all. It should be noted, that even though it is located in the "official" Condor directory - /opt/condor/bin, it is a locally (Purdue) developed tool. It is very handy for finding out how many of a certain machine architecture that are available - useful for the submit description file. The program is in the default path on tg_login, but is installed on must RCAC resources where it can be run as /opt/condor/bin/condor_all.

    Just as the machines have requirements and preferences, the same is try for the users submitting a job. The users specify a ClassAd with their requirements and preferences when they submit a job. This ClassAd includes the type of machine you wish to use - you would perhaps like to use the machine with the fastest floating point performance available and you thus want Condor to rank the available machines based upon their floating point performance.
    Another example could be that your job requires a machine with a minimum of, say, 4 GB of RAM and you thus only want Condor to consider machines which fulfill this requirement.
    Sometimes, the user may be ready to use any machine available and this too can be communicated to Condor through the job ClassAd.

    Condor's job then is to read all the machine ClassAds and all the user job ClassAds and match them up. Condor makes certain that all requirements in both ClassAds are satisfied, if possible.

    To get a feel for what a machine ClassAd does, try typing the commands condor_status. This will give you a summary of the information in the resource ClassAds in your Condor pool. To see an example of running this command, click here. The list was generated by running condor_status on radon, August 25th 2006.

    Some options can be given to the condor_status command, for example:

    • -available shows only machines which are willing to run jobs now.
    • -run shows only machines which are currently running jobs.
    • -l lists the machine ClassAds for all machines in the pool.

    A more complete list of options can be seen by running man condor_status or by looking in the Condor Manual at the University of Wisconsin. You can go directly to that manuals page about the condor_status command by clicking here.

    It is usually not a good idea to just use the command condor_status -l without giving a machine, since this will list the full machine ClassAds for all the machines in your pool and that could be a little overwhelming. Usually, what you would do is give the command for a specific machine. Click here to see an example for the machine yb-048.rcac.purdue.edu.

    As can be seen from the example, there are quite many attributes. Some of them are used by Condor for scheduling. Other attributes are for information purposes. An important point is that any of the attributes in a machine ad can be utilized at job submission time as part of a request or preference on what machine to use. Additional attributes can be easily added. For example, your site administrator can add a physical location attribute to your machine ClassAds.

    Affecting the jobs execution


    Placing a job on hold

    To place a job in the queue on hold, use the command condor_hold. A job that is in the hold state remains there until later released for execution by the command condor_release.

    See the manual page for more information.

    Changing the priority of jobs

    In addition to the priorities assigned to each user, Condor also provides each user with the capability of assigning priorities to each submitted job. These job priorities are local to each queue and can be any integer value, with higher values meaning better priority.
    The default priority of a job is 0, but can be changed using the condor_prio command. Example: to change the priority of a job to -15

    	user123@radon:~$ condor_q user123
    	
    	
    	-- Submitter: radon.rcac.purdue.edu : <128.210.9.35:35407> : radon.rcac.purdue.edu
    	 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
    	260187.0   user123         8/30 13:59   0+00:00:00 I  0   19.5 hello             
    	
    	1 jobs; 1 idle, 0 running, 0 held
    	user123@radon:~$ condor_prio -p -15 260187.0
    	user123@radon:~$ condor_q user123
    	
    	-- Submitter: radon.rcac.purdue.edu : <128.210.9.35:35407> : radon.rcac.purdue.edu
    	 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
    	260187.0   user123         8/30 13:59   0+00:00:03 R  -15 19.5 hello             
    	
    	1 jobs; 0 idle, 1 running, 0 held
    	user123@radon:~$
    

    Note these job priorities are different from the user priorities assigned by Condor. Job priorities do not impact user priorities and are only a mechanism for the user to identify the relative importance of jobs among all the jobs submitted by the user to that specific queue.

    6. Managing a Job


    In this section we are looking at commands regarding the job after it has been submitted. The first part looks at how the job is monitored. The commands will be discussed briefly, for a more detailed description, you should look at the man pages for the commands referred to. This can either be done by typing man <command>, or by looking in the online, official manual, chapter 9.
    The last part of this section looks at ways to affect the jobs execution after it has been submitted. This can (among other things) be done by changing the job priority.

    As soon as the job has been submitted, Condor will start looking for resources to run it. By typing condor_status -submitters, you will get a list of those which have a submitted job. An example of this can be seen below:

    	-bash-3.00$ condor_status -submitters
    	
    	Name                 Machine      Running IdleJobs HeldJobs
    	
    	user4@rcac.purdue. hamlet.rca         0        0        5
    	user5@r            hamlet.rca       792        0        0
    	user6@rcac.purdue. hamlet.rca         0        0        0
    	user7@rcac.purdue. hamlet.rca         0        0        1
    	user8@rcac.purdue.edu hamlet.rca         0        0      228
    	user5@r            lear.rcac.         0        0        1
    	user9@rcac.purdue.e radon.rcac         0        6        2
    	user10@rcac.purdue radon.rcac         0        0        1
    	user7@rcac.purdue radon.rcac         0        1        2
    	user6@rcac.purdue.e radon.rcac         0        0        2
    	user5@r             radon.rcac       882        0        0
    	user11@rcac.purdue  radon.rcac         0        0        1
    	user12@rcac.purdue.e radon.rcac         0        0        5
    	user13@rcac.purdue.  radon.rcac         0      186        2
    	user14@rcac.purdue   radon.rcac      1000        0        0
    	user15@rcac.p        steele-fe0         0      220        1
    	user16@rcac.purdue.ed steele-fe0         0        0        1
    	user17@rcac.purdue steele-fe0         0    37472        1
    	tg_user1@rcac.     tg-gatekee         0        1        0
    	tg_user2@rcac.purdue.e tg-login64         0        1        0
    	
    	                           RunningJobs           IdleJobs           HeldJobs
    	
    	user9@rcac.purdue.e                 0                  6                  2
    	user7@rcac.purdue                   0                  0                  1
    	user15@rcac.p                        0                220                  1
    	tg_user1@rcac.                       0                  1                  0
    	user12@rcac.purdue.                  0                  0                  5
    	user7@rcac.purdue                    0                  1                  2
    	user6@rcac.purdue.e                  0                  0                  2
    	user11@rcac.purdue.ed                0                  0                  1
    	user5@r                           1674                  0                  1
    	user17@rcac.purdue                   0              37472                  1
    	tg_user1@rcac.purdue.e               0                  1                  0
    	user6@rcac.purdue.                   0                  0                  0
    	user16@rcac.purdue                   0                  0                  1
    	user12@rcac.purdue.e                 0                  0                  5
    	user13@rcac.purdue.                  0                186                  2
    	user14@rcac.purdue                1000                  0                  0
    	user10@rcac.purdue.                  0                  0                  1
    	user8@rcac.purdue.edu                0                  0                228
    	
    	               Total              2674              37887                253
    	-bash-3.00$ 
    

    Checking on the progress of jobs

    To check on the status of your jobs, use the command condor_q. This command will display the status of all the queued jobs, not just your own.

    That is, however not the only way of tracking the progress of your jobs. Another way of doing this is through the user log. In your submit description file, you can specify a log command (by adding Log = <name>.log somewhere before the Queue command). When you have done this, the progress of the job may be followed by viewing the log file. Various events such as execution commencement, checkpoint, eviction and termination are logged in the file. Also logged is the time at which the event occurred.

    As soon as your job begins executing, Condor will start up a condor_shadow process on the submit machine. The shadow process is the mechanism by which the remotely executing jobs can access the environment from which it was submitted, such as input and output files.
    It is normal for a machine which has submitted hundreds of jobs to have hundreds of shadows running on the machine. Since the text segments of all these processes is the same, the load on the submit machine is usually not significant. If, however, you notice degraded performance, you can limit the number of jobs that can run simultaneously through the MAX_JOBS_RUNNING configuration parameter. Please talk to your system administrator for the necessary configuration change.

    To find all the machines which are running your job, use the command condor_status. Example: say you wish to find all the machines which runs jobs submitted by user123@purdue.edu. You would then type condor_status -constraint 'RemoteUser == "user123@rcac.purdue.edu"'.

    	user123@radon:~$ condor_status -constraint 'RemoteUser == "user123@rcac.purdue.edu"'
    	
    	Name          OpSys       Arch   State      Activity   LoadAv Mem   ActvtyTime
    	
    	ba-005.rcac.p LINUX       INTEL  Claimed    Busy       1.000   502  0+00:24:44
    	ba-006.rcac.p LINUX       INTEL  Claimed    Busy       0.990   502  0+00:20:22
    	ba-007.rcac.p LINUX       INTEL  Claimed    Busy       1.000   502  0+00:23:16
    	ba-008.rcac.p LINUX       INTEL  Claimed    Busy       1.000   502  0+00:30:20
    	...
    

    If you want to find all the machines that are running any job at all, then type: condor_status -run.

    Condor Job Cancellation

    Removing a job from the queue

    The command condor_rm can be used at any time to remove a job from the queue. If the job has already started running, then the job will be killed without a checkpoint, and its queue entry is removed. Use condor_q to get the ID of the job. Here are an example:

    Queue of jobs before:

    	user123@radon:~$ 
    	
    	Submitter: radon.rcac.purdue.edu : <128.210.9.35:35407> : radon.rcac.purdue.edu
    	 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
    	...
    	260076.7   nice-user.user1 8/18 00:05   0+00:21:47 I  0   29.3 startfah.sh -oneun
    	260076.9   nice-user.user1 8/18 00:05   0+01:40:44 I  0   136.7 startfah.sh -oneun
    	260185.0   user123              8/30 13:01   0+00:00:00 R  0   19.5 hello             
    	...
    

    Queues of jobs after:

    	user123@radon:~$ condor_rm 260185.0
    	Job 260185.0 marked for removal
    	user123@radon:~$ condor_q
    	
    	Submitter: radon.rcac.purdue.edu : <128.210.9.35:35407> : radon.rcac.purdue.edu
    	 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
    	...
    	260076.7   nice-user.user1 8/18 00:05   0+00:21:47 I  0   29.3 startfah.sh -oneun
    	260076.9   nice-user.user1 8/18 00:05   0+01:40:44 I  0   136.7 startfah.sh -oneun
    	...
    

    Condor Examples

    These instructions are very short and merely meant to give you the ability to run a small example immidiately. Read the rest of the sections, and maybe the Condor manual for more details on how to use some of all the possibilities in Condor.

    Simple Condor Example

    Compiling
    	condor_compile <compiler> <program>.<extension> -o <program name>
    

    Example:
    	condor_compile gcc hello.c -o hello 
    

    Submitting

    It is very simple to submit the job to Condor, when the submit description file has been written. At the command-prompt, just type condor_submit <job-name>, where job-name is the name of the submit description file.

    Example: Here I am using a very simple submit description file, namely:

    	Executable   = hello
    	Log          = hello.log
    	Output       = hello.out
    	Queue 
    

    Where hello is a C-program which where first compiled with the command condor_compile gcc hello.c -o hello. I have named this submit description file 'run_hello'. In the following, I am running on radon:

    	user123@radon:~$ condor_submit run_hello 
    	Submitting job(s).
    	Logging submit event(s).
    	1 job(s) submitted to cluster 260182.
    	user123@radon:~$ 
    

    It may take a (sometimes long) while before the job is submiied and finishes running, depending on how many others are using the machines, your rank, the requirements you have given for the job, etc. The progress can be checked with the command condor_status. When the job has completed, I have the two files hello.log and hello.out in my directory - just as I asked for in the submit description file. You should always use a log-file.
    The contents of the files are:

    hello.log:
    	000 (260182.000.000) 08/29 16:21:31 Job submitted from host: <128.210.9.35:35407>
    	...
    	001 (260182.000.000) 08/29 16:22:42 Job executing on host: <128.211.131.51:32780>
    	...
    	005 (260182.000.000) 08/29 16:22:42 Job terminated.
    	        (1) Normal termination (return value 13)
    	                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
    	                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
    	                Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
    	                Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
    	        830  -  Run Bytes Sent By Job
    	        13490672  -  Run Bytes Received By Job
    	        830  -  Total Bytes Sent By Job
    	        13490672  -  Total Bytes Received By Job
    	...
    

    and

    hello.out:
    	Hello World!
    

    which was the output the program would otherwise have written to the screen. You will also receive an email, sent to the user, unless otherwise specified.

    Requirements and Rank

    It is important to list the correct requirements and rank commands in the submit description file. This way you can assure that your program is run on the machine that best fits your requirements.

    These requirements and rank, must be specified as valid Condor ClassAd expressions. There are, however, default values set by the condor_submit program, which are used if none are deined in the submit description file. The ClassAd expressions are intuitive and reminiscent of C. It is possible to write quite elaborate expressions with ClassAds. Check out chapter 4.1 in the Condor manual for a complete description.

    All of the commands in the submit description file are case insensitive, except for the ClassAd attribute string values. ClassAds attribute names are case insensitive, but ClassAd string values are case preserving.
    Note that the comparison operators (<, >, <=, >=, and ==) compare strings case insensitively. The special comparison operators =?= and =!= compare strings case sensitively.

    The allowed ClassAd attributes varies from machine to machine. To see all of the machine ClassAd attributes for all machines in the Condor pool, run the command condor_status -l. If there are any jobs in the queue, you can see the job ClassAds with the command condor_q -l.

    Requesting Specific Architectures

    When Condor is considering a match between a job and a machine, the rank is used to choose a match from among all machines that satisfy the job's requirements and are available to the user, after accounting for the user's priority and the machine's rank of the job. The rank expressions, simple or complex, define a numerical value that expresses preferences.

    The job's rank expression evaluates to one of three values:

    • UNDEFINED
    • ERROR
    • a floating point value

    If rank evaluates to a floating point value, the best match will be the one with the largest, positive value. If no rank is given in the submit description file, then Condor substitutes a default value of 0.0 when considering machines to match. If the job's rank of a given machine evaluates to UNDEFINED or ERROR, this same value of 0.0 is used. Therefore, the machine is still considered for a match, but has no rank above any other.

    A boolean expression evaluates to the numerical value of 1.0 if true, and 0.0 if false.

    Here are some examples of rank expressions from the Condor manual:

    • For a job that desires the machine with the most available memory:
    	Rank = memory
    

    • For a job that prefers to run on a friend's machine on Saturdays and Sundays:
    	Rank = ( (clockday == 0) || (clockday == 6) ) && (machine == "friend.cs.wisc.edu")
    

    • For a job that prefers to run on one of three specific machines:
    	Rank = (machine == "friend1.cs.wisc.edu") ||
    	            (machine == "friend2.cs.wisc.edu") ||
    	            (machine == "friend3.cs.wisc.edu")
    

    • For a job that wants the machine with the best floating point performance (on Linpack benchmarks):
    	Rank = kflops
    

    This last example may give problems, since not all machines have the kflops attribute defined. For machines where this attribute is not defined, Rank will evaluate to the value UNDEFINED, and Condor will use a default rank of the machine of 0.0. The rank attribute will only rank machines where the attribute is defined. Therefore, the machine with the highest floating point performance may not be the one given the highest rank.

    Thus, it is always wise to check if the expression's evaluation will lead to the expected resulting ranking of machines, before writing a rank expression (check with the command condor_status -constraint <name>, to see a list of machines that fits a certain constraint). For wxample, to see which machines in the pool that have kflops defined, use condor_status -constraint kflops.
    Alternatively, to see a list of machines where kflops is not defined, use condor_status -constraint "kflops=?=undefined".

    • For a job that prefers specific machines in a specific order:
    	Rank = ((machine == "friend1.cs.wisc.edu")*3) +
    	            ((machine == "friend2.cs.wisc.edu")*2) +
    	             (machine == "friend3.cs.wisc.edu")
    

    Example: If the machine being ranked is "friend1.cs.wisc.edu", then the expression
    	(machine == "friend1.cs.wisc.edu")
    

    is true, and gives the value 1.0. The expressions

    	(machine == "friend2.cs.wisc.edu")
    
    and
    	(machine == "friend3.cs.wisc.edu")
    

    are false, and give the value 0.0. Therefore, rank evaluates to the value 3.0. In this way, machine "friend1.cs.wisc.edu" is ranked higher than machine "friend2.cs.wisc.edu", machine "friend2.cs.wisc.edu" is ranked higher than machine "friend3.cs.wisc.edu", and all three of these machines are ranked higher than others.

    Execution on Differing Architectures


    It is possible to allow Condor to choose between a perhaps larger pool of machines for a job, if executables are available for all the different platforms. This is done by making changes to the submit description file.

    Example:
    Cross submission. An executable is available for one platform, but the submission is done from a different platform. Given the correct executable, the requirements command in the submit description file specifies the target architecture. Here, an executable compiled for a Sun 4, submitted from an Intel architecture running Linux would add the requirement

    	requirements = Arch == "SUN4x" && OpSys == "SOLARIS251"
    

    Without this requirement, condor_submit will assume that the program is to be executed on a machine with the same platform as the machine where the job is submitted.
    Cross submission works for both standard and vanilla universes. To see the architecture and OS for the machines in the pool, type the command condor_status.

    Click here to see some examples (from the Condor manual) showing how cross submission works in the vanilla universe and here for an example for the standard universe.

    Prioritizing Resource Preferences

    Machine attributes:

    Here follows a description of some of the common machine attributes. For a longer, more complete listing of attributes, look here.

    • Activity: String which describes Condor job activity on the machine. Can have one of the following values:
      • "Idle": There is no job activity
      • "Busy": A job is busy running
      • "Suspended": A job is currently suspended
      • "Vacating": A job is currently checkpointing
      • "Killing": A job is currently being killed
      • "Benchmarking": The startd is running benchmarks
    • Arch: String with the architecture of the machine.
    • ClockDay: The day of the week, where 0 = Sunday, 1 = Monday, ... , 6 = Saturday.
    • ClockMin: The number of minutes passed since midnight.
    • ConsoleIdle: The number of seconds since activity on the system console keyboard or console mouse has last been detected.
    • Cpus: Number of CPUs in this machine.
    • CurrentRank: A float which represents this machine owner's affinity for running the Condor job which it is currently hosting. If not currently hosting a Condor job, CurrentRank is 0.0. When a machine is claimed, the attribute's value is computed by evaluating the machine's Rank expression with respect to the current job's ClassAd.
    • Disk: The amount of disk space on this machine available for the job in Kbytes.
    • EnteredCurrentActivity: Time at which the machine entered the current Activity. On all platforms (including NT), this is measured in the number of integer seconds since the Unix epoch (00:00:00 UTC, Jan 1, 1970).
    • FileSystemDomain: A ``domain'' name configured by the Condor administrator which describes a cluster of machines which all access the same, uniformly-mounted, networked file systems usually via NFS or AFS. This is useful for Vanilla universe jobs which require remote file access.
    • KeyboardIdle: The number of seconds since activity on any keyboard or mouse associated with this machine has last been detected.
    • KFlops: Relative floating point performance as determined via a Linpack benchmark.
    • LoadAvg: A floating point number with the machine's current load average.
    • Machine: A string with the machine's fully qualified hostname.
    • Memory: The amount of RAM in megabytes.
    • Name: The name of this resource; typically the same value as the Machine attribute, but could be customized by the site administrator. On SMP machines, the condor_startd will divide the CPUs up into separate virtual machines, each with with a unique name. These names will be of the form ``vm#@full.hostname'', for example, ``vm1@vulture.cs.wisc.edu'', which signifies virtual machine 1 from vulture.cs.wisc.edu.
    • OpSys: String describing the operating system running on this machine.
    • Requirements: A boolean, which when evaluated within the context of the machine ClassAd and a job ClassAd, must evaluate to TRUE before Condor will allow the job to use this machine.
    • MaxJobRetirementTime: An expression giving the maximum time in seconds that the startd will wait for the job to finish before kicking it off if it needs to do so.
    • StartdIpAddr: String with the IP and port address of the condor_startd daemon which is publishing this machine ClassAd.
    • State: String which publishes the machine's Condor state. Can be:
      • "Owner": The machine owner is using the machine, and it is unavailable to Condor.
      • "Unclaimed": The machine is available to run Condor jobs, but a good match is either not available or not yet found.
      • "Matched": The Condor central manager has found a good match for this resource, but a Condor scheduler has not yet claimed it.
      • "Claimed": The machine is claimed by a remote condor_ schedd and is probably running a job.
      • "Preempting": A Condor job is being preempted (possibly via checkpointing) in order to clear the machine for either a higher priority job or because the machine owner wants the machine back.
    • VirtualMachineID: For SMP machines, the integer that identifies the VM. The value will be X for the VM with name="vmX@full.hostname". For non-SMP machines with one virtual machine, the value will be 1.
    • VirtualMemory: The amount of currently available virtual memory (swap space) expressed in Kbytes.

    Job attributes:

    • Args: String representing the arguments passed to the job.
    • CkptArch: String describing the architecture of the machine this job executed on at the time it last produced a checkpoint. If the job has never produced a checkpoint, this attribute is undefined.
    • CkptOpSys: String describing the operating system of the machine this job executed on at the time it last produced a checkpoint. If the job has never produced a checkpoint, this attribute is undefined.
    • ClusterId: Integer cluster identifier for this job. A cluster is a group of jobs that were submitted together. Each job has its own unique job identifier within the cluster, but shares a common cluster identifier. The value changes each time a job or set of jobs are queued for execution under Condor.
    • CompletionDate: The time when the job completed, or the value 0 if the job has not yet completed. Measured in the number of seconds since the epoch (00:00:00 UTC, Jan 1, 1970).
    • CurrentHosts: The number of hosts in the claimed state, due to this job.
    • EnteredCurrentStatus: An integer containing the epoch time of when the job entered into its current status So for example, if the job is on hold, the ClassAd expression: CurrentTime - EnteredCurrentStatus will equal the number of seconds that the job has been on hold.
    • ImageSize: Estimate of the memory image size of the job in Kbytes. The initial estimate may be specified in the job submit file. Otherwise, the initial value is equal to the size of the executable. When the job checkpoints, the ImageSize attribute is set to the size of the checkpoint file (since the checkpoint file contains the job's memory image). A vanilla universe job's ImageSize is recomputed internally every 15 seconds.
    • JobPrio: Integer priority for this job, set by condor_ submit or condor_ prio. The default value is 0. The higher the number, the worse the priority.
    • JobStartDate: Time at which the job first began running. Measured in the number of seconds since the epoch (00:00:00 UTC, Jan 1, 1970).
    • JobStatus: Integer which indicates the current status of the job.
      • 0: Unexpanded (the job has never run)
      • 1: Idle
      • 2: Running
      • 3: Removed
      • 4: Completed
      • 5: Held
    • JobUniverse: Integer which indicates the job universe.
      • 1: standard
      • 4: PVM
      • 5: vanilla
      • 7: scheduler
      • 8: MPI
      • 9: grid
      • 10: java
    • LastMatchTime: An integer containing the epoch time when the job was last successfully matched with a resource (gatekeeper) Ad.
    • LastRejMatchReason: If, at any point in the past, this job failed to match with a resource ad, this attribute will contain a string with a human-readable message about why the match failed.
    • LastRejMatchTime: An integer containing the epoch time when Condor-G last tried to find a match for the job, but failed to do so.
    • MaxHosts: The maximum number of hosts that this job would like to claim. As long as CurrentHosts is the same as MaxHosts, no more hosts are negotiated for.
    • MaxJobRetirementTime: Maximum time in seconds to let this job run uninterrupted before kicking it off when it is being preempted. This can only decrease the amount of time from what the corresponding startd expression allows.
    • MinHosts: The minimum number of hosts that must be in the claimed state for this job, before the job may enter the running state.
    • NumGlobusSubmits: An integer that is incremented each time the condor_ gridmanager receives confirmation of a successful job submission into Globus.
    • Owner: String describing the user who submitted this job.
    • ProcId: Integer process identifier for this job. Within a cluster of many jobs, each job has the same ClusterId, but will have a unique ProcId. Within a cluster, assignment of a ProcId value will start with the value 0. The job (process) identifier described here is unrelated to operating system PIDs.
    • RemoteIwd: The path to the directory in which a job is to be executed on a remote machine.

    Flocking to Other Grids

    The idea of grid computing is to be able to use resources which are spanning many administrative domains. Even though a Condor pool usually conatains machines owned by many different people, it will often be the case that collaborating researchers from different organizations does not consider it feasible to combine all their computers in one large Condor pool. They will therefore have to use grid computing.

    Condor has its own mechanisms for grid computing, but is able to interact with other grid systems. The usual way for Condor to submit jobs from one pool to another, is via flocking.
    Flocking is enabled by configuration within each of the pools. Jobs migrate from one pool to another based on the availability of machines to execute jobs. If the local Condor pool currently don't have any available machines to run a job, it will flock to another pool. This is not something the user needs to think about - nothing need to be added or changed in the submit description file.

    To learn more about this, Condor-C jobs, glidein (a mechanism by which one or more Grid resources (remote machines) temporarily join a local Condor pool. The program condor_glidein is used to add a machine to a Condor pool) and running when there is other middleware like Globus running, see section 5 of the official Condor manual.

    To setup flocking, first send the DNS hostname of your Condor central manager (condor_negotiator and condor_collector) to condor-admin@rcac.purdue.edu. RCAC Condor administrators will then allow your condor pool access to RCAC pools.

    Then, the locations where job can be executed, AS WELL AS WHERE IT CAN BE SUBMITTED FROM, must be identified with the variables 'FLOCK_FROM' and 'FLOCK_TO'.
    These variable are set in Part 2 of the condor_config file, located in <path to>/condor/etc/. At Purdue, these variables should be set to:

    	FLOCK_FROM = *.rcac.purdue.edu
    
    and
    	FLOCK_TO = albatross.rcac.purdue.edu,emu.rcac.purdue.edu,egret.rcac.purdue.edu,flamingo.rcac.purdue.edu
    

    Also, the variables 'FLOCK_COLLECTOR_HOSTS', 'FLOCK_NEGOTIATOR_HOSTS', and 'HOSTALLOW_NEGOTIATOR_SCHEDD' should be set (the settings below assumes that condor_collector and condor_negotiator daemons are running on the same machine):

    	FLOCK_COLLECTOR_HOSTS = $(FLOCK_TO)
    
    	FLOCK_NEGOTIATOR_HOSTS = $(FLOCK_TO)
    
    	HOSTALLOW_NEGOTIATOR_SCHEDD = $(COLLECTOR_HOST), $(FLOCK_NEGOTIATOR_HOSTS)
    

    The configuration macros that must be set in pool B are ones that authorize jobs from machine A to flock to pool B.

    Using the 'FLOCK_FROM' variable, the variables below should keep their default values:

    	HOSTALLOW_WRITE_COLLECTOR = $(HOSTALLOW_WRITE), $(FLOCK_FROM)
    	
    	HOSTALLOW_WRITE_STARTD    = $(HOSTALLOW_WRITE), $(FLOCK_FROM)
    	
    	HOSTALLOW_READ_COLLECTOR  = $(HOSTALLOW_READ), $(FLOCK_FROM)
    	
    	HOSTALLOW_READ_STARTD     = $(HOSTALLOW_READ), $(FLOCK_FROM)
    

    Run "condor_reconfig" on your condor servers, and your pool should be configured to flock to and from RCAC.

    Jobs will always try to run locally and only flocks to another pool when there is no machines in the current pool.
    In the past, all jobs using flocking were standard universe jobs. This is no longer so and it is possible to submit jobs to other universes, but it is necessary to take into account the location of input, output and error files. Since machines in separate pools don't usually have a shared file system, the user needs to use file transfer mechanisms. See section 2.5.4 in the official Condor manual.

    Condor-C Job submission

    Job submission is done the same way for Condor-C jobs as for all other Condor jobs. The only thing to remember is that the universe must be 'grid'. There should also be an entry 'grid_resource' in the submit description file, which specifies the remote condor_schedd daemon to which the job should be submitted. The value of 'grid_resource' consists of three fields: 1) the grid type (condor), the name of the remote condor_schedd daemon (the same as the condor_schedd ClassAd attribute Name on the remote machine), 3) the third field is the name of the remote pool's condor_collector. Here is an example submit description file:

    	Universe      = grid
    	Executable    = myjob
    	Output        = myoutput
    	Error         = myerror
    	Log           = mylog
    	
    	grid_resource = condor joe@remotemachine.example.com remotecentralmanager.example.com
    	+remote_jobuniverse = 5
    	+remote_requirements = True
    	+remote_ShouldTransferFiles = "YES"
    	+remote_WhenToTransferOutput = "ON_EXIT"
    	
    	Queue
    

    The remote machine needs to know the attributes of the job. In the submit description file these are specified with the '+' syntax, followed by the string remote_.
    As a minimum, these must be the job's universe and the job's requirements. Most likely there will also be other attributes specific to the job's universe (on the remote pool).

    Note: attributes set with '+' are inserted directly into the job's ClassAd. Specify attributes as they must appear in the job's ClassAd, not the submit description file.

    See section 5.3.1.2 in the official Condor manual for more information and examples.

    Participating when your department already runs Condor


    If you operate Condor already, then you're most of the way there. Condor pools can be joined with Condor's "flocking" mechanism. To set up flocking with RCAC (the parameters below is set in the file 'condor_config'):

    1. Send the DNS hostname of your Condor central manager (condor_negotiator and condor_collector) to condor-admin@rcac.purdue.edu. RCAC Condor administrators will then allow your condor pool access to RCAC pools.
    2. Set the following variables in your condor_config file:

      • FLOCK_TO = albatross.rcac.purdue.edu, emu.rcac.purdue.edu, egret.rcac.purdue.edu, flamingo.rcac.purdue.edu
      • FLOCK_FROM = *.rcac.purdue.edu

    3. Run "condor_reconfig" on your condor servers, and your pool should be configured to flock to and from RCAC.

    Installing Condor

    If your department operates a Condor pool, or have idle workstations, clusters, or labs that could provide computing cycles to Condor, then the Rosen Center for Advanced Computing would like to work with you in order to use Condor to create a campus-wide flock of systems with which to advance scientific discovery.

    Condor is a specialized workload management system for compute-intensive jobs. Like other full-featured batch systems, Condor provides a job queueing mechanism, scheduling policy, priority scheme, resource monitoring, and resource management. Users submit their serial or parallel jobs to Condor, Condor places them into a queue, chooses when and where to run the jobs based upon a policy, carefully monitors their progress, and ultimately informs the user upon completion.

    While providing functionality similar to that of a more traditional batch queueing system, Condor's novel architecture allows it to succeed in areas where traditional scheduling systems fail. Condor can be used to manage a cluster of dedicated compute nodes (such as a "Beowulf" cluster). In addition, unique mechanisms enable Condor to effectively harness wasted CPU power from otherwise idle desktop workstations. For instance, Condor can be configured to only use desktop machines where the keyboard and mouse are idle. Should Condor detect that a machine is no longer available (such as a key press detected), in many circumstances Condor is able to transparently produce a checkpoint and migrate a job to a different machine which would otherwise be idle. Condor does not require a shared file system across machines - if no shared file system is available, Condor can transfer the job's data files on behalf of the user, or Condor may be able to transparently redirect all the job's I/O requests back to the submit machine. As a result, Condor can be used to seamlessly combine all of an organization's computational power into one resource.

    To get more information about Condor, go to the Condor Homepage and read the official Condor Manual.

    Another good page with information is the Condor tutorials and slides from the 'Condor Boot Camp' that was held at Purdue University.

    Installation


    The installation of Condor depends on your platform and on whether you want to setup a personal Condor, make a new Condor pool, or join an existing pool.

    You can also find some information about Condor and joining Purdue's Condor pool here.

    The first step to installing Condor is to download it. This can also be done in more than one way. The instructions below should cover most variations.

    It may be a good idea to join the condor-world mailing list. Traffic on this list is kept to an absolute minimum. It is only used to announce new releases of Condor. To subscribe, send a message to majordomo@cs.wisc.edu with the body: subscribe condor-world. Another useful mailing list is condor-users. It has quite a lot of traffic and can be used for discussing problems you may have with Condor. To join this list, send an email to majordomo@cs.wisc.edu with the body: subscribe condor-users.

    Windows


    First, download the newest version of Condor MSI from http://www.cs.wisc.edu/condor/downloads/. Click on the latest stable release and fill in name, email, and organization. Then click "I agree".

    On the page you are taken to, there will be a long list of Condor binaries for various OS's. Go down nearly to the bottom and you will find the binary for Windows 2000/XP. Download the MSI-file.

    Then, run the MSI and answer the questions it asks. You will need to have administrator rights to install Condor.

    1. "New or existing pool":
      • Check the box to join an exisiting pool, at the hostname: egret.rcac.purdue.edu. (If you should wish to create a new Condor pool you will be the Central Manager of this pool - this machine is set up first, and you get to choose a name for the pool. This pool can then be added to RCAC's pools by flocking, but it is not the recommended way of participating if your department doesn't already have a pool.)
    2. "Execute and Submit Behavior" (These options are where you can configure how you'd like it to behave.):
      • You should probably not check the "submit jobs" box, but if users may want to submit jobs, then check the box.
      • RCAC recommends selecting "run jobs when keyboard is idle for 15 mins", as well as "Keep job in memory".
      • If these choices aren't optimal for your environment, customize as required.
    3. "Accounting Domain":
      • Put your own DNS domain here.
    4. "Email settings":
      • Input smtp.purdue.edu, and your own email address.
    5. "Java Setting":
      • The installer should find the path to a JVM.
    6. "Host permissions":
      • Set both read and write to *.purdue.edu
    7. Select where you want it to live on C: (C:\Condor by default, but your environment's standard location is fine)

    Submitting jobs:

    You will need to open the file condor_config and make a few changes to it - setting up CONDOR_HOST and such. If you are using flocking, then you will need to set these parameters as specified under "Participating when your department already runs Condor". See here for an example of the condor_config file.

    After making changes to the condor_config file, you need to run condor_reconfig and add credentials for you (stash password). To do this:

    • Click Start -> Run...
    • Type cmd. Click 'OK'.
    • CD to your Condor directory.
    • Type condor_reconfig -> Enter.
    • Type condor_store_cred add -> Enter.

    You should now be able to submit jobs to Condor, if you enabled that during setup.

    To submit the job, type "condor_submit <full_path_to>\<submit description file>".

    Remember: you can only submit 'vanilla' jobs from Windows (add "Universe = vanilla" to your submit description file).

    There are more installation advice in section 6.2.10 of the Condor manual.

    Linux/Unix


    To install Condor you need to have root access. If you just want to try out a small personal version of Condor on your own machine, then that is possible without root access. This is described further down.

    Java: if you want to be able to run Java with Condor, then you need to have Java installed before installing Condor. If Java is not installed, you could fx. download from http://java.sun.com/j2se/1.4.2/download.html.

    Normal Condor install


    To install a normal, full version of Condor you need to have root access.

    There are two ways of downloading and installing. First, you can download the appropriate binaries directly from http://www.cs.wisc.edu/condor/downloads/ or you can use the "automated" download and install. Below I will go into more details for both options.

    After installing Condor, you must then set the JAVA and JAVA_MAXHEAP_ARGUMENT in the condor_config file.

    Downloading from http://www.cs.wisc.edu/condor/downloads/

    1. Start out by going to http://www.cs.wisc.edu/condor/downloads/. Click on 'Latest Stable Release'. On the page you are taken to, go to the bottom and fill in name, email, and organization. Choose any mailing lists you may want to subscribe to. Then click "I agree".
    2. On the page you are taken to, there will be a long list of Condor binaries for various OS's. Choose the one that fits your OS best and download it. For Linux, you should choose the dynamically linked version unless you experience problems.
    3. If the version you downloaded is gzipped/tar'ed, then use tar -xzvf <condorbinary>.tar.gz to unpack it.
    4. Contact condor-admin@rcac.purdue.edu before proceeding, and your systems will be added to Condor's allowed hosts lists.
    5. CD into the directory that was created by unpacking your files.
    6. Run the perl script 'condor_configure' to configure and install Condor. There are a number of options which will be written out when you just run 'condor_configure'. If no type or "central manager" is specified, Condor will install as 'personal Condor' and that is probably not what you want. Therefore, start by running 'condor_configure --central-manager=egret.rcac.purdue.edu'. This sets the central manager appropriately.
    7. Thereafter you should run 'condor_configure -install', which tells Condor to install. You can use 'condor_configure -install-dir=<path>' to specify the installation directory.

    There are a number of parameters which should be set in the 'condor_config' file - located in <path/to/Condor_install_directory/etc/>. Look here to see an example. I have marked the places which may need to be changed in your file. Below follows a list of the parameters and what they should be set to (the FLOCK_TO and FLOCK_FROM should only be set if you are using flocking):

    • Under PART 1
      • CONDOR_HOST = egret.rcac.purdue.edu
      • LOCAL_CONFIG_FILE = <path/to/local condor config file/condor_config.local>
      • CONDOR_ADMIN = <Your Condor administrator@purdue.edu>
      • UID_DOMAIN = rcac.purdue.edu
      • FILESYSTEM_DOMAIN = rcac.purdue.edu
      • COLLECTOR_NAME = egret.rcac.purdue.edu
    • Under PART 2
      • FLOCK_FROM = *.rcac.purdue.edu
      • FLOCK_TO = albatross.rcac.purdue.edu,emu.rcac.purdue.edu,egret.rcac.purdue.edu,flamingo.rcac.purdue.edu
      • HOSTALLOW_READ = *.purdue.edu
      • HOSTALLOW_WRITE = *.purdue.edu
    • Under PART 4
      • JAVA = <path/to/java_install_directory>
      • JAVA_MAXHEAP_ARGUMENT = -Xmx (if you are using Sun JVM

    After everything is installed and configured, you should run "condor_reconfig" on your condor servers. Then you should add the paths to your .cshrc or .bash:

    .cshrc:

    	setenv CONDOR_CONFIG <path/to/Condor_install_directory>/etc/condor_config
    	set path=(<path/to/Condor_install_directory>/bin $path)
    	set path=(<path/to/Condor_install_directory>/sbin $path)
    

    .bash:

    	export CONDOR_CONFIG=<path/to/Condor_install_directory>/etc/condor_config
    	export PATH=<path/to/Condor_install_directory>/bin:${PATH}
    	export PATH=<path/to/Condor_install_directory>/sbin:${PATH}
    

    To check if everything is set up properly, type:

    	echo $CONDOR_CONFIG
    	echo $path
    	which condor_master
    	which condor_submit
    

    These should give the expected results (the two first should repeat the paths that you expected and the two next should tell you that the system can indeed find both the Condor /bin and the Condor /sbin directories as well as the files in them).

    Starting Condor:

    After having confirmed that everything is correctly setup, you are ready to start Condor. This is done by typing (as root/with root permission):

    	condor_master
    

    To check that everything is running, type ps x. This should give something like this:

    	armadillo:~> ps x
    	  PID TTY      STAT   TIME COMMAND
    	 4568 ?        S      0:06 condor_schedd
    	 9738 ?        Ss    14:02 condor_master
    	 9739 ?        Ss     1:32 condor_collector -f
    	 9740 ?        Ss     0:35 condor_negotiator -f
    	 9741 ?        Ss     0:12 condor_schedd -f
    	 9742 ?        Ss    12:23 condor_startd -f
    	13093 ?        S      0:00 sshd: bbrydsx@pts/4
    	13095 pts/4    Ss     0:00 -tcsh
    	13123 pts/4    R+     0:00 ps x
    	armadillo:~> 
    

    This installation will need to be usable on every host you would like to be able to run Condor jobs, so a shared filesystem or your file distribution of choice will need used to get Condor around your network.

    The installation will use reasonable defaults, but if you'd like to further customize policies for starting, suspending, and preempting jobs on your execute nodes, consult the Condor manual for details.

    "Automated" download and installation

    This is probably the easiest way to install Condor from Linux/Unix. You need root access to install this way. In the following I will explain how to install Condor via the Virtual Data Toolkit (VDT): http://vdt.cs.wisc.edu/releases/1.3.11/.

    1. First, type 'wget http://physics.bu.edu/pacman/sample_cache/tarballs/pacman-3.16.1.tar.gz'
    2. Then unpack the file you just downloaded: type 'tar xzf pacman-3.16.1.tar.gz'
    3. cd pacman-3.16.1
    4. source setup.sh
    5. cd ..
    6. mkdir vdt
    7. cd vdt
    8. (as root!) Type 'pacman -get http://vdt.cs.wisc.edu/vdt_1311_cache:Condor'. Pacman will download condor and requisites, and will prompt for a number of answers:
      • Do you agree to the licenses? [y/n] y
      • Where would you like to install CA files? (n, do not install)
      • Would you like to enable the Condor batch system to run automatically? (y)
      • Do you want the EDG CRL update daemon to be installed? (n)
    9. You should now edit the condor/etc/condor_config:
      • After the line "## What machine is your central manager?" add CONDOR_HOST = egret.rcac.purdue.edu
      • There may be some other parameters that need changing. Look at this file to see where. I have marked the possible changes in red. The list of possible changes are also shown in the following list (the FLOCK_TO and FLOCK_FROM should only be set if you are using flocking)
        • Under PART 1
          • CONDOR_HOST = egret.rcac.purdue.edu
          • LOCAL_CONFIG_FILE = <path/to/local condor config file/condor_config.local>
          • CONDOR_ADMIN = <Your Condor administrator@purdue.edu>
          • UID_DOMAIN = rcac.purdue.edu
          • FILESYSTEM_DOMAIN = rcac.purdue.edu
          • COLLECTOR_NAME = egret.rcac.purdue.edu
        • Under PART 2
          • FLOCK_FROM = *.rcac.purdue.edu
          • FLOCK_TO = albatross.rcac.purdue.edu,emu.rcac.purdue.edu,egret.rcac.purdue.edu,flamingo.rcac.purdue.edu
          • HOSTALLOW_READ = *.purdue.edu
          • HOSTALLOW_WRITE = *.purdue.edu
        • Under PART 4
          • JAVA = <path/to/java_install_directory>
          • JAVA_MAXHEAP_ARGUMENT = -Xmx (if you are using Sun JVM

    After everything is installed and configured, run "condor_reconfig" on your condor servers. Then you should add the paths to your .cshrc or .bash:

    .cshrc:

    	setenv CONDOR_CONFIG /etc/condor_config
    	set path=(/bin $path)
    	set path=(/sbin $path)
    

    .bash:

    	export CONDOR_CONFIG=/etc/condor_config
    	export PATH=/bin:${PATH}
    	export PATH=/sbin:${PATH}
    

    To check if everything is set up properly, type:

    	echo $CONDOR_CONFIG
    	echo $path
    	which condor_master
    	which condor_submit
    

    These should give the expected results (the two first should repeat the paths that you expected and the two next should tell you that the system can indeed find both the Condor /bin and the Condor /sbin directories as well as the files in them).

    Starting Condor:

    After having confirmed that everything is correctly setup, you are ready to start Condor. This is done by typing (as root/with root permission):

    	condor_master
    

    To check that everything is running, type ps x. This should give something like this:

    	armadillo:~> ps x
    	  PID TTY      STAT   TIME COMMAND
    	 4568 ?        S      0:06 condor_schedd
    	 9738 ?        Ss    14:02 condor_master
    	 9739 ?        Ss     1:32 condor_collector -f
    	 9740 ?        Ss     0:35 condor_negotiator -f
    	 9741 ?        Ss     0:12 condor_schedd -f
    	 9742 ?        Ss    12:23 condor_startd -f
    	13093 ?        S      0:00 sshd: bbrydsx@pts/4
    	13095 pts/4    Ss     0:00 -tcsh
    	13123 pts/4    R+     0:00 ps x
    	armadillo:~> 
    

    This installation will need to be usable on every host you would like to be able to run Condor jobs, so a shared filesystem or your file distribution of choice will need used to get Condor around your network.

    The installation will use reasonable defaults, but if you'd like to further customize policies for starting, suspending, and preempting jobs on your execute nodes, consult the Condor manual for details.

    Personal Condor install


    A Personal Condor installation does not require root access and can be used to try out Condor on your own Linux/Unix machine. It is also possible to connect to other machines using the Condor 'flocking' option. Here is a short description of how to install a personal Condor.

    1. Start out by going to http://www.cs.wisc.edu/condor/downloads/. Click on 'Latest Stable Release'. On the page you are taken to, go to the bottom and fill in name, email, and organization. Choose any mailing lists you may want to subscribe to. Then click "I agree".
    2. On the page you are taken to, there will be a long list of Condor binaries for various OS's. Choose the one that fits your OS best and download it. For Linux, you should choose the dynamically linked version unless you experience problems.
    3. If the version you downloaded is gzipped/tar'ed, then use tar -xzvf .tar.gz to unpack it.
    4. If you want to use the flocking option, then contact condor-admin@rcac.purdue.edu before proceeding, and your system will be added to Condor's allowed hosts lists.
    5. CD into the directory that was created by unpacking your files.
    6. Run the perl script 'condor_configure' to configure and install Condor. There are a number of options which will be written out when you just run 'condor_configure'. Since you want to install as 'personal Condor' you will be the central manager yourself. Thus, just run 'condor_configure --install'. You can use 'condor_configure -install-dir=' to specify the installation directory.

    There are a number of parameters which may need to be set in the 'condor_config' file - located in . Especially the flocking parameters will need to be set, if you wish to use flocking. Look here to see an example. I have marked the places which may need to be changed in your file - most of the parameters you need not bother with, since they are only relevant if you are configuring a full Condor version, part of the RCAC pools. Below follows a list of the parameters and what they should be set to (the FLOCK_TO and FLOCK_FROM are the important ones if you wish to use flocking):

    1. Under PART 1
      • CONDOR_HOST = <anything>, since you are your own central manager
      • LOCAL_CONFIG_FILE =
      • CONDOR_ADMIN = <Your email@purdue.edu>
    2. Under PART 2
      • FLOCK_FROM = *.rcac.purdue.edu Only if using flocking
      • FLOCK_TO = albatross.rcac.purdue.edu,emu.rcac.purdue.edu,egret.rcac.purdue.edu,flamingo.rcac.purdue.edu Only if using flocking
    3. Under PART 4
      • JAVA =
      • JAVA_MAXHEAP_ARGUMENT = -Xmx (if you are using Sun JVM

    After everything is installed and configured, you should add the paths to your .cshrc or .bash:

    .cshrc:

    	setenv CONDOR_CONFIG /etc/condor_config
    	set path=(/bin $path)
    	set path=(/sbin $path)
    

    .bash:

    	export CONDOR_CONFIG=/etc/condor_config
    	export PATH=/bin:${PATH}
    	export PATH=/sbin:${PATH}
    

    To check if everything is set up properly, type:

    	echo $CONDOR_CONFIG
    	echo $path
    	which condor_master
    	which condor_submit
    

    These should give the expected results (the two first should repeat the paths that you expected and the two next should tell you that the system can indeed find both the Condor /bin and the Condor /sbin directories as well as the files in them).

    Starting Condor:

    After having confirmed that everything is correctly setup, you are ready to start Condor. This is done by typing (as root/with root permission):

    	condor_master
    
    To check that everything is running, type ps x. This should give something like this:

    	armadillo:~> ps x
    	  PID TTY      STAT   TIME COMMAND
    	 4568 ?        S      0:06 condor_schedd
    	 9738 ?        Ss    14:02 condor_master
    	 9739 ?        Ss     1:32 condor_collector -f
    	 9740 ?        Ss     0:35 condor_negotiator -f
    	 9741 ?        Ss     0:12 condor_schedd -f
    	 9742 ?        Ss    12:23 condor_startd -f
    	13093 ?        S      0:00 sshd: bbrydsx@pts/4
    	13095 pts/4    Ss     0:00 -tcsh
    	13123 pts/4    R+     0:00 ps x
    	armadillo:~> 
    
    The installation will use reasonable defaults, but if you'd like to further customize policies for starting, suspending, and preempting jobs on your execute nodes, consult the Condor manual for details.

    Mac


    The Macintosh port of Condor is more accurately a port of Condor to Darwin, the BSD core of OS X. Condor uses the Carbon library only to detect keyboard activity, and it does not use Cocoa at all. Condor on the Macintosh is a relatively new port, and it is not yet well-integrated into the Macintosh environment.

    Condor on the Macintosh has a few shortcomings:

    • Users connected to the Macintosh via ssh are not noticed for console activity.
    • The memory size of threaded programs is reported incorrectly.
    • No Macintosh-based installer is provided.
    • The example start up scripts do not follow Macintosh conventions.
    • Kerberos is not supported.

    Download and installation:

    1. First, download the newest version of Condor from http://www.cs.wisc.edu/condor/downloads/. Click on the latest stable release and fill in name, email, and organization. Then click "I agree".
    2. On the page you are taken to, there will be a long list of Condor binaries for various OS's. There very first one is the binary for Mac. Download the .tar.gz-file.
    3. Before you do anything further, you should contact condor-admin@rcac.purdue.edu, and your system will be added to Condor's allowed hosts lists.
    4. Then, unpack the downloaded file: tar -xzvf <condorbinary>.tar.gz.
    5. CD into the directory which was just created.
    6. Run the perl-script (you will need perl) condor_configure. You will probably want the option 'condor_configure -install', which tells Condor to install. To first install a central manager, you should run 'condor_configure --central-manager=egret.rcac.purdue.edu' before installing. This sets the central manager appropriately. You can use 'condor_configure -install-dir=<path>' to specify the installation directory.

    After installation you need to make some changes to the condor_config file. It is located in the directory <condor_install_directory>/etc/. The parameters should be set to (the FLOCK_TO and FLOCK_FROM should only be set if you are using flocking):

    • Under PART 1
      • CONDOR_HOST = egret.rcac.purdue.edu
      • LOCAL_CONFIG_FILE = <path/to/local condor config file/condor_config.local>
      • CONDOR_ADMIN = <Your Condor administrator@purdue.edu>
      • UID_DOMAIN = rcac.purdue.edu
      • FILESYSTEM_DOMAIN = rcac.purdue.edu
      • COLLECTOR_NAME = egret.rcac.purdue.edu
    • Under PART 2
      • FLOCK_FROM = *.rcac.purdue.edu
      • FLOCK_TO = albatross.rcac.purdue.edu,emu.rcac.purdue.edu,egret.rcac.purdue.edu,flamingo.rcac.purdue.edu
      • HOSTALLOW_READ = *.purdue.edu
      • HOSTALLOW_WRITE = *.purdue.edu
    • Under PART 4
      • JAVA = <path/to/java_install_directory>
      • JAVA_MAXHEAP_ARGUMENT = -Xmx (if you are using Sun JVM

    After making changes to the condor_config file, you need to run condor_reconfig.
    You should now be able to submit jobs to Condor. This is done by first writing and compiling your program and then making a 'submit description file'. See here for an example.

    To submit the job, type "condor_submit <full_path_to>\<submit description file>".

    Radon Frequently Asked Questions (FAQ)

    There are currently no FAQs for Radon.