BoilerGrid is a large, high-throughput, distributed computing system provided by RCAC and using the Condor system developed by the Condor Project at the University of Wisconsin. BoilerGrid provides a means for users to run programs on large numbers of otherwise idle computers in various locations, including both high-performance resources momentarily under-utilized and desktop lab machines not currently in use. Whenever a local user or scheduled job needs a given machine, the Condor job is stopped and sent to another Condor node as soon as possible. Because this model limits the ability to accomplish parallel processing and communications, RCAC decided to limit access to smaller, serial jobs. Condor jobs can be submitted from most of the RCAC systems (Gray, Pete, Prospero, Radon, Rossmann, Steele, Venice). You may also install Condor on your own desktop machine, and submit from that.
BoilerGrid scavenges cycles from nearly all RCAC systems, including community clusters, specialized systems, and the recycled cluster. BoilerGrid also uses idle time of machines in student labs on the Purdue West Lafayette campus, the Purdue Calumet campus and the University of Notre Dame. Whenever the normal scheduling system on these machines sends a job to a node, Condor preempts or (if possible) checkpoints its work, then immediately surrenders the node to the scheduled job.
BoilerGrid currently consists of over 20,000 processors. Of these, about 10,500 are Linux/x86_64, approximately 600 are Linux/Intel (ia32), and approximately 11,000 are WinNT51/Intel. There are also small numbers of Itanium Linux, Solaris and Mac OSX nodes. Memory on compute nodes ranges from 512 MB to 32 GB, and most processors run at 3 GHz or faster. With a total of over 60 TFLOPS available, BoilerGrid can provide large numbers of cycles in a short amount of time. All shared areas and software packages available on the RCAC systems are available on Condor. Condor is designed for high-throughput computing and is excellent for parameter sweeps, Monte Carlo simulations, or nearly any serial application.
| Owner | Arch/OS | Processors |
|---|---|---|
| ITaP - RCAC | x86_64/Linux | ~10500 |
| ITaP - RCAC | Intel/Linux | ~660 |
| ITaP - Envision Center | Intel/Linux | 48 |
| ITaP - Teaching & Learning | Intel/WinNTXX | ~9300 |
| Purdue Calumet | Intel/WinNT51 | ~250 |
| Notre Dame CSE | Intel/Linux, Sun4u/Solaris28, PPC/OSX, x86_64/Linux | ~230 |
| Purdue Biology, Libraries, & other ITaP | Intel/Linux, Intel/WinNT51 | 187 |
BoilerGrid currently runs the latest stable release of Condor: 7.0.1. BoilerGrid status may be monitored using CondorView.
Purdue faculty, staff, and students with the approval of their advisor may request access to BoilerGrid using the online Research Computing Account Request Form. However, if you have an account on Radon or any of the RCAC Community Clusters (Steele, Venice, Rossmann, Prospero, Pete), you have automatically been given access to BoilerGrid already.
To issue jobs on BoilerGrid, users may log in to the front-end host condor.rcac.purdue.edu via SSH, or submit using "condor_submit" directly from Radon or any of the Community Clusters (Steele, Venice, Rossmann, Prospero, Pete).
All access to the RCAC systems must be through secure (encrypted) connections. Standard telnet and FTP are not supported. SSH, SCP, and SFTP may be used instead.
Secure Shell or SSH is a way of establishing a secure channel between a local and a remote computer. It uses public-key cryptography to authenticate the remote computer and (optionally) to allow the remote computer to authenticate the user. It is usually used to log in to a remote machine and execute commands similar to telnet, but it also supports tunneling and forwarding of X11 or arbitrary TCP connections. The associated SFTP and SCP protocols may be used to transfer files. There are many SSH clients available, depending on the operating system you use.
Linux / Solaris / AIX / HP-UX / Unix:
Microsoft Windows:
Mac OS X:
SSH can be used in conjunction with many different means of authentication. One popular authentication method is called Public Key Authentication (PKA). PKA is a method of establishing your identity to a remote computer using related sets of encryption data called keys. PKA is a more secure alternative to traditional password-based authentication with which you are probably familiar.
To employ PKA via SSH, you manually generate a keypair (also called SSH keys) in the location from where you wish to initiate a connection to a remote machine. This keypair consists of two text files, one which is called a private key and one which is called a public key. You keep the private key file confidential on your local machine or local home directory (hence the name "private" key). You then login to a remote machine (if possible) and append the corresponding public key text to the end of a specific file, or have a system administrator do so on your behalf. In future login attempts, the public and private keys are compared to verify your identity, which then grants you access to the remote machine.
As a user, you can create, maintain, and employ as many keypairs as you wish. If you connect to a computational resource from your work laptop, your work desktop, and your home desktop, you can create and employ keypairs on each. You can also create multiple keypairs on a single local machine to serve different purposes, such as establishing access to different remote machines, or establishing different types of access to a single remote machine. In short, PKA via SSH offers a secure but flexible means of identifying yourself to all kinds computational resources.
When a you create a keypair, you are prompted to provide a passphrase for the private key. This passphrase is different than a password in a number of ways. First, a passphrase is, as the name implies, a phrase. It can include most types of characters, including spaces, and has no limits on length. Second, this passphrase is not transmitted to the remote machine for verification. It is used only to allow the use of your local private key and is specific to a specific local private key.
Perhaps you are wondering why you would need a private key passphrase at all when using PKA. If the private key is kept secure, why the need for a passphrase just to use it? Indeed, if the location of your private keys were always completely secure, a passphrase might not be needed. In reality, a number of situations could arise in which someone may improperly gain access to your private key files. In these situations, a passphrase offers another level of security for you, the user who created the keypair.
Think of the private key/passphrase combination as being analogous to your ATM card/PIN combination. The ATM card itself is the object that grants access to your important accounts, and as such, should be kept secure at all times—just as a private key should. But if you ever lose your wallet or your ATM card is stolen, you are glad that your PIN exists to offer you another level of protection. The same is true for a private key passphrase.
When you create a keypair, you should always provide a corresponding private key passphrase. For security purposes, avoid using phrases that would be guessed by automated programs (e.g. phrases that consist solely of words in English-language dictionaries). This passphrase can never be recovered if forgotten, so make note of it. There are only limited situations when the use of a non-passphrase-protected private key is warranted—conducting automated file backups is one such situation. If you need to use a non-passphrase-protected private key to conduct automated backups to Fortress, see the No-Passphrase SSH Keys section.
SSH supports tunneling of X11 (X-Windows), so X11 applications may be run on the machine you are using to issue jobs to BoilerGrid. However, running an X11 application via Condor is not possible.
If you have received a default password as part of the process of obtaining your account, you should change it immediately when you log on for the first time. This can be done from any terminal/SSH session with the command "passwd". You will have the same password on all RCAC systems. If you change your password on any one RCAC system, it will change on all RCAC systems.
If you already have a Purdue career account, then you will initially be given the same userid and password as your career account. There is no need to change your career account password because you have received an account on RCAC systems.
There is not currently any requirement regarding how often you must change your password within RCAC, but for security reasons changing a password every six months, preferably every three months, is good practice.
All passwords should:
Never share your password with another user or make your password known to anyone else. Systems staff will NEVER ask for your password, by email or otherwise.
There is no local mail delivery available on BoilerGrid. All email sent to BoilerGrid will be forwarded to mail.rcac.purdue.edu for delivery.
When your account is activated, your default shell will probably be set to tcsh—an enhanced version of the Berkeley UNIX C shell (csh). The tcsh shell is completely compatible with the standard csh, and all csh commands and scripts work unedited with tcsh. For more details on tcsh, enter "man tcsh" while logged in.
The other popular shell is GNU Bourne-Again SHell (bash), which is completely compatible with the Bourne shell (sh). For more details on bash, enter "man bash" while logged in.
To change your shell temporarily or to try out another shell, just type the shell name as a command ("bash", "tcsh", "ksh"). This will run the new shell as a subshell. To return to your original shell, simple type exit.
To permanently change your login shell, use the command chsh:
$ chsh -s bash
(or)
$ chsh -s tcsh
To see a list of all available shells:
$ chsh -l
The next time you log on, you will start in the new shell. However, you may switch back at any time.
File storage options on RCAC systems include home directories, scratch file systems, /tmp, and long-term or permanent storage. Each of these have different performance and intended uses, and some vary from system to system as well. Home directories and long-term storage are backed up nightly, but scratch and /tmp are not and may be occasionally purged without warning. Below is more detail about each of these storage options.
Your home directory is the default directory you are placed in when you log in.
You should use this space for storing files you want to keep long term such as source code, scripts, input data sets, etc. It should also be used for files you want to keep and which you use often. The home directory will physically reside on the BlueArc NFS Server. You can find the path to your home directory by logging in, and typing pwd:
$ pwd /home/ba01/u103/myusername
The second component of the reply indicates the name of the host where your home directory physically resides. In this example, the home directory is on the RCAC home directory file server named "ba01" under area "u103". This will vary from person to person. Remember, you can always check where your home directory is located by doing a pwd command in your home directory.
Regardless of its physical location, your home directory and its contents are available on almost all the RCAC front-end hosts and their nodes via the Network File System (NFS). The only exception is Black.
Note that your home directory has a quota capping the size and/or number of files you may store within. For more information, refer to the Storage Quotas / Limits Section.
Scratch directories are provided by RCAC and are intended for short-term file storage only.
Backups are not performed on scratch directories. In the event of a disk crash or file purge, files in scratch directories can not be recovered. Please be sure to copy any important files to more permanent storage.
All files stored in RCAC scratch directories older than 90 days will be automatically removed (purged). Owners of these files will be notified one week before removal via email. For more information, please refer to our Scratch File Purging Policy.
RCAC scratch directories are provided by a central BlueArc server and are accessible from most RCAC systems. There are two primary scratch file systems: scratch95 and scratch96. A scratch directory already exists for all BoilerGrid users. Your RCAC scratch directory is located under scratch95 or scratch96 within a subdirectory by the first letter of your username.
To find the path to your RCAC scratch directory, run myscratch:
$ myscratch /scratch/scratch96/m/myusername
The variable $RCAC_SCRATCH is also set to your RCAC scratch directory path. Use this variable in any scripts. Your actual scratch directory path may change without warning, but this variable will remain current.
$ echo $RCAC_SCRATCH /scratch/scratch96/m/myusername
To find the path to someone else's RCAC scratch directory, use the command findscratch:
$ findscratch someuser /scratch/scratch95/s/someuser
Note that your RCAC scratch directory has a quota capping the size and/or number of files you may store within. For more information, refer to the Storage Quotas / Limits Section.
The /tmp directory is intended for temporary files that are used during the execution of a process or job or while you examine files created by your jobs. Used properly, /tmp may provide faster local storage to an active process than any other storage option. However, do not use it for longer-term storage or critical results.
Files stored in /tmp are not backed up and are removed whenever space is low or whenever the system is rebooted. In the event of a loss, files in /tmp can not be recovered, so use it only for files that can be recreated relatively easily.
Long-term Storage or Permanent Storage is available to RCAC users on the DXUL/UniTree archival storage system, commonly referred to as "Fortress". DXUL (DiskXtender for Unix and Linux) and UniTree are a software package that manages a hierarchical storage system. Program files, data files and any other files which are not used often, but which must be saved, can be put in permanent storage. Fortress currently has a 1.2 PB capacity. However, since two copies are retained for every file, the usable capacity is only 600 TB.
Recently used files smaller than 0.5 MB have their primary copy stored on low-cost disks, but the second copy is on tape or optical disks. This provides a rapid restore time to the disk cache. However, the large latency to access a larger file (usually involving a copy from a tape cartridge) makes it unsuitable for use as active storage.
In addition to poor performance, these two uses can cause severe problems with the system itself:
Do not use Fortress as a second home directory. Instead, use tar or some similar archive tool to combine all the smaller files you wish to store into a single large file first.
For active data storage you should use either local storage or a scratch file system. You may then copy any results you wish to archive to Fortress when computation is complete.
Fortress is directly accessible (via FTP, SSH, SCP, SFTP, and NFS) from all RCAC systems, as well as most systems in ECN and CS and from several other major servers on campus. To access Fortress in any way other than NFS, you must login to fortress.rcac.purdue.edu. RCAC has more information about Fortress, including how to obtain a Fortress account and how to access your files on Fortress.
There are a variety of ways to manually transfer files to your Fortress home directory for long-term storage.
You can use an SCP client to interactively transfer individual files and directories to Fortress. More information on SCP can be found in the File Transfer - SCP section of this guide.
You can use an SFTP client to interactively transfer individual files and directories to Fortress. More information on SFTP can be found in the File Transfer - SFTP section of this guide.
In the absence of NFS access to Fortress, you must login to fortress.rcac.purdue.edu to transfer files to long-term storage. There are limited situations where the use of a password or a passphrase-protected authentication keypair becomes impractical, and running scripted file backups to Fortress happens to be one of them. When you attempt to establish a connection to Fortress, you will literally be prompted to input a password or a local private key passphrase. Any time a script or automated process needs to establish the connection, it is unable to respond to such a request. To enable truly automated transfer of files to Fortress, you need to employ public key authentication via SSH with a non-passphrase-protected private key. For a conceptual overview of public key authentication, see the SSH Keys section of this guide.
Now, if your home directory is compromised and an attacker obtains your non-passphrase-protected private key, the attacker will be able to masquerade as you on machines that contain the corresponding public key. Luckily, certain usage restrictions can be customized for each keypair you employ. For example, you could create a non-passphrase-protected keypair and later specify that this public key shall only be used to run a file-backup script, and additionally, is only valid when connecting from a specific machine. Then, if the non-protected private key were to be compromised, the attacker would be saddened to realize that he could only run your file-backup script repeatedly.
It is very important to place a passphrase on all of your generated keypairs. Only use non-protected keypairs when absolutely necessary.
Here is how to set up a non-password-protected keypair for use with automated backup scripts to Fortress from BoilerGrid.
$ ssh-keygen -t rsa -N "" -f ~/.ssh/mykeypairnameThe ssh-keygen command should have created the following files:
$ ls ~/.ssh/mykey* mykeypairname mykeypairname.pubThe first file is the private key. The second file is the public key counterpart.
from="*.rcac.purdue.edu",no-port-forwarding,no-agent-forwarding,no-X11-forwarding,no-ptyThis tells SSH to only allow connections from RCAC resources, to disable a number of forwarding functions, and to not allow interactive shell commands, respectively.
$ scp ~/.ssh/mykeypairname.pub myusername@fortress.rcac.purdue.edu:~/
$ ssh myusername@fortress.rcac.purdue.edu $ cd ~/.ssh/
$ chmod 600 ~/.ssh/authorized_keysIf it does not exist, create it:
$ touch ~/.ssh/authorized_keys $ chmod 600 ~/.ssh/authorized_keys
$ cat ~/mykeypairname.pub >> ~/.ssh/authorized_keys
$ cat ~/.ssh/authorized_keys from="*.rcac.purdue.edu",no-port-forwarding,no-agent-forwarding,no-X11-forwarding, no-pty ssh-rsa AABBB3NzaC1yc2EABBABIwAAAIEA3SXgmvos4jFLVFLRrh6YrN3s8FuBOUTCJ0NIsc+ FtFrSGD2bVV6yMCgpdgz9RZS7U5uTJOW2VBWsJSb6cjjnA2WJzDcS0bEU3lw+TJszv2sEfl/CwF6dyj2U2 k5VrXIpdosZVKyjoqzQXhFicIRv1/ykdO8xp+qcgc09NbcyGhs= myusername@resource.rcac.purdue.edu
$ rm ~/mykeypairname.pub
$ exit
If you have followed the instructions in the No-Passphrase SSH Keys section to employ an unprotected SSH keypair between BoilerGrid and Fortress, you can automate the backup process using backup scripts. Because of the restrictions you placed upon the public key, you cannot use this keypair to log on to an interactive SSH session on Fortress, but you can use it to send files from your BoilerGrid home directory to Fortress via SCP, or to run local scripts that employ SCP.
Since you can have multiple private keys on BoilerGrid (and a similarly, multiple public keys in any given "authorized_keys" file on Fortress), you always need to specify which keypair you intend to employ for a log-in attempt to Fortress. The most consistent way to do this is with SSH's "-o" flag. This passes options to configure SSH and can be used with all programs that use SSH for providing a secure connection (e.g. SCP, SFTP, and RSYNC).
To test automated SCP authentication from BoilerGrid to Fortress, use the following command:
$ scp -o IdentityFile=~/.ssh/mykeypairname ./mylocalfile myusername@fortress.rcac.purdue.edu:~/myremotefile
If this works (i.e. you are not prompted for a passphrase or login password), you can move on to implementing a script using SCP commands like the one above.
While only you can ultimately decide the best approach for your automated backup process, the example scripts below show, in general, how to employ backup scripts on BoilerGrid using SCP commands and public key authentication via SSH. The following bash script, named "fortress_backup_script_scp", uses SCP to recursively copy two directories on a user's BoilerGrid home directory to the user's Fortress home directory:
#!/usr/local/bin/bash
# A script to use SCP to copy
# whole directories to Fortress
# Define some parameters
user=myusername
remotehost=fortress.rcac.purdue.edu
idfile=~/.ssh/mykeypairname
# Manually populate an array of directories on the
# local machine we wish to back up on Fortress
localdir[0]=~/mydir2backup
localdir[1]=~/mydir2backup_also
# Get the number of directories to be backed up
numdirs=${#localdir[*]}
count=1
# Loop over every entry in the "localdir" array to
# copy each directory recursively to a folder of
# the same name in our home directory on Fortress.
printf "\n>> Starting Secure Copy backup to Fortress\n"
for dir in "${localdir[@]}"
do
printf ">> Copying directory $dir to Fortress ($count of $numdirs)\n"
scp -r -o IdentityFile=$idfile $dir $user@$remotehost:~/
let count++
done
printf ">> Done...\n\n"
The output for this script is as follows:
$ ./fortress_backup_script_scp >> Starting Secure Copy backup to Fortress >> Copying directory /home/ba01/u100/myusername/mydir2backup to Fortress (1 of 2) bigfile2.tar.gz 100% 121MB 30.3MB/s 00:04 bigfile1.tar.gz 100% 121MB 40.5MB/s 00:03 >> Copying directory /home/ba01/u100/myusername/mydir2backup_also to Fortress (2 of 2) bigfile4.tar.gz 100% 121MB 40.5MB/s 00:03 bigfile3.tar.gz 100% 121MB 40.5MB/s 00:03 >> Done...
By using these techniques, you can automate your file backups to Fortress safely and efficiently.
If you have followed the instructions in the No-Passphrase SSH Keys section to employ an unprotected SSH keypair between BoilerGrid and Fortress, you can automate the backup process using backup scripts. Because of the restrictions you placed upon the public key, you cannot use this keypair to log on to an interactive SSH session on Fortress, but you can use it to send files from your BoilerGrid home directory to Fortress via SFTP or to run local scripts that employ SFTP.
Since you can have multiple private keys on BoilerGrid (and similarly, multiple public keys in any given "authorized_keys" file on Fortress), you always need to specify which keypair you intend to employ for a log-in attempt to Fortress. The most consistent way to do this is with SSH's "-o" flag. This passes options to configure SSH and can be used with all programs that use SSH for providing a secure connection (e.g. SCP, SFTP, and RSYNC).
To test automated SFTP authentication from BoilerGrid to Fortress, use the following command:
$ sftp -o IdentityFile=~/.ssh/mykeypairname myusername@fortress.rcac.purdue.edu sftp> bye $
If this works (i.e. you are not prompted for a passphrase or login password), you can move on to implementing a script using SFTP commands like the one above.
While only you can ultimately decide the best approach for your automated backup process, the example scripts below show, in general, how to employ backup scripts on BoilerGrid using SFTP commands and public key authentication via SSH. The following bash script, named "fortress_backup_script_sftp", uses SFTP commands to navigate through Fortress directories, and pushes files from the user's BoilerGrid home directory when needed.
#!/usr/local/bin/bash # A script to use SFTP to push files to # Fortress for backup. # Set up some parameters user=myusername remotehost=fortress.rcac.purdue.edu idfile=~/.ssh/mykeypairname printf "\n>> Starting Secure FTP backup session to Fortress\n" # Invoke SFTP mode, specifying the correct private key, # and forcing batch file input from a "here-document" # (i.e. the rest of this script). sftp -o IdentityFile=$idfile -b - $user@$remotehost << EOF cd ./mydir2backup lcd ./mydir2backup put -P ./bigfile1.tar.gz put -P ./bigfile2.tar.gz cd ../mydir2backup_also lcd ../mydir2backup_also put -P ./bigfile3.tar.gz put -P ./bigfile4.tar.gz bye EOF # Now we are back to the bash shell... printf ">> Done...\n\n"
The output for this script is as follows:
$ ./fortress_backup_script_sftp >> Starting Secure FTP backup session to Fortress sftp> sftp> cd ./files2backup sftp> lcd ./files2backup sftp> sftp> put -P ./bigfile1.tar.gz Uploading ./bigfile1.tar.gz to /archive/fortress/home/myusername/mydir2backup/bigfile1.tar.gz sftp> put -P ./bigfile2.tar.gz Uploading ./bigfile2.tar.gz to /archive/fortress/home/myusername/mydir2backup/bigfile2.tar.gz sftp> sftp> cd ../files2backup_also sftp> lcd ../files2backup_also sftp> sftp> put -P ./bigfile3.tar.gz Uploading ./bigfile3.tar.gz to /archive/fortress/home/myusername/mydir2backup_also/bigfile3.tar.gz sftp> put -P ./bigfile4.tar.gz Uploading ./bigfile4.tar.gz to /archive/fortress/home/myusername/mydir2backup_also/bigfile4.tar.gz sftp> sftp> bye >> Done... $
By using these techniques, you can automate your file backups to Fortress safely and efficiently.
There are many environment variables related to storage locations and paths which are automatically set for you upon log in, or may be changed if necessary.
Use environment variables instead of actual paths whenever possible to avoid problems if the specific paths to any of these change. Some of the environment variables you should have are:
All environment variables begin with the dollar sign ($) and are all uppercase. These may be used on the command line or in any scripts in place of and in combination with hard-coded values:
$ ls $HOME ... $ ls $RCAC_SCRATCH/myproject ... $ ls $RCAC_SCRATCH/myproject/$HOSTNAME_data ...
You may find the value of any environment variable by using the echo command:
$ echo $RCAC_SCRATCH /scratch/scratch95/m/myusername $ echo $SHELL /usr/local/bin/tcsh
You may list the values of all environment variable using the env command:
$ env USER=myusername HOME=/home/ba01/u101/myusername RCAC_SCRATCH=/scratch/scratch95/m/myusername SHELL=/usr/local/bin/tcsh ...
You may create or overwrite an environment variable using either export or setenv, depending on your shell:
(for bash and sh) $ export VARIABLE=value (for tcsh and csh) % setenv VARIABLE value
Your disk usage is limited on RCAC systems. However, each filesystem (scratch, home directory, etc.) may have a different limit. If you exceed the soft limit or quota, you will see warnings whenever writing to the disk that you are over quota, but the write will still succeed. If you exceed the hard limit or limit, your write will fail until you either remove other files or your quota is increased. Generally, RCAC systems do not impose a soft limit—only a hard limit.
You may find out what your current quota is by using the quota command:
$ quota
Disk quotas for user myusername (uid 12345):
Filesystem blocks quota limit grace files quota limit grace
ba01:/u103 2346272 0 5000000 17508 0 65535
The columns are as follows:
You may also see the disk usage of any given directory by using du:
$ du -hs 1.1G . $ du -hs $HOME 138M /home/ba01/u103/myusername
This can be very helpful in figuring out where your largest files or directories are, so that you may clean out unneeded large files and avoid hitting your quota.
If you find you need additional disk space on an RCAC account, please first consider archiving and compressing old files and moving them to long-term storage. If this option does not resolve the issue, you may send an email to rcac-help@purdue.edu and request additional space.
There are several options for archiving and compressing groups of files or directories on RCAC systems. All of the following tools are provided:
(compress file somefile.c) $ zip somefile.zip somefile.c (extract contents of somefile.zip) $ unzip somefile.zip (compress all files in a directory into one archive file) $ zip -r somefile.zip somedirectory/ (compress all ".c" files in current directory into one archive file) $ zip -r somefile.zip . -i \*.c
(archive file somefile.c) $ tar cvf somefile.tar somefile.c (archive and compress file somefile.c) $ tar czvf somefile.tar.gz somefile.c (list contents of archive somefile.tar) $ tar tvf somefile.tar (extract contents of somefile.tar) $ tar xvf somefile.tar (extract contents of gzipped archive somefile.tar.gz) $ tar xzvf somefile.tar.gz (archive and compress all files in a directory into one archive file) $ tar czvf somefile.tar.gz somedirectory/ (archive and compress all ".c" files in current directory into one archive file) $ tar czvf somefile.tar.gz *.c
(compress file somefile - also removes uncompressed file) $ gzip somefile (uncompress file somefile.gz - also removes compressed file) $ gunzip somefile.gz
(compress file somefile - also removes uncompressed file) $ bzip2 somefile (uncompress file somefile.bz2 - also removes compressed file) $ bunzip2 somefile.bz2
Windows users can work with these same formats using some of the following software:
There are a variety of ways to transfer data to and from RCAC systems. Which you should use depends on several factors, including the ease of use for you personally, connection speed and bandwidth, the size and number of files to be transferred.
FTP (File Transfer Protocol) is simple data transfer mechanism. FTP was not designed to provide secure communications, and so FTP is no longer supported on any RCAC systems. Most modern FTP clients support either SFTP or SCP however, which are similar, secure protocols for file transfer. Try using one of the other methods described here instead of FTP.
SCP (Secure CoPy) is a simple way of transferring files between two machines that use the SSH (Secure SHell) protocol. You may use SCP to connect to any system where you have SSH (log-in) access. SCP is available as a protocol choice in some graphical file transfer programs and also as a command line program on most Linux, Unix, and Mac OS X systems. SCP can copy single files, but will also recursively copy directory contents if given a directory name.
Command-line usage:
(to a remote system from local) $ scp sourcefilename myusername@hostname:somedirectory/destinationfilename (from a remote system to local) $ scp myusername@hostname:somedirectory/sourcefilename destinationfilename (recursive directory copy to a remote system from local) $ scp sourcedirectory/ myusername@hostname:somedirectory/
Linux / Solaris / AIX / HP-UX / Unix:
Microsoft Windows:
Mac OS X:
SFTP (Secure File Transfer Protocol) is a reliable way of transferring files between two machines. You may use SFTP to connect to most RCAC systems. SFTP is available as a protocol choice in some graphical file transfer programs and also as a command-line program on most Linux, Unix, and Mac OS X systems. SFTP has more features than SCP and allows for other operations on remote files, remote directory listing, and resuming interrupted transfers. Command-line SFTP cannot recursively copy directory contents; to do so, try using SCP or graphical SFTP client.
Command-line usage:
$ sftp -B buffersize myusername@hostname
(to a remote system from local)
sftp> put sourcefile somedir/destinationfile
sftp> put -P sourcefile somedir/
(from a remote system to local)
sftp> get sourcefile somedir/destinationfile
sftp> get -P sourcefile somedir/
sftp> exit
Linux / Solaris / AIX / HP-UX / Unix:
Microsoft Windows:
Mac OS X:
LFTP is a command-line file-transfer program for Linux and Unix systems. It supports SFTP, HTTP, and HTTPS file-transfers. LFTP has additional features not provided by SFTP such as bandwidth throttling, transfer queues, and parallel transfers. It may be used interactively or scripted.
LFTP with parallel transfers can be much faster than SCP or SFTP, so its use is encouraged when possible.
LFTP is provided only on some RCAC systems. However, it is simply a client, so it is not needed on the remote machine involved in a transfer (the remote system need only support SFTP).
Interactive usage:
$ lftp myusername@hostname
(transfer all ".dat" files from remote system to local)
lftp :~> mget *.dat
(transfer "filename.dat" file from local system to remote)
lftp :~> put filename.dat
(transfer a directory and all contents from remote
system to local, using 5 connections in parallel)
lftp :~> mirror --parallel=5 remotedirectory localdirectory/
(transfer a directory and all contents from local
system to remote, using 8 connections in parallel)
lftp :~> mirror -R --parallel=8 localdirectory remotedirectory/
Batch usage:
(specify all actions on command line) $ lftp myusername@hostname -e "mget *.dat" (specify all actions in the script file "mytransfer.lftp") $ lftp myusername@hostname -f mytransfer.lftp
GridFTP is a fast method of transferring large files that uses Globus authentication credentials (x509 certificates). GridFTP is available on some RCAC resources, but only to users who are members of a Grid project, such as TeraGrid, NorthWest Indiana Computational Grid (NWICG), or Open Science Grid (OSG). Note that not all grids may access all RCAC resources.
For more information about how to use GridFTP, consult documentation for your participating grid.
#!/bin/bash source /etc/profile module load mpich-intel mpirun -np 4 mpi/helloExample (works for ksh and bash):
#!/bin/ksh . /etc/profile module load mpich-intel mpirun -np 4 mpi/hello
-bash-3.00$ module avail -------------------- /opt/modules/versions -------------------- 3.1.6 3.2.3 --------------- /opt/modules/3.1.6/modulefiles ---------------- dot module-cvs module-info modules null use.own ------------------ /opt/modules/modulefiles ------------------- R/2.3.1 mpich22-1.2.7p1-gcc/4.1.1 R/2.4.1 mpich22-1.2.7p1-intel/8.1(default) gaussian03/D.01 mpich22-1.2.7p1-intel/9.0.027 gcc/4.1.1 mpich22-1.2.7p1-intel/9.1.045 gcc/4.1.2 mpich22-1.2.7p1-pgi/6.0-5 ghostscript/8.51 mpich22-1.2.7p1-pgi/6.1-4 gsl/1.8 mpich22-1.2.7p1-pgi/6.2-3(default) gsl/1.9 mpich22-gcc/4.1.1 hdf4-gcc/3.4.6 mpich22-intel/8.1(default) hdf5-gcc/3.4.6 mpich22-intel/9.0.027 intel/8.1 mpich22-intel/9.1.045 intel/9.0.027 mpich22-pgi/6.0-5 intel/9.1.045 mpich22-pgi/6.1-4 intel32/8.1(default) mpich22-pgi/6.2-3(default) intel32/9.0.027 mpich64-1.2.7p1-gcc/4.1.1 intel32/9.1.045 mpich64-1.2.7p1-intel/8.1 intel64/8.1 mpich64-1.2.7p1-intel/9.0.027 intel64/9.0.027 mpich64-1.2.7p1-intel/9.1.045 intel64/9.1.045 mpich64-1.2.7p1-pgi/6.0-5 java/1.6.0 mpich64-1.2.7p1-pgi/6.1-4 maple/10.0 mpich64-1.2.7p1-pgi/6.2-3(default) maple/100 mpich64-gcc/4.1.1 maple/11.0(default) mpich64-intel/8.1 mathematica/51 mpich64-intel/9.0.027 mathematica/52 mpich64-intel/9.1.045 matlab/7.3 mpich64-pgi/6.0-5 matlab/7.4 mpich64-pgi/6.1-4 mpich-gcc/4.1.1 mpich64-pgi/6.2-3(default) mpich-intel/8.1 netcdf-gcc/3.4.6 mpich-intel/9.0.027 netcdf-gcc/4.1.1 mpich-intel/9.1.045 netcdf-intel64/9.1.045 mpich-pgi/6.0-5 netcdf-pgi64/6.2-3 mpich-pgi/6.1-4 pgi/6.0-5 mpich-pgi/6.2-3(default) pgi/6.1-4 mpich1-1.0.5-gcc32/4.1.1 pgi/6.2-3(default) mpich1-1.0.5-gcc64/4.1.1 pgi/7.0-2 mpich1-1.0.5-intel32/8.1(default) pgi32/6.0-5 mpich1-1.0.5-intel32/9.1.045 pgi32/6.1-4 mpich1-1.0.5-intel64/9.1.045 pgi32/6.2-3(default) mpich1-1.0.5-pgi32/6.2-3(default) pgi32/7.0-2 mpich1-1.0.5-pgi64/6.2-3(default) pgi64/6.0-5 mpich1-gcc/4.1.1 pgi64/6.1-4 mpich1-gcc32/4.1.1 pgi64/6.2-3(default) mpich1-gcc64/4.1.1 pgi64/7.0-2 mpich1-intel/9.1.045 python/2.5 mpich1-intel32/8.1(default) subversion/1.4.2 mpich1-intel32/9.1.045 subversion/1.4.3 mpich1-intel64/9.1.045 totalview/8.0.1-0 mpich1-pgi/6.2-3(default) totalview/8.1.0-0 mpich1-pgi32/6.2-3(default) mpich1-pgi64/6.2-3(default) -bash-3.00$If you load the generic name of a module, you will get the default version. To load a specific version, load the module using its full specification.
-bash-3.00$ module load intelExample, loading the intel compilers, 64-bit, version 8.1:
-bash-3.00$ module load intel64/8.1Example, loading the intel mpi-compilers, default version:
-bash-3.00$ module load mpich-intel
-bash-3.00$ module unload intel -bash-3.00$
-bash-3.00$ module show intel ------------------------------------------------------------------- /opt/modules/modulefiles/intel/9.1.045: module-whatis invoke Intel 9.1 Compilers prepend-path PATH /opt/intel/cce/9.1.045/bin prepend-path PATH /opt/intel/fce/9.1.040/bin prepend-path PATH /opt/intel/idbe/9.1.045/bin prepend-path LD_LIBRARY_PATH /opt/intel/cce/9.1.045/lib prepend-path LD_LIBRARY_PATH /opt/intel/fce/9.1.040/lib prepend-path LD_LIBRARY_PATH /opt/intel/idbe/9.1.045/lib prepend-path LD_LIBRARY_PATH /opt/intel/mkl/9.0/lib/em64t setenv CC icc setenv CXX icpc setenv FC ifort setenv F90 ifort setenv LAPACK_INCLUDE -I/opt/intel/mkl/9.0/include setenv LINK_LAPACK -L/opt/intel/mkl/9.0/lib/em64t \ -lmkl_lapack64 -lmkl_em64t -lmkl -lguide -lpthread setenv LINK_LAPACK_STATIC -L/opt/intel/mkl/9.0/lib/em64t \ -lmkl_lapack -lmkl_em64t -lguide -lpthread ------------------------------------------------------------------- -bash-3.00$
-bash-3.00$ module list No Modulefiles Currently Loaded. -bash-3.00$ module load intel -bash-3.00$ module list Currently Loaded Modulefiles: 1) intel/9.1.045 -bash-3.00$ module unload intel -bash-3.00$ module list No Modulefiles Currently Loaded. -bash-3.00$Remember, modulefiles can use conditional statements. Thus the effect a modulefile will have on the environment may change depending upon the current state of the environment. Environment variables are unset when unloading a modulefile. Thus, it is possible to load a modulefile and then unload it without having the environment variables return to their prior state.
The compilers available on the Community Clusters (Steele, Venice, Rossmann, Prospero, Pete) and the Recycled Cluster (Radon) can all be used to compile code for Condor. They are: Intel, GNU, and PGI compilers for Fortran 77, Fortran 90, Fortran 95 (only PGI), C, and C++. The compilers can produce general-purpose and architecture-specific optimizations to improve performance. These include loop-level optimizations, inter-procedural analysis and cache optimizations. The compilers support automatic and user-directed parallelization of Fortran, C, and C++ applications for multiprocessing execution.
Aside from these compilers, the command condor_compile can be used to compile code that is to be relinked with the Condor libraries for submission into Condor's Standard Universe. The Condor libraries provide the program with additional support, such as the capability to checkpoint, which is required in Condor's Standard Universe mode of operation. condor_ compile requires access to the source or object code of the program to be submitted; if source or object code for the program is not available (i.e. only an executable binary, or if it is a shell script), then the program must be submitted into Condor's Vanilla Universe.
The command for condor_compile is:
condor_compile [compiler] -O -o myprogram.condor file1.suffix file2.suffix ... where [compiler] is one of cc (the system C compiler) CC (the system C++ compiler) f77 (the system FORTRAN compiler) gcc (the GNU C compiler) g++ (the GNU C++ compiler) g77 (the GNU FORTRAN compiler) f90 (the system FORTRAN 90 compiler)
Notice from the preceding example that only the GNU compilers are compatible with condor_compile and thus the Standard Universe. To load the GNU compilers, type module load gcc.
Compiler commands for the ordinary compilers
This section looks at just compiling with the standard C/C++ and Fortran compilers, as opposed to compiling with condor_compile. When you are not compiling with condor_compile you will have to submit to the Vanilla Universe, not the Standard Universe. One reason for choosing not to use condor_compile and the Standard Universe is that we might wish to take advantage of the features of a compiler which is not compatible with condor_compile. To load the Intel compilers, type module load intel; to load the GNU compilers, type module load gcc; and to load the PGI compilers type module load pgi.
Example: To load the Intel C/C++/Fortran compilers
-bash-3.00$ module load intel -bash-3.00$
There are several versions installed for most of the compilers. Typing module avail will show all the modules possible to load.
This table below shows the different compiler names.
Note about reading the tables: for shortness, I have collected the commands for the Intel, GNU, and PGI compilers in the same table. They are shown like this (Ex: Serial Fortran 77): ifort/g77/pgf77. This means that you should use ifort to compile with Intel, g77 to compile with GNU, and pgf77 to compile with PGI. A '-' means that the compiler is not available for that type of compiler.
| Serial Intel/GNU/PGI | |
|---|---|
| Fortran 77 | ifort/g77/pgf77 |
| Fortran 90 | ifort/gfortran/pgf90 |
| Fortran 95 | ifort/gfortran/pgf95 |
| C | icc/gcc/pgcc |
| C++ | icc/g++/pgCC |
Here is a list of man pages for the various compilers. They can also be seen by issuing the command man <compiler> (only when said compiler has been loaded with the module load command). They will contain various useful information, such as compiler/linker options.
This table shows some examples on how to compile a program. Note that not all compilers are available on all systems.
| Program Type | Intel | GNU | PGI |
|---|---|---|---|
| Fortran 77 serial program | ifort program.f -o program | g77 program.f -o program | pgf77 program.f -o program |
| Fortran 90 serial program | ifort program.f90 -o program | gfortran program.f90 -o program | pgf90 program.f90 -o program |
| Fortran 95 serial program | icc myprogram.c -o myprogram | gfortran program.f95 -o program< | pgf95 program.f95 -o program< |
| C serial program | icc program.c -o program | gcc program.c -o program | pgcc program.c -o program |
| C++ serial program (1) | icc program.cpp -o program | g++ program.cpp -o program | pgCC program.cpp -o program |
(1) the suffix of a C++ file may be .C, .c, .cc, .cpp, .cxx, or .c++.
Here is a list of "getting started" guides on the various compilers etc.:
Only serial programs can be run over Condor. There is no support for OpenMP.
Only serial programs can be run over Condor. There is no support for MPI.
Only serial programs can be run over Condor. There is no support for MPI and/or OpenMP.
Some libraries are preinstalled for use on BoilerGrid. These may include mathematical libraries. More detailed documentation on the libraries available on BoilerGrid follows.
There is currently no support for MPICH through Condor.
Intel Math Kernel Library (MKL) contains ScaLAPACK, LAPACK, Sparse Solver, BLAS, Sparse BLAS, CBLAS, GMP, FFTs, DFTs, VSL, VML, and Interval Arithmetic routines. MKL can be found in the directory "/opt/intel/mkl/9.1" and it is divided into the following subdirectory structure:
Here are some example combinations of linking options:
(static linking of LAPACK and Kernels)
$ <fortran_compiler> myprogram.f -L${MKLPATH} -lmkl_lapack -lmkl_ia32 -lguide -lpthread
(static linking of Fortran-95 LAPACK Interface and Kernels)
$ <fortran_compiler> myprogram.f95 -L${MKLPATH} -lmkl_lapack95 -lmkl_lapack -lmkl_ia32 -lguide -lpthread
(static linking of BLAS, Sparse BLAS, GMP, VML/VSL, Interval Arithmetic, and FFT/DFT)
$ <c_compiler> myprogram.c -L${MKLPATH} -lmkl_ia32 -lguide -lpthread -lm
(dynamic linking of BLAS or FFTs)
$ <c_compiler> myprogram.c -L${MKLPATH} -lmkl -lguide -lpthread
It is recommended that you use dynamic linking of libguide. If so, ensure LD_LIBRARY_PATH is defined such that the correct version of libguide is found and used at run time. If you use static linking of libguide (discouraged), then:
Here are some more documentation from other sources on the Intel MKL:
If the source file ends with .F, .fpp, or .FPP, it is automatically preprocessed by cpp before it is compiled. If you want to use the C preprocessor with source files that do not end with .F, use the following compiler option to specify the filename suffix:
GNU: -x f77-cpp-input
Note that the preprocessing is not extended to the contents of files included by the "INCLUDE" directive - the #include preprocessor directive must be used instead.
For example, to preprocess source files that end with .f:
gfortran -x f77-cpp-input program.f
Intel: -cpp
To tell the compiler to link using C++ runtime libraries included with gcc/icc, use -cxxlib -gcc/-cxxlib -icc.
For example, to preprocess source files that end with .f:
ifort -cpp program.f
Generally, it is best to rename the file from <name>.f to <name>.F. The preprocessor will then be run automatically when the file is compiled.
A good page to look at for combining C/C++ and Fortran, is Using C/C++ and Fortran together.
When calling your own Fortran routines from C/C++, you should not append an underscore (_) after the name.
A complete list of routines is in the XL Fortran Language Reference Manual.
Here are some links to pages that discuss how to use Fortran from C/C++:
Only serial jobs can currently be submitted to BoilerGrid.
Condor is one of several distributed computing resources RCAC provides. Like other similar resources, Condor provides a framework for running programs on otherwise idle computers. While this has serious limitations for parallel jobs and codes with serious I/O or memory requirements, Condor can provide a large quantity of cycles for researchers who need to run hundreds of smaller jobs.
Condor is a specialized batch system for managing compute-intensive jobs. Condor provides a queuing mechanism, scheduling policy, priority scheme, and resource classifications. Users submit their jobs to Condor, which then puts these jobs in a queue, runs them, and reports back with the results.
In some ways, Condor is different from other batch systems. They usually only operate on dedicated machines/compute servers. Instead, Condor can both schedule jobs on dedicated machines and effectively utilize non-dedicated machines to run jobs. It only runs jobs on machines which are currently not being used (no keyboard activity, no load average, no active telnet users, etc). In this way, Condor effectively harnesses otherwise idle machines throughout a pool of machines.
Currently we use Condor to collect idle cycles on all i386-compatible Linux-based computational resources which RCAC maintains. Condor also runs on the Linux Clusters, both community (Steele) and recycled (Radon). All of these resources are scheduled with PBS; however, when a node is not running a PBS job, it is free to execute Condor jobs. When PBS elects to run a new job on such a node, any active Condor job is immediately checkpointed (if possible) and removed from the node. Condor jobs can be submitted from most of the RCAC systems (Gray, Pete, Prospero, Radon, Rossmann, Steele, Venice). You may also install Condor on your own desktop machine, and submit from that.
See here for a more detailed description of the resources in the Condor pools.
The status of the Condor pools can always be monitored with CondorView.
Look here for Condor tutorials and slides from the 'Condor Boot Camp' that was held at Purdue University. They are very good and useful.
Most of the information in this manual is taken from either the Condor Version 6.8.0 Manual or the man pages for the various commands.
Here is a short list of the steps to get ready to run on Condor. They are followed by a short example for the C program 'hello.c':
#include <stdio.h>
int main (void) {
printf("Hello World!\n");
return 0;
}
condor_compile hello.c
module load intel (Intel) icc hello.c -o hello module load gcc (GNU) gcc hello.c -o hello module load pgi (PGI) pgcc hello.c -o hello
# FILENAME: hello.sub Executable = hello Universe = Vanilla Output = hello.out Error = hello.err Log = hello.log QueueWith the job submission file, you can access other features of Condor. For example, you can specify how many times to run a program with multiple data sets. You can specify requirements for a platform type and apply ranks to those requirements. You can specify which data files are to be tranferred to the machine running the job. You can name the destination of Condor's email sent when a job completes or cancel that email.
radon:~/condor_running$ condor_submit hello.sub Submitting job(s). Logging submit event(s). 1 job(s) submitted to cluster 1100744. radon:~/condor_running$
radon:~/condor_running$ condor_q <username> -- Submitter: radon.rcac.purdue.edu : <128.211.157.42:56939> : radon.rcac.purdue.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 1100744.0 <username> 2/17 15:36 0+00:00:00 I 0 0.0 hello 1 jobs; 1 idle, 0 running, 0 held radon:~/condor_running$
radon:~/condor_running$ less hello.log 000 (1100744.000.000) 02/17 15:36:51 Job submitted from host: <128.211.157.42:56939> ... 001 (1100744.000.000) 02/17 15:41:49 Job executing on host: <172.18.16.12:57321> ... 005 (1100744.000.000) 02/17 15:41:53 Job terminated. (1) Normal termination (return value 0) Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage 0 - Run Bytes Sent By Job 0 - Run Bytes Received By Job 0 - Total Bytes Sent By Job 0 - Total Bytes Received By Job ...hello.out
radon:~/condor_running$ less hello.out hello World!
Condor allows several types of jobs, but the most used are "Standard" and "Vanilla". Standard jobs can be checkpointed and migrated from system to system transparently by Condor - jobs can be moved from node to node without restarting. Jobs can be checkpointed and migrated from system to system transparently by Condor - jobs can be moved from node to node without restarting. However, for a code to be submitted as a Standard job it must be recompiled using various Condor-specific compiler options and libraries. An application must also conform to a few other restrictions in order to run in the Standard Universe.
Those programs that cannot be recompiled can be submitted as Vanilla jobs. Virtually any non-parallel program can be submitted. Vanilla jobs cannot be checkpointed. If a node ceases to be idle, any running Vanilla jobs may be suspended or killed (to be restarted elsewhere).
Under Windows, only Vanilla jobs are allowed.
A universe in Condor defines an execution environment. Condor Version 6.8.0 supports several different universes for user jobs. The most used on BoilerGrid are "Standard", "Vanilla", and "Globus" or "Grid". There are other universes. See chapter 2.4.1 of the Condor manual for more details about the different universes.
The Universe attribute is specified in the job submission file. If a universe is not specified, the default is Standard.
Example 1
Here is first the simplest possible job submission file. It will put one copy of the program hello (which has first been created by condor_compile) in queue for execution by Condor. There is no definition of platform, so Condor will just use its default, which is to run the job on a machine which has the same architecture and operating system as the machine from which it was submitted.
No input, output, and error commands are given in the job submission file, so the files stdin, stdout, and stderr will all refer to /dev/null (a.k.a. the null device. It is a special file that discards all data written to it, but reports that the write operation succeeded. It provides no data to any process that reads from it - returning EOF). The program may produce output by explicitly opening a file and writing to it. A log file, hello.log, will also be produced. This log file will contain events the job had during its lifetime inside of Condor, such as any possible errors. When the job finishes, its exit conditions will also be noted in the log file. It is recommended that you always have a log file so you know what happened to your jobs.
If your program only returns output to the screen (like the hello.c program below does), then you should include Output = hello.out or something like it somewhere before Queue. Otherwise you will not see the output.
The default universe is the Standard Universe. This is what will be assumed if you do not explicitly chose a universe. This is a problem if you have not compiled with condor_compile and thus included the Condor libraries. If you have a program that is compiled with the normal compilers and thus does not have the Condor libraries linked in, then you should run in the Vanilla Universe.
####################
#
# Example 1
# Simple condor job description file
#
####################
Executable = hello
Log = hello.log
Queue
Example 2
In this example (from the Condor manual), we queue two copies of the program Mathematica. The first copy will run in directory run_1, and the second will run in directory run_2. For both queued copies, stdin will be test.data, stdout will be loop.out, and stderr will be loop.error. There will be two sets of files written, as the files are each written to their own directories. This is a convenient way to organize data if you have a large group of Condor jobs to run. The example file shows program submission of Mathematica as a Vanilla Universe job. This may be necessary if the source and/or object code to program Mathematica is not available.
Condor recommends using a single log file.
####################
#
# Example 2: demonstrate use of multiple
# directories for data organization.
#
####################
Executable = mathematica
Universe = Vanilla
Input = test.data
Output = loop.out
Error = loop.error
Log = loop.log
Initialdir = run_1
Queue
Initialdir = run_2
Queue
Example 3
In this example (also from the Condor manual, the job submission file queues 150 runs of program foo which has been compiled and linked for Silicon Graphics workstations running IRIX 6.5. This job requires Condor to run the program on machines which have greater than 32 megabytes of physical memory, and expresses a preference to run the program on machines with more than 64 megabytes, if such machines are available. It also advises Condor that it will use up to 28 megabytes of memory when running. Each of the 150 runs of the program is given its own process number, starting with process number 0. So, files stdin, stdout, and stderr will refer to in.0, out.0, and err.0 for the first run of the program, in.1, out.1, and err.1 for the second run of the program, and so forth. A log file containing entries about when and where Condor runs, checkpoints, and migrates processes for the 150 queued programs will be written into file foo.log.
#################### # # Example 3: Show off some fancy features including # use of pre-defined macros and logging. # #################### Executable = foo Requirements = Memory >= 32 && OpSys == "IRIX65" && Arch =="SGI" Rank = Memory >= 64 Image_Size = 28 Meg Error = err.$(Process) Input = in.$(Process) Output = out.$(Process) Log = foo.log Queue 150
To submit a job to Condor for execution, you must use the condor_submit command. This command takes as an argument the job submission file. As described above, this file contains the commands and keywords used to direct the queuing of jobs - the name of the executable to run, which universe to run in, any requirements and rank info, how many times to run the program, any command line arguments, etc. Based on this information, condor_submit will create a job ClassAd to use for matching with a machine ClassAd. When this has been done, Condor can queue the job for running on that machine.
One of the many advantages of using a job description file involves running the same program many times, each time with a different input data set (say, 500 times with 500 different input data sets). It is then easy to tell Condor to do this. Just arrange your data files accordingly so that each run reads its own input, and each run writes its own output. Each individual run may have its own initial working directory, stdin, stdout, stderr, command line arguments, and shell environment. A program that directly opens its own files will read the file names to use either from stdin or from the command line. A program that opens a static filename every time will need to use a separate subdirectory for the output of each run.
Write a job submission file and submit it:
condor_submit file
Example (my job submission file is called run_hello):
condor_submit run_hello
See condor_submit in the manual pages, for a more complete description of how to use it.
To just see the status of a job, type condor_q. Since there will often be many jobs scheduled at the same time, using condor_q <username> will limit the output to those jobs scheduled by <username>.
Example:
radon-fe00:~/condor_running$ condor_q <username> -- Submitter: radon-fe00.rcac.purdue.edu : <128.211.157.42:42163> : radon-fe00.rcac.purdue.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 1100900.0 <username> 2/20 15:13 0+00:00:00 I 0 0.0 Hello 1 jobs; 1 idle, 0 running, 0 held radon-fe00:~/condor_running$
Placing a job on hold
There are many reasons to put a job on hold. One could be if you do not have enough space to hold all the results at the same time, but need to move those somewhere else. You could queue all jobs, but put them on hold immediately. Then release a few job at a time (with a -constraint to condor_release, can be scripted), and move the results as they are produced, then release some more. To place a job in the queue on hold, use the command condor_hold <jobid>.
Releasing a held job
A job that is in the hold state remains there until later released for execution by the command condor_release. If a running job is placed on hold, it is killed and put back in the queue (assuming Vanilla jobs without checkpointing). A job still in the queue will stay there until released. Jobs can be held manually or may get held by the Condor Scheduler for various reasons (unable to write to your directory, etc.)condor_release jobid or condor_release user
Example
radon-fe00:~/condor_running$ condor_q <username> -- Submitter: radon-fe00.rcac.purdue.edu : <128.211.157.42:42163> : radon-fe00.rcac.purdue.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 1101790.0 <username> 2/24 14:53 0+00:00:00 I 0 0.0 Hello 1 jobs; 1 idle, 0 running, 0 held radon-fe00:~/condor_running$ condor_hold 1101790.0 Job 1101790.0 held radon-fe00:~/condor_running$ condor_q <username> -- Submitter: radon-fe00.rcac.purdue.edu : <128.211.157.42:42163> : radon-fe00.rcac.purdue.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 1101790.0 <username> 2/24 14:53 0+00:00:00 H 0 0.0 Hello 1 jobs; 0 idle, 0 running, 1 held radon-fe00:~/condor_running$ condor_release 1101790.0 Job 1101790.0 released radon-fe00:~/condor_running$ condor_q <username> -- Submitter: radon-fe00.rcac.purdue.edu : <128.211.157.42:42163> : radon-fe00.rcac.purdue.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 1101790.0 <username> 2/24 14:53 0+00:00:00 I 0 0.0 Hello 1 jobs; 1 idle, 0 running, 0 held radon-fe00:~/condor_running$
See the manual page for more information.
Checking on the progress of jobs
To check on the status of your jobs, use the command condor_q. This command will display the status of all the queued jobs, not just your own.
That is, however, not the only way of tracking the progress of your jobs. Another way of doing this is through the user log. In your job submission file, you can specify a log command (by adding Log = <name>.log somewhere before the Queue command). When you have done this, the progress of the job may be followed by viewing the log file. Various events such as execution commencement, checkpoint, eviction and termination are logged in the file. Also logged is the time at which the event occurred.
As soon as your job begins executing, Condor will start up a condor_shadow process on the submit machine. The shadow process is the mechanism by which the remotely executing jobs can access the environment from which it was submitted, such as input and output files. It is normal for a machine which has submitted hundreds of jobs to have hundreds of shadows running on the machine. Since the text segments of all these processes is the same, the load on the submit machine is usually not significant. If, however, you notice degraded performance, you can limit the number of jobs that can run simultaneously through the MAX_JOBS_RUNNING configuration parameter. Please talk to your system administrator for the necessary configuration change.
To find all the machines which are running your job, use the command condor_status. Example: say you wish to find all the machines which runs jobs submitted by user123@purdue.edu. You would then type condor_status -constraint 'RemoteUser == "user123@rcac.purdue.edu"'.
user123@radon-fe00:~$ condor_status -constraint 'RemoteUser == "user123@rcac.purdue.edu"' Name OpSys Arch State Activity LoadAv Mem ActvtyTime ba-005.rcac.p LINUX INTEL Claimed Busy 1.000 502 0+00:24:44 ba-006.rcac.p LINUX INTEL Claimed Busy 0.990 502 0+00:20:22 ba-007.rcac.p LINUX INTEL Claimed Busy 1.000 502 0+00:23:16 ba-008.rcac.p LINUX INTEL Claimed Busy 1.000 502 0+00:30:20 ...
Condor allocates resources by matching the submitted jobs with the machines. It does this by matching ClassAds. Condor's ClassAds are analogous to the classified advertising section of the newspaper. Sellers/buyers advertise specifics about what they have to sell/want to buy. Both buyers and sellers have some constraints which must be satisfied, like buyers only being able to pay a certain sum of money or sellers asking for no less than a certain price. Sellers and buyers both want to rank requests to their own advantage, for example, the seller would give a higher rank to a higher price offer. In Condor, users submitting jobs can be thought of as buyers of compute resources and machine owners are sellers.
All the machines in a Condor pool advertise their attributes. These could be available RAM memory, CPU type, CPU speed, virtual memory size, current load average, or other static and dynamic properties. This machine ClassAd also advertises under what conditions it is willing to run a Condor job and what type of job it would prefer.
The owners who allow their machines to be part of the Condor pool may set individual terms and preferences - maybe specifying that their machines may be used to run jobs only at night or that they have a preference/higher rank for running jobs submitted by their own department.
A very useful program for finding out which machines and architectures are out there is the program condor_all. It should be noted that even though it is located in the "official" Condor directory - /opt/condor/bin, it is a locally (Purdue) developed tool. It is very handy for finding out how many machines of a certain architecture are available - useful for the job submission file. The program is in the default path on tg_login, but is installed on most RCAC resources where it can be run as /opt/condor/bin/condor_all.
Just as the machines have requirements and preferences, the same is true for the users submitting a job. The users specify a ClassAd with their requirements and preferences when they submit a job. This ClassAd includes the type of machine you wish to use - you would perhaps like to use the machine with the fastest floating-point performance available and you thus want Condor to rank the available machines based upon their floating-point performance.
Another example could be that your job requires a machine with a minimum of, say, 4 GB of RAM and you thus only want Condor to consider machines which fulfill this requirement.
Sometimes, the user may be ready to use any machine available and this too can be communicated to Condor through the job ClassAd.
Condor's job then is to read all the machine ClassAds and all the user job ClassAds and match them up. Condor makes certain that all requirements in both ClassAds are satisfied, if possible.
To get a feel for what a machine ClassAd does, try typing the commands condor_status. This will give you a summary of the information in the resource ClassAds in your Condor pool. Click here to see an example of running this command. The list was generated by running condor_status on Radon, August 25, 2006.
There are many attributes. Some of them are used by Condor for scheduling. Other attributes are for information purposes. An important point is that any of the attributes in a machine ad can be utilized at job submission time as part of a request or preference on what machine to use. Additional attributes can be easily added. For example, your site administrator can add a physical location attribute to your machine ClassAds.
Removing a job from the queue
The command condor_rm can be used at any time to remove a job from the queue. If the job has already started running, then the job will be killed without a checkpoint, and its queue entry is removed. Use condor_q to get the ID of the job. Here is an example:
Queue of jobs before:
user123@radon-fe00:~$ Submitter: radon-fe00.rcac.purdue.edu : <128.210.9.35:35407> : radon-fe00.rcac.purdue.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD ... 260076.7 nice-user.user1 8/18 00:05 0+00:21:47 I 0 29.3 startfah.sh -oneun 260076.9 nice-user.user1 8/18 00:05 0+01:40:44 I 0 136.7 startfah.sh -oneun 260185.0 user123 8/30 13:01 0+00:00:00 R 0 19.5 hello ...
Queues of jobs after:
user123@radon-fe00:~$ condor_rm 260185.0 Job 260185.0 marked for removal user123@radon-fe00:~$ condor_q Submitter: radon-fe00.rcac.purdue.edu : <128.210.9.35:35407> : radon-fe00.rcac.purdue.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD ... 260076.7 nice-user.user1 8/18 00:05 0+00:21:47 I 0 29.3 startfah.sh -oneun 260076.9 nice-user.user1 8/18 00:05 0+01:40:44 I 0 136.7 startfah.sh -oneun ...
DAG means Directed Acyclic Graph and is a way of submitting many jobs at the same time. DAGMan is the Directed Acyclic Graph Manager. In short, DAGMAn, lets you submit complex sequences of jobs as long as they can be expressed as a directed acylic graph. For example, you may wish to run a large parameter sweep but before the sweep run you need to prepare your data. After the sweep runs, you need to collate the results. This might look like this, assuming you want to sweep over five parameters:
DAGMan has many abilities, such as throttling jobs, recovery from failures, and more. More information about DAGMan can be found in the Condor manual.
As an example, consider the DAGMan submit file dagfile.dag:
Job 1 testrun1.submit Job 2 testrun2.submit Job 3 testrun3.submit PARENT 1 CHILD 2 PARENT 2 CHILD 3
We then have Condor submit files for each part, this could be 'testrun1.submit' (in the below example the DAG is submitted across the grid - you can just change universe and remove globusscheduler for a local submission):
Universe = globus Executable = testrun1 Transfer_Executable = true Globusscheduler = tg-login1.ncsa.teragrid.org/jobmanager Output = testrun1.out Error = testrun1.error Log = testrun1.log Queue
The files testrun2.submit and testrun3.submit would be similar, with just 1 changed to 2 or 3. Create the file testrun1, testrun2, testrun3. Your directory should contain the following files:
cu12:~/dagtest238% ls dagfile.dag testrun1.submit testrun2.submit testrun3.submit testrun1 testrun2 testrun3 cu12:~/dagtest239%
To submit this DAG, give the command:
condor_submit_dag dagfile.dag
This gives the output:
cu12:~/dagtest239% condor_submit_dag dagfile.dag Checking all your submit files for log file names. This might take a while... Done. ----------------------------------------------------------------------- File for submitting this DAG to Condor : dagfile.dag.condor.sub Log of DAGMan debugging messages : dagfile.dag.dagman.out Log of Condor library debug messages : dagfile.dag.lib.out Log of the life of condor_dagman itself : dagfile.dag.dagman.log Condor Log file for all jobs of this DAG : /u/ncsa/user123/dagtest/testrun1.log Submitting job(s). Logging submit event(s). 1 job(s) submitted to cluster 58. ----------------------------------------------------------------------- cu12:~/dagtest240%
Just as for the ordinary condor_submit, the status of the job can be checked with condor_q and a job can be removed with condor_rm.
There are some examples of using DAG here and here.
The manual page for Condor DAG can be found here.
Click here to see another example of submitting a simple DAG.
Click here to see an example of a somewhat more complex DAG.
Click here to see an example of handling a DAG that fails.
These instructions are very short and merely meant to give you the ability to run a small example immediately. Read the rest of the sections, and maybe the Condor manual for more details on how to use other features of Condor.
Compiling
condor_compile <compiler> <program>.<extension> -o <program name>
Example:
condor_compile gcc hello.c -o hello
Submitting
It is very simple to submit the job to Condor, when the job submission file has been written. At the command prompt, just type condor_submit <job-name>, where job-name is the name of the job submission file.
Example: Here I am using a very simple job submission file, namely:
Executable = hello Log = hello.log Output = hello.out Queue
Where hello is a C program which have already been compiled with the command condor_compile gcc hello.c -o hello. I have named this job submission file 'run_hello'. In the following, I am running on Radon:
user123@radon-fe00:~$ condor_submit run_hello Submitting job(s). Logging submit event(s). 1 job(s) submitted to cluster 260182. user123@radon-fe00:~$
It may take a (sometimes long) while before the job is submitted and finishes running, depending on how many others are using the machines, your rank, the requirements you have given for the job, etc. The progress can be checked with the command condor_status. When the job has completed, I have the two files hello.log and hello.out in my directory - just as I asked for in the job submission file. You should always use a log file.
The contents of the files are:
hello.log:
000 (260182.000.000) 08/29 16:21:31 Job submitted from host: <128.210.9.35:35407> ... 001 (260182.000.000) 08/29 16:22:42 Job executing on host: <128.211.131.51:32780> ... 005 (260182.000.000) 08/29 16:22:42 Job terminated. (1) Normal termination (return value 13) Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage 830 - Run Bytes Sent By Job 13490672 - Run Bytes Received By Job 830 - Total Bytes Sent By Job 13490672 - Total Bytes Received By Job ...
and
hello.out:
Hello World!
which was the output the program would otherwise have written to the screen. You will also receive an email, unless otherwise specified.
Click here for another simple example.
Click here for an example of submitting a Standard Universe job.
Click here for an example of submitting a Java Universe job.
The examples for commercial software in this section use executable files supplied by the manufacturer. Since we cannot re-link these executable files with condor_compile, we have no choice but to use Condor's Vanilla Universe. Also note that the examples are only meant to run locally, at RCAC. We cannot guarantee that the applications will be installed elsewhere. Some commercial software is available on all RCAC resources, while others are available on only a few resources. To ensure that your job, when submitted to BoilerGrid, lands on a node which has the software that you want to execute, use the appropriate 'Requirements' directive, as shown in the examples below.
Note that with the exception of R, the following commercial software is licensed for use only by Purdue Affiliates.
The following examples were all tested by submitting to Condor from Radon.
Linux:
Note that Gaussian jobs can run long, so DON'T run them on the front-end. Also remember that Gaussian will throw large files.
This example shows how to submit a small Gaussian program to BoilerGrid (Purdue's Condor pool). The name of this example is water.com (rename 'water.com.txt' to 'water.com' to run or change in submit file). The program can be seen here: water.com.
The first step is to discover which versions of Gaussian are available and to choose one of them. In this example we will use gaussian03/E.01.
Next we must discover the path and executable file name of that version. Note that it may vary on different systems, so this must be done for each different machine you submit from. Repeat the following exercise:
radon:~$ module avail gaussian ------------------------------------------ /opt/modules-3.1.6/modulefiles ------------------------------------------ gaussian/C.02 gaussian/E.01 gaussian03/D.01(default) gaussian/D.01(default) gaussian03/C.02 gaussian03/E.01 radon:~$ radon:~$ module show gaussian03/E.01 ------------------------------------------------------------------- /opt/modules-3.1.6/modulefiles/gaussian03/E.01: module-whatis invoke Gaussian 03, Revision E.01 prepend-path PATH /apps/recycled/g03-E.01 ------------------------------------------------------------------- radon:~$ module load gaussian03/E.01 radon:~$ module list Currently Loaded Modulefiles: 1) gaussian03/E.01 radon:~$ ls /apps/recycled/g03-E.01/g* /apps/recycled/g03-E.01/g03 /apps/recycled/g03-E.01/gau-machine /apps/recycled/g03-E.01/ghelp /apps/recycled/g03-E.01/gau-cpp /apps/recycled/g03-E.01/gauopt /apps/recycled/g03-E.01/ghelp.hlp /apps/recycled/g03-E.01/gau-fsplit /apps/recycled/g03-E.01/gautraj radon:~$
Tthe relevant executable and its path appear in red text above. So we found that on this machine, the executable and the path to gaussian03/E.01 is: /apps/recycled/g03-E.01/g03.
The job submission file specifies the nature of the submission. We apply the path and name of the executable that we found in the exercise above (remember to change to the one you found yourself) to the appropriate point in the job submission file. This submit file runs a gaussian job using a shared filesystem:
# FILENAME: water.sub Notification = Never Executable = /apps/recycled/g03-E.01/g03 Universe = Vanilla Requirements = ( ( ARCH == "X86_64" || ARCH == "INTEL") && OPSYS == "LINUX" ) Env = "GAUSS_EXEDIR=/apps/recycled/g03-E.01 GAUSS_SCRDIR=/tmp GMAIN=/apps/recycled/g03-E.01 LD_LIBRARY_PATH=/apps/recycled/g03-E.01" Error = water.err Log = water.log Arguments = water.com Queue
Testrun:
radon:~/condor_running/gaussian$ condor_submit water.sub Submitting job(s). Logging submit event(s). 1 job(s) submitted to cluster 1100657. radon:~/condor_running/gaussian$
To test for job completion, use condor_q <your_username>. After the job completes, view the results in this file: water.log Notice that this is the log file. It contains both the Gaussian output and the log data from Condor.
To learn more about Gaussian, click here: http://www.gaussian.com/.
Machines with Maple installed advertise that fact with the "HAS_MAPLE" ClassAd.
For the Maple example, assuming that we require a specific version of Maple (Maple 11), we will run a small program, maple_input:
And then the submit file, maple.submit for Condor, using file transfer:
Universe = Vanilla Requirements = ( HAS_MAPLE == TRUE && MAPLE_VERSION == "11" && ( ARCH == "INTEL" || ARCH == "x86_64" ) && OPSYS == "LINUX" && FileSystemDomain == "rcac.purdue.edu" ) Executable = maple.sh Error = maple.err Log = maple.log Output = maple.out Queue
We also need a small script, which will do the module load and run maple on the input file (remember to change the path to HOME to the relevant for you):
#!/bin/sh export HOME=/home/ba01/u103/bbrydsx source /etc/profile module load maple maple < maple_input
You can now submit it:
-bash-3.00$ condor_submit maple.submit Vanilla Submitting job(s) Logging submit event(s). 1 job(s) submitted to cluster 10210. -bash-3.00$
And do a condor_q:
-bash-3.00$ condor_q -- Submitter: tg-login64.rcac.purdue.edu : <128.211.143.238:32773> : tg-login64.rcac.purdue.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 6558.0 user123 1/4 16:07 0+00:00:00 H 0 9.8 condor_dagman -f - ... 10210.0 user123 2/19 14:30 0+00:00:00 R 0 9.8 maple maple_input 212 jobs; 202 idle, 1 running, 9 held -bash-3.00$
When the program finishes running, the output can be seen in maple.out.
To learn more about Maple, click here: http://www.maplesoft.com/.
Linux:
Mathematica (command line) is installed on RCAC Linux clusters. The executable is not located in the same placed on each machine you can submit from, but you can find the correct path like this (below example run on Radon).
radon:~$ module show mathematica ------------------------------------------------------------------- /opt/modules-3.1.6/modulefiles/mathematica/5.2: module-whatis invoke Mathematica 5.2 prepend-path PATH /opt/Wolfram/Mathematica/5.2/Executables ------------------------------------------------------------------- radon:~$ module load mathematica radon:~$ which math /opt/Wolfram/Mathematica/5.2/Executables/math radon:~$
I am using a small test program, mathematica_input which finds the roots of a third degree equation.
Create a shell script to run the Mathematica executable. The variable ClusterName in the requirements, assures that the job will only run on one of the machines which have mathematica installed:
#!/bin/sh export HOME=<path to your homefirectory> source /etc/profile module load mathematica which math math < $1
And then the submit file for Condor, using file transfer:
Universe = vanilla Requirements = ( ( ARCH == "X86_64" || ARCH == "INTEL") && OPSYS == "LINUX" && FileSystemDomain == "rcac.purdue.edu" && ( ClusterName == "Steele" || "Venice" || "Recycled" || "Pete" || "Prospero" || "Rossmann" ) ) Executable = mathematica_radon.sh #Executable = /opt/Wolfram/Mathematica/5.2/Executables/math Arguments = mathematica_input Input = mathematica_input Error = mathematica_r.err Log = mathematica_r.log Output = mathematica_r.out Queue
Testrun:
radon:~$ condor_submit mathematica.submit Submitting job(s). Logging submit event(s). 1 job(s) submitted to cluster 1102962. radon:~$
A little while later, after the program has returned (check with condor_q), you can see the returned result:
radon:~$ less mathematica.out
Mathematica 5.2 for Linux x86 (64 bit)
Copyright 1988-2005 Wolfram Research, Inc.
-- Terminal graphics initialized --
In[1]:=
2 3
Out[1]= 1 + 3 x + 3 x + x
In[2]:=
Out[2]= {{x -> -1}, {x -> -1}, {x -> -1}}
In[3]:=
radon:~$
To learn more about Mathematica, click here: http://www.wolfram.com/.
Machines with Matlab installed advertise that fact with the "HAS_MATLAB" ClassAd.
The examples below will use the simple .m file "fact.m":
function fact=fact(n) % This function calculates n! % Currently calculates 15!, change the value below for other integers. n=15; fact=1; for i=1: n fact = fact*i; end ^Z % The last line is Ctrl-Z
Linux and Windows (same example works for both):
You need to write a submit file that will execute a shell script which runs Matlab on fact.m, using Condor to transfer the input and output files. The example below will execute on any machine that advertises Matlab capability, regardless of operating system or administrative domain.
matlab.submit:
Executable = matlab.sh Universe = Vanilla Requirements = ( ( HAS_MATLAB == True ) && ( ARCH == "INTEL" || ARCH == "x86_64" ) & & OPSYS == "LINUX" && FileSystemDomain == "rcac.purdue.edu" ) Log = mat.log Output = mat.out Error = mat.err Queue
matlab.sh (remember to change the path to HOME to your own - you get it by standing in your top-most directory and type pwd):
#!/bin/sh export HOME=/home/ba01/u103/bbrydsx source /etc/profile module load matlab matlab -nodisplay -nojvm -nosplash -r fact
To submit you do the following:
condor_submit matlab.submit
Example:
-bash-3.00$ condor_submit matlab.submit Vanilla Submitting job(s) Logging submit event(s). 1 job(s) submitted to cluster 9586. -bash-3.00$
You can use condor_q to check on the status of your job:
-bash-3.00$ condor_q -- Submitter: tg-login64.rcac.purdue.edu : <128.211.143.238:32773> : tg-login64.rcac.purdue.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 6558.0 user123 1/4 16:07 0+00:00:00 H 0 9.8 condor_dagman -f - 7080.0 user123 1/8 12:19 0+11:24:32 H 0 9.8 testrun1 7114.0 user123 1/8 16:31 0+00:02:20 H 0 9.8 testrun1 7386.0 user123 1/10 14:07 0+11:55:35 H 0 9.8 testrun1 7543.0 user123 1/11 15:34 0+00:00:00 I 0 9.8 hello 9545.0 user123 2/6 16:38 0+00:00:00 H 0 0.0 remote-testscript 9545.1 user123 2/6 16:38 0+00:00:00 H 0 0.0 remote-testscript 9546.0 user123 2/6 16:55 0+00:00:00 H 0 9.8 a.out 9588.0 user123 2/16 13:27 0+00:00:00 R 0 9.8 matlab -nodisplay 9 jobs; 1 idle, 1 running, 7 held -bash-3.00$
When the job returns (and is thus no longer shown in the above queue), you can get the answer from your output file, which in this case was called 'mat.out'.
Arguments = $$(MATLAB_ARGS) -r fact Universe = Vanilla Getenv = True Requirements = ( HAS_MATLAB == True ) && ( Arch=="Intel") && ( OpSys=="WINNT51" ) && ( UidDomain=="ics.purdue.edu" ) should_transfer_files = YES transfer_executable = false when_to_transfer_output = ON_EXIT Input = fact.m Log = mat-win.log error=mat-win.err Output = mat-win.out Queuecondor_submit matlab.submit
To learn more about Matlab, click here: http://www.mathworks.com/.
Linux:
Machines with R installed advertise that fact with the "HAS_R" ClassAd.
This example: R_input was found on http://www.mayin.org/ajayshah/KB/R/index.html, where other R examples can be found.
R_input:
Goal: To do sorting. # # The approach here needs to be explained. If `i' is a vector of # integers, then the data frame D[i,] picks up rows from D based # on the values found in `i'. # # The order() function makes an integer vector which is a correct # ordering for the purpose of sorting. D <- data.frame(x=c(1,2,3,1), y=c(7,19,2,2)) D # Sort on x indexes <- order(D$x) D[indexes,] # Print out sorted dataset, sorted in reverse by y D[rev(order(D$y)),]
The following submit file (r.submit) will run R on any machines that advertises it, using Condor's file transfer. Remember to change 'initialdir' to your value.
Universe = Vanilla Executable =$$(R_EXE) Requirements = ( HAS_R == TRUE && ( ARCH == "INTEL" || ARCH == "x86_64" ) && OPSYS = = "LINUX" && FileSystemDomain == "rcac.purdue.edu" ) initialdir =/home/ba01/u103/bbrydsx/condor_running/R arguments = $$(R_ARGS) should_transfer_files = YES transfer_executable = false when_to_transfer_output = ON_EXIT input = R_input output = R.out error = err.$(Process) log = R.log Queue
Testrun:
-bash-3.00$ condor_submit r.submit Submitting job(s) Logging submit event(s). 1 job(s) submitted to cluster 11. -bash-3.00$
Check with condor_q if the program has finished running. When that is the case, you can look at the output: R.out.
To learn more about R, click here: http://www.r-project.org/.
Machines with SAS installed advertise that fact with the "HAS_SAS" ClassAd.
The submit file below will run SAS on any node that advertises it, using Condor's file transfer.
# SAS needs the environment variable $HOME set to # *your* home directory and initialdir set to # the path to your SAS input file. Universe = vanilla Executable = /apps/01/sas820/sas Requirements = ( ( ARCH == "X86_64"|| ARCH == "INTEL") && OPSYS == "LINUX" ) initialdir = /home/ba01/u103/user123/condor_running/sas log = SAS.log environment = HOME=/home/ba01/u103/user123 arguments = -nonews -stdio input = SAS_input output = SAS_output error = err.$(Process) Queue
The SAS example, SAS_input was found on - Advanced Log-Linear Models Using SAS. Many other examples can be found on "SAS Online Samples".
Example showing this run:
-bash-3.00$ condor_submit sas.submit Vanilla Submitting job(s) Logging submit event(s). 1 job(s) submitted to cluster 10005. -bash-3.00$
condor_q then gives:
-bash-3.00$ condor_q -- Submitter: tg-login64.rcac.purdue.edu : <128.211.143.238:32773> : tg-login64.rcac.purdue.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 9545.0 user123 2/6 16:38 0+00:00:00 H 0 0.0 remote-testscript 9545.1 user123 2/6 16:38 0+00:00:00 H 0 0.0 remote-testscript 9546.0 user123 2/6 16:55 0+00:00:00 H 0 9.8 a.out 0005.0 user123 2/19 13:50 0+00:00:00 R 0 9.8 sas -nonews -stdio 4 jobs; 0 idle, 1 running, 3 held -bash-3.00$
When the program stops running, it returns the file SAS_output.
To learn more about SAS, click here: http://www.sas.com/technologies/analytics/statistics/stat/.
The only machines with Octave installed is Steele and its nodes. This means we can use the requirement ClusterName == "Steele" to make sure we land on a machine that can run the job.
In this example I will use a very simple octave script file 'test.m':
clear; a=1; b=2; c=a+b d=b-a
The script to execute Octave (octave.sh):
#!/bin/sh export HOME=/home/ba01/u103/bbrydsx source /etc/profile module load octave octave -q $1
The submit file (octave.submit):
Universe = vanilla Executable = octave.sh Arguments = test.m Should_transfer_files = YES When_to_transfer_output = ON_EXIT Input = test.m Output = test.out Error = test.err Log = test.log Requirements = ( ClusterName == "Steele" && ( ARCH == "INTEL" || ARCH == "X86_64" )) Queue
Submitting the job:
user123@radon-fe00:~/condor_running/octave$ condor_submit octave.submit Submitting job(s). Logging submit event(s). 1 job(s) submitted to cluster 1104940. user123@radon-fe00:~/condor_running/octave$
Then, after a while, you will get the files 'test.out' and 'test.err' in your directory. The first will contain the output, and the other is (hopefully) empty, as it contain the errors.
Note When executing Octave scripts, use
For scripts:
% octave -qf 'octavescript.m("inputfile.dat")'
For functions:
octave --silent --eval 'octavescript.m("inputfile.dat")'
To learn more about Octave, click here: http://www.gnu.org/software/octave/.
Perl is installed everywhere, and in the same location. This means we can give the absolute path to it in the submission file.
In this example I will use a simple perl script 'numbers.pl':
#!/usr/bin/perl -w print 255, " is 255 in decimal. \n"; # decimal print 0377, " is 377 in octal. \n"; # octal print 0b11111111, " is 11111111 in binary. \n"; # binary print 0xFF, " is FF in hexadecimal. \n"; # hexadecimal
The submit file (perl.submit):
Universe = vanilla Requirements = ( ( ARCH == "X86_64" || ARCH == "INTEL") && OPSYS == "LINUX" && FileS ystemDomain == "rcac.purdue.edu" ) Executable = /usr/local/bin/perl Arguments = numbers.pl Input = numbers.pl Error = perl.err Log = perl.log Output = perl.out Queue
Submitting the job:
user123@radon-fe00:~/condor_running/perle$ condor_submit perl.submit Submitting job(s). Logging submit event(s). 1 job(s) submitted to cluster 1104943. user123@radon-fe00:~/condor_running/perl$
Then, after a while, you will get the files 'perl.out' and 'perl.err' in your directory. The first will contain the output, and the other is (hopefully) empty, as it contain the errors.
perl.out:
255 is 255 in decimal. 255 is 377 in octal. 255 is 11111111 in binary. 255 is FF in hexadecimal.
To learn more about Perl, click here: http://www.perl.org/.
Python is installed everywhere, and in the same location. This means we can give the absolute path to it in the submission file.
In this example I will use a simple python script 'test.py':
#!/usr/local/bin/python import string, sys rodents = 4 print "Python says: YUM...there are", rodents, "delicious little rodents here..."
The submit file (python.submit):
Universe = vanilla Requirements = ( ( ARCH == "X86_64" || ARCH == "INTEL") && OPSYS == "LINUX" && FileS ystemDomain == "rcac.purdue.edu" ) Executable = /usr/bin/python Arguments = test.py Input = test.py Error = test.err Log = test.log Output = test.out Queue
Submitting the job:
user123@radon-fe00:~/condor_running/python$ condor_submit python.submit Submitting job(s). Logging submit event(s). 1 job(s) submitted to cluster 1104946. user123@radon-fe00:~/condor_running/python$
Then, after a while, you will get the files 'python.out' and 'python.err' in your directory. The first will contain the output, and the other is (hopefully) empty, as it contain the errors.
python.out:
Python says: YUM...there are 4 delicious little rodents here...
To learn more about Python, click here: http://www.python.org/.
Condor is able to schedule and run any type of process, but Condor's Standard Universe does have some limitations on any jobs that it checkpoints and migrates:
Note: these limitations only apply to jobs which Condor has been asked to transparently checkpoint. If job checkpointing is not desired (Vanilla Universe), the limitations above do not apply.
Note: Jobs need to be re-linked to get checkpointing and remote system calls. Although typically no source code changes are required, Condor requires that the jobs be re-linked with the Condor libraries to take advantage of checkpointing and remote system calls. This often precludes commercial software binaries from taking advantage of these services because commercial packages rarely make their object code available. Condor's other services are still available for these commercial packages.
It is important to list the correct requirements and rank commands in the job submission file. This way you can assure that your program is run on the machine that best fits your requirements.
These requirements and rank must be specified as valid Condor ClassAd expressions. There are, however, default values set by the condor_submit program, which are used if none are defined in the job submission file. The ClassAd expressions are intuitive and reminiscent of C. It is possible to write quite elaborate expressions with ClassAds. Check out chapter 4.1 in the Condor manual for a complete description.
All of the commands in the job submission file are case insensitive, except for the ClassAd attribute string values. ClassAds attribute names are case insensitive, but ClassAd string values are case preserving.
Note that the comparison operators (<, >, <=, >=, and ==) compare strings case insensitively. The special comparison operators =?= and =!= compare strings case sensitively.
The allowed ClassAd attributes varies from machine to machine. To see all of the machine ClassAd attributes for all machines in the rcac.purdue.edu Condor pool, run the command condor_status -l. If there are any jobs in the queue, you can see the job ClassAds with the command condor_q -l. There is another, useful command, which is local to Purdue. It is called condor_all. You can run condor_all -l to see all machine ClassAd attributes for all machines in BoilerGrid.
When Condor is considering a match between a job and a machine, the rank is used to choose a match from among all machines that satisfy the job's requirements and are available to the user, after accounting for the user's priority and the machine's rank of the job. The rank expressions, simple or complex, define a numerical value that expresses preferences.
The job's rank expression evaluates to one of three values:
If rank evaluates to a floating point value, the best match will be the one with the largest, positive value. If no rank is given in the job submission file, then Condor substitutes a default value of 0.0 when considering machines to match. If the job's rank of a given machine evaluates to UNDEFINED or ERROR, this same value of 0.0 is used. Therefore, the machine is still considered for a match, but has no rank above any other.
A boolean expression evaluates to the numerical value of 1.0 if true, and 0.0 if false.
Here are some examples of rank expressions from the Condor manual:
Rank = memory
Rank = ( (clockday == 0) || (clockday == 6) ) && (machine == "friend.rcac.purdue.edu")
Rank = (machine == "friend1.rcac.purdue.edu") || (machine == "friend2.rcac.purdue.edu") || (machine == "friend3.rcac.purdue.edu")
Rank = kflops
This last example may give problems, since not all machines have the kflops attribute defined. For machines where this attribute is not defined, Rank will evaluate to the value UNDEFINED, and Condor will use a default rank of the machine of 0.0. The rank attribute will only rank machines where the attribute is defined. Therefore, the machine with the highest floating-point performance may not be the one given the highest rank.
Thus, it is always wise to check if the expression's evaluation will lead to the expected ranking of machines, before writing a rank expression (check with the command condor_status -constraint <name>, to see a list of machines that fits a certain constraint). For example, to see which machines in the pool that have kflops defined, use condor_status -constraint kflops.
Alternatively, to see a list of machines where kflops is not defined, use condor_status -constraint "kflops=?=undefined".
Rank = ((machine == "friend1.rcac.purdue.edu")*3) + ((machine == "friend2.rcac.purdue.edu")*2) + (machine == "friend3.rcac.purdue.edu")
Example: If the machine being ranked is "friend1.rcac.purdue.edu", then the expression
(machine == "friend1.rcac.purdue.edu")
is true, and gives the value 1.0. The expressions
(machine == "friend2.rcac.purdue.edu")
and
(machine == "friend3.rcac.purdue.edu")
are false, and give the value 0.0. Therefore, rank evaluates to the value 3.0. In this way, machine "friend1.rcac.purdue.edu" is ranked higher than machine "friend2.rcac.purdue.edu", machine "friend2.rcac.purdue.edu" is ranked higher than machine "friend3.rcac.purdue.edu", and all three of these machines are ranked higher than others.
To make sure your job runs only on RCAC machines, add this to your submit file:
Requirements = ( FileSystemDomain == "rcac.purdue.edu" )
To make sure your job runs only on RCAC Linux machines, but either INTEL or X86_64 architecture, add this to your submit file:
Requirements = ( ( ARCH == "X86_64" || ARCH == "INTEL") && OPSYS == "LINUX" && FileSystemDomain == "rcac.purdue.edu" )
To make sure your job runs only on a specific RCAC machine, add this to your submit file (remove the machines you DON'T want to run on, from the example below):
Requirements = ( ( ARCH == "X86_64" || ARCH == "INTEL") && OPSYS == "LINUX" && FileSystemDomain == "rcac.purdue.edu" && ( ClusterName == "Steele" || "Venice" || "Recycled" || "Pete" || "Prospero" || "Rossmann" ) )
It is possible to allow Condor to choose between a perhaps larger pool of machines for a job, if executables are available for all the different platforms. This is done by making changes to the job submission file.
Example:
Cross submission. An executable is available for one platform, but the submission is done from a different platform. Given the correct executable, the requirements command in the job submission file specifies the target architecture. Here, an executable compiled for an X86_64 architecture running Linux would add the requirement.
requirements = Arch == "X86_64" && OpSys == "LINUX"
Without this requirement, condor_submit will assume that the program is to be executed on a machine with the same platform as the machine where the job is submitted.
Cross submission works for both Standard and Vanilla Universes. To see the architecture and OS for the machines in the pool, type the command condor_status.
Click here to see some examples (from the Condor manual) showing how cross submission works in the Vanilla Universe and here for an example for the Standard Universe.
Machine attributes:
Here follows a description of some of the common machine attributes. For a more complete listing of attributes, look here.
Job attributes:
Changing the priority of jobs
In addition to the priorities assigned to each user, Condor also provides each user with the capability of assigning priorities to each submitted job. These job priorities are local to each queue and can be any integer value, with higher values meaning better priority.
The default priority of a job is 0, but can be changed using the condor_prio command. Example: to change the priority of a job to -15:
user123@radon-fe00:~$ condor_q user123 -- Submitter: radon-fe00.rcac.purdue.edu : <128.210.9.35:35407> : radon-fe00.rcac.purdue.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 260187.0 user123 8/30 13:59 0+00:00:00 I 0 19.5 hello 1 jobs; 1 idle, 0 running, 0 held user123@radon-fe00:~$ condor_prio -p -15 260187.0 user123@radon-fe00:~$ condor_q user123 -- Submitter: radon-fe00.rcac.purdue.edu : <128.210.9.35:35407> : radon-fe00.rcac.purdue.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 260187.0 user123 8/30 13:59 0+00:00:03 R -15 19.5 hello 1 jobs; 0 idle, 1 running, 0 held user123@radon-fe00:~$
Note these job priorities are different from the user priorities assigned by Condor. Job priorities do not impact user priorities and are only a mechanism for the user to identify the relative importance of jobs among all the jobs submitted by the user to that specific queue.
The idea of grid computing is to be able to use resources which are spanning many administrative domains. Even though a Condor pool usually contains machines owned by many different people, it will often be the case that collaborating researchers from different organizations do not consider it feasible to combine all their computers in one large Condor pool. They will therefore have to use grid computing.
Condor has its own mechanisms for grid computing but is able to interact with other grid systems. The usual way for Condor to submit jobs from one pool to another is via flocking. Flocking is enabled by configuration within each of the pools. Jobs migrate from one pool to another based on the availability of machines to execute jobs. If the local Condor pool currently does not have any available machines to run a job, it will flock to another pool. This is not something the user needs to think about - nothing need to be added or changed in the job submission file.
To learn more about this, Condor-C jobs, glidein (a mechanism by which one or more Grid resources (remote machines) temporarily join a local Condor pool), the program condor_glidein (used to add a machine to a Condor pool) and running when there is other middleware like Globus running, see section 5 of the official Condor manual.
To set up flocking, first send the DNS hostname of your Condor central manager (condor_negotiator and condor_collector) to condor-admin@rcac.purdue.edu. RCAC Condor administrators will then allow your Condor pool access to BoilerGrid. Then, the locations where your job can be executed, AS WELL AS WHERE IT CAN BE SUBMITTED FROM, must be identified with the variables 'FLOCK_FROM' and 'FLOCK_TO'. These variable are set in Part 2 of the condor_config file, located in <path to>/condor/etc/. At Purdue, these variables should be set to:
FLOCK_FROM = *.rcac.purdue.edu
and
FLOCK_TO = albatross.rcac.purdue.edu,emu.rcac.purdue.edu,egret.rcac.purdue.edu,flamingo.rcac.purdue.edu
Also, the variables 'FLOCK_COLLECTOR_HOSTS', 'FLOCK_NEGOTIATOR_HOSTS', and 'HOSTALLOW_NEGOTIATOR_SCHEDD' should be set (the settings below assume that condor_collector and condor_negotiator daemons are running on the same machine):
FLOCK_COLLECTOR_HOSTS = $(FLOCK_TO) FLOCK_NEGOTIATOR_HOSTS = $(FLOCK_TO) HOSTALLOW_NEGOTIATOR_SCHEDD = $(COLLECTOR_HOST), $(FLOCK_NEGOTIATOR_HOSTS)
The configuration macros that must be set in pool B are ones that authorize jobs from machine A to flock to pool B.
Using the 'FLOCK_FROM' variable, the variables below should keep their default values:
HOSTALLOW_WRITE_COLLECTOR = $(HOSTALLOW_WRITE), $(FLOCK_FROM) HOSTALLOW_WRITE_STARTD = $(HOSTALLOW_WRITE), $(FLOCK_FROM) HOSTALLOW_READ_COLLECTOR = $(HOSTALLOW_READ), $(FLOCK_FROM) HOSTALLOW_READ_STARTD = $(HOSTALLOW_READ), $(FLOCK_FROM)
Run "condor_reconfig" on your condor servers, and your pool should be configured to flock to and from BoilerGrid.
Jobs will always try to run locally and only flock to another pool when there is no machine in the current pool.
In the past, all jobs using flocking were Standard Universe jobs. This is no longer so and it is possible to submit jobs to other universes, but it is necessary to take into account the location of input, output and error files. Since machines in separate pools do not usually have a shared file system, the user needs to use file transfer mechanisms. See section 2.5.4 in the official Condor manual.
Condor-C Job submission
Job submission is done the same way for Condor-C jobs as for all other Condor jobs. The only thing to remember is that the universe must be 'grid'. There should also be an entry 'grid_resource' in the job submission file, which specifies the remote condor_schedd daemon to which the job should be submitted. The value of 'grid_resource' consists of three fields: 1) the grid type (condor), the name of the remote condor_schedd daemon (the same as the condor_schedd ClassAd attribute Name on the remote machine), 3) the third field is the name of the remote pool's condor_collector. Here is an example job submission file:
Universe = grid Executable = myjob Output = myoutput Error = myerror Log = mylog grid_resource = condor joe@remotemachine.example.com remotecentralmanager.example.com +remote_jobuniverse = 5 +remote_requirements = True +remote_ShouldTransferFiles = "YES" +remote_WhenToTransferOutput = "ON_EXIT" Queue
The remote machine needs to know the attributes of the job. In the job submission file these are specified with the '+' syntax, followed by the string remote_. As a minimum, these must be the job's universe and the job's requirements. Most likely there will also be other attributes specific to the job's universe (on the remote pool).
Note: attributes set with '+' are inserted directly into the job's ClassAd. Specify attributes as they must appear in the job's ClassAd, not the job submission file.
See section 5.3.1.2 in the official Condor manual for more information and examples.
Participating when your department already runs Condor
If you operate Condor already, then you are most of the way there. Condor pools can be joined with Condor's "flocking" mechanism. To set up flocking with BoilerGrid (the parameters below are set in the file 'condor_config'):
If your department operates a Condor pool, or have idle workstations, clusters, or labs that could provide computing cycles to Condor, then the Rosen Center for Advanced Computing would like to help you join these machines to BoilerGrid, creating a campus-wide flock of systems with which to advance scientific discovery.
Condor is a specialized workload management system for compute-intensive jobs. Like other full-featured batch systems, Condor provides a job queueing mechanism, scheduling policy, priority scheme, resource monitoring, and resource management. Users submit their serial or parallel jobs to Condor, Condor places them into a queue, chooses when and where to run the jobs based upon a policy, carefully monitors their progress, and ultimately informs the user upon completion.
While providing functionality similar to that of a more traditional batch queueing system, Condor's novel architecture allows it to succeed in areas where traditional scheduling systems fail. Condor can be used to manage a cluster of dedicated compute nodes (such as a "Beowulf" cluster). In addition, unique mechanisms enable Condor to effectively harness wasted CPU power from otherwise idle desktop workstations. For instance, Condor can be configured to only use desktop machines where the keyboard and mouse are idle. Should Condor detect that a machine is no longer available (such as a key press detected), in many circumstances Condor is able to transparently produce a checkpoint and migrate a job to a different machine which would otherwise be idle. Condor does not require a shared file system across machines - if no shared file system is available, Condor can transfer the job's data files on behalf of the user, or Condor may be able to transparently redirect all the job's I/O requests back to the submit machine. As a result, Condor can be used to seamlessly combine all of an organization's computational power into one resource. Condor is designed for high-throughput computing and is excellent for parameter sweeps, Monte Carlo simulation, or nearly any serial application. Some classes of parallel jobs (master-worker) may be run effectively via Condor as well.
To get more information about Condor, go to the Condor Homepage and read the official Condor Manual.
Another good page with information is the Condor tutorials and slides from the 'Condor Boot Camp' that was held at Purdue University.
There is also a good, shorter tutorial of how to install Condor on the Boilergrid pages.
The installation of Condor depends on your platform and on whether you want to setup a personal Condor, make a new Condor pool, or join an existing pool.
You can also find some information about Condor and joining BoilerGrid here.
The first step to installing Condor is to download it. This can also be done in more than one way. The instructions below should cover most variations.
It may be a good idea to join the condor-world mailing list. Traffic on this list is kept to an absolute minimum. It is only used to announce new releases of Condor. To subscribe, send a message to majordomo@cs.wisc.edu with the body: subscribe condor-world. Another useful mailing list is condor-users. It has quite a lot of traffic and can be used for discussing problems you may have with Condor. To join this list, send an email to majordomo@cs.wisc.edu with the body: subscribe condor-users.
First, download the newest version of Condor MSI from http://www.cs.wisc.edu/condor/downloads/. Click on the latest stable release and fill in name, email, and organization. Then click "I agree".
On the page you are taken to, there will be a long list of Condor binaries for various OS's. Go down nearly to the bottom and you will find the binary for Windows 2000/XP. Download the MSI-file.
Then, run the MSI and answer the questions it asks. You will need to have administrator rights to install Condor.
Submitting jobs:
You will need to open the file condor_config and make a few changes to it - setting up CONDOR_HOST and such. If you are using flocking, then you will need to set these parameters as specified under "Participating when your department already runs Condor". See here for an example of the condor_config file.
After making changes to the condor_config file, you need to run condor_reconfig and add credentials for you (stash password). To do this:
You should now be able to submit jobs to Condor, if you enabled that during setup.
To submit the job, type condor_submit <full_path_to>\<job submission file>.
Remember: you can only submit 'Vanilla' jobs from Windows (add "Universe = Vanilla" to your job submission file).
There are more installation advice in section 6.2.10 of the Condor manual.
To install Condor you need to have root access. If you just want to try out a small personal version of Condor on your own machine, then that is possible without root access. This is described further down.
Java: if you want to be able to run Java with Condor, then you need to have Java installed before installing Condor. If Java is not installed, you could fx. download from http://java.sun.com/j2se/1.4.2/download.html.
To install a normal, full version of Condor you need to have root access.
There are two ways of downloading and installing. First, you can download the appropriate binaries directly from http://www.cs.wisc.edu/condor/downloads/ or you can use the "automated" download and install. Below I will go into more details for both options.
After installing Condor, you must then set the JAVA and JAVA_MAXHEAP_ARGUMENT in the condor_config file.
Downloading from http://www.cs.wisc.edu/condor/downloads/
There are a number of parameters which should be set in the condor_config file - located in <path/to/Condor_install_directory/etc/>. Look here to see an example. I have marked the places which may need to be changed in your file. Below follows a list of the parameters and what they should be set to (the FLOCK_TO and FLOCK_FROM should only be set if you are using flocking):
After everything is installed and configured, you should run condor_reconfig on your condor servers. Then you should add the paths to your .cshrc or .bash:
.cshrc:
setenv CONDOR_CONFIG <path/to/Condor_install_directory>/etc/condor_config set path=(<path/to/Condor_install_directory>/bin $path) set path=(<path/to/Condor_install_directory>/sbin $path)
.bash:
export CONDOR_CONFIG=<path/to/Condor_install_directory>/etc/condor_config
export PATH=<path/to/Condor_install_directory>/bin:${PATH}
export PATH=<path/to/Condor_install_directory>/sbin:${PATH}
To check if everything is set up properly, type:
echo $CONDOR_CONFIG echo $path which condor_master which condor_submit
These should give the expected results (the two first should repeat the paths that you expected and the two next should tell you that the system can indeed find both the Condor /bin and the Condor /sbin directories as well as the files in them).
Starting Condor:
After having confirmed that everything is correctly setup, you are ready to start Condor. This is done by typing (as root/with root permission):
condor_master
To check that everything is running, type ps x. This should give something like this:
armadillo:~> ps x PID TTY STAT TIME COMMAND 4568 ? S 0:06 condor_schedd 9738 ? Ss 14:02 condor_master 9739 ? Ss 1:32 condor_collector -f 9740 ? Ss 0:35 condor_negotiator -f 9741 ? Ss 0:12 condor_schedd -f 9742 ? Ss 12:23 condor_startd -f 13093 ? S 0:00 sshd: bbrydsx@pts/4 13095 pts/4 Ss 0:00 -tcsh 13123 pts/4 R+ 0:00 ps x armadillo:~>
This installation will need to be usable on every host you would like to be able to run Condor jobs, so a shared filesystem or your file distribution of choice will need to be used to get Condor around your network.
The installation will use reasonable defaults, but if you would like to further customize policies for starting, suspending, and preempting jobs on your execute nodes, consult the Condor manual for details. Another place to look for a walk-through of the above, is here: http://www.rcac.purdue.edu/boilergrid/condorTutorials/install_condor.cfm.
"Automated" download and installation
This is probably the easiest way to install Condor from Linux/Unix. You need root access to install this way. In the following I will explain how to install Condor via the Virtual Data Toolkit (VDT): http://vdt.cs.wisc.edu/releases/1.3.11/.
After everything is installed and configured, run condor_reconfig on your condor servers. Then you should add the paths to your .cshrc or .bashrc:
.cshrc:
setenv CONDOR_CONFIG/etc/condor_config set path=( /bin $path) set path=( /sbin $path)
.bashrc:
export CONDOR_CONFIG=/etc/condor_config export PATH= /bin:${PATH} export PATH= /sbin:${PATH}
To check if everything is set up properly, type:
echo $CONDOR_CONFIG echo $path which condor_master which condor_submit
These should give the expected results (the two first should repeat the paths that you expected and the two next should tell you that the system can indeed find both the Condor /bin and the Condor /sbin directories as well as the files in them).
Starting Condor:
After having confirmed that everything is correctly setup, you are ready to start Condor. This is done by typing (as root/with root permission):
condor_master
To check that everything is running, type ps x. This should give something like this:
armadillo:~> ps x PID TTY STAT TIME COMMAND 4568 ? S 0:06 condor_schedd 9738 ? Ss 14:02 condor_master 9739 ? Ss 1:32 condor_collector -f 9740 ? Ss 0:35 condor_negotiator -f 9741 ? Ss 0:12 condor_schedd -f 9742 ? Ss 12:23 condor_startd -f 13093 ? S 0:00 sshd: bbrydsx@pts/4 13095 pts/4 Ss 0:00 -tcsh 13123 pts/4 R+ 0:00 ps x armadillo:~>
This installation will need to be usable on every host you would like to be able to run Condor jobs, so a shared filesystem or your file distribution of choice will need to be used to get Condor around your network.
The installation will use reasonable defaults, but if you would like to further customize policies for starting, suspending, and preempting jobs on your execute nodes, consult the Condor manual for details.
A Personal Condor installation does not require root access and can be used to try out Condor on your own Linux/Unix machine. It is also possible to connect to other machines using the Condor 'flocking' option. Here is a short description of how to install a personal Condor.
There are a number of parameters which may need to be set in the 'condor_config' file - located in
After everything is installed and configured, you should add the paths to your .cshrc or .bashrc:
.cshrc:setenv CONDOR_CONFIG/etc/condor_config set path=( /bin $path) set path=( /sbin $path)
.bashrc:
export CONDOR_CONFIG=/etc/condor_config export PATH= /bin:${PATH} export PATH= /sbin:${PATH}
To check if everything is set up properly, type:
echo $CONDOR_CONFIG echo $path which condor_master which condor_submit
These should give the expected results (the two first should repeat the paths that you expected and the two next should tell you that the system can indeed find both the Condor /bin and the Condor /sbin directories as well as the files in them).
Starting Condor:
After having confirmed that everything is correctly setup, you are ready to start Condor. This is done by typing (as root/with root permission):
condor_master
To check that everything is running, type ps x. This should give something like this:
armadillo:~> ps x PID TTY STAT TIME COMMAND 4568 ? S 0:06 condor_schedd 9738 ? Ss 14:02 condor_master 9739 ? Ss 1:32 condor_collector -f 9740 ? Ss 0:35 condor_negotiator -f 9741 ? Ss 0:12 condor_schedd -f 9742 ? Ss 12:23 condor_startd -f 13093 ? S 0:00 sshd: bbrydsx@pts/4 13095 pts/4 Ss 0:00 -tcsh 13123 pts/4 R+ 0:00 ps x armadillo:~>
The installation will use reasonable defaults, but if you would like to further customize policies for starting, suspending, and preempting jobs on your execute nodes, consult the Condor manual for details.
The Macintosh port of Condor is more accurately a port of Condor to Darwin, the BSD core of OS X. Condor uses the Carbon library only to detect keyboard activity, and it does not use Cocoa at all. Condor on the Macintosh is a relatively new port, and it is not yet well-integrated into the Macintosh environment.
Condor on the Macintosh has a few shortcomings:
Download and installation:
After installation you need to make some changes to the condor_config file. It is located in the directory <condor_install_directory>/etc/. The parameters should be set to (the FLOCK_TO and FLOCK_FROM should only be set if you are using flocking):
After making changes to the condor_config file, you need to run condor_reconfig.
You should now be able to submit jobs to Condor. This is done by first writing and compiling your program and then making a 'job submission file'. See here for an example.
To submit the job, type condor_submit <full_path_to>/<job submission file>.
There are currently no FAQs for BoilerGrid.