This document follows certain typesetting and naming conventions:
$ example This is an example of commands and output.
BoilerGrid is a large, high-throughput, distributed computing system operated by the Rosen Center for Advanced Computing (RCAC) and using the Condor system developed by the Condor Project at the University of Wisconsin. BoilerGrid provides a way for you to run programs on large numbers of otherwise idle computers in various locations, including any temporarily under-utilized high-performance cluster resources as well as any computer lab desktop machines not currently in use. Whenever a local user or scheduled job needs a machine back, Condor stops its job and sends it to another Condor node as soon as possible. Because this model limits the ability to do parallel processing and communications, BoilerGrid is only appropriate for relatively quick serial jobs.
If you have a desktop computer on the Purdue West Lafayette campus, please consider donating your desktop's idle time to BoilerGrid! The process is easy and allows other Purdue researchers to use otherwise wasted cycles when your computer is doing nothing. More information on joining BoilerGrid is available on the Join BoilerGrid page.
BoilerGrid scavenges cycles from nearly all RCAC systems, including all the RCAC-maintained clusters and specialized systems. BoilerGrid also uses idle time of machines in student labs on the Purdue West Lafayette campus. Through the larger consortium DiaGrid, BoilerGrid may also send jobs to machines at other institutions, including the University of Wisconsin, the University of Louisville, Indiana University, the University of Notre Dame, Indiana State University, the Purdue Calumet and North Central campuses, and the Indiana University – Purdue University Fort Wayne campus. Whenever the primary scheduling system on any of these machines needs a compute node back or a user sits down and starts to use a desktop computer, Condor will stop its job and, if possible, checkpoint its work. Condor then immediately tries to restart this job on some other available compute node in BoilerGrid.
A recent snapshot of BoilerGrid found 36,524 total processor cores. Of these, there were 29,111 Linux/x86_64, 98 Linux/Intel (ia32), 385 WinNT51/Intel, and 6925 WinNT61/Intel. There are also small numbers of Itanium Linux, Solaris, and Intel OSX nodes. Memory on compute nodes ranges from 512 MB to 192 GB, and most processors run at 2 GHz or faster. With a total of over 60 TFLOPS available, BoilerGrid can provide large numbers of cycles in a short amount of time. Condor offers high-throughput computing and is excellent for parameter sweeps, Monte Carlo simulations, or nearly any serial application.
| Owner | Arch/OS | Processor Cores |
|---|---|---|
| ITaP - RCAC | x86_64/Linux | 30,717 |
| ITaP - RCAC | Intel/Linux | 29 |
| ITaP - Envision Center | Intel/Linux | 48 |
| ITaP - Teaching & Learning | Intel/WinNTXX | ~9,300 |
| Purdue Calumet | X86_64/Linux | 998 |
| Notre Dame CSE | Intel/Linux, Intel/OSX, Sun4u/Solaris210, x86_64/Linux | 1,213 |
| Purdue Biology, Libraries & some ITaP | Intel/Linux, Intel/WinNT51 | 187 |
BoilerGrid currently uses Condor 7.4.1. You can check on the overall status of BoilerGrid using CondorView.
All Purdue faculty, staff, and students with the approval of their advisor may request access to BoilerGrid. However, if you have an account on Radon or any of the RCAC Community Clusters (Hansen, Rossmann, Coates, Steele, or Miner), then you already have access to BoilerGrid. Refer to the RCAC Accounts / Access page for more details on how to request access.
To submit jobs on BoilerGrid, log in to the submission host condor.rcac.purdue.edu via SSH. This submission host is actually three front-end hosts: condor-fe00, condor-fe01, and condor-fe02. The login process randomly assigns one of these three front-ends to each login to condor.rcac.purdue.edu. While the three front-end hosts are identical, each has its own Condor queue. When you submit jobs to the Condor queue from the front-end named condor-fe00, you will not see those jobs on the Condor queue while logged in to either condor-fe01 or condor-fe02. To ensure that you always see the same Condor queue, log in to the same front-end.
Each front-end host has its own /tmp. Sharing data in /tmp during subsequent sessions may fail. RCAC advises using scratch storage for multisession, shared data instead.
You may also submit jobs to BoilerGrid from Radon or any of the RCAC Community Clusters (Hansen, Rossmann, Coates, Steele, or Miner). These clusters also have multiple front-end hosts.
All access to the RCAC systems must be through secure (encrypted) connections. RCAC systems do not support telnet and FTP. Use SSH, SCP, and SFTP instead.
Secure Shell or SSH is a way of establishing a secure channel between a local and a remote computer. It uses public-key cryptography to authenticate the remote computer and (optionally) to allow the remote computer to authenticate the user. Its usual function involves logging in to a remote machine and executing commands similar to telnet, but it also supports tunneling and forwarding of X11 or arbitrary TCP connections. The associated SFTP and SCP protocols can transfer files. There are many SSH clients available, depending on the operating system you use.
Linux / Solaris / AIX / HP-UX / Unix:
Microsoft Windows:
Mac OS X:
SSH works with many different means of authentication. One popular authentication method is Public Key Authentication (PKA). PKA is a method of establishing your identity to a remote computer using related sets of encryption data called keys. PKA is a more secure alternative to traditional password-based authentication with which you are probably familiar.
To employ PKA via SSH, you manually generate a keypair (also called SSH keys) in the location from where you wish to initiate a connection to a remote machine. This keypair consists of two text files: private key and public key. You keep the private key file confidential on your local machine or local home directory (hence the name "private" key). You then log in to a remote machine (if possible) and append the corresponding public key text to the end of a specific file, or have a system administrator do so on your behalf. In future login attempts, PKA compares the public and private keys to verify your identity; only then do you have access to the remote machine.
As a user, you can create, maintain, and employ as many keypairs as you wish. If you connect to a computational resource from your work laptop, your work desktop, and your home desktop, you can create and employ keypairs on each. You can also create multiple keypairs on a single local machine to serve different purposes, such as establishing access to different remote machines or establishing different types of access to a single remote machine. In short, PKA via SSH offers a secure but flexible means of identifying yourself to all kinds of computational resources.
Creating a keypair prompts you to provide a passphrase for the private key. This passphrase is different from a password in a number of ways. First, a passphrase is, as the name implies, a phrase. It can include most types of characters, including spaces, and has no limits on length. Secondly, the remote machine does not receive this passphrase for verification. Its purpose is only to allow the use of your local private key and is specific to a specific local private key.
Perhaps you are wondering why you would need a private key passphrase at all when using PKA. If the private key remains secure, why the need for a passphrase just to use it? Indeed, if the location of your private keys were always completely secure, a passphrase might not be necessary. In reality, a number of situations could arise in which someone may improperly gain access to your private key files. In these situations, a passphrase offers another level of security for you, the user who created the keypair.
Think of the private key/passphrase combination as being analogous to your ATM card/PIN combination. The ATM card itself is the object that grants access to your important accounts, and as such, should remain secure at all times—just as a private key should. But if you ever lose your wallet or someone steals your ATM card, you are glad that your PIN exists to offer another level of protection. The same is true for a private key passphrase.
When you create a keypair, you should always provide a corresponding private key passphrase. For security purposes, avoid using phrases which automated programs can discover (e.g. phrases that consist solely of words in English-language dictionaries). This passphrase is not recoverable if forgotten, so make note of it. Only a few situations warrant using a non-passphrase-protected private key—conducting automated file backups is one such situation. If you need to use a non-passphrase-protected private key to conduct automated backups to Fortress, see the No-Passphrase SSH Keys section.
SSH supports tunneling of X11 (X-Windows), so you may run X11 applications on the machine you are using to issue jobs to BoilerGrid. However, running an X11 application via Condor is not possible.
If you have received a default password as part of the process of obtaining your account, you should change it immediately when you log on for the first time. Change your password from any terminal/SSH session with the command passwd. You will have the same password on all RCAC systems. If you change your password on any one RCAC system, it will change on all RCAC systems.
If you already have a Purdue career account, then you will initially receive the same username and password as your career account. There is no need to change your career account password because you have received an account on RCAC systems.
There is not currently any requirement regarding how often you must change your password within RCAC, but for security reasons changing a password every six months, preferably every three months, is good practice.
A password should employ all of the following features:
Never share your password with another user or make your password known to anyone else. Systems staff will NEVER ask for your password, by email or otherwise.
There is no local email delivery available on BoilerGrid. BoilerGrid forwards all email which it receives to mail.rcac.purdue.edu for delivery.
Your shell is the program that generates your command-line prompt and processes commands. On RCAC systems, several common shell choices are available:
| Name | Description | Path |
|---|---|---|
| bash | A Bourne-shell (sh) compatible shell with many newer advanced features as well. Bash is one of the most common shells in use today. | /bin/bash |
| tcsh | An advanced variant on csh with all the features of modern shells. Tcsh is probably the second most popular shell in use today. | /bin/tcsh |
| zsh | An advanced shell which incorprates all the functionality of bash, tcsh, and ksh combined, usually with identical syntax. In spite of this, zsh is not in common use. | /bin/zsh |
| csh | The original C-style shell. Because tcsh offers all the functionality of csh and more, use csh only when you have specific csh-only scripts. | /bin/csh |
| ksh | Korn shell, which was an early Bourne-shell compatible shell with some additional features. Unless you are already an adept ksh user, you would probably prefer bash. | /bin/ksh |
To find out what shell you are running right now, simply use the ps command:
$ ps PID TTY TIME CMD 30181 pts/27 00:00:00 bash 30273 pts/27 00:00:00 ps
To use a different shell on a one-time or trial basis, simply type the shell name as a command. To return to your original shell, type exit:
$ ps PID TTY TIME CMD 30181 pts/27 00:00:00 bash 30273 pts/27 00:00:00 ps $ tcsh % ps PID TTY TIME CMD 30181 pts/27 00:00:00 bash 30313 pts/27 00:00:00 tcsh 30315 pts/27 00:00:00 ps % exit $
To permanently change your default login shell, use the command chsh:
$ chsh Changing login shell for myusername on *all* ACMAINT hosts. Enter existing password: ********** Old shell: nologin New shell [nologin]: /bin/tcsh Changed 'loginShell' to '/bin/tcsh' for login 'myusername' on host(s) 'host123.rcac.purdue.edu host234.rcac.purdue.edu ...'. Connection to data.rcac.purdue.edu closed.
There is a propagation delay which may last up to two hours. After the change has taken effect, your next login will start in your new shell. Moreover, you may change your shell again at any time by rerunning chsh.
File storage options on RCAC systems include long-term storage (home directories, Fortress) and short-term storage (scratch directories, /tmp directory). Each option has different performance and intended uses, and some options vary from system to system as well. RCAC backs up home directories nightly. RCAC does not back up short-term storage and may occasionally purge files from scratch and /tmp directories without warning. More details about each storage option appear below.
RCAC provides home directories for long-term file storage. Each user ID has one home directory. You should use your home directory for storing important program files, scripts, input data sets, critical results, and frequently used files. You should store infrequently used files on Fortress. Your home directory becomes your current working directory, by default, when you log in.
RCAC backs up your home directory nightly. For additional security, you should store another copy of your home directory on more permanent storage.
Your home directory will physically reside on the BlueArc NFS Server. To find the path to your home directory, first log in then immediately enter the following:
$ pwd /autohome/u103/myusername
Or from any subdirectory:
$ echo $HOME /home/ba01/u103/myusername
The replies indicate the name of the host where your home directory physically resides. In this example, the home directory is on the RCAC home directory file server named "ba01" under area "u103". This will vary from person to person.
Regardless of its physical location, your home directory and its contents are available on almost all RCAC front-end hosts and compute nodes via the Network File System (NFS). The only exception is Black.
Your home directory has a quota capping the size and/or number of files you may store within. For more information, refer to the Storage Quotas / Limits Section.
Only files which RCAC has backed up overnight are recoverable. If you lose a file the same day you created it, it is NOT recoverable.
To recover files lost from your home directory, use the flost command:
$ flost
This will ask you some questions about when you lost your file. If you lost it recently, flost will direct you to a place where you can recover your file yourself immediately. If you lost the file some time ago, flost will help you note all the necessary information for RCAC staff to restore your file from tape backups.
RCAC provides scratch directories for short-term file storage only. Each file system domain has at least one scratch directory. Each user ID may access one scratch directory in a file system domain. The quota of your scratch directory is several times greater than the quota of your home directory. You should use your scratch directory for storing large temporary input files which your job reads or for writing large temporary output files which you may examine after execution of your job. You should use your home directory and Fortress for longer-term storage or for holding critical results.
Users of all RCAC's major clusters have access to a scratch directory.
RCAC does not perform backups for scratch directories. In the event of a disk crash or file purge, files in scratch directories are not recoverable. You should copy any important files to more permanent storage.
RCAC automatically removes (purges) from RCAC scratch directories all files stored for more than 90 days. Owners of these files receive a notice one week before removal via email. For more information, please refer to RCAC's Scratch File Purging Policy.
To find the path to your scratch directory:
$ findscratch
The response from command findscratch depends on your submission host. You may see one of the following paths:
/scratch/scratch95/m/myusername /scratch/scratch96/m/myusername /scratch/lustreA/m/myusername /scratch/miner/m/myusername
The value of variable $RCAC_SCRATCH is the path of your scratch directory. Use this variable in any scripts. Your actual scratch directory path may change without warning, but this variable will remain current.
$ echo $RCAC_SCRATCH
The response will be one of the previously listed paths.
Your scratch directory on RCAC computational resources may be the same location and shared by some other RCAC computational resources, and also distinct and not shared by other RCAC computational resources. All submission hosts on all computational resources are able to access the scratch directories of all other computational resources. However, compute nodes are only able to access the scratch directory allocated to that specific computational resource. RCAC may change which computational resources share scratch storage with other computational resources as needs dictate. For more information about which computational resources share scratch volumes, please see the Network Storage Resource Page.
All BoilerGrid jobs submitted from a submission host of an RCAC computational resource will have their Condor filesystem domain set such that these jobs will stay on RCAC compute nodes which have access to the scratch directory of the submission host unless you specify file transfer (which would eliminate any need for this). This will ensure that non-file-transfer jobs will always run on nodes which can access the scratch directory you had where you submitted the jobs. If you have no need of this scratch directory and want these jobs to run on systems which do not have access to it, you will need to explicitly set the file system domain of your jobs.
To find the path to someone else's RCAC scratch directory:
$ findscratch someusername /scratch/scratch95/s/someusername
Your RCAC scratch directory has a quota capping the size and number of files you may store in it. For more information, refer to the Storage Quotas / Limits Section.
RCAC provides /tmp directories for short-term file storage only. Each front-end and compute node has a /tmp directory. Your program may write temporary data to the /tmp directory of the compute node on which it is running. That data is available for as long as your program is active. Once your program terminates, that temporary data is no longer available. When used properly, /tmp may provide faster local storage to an active process than any other storage option. You should use your home directory and Fortress for longer-term storage or for holding critical results.
RCAC does not perform backups for the /tmp directory and removes files from /tmp whenever space is low or whenever the system needs a reboot. In the event of a disk crash or file purge, files in /tmp are not recoverable. You should copy any important files to more permanent storage.
Long-term Storage or Permanent Storage is available to RCAC users on the High Performance Storage System (HPSS), an archival storage system, commonly referred to as "Fortress". HPSS is a software package that manages a hierarchical storage system. Program files, data files and any other files which are not used often, but which must be saved, can be put in permanent storage. Fortress currently has a 1.2 PB capacity.
Files smaller than 100 MB have their primary copy stored on low-cost disks (disk cache), but the second copy (backup of disk cache) is on tape or optical disks. This provides a rapid restore time to the disk cache. However, the large latency to access a larger file (usually involving a copy from a tape cartridge) makes it unsuitable for direct use by any processes or jobs, even where possible. The primary and secondary copies of larger files are stored on separate tape cartridges in the Quantum (ADIC, Advanced Digital Information Corporation) tape library.
To ensure optimal performance for all users, and to keep the Fortress system healthy, please remember the following tips:
Fortress writes two copies of every file either to two tapes, or to disk and a tape, to protect against medium errors. Unfortunately, Fortress does not automatically switch to the alternate copy when it has trouble accessing the primary. If it seems to be taking an extraordinary amount of time to retrieve a file (hours), please either email rcac-help@purdue.edu or call ITaP Customer Service at 765-49-4400. We can then investigate why it is taking so long. If it is an error on the primary copy, we will instruct Fortress to switch to the alternate copy as the primary and recreate a new alternate copy.
For more information about Fortress, how it works, user guides, and how to obtain an account:
There are a variety of ways to manually transfer files to your Fortress home directory for long-term storage.
HSI, the Hierarchical Storage Interface, is the preferred method of transferring files to and from Fortress. HSI is designed to be a friendly interface for users of the High Performance Storage System (HPSS). It provides a familiar Unix-style environment for working within HPSS while automatically taking advantage of high-speed, parallel file transfers without requiring any special user knowledge.
HSI is already provided on all RCAC systems as the command hsi. You may download HSI for the following platforms as well:
Any machines using HSI or HTAR must have all firewalls (local and departmental) configured to allow open access from the following IP addresses:
If you are unsure of how to modify your firewall settings, please consult with your department's IT support or the documentation for your operating system. Access to Fortress is restricted to on-campus networks. If you need to directly access Fortress from off-campus, please use the Purdue VPN service before connecting.
Interactive usage:
$ hsi ************************************************************************* * Purdue University * High Performance Storage System (HPSS) ************************************************************************* * This is the Purdue Data Archive, Fortress. For further information * see http://www.rcac.purdue.edu/userinfo/resources/fortress/ * * If you are having problems with HPSS, please call IT/Operational * Services at 49-44000 or send E-mail to dxul-help@purdue.edu. * ************************************************************************* Username: myusername UID: 12345 Acct: 12345(12345) Copies: 1 Firewall: off [hsi.3.5.8 Wed Sep 21 17:31:14 EDT 2011] [Fortress HSI]/home/myusername->put data1.fits put 'test' : '/home/myusername/test' ( 1024000000 bytes, 250138.1 KBS (cos=11)) [Fortress HSI]/home/myusername->lcd /tmp [Fortress HSI]/home/myusername->get data1.fits get '/tmp/data1.fits' : '/home/myusername/data1.fits' (2011/10/04 16:28:50 1024000000 bytes, 325844.9 KBS ) [Fortress HSI]/home/myusername->quit
Batch transfer file:
put data1.fits put data2.fits put data3.fits put data4.fits put data5.fits put data6.fits put data7.fits put data8.fits put data9.fits
Batch usage:
$ hsi < my_batch_transfer_file ************************************************************************* * Purdue University * High Performance Storage System (HPSS) ************************************************************************* * This is the Purdue Data Archive, Fortress. For further information * see http://www.rcac.purdue.edu/userinfo/resources/fortress/ * * If you are having problems with HPSS, please call IT/Operational * Services at 49-44000 or send E-mail to dxul-help@purdue.edu. * ************************************************************************* Username: myusername UID: 12345 Acct: 12345(12345) Copies: 1 Firewall: off [hsi.3.5.8 Wed Sep 21 17:31:14 EDT 2011] put 'data1.fits' : '/home/myusername/data1.fits' ( 1024000000 bytes, 250200.7 KBS (cos=11)) put 'data2.fits' : '/home/myusername/data2.fits' ( 1024000000 bytes, 258893.4 KBS (cos=11)) put 'data3.fits' : '/home/myusername/data3.fits' ( 1024000000 bytes, 222819.7 KBS (cos=11)) put 'data4.fits' : '/home/myusername/data4.fits' ( 1024000000 bytes, 224311.9 KBS (cos=11)) put 'data5.fits' : '/home/myusername/data5.fits' ( 1024000000 bytes, 323707.3 KBS (cos=11)) put 'data6.fits' : '/home/myusername/data6.fits' ( 1024000000 bytes, 320322.9 KBS (cos=11)) put 'data7.fits' : '/home/myusername/data7.fits' ( 1024000000 bytes, 253192.6 KBS (cos=11)) put 'data8.fits' : '/home/myusername/data8.fits' ( 1024000000 bytes, 253056.2 KBS (cos=11)) put 'data9.fits' : '/home/myusername/data9.fits' ( 1024000000 bytes, 323218.9 KBS (cos=11)) EOF detected on TTY - ending HSI session
For more information about HSI:
HTAR (short for "HPSS TAR") is a utility program that writes TAR-compatible archive files directly onto Fortress, without having to first create a local file. Its command line was originally based on the AIX tar program, with a number of extensions added to provide extra features.
HTAR is already provided on all RCAC systems as the command htar. You may download HTAR for the following platforms as well:
Any machines using HSI or HTAR must have all firewalls (local and departmental) configured to allow open access from the following IP addresses:
If you are unsure of how to modify your firewall settings, please consult with your department's IT support or the documentation for your operating system. Access to Fortress is restricted to on-campus networks. If you need to directly access Fortress from off-campus, please use the Purdue VPN service before connecting.
Usage:
(Create a tar archive on Fortress named data.tar including all files with the extension ".fits".) $ htar -cvf data.tar *.fits HTAR: a data1.fits HTAR: a data2.fits HTAR: a data3.fits HTAR: a data4.fits HTAR: a data5.fits HTAR: a data6.fits HTAR: a data7.fits HTAR: a data8.fits HTAR: a data9.fits HTAR: a /tmp/HTAR_CF_CHK_17953_1317760775 HTAR Create complete for data.tar. 9,216,006,144 bytes written for 9 member files, max threads: 3 Transfer time: 29.622 seconds (311.121 MB/s) HTAR: HTAR SUCCESSFUL (Unpack a tar archive on Fortress named data.tar into a scratch directory for use in a batch job.) $ cd $RCAC_SCRATCH/job_dir $ htar -xvf data.tar HTAR: x data1.fits, 1024000000 bytes, 2000001 media blocks HTAR: x data2.fits, 1024000000 bytes, 2000001 media blocks HTAR: x data3.fits, 1024000000 bytes, 2000001 media blocks HTAR: x data4.fits, 1024000000 bytes, 2000001 media blocks HTAR: x data5.fits, 1024000000 bytes, 2000001 media blocks HTAR: x data6.fits, 1024000000 bytes, 2000001 media blocks HTAR: x data7.fits, 1024000000 bytes, 2000001 media blocks HTAR: x data8.fits, 1024000000 bytes, 2000001 media blocks HTAR: x data9.fits, 1024000000 bytes, 2000001 media blocks HTAR: Extract complete for data.tar, 9 files. total bytes read: 9,216,004,608 in 33.914 seconds (271.749 MB/s ) HTAR: HTAR SUCCESSFUL (Look at the contents of the data.tar HTAR archive on Fortress.) $ htar -tvf data.tar HTAR: -rw-r--r-- myusername/pucc 1024000000 2011-10-04 16:30 data1.fits HTAR: -rw-r--r-- myusername/pucc 1024000000 2011-10-04 16:35 data2.fits HTAR: -rw-r--r-- myusername/pucc 1024000000 2011-10-04 16:35 data3.fits HTAR: -rw-r--r-- myusername/pucc 1024000000 2011-10-04 16:35 data4.fits HTAR: -rw-r--r-- myusername/pucc 1024000000 2011-10-04 16:35 data5.fits HTAR: -rw-r--r-- myusername/pucc 1024000000 2011-10-04 16:35 data6.fits HTAR: -rw-r--r-- myusername/pucc 1024000000 2011-10-04 16:35 data7.fits HTAR: -rw-r--r-- myusername/pucc 1024000000 2011-10-04 16:35 data8.fits HTAR: -rw-r--r-- myusername/pucc 1024000000 2011-10-04 16:35 data9.fits HTAR: -rw------- myusername/pucc 256 2011-10-04 16:39 /tmp/HTAR_CF_CHK_17953_1317760775 HTAR: Listing complete for data.tar, 10 files 10 total objects HTAR: HTAR SUCCESSFUL (Unpack a single file, "data7.fits", from the tar archive on Fortress named data.tar into a scratch directory.) $ htar -xvf data.tar data7.fits HTAR: x data7.fits, 1024000000 bytes, 2000001 media blocks HTAR: Extract complete for data.tar, 1 files. total bytes read: 1,024,000,512 in 3.642 seconds (281.166 MB/s ) HTAR: HTAR SUCCESSFUL
For more information about HTAR:
Fortress does NOT support SCP.
Fortress does NOT support SFTP.
If you are using an RCAC cluster front-end system, your Fortress home directory is available as /archive/fortress/home/myusername. While your Fortress home directory can be accessed via NFS in this way, this is only provided as a convenience and should not be used on a regular basis as it is extremely slow. Instead, use the HSI command to get a fast, parallelized, UNIX-like interface to your Fortress home directory.
There are many environment variables related to storage locations and paths. Logging in automatically sets these environment variables. You may change the variables at any time.
Use environment variables instead of actual paths whenever possible to avoid problems if the specific paths to any of these change. Some of the environment variables you should have are:
| Name | Description |
|---|---|
| USER | your username |
| HOME | path to your home directory |
| PWD | path to your current directory |
| RCAC_SCRATCH | path to scratch filesystem |
| PATH | all directories searched for commands/applications |
| HOSTNAME | name of the machine you are on |
| SHELL | your current shell (bash, tcsh, csh, ksh) |
| SSH_CLIENT | your local client's IP address |
| TERM | type of terminal or terminal emulator being used |
By convention, environment variable names are all uppercase. Use them on the command line or in any scripts in place of and in combination with hard-coded values:
$ ls $HOME ... $ ls $RCAC_SCRATCH/myproject ...
To find the value of any environment variable:
$ echo $RCAC_SCRATCH /scratch/scratch95/m/myusername $ echo $SHELL /bin/tcsh
To list the values of all environment variables:
$ env USER=myusername HOME=/home/ba01/u101/myusername RCAC_SCRATCH=/scratch/scratch95/m/myusername SHELL=/bin/tcsh ...
You may create or overwrite an environment variable. To pass (export) the value of a variable in either bash or ksh:
$ export VARIABLE=value
To assign a value to an environment variable in either tcsh or csh:
% setenv VARIABLE value
RCAC limits your disk usage on RCAC systems. Each filesystem (home directory, scratch directory, etc.) may have a different limit. RCAC does not implement a soft limit or quota. However, if you exceed the hard limit or limit, your write will fail. Either remove other files or ask RCAC about increasing your limit.
To discover the current quotas of your home and scratch directories:
$ myquota Type Filesystem Size Limit Use Files Limit Use ============================================================================== home u105 4.5GB 9.5GB 47% 10,258 65,535 15% scratch /scratch/scratch95/ 8KB 476.8GB 0% 2 100,000 0%
The columns are as follows:
If you find that you reached your quota in either your home directory or your scratch file directory, obtain estimates of your disk usage. Find the top-level directories which have a high disk usage, then study the subdirectories to discover where the heaviest usage lies.
To see in a human-readable format an estimate of the disk usage of your top-level directories in your home directory:
$ du -h --max-depth=1 $HOME >myfile 32K /home/ba01/u105/myusername/mysubdirectory_1 529M /home/ba01/u105/myusername/mysubdirectory_2 608K /home/ba01/u105/myusername/mysubdirectory_3
The second directory is the largest of the three, so apply command du to it.
To see in a human-readable format an estimate of the disk usage of your top-level directories in your scratch file directory:
$ du -h --max-depth=1 $RCAC_SCRATCH >myfile 160K /scratch/scratch95/m/myusername
This strategy can be very helpful in figuring out the location of your largest usage. Move unneeded files and directories to alternate long-term storage to free space in your home and scratch directories.
If you find you need additional disk space on an RCAC account, please first consider archiving and compressing old files and moving them to long-term storage. If this option does not resolve the issue, you may send an email to rcac-help@purdue.edu and request additional space.
There are several options for archiving and compressing groups of files or directories on RCAC systems. RCAC provides the following tools:
(compress file somefile.c) $ zip somefile.zip somefile.c (extract contents of somefile.zip) $ unzip somefile.zip (compress all files in a directory into one archive file) $ zip -r somefile.zip somedirectory/ (compress all ".c" files in current directory into one archive file) $ zip -r somefile.zip . -i \*.c
(archive file somefile.c) $ tar cvf somefile.tar somefile.c (archive and compress file somefile.c) $ tar czvf somefile.tar.gz somefile.c (list contents of archive somefile.tar) $ tar tvf somefile.tar (extract contents of somefile.tar) $ tar xvf somefile.tar (extract contents of gzipped archive somefile.tar.gz) $ tar xzvf somefile.tar.gz (archive and compress all files in a directory into one archive file) $ tar czvf somefile.tar.gz somedirectory/ (archive and compress all ".c" files in current directory into one archive file) $ tar czvf somefile.tar.gz *.c
(compress file somefile - also removes uncompressed file) $ gzip somefile (uncompress file somefile.gz - also removes compressed file) $ gunzip somefile.gz
(compress file somefile - also removes uncompressed file) $ bzip2 somefile (uncompress file somefile.bz2 - also removes compressed file) $ bunzip2 somefile.bz2
Windows users can work with these same formats using some of the following software:
There are a variety of ways to transfer data to and from RCAC systems. Which you should use depends on several factors, including the ease of use for you personally, connection speed and bandwidth, and the size and number of files which you intend to transfer.
FTP (File Transfer Protocol) is a simple data transfer mechanism. FTP does not provide secure communications, so RCAC no longer supports FTP on any RCAC systems. However, most modern FTP clients support either SFTP or SCP, which are similar, secure protocols for file transfer. Try using one of the other methods described here instead of FTP.
SCP (Secure CoPy) is a simple way of transferring files between two machines that use the SSH (Secure SHell) protocol. You may use SCP to connect to any system where you have SSH (login) access. SCP is available as a protocol choice in some graphical file transfer programs and also as a command line program on most Linux, Unix, and Mac OS X systems. SCP can copy single files, but will also recursively copy directory contents if given a directory name.
Command-line usage:
(to a remote system from local) $ scp sourcefilename myusername@hostname:somedirectory/destinationfilename (from a remote system to local) $ scp myusername@hostname:somedirectory/sourcefilename destinationfilename (recursive directory copy to a remote system from local) $ scp sourcedirectory/ myusername@hostname:somedirectory/
Linux / Solaris / AIX / HP-UX / Unix:
Microsoft Windows:
Mac OS X:
SFTP (Secure File Transfer Protocol) is a reliable way of transferring files between two machines. You may use SFTP to connect to most RCAC systems. SFTP is available as a protocol choice in some graphical file transfer programs and also as a command-line program on most Linux, Unix, and Mac OS X systems. SFTP has more features than SCP and allows for other operations on remote files, remote directory listing, and resuming interrupted transfers. Command-line SFTP cannot recursively copy directory contents; to do so, try using SCP or graphical SFTP client.
Command-line usage:
$ sftp -B buffersize myusername@hostname
(to a remote system from local)
sftp> put sourcefile somedir/destinationfile
sftp> put -P sourcefile somedir/
(from a remote system to local)
sftp> get sourcefile somedir/destinationfile
sftp> get -P sourcefile somedir/
sftp> exit
Linux / Solaris / AIX / HP-UX / Unix:
Microsoft Windows:
Mac OS X:
LFTP is a command-line file-transfer program for Linux and Unix systems. It supports SFTP, HTTP, and HTTPS file-transfers. LFTP has additional features not provided by SFTP such as bandwidth throttling, transfer queues, and parallel transfers. Use interactively or scripted.
LFTP with parallel transfers can be much faster than SCP or SFTP, so RCAC encourage its use, when possible.
LFTP is available only on some RCAC systems. However, it is simply a client, so the remote machine involved in a transfer does not need it (the remote system need only support SFTP).
Interactive usage:
$ lftp myusername@hostname
(transfer all ".dat" files from remote system to local)
lftp :~> mget *.dat
(transfer "filename.dat" file from local system to remote)
lftp :~> put filename.dat
(transfer a directory and all contents from remote
system to local, using 5 connections in parallel)
lftp :~> mirror --parallel=5 remotedirectory localdirectory/
(transfer a directory and all contents from local
system to remote, using 8 connections in parallel)
lftp :~> mirror -R --parallel=8 localdirectory remotedirectory/
Batch usage:
(specify all actions on command line) $ lftp myusername@hostname -e "mget *.dat" (specify all actions in the script file "mytransfer.lftp") $ lftp myusername@hostname -f mytransfer.lftp
GridFTP is a fast method of transferring large files that uses Globus authentication credentials (x509 certificates). GridFTP is available on some RCAC resources, but only to users who are members of a Grid project, such as TeraGrid, NorthWest Indiana Computational Grid (NWICG), or Open Science Grid (OSG). However, not all grids may access all RCAC resources.
For more information about how to use GridFTP, consult documentation for your participating grid.
SMB (Server Message Block), also known as CIFS, is an easy to use file transfer protocol that is useful for transferring files between RCAC systems and a desktop or laptop. You may use SMB to connect to your home, scratch, and fortress storage directories. The SMB protocol is available on Windows, Linux, and Mac OS X. It is primarily used as a graphical means of transfer but it can also be used over the command line.
Windows:
Mac OS X:
Linux:
smbclient //samba.rcac.purdue.edu/myusername -U myusername -W onepurdue
The compilers available on Radon and the Community Clusters (Hansen, Rossmann, Coates, Steele, and Miner) are able to compile code for Condor. Compilers are available for Fortran 77, Fortran 90, Fortran 95, C, and C++. The compilers can produce general-purpose and architecture-specific optimizations to improve performance. These include loop-level optimizations, inter-procedural analysis and cache optimizations. While the compilers support automatic and user-directed parallelization of Fortran, C, and C++ applications for multiprocessing execution, BoilerGrid allows only serial jobs.
To see the available compilers, choose one of the following entries:
$ module avail intel $ module avail gcc $ module avail pgi
Using statically linked libraries, regardless the chosen Condor universe, is good practice; you cannot rely on which versions of dynamic libraries are available on the machines selected to run your job. With static libraries, Condor will send the same libraries to all machines. On the other hand, with the Condor flock consisting of a mix of machine architectures, there is also the possibility that your job will land on a machine that is so different from or much older than the machine on which you built your executable file that your job may fail to execute an instruction in the statically linked library. In a parameter sweep, this leads to the confusing situation of some of the runs of the sweep completing successfully while others fail. In this case, you must consider using the corresponding dynamic library on the selected machine or using ClassAds to select compute nodes known to run your job successfully or to exclude compute nodes known to fail. So, use static linkage if at all possible. For the Standard Universe, the condor_compile command specifies static linkage as part of its arguments to the linker; the condor_compile command exhibits its arguments in the "LINKING FOR" message. Regarding jobs destined for the Vanilla Universe, use your compiler's command-line option for selecting statically linked libraries.
A serial program is a single process whose steps execute as a sequential stream of instructions on one computer. Compilers capable of serial programming are available for C, C++, and versions of Fortran.
Here are a few sample serial programs:
With the GNU compilers only, the command condor_compile compiles source code and relinks it with the Condor libraries for submission into Condor's Standard Universe. The Condor libraries provide the program with additional support, such as the capability to preempt with checkpointing, which is a feature of Condor's Standard Universe mode of operation. The command condor_compile requires the source or object code of a computer program as well as a compatible compiler.
To use condor_compile and the Standard Universe, first load a compatible compiler (in this case the default GNU compiler):
$ module load gcc
Next, choose one of the following entries:
$ condor_compile gfortran myprogram.f -o myprogram $ condor_compile gfortran myprogram.f90 -o myprogram $ condor_compile gfortran myprogram.f95 -o myprogram $ condor_compile gcc myprogram.c -o myprogram $ condor_compile g++ myprogram.cpp -o myprogram
When neither source nor object code of a computer program is available (i.e. only an executable binary or a shell script) or when you wish to take advantage of features of a compiler which is not compatible with Condor's condor_compile and Standard Universe, you must compile without condor_compile and submit your executable file to Condor's Vanilla Universe. This section looks at just compiling with the standard C/C++ and Fortran compilers, as opposed to compiling with condor_compile.
The following table illustrates how to compile a serial program with statically linked libraries. Note that not all compilers are available on all systems.
| Language | Intel Compiler | GNU Compiler | PGI Compiler |
|---|---|---|---|
| Fortran 77 | $ module load intel $ ifort -static myprogram.f -o myprogram |
$ module load gcc $ gfortran -static myprogram.f -o myprogram |
$ module load pgi $ pgf77 -Bstatic myprogram.f -o myprogram |
| Fortran 90 | $ module load intel $ ifort -static myprogram.f90 -o myprogram |
$ module load gcc $ gfortran -static myprogram.f90 -o myprogram |
$ module load pgi $ pgf90 -Bstatic myprogram.f90 -o myprogram |
| Fortran 95 | $ module load intel $ ifort -static myprogram.f90 -o myprogram |
$ module load gcc $ gfortran -static myprogram.f95 -o myprogram |
$ module load pgi $ pgf95 -Bstatic myprogram.f95 -o myprogram |
| C | $ module load intel $ icc -static myprogram.c -o myprogram |
$ module load gcc $ gcc -static myprogram.c -o myprogram |
$ module load pgi $ pgcc -Bstatic myprogram.c -o myprogram |
| C++ ¹ | $ module load intel $ icc -static myprogram.cpp -o myprogram |
$ module load gcc $ g++ -static myprogram.cpp -o myprogram |
$ module load pgi $ pgCC -Bstatic myprogram.cpp -o myprogram |
| ¹ The suffix of a C++ file may be .C, .c, .cc, .cpp, .cxx, or .c++. | |||
The Intel, GNU and PGI compilers will not output anything for a successful compilation. Also, the Intel compiler does not recognize the suffix ".f95".
An older version of the GNU compiler will be in your path by default. Do NOT use this version. Instead, load a newer version using the command module load gcc.
More information on compiler options is available in the official man pages on the Web. Also, the command man mycompiler displays man pages (only after using module load to load the appropriate compiler.
Here is some more documentation from other sources on the various compilers:
BoilerGrid allows only serial programs to run via Condor. There is no support for MPI.
BoilerGrid allows only serial programs to run via Condor. There is no support for OpenMP.
BoilerGrid allows only serial programs to run via Condor. There is no support for MPI or OpenMP.
BoilerGrid has a few preinstalled libraries, including mathematical libraries. More detailed documentation on the libraries available on BoilerGrid follows.
There is currently no support for MPICH through Condor.
Intel Math Kernel Library (MKL) contains ScaLAPACK, LAPACK, Sparse Solver, BLAS, Sparse BLAS, CBLAS, GMP, FFTs, DFTs, VSL, VML, and Interval Arithmetic routines. MKL resides in the directory /opt/intel/mkl/9.1, and it has the following subdirectory structure:
Here are some example combinations of linking options:
(static linking of LAPACK and Kernels)
$ myfortrancompiler myprogram.f -L${MKLPATH} -lmkl_lapack -lmkl_ia32 -lguide -lpthread
(static linking of Fortran-95 LAPACK Interface and Kernels)
$ myfortrancompiler myprogram.f95 -L${MKLPATH} -lmkl_lapack95 -lmkl_lapack -lmkl_ia32 -lguide -lpthread
(static linking of BLAS, Sparse BLAS, GMP, VML/VSL, Interval Arithmetic, and FFT/DFT)
$ myccompiler myprogram.c -L${MKLPATH} -lmkl_ia32 -lguide -lpthread -lm
(dynamic linking of BLAS or FFTs)
$ myccompiler myprogram.c -L${MKLPATH} -lmkl -lguide -lpthread
RCAC recommends that you use dynamic linking of libguide. If so, define LD_LIBRARY_PATH such that you are using the correct version of libguide at run time. If you use static linking of libguide (discouraged), then:
Here are some more documentation from other sources on the Intel MKL:
You may write different parts of a computing application in different programming languages. For example, an application might incorporate older, legacy code which performs numerical calculations written in Fortran. Systems functions might use C. A newer, main program which binds together all older code might use C++ to take advantage of the object orientation. This section illustrates a few simple examples.
For more information about mixing programming languages:
If the source file ends with .F, .fpp, or .FPP, cpp automatically preprocesses the source code before compilation. If you want to use the C preprocessor with source files that do not end with .F, use the following compiler option to specify the filename suffix:
$ gfortran -x f77-cpp-input myprogram.f
$ ... -cxxlib -gcc/-cxxlib -iccFor example, to preprocess source files that end with .f:
$ ifort -cpp myprogram.f
Generally, it is advisable to rename your file from myprogram.f to myprogram.F. The preprocessor then automatically runs when you compile the file.
For more information on combining C/C++ and Fortran:
A C language program calls routines written in Fortran 90, C, and C++. The routines change the value of a character argument. To understand what makes this example work, you must be aware of a few simple issues.
To discover how the chosen Fortran compiler handles the names of routines, apply the Linux command nm to the object file: nm filename.o. The Fortran compilers used in this example append an underscore after the name of a routine. The C program calls the Fortran routine with the underscore character.
Fortran uses pass-by-reference while C uses pass-by-value. Therefore, to pass a value from a Fortran routine to a C program requires the argument in the call to the Fortran routine to be a pointer (ampersand "&"). To pass a value from a C++ routine to a C program, the C++ routine may use the pass-by-reference syntax (ampersand "&") of C++ while the C program again specifies a pointer (ampersand "&") in the call to the C++ routine.
The C++ compiler must know at the time of compiling the C++ routine that the C program will invoke the C++ routine with the C-style interface rather than the C++ interface.
The following files of source code illustrate these technical details:
Separately compile each source code file with the appropriate compiler into an object (.o) file. Then link the object files into a single executable file (a.out):
| Compiler | Intel | GNU | PGI |
|---|---|---|---|
| C Main Program | $ module load intel $ icc -c main.c $ ifort -c f90.f90 $ icc -c c.c $ icc -c cpp.cpp $ icc -lstdc++ main.o f90.o c.o cpp.o |
$ module load gcc $ gcc -c main.c $ gfortran -c f90.f90 $ gcc -c c.c $ g++ -c cpp.cpp $ gcc -lstdc++ main.o f90.o c.o cpp.o |
$ module load pgi $ pgcc -c main.c $ pgcc -c c.c $ pgCC -c cpp.cpp $ pgf90 -Mnomain main.o c.o cpp.o f90.f90 |
The results show that each routine successfully returns a different character to the main program:
$ a.out main(), initial value: chr=X main(), after function subr_f_(): chr=f main(), after function func_c(): chr=c main(), after function func_cpp(): chr=+ Exit main.c
A C++ language program calls routines written in Fortran 90, C, and C++. The routines change the value of a character argument. To understand what makes this example work, you must be aware of a few simple issues.
To discover how the chosen Fortran compiler handles the names of routines, apply the Linux command nm to the object file: nm filename.o. The Fortran compilers used in this example append an underscore after the name of a routine. The C++ program calls the Fortran routine with the underscore character.
Fortran uses pass-by-reference while C++ uses pass-by-value. Therefore, to pass a value from a Fortran routine to a C++ program requires the argument in the call to the Fortran routine to be a pointer (ampersand "&"). To pass a value from a C routine to a C++ program, the C routine must declare a parameter as a pointer (asterisk "*") while the C++ program again specifies a pointer (ampersand "&") in the call to the C routine.
The C++ compiler must know at the time of compiling the C++ program that the C++ program will invoke the Fortran and C routines with the C-style interface rather than the C++ interface.
The following files of source code illustrate these technical details:
Separately compile each source code file with the appropriate compiler into an object (.o) file. Then link the object files into a single executable file (a.out):
| Compiler | Intel | GNU | PGI |
|---|---|---|---|
| C++ Main Program | $ module load intel $ icc -c main.cpp $ ifort -c f90.f90 $ icc -c c.c $ icc -c cpp.cpp $ icc -lstdc++ main.o f90.o c.o cpp.o |
$ module load gcc $ g++ -c main.cpp $ gfortran -c f90.f90 $ gcc -c c.c $ g++ -c cpp.cpp $ g++ main.o f90.o c.o cpp.o |
$ module load pgi $ pgCC -c main.cpp $ pgf90 -c f90.f90 $ pgcc -c c.c $ pgCC -c cpp.cpp $ pgCC -L../lib main.o c.o cpp.o f90.o -pgf90libs |
The results show that each routine successfully returns a different character to the main program:
$ a.out main(), initial value: chr=X main(), after function subr_f_(): chr=f main(), after function func_c(): chr=c main(), after function func_cpp(): chr=+ Exit main.cpp
A Fortran language program calls routines written in Fortran 90, C, and C++. The routines change the value of a character argument. To understand what makes this example work, you must be aware of a few simple issues.
To discover how the chosen Fortran compiler handles the names of routines, apply the Linux command nm to the object file: nm filename.o. The Fortran compilers used in this example append an underscore after the name of a routine, so the definitions of the C and C++ routines must include the underscore. The Fortran program calls these routines without the underscore character in the Fortran source code.
Fortran uses pass-by-reference while C uses pass-by-value. Therefore, to pass a value from a C routine to a Fortran program requires the parameter of the C routine to be a pointer (asterisk "*") in the C routine's definition. To pass a value from a C++ routine to a Fortran program, the C++ routine may use the pass-by-reference syntax (ampersand "&") of C++ in its definition.
The C++ compiler must know at the time of compiling the C++ routine that the Fortran program will invoke the C++ routine with the C-style interface rather than the C++ interface.
The following files of source code illustrate these technical details:
Separately compile each source code file with the appropriate compiler into an object (.o) file. Then link the object files into a single executable file (a.out):
| Compiler | Intel | GNU | PGI |
|---|---|---|---|
| Fortran 90 Main Program | $ module load intel $ ifort -c main.f90 $ ifort -c f90.f90 $ icc -c c.c $ icc -c cpp.cpp $ ifort -lstdc++ main.o f90.o c.o cpp.o |
$ module load gcc $ gfortran -c main.f90 $ gfortran -c f90.f90 $ gcc -c c.c $ g++ -c cpp.cpp $ gfortran -lstdc++ main.o c.o cpp.o f90.o |
$ module load pgi $ pgf90 -c main.f90 $ pgf90 -c f90.f90 $ pgcc -c c.c $ pgCC -c cpp.cpp $ pgf90 main.o c.o cpp.o f90.o |
The results show that each routine successfully returns a different character to the main program:
$ a.out main(), initial value: chr=X main(), after function subr_f(): chr=f main(), after function subr_c(): chr=c main(), after function func_cpp(): chr=+ Exit mixlang
You may use Condor to submit jobs to BoilerGrid. Condor performs job scheduling. Jobs may be serial only. You may use only the batch mode for developing and running your program. BoilerGrid does not offer an interactive mode to run your jobs.
Condor is one of several distributed computing resources RCAC provides. Like other similar resources, Condor provides a framework for running programs on otherwise idle computers. While this imposes serious limitations on parallel jobs and codes with large I/O or memory requirements, Condor can provide a large quantity of cycles for researchers who need to run hundreds of smaller jobs.
Condor is a specialized batch system for managing compute-intensive jobs. Condor provides a queuing mechanism, scheduling policy, priority scheme, and resource classifications. Users submit their jobs to Condor, which then puts these jobs in a queue, runs them, and reports back with the results.
In some ways, Condor is different from other batch systems. They usually only operate on dedicated machines/compute servers. Instead, Condor can both schedule jobs on dedicated machines and effectively utilize non-dedicated machines to run jobs. It only runs jobs on machines which are currently idle (no keyboard activity, no load average, no active telnet users, etc). In this way, Condor effectively harnesses otherwise idle machines throughout a pool of machines.
Currently, RCAC uses Condor to utilize idle cycles on all RCAC computational resources, including all Linux cluster nodes as well as some other servers and workstations. While RCAC uses PBS to schedule the resources of the Linux clusters, Condor schedules jobs on compute nodes when the nodes are not running PBS jobs. When PBS elects to run a new job on a node which is currently running Condor-scheduled jobs, Condor preempts all jobs running on that node to make room for the PBS-scheduled job. You may submit Condor jobs from most of the RCAC systems (Hansen, Rossmann, Coates, Steele, Miner, or Radon).
For more information:
A Universe in Condor defines an execution environment. Condor supports several different Universes for user jobs. The most used on BoilerGrid are "Standard", "Vanilla", and "Globus" (or "Grid"). There are other Universes. See Chapter 2.4.1 of the Condor Manual for more details about the different Universes.
Job submission files specify the Condor Universe through the universe command. The default Universe is Vanilla (not Standard). Windows compute nodes accept only Vanilla Universe jobs.
You will need to determine the appropriate Universe for your jobs. Here are some more details about how the Universes differ:
Vanilla Universe
The Vanilla Universe is the default (except where the configuration variable DEFAULT_UNIVERSE defines it otherwise). It is an execution environment for jobs which you did not re-link with the Condor libraries. It provides fewer services, but has very few restrictions. Preemption with either suspension or eviction (without checkpointing) is a signature of the Vanilla Universe. If a compute node which is running one or more Vanilla jobs ceases to be idle, Condor will either suspend or evict those jobs. Condor may restart a suspended job on the same compute node; Condor will restart evicted jobs on other compute nodes. When re-linking a computer program to the Condor libraries is impossible or when you wish to use a compiler which is incompatible with condor_compile, use the Vanilla Universe.
Virtually any non-parallel program can use the Vanilla Universe. Shell scripts may be executables. It is the only possibility for Windows machines. You may use compilers which are incompatible with condor_compile. For example, Intel compilers may run 30–40% faster than compatible compilers and may even be faster for somewhat longer jobs, because the speed gain may be bigger than the advantage from checkpointing in the Standard Universe. Preemption with suspension or eviction is, in general, bad for long jobs, but OK for short jobs. A long job may never finish because repeated preemptions with restarts can prevent completion.
Static linkage of libraries for Vanilla Universe jobs eliminates the chance of running a job with different, older libraries which may be available on some compute nodes since it sends the same collection of libraries to all compute nodes. There is the risk that some compute nodes are sufficiently out of synch with the submission host that they are unable to run the newer libraries. RCAC recommends using static linkage if at all possible.
Standard Universe
The Standard Universe supports transparent job preemption with checkpointing, remote system calls, and migration from compute node to compute node without restarting. Specifying the Standard Universe in your job submission file tells Condor that you previously re-linked your job via condor_compile with the Condor libraries while using various Condor-specific compiler options and libraries. Standard Universe is a desirable Universe due to its premption with checkpointing. If possible use the Standard Universe for long jobs. Long jobs are less likely to finish in the Vanilla Universe.
There are a few restrictions on programs. There is no possibility of sub-processes. Shell scripts may not be executables. You may not use incompatible compilers, for example Intel compilers. All Standard Universe executables should be statically linked since there is no guarantee that the dynamic libraries on all machines in the flock will be the same version. That way Condor will send the same executable file to all machines. There is also the problem that your job land on a system that is not even the same version as your build system. The condor_compile command specifies static linkage as part of its arguments to the linker; condor_compile displays these arguments in the 'LINKING FOR' message. This command not only forces a static link but also fills in a number of wrappers for standard C library routines to make, among other things, remote file access work.
Globus (or Grid) Universe
The Globus or Grid Universe forwards the job to an external job management system. You use the grid_resource command to apply additional specifications of the Grid Universe. The Globus or Grid Universe allows users to submit jobs using Condor's interface. These jobs execute on grid resources. For Globus jobs, see http://www.globus.org for more information.
Here is the simplest possible job submission file. It will queue one copy of the program hello for execution by Condor. Condor will use its default universe and the default platform, which means to run the job on a compute node which has the same architecture and operating system as the submission host.
No input, output, and error commands appear in the job submission file, so the files stdin, stdout, and stderr will all refer to /dev/null (a.k.a. the null device. It is a special file that discards all data written to it, but reports that the write operation succeeded. It provides no data to any process that reads from it - returning EOF). The program may produce output by explicitly opening a file and writing to it. This job writes to a log file, hello.log. This log file will contain events the job had during its lifetime inside of Condor, such as any possible errors. When the job finishes, its exit conditions will also be noted in the log file. Condor recommends a log file so that you know what happened to your jobs.
If your program only returns output to the screen (like the hello.c program below does), then you should include Output = hello.out or something like it somewhere before Queue. Otherwise you will not see the output.
If you do not explicitly choose a universe, Condor uses the default universe: Vanilla Universe.
#################### # # Example 1 # Simple Condor job description file # #################### executable = hello log = hello.log queue
This example (from the Condor Manual), queues two copies of the program Mathematica. The first copy will run in directory run_1, and the second will run in directory run_2. For both queued copies, stdin will be file test.data, stdout will be file loop.out, and stderr will be file loop.error. This job writes two sets of files in separate directories. This is a convenient way to organize data if you have a large group of Condor jobs to run. The example file shows program submission of Mathematica as a Vanilla Universe job, since neither the source nor object code to program Mathematica is available for relinking to the Condor libraries.
Condor recommends using a single log file.
#################### # # Example 2 # Demonstrate use of multiple directories for data organization # #################### universe = VANILLA executable = mathematica input = test.data output = loop.out error = loop.error log = loop.log initialdir = run_1 queue initialdir = run_2 queue
In this example (also from the Condor Manual), the job submission file queues 150 runs of program foo which you compiled and linked for Silicon Graphics workstations running IRIX 6.5. This job requires Condor to run the program on machines which have greater than 32 megabytes of physical memory, and expresses a preference to run the program on machines with more than 64 megabytes, if such machines are available. It also advises Condor that it will use up to 28 megabytes of memory when running. Each of the 150 runs of the program receives its own process number, starting with process number 0. So, files stdin, stdout, and stderr will refer to in.0, out.0, and err.0 for the first run of the program; in.1, out.1, and err.1 for the second run of the program; and so forth. A log file foo.log will contain entries about when and where Condor runs, checkpoints, and migrates processes for the 150 queued runs of the program.
#################### # # Example 3 # Show off some fancy features including use of pre-defined macros and logging # #################### executable = foo requirements = Memory >= 32 && OpSys == "IRIX65" && Arch =="SGI" rank = Memory >= 64 image_Size = 28 Meg error = err.$(Process) input = in.$(Process) output = out.$(Process) log = foo.log queue 150
Once you have a job submission file, you may submit this script to Condor using the condor_submit command. As described above, a job submission file contains the commands and keywords which specify the type of compute node on which you wish to run your job. Condor will find an available processor core and run your job there, or leave your job in a queue until one becomes available.
You may submit jobs to BoilerGrid from any BoilerGrid submission host, including all RCAC cluster front-ends.
To submit a job submission file:
$ condor_submit myjobsubmissionfile
For more information about job submission:
To check on the progress of your jobs, view the Condor queue on the host from which you submitted the jobs.
You must make certain that you logged in to the same submission host (…-fe00, …-fe01, …-fe02, etc.) from which you submitted your jobs, or you will not see them in the queue.
To view the status of all jobs in the Condor queue of your login host:
$ condor_q
To see only your own jobs, specify your own username as an argument:
$ condor_q myusername -- Submitter: condor.rcac.purdue.edu : <128.211.157.86:42163> : condor.rcac.purdue.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 1100900.0 myusername 2/20 15:13 0+00:00:00 I 0 0.0 Hello 1 jobs; 1 idle, 0 running, 0 held
Secondly, you may check on the status of your jobs through their log files. In your job submission file, you can specify a log command (log = myjob.log) at any point prior to the queue command. The main events during the processing of the job will appear in this log file: submittal, execution commencement, preemption, checkpoint, eviction, and termination.
Thirdly, as soon as your job begins executing, Condor will start a condor_shadow process on the submission host. This shadow process is the mechanism by which the remotely executing jobs can access the environment of the submit host, such as input and output files. There is a shadow process started on the submit host for each job. However, the load on the submit host from this is usually not significant. If you notice degraded performance, you can limit the number of jobs that can run simultaneously using the MAX_JOBS_RUNNING configuration parameter. Please contact RCAC for help with this if you notice poor performance.
To list all the compute nodes which are running your jobs:
$ condor_status -pool boilergrid.rcac.purdue.edu -constraint 'RemoteUser=="myusername@rcac.purdue.edu"' Name OpSys Arch State Activity LoadAv Mem ActvtyTime ba-005.rcac.p LINUX INTEL Claimed Busy 1.000 502 0+00:24:44 ba-006.rcac.p LINUX INTEL Claimed Busy 0.990 502 0+00:20:22 ba-007.rcac.p LINUX INTEL Claimed Busy 1.000 502 0+00:23:16 ba-008.rcac.p LINUX INTEL Claimed Busy 1.000 502 0+00:30:20 ...
For more information about monitoring your job:
The command condor_rm removes a job from the queue. If the job has already started running, then Condor kills the job and removes its queue entry. Use condor_q to get the ID of the job.
Queue of jobs before removal:
$ condor_q Submitter: condor.rcac.purdue.edu : <128.211.157.86:35407> : condor.rcac.purdue.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD ... 260076.7 nice-user 8/18 00:05 0+00:21:47 I 0 29.3 startfah.sh -oneun 260076.9 nice-user 8/18 00:05 0+01:40:44 I 0 136.7 startfah.sh -oneun 260185.0 myusername 8/30 13:01 0+00:00:00 R 0 19.5 hello ...
Remove a job:
$ condor_rm 260185.0 Job 260185.0 marked for removal
Queue of jobs after removal:
$ condor_q Submitter: condor.rcac.purdue.edu : <128.211.157.86:35407> : condor.rcac.purdue.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD ... 260076.7 nice-user 8/18 00:05 0+00:21:47 I 0 29.3 startfah.sh -oneun 260076.9 nice-user 8/18 00:05 0+01:40:44 I 0 136.7 startfah.sh -oneun ...
For more information about removing your job:
This section offers a quick overview of the steps involved in preparing and submitting a simple Condor job.
Prepare the Code
The "Hello World" program below is a simple program which displays the text "hello, world":
/* FILENAME: hello.c */
#include <stdio.h>
int main (void) {
printf("hello, world\n");
return 0;
}
The two most commonly used Condor Universes are Standard and Vanilla. The "Hello World" program above will run in either universe.
Vanilla Universe
Compile the "Hello World" program normally using any available compiler:$ module load intel $ icc -static hello.c -o hello $ module load gcc $ gcc -static hello.c -o hello $ module load pgi $ pgcc -Bstatic hello.c -o hello
Standard Universe
Relink the "Hello World" program with the Condor library using the condor_compile command and a compatible compiler:$ module load gcc $ condor_compile gcc hello.c -o hello
Prepare the Job Submission File
Your job submission file defines how to run the job via Condor. It specifies the executable file, the chosen universe, a file containing standard input (not used in this example), files which will receive standard output and standard error, and the Condor log file, as well as many other possible parameters. The queue directive specifies how many executions of the job are to occur. Usually this is just once, as here:
Vanilla Universe
# FILENAME: hello.sub executable = hello universe = vanilla output = hello.out error = hello.err log = hello.log queue
Standard Universe
# FILENAME: hello.sub executable = hello universe = standard output = hello.out error = hello.err log = hello.log queue
$ condor_submit hello.sub Submitting job(s). Logging submit event(s). 1 job(s) submitted to cluster 1100744.
$ condor_q myusername -- Submitter: condor.rcac.purdue.edu : <128.211.157.86:56939> : condor.rcac.purdue.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 1100744.0 myusername 2/17 15:36 0+00:00:00 I 0 0.0 hello 1 jobs; 1 idle, 0 running, 0 held
$ condor_rm 1100744
View the Results
When the "Hello World" program completes, its output will appear in the file hello.out. The exit status of your program and various statistics about its performance, including time used and I/O performed, will appear in the log file hello.log. To view the output file:$ less hello.out hello, world
$ less hello.log
000 (1100744.000.000) 02/17 15:36:51 Job submitted from host: <128.211.157.86:56939>
...
001 (1100744.000.000) 02/17 15:41:49 Job executing on host: <128.211.157.10:57321>
...
005 (1100744.000.000) 02/17 15:41:53 Job terminated.
(1) Normal termination (return value 0)
Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage
1018 - Run Bytes Sent By Job
5429958 - Run Bytes Received By Job
1018 - Total Bytes Sent By Job
5429958 - Total Bytes Received By Job
...
There are many reasons to put a job on hold. For example, if you do not have enough space to hold all the results at the same time but need to move those results somewhere else, you could queue all jobs and put them on hold immediately. Then release a few jobs at a time (with a -constraint to condor_release, can be scripted), and move the results as they appear, then release some more jobs. In addition to the user's holding jobs manually, the Condor Scheduler can hold jobs for various reasons (unable to write to your directory, etc.).
Any job in the hold state will remain in the hold state until released. A job in the queue may be placed on hold. A currently running, Vanilla Universe job receives a hard kill signal (preemption without checkpointing), and Condor returns the job to the queue; when released, this Standard Universe job continues its execution using the most recent checkpoint available. A currently running, Vanilla Universe job receives a hard kill signal (preemption without checkpointing), and Condor returns the job to the queue; when released, this Vanilla Universe job restarts at the beginning.
To hold a job:
condor_hold myjobid
To view the state, column "ST", of the held job, "H":
$ condor_q myusername -- Submitter: condor.rcac.purdue.edu : <128.211.157.86:42163> : condor.rcac.purdue.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 1101790.0 myusername 2/24 14:53 0+00:00:00 H 0 0.0 Hello 1 jobs; 0 idle, 0 running, 1 held
For more information about holding your job:
A job that is in the hold state remains there until later released for execution.
To release a held job:
$ condor_release myjobid
The state of the released job is now "Idle", "I":
$ condor_q myusername -- Submitter: condor.rcac.purdue.edu : <128.211.157.86:42163> : condor.rcac.purdue.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 1101790.0 myusername 2/24 14:53 0+00:00:00 I 0 0.0 Hello 1 jobs; 1 idle, 0 running, 0 held
To release all held jobs of a single user:
$ condor_release myusername
For more information about releasing your job:
Condor attempts to start jobs by matching submitted jobs with available compute nodes on the basis of ClassAds. Condor's ClassAds are analogous to the classified advertising section of the newspaper. Both sellers and buyers advertise details about what they have to sell or want to buy. Both buyers and sellers have some requirements which absolutely must be satisfied, such as the right type of item, and some other criteria by which they will prefer certain offers over others, such as a better price. The same is true in Condor, but between users submitting jobs and compute nodes advertising available resources. Condor uses ClassAds to make the best matches between these two groups.
By default, your Condor jobs will seek an available compute node with the same values for the ClassAds Arch and OpSys as the host from which you submitted your job. The submission process assumes that in most cases your jobs will require the same combination of chip architecture and operating system to run as the host from which you submitted it. You can remove or alter this restriction by looking at the examples in the "Requiring Specific Architectures or Operating Systems" section.
Some applications may require even more specific capabilities. Using ClassAds, you may specify a set of requirements so that only a subset of available compute nodes become candidates to run your job. There are many ClassAds available for you to use in your job requirements. You may also use ClassAds to indicate a preference for certain nodes over others (but not as an absolute requirement) by using the rank command. The following examples illustrate how to discover current ClassAds and how to estimate the number of compute nodes which will match job requirements based on ClassAds.
To save a detailed report of all the ClassAds of all processor cores in BoilerGrid in the file myfile:
$ condor_status -pool boilergrid.rcac.purdue.edu -long > myfile
You may use any of the ClassAds which appear in this list to view a subset of BoilerGrid. For example, to save a listing of all user ID domains or all file system domains in the file myfile:
$ condor_status -pool boilergrid.rcac.purdue.edu -format "%s\n" UidDomain > myfile $ condor_status -pool boilergrid.rcac.purdue.edu -format "%s\n" FileSystemDomain > myfile
To list all platforms (architectures and operating systems) and the number of processor cores of each platform on BoilerGrid:
$ condor_status -pool boilergrid.rcac.purdue.edu -total
Total Owner Claimed Unclaimed Matched Preempting Backfill
INTEL/LINUX 64 13 5 46 0 0 0
INTEL/OSX 2 0 0 2 0 0 0
INTEL/WINNT51 345 29 2 314 0 0 0
INTEL/WINNT61 4683 150 13 4520 0 0 0
SUN4u/SOLARIS210 3 2 0 1 0 0 0
X86_64/LINUX 31395 22617 4734 4035 2 2 5
Total 36492 22811 4754 8918 2 2 5
Condor uses the name "INTEL" to indicate x86_32 (32-bit Intel-compatible) architecture.
The total number of processor cores on BoilerGrid is 36,492. The predominant platform of BoilerGrid is the x86_64/Linux with 31,395 processor cores. The values in this table are approximations since compute nodes require repair.
To see how many compute nodes have a given ClassAd value, add the ClassAd value as a constraint.
To see only how many 64-bit, Intel-compatible, Linux (x86_64/Linux) platforms are currently in BoilerGrid:
$ condor_status -pool boilergrid.rcac.purdue.edu -total -constraint '(Arch == "X86_64") && (OpSys == "LINUX")'
Total Owner Claimed Unclaimed Matched Preempting Backfill
X86_64/LINUX 31395 22740 4688 3957 3 2 5
Total 31395 22740 4688 3957 3 2 5
To see how many 64-bit, Intel-compatible, Linux (x86_64/Linux) platforms are currently in BoilerGrid and advertise MATLAB as installed:
$ condor_status -pool boilergrid.rcac.purdue.edu -total -constraint '(Arch == "X86_64") && (OpSys == "LINUX") && (HAS_MATLAB == TRUE)'
Total Owner Claimed Unclaimed Matched Preempting Backfill
X86_64/LINUX 24659 12008 1557 11094 0 0 0
Total 24659 12008 1557 11094 0 0 0
You may specify numeric constraints with other relational operators. To discover how many compute nodes have at least 16 GB of memory:
$ condor_status -pool boilergrid.rcac.purdue.edu -total -constraint 'TotalMemory >= 16046'
Total Owner Claimed Unclaimed Matched Preempting Backfill
X86_64/LINUX 26093 18007 3330 4753 3 0 0
Total 26093 18007 3330 4753 3 0 0
ClassAd string values are case-sensitive. ClassAd attribute names are case-insensitive. The comparison operators (<, >, <=, >=, and ==) compare strings case-insensitively. The special comparison operators =?= and =!= compare strings case-sensitively. ClassAd expressions are similar to C boolean expressions and can be quite elaborate.
For more information about ClassAds, requirements, and rank:
Increasing the throughput of your jobs may not come from maximizing the number of candidate compute nodes but rather from limiting the candidate compute nodes to the set which can access the shared scratch file system of the front-end. This limitation is useful in the case of a large input data file since it avoids both using Condor's file transfer mechanism and running the risk of preemptions preventing job completion.
The following table shows the current list of scratch directories:
| Cluster | Scratch Directory | File System Domain |
|---|---|---|
condor.rcac.purdue.edu Radon Steele |
/scratch/scratch95/m/myusername /scratch/scratch96/m/myusername |
bluearc.rcac.purdue.edu |
Coates Rossmann |
/scratch/lustreA/m/myusername |
lustrea.rcac.purdue.edu |
Hansen |
/scratch/lustreC/m/myusername |
lustrec.rcac.purdue.edu |
Miner |
/scratch/miner/m/myusername |
miner.rcac.purdue.edu |
To discover your scratch file directory, log in to your submission host and enter either of the following commands:
$ findscratch $ echo $RCAC_SCRATCH
The response will be one of the following paths:
/scratch/scratch95/m/myusername /scratch/scratch96/m/myusername /scratch/lustreA/m/myusername /scratch/lustreC/m/myusername /scratch/miner/m/myusername
To see which shared scratch file system a specific cluster can access, search on the ClassAd attribute ClusterName:
$ condor_status -pool boilergrid.rcac.purdue.edu -constraint 'ClusterName=="Radon"' -format "%s\n" FileSystemDomain >myfile
To see which shared scratch file systems other clusters use, modify the preceding example with other cluster names: Hansen, Rossmann, Coates, Steele, or Miner.
To see which clusters can access a given shared scratch file system, search on the ClassAd attribute FileSystemDomain:
$ condor_status -pool boilergrid.rcac.purdue.edu -constraint 'FileSystemDomain == "bluearc.rcac.purdue.edu"' -format "%s\n" ClusterName >myfile $ condor_status -pool boilergrid.rcac.purdue.edu -constraint 'FileSystemDomain == "lustrea.rcac.purdue.edu"' -format "%s\n" ClusterName >myfile $ condor_status -pool boilergrid.rcac.purdue.edu -constraint 'FileSystemDomain == "lustrec.rcac.purdue.edu"' -format "%s\n" ClusterName >myfile $ condor_status -pool boilergrid.rcac.purdue.edu -constraint 'FileSystemDomain == "miner.rcac.purdue.edu"' -format "%s\n" ClusterName >myfile
Using logical operators, you may combine ClassAd constraints. For example, to see how many x86_64 processor cores running Linux have access to the BlueArc shared scratch file system:
$ condor_status -pool boilergrid.rcac.purdue.edu -total -constraint '(Arch == "X86_64") && (OpSys == "LINUX") && (FileSystemDomain == "bluearc.rcac.purdue.edu")'
Total Owner Claimed Unclaimed Matched Preempting Backfill
X86_64/LINUX 9232 5515 1431 2286 0 0 0
Total 9232 5515 1431 2286 0 0 0
Here is a brief description of some of the common ClassAds and attributes available in Condor. For a more complete listing, see the Job Submission Chapter of the Condor Users' Manual.
Long-running computer programs which are executing in the Condor environment face risks that can prevent job completion, for example power loss, overflow of dynamic memory or disk storage, and preemption. Overflow means that a computer program allocates too much dynamic memory or writes too much data to the disk (remote or local) serving the program. Preemption occurs when a higher priority job needs the compute node. It involves either temporarily interrupting a Condor job with the intention of resuming that job from the point of preemption at a later time and often on a different compute node (checkpointing), stopping the job but keeping it on the compute node (checkpointing followed by suspension), or restarting the job from the beginning on a different compute node (eviction).
Checkpointing is a technique for inserting fault tolerance into computing systems. It changes the state of a CPU so that another job can run. This is how Condor scavenges unused computing cycles without preventing higher-priority work. It basically consists of storing a snapshot of the current state of an application and later using it to resume the execution. With checkpointing and suspension, a job has a chance to finish. Eviction may cause a job never to finish if the job's run time is significantly longer than the mean time between preemptions or between power failures. Restarting a job from the beginning can be exceedingly wasteful. Condor handles preemption somewhat differently on various compute nodes in BoilerGrid because the owners of each compute node may specify how they want preemption handled. However, a few general principles are true for all.
BoilerGrid offers a heterogeneous collection of compute nodes. These compute nodes support not only Condor. The majority are Linux systems also running the Portable Batch System (PBS). Many are Windows desktop machines. Architecture, performance, memory and disk space vary broadly.
For all compute nodes running PBS, when a PBS-scheduled job needs a compute node, Condor evicts any Condor jobs running on that node at the time. This is known as preemption. When Condor preempts a Standard Universe job, it checkpoints the job, immediately removes it, and starts seeking another compute node to run it, where it will resume the job from the point of preemption. When Condor preempts a Vanilla Universe job, Condor immediately evicts the job and starts seeking another compute node to run it, where it will restart the job at the beginning.
In the case of Windows-running compute nodes, preemption in the Condor environment occurs when a user touches the mouse or keyboard. On some nodes, Condor places the job in suspension and waits a finite amount of time to see whether it can restart the job on the same compute node. Perhaps the user needs only a few minutes to check email. If the compute node is still unavailable, Condor either checkpoints (Standard Universe) or evicts (Vanilla Universe) the job and moves it to another Windows compute node in the BoilerGrid. On other nodes, Condor immediately checkpoints or evicts the job.
To take advantage of checkpointing and remote system calls of Condor's Standard Universe, you must re-link your program with the Condor libraries. Typically, re-linking requires no change to the source code. Not all applications may take advantage of Condor's Standard Universe. Re-linking precludes commercial software binaries from taking advantage of these services because commercial vendors rarely make their source or object code available. Re-linking precludes applications which must be run from a script. Re-linking precludes using compilers which are incompatible with Condor. An incompatible compiler might yield more efficient code which reduces run time and the likelihood of eviction. Such applications must use Condor's Vanilla Universe. Unless a Vanilla job is self-checkpointing, eviction means that all work is lost.
Jobs running for long periods on BoilerGrid have a high probability of reaching preemption. These risks can warrant a significant retooling of a job to customize the match between the characteristics of a job's computation and the compute nodes of BoilerGrid in order to maximize throughput. Debugging a computer program and recoding a working program to improve performance are the usual tasks of a programmer. Condor may require additional retooling of that program so that it is able to reach completion.
Condor is able to schedule and run any type of process, but Condor's Standard Universe does have some limitations on any jobs that it checkpoints and migrates:
These limitations apply only to Standard Universe jobs. They do not apply to Vanilla Universe jobs.
To submit jobs successfully to BoilerGrid and to achieve maximum throughput in Condor's computing environment, you must understand the architecture of BoilerGrid and how to request resources which are appropriate to your application. The following examples show how to discover the resources of BoilerGrid. They also explain standard input and output, command-line arguments, file input and output, Standard and Vanilla universe jobs, shared file systems, parameter sweeps, DAG Manager, job requirements and ranks, and how to run commercial and third-party software. You may wish to look here for an example that is most similar to your application and modify that example for your jobs. You may also refer to the Condor Manual for more details.
The job submission file must contain one executable command and at least one queue command. All other commands of the job submission file have default actions. Condor's job submission parser ignores blank lines and single-line comments beginning with a pound sign ("#"). There is no block (multi-line) comment in a job submission file. In some cases, a single-line comment may appear on a command line.
# FILENAME: myjob.sub executable = myprogram queue # place one copy of the job in the Condor queue
This job submission file uses Condor's default universe, Vanilla. Because this Vanilla job does not turn on Condor's file transfer mechanism, this job will (by default) use the shared file system of the submission host from which you submitted the job. Also by default, the number of jobs queued is one.
This job submission file may appear to be useless because it lacks the standard input, standard output, standard error, and a common log file; however, it will correctly process a program which reads and writes formatted files. Here is an example of file I/O. To compile this program for the Vanilla Universe, see Compiling Serial Programs.
To submit this job to Condor:
$ condor_submit myjob.sub
Condor manages a batch environment. When Condor manages the execution of a computer program, that program cannot offer an interactive experience with a terminal. All input normally read from the keyboard (standard input) must be prepared in a file ahead of execution. All output normally written to the screen (standard output and standard error) appear in files where you may view them after execution. Also, Condor records in a common log file the main events of running a job.
Here is an example of standard I/O. To compile this program for the Vanilla Universe, see Compiling Serial Programs.
Prepare a job submission file with an appropriate filename, here named myjob.sub:
# FILENAME: myjob.sub executable = myprogram # Standard I/O files, Condor log file input = mydata.in output = mydata.out error = mydata.err log = mydata.log # queue one job queue
This job submission file uses Condor's default universe, Vanilla. Because this Vanilla job does not turn on Condor's file transfer mechanism, this job wi ll (by default) use the shared file system of the submission host from which you submitted the job. Also by default, the number of jobs queued is one.
This submission specifies that there exists a file, mydata.in, which contains all text which the program would otherwise read from the keyboard, standard input. It also specifes the names of three files which will receive standard output, standard error, and Condor's log entries. These three output files need not preexist, but they can. Condor will overwrite standard output and standard error but will append to the log file during subsequent submissions.
To submit this job to Condor:
$ condor_submit myjob.sub
Condor allows the specification of command-line arguments in the job submission file. There are two permissible formats for specifying arguments. The old syntax has arguments delimited (separated) by space characters. To use double quotes, escape with a backslash (i.e. put a backslash in front of each double quote). For example:
arguments = arg1 \"arg2\" 'arg3'
yields the following arguments:
arg1 "arg2" 'arg3'
The new syntax supports uniform quoting of spaces within arguments. A pair of double quotes surrounds the entire argument list. To include a literal double quote, simply repeat it. White space (spaces, tabs) separate arguments. To include literal white space in an argument, surround the argument with a pair of single quotes. To include a literal single quote within a single-quoted argument, repeat the single quote.
Here is a simple program which will display command-line arguments specified in a job submission file. To compile this program for the Vanilla Universe, see Compiling Serial Programs.
Prepare a job submission file with an appropriate filename, here named myjob.sub. Run with command-line arguments in either the old or new syntax:
# FILENAME: myjob.sub universe = VANILLA executable = myprogram # Old Syntax # arguments = arg1 arg2 arg3 \"arg4\" 'arg5' 'arg with spaces' arg6 arg7_with_spaces arg8 # New Syntax arguments = "arg9 ""arg10"" 'arg with literal '' and spaces'" # Condor Macros # arguments = $(Cluster) $(Process) # standard I/O files, Condor log file output = myprogram.out error = myprogram.err log = myprogram.log # queue one job queue
To submit this job to Condor:
$ condor_submit myjob.sub
View command-line arguments submitted in the old syntax:
*** MAIN START *** Number of command line arguments: 12 command line argument, argv[0]: condor_exec.746418.0 command line argument, argv[1]: arg1 command line argument, argv[2]: arg2 command line argument, argv[3]: arg3 command line argument, argv[4]: "arg4" command line argument, argv[5]: 'arg5' command line argument, argv[6]: 'arg command line argument, argv[7]: with command line argument, argv[8]: spaces' command line argument, argv[9]: arg6 command line argument, argv[10]: arg7_with_spaces command line argument, argv[11]: arg8 *** MAIN STOP ***
The old syntax requires simulating spaces in arguments with the underscore character. Then, user code can replace the underscores with spaces to achieve an argument with spaces.
View command-line arguments submitted in the new syntax:
*** MAIN START *** Number of command line arguments: 4 command line argument, argv[0]: condor_exec.341964.0 command line argument, argv[1]: arg9 command line argument, argv[2]: "arg10" command line argument, argv[3]: arg with literal ' and spaces *** MAIN STOP ***
The array element argv[0] holds Condor's name for a job.
Two Condor macros are useful as command-line arguments, $(Cluster) and $(Process):
*** MAIN START *** Number of command line arguments: 3 command line argument, argv[0]: condor_exec.341965.0 command line argument, argv[1]: 341965 command line argument, argv[2]: 0 *** MAIN STOP ***
Condor is able to manage a computer program which reads and writes formatted data files.
Here is an example of formatted file I/O. To compile this program for the Vanilla Universe, see Compiling Serial Programs.
Prepare a job submission file with an appropriate filename, here named myjob.sub. This example combines formatted file I/O with standard output:
# FILENAME: myjob.sub executable = myprogram # Standard I/O files, Condor log file output = mydata.out error = mydata.err log = mydata.log # queue one job queue
This job submission file uses Condor's default universe, Vanilla. Because this Vanilla job does not turn on Condor's file transfer mechanism, this job will (by default) use the shared file system of the submission host from which you submitted the job. Also by default, the number of jobs queued is one.
This submission specifies that there exists a formatted input file, myinputdata, a name which appears in the source code only. The result is a formatted output file, myoutputdata, a name which also appears in the source code only. This submission also specifes the names of three files which will receive standard output, standard error, and Condor's log entries. These three output files need not preexist, but they can. Condor will overwrite standard output and standard error but append to the log file during subsequent submissions.
To submit this job to Condor:
$ condor_submit myprogram.sub
The Standard Universe is an execution environment of Condor. Jobs using the Standard Universe enjoy two advantages. A job with a higher priority may preempt a Condor job without loss of completed work. Condor can checkpoint the job and move (migrate) the job to a different compute node which would otherwise be idle. Condor restarts the job on the new compute node at precisely the point of preemption. The Standard Universe tells Condor that you re-linked your job via condor_compile with the Condor libraries, and therefore your job supports checkpointing. Condor transfers the executable and checkpoint files automatically, when needed.
The second advantage of Condor's Standard Universe is that remote system calls handle access to files (input and output). For example, Condor intercepts a call to read a record of a data file. Condor sends the read operation to the user's current working directory on the submission host which performs the read operation. Condor then sends the desired record to the compute node which processes the record. A similar process occurs with write operations. Therefore, the existence of a shared file system is not relevant. This feature maximizes the number of machines which can run a job. Compute nodes across an entire enterprise can run a job, including compute nodes in different administrative domains.
This section illustrates how to submit a small job to the Standard Universe of BoilerGrid. This example, myprogram.c, displays the name of the host which runs the job. To compile this program for the Standard Universe, see Compiling Serial Programs.
Prepare a job submission file with the Standard Universe, the compiled C program as the executable, Condor's transferring the compiled program to the chosen compute node, and an appropriate filename, here named myjob.sub:
# FILENAME: myjob.sub universe = STANDARD # Transfer the "executable" myprogram to the compute node. transfer_executable = TRUE executable = myprogram # Standard I/O files, Condor log file output = mydata.out error = mydata.err log = mydata.log # queue one job queue
Submit this job to Condor:
$ condor_submit myjob.sub Submitting job(s). Logging submit event(s). 1 job(s) submitted to cluster 341956.
View job status:
$ condor_q myusername -- Submitter: condor.rcac.purdue.edu : <128.211.157.86:35556> : condor.rcac.purdue.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 341956.0 myusername 10/22 11:18 0+00:00:00 I 0 7.3 myjob
Place the job on hold to study the submission:
$ condor_hold 341956 Cluster 341956 held.
Obtain the requirements of this job:
$ condor_q myusername -attributes requirements -long -- Submitter: condor.rcac.purdue.edu : <128.211.157.86:35556> : condor.rcac.purdue.edu requirements = (Arch == "X86_64") && (OpSys == "LINUX") && ((CkptArch == Arch) || (CkptArch =?= UNDEFINED)) && ((CkptOpSys == OpSys) || (CkptOpSys =?= UNDEFINED)) && (Disk >= DiskUsage) && ((Memory * 1024) >= ImageSize)
Job requirements reflect the Standard Universe (preemption with checkpointing). This job requires a processor core which runs the Linux operating system on the x86_64 architecture and has the ability to checkpoint the job at preemption. The requirements exclude any mention of the shared file system since a shared file system is not relevant to a Standard Universe job. Running a Standard Universe job does not limit the job to the processor cores which use the same shared file system that the submission host uses. The job may land either on a processor core that uses the same shared file system or not; in either case, the remote I/O of the Standard Universe handles the job's file I/O. The ClassAds Disk and Memory ensure that the chosen core has sufficient resources to hold the disk and memory footprint of the job.
To see how many compute nodes of BoilerGrid are able to satisfy this job's requirements:
$ condor_status -pool boilergrid.rcac.purdue.edu -total -constraint '(Arch == "X86_64") && (OpSys == "LINUX") && ((CkptArch == Arch) || (CkptArch =?= UNDEFINED)) && ((CkptOpSys == OpSys) || (CkptOpSys =?= UNDEFINED))'
Total Owner Claimed Unclaimed Matched Preempting Backfill
X86_64/LINUX 33118 27602 2878 2596 42 0 0
Total 33118 27602 2878 2596 42 0 0
The report shows that 33,118 processor cores are candidates for running the job. Using Condor's Standard Universe with its remote file I/O maximizes the number of candidate cores available to the job. This large pool of candidate cores has the potential of offering high throughput. Some subset of the candidate processor cores, possibly none, will be available to accept the job at any given moment.
Release the job from the queue:
$ condor_release 341956 Cluster 341956 released.
View results in the file for all standard output, here named mydata.out:
*** MAIN START *** hostname = cms-100.rcac.purdue.edu domainname = (none) *** MAIN STOP ***
The output shows the name of the processor core which ran the job. While this job ran on a processor core which resides on the same shared file system used by the submission host, another submission which forced the job onto a core of another shared file system also ran successfully because the remote I/O of the Standard Universe handled the reading and writing of records.
View the log file, mydata.log:
000 (341956.000.000) 10/22 11:42:22 Job submitted from host: <128.211.157.86:35556>
...
012 (341956.000.000) 10/22 11:42:57 Job was held.
via condor_hold (by user myusername)
Code 1 Subcode 0
...
001 (341956.000.000) 10/22 11:43:57 Job executing on host: <128.211.157.10:52556>
...
005 (341956.000.000) 10/22 11:43:57 Job terminated.
(1) Normal termination (return value 0)
Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage
1110 - Run Bytes Sent By Job
5431033 - Run Bytes Received By Job
1110 - Total Bytes Sent By Job
5431033 - Total Bytes Received By Job
The log file records the main events related to the processing of this job. In this example, the log records the number of bytes read and written between the submission host and the compute node via the remote I/O of the Standard Universe.
The Standard Universe maximizes throughput with its ability to checkpoint jobs and to intercept remote system calls. The latter avoids requiring the submission host and the compute node to share a file system. The process of re-linking a job with Condor's libraries involves including both Condor's libraries and the user's libraries as static libraries. The danger of this effort to maximize throughput is that a Condor flock is a heterogeneous collection of old and new compute nodes, so a job can land on a compute node that is unable to run the job. When this happens, the user must consider how to avoid compute nodes which are unable to run a job to a successful completion.
The Vanilla Universe is an execution environment of Condor. The Vanilla Universe tells Condor that you did not re-link your job via condor_compile with the Condor libraries, and therefore your job does not support checkpointing or remote system calls. Such jobs include an executable binary of a commercial application, shell script, or a program which is to take advantage of features of a compiler which is not compatible with Condor's condor_compile command.
For jobs submitted under the Vanilla Universe, the existence of a shared file system is relevant since access to files (input and output) involves either a shared file system or Condor's file transfer mechanism mechanism, not the remote system calls of the Standard Universe.
This section illustrates how to submit a small job to the Vanilla Universe of BoilerGrid from a submission host which has a shared file system with Condor's file transfer mechanism turned off, by default. This example, myprogram.c, displays the name of the host which runs the job, the path name of the current working directory, and the contents of that directory. The Vanilla Universe allows using Linux commands to obtain the information. To compile this program for the Vanilla Universe, see Compiling Serial Programs.
Prepare a job submission file with the Vanilla Universe, the compiled C program as the executable, Condor's file transfer mechanism turned off by default, Condor's transferring the compiled program to the chosen compute node, and an appropriate filename, here named myjob.sub:
# FILENAME: myjob.sub universe = VANILLA # Transfer the "executable" myprogram to the compute node. transfer_executable = TRUE executable = myprogram # Condor's file transfer mechanism is off, by default. # Standard I/O files, Condor log file output = mydata.out error = mydata.err log = mydata.log # queue one job queue
Submit this job to Condor:
$ condor_submit myjob.sub Submitting job(s). Logging submit event(s). 1 job(s) submitted to cluster 746407.
View job status:
$ condor_q myusername -- Submitter: condor.rcac.purdue.edu : <128.211.157.86:60481> : condor.rcac.purdue.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 746407.0 myusername 10/25 10:04 0+00:00:00 I 0 0.0 myjob
Place this job on hold to study the submission:
$ condor_hold 746407 Cluster 746407 held.
Obtain the requirements of this job:
$ condor_q myusername -attributes requirements -long -- Submitter: condor.rcac.purdue.edu : <128.211.157.86:60481> : condor.rcac.purdue.edu requirements = (Arch == "X86_64") && (OpSys == "LINUX") && (Disk >= DiskUsage) && ((Memory * 1024) >= ImageSize) && (TARGET.FileSystemDomain == MY.FileSystemDomain)
Job requirements reflect both the Vanilla Universe (preemption without checkpointing) and the shared file system. This job requires a compute node which runs the Linux operating system on the x86_64 architecture and, more importantly, which shares the same FileSystemDomain as the submission host (both the TARGET and MY shared file system must be the same). So, this submission limits running the job to the processor cores which use the same shared file system that the submission host uses. The ClassAds Disk and Memory ensure that the chosen core has sufficient resources to hold the disk and memory footprint of the job.
To see how many compute nodes in various file system domains of BoilerGrid are able to satisfy this job's requirements :
$ condor_status -pool boilergrid.rcac.purdue.edu -total -constraint '(Arch == "X86_64") && (OpSys == "LINUX") && (FileSystemDomain == "bluearc.rcac.purdue.edu")'
Total Owner Claimed Unclaimed Matched Preempting Backfill
X86_64/LINUX 9924 7466 1579 878 0 1 0
Total 9924 7466 1579 878 0 1 0
$ condor_status -pool boilergrid.rcac.purdue.edu -total -constraint '(Arch == "X86_64") && (OpSys == "LINUX") && (FileSystemDomain == "lustrea.rcac.purdue.edu")'
Total Owner Claimed Unclaimed Matched Preempting Backfill
X86_64/LINUX 18784 16717 1438 628 0 1 0
Total 18784 16717 1438 628 0 1 0
$ condor_status -pool boilergrid.rcac.purdue.edu -total -constraint '(Arch == "X86_64") && (OpSys == "LINUX") && (FileSystemDomain == "miner.rcac.purdue.edu")'
Total Owner Claimed Unclaimed Matched Preempting Backfill
X86_64/LINUX 1006 156 760 90 0 0 0
Total 1006 156 760 90 0 0 0
The report shows that 9,924 and 18,784 processor cores are candidates for running this job in various file system domains. While the number of candidate processor cores which are able to run this job is much less than the number of x86_64 cores running Linux on BoilerGrid, using the shared file system is the preferred method in many situations. Some subset of the candidate processor cores, possibly none, will be available to accept the job at any given moment.
Release the job from the queue:
$ condor_release 746407 Cluster 746407 released.
View results in the file for all standard output, here named mydata.out:
cms-100.rcac.purdue.edu (none) /autohome/u105/myhomedirectory/Condor/vanilla_w_sfs total 288 -rw-r--r-- 1 myusername itap 1508 Oct 25 10:39 commands -rw-r--r-- 1 myusername itap 0 Oct 25 10:46 mydata.err -rw-r--r-- 1 myusername itap 467 Oct 25 10:46 mydata.log -rw-r--r-- 1 myusername itap 71 Oct 25 10:46 mydata.out -rwxr-xr-x 1 myusername itap 6863 Oct 25 10:14 myprogram -rw-r----- 1 myusername itap 376 Oct 25 09:14 myprogram.c -rw-r----- 1 myusername itap 199 Oct 25 10:04 myprogram.sub -rwxr----- 1 myusername itap 70 Oct 25 09:19 run -rwxr--r-- 1 myusername itap 216 Oct 25 09:14 tally -rw-r----- 1 myusername itap 952 Oct 25 09:14 tmp -rw-r--r-- 1 myusername itap 0 Oct 25 10:19 tmp1 *** MAIN START *** *** MAIN STOP ***
The output shows the name of the compute node which ran the job. This job ran on a compute node which uses the same shared file system which the submission host uses. The fact that the current working directory is the user's home directory on the submission host proves that this job used the shared file system for file I/O.
View the log file, mydata.log:
000 (746407.000.000) 10/25 10:04:51 Job submitted from host: <128.211.157.86:60481>
...
012 (746407.000.000) 10/25 10:05:15 Job was held.
via condor_hold (by user myusername)
Code 1 Subcode 0
...
009 (746406.000.000) 10/25 10:22:07 Job was aborted by the user.
via condor_rm (by user myusername)
...
013 (746407.000.000) 10/25 10:44:07 Job was released.
via condor_release (by user myusername)
...
001 (746407.000.000) 10/25 10:46:47 Job executing on host: <128.211.157.10:57108>
...
005 (746407.000.000) 10/25 10:46:47 Job terminated.
(1) Normal termination (return value 0)
Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage
0 - Run Bytes Sent By Job
0 - Run Bytes Received By Job
0 - Total Bytes Sent By Job
0 - Total Bytes Received By Job
The log file records the main events related to the processing of this job. In this example, the log records that the number of bytes read and written between the submission host and the compute node is zero, another indication that this job used the shared file system for file I/O.
The Vanilla Universe is available for jobs which cannot take advantage of the Standard Universe because you cannot re-link them with Condor's libraries. You cannot re-link an executable binary, shell script, and a program which you compiled with an incompatible compiler. If a shared file system is available, the Vanilla job may use it for file I/O by keeping Condor's file transfer mechanism turned off. Keeping the file transfer mechanism off excludes compatible compute nodes which do not share a file system with the submission host.
The Vanilla Universe is an execution environment of Condor. The Vanilla Universe tells Condor that you did not re-link a job via condor_compile with the Condor libraries and therefore your job does not support checkpointing or remote system calls. Such jobs include an executable binary of a commercial application, shell script, or a program which is to take advantage of features of a compiler which is not compatible with Condor's condor_compile command.
For jobs submitted under the Vanilla Universe, the existence of a shared file system is relevant since access to files (input and output) involves either a shared file system or Condor's file transfer mechanism mechanism, not the remote system calls of the Standard Universe.
This section illustrates how to submit a small job to the Vanilla Universe of BoilerGrid from a submission host which has a shared file system with Condor's file transfer mechanism turned on "if needed". Condor transfers files only when it matches the job with a compute node which uses a different FileSystemDomain from the one which the submission host uses. If Condor matches the job with a compute node which uses the same FileSystemDomain which the submission host uses, Condor does not transfer files and relies on the shared file system instead. This example, myprogram.c, displays the name of the host which runs the job, the path name of the current working directory, and the contents of that directory. The Vanilla Universe allows using Linux commands to obtain the information. To compile this program for the Vanilla Universe, see Compiling Serial Programs.
Prepare a job submission file with the Vanilla Universe, the compiled C program as the executable, Condor's file transfer mechanism turned on if needed, Condor's transferring the compiled program to the chosen compute node, and an appropriate filename, here named myjob.sub:
# FILENAME: myjob.sub universe = VANILLA # Transfer the "executable" myprogram to the compute node. transfer_executable = TRUE executable = myprogram # Condor's file transfer mechanism is turned on only when needed. should_transfer_files = IF_NEEDED # Let Condor handle output file(s). when_to_transfer_output = ON_EXIT # Standard I/O files, Condor log file output = mydata.out error = mydata.err log = mydata.log # queue one job queue
Submit this job to Condor:
$ condor_submit myjob.sub Submitting job(s). Logging submit event(s). 1 job(s) submitted to cluster 746408.
View job status:
$ condor_q myusername -- Submitter: condor.rcac.purdue.edu : <128.211.157.86:60481> : condor.rcac.purdue.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 746408.0 myusername 10/25 10:04 0+00:00:00 I 0 0.0 myjob
Place this job on hold to study the submission:
$ condor_hold 746408 Cluster 746408 held.
Obtain the requirements of this job:
$ condor_q myusername -attributes requirements -long -- Submitter: condor.rcac.purdue.edu : <128.211.157.86:60481> : condor.rcac.purdue.edu requirements = (Arch == "X86_64") && (OpSys == "LINUX") && (Disk >= DiskUsage) && ((Memory * 1024) >= ImageSize) && ((HasFileTransfer) || (TARGET.FileSystemDomain == MY.FileSystemDomain))
The requirements reflect both the Vanilla Universe (preemption without checkpointing) and Condor's file transfer mechanism turned on only if needed. This job requires a processor core which runs the Linux operating system on the x86_64 architecture, but, more importantly, the processor core chosen to run this job need not share the same FileSystemDomain which the submission host uses (both the TARGET and MY shared file system need not be equal). The ClassAd of this job states that the chosen core must either have the file transfer capability or share a file system with the submission host. So, this submission does not limit running the job to the processor cores which use the same shared file system which the submission host uses. The ClassAds Disk and Memory ensure that the chosen core has sufficient resources to hold the disk and memory footprint of the job.
To see how many compute nodes of BoilerGrid are able to satisfy this job's requirements:
$ condor_status -pool boilergrid.rcac.purdue.edu -total -constraint '(Arch == "X86_64") && (OpSys == "LINUX") && ((HasFileTransfer) || (FileSystemDomain == "bluearc.rcac.purdue.edu"))'
Total Owner Claimed Unclaimed Matched Preempting Backfill
X86_64/LINUX 32074 20806 4287 6976 5 0 0
Total 32074 20806 4287 6976 5 0 0 0 0
The report shows that 32,074 processor cores are candidates for running this job. Using Condor's Vanilla Universe with its file transfer mechanism turned on only if needed maximizes the number of candidate cores available to the job. This large pool of candidate cores has the potential of offering high throughput. On the other hand, large data files involved in the file transfer may diminish throughput. Some subset of the candidate processor cores, possibly none, will be available to accept the job at any given moment.
Release the job from the queue:
$ condor_release 746408 Cluster 746408 released.
View results in the file for all standard output, here named mydata.out:
cms-100.rcac.purdue.edu (none) /autohome/u105/myhomedirectory/Condor/vanilla_w_sfs_ftm total 284 -rw-r--r-- 1 myusername itap 1508 Oct 25 10:39 commands -rw-r--r-- 1 myusername itap 0 Oct 25 10:46 mydata.err -rw-r--r-- 1 myusername itap 467 Oct 25 10:46 mydata.log -rw-r--r-- 1 myusername itap 71 Oct 25 10:46 mydata.out -rwxr-xr-x 1 myusername itap 6863 Oct 25 10:14 myprogram -rw-r----- 1 myusername itap 376 Oct 25 09:14 myprogram.c -rw-r----- 1 myusername itap 199 Oct 25 10:04 myjob.sub -rwxr----- 1 myusername itap 70 Oct 25 09:19 run -rwxr--r-- 1 myusername itap 216 Oct 25 09:14 tally -rw-r----- 1 myusername itap 952 Oct 25 09:14 tmp -rw-r--r-- 1 myusername itap 0 Oct 25 10:19 tmp1 *** MAIN START *** *** MAIN STOP ***
The output shows the name of the processor core which ran the job. This job ran on a processor core which uses the same shared file system which the submission host uses. The fact that the current working directory is the user's home directory on the submission host proves that this job used the shared file system for file I/O.
View the log file, mydata.log:
000 (746408.000.000) 10/25 10:04:51 Job submitted from host: <128.211.157.86:60481>
...
012 (746408.000.000) 10/25 10:05:15 Job was held.
via condor_hold (by user myusername)
Code 1 Subcode 0
...
013 (746408.000.000) 10/25 10:44:07 Job was released.
via condor_release (by user myusername)
...
001 (746408.000.000) 10/25 10:46:47 Job executing on host: <128.211.157.10:57108>
...
005 (746408.000.000) 10/25 10:46:47 Job terminated.
(1) Normal termination (return value 0)
Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage
0 - Run Bytes Sent By Job
0 - Run Bytes Received By Job
0 - Total Bytes Sent By Job
0 - Total Bytes Received By Job
The log file records the main events related to the processing of this job. In this example, the log records that the number of bytes read and written between the submission host and the compute node is zero, another indication that this job used the shared file system for file I/O.
To see Condor's file transfer mechanism at work, repeat the example above but force the job to a compute node which does not share a file system with the submission host.
Modify the job submission file of the previous example to send the job to a processor core which uses a different shared file system:
# FILENAME: myjob.sub universe = VANILLA # A core on the Rossmann cluster uses a different shared file system. requirements = ClusterName == "Rossmann" # Transfer the "executable" myprogram to the compute node. transfer_executable = TRUE executable = myprogram # Condor's file transfer mechanism is turned on only when needed. This submission needs the transfer mechanism. should_transfer_files = IF_NEEDED when_to_transfer_output = ON_EXIT # Standard I/O files, Condor log file output = mydata.out error = mydata.err log = mydata.log # queue one job queue
View results in the file for all standard output, here named mydata.out:
cms-100.rcac.purdue.edu (none) /var/condor/execute/dir_11554 total 12 -rwxr-xr-x 1 myusername itap 6863 Oct 25 12:28 condor_exec.exe -rw-r--r-- 1 myusername itap 0 Oct 25 12:31 mydata.err -rw-r--r-- 1 myusername itap 67 Oct 25 12:32 mydata.out *** MAIN START *** *** MAIN STOP ***
This output file exhibits a temporary directory on the processor core which Condor chose to run the job, rather than the user's home directory, another indication that this job used Condor's file transfer mechanism for file I/O.
View the log file, mydata.log:
000 (746411.000.000) 10/25 12:08:12 Job submitted from host: <128.211.157.86:60481>
...
001 (746411.000.000) 10/25 12:31:59 Job executing on host: <128.211.157.10:51871>
...
005 (746411.000.000) 10/25 12:32:00 Job terminated.
(1) Normal termination (return value 0)
Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage
278 - Run Bytes Sent By Job
6863 - Run Bytes Received By Job
278 - Total Bytes Sent By Job
6863 - Total Bytes Received By Job
The log file records the main events related to the processing of this job. In this example, the log records the number of bytes transferred between the submission host and the compute node via the file transfer mechanism, another indication that Condor's file transfer mechanism was used.
The Vanilla Universe is available for jobs which cannot take advantage of the Standard Universe because you cannot re-link them with Condor's libraries. You cannot re-link an executable binary, shell script, and a program which you compiled with an incompatible compiler. If a shared file system is available and Condor's file transfer mechanism is suitable for the job, the Vanilla job may use either for file I/O by specifying that the submission uses the mechanism only "if needed." While this method can maximize throughput, the size of any file which you intend to transfer must be reasonable. The size of the file must fit on the available disk space of the compute node. The amount of time needed to transfer the file cannot be so great that higher priority jobs constantly preempt the Condor job during file transfer.
The Vanilla Universe is an execution environment of Condor. The Vanilla Universe tells Condor that you did re-link your job via condor_compile with the Condor libraries, and therefore your job does not support checkpointing or remote system calls. Such jobs include an executable binary of a commercial application, shell script, or a program which is to take advantage of features of a compiler which is not compatible with Condor's condor_compile command.
For jobs submitted under the Vanilla Universe, the existence of a shared file system is relevant since access to files (input and output) involves either a shared file system or Condor's file transfer mechanism mechanism, not the remote system calls of the Standard Universe.
This section illustrates how to submit a small job to the Vanilla Universe of BoilerGrid from a submission host which lacks a shared file system with Condor's file transfer mechanism turned on. No matter which processor core Condor chooses to run the job, Condor transfers files. This example, myprogram.c, displays the name of the host which runs the job, the path name of the current working directory, and the contents of that directory. The Vanilla Universe allows using Linux commands to obtain the information. To compile this program for the Vanilla Universe, see Compiling Serial Programs.
Prepare a job submission file with the Vanilla Universe, the compiled C program as the executable, Condor's file transfer mechanism turned on, Condor's transferring the compiled program to the chosen compute node, and an appropriate filename, here named myjob.sub:
# FILENAME: myjob.sub universe = VANILLA # Transfer the "executable" myprogram to the compute node. transfer_executable = TRUE executable = myprogram # Turn on Condor's file transfer mechanism. should_transfer_files = YES # Let Condor handle output files. when_to_transfer_output = ON_EXIT # Standard I/O files, Condor log file output = mydata.out error = mydata.err log = mydata.log # queue one job queue
Submit this job to Condor:
$ condor_submit myjob.sub Submitting job(s). Logging submit event(s). 1 job(s) submitted to cluster 341960.
View job status:
$ condor_q myusername -- Submitter: condor.rcac.purdue.edu : <128.211.157.86:35556> : condor.rcac.purdue.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 341960.0 myusername 10/25 15:02 0+00:00:00 I 0 0.0 myjob
Place this job on hold to study the submission:
$ condor_hold 341960 Cluster 341960 held.
Obtain the requirements of this job:
$ condor_q myusername -attributes requirements -long -- Submitter: condor.rcac.purdue.edu : <128.211.157.86:35556> : condor.rcac.purdue.edu requirements = (Arch == "X86_64") && (OpSys == "LINUX") && (Disk >= DiskUsage) && ((Memory * 1024) >= ImageSize) && (HasFileTransfer)
Job requirements reflect both the Vanilla Universe (preemption without checkpointing) and Condor's file transfer mechanism turned on. This job requires a processor core which runs the Linux operating system on the x86_64 architecture, but more importantly the processor core chosen to run this job can reside on a cluster which lacks a shared file system. The ClassAd of this job states that the chosen core must have the file transfer capability. The ClassAds Disk and Memory ensure that the chosen core has sufficient resources to hold the disk and memory footprint of the job.
To see how many compute nodes of BoilerGrid are able to satisfy this job's requirements:
$ condor_status -pool boilergrid.rcac.purdue.edu -total -constraint '(Arch == "X86_64") && (OpSys == "LINUX") && (HasFileTransfer)'
Total Owner Claimed Unclaimed Matched Preempting Backfill
X86_64/LINUX 33068 20690 3850 8520 8 0 0
Total 33068 20690 3850 8520 8 0 0
This report shows that 33,068 processor cores are candidates for running this job. Using Condor's Vanilla Universe with its file transfer mechanism turned off maximizes the number of candidate cores available to the job. This large pool of candidate cores has the potential of offering high throughput. On the other hand, large data files involved in the file transfer, may diminish throughput. Some subset of the candidate processor cores, possibly none, will be available to accept the job at any given moment.
Release the job from the queue:
$ condor_release 341960 Cluster 341960 released.
View results in the file for all standard output, here named mydata.out:
cms-100.rcac.purdue.edu (none) /var/condor/execute/dir_13374 total 12 -rwxr-xr-x 1 myusername itap 6863 Oct 25 15:47 condor_exec.exe -rw-r--r-- 1 myusername itap 0 Oct 25 15:50 mydata.err -rw-r--r-- 1 myusername itap 61 Oct 25 15:50 mydata.out *** MAIN START *** *** MAIN STOP ***
The output shows the name of the processor core which ran the job. This job ran on a processor core which shares a file system with the submission host. Despite this, the current working directory is a temporary directory on the compute node; therefore, this job used the file transfer mechanism for file I/O.
View the log file, mydata.log:
000 (341960.000.000) 10/25 15:03:18 Job submitted from host: <128.211.157.86:35556>
...
012 (341960.000.000) 10/25 15:03:35 Job was held.
via condor_hold (by user myusername)
Code 1 Subcode 0
...
013 (341960.000.000) 10/25 15:48:00 Job was released.
via condor_release (by user myusername)
...
001 (341960.000.000) 10/25 15:50:46 Job executing on host: <128.211.157.10:33047>
...
005 (341960.000.000) 10/25 15:50:46 Job terminated.
(1) Normal termination (return value 0)
Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage
272 - Run Bytes Sent By Job
6863 - Run Bytes Received By Job
272 - Total Bytes Sent By Job
6863 - Total Bytes Received By Job
The log file records the main events related to the processing of this job. In this example, the log records the number of bytes transferred between the submission host and the compute node via Condor's file transfer mechanism.
The Vanilla Universe is available for jobs which cannot take advantage of the Standard Universe because you cannot re-link them with Condor's libraries. You cannot re-link an executable binary, shell script, and a program which you compiled with an incompatible compiler. If a shared file system is not available and Condor's file transfer mechanism is suitable for the job, you may turn on the file transfer mechanism, and the Vanilla job will transfer your files. The size of any file which you intend to transfer must be reasonable. The size of the file must fit on the available disk space of the compute node. The amount of time needed to transfer the file cannot be so great that higher priority jobs constantly preempt the Condor job during file transfer.
Some applications process data stored in a large input data file. The size of this file may be so large that it cannot fit within the quota of a home directory. This file might reside on Fortress or some other external storage medium. The way to process this file on BoilerGrid is to copy it to your scratch directory where a job running on a compute node of BoilerGrid may access it.
The job may run in either the Standard or Vanilla Universe. If the universe is Standard, then the job will use the remote file I/O of the Standard Universe. If the universe is Vanilla, then the job will use either the shared file system or Condor's file transfer mechanism, depending on the compute node which Condor chose to run the job.
This section illustrates how to submit a small job which reads a data file which resides on the scratch file system. The example uses the Vanilla Universe with Condor's file transfer mechanism turned on "if needed". Condor transfers files only when it matches the job with a compute node which uses a different FileSystemDomain from the one which the submission host uses. If Condor matches the job with a compute node which uses the same FileSystemDomain which the submission host uses, Condor does not transfer files and relies on the shared file system instead. This example, myprogram.c, displays the name of the compute node which runs the job, the path name of the current working directory, the contents of that directory, and copies the contents of an input scratch file to an output scratch file. The Vanilla Universe allows using Linux commands to obtain system information. To compile this program for the Vanilla Universe, see Compiling Serial Programs.
Prepare a scratch file directory with a large input data file:
$ ls -l $RCAC_SCRATCH total 32 -rw-r----- 1 myusername itap 27 Jun 8 10:41 biginputdatafile
Prepare a job submission file with the Vanilla Universe, Condor's transferring the compiled program to the chosen compute node, the compiled C program specified as the executable, Condor's file transfer mechanism turned on if needed, a list of input file(s), and an appropriate filename, here named myjob.sub:
# FILENAME: myjob.sub universe = VANILLA # Transfer the "executable" myprogram to the compute node. transfer_executable = TRUE executable = myprogram # Condor's file transfer mechanism is turned on only when needed. should_transfer_files = IF_NEEDED # Let Condor handle output file(s). when_to_transfer_output = ON_EXIT # List input data file(s) to be read or transferred from the # initial directory, if needed. transfer_input_files = biginputdatafile # Standard I/O files, Condor log file # Find these files in the initial directory. output = mydata.out error = mydata.err log = mydata.log # queue one job queue
Submit this job to Condor while specifying that all files, except the executable, are located relative to the specified initial directory, namely your scratch directory:
$ condor_submit -append initialdir=$RCAC_SCRATCH myjob.sub Submitting job(s). 1 job(s) submitted to cluster 1421563.
View job status:
$ condor_q myusername -- Submitter: condor-fe00.rcac.purdue.edu : <128.211.157.87:40924> : condor-fe00.rcac.purdue.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 1421563.0 myusername 6/9 11:49 0+00:00:00 I 0 0.0 myprogram 1 jobs; 1 idle, 0 running, 0 held
View four new files in the scratch file directory, including bigoutputdatafile:
$ ls -l $RCAC_SCRATCH total 128 -rw-r----- 1 myusername itap 27 Jun 8 10:41 biginputdatafile -rw-r--r-- 1 myusername itap 41 Jun 9 11:49 bigoutputdatafile -rw-r--r-- 1 myusername itap 0 Jun 9 11:49 mydata.err -rw-r--r-- 1 myusername itap 648 Jun 9 11:49 mydata.log -rw-r--r-- 1 myusername itap 632 Jun 9 11:49 mydata.out
View results in the file for all standard output, here named mydata.out:
steele-d037.rcac.purdue.edu /usr/rmt_share/scratch95/m/myusername total 96 -rw-r----- 1 myusername itap 27 Jun 8 10:41 biginputdatafile -rw-r--r-- 1 myusername itap 0 Jun 9 11:49 mydata.err -rw-r--r-- 1 myusername itap 204 Jun 9 11:49 mydata.log -rw-r--r-- 1 myusername itap 59 Jun 9 11:49 mydata.out total 128 -rw-r----- 1 myusername itap 27 Jun 8 10:41 biginputdatafile -rw-r--r-- 1 myusername itap 41 Jun 9 11:49 bigoutputdatafile -rw-r--r-- 1 myusername itap 0 Jun 9 11:49 mydata.err -rw-r--r-- 1 myusername itap 204 Jun 9 11:49 mydata.log -rw-r--r-- 1 myusername itap 274 Jun 9 11:49 mydata.out *** MAIN START *** scratch file system: textfromscratchfile *** MAIN STOP ***
The output shows the name of the compute node which Condor chose to run the job, the path of the current working directory (the user's scratch file directory), before-and-after listings of the content of the current working directory, and output from the application. This job ran on a processor core which uses the same shared file system which the submission host uses. The fact that the current working directory is the user's scratch directory on the submission host proves that this job used the shared file system for file I/O. The output scratch file named bigoutdatafile, the primary output of this program, appears in the second listing of the current working directory.
The second line of the output file shows the path of the scratch directory. In this case, the submission host was one of the front-ends of condor.rcac.purdue.edu and the compute node was one which uses the same file system domain as the submission host. This same path will appear in the output when the submission host is either Radon or Steele and the compute node is one of the nodes of Radon or Steele. If the submission host is either Rossmann or Coates and the compute node is one of the nodes of Rossmann or Coates, then the path will be /scratch/lustreA/m/myusername. If the submission host is Hansen and the compute node is one of the nodes of Hansen, then the path will be /scratch/lustreC/m/myusername.
View the log file, mydata.log:
000 (1421563.000.000) 06/09 11:49:23 Job submitted from host: <128.211.157.87:40924>
...
001 (1421563.000.000) 06/09 11:49:47 Job executing on host: <172.18.30.47:37393?PrivNet=condor.ccb.purdue.edu>
...
005 (1421563.000.000) 06/09 11:49:47 Job terminated.
(1) Normal termination (return value 0)
Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage
0 - Run Bytes Sent By Job
0 - Run Bytes Received By Job
0 - Total Bytes Sent By Job
0 - Total Bytes Received By Job
...
The log file records the main events related to the processing of this job. In this example, the log records that the number of bytes read and written between the submission host and the compute node is zero, another indication that this job used the shared file system for file I/O.
To see Condor's file transfer mechanism at work, repeat the example above but force the job to a compute node which does not share a file system with the submission host.
Resubmit this job to Condor while specifying that a compute node on the Miner cluster is to run the job. Either command works:
$ condor_submit -append initialdir=$RCAC_SCRATCH append requirements=ClusterName==\"Miner\" myjob.sub $ condor_submit -append initialdir=$RCAC_SCRATCH append 'requirements=ClusterName=="Miner"' myjob.sub
View results in the file for all standard output, here named mydata.out:
miner-a141.rcac.purdue.edu /var/condor/execute/dir_24010 total 20 -rw-r----- 1 myusername itap 27 Jun 9 13:39 biginputdatafile -rwxr-xr-x 1 myusername itap 8574 Jun 9 13:39 condor_exec.exe -rw-r--r-- 1 myusername itap 0 Jun 9 13:42 mydata.err -rw-r--r-- 1 myusername itap 57 Jun 9 13:42 mydata.out total 24 -rw-r----- 1 myusername itap 27 Jun 9 13:39 biginputdatafile -rw-r--r-- 1 myusername itap 42 Jun 9 13:42 bigoutputdatafile -rwxr-xr-x 1 myusername itap 8574 Jun 9 13:39 condor_exec.exe -rw-r--r-- 1 myusername itap 0 Jun 9 13:42 mydata.err -rw-r--r-- 1 myusername itap 281 Jun 9 13:42 mydata.out *** MAIN START *** scratch file system: textfromscratchfile *** MAIN STOP ***
The output shows the name of the compute node which Condor chose to run the job, the path of the current working directory (a temporary directory on the compute node, rather than the user's scratch file directory), before-and-after listings of the content of the current working directory, and output from the application. This job ran on a processor core which uses a shared file system which is different from the shared file system which the submission host uses. The fact that the current working directory is a temporary directory on the compute node and that the file named "biginputdatafile" appears in this temporary directory proves that this job used Condor's file transfer mechanism for file I/O. The output scratch file named bigoutdatafile, the primary output of this program, appears in the second listing of the current working directory. Condor transferred all output files (mydata.out, mydata.log, mydata.err, and bigoutputdatafile) to the scratch directory.
View the log file, mydata.log:
000 (1421565.000.000) 06/09 13:42:30 Job submitted from host: <128.211.157.87:40924>
...
001 (1421565.000.000) 06/09 13:42:56 Job executing on host: <172.18.32.151:43952?PrivNet=condor.ccb.purdue.edu>
...
005 (1421565.000.000) 06/09 13:42:56 Job terminated.
(1) Normal termination (return value 0)
Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage
690 - Run Bytes Sent By Job
8601 - Run Bytes Received By Job
690 - Total Bytes Sent By Job
8601 - Total Bytes Received By Job
The log file records the main events related to the processing of this job. In this example, the log records the number of bytes transferred between the submission host and the compute node via the file transfer mechanism, another indication that Condor's file transfer mechanism was used.
While this method can maximize throughput, the size of any scratch file which you intend to transfer must be reasonable. The size of the scratch file must fit on the available disk space of the compute node. The amount of time needed to transfer the scratch file cannot be so great that higher priority jobs constantly preempt the Condor job during file transfer. If the scratch file cannot fit on the available disk space of the chosen compute node or the file transfer time is so great that preemption prevents completion, consider limiting the pool of candidate compute nodes to those which share a file system with the submission host (should_transfer_files = NO) or consider using the Standard Universe with its remote file I/O.
Some applications write a large amount of intermediate data to a temporary file during an early part of the process then read that data for further processing during a later part of the process. The size of this file may be so large that it cannot fit within the quota of a home directory or that it requires too much I/O activity between the compute node and either the home directory or the scratch file directory. The way to process this intermediate file on BoilerGrid is to use the /tmp directory of the compute node which runs the job. Used properly, /tmp may provide faster local storage to an active process than any other storage option.
The job may run in the Vanilla Universe. When preemption occurs, a Vanilla job restarts at the beginning, and it rebuilds the intermediate data file from the beginning. Condor's Standard Universe is not applicable since checkpointing does not include any file in /tmp.
This section illustrates how to submit a small job which first writes then reads an intermediate data file which resides on the /tmp directory. This example, myprogram.c, displays the contents of the /tmp directory before and after processing. Linux commands access system information. To compile this program, see Compiling Serial Programs.
Prepare a job submission file with the Vanilla Universe, Condor's transferring the compiled program to the chosen compute node, the compiled C program specified as the executable, Condor's file transfer mechanism turned on if needed, and an appropriate filename, here named myjob.sub:
# FILENAME: myjob.sub universe = VANILLA # Transfer the "executable" myprogram to the compute node. transfer_executable = TRUE executable = myprogram # Condor's file transfer mechanism is turned on only when needed. should_transfer_files = IF_NEEDED # Let Condor handle output file(s). when_to_transfer_output = ON_EXIT # Standard I/O files, Condor log file output = mydata.out error = mydata.err log = mydata.log # queue one job queue
Submit this job to Condor:
$ condor_submit myjob.sub Submitting job(s). 1 job(s) submitted to cluster 346033.
View job status:
$ condor_q myusername -- Submitter: condor-fe00.rcac.purdue.edu : <128.211.157.87:40924> : condor-fe00.rcac.purdue.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 346033.0 myusername 6/16 15:05 0+00:00:00 I 0 0.0 myprogram 1 jobs; 1 idle, 0 running, 0 held
View results in the file for all standard output, here named mydata.out:
-rw-r--r-- 1 kes itap 12 Jun 16 15:12 /tmp/mytmpfile *** MAIN START *** /tmp file data: abcdefghijk *** MAIN STOP ***
The output verifies the existence of the intermediate data file in the /tmp directory.
View results in the file for all standard error, here named mydata.err:
ls: /tmp/mytmpfile: No such file or directory
The results in the error file verify that the intermediate data file does not exist at the start of processing.
View the log file, mydata.log:
000 (346033.000.000) 06/16 15:05:25 Job submitted from host: <128.211.158.38:40666>
...
001 (346033.000.000) 06/16 15:12:00 Job executing on host: <172.18.22.85:54211?PrivNet=condor.ccb.purdue.edu>
...
005 (346033.000.000) 06/16 15:12:01 Job terminated.
(1) Normal termination (return value 0)
Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage
0 - Run Bytes Sent By Job
0 - Run Bytes Received By Job
0 - Total Bytes Sent By Job
0 - Total Bytes Received By Job
...
The log file records the main events related to the processing of this job. In this example, the log records that the number of bytes read and written between the submission host and the compute node is zero, an indication that this job used the shared file system for file I/O.
While the /tmp directory can provide faster local storage to an active process than other storage options, you never know how much storage is available in the /tmp directory of the compute node chosen to run your job. If an intermediate data file consistently fails to fit in the /tmp directories of a set of compute nodes, consider limiting the pool of candidate compute nodes to those which can handle your intermediate data file.
A parameter sweep is a computation applied to several unique sets of data. Computing the average test score for each class in a school is an example of a simple parameter sweep.
A parameter sweep requires that each queued run of a computer program reads a unique set of input parameters. This example illustrates how to implement a unique command line argument, a unique standard input, and a unique formatted input data file. The program writes to a unique standard output and to a unique formatted output file.
Condor provides a pre-defined macro $(Process) which supports the specification of unique command-line arguments and input/output files so that each queued run of a job sees a unique set of data. This macro appears in the job submission file, myprogram.sub:
# FILENAME: myprogram.sub universe = VANILLA executable = myprogram # Processes 0,1,2 # command line argument arguments = $(Process) # Standard I/O files, Condor log file input = mydata.in.$(Process) output = mydata.out.$(Process) error = mydata.err.$(Process) log = mydata.log # queue 3 jobs in 1 cluster queue 3
This job submission file specifies a parameter sweep of three queued runs. Each process has a unique ID: 0, 1, and 2. Process 0 of the parameter sweep expects a standard input file named "mydata.in.0"; process 1, "mydata.in.1"; and process 2, "mydata.in.2". The sweep will generate similarly named files for standard output and error. Condor advises using a single log file in a submission. In addition, the sweep expects to find formatted input data files with the same process number used as a suffix: i_00020.0, i_mydata.1, i_mydata.2. Each copy of the computer program used in this sweep finds its unique process number in its command-line argument and appends that unique process number to the generic names "i_mydata." and "o_mydata." to make unique formatted data file names. All files reside in the user's current working directory; hence, data file names must be unique. To compile this program for the Vanilla Universe, see Compiling Serial Programs.
This job submission file uses Condor's default universe, Vanilla. Because this Vanilla job does not turn on Condor's file transfer mechanism, this job will (by default) use the shared file system of the submission host from which you submitted the job.
To submit the executable to Condor:
$ condor_submit myprogram.sub
For this submission, command condor_q will show a single cluster number, three unique process numbers, and three unique command-line arguments:
$ condor_q myusername -- Submitter: condor.rcac.purdue.edu : <128.211.157.86:60481> : condor.rcac.purdue.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 746419.0 myusername 10/28 10:57 0+00:00:00 I 0 0.0 myprogram 0 746419.1 myusername 10/28 10:57 0+00:00:00 I 0 0.0 myprogram 1 746419.2 myusername 10/28 10:57 0+00:00:00 I 0 0.0 myprogram 2
View the standard input file for process 0, mydata.in.0:
textfromstandardinput:process0
View the formatted input file for process 0, i_mydata.0:
textfromformattedinput:process0
View the standard output file for process 0, mydata.out.0:
*** MAIN START *** program name: condor_exec.exe command line argument: 0 standard input/output: textfromstandardinput:process0 formatted input/output: textfromformattedinput:process0 *** MAIN STOP ***
View the formatted output file for process 0, o_mydata.0:
textfromformattedinput:process0
Processes 1 and 2 have similar input and output files.
The single log file collects records the major events of the submission of the three queued runs of this parameter sweep:
000 (746419.000.000) 10/28 10:57:43 Job submitted from host: <128.211.157.86:60481>
...
000 (746419.001.000) 10/28 10:57:43 Job submitted from host: <128.211.157.86:60481>
...
000 (746419.002.000) 10/28 10:57:43 Job submitted from host: <128.211.157.86:60481>
...
001 (746419.001.000) 10/28 11:02:14 Job executing on host: <128.211.157.10:44836>
...
005 (746419.001.000) 10/28 11:02:14 Job terminated.
(1) Normal termination (return value 0)
Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage
950 - Run Bytes Sent By Job
9800 - Run Bytes Received By Job
950 - Total Bytes Sent By Job
9800 - Total Bytes Received By Job
...
001 (746419.000.000) 10/28 11:02:15 Job executing on host: <128.211.157.10:44836>
...
005 (746419.000.000) 10/28 11:02:15 Job terminated.
(1) Normal termination (return value 0)
Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage
950 - Run Bytes Sent By Job
9800 - Run Bytes Received By Job
950 - Total Bytes Sent By Job
9800 - Total Bytes Received By Job
...
001 (746419.002.000) 10/28 11:02:17 Job executing on host: <128.211.157.10:44836>
...
005 (746419.002.000) 10/28 11:02:17 Job terminated.
(1) Normal termination (return value 0)
Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage
950 - Run Bytes Sent By Job
9800 - Run Bytes Received By Job
950 - Total Bytes Sent By Job
9800 - Total Bytes Received By Job
Condor's parameter sweep offers a huge potential. Simply adding a large number to the queue command in a job submission file yields an enormous amount of computing. The catch is that the effort needed to prepare the unique input files, either standard input files or formatted input files. This effort can be minimal when the input data comes from some data collector operating in the field. This effort can be enormous when you must enter each unique dataset from the keyboard.
A parameter sweep is a computation applied to several unique sets of data. Computing the average test score for each class in a school is an example of a simple parameter sweep.
A parameter sweep requires that each queued run of a computer program reads a unique set of input parameters. This example illustrates how to implement a unique command line argument, a unique standard input, and a unique formatted input data file. The program writes to a unique standard output and to a unique formatted output file.
Condor provides a pre-defined macro $(Process) which supports the specification of unique command-line arguments so that each queued run of a job sees a unique set of data.
Also, Condor provides an "initial directory" which supports the specification of unique input/output files so that each queued run of a job sees a unique set of data. Command initialdir specifies a generic directory name which becomes unique after appending the process number of a queued run of a parameter sweep. Each initial directory is actually a subdirectory of the user's current working directory. Each initial directory holds the unique standard input and formatted input files of a queued run of a parameter sweep; each initial directory receives the unique standard output, error and log files plus any unique formatted output files generated by a queued run of a parameter sweep. Since data files of each run of a sweep reside in a separate directory, identical file names may be used; they need not be modified with a process number. Both macro and command appear in the job submission file, myprogram.sub:
# FILENAME: myprogram.sub universe = VANILLA executable = myprogram # Processes 0,1,2 # command line argument arguments = $(Process) initialdir = mydatadirectory.$(Process) # Standard I/O files, Condor log file input = mydata.in output = mydata.out error = mydata.err log = mydata.log # queue 3 jobs in 1 cluster queue 3
This job submission file specifies a parameter sweep of three queued runs. Each process has a unique ID: 0, 1, and 2. Process 0 of the parameter sweep expects a standard input file named "mydata.in" to reside in the initial directory named "mydatadirectory.0"; process 1, "mydata.in" resides in "mydirectory.1"; and process 2, "mydata.in" resides in "mydirectory.2". The sweep will generate similarly named files for standard output, error, and log in the initial directories. In addition, the sweep expects to find in the initial directories formatted input data files with identical names: myinputdata. Each copy of the computer program used in this sweep finds its unique process number in its command-line argument and finds its unique formatted input data file in its own initial directory. The program does not append its unique process number to the generic names of formatted files to make unique formatted data file names. All files reside in unique subdirectories of the user's current working directory; hence, data file names must be identical. To compile this program for the Vanilla Universe, see Compiling Serial Programs.
This job submission file uses Condor's default universe, Vanilla. Because this Vanilla job does not turn on Condor's file transfer mechanism, this job will (by default) use the shared file system of the submission host from which you submitted the job.
To submit the executable to Condor:
$ condor_submit myprogram.sub
For this submission, command condor_q will show a single cluster number, three unique process numbers, and three unique command-line arguments:
$ condor_q myusername -- Submitter: condor.rcac.purdue.edu : <128.211.157.86:60481> : condor.rcac.purdue.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 746420.0 myusername 10/28 12:28 0+00:00:00 I 0 0.0 myprogram 0 746420.1 myusername 10/28 12:28 0+00:00:00 I 0 0.0 myprogram 1 746420.2 myusername 10/28 12:28 0+00:00:00 I 0 0.0 myprogram 2
View the standard input file for process 0, mydata.in, in the initial directory mydirectory.0:
textfromstandardinput:process0
View the formatted input file for process 0, myinputdata, in the initial directory mydirectory.0:
textfromformattedinput:process0
View the standard output file for process 0, mydata.out, in the initial directory mydirectory.1:
*** MAIN START *** program name: condor_exec.exe command line argument: 0 standard input/output: textfromstandardinput:process0 formatted input/output: textfromformattedinput:process0 *** MAIN STOP ***
View the formatted output file for process 0, myoutputdata, in the initial directory mydirectory.0:
textfromformattedinput:process0
The log file, mydata.log, records the major events of the submission of the one queued run of this parameter sweep. View the log file for process 0, mydata.log, in the initial directory mydirectory.0:
000 (746420.000.000) 10/28 12:28:35 Job submitted from host: <128.211.157.86:60481>
...
001 (746420.000.000) 10/28 12:33:48 Job executing on host: <128.211.157.10:34460>
...
005 (746420.000.000) 10/28 12:33:49 Job terminated.
(1) Normal termination (return value 0)
Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage
909 - Run Bytes Sent By Job
9800 - Run Bytes Received By Job
909 - Total Bytes Sent By Job
9800 - Total Bytes Received By Job
Processes 1 and 2 have similar input, output and log files and formatted input/output files residing in their respective initial directories.
Condor's parameter sweep offers a huge potential. Simply adding a large number to the queue command in a job submission file yields an enormous amount of computing. The catch is that the effort needed to prepare the unique input files, either standard input files or formatted input files, can be great. This effort is minimal when the input data comes from some data collector operating in the field. This effort can be overwhelming when you must enter each unique dataset from the keyboard.
A parameter sweep is a computation applied to several unique sets of data. Computing the average test score for each class in a school is an example of a simple parameter sweep.
A parameter sweep requires that each queued run of a computer program reads a unique set of input parameters. This example illustrates how to implement a parameter sweep on a single large file. Each queued run of the job reads a different portion on the same file.
Condor provides a pre-defined macro $(Process) which supports the specification of unique command-line arguments and input/output files so that each queued run of a job sees a unique set of data. This macro appears in the job submission file, myprogram.sub:
# FILENAME: myprogram.sub universe = VANILLA executable = myprogram # Processes 0,1,2 arguments = $(Process) # There is a single formatted input data file, myinputdata. # Standard I/O files, Condor log file output = mydata.out.$(Process) error = mydata.err.$(Process) log = mydata.log # queue 3 jobs in 1 cluster queue 3
This job submission file specifies a parameter sweep of three queued runs. Each process has a unique ID: 0, 1, and 2. Each queued run of this job will read a different portion of the data file. Process 0 of the parameter sweep writes a standard output file named "mydata.out.0"; process 1, "mydata.out.1"; and process 2, "mydata.out.2". The sweep will generate similarly named files for standard error. Condor advises using a single log file in a submission to record the major events of the sweep. In addition, the sweep expects to find a single formatted input data file, myinputdata. Each copy of the computer program used in this sweep finds its unique process number in its command-line argument and uses that number to determine where in the single input data file it is to start reading records. All files reside in the user's current working directory; hence, data file names must be unique. To compile this program for the Vanilla Universe, see Compiling Serial Programs.
This job submission file uses Condor's default universe, Vanilla. Because this Vanilla job does not turn on Condor's file transfer mechanism, this job will (by default) use the shared file system of the submission host from which you submitted the job.
To submit the executable to Condor:
$ condor_submit myprogram.sub
For this submission, command condor_q will show a single cluster number, three unique process numbers, and three unique command-line arguments:
$ condor_q myusername -- Submitter: condor.rcac.purdue.edu : <128.211.157.86:60481> : condor.rcac.purdue.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 746421.0 myusername 10/29 10:57 0+00:00:00 I 0 0.0 myprogram 0 746421.1 myusername 10/29 10:57 0+00:00:00 I 0 0.0 myprogram 1 746421.2 myusername 10/29 10:57 0+00:00:00 I 0 0.0 myprogram 2
View the single formatted input file, myinputdata:
AAAAAAAAAA
BBBBBBBBBB
CCCCCCCCCC
:
ZZZZZZZZZZ
0000000000
1111111111
2222222222
3333333333
View the standard output file for process 0, mydata.out.0:
*** MAIN START *** program name: condor_exec.exe command line argument: 0 current file position: 0 rtn_val = 0 starting file position: 0 line 1: AAAAAAAAAA line 2: BBBBBBBBBB line 3: CCCCCCCCCC line 4: DDDDDDDDDD line 5: EEEEEEEEEE line 6: FFFFFFFFFF line 7: GGGGGGGGGG line 8: HHHHHHHHHH line 9: IIIIIIIIII line 10: JJJJJJJJJJ *** MAIN STOP ***
View the standard output file for process 1, mydata.out.1:
*** MAIN START *** program name: condor_exec.exe command line argument: 1 current file position: 0 rtn_val = 0 starting file position: 110 line 11: KKKKKKKKKK line 12: LLLLLLLLLL line 13: MMMMMMMMMM line 14: NNNNNNNNNN line 15: OOOOOOOOOO line 16: PPPPPPPPPP line 17: QQQQQQQQQQ line 18: RRRRRRRRRR line 19: SSSSSSSSSS line 20: TTTTTTTTTT rtn_val = 0 starting file position: 0 line 0: AAAAAAAAAA rtn_val = 0 starting file position: 220 line 21: UUUUUUUUUU *** MAIN STOP ***
Process 1 also practices additional random file accesses.
View the standard output file for process 2, mydata.out.2:
*** MAIN START *** program name: condor_exec.exe command line argument: 2 current file position: 0 rtn_val = 0 starting file position: 220 line 21: UUUUUUUUUU line 22: VVVVVVVVVV line 23: WWWWWWWWWW line 24: XXXXXXXXXX line 25: YYYYYYYYYY line 26: ZZZZZZZZZZ line 27: 0000000000 line 28: 1111111111 line 29: 2222222222 line 30: 3333333333 *** MAIN STOP ***
Condor's parameter sweep, when applied to a single, large data file, offers a huge potential. Simply adding a large number to the queue command in a job submission file applies several compute servers to the data processing.
To review, Condor is unable to transfer a subdirectory of data files to a compute server. While the submit command transfer_input_files allows paths when specifying which input files to transfer, Condor places all transferred files in a single, flat directory where the executable and standard input file reside - the temporary working directory on the compute server. Therefore, the executing program must access input files without paths.
A similar situation exists for output files. If the program creates output files during execution, it must create them within the temporary working directory. Condor transfers back all new and modified files within the temporary working directory - the output files. To transfer back only a subset of these files, use the submit command transfer_output_files. Condor does not support the transfer of output files that exist but that do not reside within the temporary working directory on the compute server.
This restriction need not deter the user with a subdirectory of input and output files. The user simply makes an archive file of the subdirectory structure with the tar utility and tell Condor to transfer the tar file. The application may then un-tar the archive before reading the input files. The application may also write to output files which reside within the subdirectory. The final step of the application archives those files which your job made or modified. Condor will see the archive file as an output file and transfer the archive from the compute server to the user's working directory on the submission host. Finally, the user extracts the output files from the archive.
The computer program, myprogram.c, reads a formatted data file and writes a formatted data file. This example assumes that there exists a formatted input file, i_00110 in a subdirectory name mysubdirectory. The result is a formatted output file, o_00110, in the same subdirectory. The program uses the tar utility to extract the subdirectory structure on the compute server. After the program writes the output file, it then uses the tar utility again to archive the subdirectory of output files only. To compile this program for the Vanilla Universe, see Compiling Serial Programs.
This example assumes that the current working directory has a subdirectory containing a formatted input file. The tar utility prepares the archive of input files:
tar cf myarchive.i.tar mysubdirectory
Prepare a job submission file, myprogram.sub. Specify the Vanilla Universe and the file transfer mechanism as "on":
# FILENAME: myprogram.sub universe = VANILLA executable = myprogram # Specify the archive as the input data file. transfer_input_files = myarchive.i.tar # Turn on file transfer mechanism. should_transfer_files = YES # Let Condor handle output file(s): myarchive.o.tar. when_to_transfer_output = ON_EXIT # Standard output files, Condor log file output = mydata.out error = mydata.err log = mydata.log # queue one job queue
To submit the executable to Condor:
$ condor_submit myprogram.sub
The standard output file, mydata.out, shows the evolution of the current working directory on the compute server. Initially, it shows that Condor transferred the tar file which contains the archived subdirectory of input data file(s). After extraction, the subdirectory with its formatted input file(s), mysubdirectory and myinputdata, are visible. After processing, the formatted output file(s), myoutputdata, is visible:
total 24 -rwxr-xr-x 1 myusername itap 8708 Nov 12 15:27 condor_exec.exe -rw-r--r-- 1 myusername itap 10240 Nov 12 15:27 myarchive.i.tar -rw-r--r-- 1 myusername itap 0 Nov 12 15:30 mydata.err -rw-r--r-- 1 myusername itap 0 Nov 12 15:30 mydata.out total 32 -rwxr-xr-x 1 myusername itap 8708 Nov 12 15:27 condor_exec.exe -rw-r--r-- 1 myusername itap 10240 Nov 12 15:27 myarchive.i.tar -rw-r--r-- 1 myusername itap 0 Nov 12 15:30 mydata.err -rw-r--r-- 1 myusername itap 227 Nov 12 15:30 mydata.out drwxr-x--- 3 myusername itap 4096 Feb 14 2008 mysubdirectory total 8 drwx------ 2 myusername itap 4096 Feb 14 2008 .. -rw-r--r-- 1 myusername itap 19 Jul 12 2007 myinputdata total 12 drwx------ 2 myusername itap 4096 Feb 14 2008 .. -rw-r--r-- 1 myusername itap 19 Jul 12 2007 myinputdata -rw-r--r-- 1 myusername itap 28 Nov 12 15:30 myoutputdata *** MAIN START *** formatted input/output: textinsubdirectory *** MAIN STOP ***
At job completion, Condor sees file myarchive.o.tar as an output file which it will transfer to the submission host. After the transfer, the user then extracts the output file(s) from this archive:
tar xf myarchive.o.tar mysubdirectory/myoutputfile
View the log file, mydata.log:
000 (342352.000.000) 11/12 15:29:31 Job submitted from host: <128.211.157.86:47933>
...
001 (342352.000.000) 11/12 15:30:55 Job executing on host: <128.211.157.10:59987?PrivNet=condor.ccb.purdue.edu>
...
005 (342352.000.000) 11/12 15:30:56 Job terminated.
(1) Normal termination (return value 0)
Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage
11094 - Run Bytes Sent By Job
18948 - Run Bytes Received By Job
11094 - Total Bytes Sent By Job
18948 - Total Bytes Received By Job
The log file records the main events related to the processing of this job. The log shows the number of bytes transferred between the submission host and the compute server via Condor's file transfer mechanism.
Some applications require compute nodes with a certain minimum amount of memory. These applications may also perform better when even more memory is available on the compute node.
This section illustrates how to submit a small job to a BoilerGrid compute node with at least 16 GB of memory (requirements) and to prefer compute nodes with even more memory (rank), if available. This example, myprogram.c, displays the name of the host which runs the job, the path name of the current working directory, and the contents of that directory.
Prepare a job submission file with an appropriate filename, here named myjob.sub:
# FILENAME: myjob.sub universe = VANILLA # Require a compute node with at least 16 GB of memory. # 16 GB == 16046 MB; requirements = TotalMemory >= 16046 # Prefer a compute node with more than 16 GB, if available. rank = TotalMemory # Transfer the "executable" myprogram to the compute node. transfer_executable = TRUE executable = myprogram # Turn on Condor's file transfer mechanism only when needed. should_transfer_files = IF_NEEDED when_to_transfer_output = ON_EXIT # Standard I/O files, Condor log file output = myprogram.out error = myprogram.err log = myprogram.log # queue one job queue
The ClassAd TotalMemory specifies the amount of memory on a compute node. The amount of memory is in units of megabytes. To change this example to request at least 32 GB of total memory, replace "16046" with "32192". For at least 48 GB, use "48297".
This example assumes that all compute nodes have a definition for the attribute TotalMemory. To see how many compute nodes in BoilerGrid do not have the attribute TotalMemory defined:
$ condor_status -pool boilergrid.rcac.purdue.edu -total -constraint 'TotalMemory =?= undefined'
There is no output since all compute nodes of BoilerGrid do have this attribute defined.
Before submitting your job, you may wish to verify that there are a sufficient number of compute nodes which will satisfy your requirements and that those same compute nodes define the preferred ClassAds expressed in the rank command. To see how many compute nodes satisfy your requirements:
$ condor_status -pool boilergrid.rcac.purdue.edu -total -constraint 'TotalMemory >= 16046'
Total Owner Claimed Unclaimed Matched Preempting Backfill
X86_64/LINUX 26093 18007 3330 4753 3 0 0
Total 26093 18007 3330 4753 3 0 0
There are 26,093 compute nodes with at least 16 GB of memory.
View results in the file for all standard output, here named myjob.out:
cms-100.rcac.purdue.edu (none) /autohome/u105/myusername/condor/Introduction/memory total 224 -rw-r--r-- 1 myusername itap 1508 Mar 11 14:38 README -rw-r--r-- 1 myusername itap 0 Mar 11 15:36 myjob.err -rw-r--r-- 1 myusername itap 791 Mar 11 15:36 myjob.log -rw-r--r-- 1 myusername itap 77 Mar 11 15:36 myjob.out -rw-r----- 1 myusername itap 663 Mar 11 15:20 myjob.sub -rwxr-xr-x 1 myusername itap 6939 Mar 11 14:38 myprogram -rw-r----- 1 myusername itap 488 Mar 11 14:40 myprogram.c -rwxr----- 1 myusername itap 58 Mar 11 14:38 run *** MAIN START *** *** MAIN STOP ***
This job happened to run on compute node cms-100. This compute node has 8 processor cores. To verify that cms-100 has at least 16 GB of memory:
$ condor_status -pool boilergrid.rcac.purdue.edu -constraint 'Machine=="cms-100.rcac.purdue.edu"' -format "%s\n" TotalMemory 16046 16046 16046 16046 16046 16046 16046 16046
For more information about requirements and rank:
You compile a computer program to run on a specific combination of chip architecture and operating system. This combination is a platform. BoilerGrid contains compute nodes of many different platforms, so you must often specify the platform your program requires to ensure that your job runs on the correct platform. The predominant platform on BoilerGrid is 64-bit Linux ("X86_64/Linux"). To see a list of all platforms available on BoilerGrid:
$ condor_status -pool boilergrid.rcac.purdue.edu -total
Total Owner Claimed Unclaimed Matched Preempting Backfill
INTEL/LINUX 114 18 0 60 0 0 36
INTEL/OSX 2 0 0 2 0 0 0
INTEL/WINNT51 334 8 0 326 0 0 0
INTEL/WINNT61 6299 982 0 5317 0 0 0
SUN4u/SOLARIS210 3 0 0 3 0 0 0
X86_64/LINUX 30170 19460 4559 6150 0 0 1
Total 36922 20468 4559 11858 0 0 37
The name "INTEL" as used on BoilerGrid means 32-bit Intel-compatible hardware, and it makes no distinction between Intel and AMD CPUs. The name "X86_64" is a vendor-neutral term to refer to 64-bit architecture from either Intel or AMD. The name "WINNT51" means Windows XP, and "WINNT61" means Windows 7.
By default, Condor will send a job to a compute node whose architecture and operating system match the platform of the host from which you submitted your job. Moreover, you may submit jobs to compute nodes which are platforms different from the submission host. You may compile a program to run on a Windows machine and submit the executable file to BoilerGrid from one of BoilerGrid's Linux submission hosts by specifying that the job requires a Windows compute node:
executable = myprogram.exe requirements = (ARCH == "INTEL") && ((OPSYS == "WINNT51") || (OPSYS == "WINNT61"))
It is possible to allow Condor to use a larger pool of compute nodes for a job if executables are available for multiple platforms. You need only take care to not reference any absolute paths within your job submission that are specific to one platform or installation. You can often use some existing ClassAd variables instead of fixed paths to make non-platform-specific submission files.
For more information about requirements and rank:
RCAC resources include several clusters. Currently, the clusters include the following:
Radon Steele Coates Rossmann Miner
This section illustrates how to apply Condor ClassAds to submit a small job to a node which resides on some subset of this collection of RCAC resources. These examples execute a simple shell script which displays the name of the compute node which ran the job.
Prepare a shell script with an appropriate filename, here named myjob.sh:
#!/bin/sh # FILENAME: myjob.sh hostname
Change the permissions of the shell script to allow execution by the owner (you):
$ chmod u+x myjob.sh
Prepare a job submission file with an appropriate filename, here named myjob.sub. The ClassAd requires that the chosen compute node should reside on either of two clusters. This example also specifies the shell script myjob.sh as the executable, transfers the shell script to the compute node, and uses Condor's file transfer mechanism to copy input and output files only if needed (if the same file system is not available on the compute node):
# FILENAME: myjob.sub universe = VANILLA # Require a compute node of either the Steele or Coates cluster. # Attribute name is not case sensitive; attribute value is. requirements = (CLUSTERNAME=="Steele") || (clustername=="Coates") # Transfer the "executable" myjob.sh to the compute node. transfer_executable = TRUE executable = myjob.sh # Turn on Condor's file transfer mechanism only when needed. should_transfer_files = IF_NEEDED when_to_transfer_output = ON_EXIT # Standard I/O files, Condor log file output = myjob.out error = myjob.err log = myjob.log # queue one job queue
Submit the job:
$ condor_submit myjob.sub
View job status:
$ condor_q myusername
View results in the file for all standard output, here named myjob.out:
coates-d020.rcac.purdue.edu
Prepare a job submission file with an appropriate filename, here named myjob.sub. The ClassAd requires a specific compute node. This example also specifies the shell script myjob.sh as the executable, transfers the shell script to the compute node, and uses Condor's file transfer mechanism to copy input and output files only if needed (if the same file system is not available on the compute node):
# FILENAME: myjob.sub universe = VANILLA # Require a specific compute node. requirements = Machine=="miner-a500.rcac.purdue.edu" # Transfer the "executable" myjob.sh to the compute node. transfer_executable = TRUE executable = myjob.sh # Turn on Condor's file transfer mechanism only when needed. should_transfer_files = IF_NEEDED when_to_transfer_output = ON_EXIT # Standard I/O files, Condor log file output = myjob.out error = myjob.err log = myjob.log # queue one job queue
Submit the job:
$ condor_submit myjob.sub
View job status:
$ condor_q myusername
View results in the file for all standard output, here named myjob.out:
miner-a500.rcac.purdue.edu
When you discover that a compute node is consistently available and consistently fails to run your job, you may exclude that node from the set of candidate nodes.
Prepare a job submission file with an appropriate filename, here named myjob.sub. The ClassAd excludes one specific compute node of a chosen cluster. This example also specifies the shell script myjob.sh as the executable, transfers the shell script to the compute node, and uses Condor's file transfer mechanism to copy input and output files only if needed (if the same file system is not available on the compute node):
# FILENAME: myjob.sub universe = VANILLA # Exclude a specific compute node. requirements = ClusterName=="Miner" && Machine!="miner-a500.rcac.purdue.edu" # Transfer the "executable" myjob.sh to the compute node. transfer_executable = TRUE executable = myjob.sh # Turn on Condor's file transfer mechanism only when needed. should_transfer_files = IF_NEEDED when_to_transfer_output = ON_EXIT # Standard I/O files, Condor log file output = myjob.out error = myjob.err log = myjob.log # queue one job queue
Submit the job:
$ condor_submit myjob.sub
View job status:
$ condor_q myusername
View results in the file for all standard output, here named myjob.out:
miner-a502.rcac.purdue.edu
For more information about requirements and rank:
Condor schedules individual programs to run on unused compute servers, but it does not schedule a sequence of programs; Condor does not handle dependencies. Instead, the Directed Acyclic Graph Manager (DAGMan), a meta-scheduler which can handle dependencies, submits programs to Condor in a sequence specified by a directed acyclic graph (DAG). A DAG can represent a sequence of computations. Nodes (vertices) of the DAG represent executable programs; edges (arcs) identify the dependencies between programs.
This example is a linear DAG which represents three ordered executions named "A", "B", and "C". Program A must finish before program B may begin; B must finish before C may begin.

The DAG submission file, myprogram.dag, describes the DAG and specifies job submission files which control the execution of individual programs at each node of the DAG and under Condor's control:
# FILENAME: myprogram.dag # Specify the nodes (job submission files) of a DAG. JOB A myprogram.A.sub JOB B myprogram.B.sub JOB C myprogram.C.sub # Specify command-line arguments as macro definitions. VARS A nodename="A" VARS B nodename="B" VARS C nodename="C" # Specify the edges (dependencies, order of execution) of a DAG. PARENT A CHILD B PARENT B CHILD C
View the job submission file, myprogram.A.sub, for the first node of the DAG:
universe = VANILLA executable = myprogram arguments = $(Cluster) $(Process) $(nodename) output = myprogram.A.out error = myprogram.A.err log = myprogram.log queue
View the job submission file, myprogram.B.sub, for the second node of the DAG:
universe = VANILLA executable = myprogram arguments = $(Cluster) $(Process) $(nodename) output = myprogram.B.out error = myprogram.B.err log = myprogram.log queue
View the job submission file, myprogram.C.sub, for the third node of the DAG:
universe = VANILLA executable = myprogram arguments = $(Cluster) $(Process) $(nodename) output = myprogram.C.out error = myprogram.C.err log = myprogram.log queue
While DAGMan can execute a different program for each node of the DAG, this example uses a single executable file to remain simple. Here is a short program, myprogram.c, that displays the command-line arguments which originally appear in the DAG submission file and forwarded to the job submission files. To compile this program for the Vanilla Universe, see Compiling Serial Programs.
To submit the DAG to Condor:
$ condor_submit_dag -force myprogram.dag
The argument -force requires Condor to overwrite the files that it produces. This is a useful convenience when the DAG's output files already exist from a preceding run, you no longer need that earlier output. Condor appends, not overwrites, the file dagman.out .
Command condor_rm is able to remove a DAG from the job queue.
Command condor_q shows the sequence of execution:
$ condor_q myusername -- Submitter: condor.rcac.purdue.edu : <128.211.157.86:38552> : condor.rcac.purdue.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 746893.0 myusername 11/9 10:00 0+00:00:08 R 0 7.3 condor_dagman $ condor_q myusername -- Submitter: condor.rcac.purdue.edu : <128.211.157.86:38552> : condor.rcac.purdue.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 746893.0 myusername 11/9 10:00 0+00:01:42 R 0 7.3 condor_dagman 746894.0 myusername 11/9 10:00 0+00:00:00 I 0 0.0 myprogram 746894 0 A $ condor_q myusername -- Submitter: condor.rcac.purdue.edu : <128.211.157.86:38552> : condor.rcac.purdue.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 746893.0 myusername 11/9 10:00 0+00:18:48 R 0 7.3 condor_dagman 746897.0 myusername 11/9 10:15 0+00:00:00 I 0 0.0 myprogram 746897 0 B $ condor_q myusername -- Submitter: condor.rcac.purdue.edu : <128.211.157.86:38552> : condor.rcac.purdue.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 746893.0 myusername 11/9 10:00 0+00:21:28 R 0 7.3 condor_dagman 746900.0 myusername 11/9 10:21 0+00:00:00 I 0 0.0 myprogram 746900 0 C
This report shows that DAGMan has its own cluster number. Each node of a DAG has its own cluster number. These cluster numbers are not necessarily in sequence since other users are submitting jobs to Condor.
View the output file of the first node of the DAG, myprogram.A.out:
*** MAIN START *** program name: condor_exec.exe cluster number: 746894 process number: 0 node name: A *** MAIN STOP ***
View the output file of the second node of the DAG, myprogram.B.out:
*** MAIN START *** program name: condor_exec.exe cluster number: 746897 process number: 0 node name: B *** MAIN STOP ***
View the output file of the third node of the DAG, myprogram.C.out:
*** MAIN START *** program name: condor_exec.exe cluster number: 746900 process number: 0 node name: C *** MAIN STOP ***
Each execution of the single program sees a unique node name: A, B, C.
The common log file records the execution of the three nodes of the DAG, myprogram.log:
000 (746894.000.000) 11/09 10:00:37 Job submitted from host: <128.211.157.86:38552>
DAG Node: A
...
001 (746894.000.000) 11/09 10:15:09 Job executing on host: <128.211.157.10:59600>
...
005 (746894.000.000) 11/09 10:15:09 Job terminated.
(1) Normal termination (return value 0)
Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage
0 - Run Bytes Sent By Job
0 - Run Bytes Received By Job
0 - Total Bytes Sent By Job
0 - Total Bytes Received By Job
...
000 (746897.000.000) 11/09 10:15:18 Job submitted from host: <128.211.157.86:38552>
DAG Node: B
...
001 (746897.000.000) 11/09 10:21:24 Job executing on host: <128.211.157.10:52773>
...
005 (746897.000.000) 11/09 10:21:24 Job terminated.
(1) Normal termination (return value 0)
Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage
0 - Run Bytes Sent By Job
0 - Run Bytes Received By Job
0 - Total Bytes Sent By Job
0 - Total Bytes Received By Job
...
000 (746900.000.000) 11/09 10:21:36 Job submitted from host: <128.211.157.86:38552>
DAG Node: C
...
001 (746900.000.000) 11/09 10:26:06 Job executing on host: <128.211.157.10:59600>
...
005 (746900.000.000) 11/09 10:26:06 Job terminated.
(1) Normal termination (return value 0)
Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage
0 - Run Bytes Sent By Job
0 - Run Bytes Received By Job
0 - Total Bytes Sent By Job
0 - Total Bytes Received By Job
...
For more information about DAGMan:
A linear DAG may include a parameter sweep. The following diagram illustrates a three-step linear DAG with the middle process being a parameter sweep which applies a single computer program to unique data sets. The first and third steps might perform data preparation and collation, respectively:

This example is a linear DAG which represents three ordered executions named "A", "B", and "C". Program A must finish before any run of program B used in the parameter sweep may begin; all runs of program B must finish before C may begin.
The DAG submission file, myprogram.dag, describes the DAG and specifies job submission files which control the execution of individual programs at each node of the DAG and under Condor's control. Notice that this DAG submission file is identical to a linear DAG submission file:
# FILENAME: myprogram.dag # Specify the nodes (job submission files) of a DAG. JOB A myprogram.A.sub JOB B myprogram.B.sub JOB C myprogram.C.sub # Specify command-line arguments as macro definitions. VARS A nodename="A" VARS B nodename="B" VARS C nodename="C" # Specify the edges (dependencies, order of execution) of a DAG. PARENT A CHILD B PARENT B CHILD C
View the job submission file, myprogram.A.sub, for the first node of the DAG:
universe = VANILLA executable = myprogram arguments = $(Cluster) $(Process) $(nodename) output = myprogram.A.out error = myprogram.A.err log = myprogram.log queue
View the job submission file, myprogram.B.sub, for the second node, the parameter sweep, of the DAG. Command queue submits three copies of myprogram to Condor:
universe = VANILLA executable = myprogram arguments = $(Cluster) $(Process) $(nodename) output = myprogram.B.out.$(Process) error = myprogram.B.err.$(Process) log = myprogram.log queue 3
View the job submission file, myprogram.C.sub, for the third node of the DAG:
universe = VANILLA executable = myprogram arguments = $(Cluster) $(Process) $(nodename) output = myprogram.C.out error = myprogram.C.err log = myprogram.log queue
While DAGMan can execute a different program for each node of the DAG, this example uses a single executable file to remain simple. Here is a short program, myprogram.c, that displays the command-line arguments which originally appear in the DAG submission file and forwarded to the job submission files. To compile this program for the Vanilla Universe, see Compiling Serial Programs.
To submit the DAG to Condor:
$ condor_submit_dag -force myprogram.dag
The argument -force requires Condor to overwrite the files that it produces. This is a useful convenience when the DAG's output files already exist from a preceding run, and you no longer need the earlier output. DAGMan appends, not overwrites, file dagman.out.
Command condor_rm is able to remove a DAG from the job queue.
Three timely submissions of condor_q caught the three steps of the DAG, including the parameter sweep of the middle step:
$ condor_q myusername -- Submitter: condor.rcac.purdue.edu : <128.211.157.86:58916> : condor.rcac.purdue.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 746911.0 myusername 11/10 08:30 0+00:00:19 R 0 7.3 condor_dagman 746912.0 myusername 11/10 08:30 0+00:00:00 I 0 0.0 myprogram 746912 0 A $ condor_q myusername -- Submitter: condor.rcac.purdue.edu : <128.211.157.86:58916> : condor.rcac.purdue.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 746911.0 myusername 11/10 08:30 0+00:02:25 R 0 7.3 condor_dagman 746913.0 myusername 11/10 08:32 0+00:00:00 I 0 0.0 myprogram 746913 0 B 746913.1 myusername 11/10 08:32 0+00:00:00 I 0 0.0 myprogram 746913 1 B 746913.2 myusername 11/10 08:32 0+00:00:00 I 0 0.0 myprogram 746913 2 B $ condor_q myusername -- Submitter: condor.rcac.purdue.edu : <128.211.157.86:58916> : condor.rcac.purdue.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 746911.0 myusername 11/10 08:30 0+00:14:55 R 0 7.3 condor_dagman 746914.0 myusername 11/10 08:41 0+00:00:00 I 0 0.0 myprogram 746914 0 C
This report shows that DAGMan has its own cluster number. Each node of a DAG has its own cluster number. These cluster numbers are not necessarily in sequence since other users are submitting jobs to Condor. In addition, each process in the parameter sweep has its own process number, and they are in sequence.
View the output file of the first node of the DAG, myprogram.A.out:
*** MAIN START *** program name: condor_exec.exe cluster number: 746912 process number: 0 node name: A *** MAIN STOP ***
View the three output files of the three processes of the parameter sweep that is the second node of the DAG, myprogram.B.out.$(Process):
*** MAIN START *** program name: condor_exec.exe cluster number: 746913 process number: 0 node name: B *** MAIN STOP *** *** MAIN START *** program name: condor_exec.exe cluster number: 746913 process number: 1 node name: B *** MAIN STOP *** *** MAIN START *** program name: condor_exec.exe cluster number: 746913 process number: 2 node name: B *** MAIN STOP ***
View the output file of the third node of the DAG, myprogram.C.out:
*** MAIN START *** program name: condor_exec.exe cluster number: 746914 process number: 0 node name: C *** MAIN STOP ***
Each execution of the single program sees a unique node name: A, B, C. In the parameter sweep, all runs of the single program see the same node name, B; however, each copy sees a unique process number.
The common log file records the execution of the three nodes of the DAG, myprogram.log:
000 (746912.000.000) 11/10 08:30:43 Job submitted from host: <128.211.157.86:58916>
DAG Node: A
...
001 (746912.000.000) 11/10 08:32:36 Job executing on host: <128.211.157.10:37230>
...
005 (746912.000.000) 11/10 08:32:36 Job terminated.
(1) Normal termination (return value 0)
Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage
0 - Run Bytes Sent By Job
0 - Run Bytes Received By Job
0 - Total Bytes Sent By Job
0 - Total Bytes Received By Job
...
000 (746913.000.000) 11/10 08:32:44 Job submitted from host: <128.211.157.86:58916>
DAG Node: B
...
000 (746913.001.000) 11/10 08:32:44 Job submitted from host: <128.211.157.86:58916>
DAG Node: B
...
000 (746913.002.000) 11/10 08:32:44 Job submitted from host: <128.211.157.86:58916>
DAG Node: B
...
001 (746913.000.000) 11/10 08:41:12 Job executing on host: <128.211.157.10:34460>
...
001 (746913.001.000) 11/10 08:41:12 Job executing on host: <128.211.157.10:36048>
...
005 (746913.000.000) 11/10 08:41:12 Job terminated.
(1) Normal termination (return value 0)
Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage
0 - Run Bytes Sent By Job
0 - Run Bytes Received By Job
0 - Total Bytes Sent By Job
0 - Total Bytes Received By Job
...
005 (746913.001.000) 11/10 08:41:12 Job terminated.
(1) Normal termination (return value 0)
Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage
0 - Run Bytes Sent By Job
0 - Run Bytes Received By Job
0 - Total Bytes Sent By Job
0 - Total Bytes Received By Job
...
001 (746913.002.000) 11/10 08:41:12 Job executing on host: <128.211.157.10:34460>
...
005 (746913.002.000) 11/10 08:41:13 Job terminated.
(1) Normal termination (return value 0)
Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage
0 - Run Bytes Sent By Job
0 - Run Bytes Received By Job
0 - Total Bytes Sent By Job
0 - Total Bytes Received By Job
...
000 (746914.000.000) 11/10 08:41:26 Job submitted from host: <128.211.157.86:58916>
DAG Node: C
...
001 (746914.000.000) 11/10 08:48:03 Job executing on host: <128.211.157.10:40848>
...
005 (746914.000.000) 11/10 08:48:04 Job terminated.
(1) Normal termination (return value 0)
Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage
0 - Run Bytes Sent By Job
0 - Run Bytes Received By Job
0 - Total Bytes Sent By Job
0 - Total Bytes Received By Job
...
For more information about DAGMan:
All nodes of a DAG may be a parameter sweep. This means that each run of an entire DAG can process a unique set of input data. This is the logical extension of a single progam used in a parameter sweep. The disadvantage of this method is the interdependence among copies of the DAG.
This example is a linear DAG which represents three ordered executions named "A", "B", and "C". This DAG runs as a parameter sweep. The interdependence among the runs of this DAG means that all runs of program associated with node A must finish before any run of program associated with node B may begin; all runs of the program associated with node B must finish before any run of the program associated with node C may begin. If one of the runs of DAG Node A experiences a delay because the executable file landed on a slow compute node, then all runs of the parameter sweep wait, not just the run which experiences the delay.

The DAG submission file, myprogram.dag, describes the DAG and specifies job submission files which control the execution of individual programs at each node of the DAG and under Condor's control. Notice that this DAG submission file is identical to a linear DAG submission file:
# FILENAME: myprogram.dag # Specify the nodes (job submission files) of a DAG. JOB A myprogram.A.sub JOB B myprogram.B.sub JOB C myprogram.C.sub # Specify command-line arguments as macro definitions. VARS A nodename="A" VARS B nodename="B" VARS C nodename="C" # Specify the edges (dependencies, order of execution) of a DAG. PARENT A CHILD B PARENT B CHILD C
View the job submission file, myprogram.A.sub, for the first node of the DAG:
universe = VANILLA executable = myprogram arguments = $(Cluster) $(Process) $(nodename) output = myprogram.A.out.$(Process) error = myprogram.A.err.$(Process) log = myprogram.log queue 3 # queue 3 runs
View the job submission file, myprogram.B.sub, for the second node of the DAG:
universe = VANILLA executable = myprogram arguments = $(Cluster) $(Process) $(nodename) output = myprogram.B.out.$(Process) error = myprogram.B.err.$(Process) log = myprogram.log queue 3 # queue 3 runs
View the job submission file, myprogram.C.sub, for the third node of the DAG:
universe = VANILLA executable = myprogram arguments = $(Cluster) $(Process) $(nodename) output = myprogram.C.out.$(Process) error = myprogram.C.err.$(Process) log = myprogram.log queue 3 # queue 3 runs
For each node of the DAG, command queue submits three copies of myprogram to Condor.
While DAGMan can execute a different program for each node of the DAG, this example uses a single executable file to remain simple. Here is a short program, myprogram.c, that displays the command-line arguments which originally appear in the DAG submission file and forwarded to the job submission files. To compile this program for the Vanilla Universe, see Compiling Serial Programs.
To submit the DAG to Condor:
$ condor_submit_dag -force myprogram.dag
The argument -force requires Condor to overwrite the files that it produces. This is a useful convenience when the DAG's output files already exist from a preceding run, and you no longer need the earlier output. DAGMan appends, not overwrites, the file dagman.out.
Command condor_rm is able to remove a DAG from the job queue.
Three timely submissions of condor_q caught the parameter sweeps of the three steps of the DAG:
$ condor_q myusername -- Submitter: condor.rcac.purdue.edu : <128.211.157.86:58916> : condor.rcac.purdue.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 746924.0 myusername 11/11 14:28 0+00:00:22 R 0 7.3 condor_dagman 746925.0 myusername 11/11 14:28 0+00:00:00 I 0 0.0 myprogram 746925 0 A 746925.1 myusername 11/11 14:28 0+00:00:00 I 0 0.0 myprogram 746925 1 A 746925.2 myusername 11/11 14:28 0+00:00:00 I 0 0.0 myprogram 746925 2 A $ condor_q myusername -- Submitter: condor.rcac.purdue.edu : <128.211.157.86:58916> : condor.rcac.purdue.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 746924.0 myusername 11/11 14:28 0+00:04:51 R 0 7.3 condor_dagman 746926.0 myusername 11/11 14:32 0+00:00:00 I 0 0.0 myprogram 746926 0 B 746926.1 myusername 11/11 14:32 0+00:00:00 I 0 0.0 myprogram 746926 1 B 746926.2 myusername 11/11 14:32 0+00:00:00 I 0 0.0 myprogram 746926 2 B $ condor_q myusername -- Submitter: condor.rcac.purdue.edu : <128.211.157.86:58916> : condor.rcac.purdue.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 746924.0 myusername 11/11 14:28 0+00:09:55 R 0 7.3 condor_dagman 746927.0 myusername 11/11 14:37 0+00:00:00 I 0 0.0 myprogram 746927 0 C 746927.1 myusername 11/11 14:37 0+00:00:00 I 0 0.0 myprogram 746927 1 C 746927.2 myusername 11/11 14:37 0+00:00:00 I 0 0.0 myprogram 746927 2 C
This report shows that DAGMan has its own cluster number. Each node of the DAG has its own cluster number. These cluster numbers are not necessarily in sequence since other users are submitting jobs to Condor. In addition, since each node is a parameter sweep, each process in the parameter sweep has its own process number, and they are in sequence.
View the three output files of the zero-th run of the parameter sweep of the DAG: myprogram.A.out.0, myprogram.B.out.0, and myprogram.C.out.0:
*** MAIN START *** program name: condor_exec.exe cluster number: 746925 process number: 0 node name: A *** MAIN STOP *** *** MAIN START *** program name: condor_exec.exe cluster number: 746926 process number: 0 node name: B *** MAIN STOP *** *** MAIN START *** program name: condor_exec.exe cluster number: 746927 process number: 0 node name: C *** MAIN STOP ***
Similar sets of output files exist for the other two runs of the parameter sweep. Each execution of the single program sees a unique pair of node name (A, B, C) and process number (0, 1, 2).
The common log file records the execution of the three runs of the parameter sweep. In particular, it shows that all runs of node B start only after all runs of node A reach completion.
For more information about DAGMan:
A single use of condor_submit_dag may execute several independent DAGs. Each independent DAG has its own DAG submission file. The names of these DAG submission files appear as command-line arguments of condor_submit_dag, as in the following:
condor_submit_dag -force mydagsubmissionfile1 mydagsubmissionfile2 ... mydagsubmissionfileN
This example is two independent linear DAGs which represent three ordered executions named "A", "B", and "C" and two ordered executions named "D" and "E". While each sequence must be executed in the order specified by their respective DAGs, there is no dependency between the two sequences; the two sequences are independent. In other words, the execution of step E does not depend on the completion of either step A, B, or C, only step D.

Here are the two independent DAG submission files, myprogram.dag.1 and myprogram.dag.2:
# FILENAME: myprogram.dag.1 # Specify the nodes (job submission files) of a DAG. JOB A myprogram.dag1.A.sub JOB B myprogram.dag1.B.sub JOB C myprogram.dag1.C.sub # Specify command-line arguments as macro definitions. VARS A nodename="A" VARS B nodename="B" VARS C nodename="C" # Specify the edges (dependencies, order of execution) of a DAG. PARENT A CHILD B PARENT B CHILD C
# FILENAME: myprogram.dag.2 # Specify the nodes (job submission files) of a DAG. JOB D p_00156.dag2.D.sub JOB E p_00156.dag2.E.sub # Specify command-line arguments as macro definitions. VARS D nodename="D" VARS E nodename="E" # Specify the edges (dependencies, order of execution) of a DAG. PARENT D CHILD E
View the three job submission files of DAG 1:
# FILENAME: myprogram.dag1.A.sub universe = VANILLA executable = myprogram arguments = $(Cluster) $(Process) $(nodename) output = myprogram.dag1.A.out error = myprogram.dag1.A.err log = myprogram.dag1.log queue
# FILENAME: myprogram.dag1.B.sub universe = VANILLA executable = myprogram arguments = $(Cluster) $(Process) $(nodename) output = myprogram.dag1.B.out error = myprogram.dag1.B.err log = myprogram.dag1.log queue
# FILENAME: myprogram.dag1.C.sub universe = VANILLA executable = myprogram arguments = $(Cluster) $(Process) $(nodename) output = myprogram.dag1.C.out error = myprogram.dag1.C.err log = myprogram.dag1.log queue
View the two job submission files of DAG 2:
# FILENAME: myprogram.dag2.D.sub universe = VANILLA executable = myprogram arguments = $(Cluster) $(Process) $(nodename) output = myprogram.dag2.D.out error = myprogram.dag2.D.err log = myprogram.dag2.log queue
# FILENAME: myprogram.dag2.E.sub universe = VANILLA executable = myprogram arguments = $(Cluster) $(Process) $(nodename) output = myprogram.dag2.E.out error = myprogram.dag2.E.err log = myprogram.dag2.log queue
While DAGMan can execute a different program for each node of the DAG, this example uses a single executable file to remain simple. Here is a short program, myprogram.c, that displays the command-line arguments which originally appear in the DAG submission file and forwarded to the job submission files. To compile this program for the Vanilla Universe, see Compiling Serial Programs.
To submit the independent DAGs to Condor:
$ condor_submit_dag -force myprogram.dag.1 myprogram.dag.2
The argument -force requires Condor to overwrite the files that it produces. This is a useful convenience when the DAG's output files already exist from a preceding run, and you no longer need the earlier output. DAGMan appends, not overwrites, the file dagman.out.
Command condor_rm is able to remove a DAG from the job queue.
Command condor_q shows the start of the two independent DAGs:
$ condor_q myusername -- Submitter: condor.rcac.purdue.edu : <128.211.157.86:38552> : condor.rcac.purdue.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 746918.0 myusername 11/10 11:06 0+00:00:32 R 0 7.3 condor_dagman 746919.0 myusername 11/10 11:06 0+00:00:00 I 0 0.0 myprogram.dag1 74691 746920.0 myusername 11/10 11:06 0+00:00:00 I 0 0.0 myprogram.dag2 74692
This report shows that DAGMan has its own cluster number. Each independent DAG has its own set of cluster numbers. These cluster numbers are not necessarily in sequence since other users are submitting jobs to Condor.
View the output file of the first node of DAG 1, myprogram.dag1.A.out:
*** MAIN START *** program name: condor_exec.exe cluster number: 746894 process number: 0 node name: A *** MAIN STOP ***
Similarly named output files exist for the other four nodes.
This example ran with each independent DAG having its own log file. Here is the log file for DAG 2, myprogram.dag2.log:
000 (746920.000.000) 11/10 11:06:17 Job submitted from host: <128.211.157.86:58916>
DAG Node: 1.D
...
001 (746920.000.000) 11/10 11:12:00 Job executing on host: <128.211.157.10:42201>
...
005 (746920.000.000) 11/10 11:12:00 Job terminated.
(1) Normal termination (return value 0)
Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage
0 - Run Bytes Sent By Job
0 - Run Bytes Received By Job
0 - Total Bytes Sent By Job
0 - Total Bytes Received By Job
...
000 (746922.000.000) 11/10 11:12:08 Job submitted from host: <128.211.157.86:58916>
DAG Node: 1.E
...
001 (746922.000.000) 11/10 11:18:38 Job executing on host: <128.211.157.10:49358>
...
005 (746922.000.000) 11/10 11:18:38 Job terminated.
(1) Normal termination (return value 0)
Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage
0 - Run Bytes Sent By Job
0 - Run Bytes Received By Job
0 - Total Bytes Sent By Job
0 - Total Bytes Received By Job
...
The text "DAG Node: 1.D" refers to step D of the second independent DAG listed as a command-line argument of condor_submit_dag.
Finally, this example could be reshaped into a parameter sweep, but the need to list the names of separate DAG submission files as command-line arguments of condor_submit_dag is very inconvenient for large sweeps.
For more information about DAGMan:
The Condor keyword SCRIPT specifies optional processing that occurs either before a job within a DAG starts its execution or after a job within a DAG completes its execution. A PRE script performs processing before a job starts its execution under Condor; a POST script performs processing after a job completes its execution under Condor. A node in the DAG includes the job together with PRE and/or POST scripts. These scripts run on the submission host, not on a compute node.
A common use of a PRE script places files in a staging area for a cluster of jobs to use; a common use of a POST script cleans up or removes files once that cluster of jobs reaches completion. An example might use a PRE script to transfer needed files from long-term storage; the corresponding POST script might return the processed files to long-term storage. In another example about staging files, a PRE script might archive a subdirectory structure of files in preparation for transferring that archive as a single input file to the compute node, while the POST script might extract output files from the archive which Condor transferred from the compute node to the submission host after job completion.
The following flowchart illustrates a DAG with PRE and POST scripts:

The DAG submission file, myprogram.dag, describes the DAG and specifies job submission files which control the execution of individual programs at each node of the DAG and under Condor's control. It also specifies the PRE and POST scripts:
# FILENAME: myprogram.dag # Specify the nodes (job submission files) of a DAG. JOB A myprogram.A.sub JOB B myprogram.B.sub # Specify PRE and POST scripts. SCRIPT PRE A myprogram_preA.scr SCRIPT POST A myprogram_pstA.scr SCRIPT PRE B myprogram_preB.scr SCRIPT POST B myprogram_pstB.scr # Specify command-line arguments as macro definitions. VARS A nodename="A" VARS B nodename="B" # Specify the edges (dependencies, order of execution) of a DAG. PARENT A CHILD B
View the job submission file, myprogram.A.sub, for the first node of the DAG:
universe = VANILLA executable = myprogram arguments = $(Cluster) $(Process) $(nodename) output = myprogram.A.out error = myprogram.A.err log = myprogram.log queue
View the job submission file, myprogram.B.sub, for the second node of the DAG:
universe = VANILLA executable = myprogram arguments = $(Cluster) $(Process) $(nodename) output = myprogram.B.out error = myprogram.B.err log = myprogram.log queue
The four PRE and POST scripts write a short message to a common output file:
#!/bin/sh # FILENAME: myprogram_preA.scr echo "before node A" >>myprogram.lst /bin/hostname >>myprogram.lst
#!/bin/sh # FILENAME: myprogram_pstA.scr echo "after node A" >>myprogram.lst /bin/hostname >>myprogram.lst
#!/bin/sh # FILENAME: myprogram_preB.scr echo "before node B" >>myprogram.lst /bin/hostname >>myprogram.lst
#!/bin/sh # FILENAME: myprogram_pstB.scr echo "after node B" >>myprogram.lst /bin/hostname >>myprogram.lst
While DAGMan can execute a different program for each node of the DAG, this example uses a single executable file to remain simple. Here is a short program, myprogram.c, that displays the command-line arguments which originally appear in the DAG submission file and forwarded to the job submission files. To compile this program for the Vanilla Universe, see Compiling Serial Programs.
To submit the DAG to Condor:
$ condor_submit_dag -force myprogram.dag
The argument -force requires Condor to overwrite the files that it produces. This is a useful convenience when the DAG's output files already exist from a preceding run, and you no longer need the earlier output. DAGMan appends, not overwrites, the file dagman.out.
Command condor_rm is able to remove a DAG from the job queue.
View the output file of the first node of the DAG, myprogram.A.out:
*** MAIN START *** program name: condor_exec.exe cluster number: 746948 process number: 0 node name: A *** MAIN STOP ***
View the output file of the second node of the DAG, myprogram.B.out:
*** MAIN START *** program name: condor_exec.exe cluster number: 746949 process number: 0 node name: B *** MAIN STOP ***
Each execution of the single program sees a unique node name: A, B.
View the common output file, myprogram.lst, of the four PRE and POST scripts. Output shows that the submission host itself executed the PRE and POST scripts:
before node A condor.rcac.purdue.edu after node A condor.rcac.purdue.edu before node B condor.rcac.purdue.edu after node B condor.rcac.purdue.edu
For more information about DAGMan:
You may assign a priority to each of your jobs within a specific Condor queue (on a specific submission host). A priority value can be any integer, where higher values mean higher priority. Condor will generally attempt to assign a compute node to the highest priority job of yours first. However, this does not necessarily mean that a higher priority job will get a compute node before a lower priority job. An available compute node may match the requirements of a lower priority job but not the requirements of a higher priority job. Even once started, a higher priority job may not finish before lower priority jobs, because a higher priority job might have a longer run time or be preempted and have to restart more.
Job priorities are user-specific and queue-specific and will not affect which user's jobs run first—only which jobs of yours start before which other jobs of yours. The default job priority is 0.
One possible example of when job priorities could be useful is if you have submitted many jobs with the default priority, and only afterward realize that you would really prefer to see the results of another job first. You may submit this new urgent job and give it a higher priority so that Condor will try to find a compute node for this job before finding compute nodes for your other jobs. This will also only work if you submit this new job to the same queue (on the same submission host) as your other jobs, because job priorities are queue-specific.
First submit a job to the Condor queue at the default priority (0). To raise this job's priority to 5:
$ condor_q myusername -- Submitter: condor.rcac.purdue.edu : <128.211.157.86:35407> : condor.rcac.purdue.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 260187.0 myusername 8/30 13:59 0+00:00:00 I 0 19.5 hello 1 jobs; 1 idle, 0 running, 0 held $ condor_prio -p 5 260187.0 $ condor_q myusername -- Submitter: condor.rcac.purdue.edu : <128.211.157.86:35407> : condor.rcac.purdue.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 260187.0 myusername 8/30 13:59 0+00:00:03 I 5 19.5 hello 1 jobs; 0 idle, 1 running, 0 held
For more information about job priority:
Several commercial and third-party software packages are available on RCAC resources and available through BoilerGrid.
Running one of these applications on BoilerGrid through Condor can be tricky. Ideally, to achieve high throughput, you want to maximize the number of candidate compute nodes that can execute your job. However, because BoilerGrid consists of many different types of systems, not all compute nodes have a given application or may be able to run your job effectively. You need to specify enough requirements to ensure that your job submission only tries to run on systems that are capable of running your job, but not be so specific that the requirements limit the running of your jobs to too few systems. You must carefully balance these two considerations. There is no single, fool-proof method for executing these applications, but a few general comments may help to guide you.
These applications are executable files supplied by the manufacturer. Since these executable files cannot be relinked with condor_compile, you must use the Vanilla Universe when submitting to Condor jobs that use them. By default, the Vanilla Universe will transfer to the compute node whatever file appears on the job's executable command. If you intend to use an executable that is already available on the compute nodes you are using, then specify transfer_executable = FALSE in your job submission file to avoid needlessly copying the manufacturer's executable from the submission host to the compute node with every attempted run. If you specify a shell script of your own creation as the Condor executable, you will want to leave this value as the default (TRUE).
Be aware there are several potential reasons a given application may not be available on all compute nodes in BoilerGrid. The application examples below all run on 64-bit Linux platforms, but those applications may or may not be available on 32-bit Linux or Windows platforms. Owners of some compute nodes may have agreed to include their nodes in BoilerGrid, but they may not have installed some applications. Commercial software licenses may or may not allow some compute nodes to have an application installed.
Ideally, your job submission would specify exactly the subset of compute nodes that have the application installed. To that end, many compute nodes explicitly advertise their applications. They advertise through ClassAd attributes such as HAS_MATLAB or HAS_MAPLE. Using such a ClassAd attribute may exclude some compute nodes that do have the application installed but that do not advertise this fact. However, this method is relatively robust. Unfortunately, only a few applications currently appear in an explicit advertisement. If you need to use an application which no node explicitly advertises through a ClassAd, you may find that you need to restrict the set of potential compute nodes in other ways.
The examples in the next few sections follow the guidelines described above. Start with the example that most closely resembles your computing goal. After a successful run of a simple job, you may modify your approach to attempt to maximize the number of candidate compute nodes without also including nodes that fail to run your submission.
RCAC tested the examples in the next few sections on some RCAC resources of BoilerGrid but not recently, so you may find some differences. If you need assistance, please contact RCAC.
With the exception of Octave and R, which are free software, only Purdue affiliates may use the following licensed software.
Maple is a general-purpose computer algebra system. This section illustrates how to submit a small Maple job to BoilerGrid. This Maple example differentiates, integrates, and finds the roots of polynomials.
Prepare a Maple input file with an appropriate filename, here named myjob.in:
# FILENAME: myjob.in # Differentiate with respect to x. diff( 2*x^3,x ); # Integrate with respect to x. int( 3*x^2*sin(x)+x,x ); # Solve for x. solve( 3*x^2+2*x-1,x );
Use the ClassAd attribute "HAS_MAPLE" to discover how many compute nodes advertise Maple:
$ condor_status -pool boilergrid.rcac.purdue.edu -total -constraint '(Arch == "X86_64") && (OpSys == "LINUX") && (HAS_MAPLE==True)'
Total Owner Claimed Unclaimed Matched Preempting Backfill
X86_64/LINUX 25360 16406 3983 4970 0 1 0
Total 25360 16406 3983 4970 0 1 0
Prepare a job submission file with an appropriate filename, here named myjob.sub. This example uses the ClassAd attribute HAS_MAPLE to locate a compute node, the ClassAd attribute MAPLE_EXE for the path to Maple on the compute node, and Condor's file transfer mechanism to copy input and output files only if needed (if the same filesystem is not available on the compute node):
# FILENAME: myjob.sub universe = VANILLA requirements = (HAS_MAPLE==TRUE) # Run the executable already installed on the compute node. transfer_executable = FALSE executable = /$$(MAPLE_EXE) # Use the -q option to suppress startup messages. # arguments = -q should_transfer_files = IF_NEEDED when_to_transfer_output = ON_EXIT input = myjob.in output = myjob.out error = myjob.err log = myjob.log queue
Submit the job:
$ condor_submit myjob.sub
View job status:
$ condor_q myusername
If there are any problems with your job, you may need to cancel your job or view the full job requirements. You may also wish to investigate more options on job submission. Please refer to the appropriate sections in this user guide for more details on these topics:
View results in the file for all standard output, here named myjob.out:
|\^/| Maple 13 (X86 64 LINUX)
._|\| |/|_. Copyright (c) Maplesoft, a division of Waterloo Maple Inc. 2009
\ MAPLE / All rights reserved. Maple is a trademark of
<____ ____> Waterloo Maple Inc.
| Type ? for help.
# FILENAME: myjob.in
>
# Differentiate wrt x.
> diff(2*x^3,x );
2
6 x
>
# Integrate wrt x.
> int(3*x^2*sin(x)+x,x );
2
2 x
-3 x cos(x) + 6 cos(x) + 6 x sin(x) + ----
2
>
# Solve for x.
> solve(3*x^2+2*x-1,x );
1/3, -1
> quit
memory used=3.0MB, alloc=2.8MB, time=0.04
Any output written to standard error will appear in myjob.err.
You may also view the Condor log file, here named myjob.log. The counts of bytes sent and received are zero because this job ran on a compute node which could directly access the shared file system holding the input file:
000 (342479.000.000) 01/29 10:11:24 Job submitted from host: <128.211.157.86:60004>
...
001 (342479.000.000) 01/29 10:11:53 Job executing on host: <128.211.157.10:53997?PrivNet=condor.ccb.purdue.edu>
...
005 (342479.000.000) 01/29 10:11:57 Job terminated.
(1) Normal termination (return value 0)
Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage
0 - Run Bytes Sent By Job
0 - Run Bytes Received By Job
0 - Total Bytes Sent By Job
0 - Total Bytes Received By Job
...
For more information about Maple:
Mathematica implements numeric and symbolic mathematics. This section illustrates how to submit a small Mathematica job to BoilerGrid. This Mathematica example finds the three roots of a third-degree polynomial.
Prepare a Mathematica input file with an appropriate filename, here named myjob.in:
(* FILENAME: myjob.in *) (* Find three roots. *) p=x^3+3*x^2+3*x+1 Solve[p==0] Quit
Prepare a shell script with an appropriate filename, here named myjob.sh, to run the non-graphical version of Mathematica:
#!/bin/sh # FILENAME: myjob.sh module load mathematica # For additional information about your job, uncomment the following commands: # hostname # module list # which math math
Change the permissions of the shell script to allow execution by the owner (you):
$ chmod u+x myjob.sh
Prepare a job submission file with an appropriate filename, here named myjob.sub. This example specifies the shell script myjob.sh as the executable, transfers the shell script and the user's current environment variables to the compute node, and uses Condor's file transfer mechanism to copy input and output files only if needed (if the same filesystem is not available on the compute node):
# FILENAME: myjob.sub universe = VANILLA # Copy my shell environment variables to the compute node. # The compute node requires this to find the module command. getenv = TRUE # Transfer the "executable" script myjob.sh to the compute node. transfer_executable = TRUE executable = myjob.sh should_transfer_files = IF_NEEDED when_to_transfer_output = ON_EXIT input = myjob.in output = myjob.out error = myjob.err log = myjob.log queue
Submit the job:
$ condor_submit myjob.sub
View job status:
$ condor_q myusername
If there are any problems with your job, you may need to cancel your job or view the full job requirements. You may also wish to investigate more options on job submission. Please refer to the appropriate sections in this user guide for more details on these topics:
View results in the file for all standard output, here named myjob.out:
Mathematica 5.2 for Linux x86 (64 bit)
Copyright 1988-2005 Wolfram Research, Inc.
-- Motif graphics initialized --
In[1]:=
In[2]:=
2 3
Out[2]= 1 + 3 x + 3 x + x
In[3]:=
Out[3]= {{x -> -1}, {x -> -1}, {x -> -1}}
In[4]:=
Any output written to standard error will appear in myjob.err.
You may also view the Condor log file, here named myjob.log. The counts of bytes sent and received are zero because this job ran on a compute node which could directly access the shared file system holding the input file:
000 (342433.000.000) 12/16 14:12:29 Job submitted from host: <128.211.157.86:41603>
...
001 (342433.000.000) 12/16 14:31:33 Job executing on host: <128.211.157.10:60202>
...
005 (342433.000.000) 12/16 14:31:39 Job terminated.
(1) Normal termination (return value 0)
Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage
0 - Run Bytes Sent By Job
0 - Run Bytes Received By Job
0 - Total Bytes Sent By Job
0 - Total Bytes Received By Job
For more information about Mathematica:
MATLAB (an acronym for MATrix LABoratory) is a computing environment and a fourth-generation programming language supporting algorithm development, data analysis and visualization, and numeric and symbolic computation. The MATLAB interpreter is the part of MATLAB which reads M-files and MEX-files and executes MATLAB statements.
This section illustrates how to submit a small MATLAB job to BoilerGrid. This MATLAB example computes the inverse of a matrix. This example, when executed, uses the MATLAB interpreter, so it requires and checks out a MATLAB license.
Prepare a MATLAB M-file with an appropriate filename, here named myjob.m:
% FILENAME: myjob.m
% Display name of compute node which ran this job.
[c name] = system('hostname');
fprintf('\n\nhostname:%s\n', name)
% Invert matrix A.
A = [1 2 3; 4 5 6; 7 8 0]
inv(A)
quit
Use the ClassAd attribute "HAS_MATLAB" to discover how many compute nodes advertise MATLAB:
$ condor_status -pool boilergrid.rcac.purdue.edu -total -constraint '(Arch == "X86_64") && (OpSys == "LINUX") && (HAS_MATLAB==True)'
Total Owner Claimed Unclaimed Matched Preempting Backfill
X86_64/LINUX 25458 19339 2101 4018 0 0 0
Total 25458 19339 2101 4018 0 0 0
Prepare a job submission file with an appropriate filename, here named myjob.sub. This example uses the ClassAd attribute HAS_MATLAB to locate a compute node, the ClassAd attribute MAPLE_EXE for the path to the MATLAB interpreter on the compute node, and Condor's file transfer mechanism to copy input and output files only if needed (if the same filesystem is not available on the compute node):
# FILENAME: myjob.sub universe = VANILLA requirements = (HAS_MATLAB==TRUE) # Run the executable already installed on the compute node. transfer_executable = FALSE executable = /$$(MATLAB_EXE) # arguments = -nodisplay -nosplash -nojvm # -nodisplay: turn off graphics # -nosplash: start MATLAB without the splash screen # -nojvm: turn off graphics arguments = $$(MATLAB_ARGS) should_transfer_files = IF_NEEDED when_to_transfer_output = ON_EXIT input = myjob.m output = myjob.out error = myjob.err log = myjob.log queue
Submit the job:
$ condor_submit myjob.sub
View job status:
$ condor_q myusername
If there are any problems with your job, you may need to cancel your job or view the full job requirements. You may also wish to investigate more options on job submission. Please refer to the appropriate sections in this user guide for more details on these topics:
View results in the file for all standard output, here named myjob.out:
< M A T L A B (R) >
Copyright 1984-2010 The MathWorks, Inc.
Version 7.10.0.499 (R2010a) 64-bit (glnxa64)
February 5, 2010
----------------------------------------------------------
Your MATLAB license will expire in 48 days.
Please contact your system administrator or
The MathWorks to renew this license.
----------------------------------------------------------
To get started, type one of these: helpwin, helpdesk, or demo.
For product information, visit www.mathworks.com.
>> >> >> >> >>
hostname:hansen-b066.rcac.purdue.edu
>> >> >>
A =
1 2 3
4 5 6
7 8 0
>>
ans =
-1.7778 0.8889 -0.1111
1.5556 -0.7778 0.2222
-0.1111 0.2222 -0.1111
>>
Any output written to standard error will appear in myjob.err.
You may also view the Condor log file, here named myjob.log. The counts of bytes sent and received are zero because this job ran on a compute node which could directly access the shared file system holding the input file:
000 (746973.000.000) 12/15 15:01:11 Job submitted from host: <128.211.157.86:49400>
...
001 (746973.000.000) 12/15 15:01:17 Job executing on host: <128.211.157.10:53612?PrivNet=condor.ccb.purdue.edu>
...
005 (746973.000.000) 12/15 15:02:09 Job terminated.
(1) Normal termination (return value 0)
Usr 0 00:00:02, Sys 0 00:00:00 - Run Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage
Usr 0 00:00:02, Sys 0 00:00:00 - Total Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage
0 - Run Bytes Sent By Job
0 - Run Bytes Received By Job
0 - Total Bytes Sent By Job
0 - Total Bytes Received By Job
For more information about MATLAB:
The MATLAB Compiler translates an M-file into an executable file. A compiled version of an M-file can substantially improve performance of MATLAB code, especially for statements like for and while.
This section illustrates how to submit a small, compiled MATLAB job to BoilerGrid. This MATLAB example computes the inverse of a matrix. This example, when executed, does not use the MATLAB interpreter, so it neither requires nor checks out a MATLAB license.
Prepare a MATLAB M-file with an appropriate filename, here named myjob.m:
% FILENAME: myjob.m
% Display name of compute node which ran this job.
[c name] = system('hostname');
fprintf('\n\nhostname:%s\n', name)
% Invert matrix A.
A = [1 2 3; 4 5 6; 7 8 0]
inv(A)
quit
To access the MATLAB Compiler mcc, load a MATLAB module. The MATLAB Compiler depends on shared libraries from GCC Version 4.2.3. This version is not available on most of BoilerGrid, but GCC Version 4.2.4 is compatible:
$ module load matlab $ module load gcc/4.2.4
To compile MATLAB source code into a stand-alone executable, use the macro option -m:
$ mcc -m -v -R -nojvm myjob.m
A few new files appear after the compilation:
mccExcludedFiles.log myjob myjob.prj myjob_main.c myjob_mcc_component_data.c readme.txt run_myjob.sh
The name of the stand-alone executable file is myjob. The name of the shell script to run this executable file is run_myjob.sh.
Prepare the job submission file with an appropriate filename, here named myjob.sub. This example specifies the MATLAB shell script run_myjob.sh as the executable, transfers the shell script to the compute node, and uses Condor's file transfer mechanism to copy input and output files only if needed (if the same filesystem is not available on the compute node):
# FILENAME: myjob.sub universe = VANILLA requirements = (HAS_MATLAB==TRUE) # Transfer "executable" shell script run_myjob.sh to the compute node. transfer_executable = TRUE executable = run_myjob.sh # Pass the MATLAB root directory as an argument to the shell script. arguments = $$(MATLAB_ROOT) should_transfer_files = IF_NEEDED when_to_transfer_output = ON_EXIT # # The file "myjob" is the compiled version of "myjob.m". input = myjob output = myjob.out error = myjob.err log = myjob.log queue
Submit the job:
$ condor_submit myjob.sub
View job status:
$ condor_q myusername
If there are any problems with your job, you may need to cancel your job or view the full job requirements. You may also wish to investigate more options on job submission. Please refer to the appropriate sections in this user guide for more details on these topics:
View results in the file for all standard output, here named myjob.out:
------------------------------------------
Setting up environment variables
---
LD_LIBRARY_PATH is .:/apps/rhel5/MATLAB_R2010a/runtime/glnxa64:/apps/rhel5/MATLAB_R2010a/bin/glnxa64:/apps/rhel5/MATLAB_R2010a/sys/os/glnxa
64:/apps/rhel5/MATLAB_R2010a/sys/java/jre/glnxa64/jre/lib/amd64/native_threads:/apps/rhel5/MATLAB_R2010a/sys/java/jre/glnxa64/jre/lib/amd64
/server:/apps/rhel5/MATLAB_R2010a/sys/java/jre/glnxa64/jre/lib/amd64/client:/apps/rhel5/MATLAB_R2010a/sys/java/jre/glnxa64/jre/lib/amd64
Warning: No display specified. You will not be able to display graphics on the screen.
hostname:hansen-b066.rcac.purdue.edu
A =
1 2 3
4 5 6
7 8 0
ans =
-1.7778 0.8889 -0.1111
1.5556 -0.7778 0.2222
-0.1111 0.2222 -0.1111
Any output written to standard error will appear in myjob.err.
You may also view the Condor log file, here named myjob.log. The counts of bytes sent and received are zero because this job ran on a compute node which could directly access the shared file system holding the input file:
000 (746999.000.000) 01/03 11:49:13 Job submitted from host: <128.211.157.86:49400>
...
001 (746999.000.000) 01/03 13:14:00 Job executing on host: <128.211.157.10:33910?PrivNet=condor.ccb.purdue.edu>
...
006 (746999.000.000) 01/03 13:14:09 Image size of job updated: 81956
...
005 (746999.000.000) 01/03 13:14:49 Job terminated.
(1) Normal termination (return value 0)
Usr 0 00:00:01, Sys 0 00:00:00 - Run Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage
Usr 0 00:00:01, Sys 0 00:00:00 - Total Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage
0 - Run Bytes Sent By Job
0 - Run Bytes Received By Job
0 - Total Bytes Sent By Job
0 - Total Bytes Received By Job
For more information about the MATLAB Compiler:
MEX stands for "MATLAB Executable". A MEX-file offers a way for MATLAB code to call functions written in C, C++, or Fortran as though these external functions were built-in MATLAB functions. You may wish to use a MEX-file if you would like to call an existing C, C++, or Fortran function directly from MATLAB rather than reimplementing that code as a MATLAB function. Also, by implementing performance-critical routines in C, C++, or Fortran rather than MATLAB, you may be able to substantially improve performance over MATLAB source code, especially for statements like for and while.
This section illustrates how to submit a small MATLAB job with a MEX-file to BoilerGrid. This MATLAB example calls a C function which adds two matrices. This example, when executed, uses the MATLAB interpreter, so it requires and checks out a MATLAB license.
Prepare a complicated and time-consuming computation in the form of a C, C++, or Fortran function. In this example, the computation is a C function which adds two matrices:
/* Computational Routine */
void matrixSum (double *a, double *b, double *c, int n) {
int i;
/* Matrix (component-wise) addition. */
for (i = 0; i<n; i++) {
c[i] = a[i] + b[i];
}
}
Combine the computational routine with a MEX-file, which contains the necessary external function interface of MATLAB. In the computational routine, change int to mwSize. The name of the file is matrixSum.c:
/***********************************************************
* FILENAME: matrixSum.c
*
* Adds two MxN arrays (inMatrix).
* Outputs one MxN array (outMatrix).
*
* The calling syntax is:
*
* matrixSum (inMatrix, inMatrix, outMatrix, size)
*
* This is a MEX-file for MATLAB.
*
**********************************************************/
#include "mex.h"
/* Computational Routine */
void matrixSum (double *a, double *b, double *c, mwSize n) {
mwSize i;
/* Component-wise addition. */
for (i = 0; i<n; i++) {
c[i] = a[i] + b[i];
}
}
/* Gateway Function */
void mexFunction (int nlhs, mxArray *plhs[],
int nrhs, const mxArray *prhs[]) {
double *inMatrix_a; /* mxn input matrix */
double *inMatrix_b; /* mxn input matrix */
mwSize nrows_a,ncols_a; /* size of matrix a */
mwSize nrows_b,ncols_b; /* size of matrix b */
double *outMatrix_c; /* mxn output matrix */
/* Check for proper number of arguments */
if(nrhs!=2) {
mexErrMsgIdAndTxt("MyToolbox:matrixSum:nrhs","Two inputs required.");
}
if(nlhs!=1) {
mexErrMsgIdAndTxt("MyToolbox:matrixSum:nlhs","One output required.");
}
/* Get dimensions of the first input matrix */
nrows_a = mxGetM(prhs[0]);
ncols_a = mxGetN(prhs[0]);
/* Get dimensions of the second input matrix */
nrows_b = mxGetM(prhs[1]);
ncols_b = mxGetN(prhs[1]);
/* Check for equal number of rows. */
if(nrows_a != nrows_b) {
mexErrMsgIdAndTxt("MyToolbox:matrixSum:notEqual","Unequal number of rows.");
}
/* Check for equal number of columns. */
if(ncols_a != ncols_b) {
mexErrMsgIdAndTxt("MyToolbox:matrixSum:notEqual","Unequal number of columns.");
}
/* Make a pointer to the real data in the first input matrix */
inMatrix_a = mxGetPr(prhs[0]);
/* Make a pointer to the real data in the second input matrix */
inMatrix_b = mxGetPr(prhs[1]);
/* Make the output matrix */
plhs[0] = mxCreateDoubleMatrix(nrows_a,ncols_a,mxREAL);
/* Make a pointer to the real data in the output matrix */
outMatrix_c = mxGetPr(plhs[0]);
/* Call the computational routine */
matrixSum(inMatrix_a,inMatrix_b,outMatrix_c,nrows_a*ncols_a);
}
To access the MATLAB utility mex, load a MATLAB module. The MATLAB Compiler, mcc, depends on shared libraries from GCC Version 4.2.3. This version is not available on most of BoilerGrid, but GCC Version 4.2.4 is compatible:
$ module load matlab $ module load gcc/4.2.4
To compile matrixSum.c into a MEX-file:
$ mex matrixSum.c
The name of the MATLAB-callable MEX-file is matrixSum.mexa64.
Prepare a MATLAB M-file with an appropriate filename, here named myjob.m:
% FILENAME: myjob.m % Call the separately compiled and dynamically linked MEX-file. A = [1,1,1;1,1,1] B = [2,2,2;2,2,2] C = matrixSum(A,B) quit
Prepare a job submission file with an appropriate filename, here named myjob.sub. This example uses the ClassAd attribute HAS_MATLAB to locate a compute node, the ClassAd attribute MATLAB_EXE for the path to the MATLAB interpreter on the compute node, and Condor's file transfer mechanism to copy input and output files only if needed (if the same filesystem is not available on the compute node):
# FILENAME: myjob.sub universe = VANILLA requirements = (HAS_MATLAB==TRUE) # Run the executable already installed on the compute node. transfer_executable = FALSE executable = /$$(MATLAB_EXE) # -nodesktop: run MATLAB in text mode # -nodisplay: turn off graphics # -nosplash: start MATLAB without the splash screen # -nojvm: turn off graphics # arguments = -nodesktop -nodisplay -nosplash -nojvm arguments = - nodesktop $$(MATLAB_ARGS) should_transfer_files = IF_NEEDED when_to_transfer_output = ON_EXIT input = myjob.m output = myjob.out error = myjob.err log = myjob.log # # Transfer MEX-file matrixSum.mexa64 to the compute node. transfer_input_files = matrixSum.mexa64 queue
Submit the job:
$ condor_submit myjob.sub
View job status:
$ condor_q myusername
If there are any problems with your job, you may need to cancel your job or view the full job requirements. You may also wish to investigate more options on job submission. Please refer to the appropriate sections in this user guide for more details on these topics:
View results in the file for all standard output, here named myjob.out:
< M A T L A B (R) >
Copyright 1984-2010 The MathWorks, Inc.
Version 7.10.0.499 (R2010a) 64-bit (glnxa64)
February 5, 2010
To get started, type one of these: helpwin, helpdesk, or demo.
For product information, visit www.mathworks.com.
>>
A =
1 1 1
1 1 1
B =
2 2 2
2 2 2
C =
3 3 3
3 3 3
Any output written to standard error will appear in myjob.err.
You may also view the Condor log file, here named myjob.log. The counts of bytes sent and received are zero because this job ran on a compute node which could directly access the shared file system holding the input file:
000 (342465.000.000) 01/11 09:30:17 Job submitted from host: <128.211.157.86:60004>
...
001 (342465.000.000) 01/11 09:30:21 Job executing on host: <128.211.157.10:36464?PrivNet=condor.ccb.purdue.edu>
...
005 (342465.000.000) 01/11 09:31:11 Job terminated.
(1) Normal termination (return value 0)
Usr 0 00:00:01, Sys 0 00:00:00 - Run Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage
Usr 0 00:00:01, Sys 0 00:00:00 - Total Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage
0 - Run Bytes Sent By Job
0 - Run Bytes Received By Job
0 - Total Bytes Sent By Job
0 - Total Bytes Received By Job
For more information about the MATLAB MEX-file:
A stand-alone MATLAB program is a C, C++, or Fortran program which calls user-written M-files and the same libraries which MATLAB uses. A stand-alone program has access to MATLAB objects, such as the array and matrix classes, as well as all the MATLAB algorithms. If you would like to implement performance-critical routines in C, C++, or Fortran and still call select MATLAB functions, a stand-alone MATLAB program may be a good option. This offers the possibility for substantially improved performance over MATLAB source code, especially for statements like for and while, while still allowing use of specialized MATLAB functions where useful.
This section illustrates how to submit a small stand-alone MATLAB program to BoilerGrid. This C example calls a compiled MATLAB script which ranks magic squares and another compiled MATLAB script which displays the ranks. This example, when executed, does not use the MATLAB interpreter, so it neither requires nor checks out a MATLAB license.
Prepare a MATLAB function which returns a vector of the ranks of the magic squares from 1 to n. Use an appropriate filename, here named mrank.m:
% FILENAME: mrank.m
function r = mrank(n)
r = zeros(n,1);
for k = 1:n
r(k) = rank(magic(k));
end
Prepare a second MATLAB function which displays a vector, the return value of function mrank. Use an appropriate filename, like printmatrix.m:
% FILENAME: printmatrix.m
function printmatrix(A)
disp(A)
end
Prepare a C source file with a main function and the necessary external function interface and give it an appropriate filename, here named myprogram.c. In a C program, you must use a "mangled" MATLAB function names in an invocation. The C program invokes the MATLAB function mrank using the name mlfMrank and the MATLAB function printmatrix using the name mlfPrintmatrix. All MATLAB function names must be modified in this manner when called from outside MATLAB:
/* FILENAME: myprogram.c */
#include <stdio.h>
#include <math.h>
#include "Pkg.h"
int main (const int argc, char ** argv) {
mxArray *N; /* matrix containing n */
mxArray *R; /* result matrix */
int n=12; /* integer parameter from command line */
printf("Enter myprogram.c\n");
PkgInitialize(); /* call Pkg initialization */
/* Create a 1-by-1 matrix containing n. */
N = mxCreateDoubleMatrix(1, 1, mxREAL);
*mxGetPr(N) = n;
/* Call mlfMrank, the compiled version of mrank.m. */
mlfMrank(1,&R,N);
/* Print the results. */
mlfPrintmatrix(R);
/* Free the matrices allocated during this computation. */
mxDestroyArray(N);
mxDestroyArray(R);
PkgTerminate(); /* call Pkg initialization */
printf("Exit myprogram.c\n");
return 0;
}
To access the MATLAB Compiler mcc, load a MATLAB module. The MATLAB Compiler, mcc, depends on shared libraries from GCC Version 4.2.3. This version is not available on most of BoilerGrid, but GCC Version 4.2.4 is compatible:
$ module load matlab $ module load gcc/4.2.4
To compile the stand-alone MATLAB program:
$ mcc -W lib:Pkg -T link:exe myprogram.c mrank printmatrix libmmfile.mlib -v
Several new files and one subdirectory appear after the compilation:
Pkg.c Pkg.ctf Pkg.exports Pkg.h Pkg.prj Pkg_mcc_component_data.c Pkg_mcr mccExcludedFiles.log myprogram readme.txt
The name of the compiled, stand-alone MATLAB program is myprogram.
Prepare a shell script which will run the stand-alone MATLAB program with an appropriate filename, here named myjob.sh:
#!/bin/sh # FILENAME: myjob.sh # A stand-alone program does not use the MATLAB interpreter, # but it does need a shared library which comes from a # compatible version of GCC. module load gcc/4.2.4 # For additional information about your job submission, uncomment the following commands. # hostname # module list # which gcc myprogram
Change the permissions of the shell script to allow execution by the owner (you):
$ chmod u+x myjob.sh
Prepare a job submission file with an appropriate filename, here named myjob.sub. This example specifies the shell script myjob.sh as the executable, transfers the shell script to the compute node, and uses Condor's file transfer mechanism to copy input and output files only if needed (if the same filesystem is not available on the compute node):
# FILENAME: myjob.sub universe = VANILLA # Transfer the "executable" shell script myjob.sh to the compute node. transfer_executable = TRUE executable = myjob.sh should_transfer_files = IF_NEEDED when_to_transfer_output = ON_EXIT # # Transfer the compiled code to the compute node. The file myprogram # is the compiled version of myprogram.c, mrank.m, and printmatrix.m. input = myprogram output = myjob.out error = myjob.err log = myjob.log queue
Submit the job:
$ condor_submit myjob.sub
View job status:
$ condor_q myusername
If there are any problems with your job, you may need to cancel your job or view the full job requirements. You may also wish to investigate more options on job submission. Please refer to the appropriate sections in this user guide for more details on these topics:
View results in the file for all standard output, here named myjob.out:
Enter myprogram.c
1
2
3
3
5
5
7
3
9
7
11
3
Exit myprogram.c
View the standard error file, here named myjob.err:
pure virtual method called
You may also view the Condor log file, here named myjob.log. The counts of bytes sent and received are zero because this job ran on a compute node which could directly access the shared file system holding the input file:
000 (342501.000.000) 02/03 11:35:14 Job submitted from host: <128.211.157.86:60004>
...
001 (342501.000.000) 02/03 11:36:00 Job executing on host: <128.211.157.10:44356?PrivNet=condor.ccb.purdue.edu>
...
005 (342501.000.000) 02/03 11:36:08 Job terminated.
(1) Normal termination (return value 0)
Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage
0 - Run Bytes Sent By Job
0 - Run Bytes Received By Job
0 - Total Bytes Sent By Job
0 - Total Bytes Received By Job
For more information about the MATLAB stand-alone programs:
GNU Octave is a high-level, interpreted, programming language for numerical computations. The Octave interpreter is the part of Octave which reads M-files, oct-files, and MEX-files and executes Octave statements. Octave is a structured language (similar to C) and mostly compatible with MATLAB. You may use Octave to avoid the need for a MATLAB license, both during development and as a deployed application. By doing so, you may be able to run your application on more systems or more easily distribute it to others.
This section illustrates how to submit a small Octave job to BoilerGrid. This Octave example computes the inverse of a matrix.
Prepare an Octave-compatible M-file with an appropriate filename, here named myjob.m:
% FILENAME: myjob.m % Invert matrix A. A = [1 2 3; 4 5 6; 7 8 0] inv(A) quit
Prepare a shell script with an appropriate filename, here named myjob.sh:
#!/bin/sh # FILENAME: myjob.sh module load octave # For additional information about your job submission, uncomment the following commands. # hostname # module list # which octave # Use the -q option to suppress startup messages. # octave -q octave
Change the permissions of the shell script to allow execution by the owner (you):
$ chmod u+x myjob.sh
Prepare a job submission file with an appropriate filename, here named myjob.sub. This example specifies the shell script myjob.sh as the executable, transfers the shell script and the user's current shell environment to the compute node, and uses Condor's file transfer mechanism to copy input and output files only if needed (if the same filesystem is not available on the compute node):
# FILENAME: myjob.sub universe = VANILLA # Copy my shell environment variables to the compute node. # The compute node requires this to find the module command. getenv = TRUE # Transfer the "executable" myjob.sh to the compute node. transfer_executable = TRUE executable = myjob.sh should_transfer_files = IF_NEEDED when_to_transfer_output = ON_EXIT input = myjob.m output = myjob.out error = myjob.err log = myjob.log queue
Submit the job:
$ condor_submit myjob.sub
View job status:
$ condor_q myusername
If there are any problems with your job, you may need to cancel your job or view the full job requirements. You may also wish to investigate more options on job submission. Please refer to the appropriate sections in this user guide for more details on these topics:
View results in the file for all standard output, here named myjob.out:
A = 1 2 3 4 5 6 7 8 0 ans = -1.77778 0.88889 -0.11111 1.55556 -0.77778 0.22222 -0.11111 0.22222 -0.11111
You may also view the Condor log file, here named myjob.log. The counts of bytes sent and received are zero because this job ran on a compute node which could directly access the shared file system holding the input file:
000 (746978.000.000) 12/17 12:49:44 Job submitted from host: <128.211.157.86:49400>
...
001 (746978.000.000) 12/17 12:57:58 Job executing on host: <128.211.157.10:54256?PrivNet=condor.ccb.purdue.edu>
...
005 (746978.000.000) 12/17 12:58:12 Job terminated.
(1) Normal termination (return value 0)
Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage
0 - Run Bytes Sent By Job
0 - Run Bytes Received By Job
0 - Total Bytes Sent By Job
0 - Total Bytes Received By Job
...
Any output written to standard error will appear in myjob.err.
For more information about Octave:
Octave does not offer a compiler to translate an M-file into an executable file for additional speed or distribution. You may wish to consider recoding an M-file as either an oct-file or a stand-alone program.
An oct-file is an "Octave Executable". It offers a way for Octave code to call functions written in C, C++, or Fortran as though these external functions were built-in Octave functions. You may wish to use an oct-file if you would like to call an existing C, C++, or Fortran function directly from Octave rather than reimplementing that code as an Octave function. Also, by implementing performance-critical routines in C, C++, or Fortran rather than Octave, you may be able to substantially improve performance over Octave source code, especially for statements like for and while.
This section illustrates how to submit a small Octave job with an oct-file to BoilerGrid. This Octave example calls a C function which adds two matrices.
Prepare a complicated and time-consuming computation in the form of a C, C++, or Fortran function. In this example, the computation is a C function which adds two matrices:
/* Computational Routine */
void matrixSum (double *a, double *b, double *c, int n) {
int i;
/* Component-wise addition. */
for (i=0; i<n; i++) {
c[i] = a[i] + b[i];
}
}
Combine the computational routine with an oct-file, which contains the necessary external function interface of Octave. The name of the file is matrixSum.cc:
/***********************************************************
* FILENAME: matrixSum.cc
*
* Adds two MxN arrays (inMatrix).
* Outputs one MxN array (outMatrix).
*
* The calling syntax is:
*
* matrixSum (inMatrix, inMatrix, outMatrix, size)
*
* This is an oct-file for Octave.
*
**********************************************************/
#include <octave/oct.h>
/* Computational Routine */
void matrixSum (double *a, double *b, double *c, int n) {
int i;
/* Component-wise addition. */
for (i=0; i<n; i++) {
c[i] = a[i] + b[i];
}
}
/* Gateway Function */
DEFUN_DLD (matrixSum, args, nargout, "matrixSum: A + B") {
NDArray inMatrix_a; /* mxn input matrix */
NDArray inMatrix_b; /* mxn input matrix */
int nrows_a,ncols_a; /* size of matrix a */
int nrows_b,ncols_b; /* size of matrix b */
NDArray outMatrix_c; /* mxn output matrix */
/* Check for proper number of input arguments */
if (args.length() != 2) {
printf("matrixSum: two inputs required.");
exit(-1);
}
/* Check for proper number of output arguments */
if (nargout != 1) {
printf("matrixSum: one output required.");
exit(-1);
}
/* Check that both input matrices are real matrices. */
if (!args(0).is_real_matrix()) {
printf("matrixSum: expecting LHS (arg 1) to be a real matrix");
exit(-1);
}
if (!args(1).is_real_matrix()) {
printf("matrixSum: expecting RHS (arg 2) to be a real matrix");
exit(-1);
}
/* Get dimensions of the first input matrix */
nrows_a = args(0).rows();
ncols_a = args(0).columns();
/* Get dimensions of the second input matrix */
nrows_b = args(1).rows();
ncols_b = args(1).columns();
/* Check for equal number of rows. */
if(nrows_a != nrows_b) {
printf("matrixSum: unequal number of rows.");
exit(-1);
}
/* Check for equal number of columns. */
if(ncols_a != ncols_b) {
printf("matrixSum: unequal number of rows.");
exit(-1);
}
/* Make a pointer to the real data in the first input matrix */
inMatrix_a = args(0).array_value();
/* Make a pointer to the real data in the second input matrix */
inMatrix_b = args(1).array_value();
/* Construct output matrix as a copy of the first input matrix. */
outMatrix_c = args(0).array_value();
/* Call the computational routine. */
double* ptr_a = inMatrix_a.fortran_vec();
double* ptr_b = inMatrix_b.fortran_vec();
double* ptr_c = outMatrix_c.fortran_vec();
matrixSum(ptr_a,ptr_b,ptr_c,nrows_a*ncols_a);
return octave_value(outMatrix_c);
}
To access the Octave utility mkoctfile, load an Octave module. Loading Octave also loads a compatible GCC:
$ module load octave
To compile matrixSum.cc into an oct-file:
$ mkoctfile matrixSum.cc
Two new files appear after the compilation:
matrixSum.o matrixSum.oct
The name of the Octave-callable oct-file is matrixSum.oct.
Prepare an Octave-compatible M-file with an appropriate filename, here named myjob.m:
% FILENAME: myjob.m % Call the separately compiled and dynamically linked oct-file. A = [1,1,1;1,1,1] B = [2,2,2;2,2,2] C = matrixSum(A,B) quit
Prepare a shell script with an appropriate filename, here named myjob.sh:
#!/bin/sh # FILENAME: myjob.sh module load octave # For additional information about your job submission, # uncomment the following commands. # hostname # module list # which octave # Use the -q option to suppress startup messages. # octave -q octave
Change the permissions of the shell script to allow execution by the owner (you):
$ chmod u+x myjob.sh
Prepare a job submission file with an appropriate filename, here named myjob.sub. This example specifies the shell script myjob.sh as the executable, transfers the shell script and the user's current shell environment to the compute node, and uses Condor's file transfer mechanism to copy input and output files only if needed (if the same filesystem is not available on the compute node):
# FILENAME: myjob.sub universe = VANILLA # Copy my shell environment variables to the compute node. # The compute node requires this to find the module command. getenv = TRUE # Transfer the "executable" myjob.sh to the compute node. transfer_executable = TRUE executable = myjob.sh should_transfer_files = IF_NEEDED when_to_transfer_output = ON_EXIT input = myjob.m output = myjob.out error = myjob.err log = myjob.log # # Transfer oct-file matrixSum.oct to the compute node. transfer_input_files = matrixSum.oct queue
Submit the job:
$ condor_submit myjob.sub
View job status:
$ condor_q myusername
If there are any problems with your job, you may need to cancel your job or view the full job requirements. You may also wish to investigate more options on job submission. Please refer to the appropriate sections in this user guide for more details on these topics:
View results in the file for all standard output, here named myjob.out:
A = 1 1 1 1 1 1 B = 2 2 2 2 2 2 C = 3 3 3 3 3 3
You may also view the Condor log file, here named myjob.log. The counts of bytes sent and received are zero because this job ran on a compute node which could directly access the shared file system holding the input file:
000 (747006.000.000) 01/13 08:09:56 Job submitted from host: <128.211.157.86:42696>
...
001 (747006.000.000) 01/13 08:10:22 Job executing on host: <128.211.157.10:35807?PrivNet=condor.ccb.purdue.edu>
...
006 (747006.000.000) 01/13 08:10:31 Image size of job updated: 99404
...
005 (747006.000.000) 01/13 08:10:40 Job terminated.
(1) Normal termination (return value 0)
Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage
0 - Run Bytes Sent By Job
0 - Run Bytes Received By Job
0 - Total Bytes Sent By Job
0 - Total Bytes Received By Job
...
Any output written to standard error will appear in myjob.err.
For more information about Octave oct-files:
A stand-alone program is a C, C++, or Fortran program which calls user-written oct-files and the same libraries that Octave uses. A stand-alone program has access to Octave objects, such as the array and matrix classes, as well as all the Octave algorithms. If you would like to implement performance-critical routines in C, C++, or Fortran and still call select Octave functions, a stand-alone Octave program may be a good option. This offers the possibility for substantially improved performance over Octave source code, especially for statements like for and while.
This section illustrates how to submit a small stand-alone program which calls Octave to BoilerGrid. This C++ example uses class Matrix and calls an Octave script which prints a message.
Prepare a C++ function file with the necessary external function interface and with an appropriate filename, here named hello.cc:
// FILENAME: hello.cc
#include <iostream>
#include <octave/oct.h>
#include <octave/octave.h>
#include <octave/parse.h>
#include <octave/toplev.h> /* do_octave_atexit */
int main (const int argc, char ** argv) {
const char * argvv [] = {"" /* name of program, not relevant */, "--silent"};
octave_main (2, (char **) argvv, true /* embedded */);
/* Display the start of this program. */
std::cout << "hello.cc: hello, world" << std::endl;
/* Invoke hello.m */
const octave_value_list result = feval ("hello");
/* Define an Octave Matrix. */
int n = 2;
Matrix a_matrix = Matrix (1,2);
a_matrix (0,0) = 888;
a_matrix (0,1) = 999;
std::cout << "hello.cc: " << a_matrix;
do_octave_atexit ();
}
Prepare an Octave-compatible M-file with an appropriate filename, here named hello.m:
% FILENAME: hello.m
disp('hello.m : hello, world')
To access the Octave utility mkoctfile, load an Octave module. Loading Octave also loads a compatible GCC:
$ module load octave
To compile the stand-alone Octave program:
$ mkoctfile --link-stand-alone hello.cc -o hello
Two new files appear after the compilation:
hello hello.o
The name of the stand-alone program is hello.
Prepare a shell script with an appropriate filename, here named myjob.sh:
#!/bin/sh # FILENAME: myjob.sh # A stand-alone program does not use the Octave interpreter, # but it does need a shared library which comes from GCC. module load gcc # For additional information about your job submission, # uncomment the following commands. # hostname # module list # which gcc hello
Change the permissions of the shell script to allow execution by the owner (you):
$ chmod u+x myjob.sh
Prepare the job submission file with an appropriate filename, here named myjob.sub. This example specifies a shell script myjob.sh as the executable, transfers the shell script and the user's current shell environment to the compute node, and uses Condor's file transfer mechanism to copy input and output files only if needed (if the same filesystem is not available on the compute node):
# FILENAME: myjob.sub universe = VANILLA # Copy my shell environment variables to the compute node. # The compute node requires this to find the module command. getenv = TRUE # Transfer the "executable" myjob.sh to the compute node. transfer_executable = TRUE executable = myjob.sh should_transfer_files = IF_NEEDED when_to_transfer_output = ON_EXIT # # Transfer the compiled code to the compute node. The file hello # is the compiled version of hello.m. input = hello output = myjob.out error = myjob.err log = myjob.log queue
Submit the job:
$ condor_submit myjob.sub
View job status:
$ condor_q myusername
If there are any problems with your job, you may need to cancel your job or view the full job requirements. You may also wish to investigate more options on job submission. Please refer to the appropriate sections in this user guide for more details on these topics:
View results in the file for all standard output, here named myjob.out:
hello.cc: hello, world hello.m: hello, world hello.cc: 888 999
You may also view the Condor log file, here named myjob.log. The counts of bytes sent and received are zero because this job ran on a compute node which could directly access the shared file system holding the input file:
000 (747012.000.000) 01/14 08:19:57 Job submitted from host: <128.211.157.86:42696>
...
001 (747012.000.000) 01/14 08:20:36 Job executing on host: <128.211.157.10:53050?PrivNet=condor.ccb.purdue.edu>
...
005 (747012.000.000) 01/14 08:20:37 Job terminated.
(1) Normal termination (return value 0)
Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage
0 - Run Bytes Sent By Job
0 - Run Bytes Received By Job
0 - Total Bytes Sent By Job
0 - Total Bytes Received By Job
...
Any output written to standard error will appear in myjob.err.
For more information about the Octave stand-alone program:
MEX stands for "MATLAB Executable". A MEX-file offers a way for MATLAB code to call functions written in C, C++ or Fortran as though these external functions were built-in MATLAB functions. You may wish to use a MEX-file if you would like to call an existing C, C++, or Fortran function directly from MATLAB rather than reimplementing that code as a MATLAB function. Also, by implementing performance-critical routines in C, C++, or Fortran rather than MATLAB, you may be able to substantially improve performance over MATLAB source code, especially for statements like for and while.
Octave includes an interface which can link compiled legacy MEX-files. This interface allows sharing code between Octave and MATLAB users. In Octave, an oct-file will always perform better than a MEX-file, so you should write new code using the oct-file interface, if possible. However, you may test a new MEX-file in Octave then use it in a MATLAB application.
This section illustrates how to submit a small Octave job with a MEX-file to BoilerGrid. This Octave example calls a C function which adds two matrices.
Prepare a complicated and time-consuming computation in the form of a C, C++, or Fortran function. In this example, the computation is a C function which adds two matrices:
/* Computational Routine */
void matrixSum (double *a, double *b, double *c, int n) {
int i;
/* Component-wise addition. */
for (i=0; i<n; i++) {
c[i] = a[i] + b[i];
}
}
Combine the computational routine with a MEX-file, which contains the necessary external function interface of Octave. In the computational routine, change int to mwSize. The name of the file is matrixSum.c:
/*************************************************************
* FILENAME: matrixSum.c
*
* Adds two MxN arrays (inMatrix).
* Outputs one MxN array (outMatrix).
*
* The calling syntax is:
*
* matrixSum(inMatrix, inMatrix, outMatrix, size)
*
* This is a MEX-file which Octave will execute.
*
**************************************************************/
#include "mex.h"
/* Computational Routine */
void matrixSum (double *a, double *b, double *c, mwSize n) {
mwSize i;
/* Component-wise addition. */
for (i=0; i<n; i++) {
c[i] = a[i] + b[i];
}
}
/* Gateway Function */
void mexFunction (int nlhs, mxArray *plhs[],
int nrhs, const mxArray *prhs[]) {
double *inMatrix_a; /* mxn input matrix */
double *inMatrix_b; /* mxn input matrix */
mwSize nrows_a,ncols_a; /* size of matrix a */
mwSize nrows_b,ncols_b; /* size of matrix b */
double *outMatrix_c; /* mxn output matrix */
/* Check for proper number of arguments */
if(nrhs!=2) {
mexErrMsgIdAndTxt("MyToolbox:matrixSum:nrhs","Two inputs required.");
}
if(nlhs!=1) {
mexErrMsgIdAndTxt("MyToolbox:matrixSum:nlhs","One output required.");
}
/* Get dimensions of the first input matrix */
nrows_a = mxGetM(prhs[0]);
ncols_a = mxGetN(prhs[0]);
/* Get dimensions of the second input matrix */
nrows_b = mxGetM(prhs[1]);
ncols_b = mxGetN(prhs[1]);
/* Check for equal number of rows. */
if(nrows_a != nrows_b) {
mexErrMsgIdAndTxt("MyToolbox:matrixSum:notEqual","Unequal number of rows.");
}
/* Check for equal number of columns. */
if(ncols_a != ncols_b) {
mexErrMsgIdAndTxt("MyToolbox:matrixSum:notEqual","Unequal number of columns.");
}
/* Make a pointer to the real data in the first input matrix */
inMatrix_a = mxGetPr(prhs[0]);
/* Make a pointer to the real data in the second input matrix */
inMatrix_b = mxGetPr(prhs[1]);
/* Make the output matrix */
plhs[0] = mxCreateDoubleMatrix(nrows_a,ncols_a,mxREAL);
/* Make a pointer to the real data in the output matrix */
outMatrix_c = mxGetPr(plhs[0]);
/* Call the computational routine */
matrixSum(inMatrix_a,inMatrix_b,outMatrix_c,nrows_a*ncols_a);
}
To access the Octave utility mkoctfile, load an Octave module. Loading Octave also loads a compatible GCC:
$ module load octave
To compile matrixSum.c into a MEX-file:
$ mkoctfile --mex matrixSum.c
Two new files appear after the compilation:
matrixSum.mex matrixSum.o
The name of the Octave-callable MEX-file is matrixSum.mex.
Prepare an Octave-compatible M-file with an appropriate filename, here named myjob.m:
% FILENAME: myjob.m % Call the separately compiled and dynamically linked oct-file. A = [1,1,1;1,1,1] B = [2,2,2;2,2,2] C = matrixSum(A,B) quit
Prepare a shell script with an appropriate filename, here named myjob.sh:
#!/bin/sh # FILENAME: myjob.sh module load octave # For additional information about your job submission, # uncomment the following commands. # hostname # module list # which octave # Use the -q option to suppress startup messages. # octave -q octave
Change the permissions of the shell script to allow execution by the owner (you):
$ chmod u+x myjob.sh
Prepare a job submission file with an appropriate filename, here named myjob.sub. This example specifies the shell script myjob.sh as the executable, transfers the shell script and the user's current shell environment to the compute node, and uses Condor's file transfer mechanism to copy input and output files only if needed (if the same filesystem is not available on the compute node):
# FILENAME: myjob.sub universe = VANILLA # Copy my shell environment variables to the compute node. # The compute node requires this to find the module command. getenv = TRUE # Transfer the "executable" myjob.sh to the compute node. transfer_executable = TRUE executable = myjob.sh should_transfer_files = IF_NEEDED when_to_transfer_output = ON_EXIT input = myjob.m output = myjob.out error = myjob.err log = myjob.log # # Transfer the MEX-file matrixSum.mex to the compute node. transfer_input_files = matrixSum.mex queue
Submit the job:
$ condor_submit myjob.sub
View job status:
$ condor_q myusername
If there are any problems with your job, you may need to cancel your job or view the full job requirements. You may also wish to investigate more options on job submission. Please refer to the appropriate sections in this user guide for more details on these topics:
View results in the file for all standard output, here named myjob.out:
A = 1 1 1 1 1 1 B = 2 2 2 2 2 2 C = 3 3 3 3 3 3
You may also view the Condor log file, here named myjob.log. The counts of bytes sent and received are zero because this job ran on a compute node which could directly access the shared file system holding the input file:
000 (342475.000.000) 01/19 10:03:55 Job submitted from host: <128.211.157.86:60004>
...
001 (342475.000.000) 01/19 10:06:01 Job executing on host: <128.211.157.10:33917?PrivNet=condor.ccb.purdue.edu>
...
006 (342475.000.000) 01/19 10:06:10 Image size of job updated: 99616
...
005 (342475.000.000) 01/19 10:06:14 Job terminated.
(1) Normal termination (return value 0)
Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage
0 - Run Bytes Sent By Job
0 - Run Bytes Received By Job
0 - Total Bytes Sent By Job
0 - Total Bytes Received By Job
...
Any output written to standard error will appear in myjob.err.
For more information about the Octave-compatible Mex-file:
Perl is a high-level, general-purpose, interpreted, dynamic programming language offering powerful text processing features. This section illustrates how to submit a small Perl job to BoilerGrid. This Perl example prints a single line of text.
Prepare a Perl input file with an appropriate filename, here named myjob.in:
# FILENAME: myjob.in print "hello, world\n"
The absolute path of Perl is the same on all Linux platforms. This allows using the absolute path to Perl in the job submission file. Also, consider including both 32-bit and 64-bit Linux platforms as candidates to run the job.
To discover the number of Linux platforms which can run Perl:
$ condor_status -pool boilergrid.rcac.purdue.edu -total -constraint '(OpSys == "LINUX")'
Total Owner Claimed Unclaimed Matched Preempting Backfill
INTEL/LINUX 92 12 4 52 0 0 24
X86_64/LINUX 28876 19310 3480 6085 1 0 0
Total 28968 19322 3484 6137 1 0 24
Prepare the job submission file with an appropriate filename, here named myjob.sub. This example specifies the absolute path of the executable file, selects candidate compute nodes among the 32-bit and 64-bit Linux platforms, and uses Condor's file transfer mechanism to copy input and output files only if needed (if the same filesystem is not available on the compute node:
# FILENAME: myjob.sub universe = VANILLA requirements = ((Arch=="x86_64") || (Arch=="INTEL")) && (OpSys=="LINUX") # Run the executable already installed on the compute node. transfer_executable = FALSE executable = /usr/bin/perl # Use the -w option to issue warnings. arguments = -w should_transfer_files = IF_NEEDED when_to_transfer_output = ON_EXIT input = myjob.in output = myjob.out error = myjob.err log = myjob.log queue
Submit the job:
$ condor_submit myjob.sub
View job status:
$ condor_q myusername
If there are any problems with your job, you may need to cancel your job or view the full job requirements. You may also wish to investigate more options on job submission. Please refer to the appropriate sections in this user guide for more details on these topics:
View results in the file for all standard output, here named myjob.out:
hello, world
You may also view the Condor log file, here named myjob.log. The counts of bytes sent and received are zero because this job ran on a compute node which could directly access the shared file system holding the input file:
000 (746987.000.000) 12/20 13:53:59 Job submitted from host: <128.211.157.86:49400>
...
001 (746987.000.000) 12/20 13:55:52 Job executing on host: <128.211.157.10:50656?PrivNet=condor.ccb.purdue.edu>
...
005 (746987.000.000) 12/20 13:55:52 Job terminated.
(1) Normal termination (return value 0)
Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage
0 - Run Bytes Sent By Job
0 - Run Bytes Received By Job
0 - Total Bytes Sent By Job
0 - Total Bytes Received By Job
...
Any output written to standard error will appear in myjob.err.
For more information about Perl:
Python is an interpreted, general-purpose, interpreted, dynamic programming language offering powerful text processing features. This section illustrates how to submit a small Python job to BoilerGrid. This Python example prints a single line of text.
Prepare a Python input file with an appropriate filename, here named myjob.in:
#!/usr/bin/python # FILENAME: myjob.in import string, sys print "hello, world"
The absolute path of Python is the same on all Linux platforms. This allows using the absolute path to Python in the job submission file. Also, consider including both 32-bit and 64-bit Linux platforms as candidates to run the job.
To discover the number of Linux platforms which can run Python:
$ condor_status -pool boilergrid.rcac.purdue.edu -total -constraint '(OpSys == "LINUX")'
Total Owner Claimed Unclaimed Matched Preempting Backfill
INTEL/LINUX 92 12 4 52 0 0 24
X86_64/LINUX 28876 19310 3480 6085 1 0 0
Total 28968 19322 3484 6137 1 0 24
Prepare the job submission file with an appropriate filename, here named myjob.sub. This example specifies the absolute path of the executable file, selects candidate compute nodes among the 32-bit and 64-bit Linux platforms, and uses Condor's file transfer mechanism to copy input and output files only if needed (if the same filesystem is not available on the compute node):
# FILENAME: myjob.sub universe = VANILLA requirements = ((Arch=="x86_64") || (Arch=="INTEL")) && (OpSys=="LINUX") # Run the executable already installed on the compute node. transfer_executable = FALSE executable = /usr/bin/python should_transfer_files = IF_NEEDED when_to_transfer_output = ON_EXIT input = myjob.in output = myjob.out error = myjob.err log = myjob.log queue
Submit the job:
$ condor_submit myjob.sub
View job status:
$ condor_q myusername
If there are any problems with your job, you may need to cancel your job or view the full job requirements. You may also wish to investigate more options on job submission. Please refer to the appropriate sections in this user guide for more details on these topics:
View results in the file for all standard output, here named myjob.out:
hello, world
You may also view the Condor log file, here named myjob.log. The counts of bytes sent and received are zero because this job ran on a compute node which could directly access the shared file system holding the input file:
000 (342437.000.000) 12/21 10:13:04 Job submitted from host: <128.211.157.86:41603>
...
001 (342437.000.000) 12/21 10:14:43 Job executing on host: <128.211.157.10:34840?PrivNet=condor.ccb.purdue.edu>
...
005 (342437.000.000) 12/21 10:14:43 Job terminated.
(1) Normal termination (return value 0)
Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage
0 - Run Bytes Sent By Job
0 - Run Bytes Received By Job
0 - Total Bytes Sent By Job
0 - Total Bytes Received By Job
...
Any output written to standard error will appear in myjob.err.
For more information about Python:
R, a GNU project, is a language and environment for statistics and graphics. It is an open source version of the S programming language. This section illustrates how to submit a small R job to BoilerGrid. This R example computes a Pythagorean triple.
Prepare an R input file with an appropriate filename, here named myjob.in:
# FILENAME: myjob.in # Compute a Pythagorean triple. a = 3 b = 4 c = sqrt(a*a + b*b) c # display result
Use the ClassAd attribute "HAS_R" to discover how many compute nodes advertise R:
$ condor_status -pool boilergrid.rcac.purdue.edu -total -constraint '(Arch=="X86_64") && (OpSys=="LINUX") && (HAS_R==True)'
Total Owner Claimed Unclaimed Matched Preempting Backfill
X86_64/LINUX 25416 16524 1781 7111 0 0 0
Total 25416 16524 1781 7111 0 0 0
The absolute path of R is not the same on all clusters. The ClassAd attribute R_EXE handles this discrepancy. To see the three values of R_EXE:
$ condor_status -pool boilergrid.rcac.purdue.edu -constraint '(Arch == "X86_64") && (OpSys == "LINUX") && (HAS_R==True)' -format "%s\n" R_EXE > myfile
The three values of ClassAd attribute R_EXE:
/apps/rhel5/R-2.10.0/bin/R /apps/steele/R-2.9.0/bin/R /apps/coates/R-2.9.0/bin/R
The three values include two different versions of R. There is a chance that different versions will run your job during your project.
The existence of paths specific to clusters suggest using the ClassAd attribute R_EXE rather than an absolute path; however, R requires that a shared library be loaded also. So, this method uses module load in a shell script.
Prepare a shell script with an appropriate filename, here named myjob.sh:
#!/bin/sh # FILENAME: myjob.sh module load R # For additional information about your job submission, uncomment the following commands. # hostname # module list # which R # --vanilla: # --no-save: do not save datasets at the end of an R session R --vanilla --no-save
Change the permissions of the shell script to allow execution by the owner (you):
$ chmod u+x myjob.sh
Prepare a job submission file with an appropriate filename, here named myjob.sub. This example uses the ClassAd attribute HAS_R to locate a compute node, specifies the shell script myjob.sh as the executable, transfers the shell script and the user's current shell environment to the compute node, and uses Condor's file transfer mechanism to copy input and output files only if needed (if the same filesystem is not available on the compute node):
# FILENAME: myjob.sub universe = VANILLA requirements = (HAS_R==TRUE) # Copy my shell environment variables to the compute node. # The compute node requires this to find the module command. getenv = TRUE # Transfer the "executable" myjob.sh to the compute node. transfer_executable = TRUE executable = myjob.sh should_transfer_files = IF_NEEDED when_to_transfer_output = ON_EXIT input = myjob.in output = myjob.out error = myjob.err log = myjob.log queue
Submit the job:
$ condor_submit myjob.sub
View job status:
$ condor_q myusername
If there are any problems with your job, you may need to cancel your job or view the full job requirements. You may also wish to investigate more options on job submission. Please refer to the appropriate sections in this user guide for more details on these topics:
View results in the file for all standard output, here named myjob.out:
R version 2.9.0 (2009-04-17) Copyright (C) 2009 The R Foundation for Statistical Computing ISBN 3-900051-07-0 R is free software and comes with ABSOLUTELY NO WARRANTY. You are welcome to redistribute it under certain conditions. Type 'license()' or 'licence()' for distribution details. R is a collaborative project with many contributors. Type 'contributors()' for more information and 'citation()' on how to cite R or R packages in publications. Type 'demo()' for some demos, 'help()' for on-line help, or 'help.start()' for an HTML browser interface to help. Type 'q()' to quit R. > # FILENAME: myjob.in > > # Compute a Pythagorean triple. > a = 3 > b = 4 > c = sqrt(a*a + b*b) > c # display result [1] 5 >
You may also view the Condor log file, here named myjob.log. The counts of bytes sent and received are zero because this job ran on a compute node which could directly access the shared file system holding the input file:
000 (747004.000.000) 01/06 14:16:52 Job submitted from host: <128.211.157.86:49041>
...
001 (747004.000.000) 01/06 14:18:30 Job executing on host: <128.211.157.10:45461?PrivNet=condor.ccb.purdue.edu>
...
005 (747004.000.000) 01/06 14:18:35 Job terminated.
(1) Normal termination (return value 0)
Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage
0 - Run Bytes Sent By Job
0 - Run Bytes Received By Job
0 - Total Bytes Sent By Job
0 - Total Bytes Received By Job
...
Any output written to standard error will appear in myjob.err.
For more information about R:
SAS is an integrated system supporting statistical analysis, report generation, business planning, and forecasting. This section illustrates how to submit a small SAS job to BoilerGrid. This SAS example displays a small dataset.
Prepare a SAS input file with an appropriate filename, here named myjob.sas:
* FILENAME: myjob.sas /* Display a small dataset. */ TITLE 'Display a Small Dataset'; DATA grades; INPUT name $ midterm final; DATALINES; Anne 61 64 Bob 71 71 Carla 86 80 David 79 77 Edwardo 73 73 Fannie 81 81 ; PROC PRINT data=grades; RUN;
Prepare a shell script with an appropriate filename, here named myjob.sh:
#!/bin/sh # FILENAME: myjob.sh module load sas # For additional information about your job submission, uncomment the following commands. # hostname # module list # which sas # -stdio: run SAS in batch mode: # read SAS input from stdin # write SAS output to stdout # write SAS log to stderr # -nonews: do not display SAS news sas -stdio -nonews
The SAS command-line option -stdio uses standard I/O in the normal fashion. Using this option sends the SAS log file to stderr and avoids any conflict between the SAS log file and the Condor log file.
Change the permissions of the shell script to allow execution by the owner (you):
$ chmod u+x myjob.sh
Prepare a job submission file with an appropriate filename, here named myjob.sub. This example specifies the shell script myjob.sh as the executable, transfers the shell script and the user's current shell environment to the compute node, and uses Condor's file transfer mechanism to copy input and output files only if needed (if the same filesystem is not available on the compute node):
# FILENAME: myjob.sub universe = VANILLA # Copy my shell environment variables to the compute node. # The compute node requires this to find the module command. getenv = TRUE # SAS needs the environment variable HOME set to your home directory. environment = HOME=myhomedirectory # Transfer the "executable" myjob.sh to the compute node. transfer_executable = TRUE executable = myjob.sh should_transfer_files = IF_NEEDED when_to_transfer_output = ON_EXIT input = myjob.sas output = myjob.out error = myjob.err log = myjob.log queue
Submit the job:
$ condor_submit myjob.sub
View job status:
$ condor_q myusername
If there are any problems with your job, you may need to cancel your job or view the full job requirements. You may also wish to investigate more options on job submission. Please refer to the appropriate sections in this user guide for more details on these topics:
View results in the file for all standard output, here named myjob.out:
The SAS System 11:22 Wednesday, January 5, 2011 1
Obs name midterm final
1 Anne 61 64
2 Bob 71 71
3 Carla 86 80
4 David 79 77
5 Edwardo 73 73
6 Fannie 81 81
You may also view the Condor log file, here named myjob.log. The counts of bytes sent and received are zero because this job ran on a compute node which could directly access the shared file system holding the input file:
00 (747003.000.000) 01/05 11:21:35 Job submitted from host: <128.211.157.86:49041>
...
001 (747003.000.000) 01/05 11:22:02 Job executing on host: <128.211.157.10:46641?PrivNet=condor.ccb.purdue.edu>
...
005 (747003.000.000) 01/05 11:22:04 Job terminated.
(1) Normal termination (return value 0)
Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage
0 - Run Bytes Sent By Job
0 - Run Bytes Received By Job
0 - Total Bytes Sent By Job
0 - Total Bytes Received By Job
...
View the SAS log in the standard error file, here named myjob.err:
1 The SAS System 11:22 Wednesday, January 5, 2011
NOTE: Copyright (c) 2002-2008 by SAS Institute Inc., Cary, NC, USA.
NOTE: SAS (r) Proprietary Software 9.2 (TS2M0)
Licensed to PURDUE UNIVERSITY - T&R, Site 70063312.
NOTE: This session is executing on the Linux 2.6.18-194.17.1.el5rcac2 (LINUX) platform.
NOTE: SAS initialization used:
real time 0.06 seconds
cpu time 0.02 seconds
1 * FILENAME: myjob.in
2
3 /* Display a small dataset. */
4 TITLE 'Display a Small Dataset';
5 DATA grades;
6 INPUT name $ midterm final;
7 DATALINES;
NOTE: The data set WORK.GRADES has 6 observations and 3 variables.
NOTE: DATA statement used (Total process time):
real time 0.01 seconds
cpu time 0.01 seconds
14 ;
15 PROC PRINT data=grades;
16 RUN;
NOTE: There were 6 observations read from the data set WORK.GRADES.
NOTE: The PROCEDURE PRINT printed page 1.
NOTE: PROCEDURE PRINT used (Total process time):
real time 0.68 seconds
cpu time 0.03 seconds
NOTE: SAS Institute Inc., SAS Campus Drive, Cary, NC USA 27513-2414
NOTE: The SAS System used:
real time 0.77 seconds
cpu time 0.06 seconds
For more information about SAS:
Even though a Condor pool usually contains machines owned by many different people, it will often be the case that collaborating researchers from different organizations do not consider it feasible to combine all of their computers into a single Condor pool. The solution to this is to create multiple Condor pools and allow flocking between these pools. Jobs may then flock (migrate) from one pool to another based on the availability of compute nodes. If your local Condor pool does not have any available machines to run your job, it may flock to another pool. You need do nothing special to enable this for your jobs. It will happen automatically.
If you would like to learn more about how this works, see the Grid Computing Chapter of the Condor Users' Manual.
There are currently no FAQs for BoilerGrid.