BoilerGrid - Complete User Guide

Conventions Used in this Document

This document follows certain typesetting and naming conventions:

  • Colored, underlined text indicates a link.
  • Colored, bold text highlights something of particular importance.
  • Italicized text notes the first use of a key concept or term.
  • Bold, fixed-width font text indicates a command or command argument that you type verbatim.
  • Examples of commands and output as you would see them on the command line will appear in colored blocks of fixed-width text such as this:
    $ example
    This is an example of commands and output.
    
  • All command line shell prompts appear as a single dollar sign ("$"). Your actual shell prompt may differ.
  • All examples work with bash or ksh shells. Where different, changes needed for tcsh or csh shell users appear in example comments.
  • All names that begin with "my" illustrate examples that you replace with an appropriate name. These include "myusername", "myfilename", "mydirectory", "myjobid", etc.
  • The term "processor core" or "core" throughout this guide refers to the individual CPU cores on a processor chip. All RCAC systems schedule jobs on the basis of these processor cores, and not the physical processor chips. For example, no distinction would be made between a dual-processor, single-core machine and a single-processor, dual-core machine, as both contain a total of two processor cores.
  • The term "compute node" is a synonym for processor chip.

Overview of BoilerGrid

BoilerGrid is a large, high-throughput, distributed computing system operated by the Rosen Center for Advanced Computing (RCAC) and using the Condor system developed by the Condor Project at the University of Wisconsin. BoilerGrid provides a way for you to run programs on large numbers of otherwise idle computers in various locations, including any temporarily under-utilized high-performance cluster resources as well as any computer lab desktop machines not currently in use. Whenever a local user or scheduled job needs a machine back, Condor stops its job and sends it to another Condor node as soon as possible. Because this model limits the ability to do parallel processing and communications, BoilerGrid is only appropriate for relatively quick serial jobs.

How to Join BoilerGrid

If you have a desktop computer on the Purdue West Lafayette campus, please consider donating your desktop's idle time to BoilerGrid! The process is easy and allows other Purdue researchers to use otherwise wasted cycles when your computer is doing nothing. More information on joining BoilerGrid is available on the Join BoilerGrid page.

Detailed Hardware Specification

BoilerGrid scavenges cycles from nearly all RCAC systems, including all the RCAC-maintained clusters and specialized systems. BoilerGrid also uses idle time of machines in student labs on the Purdue West Lafayette campus. Through the larger consortium DiaGrid, BoilerGrid may also send jobs to machines at other institutions, including the University of Wisconsin, the University of Louisville, Indiana University, the University of Notre Dame, Indiana State University, the Purdue Calumet and North Central campuses, and the Indiana University – Purdue University Fort Wayne campus. Whenever the primary scheduling system on any of these machines needs a compute node back or a user sits down and starts to use a desktop computer, Condor will stop its job and, if possible, checkpoint its work. Condor then immediately tries to restart this job on some other available compute node in BoilerGrid.

A recent snapshot of BoilerGrid found 36,524 total processor cores. Of these, there were 29,111 Linux/x86_64, 98 Linux/Intel (ia32), 385 WinNT51/Intel, and 6925 WinNT61/Intel. There are also small numbers of Itanium Linux, Solaris, and Intel OSX nodes. Memory on compute nodes ranges from 512 MB to 192 GB, and most processors run at 2 GHz or faster. With a total of over 60 TFLOPS available, BoilerGrid can provide large numbers of cycles in a short amount of time. Condor offers high-throughput computing and is excellent for parameter sweeps, Monte Carlo simulations, or nearly any serial application.

Owner Arch/OS Processor Cores
ITaP - RCAC x86_64/Linux 30,717
ITaP - RCAC Intel/Linux 29
ITaP - Envision Center Intel/Linux 48
ITaP - Teaching & Learning Intel/WinNTXX ~9,300
Purdue Calumet X86_64/Linux 998
Notre Dame CSE Intel/Linux, Intel/OSX, Sun4u/Solaris210, x86_64/Linux 1,213
Purdue Biology, Libraries & some ITaP Intel/Linux, Intel/WinNT51 187

BoilerGrid currently uses Condor 7.4.1. You can check on the overall status of BoilerGrid using CondorView.

Accounts on BoilerGrid

Obtaining an Account

All Purdue faculty, staff, and students with the approval of their advisor may request access to BoilerGrid. However, if you have an account on Radon or any of the RCAC Community Clusters (Hansen, Rossmann, Coates, Steele, or Miner), then you already have access to BoilerGrid. Refer to the RCAC Accounts / Access page for more details on how to request access.

Login / SSH

To submit jobs on BoilerGrid, log in to the submission host condor.rcac.purdue.edu via SSH. This submission host is actually three front-end hosts: condor-fe00, condor-fe01, and condor-fe02. The login process randomly assigns one of these three front-ends to each login to condor.rcac.purdue.edu. While the three front-end hosts are identical, each has its own Condor queue. When you submit jobs to the Condor queue from the front-end named condor-fe00, you will not see those jobs on the Condor queue while logged in to either condor-fe01 or condor-fe02. To ensure that you always see the same Condor queue, log in to the same front-end.

Each front-end host has its own /tmp. Sharing data in /tmp during subsequent sessions may fail. RCAC advises using scratch storage for multisession, shared data instead.

You may also submit jobs to BoilerGrid from Radon or any of the RCAC Community Clusters (Hansen, Rossmann, Coates, Steele, or Miner). These clusters also have multiple front-end hosts.

SSH Client Software

All access to the RCAC systems must be through secure (encrypted) connections. RCAC systems do not support telnet and FTP. Use SSH, SCP, and SFTP instead.

Secure Shell or SSH is a way of establishing a secure channel between a local and a remote computer. It uses public-key cryptography to authenticate the remote computer and (optionally) to allow the remote computer to authenticate the user. Its usual function involves logging in to a remote machine and executing commands similar to telnet, but it also supports tunneling and forwarding of X11 or arbitrary TCP connections. The associated SFTP and SCP protocols can transfer files. There are many SSH clients available, depending on the operating system you use.

Linux / Solaris / AIX / HP-UX / Unix:

  • "ssh", "sftp", and "scp" are pre-installed. Log in using ssh myusername@servername.

Microsoft Windows:

Mac OS X:

  • "ssh", "sftp", and "scp" are pre-installed. You may start a local terminal window from "Applications->Utilities". Log in using ssh myusername@servername.
  • MacSSH
  • MacSFTP
  • NiftyTelnet 1.1 SSH

SSH Keys

SSH works with many different means of authentication. One popular authentication method is Public Key Authentication (PKA). PKA is a method of establishing your identity to a remote computer using related sets of encryption data called keys. PKA is a more secure alternative to traditional password-based authentication with which you are probably familiar.

To employ PKA via SSH, you manually generate a keypair (also called SSH keys) in the location from where you wish to initiate a connection to a remote machine. This keypair consists of two text files: private key and public key. You keep the private key file confidential on your local machine or local home directory (hence the name "private" key). You then log in to a remote machine (if possible) and append the corresponding public key text to the end of a specific file, or have a system administrator do so on your behalf. In future login attempts, PKA compares the public and private keys to verify your identity; only then do you have access to the remote machine.

As a user, you can create, maintain, and employ as many keypairs as you wish. If you connect to a computational resource from your work laptop, your work desktop, and your home desktop, you can create and employ keypairs on each. You can also create multiple keypairs on a single local machine to serve different purposes, such as establishing access to different remote machines or establishing different types of access to a single remote machine. In short, PKA via SSH offers a secure but flexible means of identifying yourself to all kinds of computational resources.

Passphrases and SSH Keys

Creating a keypair prompts you to provide a passphrase for the private key. This passphrase is different from a password in a number of ways. First, a passphrase is, as the name implies, a phrase. It can include most types of characters, including spaces, and has no limits on length. Secondly, the remote machine does not receive this passphrase for verification. Its purpose is only to allow the use of your local private key and is specific to a specific local private key.

Perhaps you are wondering why you would need a private key passphrase at all when using PKA. If the private key remains secure, why the need for a passphrase just to use it? Indeed, if the location of your private keys were always completely secure, a passphrase might not be necessary. In reality, a number of situations could arise in which someone may improperly gain access to your private key files. In these situations, a passphrase offers another level of security for you, the user who created the keypair.

Think of the private key/passphrase combination as being analogous to your ATM card/PIN combination. The ATM card itself is the object that grants access to your important accounts, and as such, should remain secure at all times—just as a private key should. But if you ever lose your wallet or someone steals your ATM card, you are glad that your PIN exists to offer another level of protection. The same is true for a private key passphrase.

When you create a keypair, you should always provide a corresponding private key passphrase. For security purposes, avoid using phrases which automated programs can discover (e.g. phrases that consist solely of words in English-language dictionaries). This passphrase is not recoverable if forgotten, so make note of it. Only a few situations warrant using a non-passphrase-protected private key—conducting automated file backups is one such situation. If you need to use a non-passphrase-protected private key to conduct automated backups to Fortress, see the No-Passphrase SSH Keys section.

SSH X11 Forwarding

SSH supports tunneling of X11 (X-Windows), so you may run X11 applications on the machine you are using to issue jobs to BoilerGrid. However, running an X11 application via Condor is not possible.

Passwords

If you have received a default password as part of the process of obtaining your account, you should change it immediately when you log on for the first time. Change your password from any terminal/SSH session with the command passwd. You will have the same password on all RCAC systems. If you change your password on any one RCAC system, it will change on all RCAC systems.

If you already have a Purdue career account, then you will initially receive the same username and password as your career account. There is no need to change your career account password because you have received an account on RCAC systems.

There is not currently any requirement regarding how often you must change your password within RCAC, but for security reasons changing a password every six months, preferably every three months, is good practice.

A password should employ all of the following features:

  • Something you have never used as a password before, on this or any other system.
  • Easy for you to remember and difficult for others to guess.
  • At least eight characters long.
  • A combination of uppercase and lowercase letters, numbers, and symbols.
TIP: A recommended password is an abbreviation of a sentence or song lyric: "The dog Samson ate 4 new slippers!" = "TdSa4ns!"

Never share your password with another user or make your password known to anyone else. Systems staff will NEVER ask for your password, by email or otherwise.

Email

There is no local email delivery available on BoilerGrid. BoilerGrid forwards all email which it receives to mail.rcac.purdue.edu for delivery.

Login Shell

Your shell is the program that generates your command-line prompt and processes commands. On RCAC systems, several common shell choices are available:

Name Description Path
bash A Bourne-shell (sh) compatible shell with many newer advanced features as well. Bash is one of the most common shells in use today. /bin/bash
tcsh An advanced variant on csh with all the features of modern shells. Tcsh is probably the second most popular shell in use today. /bin/tcsh
zsh An advanced shell which incorprates all the functionality of bash, tcsh, and ksh combined, usually with identical syntax. In spite of this, zsh is not in common use. /bin/zsh
csh The original C-style shell. Because tcsh offers all the functionality of csh and more, use csh only when you have specific csh-only scripts. /bin/csh
ksh Korn shell, which was an early Bourne-shell compatible shell with some additional features. Unless you are already an adept ksh user, you would probably prefer bash. /bin/ksh

To find out what shell you are running right now, simply use the ps command:

$ ps
  PID TTY          TIME CMD
30181 pts/27   00:00:00 bash
30273 pts/27   00:00:00 ps

To use a different shell on a one-time or trial basis, simply type the shell name as a command. To return to your original shell, type exit:

$ ps
  PID TTY          TIME CMD
30181 pts/27   00:00:00 bash
30273 pts/27   00:00:00 ps

$ tcsh
% ps
  PID TTY          TIME CMD
30181 pts/27   00:00:00 bash
30313 pts/27   00:00:00 tcsh
30315 pts/27   00:00:00 ps

% exit
$

To permanently change your default login shell, use the command chsh:

$ chsh

Changing login shell for myusername on *all* ACMAINT hosts.
Enter existing password: **********
Old shell: nologin
New shell [nologin]: /bin/tcsh

Changed 'loginShell' to '/bin/tcsh' for login 'myusername' on host(s) 'host123.rcac.purdue.edu host234.rcac.purdue.edu ...'.
Connection to data.rcac.purdue.edu closed.

There is a propagation delay which may last up to two hours. After the change has taken effect, your next login will start in your new shell. Moreover, you may change your shell again at any time by rerunning chsh.

File Storage and Transfer for BoilerGrid

Storage Options

File storage options on RCAC systems include long-term storage (home directories, Fortress) and short-term storage (scratch directories, /tmp directory). Each option has different performance and intended uses, and some options vary from system to system as well. RCAC backs up home directories nightly. RCAC does not back up short-term storage and may occasionally purge files from scratch and /tmp directories without warning. More details about each storage option appear below.

Home Directories

RCAC provides home directories for long-term file storage. Each user ID has one home directory. You should use your home directory for storing important program files, scripts, input data sets, critical results, and frequently used files. You should store infrequently used files on Fortress. Your home directory becomes your current working directory, by default, when you log in.

RCAC backs up your home directory nightly. For additional security, you should store another copy of your home directory on more permanent storage.

Your home directory will physically reside on the BlueArc NFS Server. To find the path to your home directory, first log in then immediately enter the following:

$ pwd
/autohome/u103/myusername

Or from any subdirectory:

$ echo $HOME
/home/ba01/u103/myusername

The replies indicate the name of the host where your home directory physically resides. In this example, the home directory is on the RCAC home directory file server named "ba01" under area "u103". This will vary from person to person.

Regardless of its physical location, your home directory and its contents are available on almost all RCAC front-end hosts and compute nodes via the Network File System (NFS). The only exception is Black.

Your home directory has a quota capping the size and/or number of files you may store within. For more information, refer to the Storage Quotas / Limits Section.

Lost Home Directory File Recovery

Only files which RCAC has backed up overnight are recoverable. If you lose a file the same day you created it, it is NOT recoverable.

To recover files lost from your home directory, use the flost command:

$ flost

This will ask you some questions about when you lost your file. If you lost it recently, flost will direct you to a place where you can recover your file yourself immediately. If you lost the file some time ago, flost will help you note all the necessary information for RCAC staff to restore your file from tape backups.

Scratch Directories

RCAC provides scratch directories for short-term file storage only. Each file system domain has at least one scratch directory. Each user ID may access one scratch directory in a file system domain. The quota of your scratch directory is several times greater than the quota of your home directory. You should use your scratch directory for storing large temporary input files which your job reads or for writing large temporary output files which you may examine after execution of your job. You should use your home directory and Fortress for longer-term storage or for holding critical results.

Users of all RCAC's major clusters have access to a scratch directory.

RCAC does not perform backups for scratch directories. In the event of a disk crash or file purge, files in scratch directories are not recoverable. You should copy any important files to more permanent storage.

RCAC automatically removes (purges) from RCAC scratch directories all files stored for more than 90 days. Owners of these files receive a notice one week before removal via email. For more information, please refer to RCAC's Scratch File Purging Policy.

To find the path to your scratch directory:

$ findscratch

The response from command findscratch depends on your submission host. You may see one of the following paths:

/scratch/scratch95/m/myusername
/scratch/scratch96/m/myusername
/scratch/lustreA/m/myusername
/scratch/miner/m/myusername

The value of variable $RCAC_SCRATCH is the path of your scratch directory. Use this variable in any scripts. Your actual scratch directory path may change without warning, but this variable will remain current.

$ echo $RCAC_SCRATCH

The response will be one of the previously listed paths.

Your scratch directory on RCAC computational resources may be the same location and shared by some other RCAC computational resources, and also distinct and not shared by other RCAC computational resources. All submission hosts on all computational resources are able to access the scratch directories of all other computational resources. However, compute nodes are only able to access the scratch directory allocated to that specific computational resource. RCAC may change which computational resources share scratch storage with other computational resources as needs dictate. For more information about which computational resources share scratch volumes, please see the Network Storage Resource Page.

All BoilerGrid jobs submitted from a submission host of an RCAC computational resource will have their Condor filesystem domain set such that these jobs will stay on RCAC compute nodes which have access to the scratch directory of the submission host unless you specify file transfer (which would eliminate any need for this). This will ensure that non-file-transfer jobs will always run on nodes which can access the scratch directory you had where you submitted the jobs. If you have no need of this scratch directory and want these jobs to run on systems which do not have access to it, you will need to explicitly set the file system domain of your jobs.

To find the path to someone else's RCAC scratch directory:

$ findscratch someusername
/scratch/scratch95/s/someusername

Your RCAC scratch directory has a quota capping the size and number of files you may store in it. For more information, refer to the Storage Quotas / Limits Section.

/tmp Directory

RCAC provides /tmp directories for short-term file storage only. Each front-end and compute node has a /tmp directory. Your program may write temporary data to the /tmp directory of the compute node on which it is running. That data is available for as long as your program is active. Once your program terminates, that temporary data is no longer available. When used properly, /tmp may provide faster local storage to an active process than any other storage option. You should use your home directory and Fortress for longer-term storage or for holding critical results.

RCAC does not perform backups for the /tmp directory and removes files from /tmp whenever space is low or whenever the system needs a reboot. In the event of a disk crash or file purge, files in /tmp are not recoverable. You should copy any important files to more permanent storage.

Long-Term Storage

Long-term Storage or Permanent Storage is available to RCAC users on the High Performance Storage System (HPSS), an archival storage system, commonly referred to as "Fortress". HPSS is a software package that manages a hierarchical storage system. Program files, data files and any other files which are not used often, but which must be saved, can be put in permanent storage. Fortress currently has a 1.2 PB capacity.

Files smaller than 100 MB have their primary copy stored on low-cost disks (disk cache), but the second copy (backup of disk cache) is on tape or optical disks. This provides a rapid restore time to the disk cache. However, the large latency to access a larger file (usually involving a copy from a tape cartridge) makes it unsuitable for direct use by any processes or jobs, even where possible. The primary and secondary copies of larger files are stored on separate tape cartridges in the Quantum (ADIC, Advanced Digital Information Corporation) tape library.

To ensure optimal performance for all users, and to keep the Fortress system healthy, please remember the following tips:

  • Fortress operates most effectively with large files - 1GB or larger. If your data is comprised of smaller files, use HTAR to directly create archives in Fortress.
  • When working with files on cluster head nodes, use your home directory or a scratch file system, rather than editing or computing on files directly in Fortress. Copy any data you wish to archive to Fortress after computation is complete.
  • The HPSS software does not handle sparse files (files with empty space) in an optimal manner. Therefore, if you must copy a sparse file into HPSS, use HSI rather than the cp or mv commands.
  • Due to the sparse files issue, the rsync command should not be used to copy data into Fortress through NFS, as this may cause problems with the system.

Fortress writes two copies of every file either to two tapes, or to disk and a tape, to protect against medium errors. Unfortunately, Fortress does not automatically switch to the alternate copy when it has trouble accessing the primary. If it seems to be taking an extraordinary amount of time to retrieve a file (hours), please either email rcac-help@purdue.edu or call ITaP Customer Service at 765-49-4400. We can then investigate why it is taking so long. If it is an error on the primary copy, we will instruct Fortress to switch to the alternate copy as the primary and recreate a new alternate copy.

For more information about Fortress, how it works, user guides, and how to obtain an account:

Manual File Transfer to Long-Term Storage

There are a variety of ways to manually transfer files to your Fortress home directory for long-term storage.

HSI

HSI, the Hierarchical Storage Interface, is the preferred method of transferring files to and from Fortress. HSI is designed to be a friendly interface for users of the High Performance Storage System (HPSS). It provides a familiar Unix-style environment for working within HPSS while automatically taking advantage of high-speed, parallel file transfers without requiring any special user knowledge.

HSI is already provided on all RCAC systems as the command hsi. You may download HSI for the following platforms as well:

Any machines using HSI or HTAR must have all firewalls (local and departmental) configured to allow open access from the following IP addresses:

  • 128.211.158.46
  • 128.211.158.121
  • 128.211.158.122

If you are unsure of how to modify your firewall settings, please consult with your department's IT support or the documentation for your operating system. Access to Fortress is restricted to on-campus networks. If you need to directly access Fortress from off-campus, please use the Purdue VPN service before connecting.

Interactive usage:

$ hsi

*************************************************************************
*                    Purdue University 
*                  High Performance Storage System (HPSS)
*************************************************************************
* This is the Purdue Data Archive, Fortress.  For further information 
* see http://www.rcac.purdue.edu/userinfo/resources/fortress/
*  
*   If you are having problems with HPSS, please call IT/Operational
*   Services at 49-44000 or send E-mail to dxul-help@purdue.edu.
*
*************************************************************************
Username: myusername  UID: 12345  Acct: 12345(12345) Copies: 1 Firewall: off [hsi.3.5.8 Wed Sep 21 17:31:14 EDT 2011] 

[Fortress HSI]/home/myusername->put data1.fits
put  'test' : '/home/myusername/test' ( 1024000000 bytes, 250138.1 KBS (cos=11))

[Fortress HSI]/home/myusername->lcd /tmp

[Fortress HSI]/home/myusername->get data1.fits
get  '/tmp/data1.fits' : '/home/myusername/data1.fits' (2011/10/04 16:28:50 1024000000 bytes, 325844.9 KBS )

[Fortress HSI]/home/myusername->quit

Batch transfer file:

put data1.fits 
put data2.fits 
put data3.fits 
put data4.fits 
put data5.fits 
put data6.fits 
put data7.fits 
put data8.fits 
put data9.fits

Batch usage:

$ hsi < my_batch_transfer_file
*************************************************************************
*                    Purdue University 
*                  High Performance Storage System (HPSS)
*************************************************************************
* This is the Purdue Data Archive, Fortress.  For further information 
* see http://www.rcac.purdue.edu/userinfo/resources/fortress/
*  
*   If you are having problems with HPSS, please call IT/Operational
*   Services at 49-44000 or send E-mail to dxul-help@purdue.edu.
*
*************************************************************************
Username: myusername  UID: 12345  Acct: 12345(12345) Copies: 1 Firewall: off [hsi.3.5.8 Wed Sep 21 17:31:14 EDT 2011] 
put  'data1.fits' : '/home/myusername/data1.fits' ( 1024000000 bytes, 250200.7 KBS (cos=11))
put  'data2.fits' : '/home/myusername/data2.fits' ( 1024000000 bytes, 258893.4 KBS (cos=11))
put  'data3.fits' : '/home/myusername/data3.fits' ( 1024000000 bytes, 222819.7 KBS (cos=11))
put  'data4.fits' : '/home/myusername/data4.fits' ( 1024000000 bytes, 224311.9 KBS (cos=11))
put  'data5.fits' : '/home/myusername/data5.fits' ( 1024000000 bytes, 323707.3 KBS (cos=11))
put  'data6.fits' : '/home/myusername/data6.fits' ( 1024000000 bytes, 320322.9 KBS (cos=11))
put  'data7.fits' : '/home/myusername/data7.fits' ( 1024000000 bytes, 253192.6 KBS (cos=11))
put  'data8.fits' : '/home/myusername/data8.fits' ( 1024000000 bytes, 253056.2 KBS (cos=11))
put  'data9.fits' : '/home/myusername/data9.fits' ( 1024000000 bytes, 323218.9 KBS (cos=11))
EOF detected on TTY - ending HSI session

For more information about HSI:

HTAR

HTAR (short for "HPSS TAR") is a utility program that writes TAR-compatible archive files directly onto Fortress, without having to first create a local file. Its command line was originally based on the AIX tar program, with a number of extensions added to provide extra features.

HTAR is already provided on all RCAC systems as the command htar. You may download HTAR for the following platforms as well:

Any machines using HSI or HTAR must have all firewalls (local and departmental) configured to allow open access from the following IP addresses:

  • 128.211.158.46
  • 128.211.158.121
  • 128.211.158.122

If you are unsure of how to modify your firewall settings, please consult with your department's IT support or the documentation for your operating system. Access to Fortress is restricted to on-campus networks. If you need to directly access Fortress from off-campus, please use the Purdue VPN service before connecting.

Usage:

  (Create a tar archive on Fortress named data.tar including all files with the extension ".fits".)
$ htar -cvf data.tar *.fits
HTAR: a   data1.fits                                      
HTAR: a   data2.fits
HTAR: a   data3.fits
HTAR: a   data4.fits
HTAR: a   data5.fits
HTAR: a   data6.fits
HTAR: a   data7.fits
HTAR: a   data8.fits
HTAR: a   data9.fits
HTAR: a   /tmp/HTAR_CF_CHK_17953_1317760775
HTAR Create complete for data.tar. 9,216,006,144 bytes written for 9 member files, max threads: 3 Transfer time: 29.622 seconds (311.121 MB/s)
HTAR: HTAR SUCCESSFUL   

  (Unpack a tar archive on Fortress named data.tar into a scratch directory for use in a batch job.)
$ cd $RCAC_SCRATCH/job_dir
$ htar -xvf data.tar 
HTAR: x data1.fits, 1024000000 bytes, 2000001 media blocks
HTAR: x data2.fits, 1024000000 bytes, 2000001 media blocks
HTAR: x data3.fits, 1024000000 bytes, 2000001 media blocks
HTAR: x data4.fits, 1024000000 bytes, 2000001 media blocks
HTAR: x data5.fits, 1024000000 bytes, 2000001 media blocks
HTAR: x data6.fits, 1024000000 bytes, 2000001 media blocks
HTAR: x data7.fits, 1024000000 bytes, 2000001 media blocks
HTAR: x data8.fits, 1024000000 bytes, 2000001 media blocks
HTAR: x data9.fits, 1024000000 bytes, 2000001 media blocks
HTAR: Extract complete for data.tar, 9 files. total bytes read: 9,216,004,608 in 33.914 seconds (271.749 MB/s )
HTAR: HTAR SUCCESSFUL

  (Look at the contents of the data.tar HTAR archive on Fortress.)
$ htar -tvf data.tar
HTAR: -rw-r--r--  myusername/pucc 1024000000 2011-10-04 16:30  data1.fits
HTAR: -rw-r--r--  myusername/pucc 1024000000 2011-10-04 16:35  data2.fits
HTAR: -rw-r--r--  myusername/pucc 1024000000 2011-10-04 16:35  data3.fits
HTAR: -rw-r--r--  myusername/pucc 1024000000 2011-10-04 16:35  data4.fits
HTAR: -rw-r--r--  myusername/pucc 1024000000 2011-10-04 16:35  data5.fits
HTAR: -rw-r--r--  myusername/pucc 1024000000 2011-10-04 16:35  data6.fits
HTAR: -rw-r--r--  myusername/pucc 1024000000 2011-10-04 16:35  data7.fits
HTAR: -rw-r--r--  myusername/pucc 1024000000 2011-10-04 16:35  data8.fits
HTAR: -rw-r--r--  myusername/pucc 1024000000 2011-10-04 16:35  data9.fits
HTAR: -rw-------  myusername/pucc        256 2011-10-04 16:39  /tmp/HTAR_CF_CHK_17953_1317760775
HTAR: Listing complete for data.tar, 10 files 10 total objects
HTAR: HTAR SUCCESSFUL

  (Unpack a single file, "data7.fits", from the tar archive on Fortress named data.tar into a scratch directory.)
$ htar -xvf data.tar data7.fits
HTAR: x data7.fits, 1024000000 bytes, 2000001 media blocks
HTAR: Extract complete for data.tar, 1 files. total bytes read: 1,024,000,512 in 3.642 seconds (281.166 MB/s )
HTAR: HTAR SUCCESSFUL

For more information about HTAR:

SCP

Fortress does NOT support SCP.

SFTP

Fortress does NOT support SFTP.

NFS

If you are using an RCAC cluster front-end system, your Fortress home directory is available as /archive/fortress/home/myusername. While your Fortress home directory can be accessed via NFS in this way, this is only provided as a convenience and should not be used on a regular basis as it is extremely slow. Instead, use the HSI command to get a fast, parallelized, UNIX-like interface to your Fortress home directory.

Environment Variables

There are many environment variables related to storage locations and paths. Logging in automatically sets these environment variables. You may change the variables at any time.

Use environment variables instead of actual paths whenever possible to avoid problems if the specific paths to any of these change. Some of the environment variables you should have are:

Name Description
USER your username
HOME path to your home directory
PWD path to your current directory
RCAC_SCRATCH path to scratch filesystem
PATH all directories searched for commands/applications
HOSTNAME name of the machine you are on
SHELL your current shell (bash, tcsh, csh, ksh)
SSH_CLIENT your local client's IP address
TERM type of terminal or terminal emulator being used

By convention, environment variable names are all uppercase. Use them on the command line or in any scripts in place of and in combination with hard-coded values:

$ ls $HOME
...

$ ls $RCAC_SCRATCH/myproject
...

To find the value of any environment variable:

$ echo $RCAC_SCRATCH
/scratch/scratch95/m/myusername

$ echo $SHELL
/bin/tcsh

To list the values of all environment variables:

$ env
USER=myusername
HOME=/home/ba01/u101/myusername
RCAC_SCRATCH=/scratch/scratch95/m/myusername
SHELL=/bin/tcsh
...

You may create or overwrite an environment variable. To pass (export) the value of a variable in either bash or ksh:

$ export VARIABLE=value

To assign a value to an environment variable in either tcsh or csh:

% setenv VARIABLE value

Storage Quotas / Limits

RCAC limits your disk usage on RCAC systems. Each filesystem (home directory, scratch directory, etc.) may have a different limit. RCAC does not implement a soft limit or quota. However, if you exceed the hard limit or limit, your write will fail. Either remove other files or ask RCAC about increasing your limit.

Checking Quota Usage

To discover the current quotas of your home and scratch directories:

$ myquota
Type        Filesystem          Size    Limit  Use         Files    Limit  Use
==============================================================================
home        u105               4.5GB    9.5GB  47%        10,258   65,535  15%
scratch     /scratch/scratch95/    8KB  476.8GB   0%             2  100,000   0%

The columns are as follows:

  1. Type: indicates home or scratch directory.
  2. Filesystem: name of storage option.
  3. Size: sum of file sizes in bytes.
  4. Limit: allowed maximum on sum of file sizes in bytes.
  5. Use: percentage of file-size limit currently in use.
  6. Files: number of files and directories (not the size).
  7. Limit: allowed maximum on number of files and directories. It is possible, though unlikely, to reach this limit and not the file-size limit if you create a large number of very small files.
  8. Use: percentage of file-number limit currently in use.

If you find that you reached your quota in either your home directory or your scratch file directory, obtain estimates of your disk usage. Find the top-level directories which have a high disk usage, then study the subdirectories to discover where the heaviest usage lies.

To see in a human-readable format an estimate of the disk usage of your top-level directories in your home directory:

$ du -h --max-depth=1 $HOME >myfile
32K /home/ba01/u105/myusername/mysubdirectory_1
529M    /home/ba01/u105/myusername/mysubdirectory_2
608K    /home/ba01/u105/myusername/mysubdirectory_3

The second directory is the largest of the three, so apply command du to it.

To see in a human-readable format an estimate of the disk usage of your top-level directories in your scratch file directory:

$ du -h --max-depth=1 $RCAC_SCRATCH >myfile
160K    /scratch/scratch95/m/myusername

This strategy can be very helpful in figuring out the location of your largest usage. Move unneeded files and directories to alternate long-term storage to free space in your home and scratch directories.

Requesting Quota Increase

If you find you need additional disk space on an RCAC account, please first consider archiving and compressing old files and moving them to long-term storage. If this option does not resolve the issue, you may send an email to rcac-help@purdue.edu and request additional space.

Archive and Compression

There are several options for archiving and compressing groups of files or directories on RCAC systems. RCAC provides the following tools:

  • zip   (more information)
    Simple compression and file packaging utility.
    Examples:
      (compress file somefile.c)
    $ zip somefile.zip somefile.c
    
      (extract contents of somefile.zip)
    $ unzip somefile.zip
    
      (compress all files in a directory into one archive file)
    $ zip -r somefile.zip somedirectory/
    
      (compress all ".c" files in current directory into one archive file)
    $ zip -r somefile.zip . -i \*.c
    
  • tar   (more information)
    Saves many files together into a single archive file, and restores individual files from the archive. Includes automatic archive compression/decompression options and special features for incremental and full backups.
    Examples:
      (archive file somefile.c)
    $ tar cvf somefile.tar somefile.c
    
      (archive and compress file somefile.c)
    $ tar czvf somefile.tar.gz somefile.c
    
      (list contents of archive somefile.tar)
    $ tar tvf somefile.tar
    
      (extract contents of somefile.tar)
    $ tar xvf somefile.tar
    
      (extract contents of gzipped archive somefile.tar.gz)
    $ tar xzvf somefile.tar.gz
    
      (archive and compress all files in a directory into one archive file)
    $ tar czvf somefile.tar.gz somedirectory/
    
      (archive and compress all ".c" files in current directory into one archive file)
    $ tar czvf somefile.tar.gz *.c 
    
  • gzip   (more information)
    Compression utility designed as a replacement for compress, with much better compression and no patented algorithms. The standard compression system for all GNU software.
    Examples:
      (compress file somefile - also removes uncompressed file)
    $ gzip somefile
    
      (uncompress file somefile.gz - also removes compressed file)
    $ gunzip somefile.gz
    
  • bzip2   (more information)
    Strong, lossless data compressor based on the Burrows-Wheeler transform. Also available as a library.
    Examples:
      (compress file somefile - also removes uncompressed file)
    $ bzip2 somefile
    
      (uncompress file somefile.bz2 - also removes compressed file)
    $ bunzip2 somefile.bz2
    
  • compress   (more information)
    Adaptive Lempel-Ziv compressor. Not often used today.

Windows users can work with these same formats using some of the following software:

  • 7-Zip
    Free Windows software package that can handle all the above formats.
  • WinZip
    Commercial Windows software package that can handle all the above formats.
  • WinRAR
    Commercial Windows software package that can handle all the above formats.

File Transfer

There are a variety of ways to transfer data to and from RCAC systems. Which you should use depends on several factors, including the ease of use for you personally, connection speed and bandwidth, and the size and number of files which you intend to transfer.

FTP

FTP (File Transfer Protocol) is a simple data transfer mechanism. FTP does not provide secure communications, so RCAC no longer supports FTP on any RCAC systems. However, most modern FTP clients support either SFTP or SCP, which are similar, secure protocols for file transfer. Try using one of the other methods described here instead of FTP.

SCP

SCP (Secure CoPy) is a simple way of transferring files between two machines that use the SSH (Secure SHell) protocol. You may use SCP to connect to any system where you have SSH (login) access. SCP is available as a protocol choice in some graphical file transfer programs and also as a command line program on most Linux, Unix, and Mac OS X systems. SCP can copy single files, but will also recursively copy directory contents if given a directory name.

Command-line usage:

  (to a remote system from local)
$ scp sourcefilename myusername@hostname:somedirectory/destinationfilename

  (from a remote system to local)
$ scp myusername@hostname:somedirectory/sourcefilename destinationfilename

  (recursive directory copy to a remote system from local)
$ scp sourcedirectory/ myusername@hostname:somedirectory/

Linux / Solaris / AIX / HP-UX / Unix:

  • You should have already installed the "scp" command-line program.

Microsoft Windows:

  • WinSCP is a full-featured and free graphical SCP and SFTP client.
  • PuTTY also offers "pscp.exe", which is an extremely small program and a basic SCP client.
  • Secure FX is a commercial SCP and SFTP client which is freely available to Purdue students, faculty, and staff with a Purdue career account.

Mac OS X:

  • You should have already installed the "scp" command-line program. You may start a local terminal window from "Applications->Utilities".

SFTP

SFTP (Secure File Transfer Protocol) is a reliable way of transferring files between two machines. You may use SFTP to connect to most RCAC systems. SFTP is available as a protocol choice in some graphical file transfer programs and also as a command-line program on most Linux, Unix, and Mac OS X systems. SFTP has more features than SCP and allows for other operations on remote files, remote directory listing, and resuming interrupted transfers. Command-line SFTP cannot recursively copy directory contents; to do so, try using SCP or graphical SFTP client.

Command-line usage:

$ sftp -B buffersize myusername@hostname

      (to a remote system from local)
sftp> put sourcefile somedir/destinationfile
sftp> put -P sourcefile somedir/

      (from a remote system to local)
sftp> get sourcefile somedir/destinationfile
sftp> get -P sourcefile somedir/

sftp> exit
  • -B: optional, specify buffer size for transfer; larger may increase speed, but costs memory
  • -P: optional, preserve file attributes and permissions

Linux / Solaris / AIX / HP-UX / Unix:

  • The "sftp" command line program should already be installed.

Microsoft Windows:

  • WinSCP is a full-featured and free graphical SFTP and SCP client.
  • PuTTY also offers "psftp.exe", which is an extremely small program and a basic SFTP client.
  • Secure FX is a commercial SFTP and SCP client which is freely available to Purdue students, faculty, and staff with a Purdue career account.

Mac OS X:

  • The "sftp" command-line program should already be installed. You may start a local terminal window from "Applications->Utilities".
  • MacSFTP is a free graphical SFTP client for Macs.

LFTP

LFTP is a command-line file-transfer program for Linux and Unix systems. It supports SFTP, HTTP, and HTTPS file-transfers. LFTP has additional features not provided by SFTP such as bandwidth throttling, transfer queues, and parallel transfers. Use interactively or scripted.

LFTP with parallel transfers can be much faster than SCP or SFTP, so RCAC encourage its use, when possible.

LFTP is available only on some RCAC systems. However, it is simply a client, so the remote machine involved in a transfer does not need it (the remote system need only support SFTP).

Interactive usage:

$ lftp myusername@hostname

         (transfer all ".dat" files from remote system to local)
lftp :~> mget *.dat

         (transfer "filename.dat" file from local system to remote)
lftp :~> put filename.dat

         (transfer a directory and all contents from remote
          system to local, using 5 connections in parallel)
lftp :~> mirror --parallel=5 remotedirectory localdirectory/

         (transfer a directory and all contents from local
          system to remote, using 8 connections in parallel)
lftp :~> mirror -R --parallel=8 localdirectory remotedirectory/

Batch usage:

  (specify all actions on command line)
$ lftp myusername@hostname -e "mget *.dat"

  (specify all actions in the script file "mytransfer.lftp")
$ lftp myusername@hostname -f mytransfer.lftp

GridFTP

GridFTP is a fast method of transferring large files that uses Globus authentication credentials (x509 certificates). GridFTP is available on some RCAC resources, but only to users who are members of a Grid project, such as TeraGrid, NorthWest Indiana Computational Grid (NWICG), or Open Science Grid (OSG). However, not all grids may access all RCAC resources.

For more information about how to use GridFTP, consult documentation for your participating grid.

Windows Network Drive / SMB

SMB (Server Message Block), also known as CIFS, is an easy to use file transfer protocol that is useful for transferring files between RCAC systems and a desktop or laptop. You may use SMB to connect to your home, scratch, and fortress storage directories. The SMB protocol is available on Windows, Linux, and Mac OS X. It is primarily used as a graphical means of transfer but it can also be used over the command line.

Windows:

  • Click Windows menu > Computer (Vista/7) or Start > My Computer (XP)
  • Click Map Network Drive in the top bar (Vista/7) or Tools > Map Network Drive (XP)
  • In the folder location enter the following information and click Finish:

    • To access your home directory, enter \\samba.rcac.purdue.edu\myusername where myusername is your career account name.
    • To access your scratch storage, enter the following:

      • For Steele or Radon scratch, enter \\samba.rcac.purdue.edu\scratch9N\m\myusername where m is the first letter of your username, myusername is your career account name, and N is the number of your scratch drive (N can be 5, 6, 8, or 9.) You need to know beforehand which scratch drive your directory is on, for example scratch95.
      • For Coates or Rossmann scratch, enter \\samba.rcac.purdue.edu\lustreA\m\myusername where m is the first letter of your username and myusername is your career account name.
      • For Carter, Hansen, or WinHPC scratch, enter \\samba.rcac.purdue.edu\lustreC\m\myusername where m is the first letter of your username and myusername is your career account name.

    • To access Fortress long-term storage, enter \\fortress-smb.rcac.purdue.edu\myusername where myusername is your career account name.

  • You may be prompted for login information. Enter your username as onepurdue\myusername and your account password. If you forget the onepurdue prefix it will prevent you from logging in.
  • Your home, scratch, or fortress directory should now be mounted as a drive in the Computer window.

Mac OS X:

  • In the Finder, click Go > Connect to Server (or the Command-K shortcut)
  • In the Server Address enter the following information and click Connect:

    • To access your home directory, enter smb://samba.rcac.purdue.edu/myusername where myusername is your career account name.
    • To access your scratch storage, enter the following:

      • For Steele or Radon scratch, enter smb://samba.rcac.purdue.edu/scratch9N/m/myusername where m is the first letter of your username, myusername is your career account name, and N is the number of your scratch drive (N can be 5, 6, 8, or 9.) You need to know beforehand which scratch drive your directory is on, for example scratch95.
      • For Coates or Rossmann scratch, enter smb://samba.rcac.purdue.edu/lustreA/m/myusername where m is the first letter of your username and myusername is your career account name.
      • For Carter, Hansen, or WinHPC scratch, enter smb://samba.rcac.purdue.edu/lustreC/m/myusername where m is the first letter of your username and myusername is your career account name.

    • To access Fortress long-term storage, enter smb://fortress-smb.rcac.purdue.edu/myusername where myusername is your career account name.

  • You may be prompted for login information. Enter your username, password and for the Domain make sure to enter onepurdue or it will prevent you from logging in.

Linux:

  • There are several graphical methods to connect in Linux depending on your desktop environment. Once you find out how to connect to a network server on your desktop environment, choose the Samba/SMB protocol and adapt the information from the Mac OS X section to connect.
  • If you'd like access via samba on the command line you may install smbclient which will give you ftp-like access and can be used as shown below. SCP or SFTP is recommended over this use case. For all the possible ways to connect look at the Mac OS X instructions.
    smbclient //samba.rcac.purdue.edu/myusername -U myusername -W onepurdue

Applications on BoilerGrid

Compiling Source Code on BoilerGrid

Provided Compilers

The compilers available on Radon and the Community Clusters (Hansen, Rossmann, Coates, Steele, and Miner) are able to compile code for Condor. Compilers are available for Fortran 77, Fortran 90, Fortran 95, C, and C++. The compilers can produce general-purpose and architecture-specific optimizations to improve performance. These include loop-level optimizations, inter-procedural analysis and cache optimizations. While the compilers support automatic and user-directed parallelization of Fortran, C, and C++ applications for multiprocessing execution, BoilerGrid allows only serial jobs.

To see the available compilers, choose one of the following entries:

$ module avail intel
$ module avail gcc
$ module avail pgi 

Statically Linked Libraries

Using statically linked libraries, regardless the chosen Condor universe, is good practice; you cannot rely on which versions of dynamic libraries are available on the machines selected to run your job. With static libraries, Condor will send the same libraries to all machines. On the other hand, with the Condor flock consisting of a mix of machine architectures, there is also the possibility that your job will land on a machine that is so different from or much older than the machine on which you built your executable file that your job may fail to execute an instruction in the statically linked library. In a parameter sweep, this leads to the confusing situation of some of the runs of the sweep completing successfully while others fail. In this case, you must consider using the corresponding dynamic library on the selected machine or using ClassAds to select compute nodes known to run your job successfully or to exclude compute nodes known to fail. So, use static linkage if at all possible. For the Standard Universe, the condor_compile command specifies static linkage as part of its arguments to the linker; the condor_compile command exhibits its arguments in the "LINKING FOR" message. Regarding jobs destined for the Vanilla Universe, use your compiler's command-line option for selecting statically linked libraries.

Compiling Serial Programs

A serial program is a single process whose steps execute as a sequential stream of instructions on one computer. Compilers capable of serial programming are available for C, C++, and versions of Fortran.

Here are a few sample serial programs:

Standard Universe

With the GNU compilers only, the command condor_compile compiles source code and relinks it with the Condor libraries for submission into Condor's Standard Universe. The Condor libraries provide the program with additional support, such as the capability to preempt with checkpointing, which is a feature of Condor's Standard Universe mode of operation. The command condor_compile requires the source or object code of a computer program as well as a compatible compiler.

To use condor_compile and the Standard Universe, first load a compatible compiler (in this case the default GNU compiler):

$ module load gcc

Next, choose one of the following entries:

$ condor_compile gfortran myprogram.f -o myprogram
$ condor_compile gfortran myprogram.f90 -o myprogram
$ condor_compile gfortran myprogram.f95 -o myprogram
$ condor_compile gcc myprogram.c -o myprogram
$ condor_compile g++ myprogram.cpp -o myprogram

Vanilla Universe

When neither source nor object code of a computer program is available (i.e. only an executable binary or a shell script) or when you wish to take advantage of features of a compiler which is not compatible with Condor's condor_compile and Standard Universe, you must compile without condor_compile and submit your executable file to Condor's Vanilla Universe. This section looks at just compiling with the standard C/C++ and Fortran compilers, as opposed to compiling with condor_compile.

The following table illustrates how to compile a serial program with statically linked libraries. Note that not all compilers are available on all systems.

Language Intel Compiler GNU Compiler PGI Compiler
Fortran 77
$ module load intel
$ ifort -static myprogram.f -o myprogram
$ module load gcc
$ gfortran -static myprogram.f -o myprogram
$ module load pgi
$ pgf77 -Bstatic myprogram.f -o myprogram
Fortran 90
$ module load intel
$ ifort -static myprogram.f90 -o myprogram
$ module load gcc
$ gfortran -static myprogram.f90 -o myprogram
$ module load pgi
$ pgf90 -Bstatic myprogram.f90 -o myprogram
Fortran 95
$ module load intel
$ ifort -static myprogram.f90 -o myprogram
$ module load gcc
$ gfortran -static myprogram.f95 -o myprogram
$ module load pgi
$ pgf95 -Bstatic myprogram.f95 -o myprogram
C
$ module load intel
$ icc -static myprogram.c -o myprogram
$ module load gcc
$ gcc -static myprogram.c -o myprogram
$ module load pgi
$ pgcc -Bstatic myprogram.c -o myprogram
C++ ¹
$ module load intel
$ icc -static myprogram.cpp -o myprogram
$ module load gcc
$ g++ -static myprogram.cpp -o myprogram
$ module load pgi
$ pgCC -Bstatic myprogram.cpp -o myprogram
¹  The suffix of a C++ file may be .C, .c, .cc, .cpp, .cxx, or .c++.

The Intel, GNU and PGI compilers will not output anything for a successful compilation. Also, the Intel compiler does not recognize the suffix ".f95".

An older version of the GNU compiler will be in your path by default. Do NOT use this version. Instead, load a newer version using the command module load gcc.

More information on compiler options is available in the official man pages on the Web. Also, the command man mycompiler displays man pages (only after using module load to load the appropriate compiler.

Here is some more documentation from other sources on the various compilers:

Compiling MPI Programs

BoilerGrid allows only serial programs to run via Condor. There is no support for MPI.

Compiling OpenMP Programs

BoilerGrid allows only serial programs to run via Condor. There is no support for OpenMP.

Compiling Hybrid Programs

BoilerGrid allows only serial programs to run via Condor. There is no support for MPI or OpenMP.

Provided Libraries

BoilerGrid has a few preinstalled libraries, including mathematical libraries. More detailed documentation on the libraries available on BoilerGrid follows.

MPICH Library

There is currently no support for MPICH through Condor.

Intel Math Kernel Library (MKL)

Intel Math Kernel Library (MKL) contains ScaLAPACK, LAPACK, Sparse Solver, BLAS, Sparse BLAS, CBLAS, GMP, FFTs, DFTs, VSL, VML, and Interval Arithmetic routines. MKL resides in the directory /opt/intel/mkl/9.1, and it has the following subdirectory structure:

  • lib/32    Libraries for 32-bit Applications
    • libmkl_ia32.a    Optimized Kernels (BLAS, CBLAS, Sparse BLAS, GMP, FFTs, DFTs, VML, VSL, Interval Arithmetic)
    • libmkl_lapack.a    LAPACK Routines
    • libmkl_lapack95.a    LAPACK95 Interface (libmkl_lapack.a also required)
    • libmkl_solver.a    Sparse Solver Routines
    • libguide.a    Threading Library for Static Linking
  • lib/em64t    Libraries for Intel EM64T Applications
    • libmkl_em64t.a    Optimized Kernels (BLAS, CBLAS, Sparse BLAS, GMP, FFTs, DFTs, VML, VSL, Interval Arithmetic)
    • libmkl_lapack.a    LAPACK Routines
    • libmkl_lapack95.a    LAPACK95 Interface (libmkl_lapack.a also required)
    • libmkl_solver.a    Sparse Solver Routines
    • libguide.a    Threading Library for Static Linking

Here are some example combinations of linking options:

  (static linking of LAPACK and Kernels)
$ myfortrancompiler myprogram.f -L${MKLPATH} -lmkl_lapack -lmkl_ia32 -lguide -lpthread

  (static linking of Fortran-95 LAPACK Interface and Kernels)
$ myfortrancompiler myprogram.f95 -L${MKLPATH} -lmkl_lapack95 -lmkl_lapack -lmkl_ia32 -lguide -lpthread

  (static linking of BLAS, Sparse BLAS, GMP, VML/VSL, Interval Arithmetic, and FFT/DFT)
$ myccompiler myprogram.c -L${MKLPATH} -lmkl_ia32 -lguide -lpthread -lm

  (dynamic linking of BLAS or FFTs)
$ myccompiler myprogram.c -L${MKLPATH} -lmkl -lguide -lpthread

RCAC recommends that you use dynamic linking of libguide. If so, define LD_LIBRARY_PATH such that you are using the correct version of libguide at run time. If you use static linking of libguide (discouraged), then:

  • If you use the Intel compilers, link in the libguide version that comes with the compiler (use the -openmp option).
  • If you do not use the Intel compilers, link in the libguide version that comes with the Intel MKL above.

Here are some more documentation from other sources on the Intel MKL:

Mixing Fortran, C, and C++ Code on Unix

You may write different parts of a computing application in different programming languages. For example, an application might incorporate older, legacy code which performs numerical calculations written in Fortran. Systems functions might use C. A newer, main program which binds together all older code might use C++ to take advantage of the object orientation. This section illustrates a few simple examples.

For more information about mixing programming languages:

Using cpp with Fortran

If the source file ends with .F, .fpp, or .FPP, cpp automatically preprocesses the source code before compilation. If you want to use the C preprocessor with source files that do not end with .F, use the following compiler option to specify the filename suffix:

  • GNU Compilers: -x f77-cpp-input
    Note that preprocessing does not extend to the contents of files included by an "INCLUDE" directive. You must use the #include preprocessor directive instead.
    For example, to preprocess source files that end with .f:
    $ gfortran -x f77-cpp-input myprogram.f
    
  • Intel Compilers: -cpp
    To tell the compiler to link using C++ runtime libraries included with gcc/icc:
    $ ... -cxxlib -gcc/-cxxlib -icc
    
    For example, to preprocess source files that end with .f:
    $ ifort -cpp myprogram.f
    

Generally, it is advisable to rename your file from myprogram.f to myprogram.F. The preprocessor then automatically runs when you compile the file.

For more information on combining C/C++ and Fortran:

C Program Calling Subroutines in Fortran, C, and C++

A C language program calls routines written in Fortran 90, C, and C++. The routines change the value of a character argument. To understand what makes this example work, you must be aware of a few simple issues.

To discover how the chosen Fortran compiler handles the names of routines, apply the Linux command nm to the object file: nm filename.o. The Fortran compilers used in this example append an underscore after the name of a routine. The C program calls the Fortran routine with the underscore character.

Fortran uses pass-by-reference while C uses pass-by-value. Therefore, to pass a value from a Fortran routine to a C program requires the argument in the call to the Fortran routine to be a pointer (ampersand "&"). To pass a value from a C++ routine to a C program, the C++ routine may use the pass-by-reference syntax (ampersand "&") of C++ while the C program again specifies a pointer (ampersand "&") in the call to the C++ routine.

The C++ compiler must know at the time of compiling the C++ routine that the C program will invoke the C++ routine with the C-style interface rather than the C++ interface.

The following files of source code illustrate these technical details:

Separately compile each source code file with the appropriate compiler into an object (.o) file. Then link the object files into a single executable file (a.out):

Compiler Intel GNU PGI
C Main Program
$ module load intel
$ icc -c main.c
$ ifort -c f90.f90
$ icc -c c.c
$ icc -c cpp.cpp
$ icc -lstdc++ main.o f90.o c.o cpp.o
$ module load gcc
$ gcc -c main.c
$ gfortran -c f90.f90
$ gcc -c c.c
$ g++ -c cpp.cpp
$ gcc -lstdc++ main.o f90.o c.o cpp.o
$ module load pgi
$ pgcc -c main.c
$ pgcc -c c.c
$ pgCC -c cpp.cpp
$ pgf90 -Mnomain main.o c.o cpp.o f90.f90

The results show that each routine successfully returns a different character to the main program:

$ a.out
main(), initial value:               chr=X
main(), after function subr_f_():    chr=f
main(), after function func_c():     chr=c
main(), after function func_cpp():   chr=+
Exit main.c

C++ Program Calling Subroutines in Fortran, C, and C++

A C++ language program calls routines written in Fortran 90, C, and C++. The routines change the value of a character argument. To understand what makes this example work, you must be aware of a few simple issues.

To discover how the chosen Fortran compiler handles the names of routines, apply the Linux command nm to the object file: nm filename.o. The Fortran compilers used in this example append an underscore after the name of a routine. The C++ program calls the Fortran routine with the underscore character.

Fortran uses pass-by-reference while C++ uses pass-by-value. Therefore, to pass a value from a Fortran routine to a C++ program requires the argument in the call to the Fortran routine to be a pointer (ampersand "&"). To pass a value from a C routine to a C++ program, the C routine must declare a parameter as a pointer (asterisk "*") while the C++ program again specifies a pointer (ampersand "&") in the call to the C routine.

The C++ compiler must know at the time of compiling the C++ program that the C++ program will invoke the Fortran and C routines with the C-style interface rather than the C++ interface.

The following files of source code illustrate these technical details:

Separately compile each source code file with the appropriate compiler into an object (.o) file. Then link the object files into a single executable file (a.out):

Compiler Intel GNU PGI
C++ Main Program
$ module load intel
$ icc -c main.cpp
$ ifort -c f90.f90
$ icc -c c.c
$ icc -c cpp.cpp
$ icc -lstdc++ main.o f90.o c.o cpp.o
$ module load gcc
$ g++ -c main.cpp
$ gfortran -c f90.f90
$ gcc -c c.c
$ g++ -c cpp.cpp
$ g++ main.o f90.o c.o cpp.o
$ module load pgi
$ pgCC -c main.cpp
$ pgf90 -c f90.f90
$ pgcc -c c.c
$ pgCC -c cpp.cpp
$ pgCC -L../lib main.o c.o cpp.o f90.o -pgf90libs

The results show that each routine successfully returns a different character to the main program:

$ a.out
main(), initial value:               chr=X
main(), after function subr_f_():    chr=f
main(), after function func_c():     chr=c
main(), after function func_cpp():   chr=+
Exit main.cpp

Fortran Program Calling Subroutines in Fortran, C, and C++

A Fortran language program calls routines written in Fortran 90, C, and C++. The routines change the value of a character argument. To understand what makes this example work, you must be aware of a few simple issues.

To discover how the chosen Fortran compiler handles the names of routines, apply the Linux command nm to the object file: nm filename.o. The Fortran compilers used in this example append an underscore after the name of a routine, so the definitions of the C and C++ routines must include the underscore. The Fortran program calls these routines without the underscore character in the Fortran source code.

Fortran uses pass-by-reference while C uses pass-by-value. Therefore, to pass a value from a C routine to a Fortran program requires the parameter of the C routine to be a pointer (asterisk "*") in the C routine's definition. To pass a value from a C++ routine to a Fortran program, the C++ routine may use the pass-by-reference syntax (ampersand "&") of C++ in its definition.

The C++ compiler must know at the time of compiling the C++ routine that the Fortran program will invoke the C++ routine with the C-style interface rather than the C++ interface.

The following files of source code illustrate these technical details:

Separately compile each source code file with the appropriate compiler into an object (.o) file. Then link the object files into a single executable file (a.out):

Compiler Intel GNU PGI
Fortran 90 Main Program
$ module load intel
$ ifort -c main.f90
$ ifort -c f90.f90
$ icc -c c.c
$ icc -c cpp.cpp
$ ifort -lstdc++ main.o f90.o c.o cpp.o
$ module load gcc
$ gfortran -c main.f90
$ gfortran -c f90.f90
$ gcc -c c.c
$ g++ -c cpp.cpp
$ gfortran -lstdc++ main.o c.o cpp.o f90.o
$ module load pgi
$ pgf90 -c main.f90
$ pgf90 -c f90.f90
$ pgcc -c c.c
$ pgCC -c cpp.cpp
$ pgf90 main.o c.o cpp.o f90.o

The results show that each routine successfully returns a different character to the main program:

$ a.out
 main(), initial value:               chr=X
 main(), after function subr_f():     chr=f
 main(), after function subr_c():     chr=c
 main(), after function func_cpp():   chr=+
 Exit mixlang

Running Jobs on BoilerGrid

You may use Condor to submit jobs to BoilerGrid. Condor performs job scheduling. Jobs may be serial only. You may use only the batch mode for developing and running your program. BoilerGrid does not offer an interactive mode to run your jobs.

Running Jobs via Condor

Condor is one of several distributed computing resources RCAC provides. Like other similar resources, Condor provides a framework for running programs on otherwise idle computers. While this imposes serious limitations on parallel jobs and codes with large I/O or memory requirements, Condor can provide a large quantity of cycles for researchers who need to run hundreds of smaller jobs.

Condor is a specialized batch system for managing compute-intensive jobs. Condor provides a queuing mechanism, scheduling policy, priority scheme, and resource classifications. Users submit their jobs to Condor, which then puts these jobs in a queue, runs them, and reports back with the results.

In some ways, Condor is different from other batch systems. They usually only operate on dedicated machines/compute servers. Instead, Condor can both schedule jobs on dedicated machines and effectively utilize non-dedicated machines to run jobs. It only runs jobs on machines which are currently idle (no keyboard activity, no load average, no active telnet users, etc). In this way, Condor effectively harnesses otherwise idle machines throughout a pool of machines.

Currently, RCAC uses Condor to utilize idle cycles on all RCAC computational resources, including all Linux cluster nodes as well as some other servers and workstations. While RCAC uses PBS to schedule the resources of the Linux clusters, Condor schedules jobs on compute nodes when the nodes are not running PBS jobs. When PBS elects to run a new job on a node which is currently running Condor-scheduled jobs, Condor preempts all jobs running on that node to make room for the PBS-scheduled job. You may submit Condor jobs from most of the RCAC systems (Hansen, Rossmann, Coates, Steele, Miner, or Radon).

For more information:

Tips

  • Do not queue up thousands of jobs in a queue. Submit fewer jobs at a time or use DAGMan to divide your jobs into reasonably-sized chunks (less than 500 jobs per set).
  • Never run condor_q repeatedly on a heavily used submit node. The condor_schedd is single-threaded and schedules work in the same thread that you are using to list the queue. This actually takes resources away from the scheduler and is counter-productive.
  • Long jobs should run in the Standard Universe, not in the Vanilla Universe, since they will likely never finish in Vanilla.
  • Vanilla Universe can use Intel compilers (may run 30–40% faster). Using Intel compilers under Vanilla may ultimately provide better throughput than checkpointing jobs in the Standard Universe using a different compiler because the speed gained from using the Intel compilers may be greater than the advantage of checkpointing.
  • Prefer statically linked libraries over dynamically linked libraries.
  • Generally, if your jobs run in less than 1/2 hour, they will seldom be evicted. If they take 1/2 hour to 1 hour, there will usually still only be a few evictions.
  • Purdue has both a scavenging/preempting and a scheduling system. Remember that the Condor pool is very heterogeneous, both regarding processor versions and OS versions/types (both Linux of different varieties and some Windows).
  • At Purdue, RCAC disabled all automatic email notification using the notification Condor submission command. Setting this in a submission file will have no effect.
  • Why no middleware (like Mycluster at TACC)? Middleware can be easier for the user, since it uses Condor (and other schedulers) "behind the scenes". Middlewares are also themselves schedulers and will not start a job until they can guarantee to run a job to completion (no eviction). However, Condor has a lot of job restarts and thus much overhead on many jobs. For a large number of jobs, using Condor without any middleware is a better approach.

Choosing a Condor Universe

A Universe in Condor defines an execution environment. Condor supports several different Universes for user jobs. The most used on BoilerGrid are "Standard", "Vanilla", and "Globus" (or "Grid"). There are other Universes. See Chapter 2.4.1 of the Condor Manual for more details about the different Universes.

Job submission files specify the Condor Universe through the universe command. The default Universe is Vanilla (not Standard). Windows compute nodes accept only Vanilla Universe jobs.

You will need to determine the appropriate Universe for your jobs. Here are some more details about how the Universes differ:

  • Vanilla Universe

    The Vanilla Universe is the default (except where the configuration variable DEFAULT_UNIVERSE defines it otherwise). It is an execution environment for jobs which you did not re-link with the Condor libraries. It provides fewer services, but has very few restrictions. Preemption with either suspension or eviction (without checkpointing) is a signature of the Vanilla Universe. If a compute node which is running one or more Vanilla jobs ceases to be idle, Condor will either suspend or evict those jobs. Condor may restart a suspended job on the same compute node; Condor will restart evicted jobs on other compute nodes. When re-linking a computer program to the Condor libraries is impossible or when you wish to use a compiler which is incompatible with condor_compile, use the Vanilla Universe.

    Virtually any non-parallel program can use the Vanilla Universe. Shell scripts may be executables. It is the only possibility for Windows machines. You may use compilers which are incompatible with condor_compile. For example, Intel compilers may run 30–40% faster than compatible compilers and may even be faster for somewhat longer jobs, because the speed gain may be bigger than the advantage from checkpointing in the Standard Universe. Preemption with suspension or eviction is, in general, bad for long jobs, but OK for short jobs. A long job may never finish because repeated preemptions with restarts can prevent completion.

    Static linkage of libraries for Vanilla Universe jobs eliminates the chance of running a job with different, older libraries which may be available on some compute nodes since it sends the same collection of libraries to all compute nodes. There is the risk that some compute nodes are sufficiently out of synch with the submission host that they are unable to run the newer libraries. RCAC recommends using static linkage if at all possible.

  • Standard Universe

    The Standard Universe supports transparent job preemption with checkpointing, remote system calls, and migration from compute node to compute node without restarting. Specifying the Standard Universe in your job submission file tells Condor that you previously re-linked your job via condor_compile with the Condor libraries while using various Condor-specific compiler options and libraries. Standard Universe is a desirable Universe due to its premption with checkpointing. If possible use the Standard Universe for long jobs. Long jobs are less likely to finish in the Vanilla Universe.

    There are a few restrictions on programs. There is no possibility of sub-processes. Shell scripts may not be executables. You may not use incompatible compilers, for example Intel compilers. All Standard Universe executables should be statically linked since there is no guarantee that the dynamic libraries on all machines in the flock will be the same version. That way Condor will send the same executable file to all machines. There is also the problem that your job land on a system that is not even the same version as your build system. The condor_compile command specifies static linkage as part of its arguments to the linker; condor_compile displays these arguments in the 'LINKING FOR' message. This command not only forces a static link but also fills in a number of wrappers for standard C library routines to make, among other things, remote file access work.

  • Globus (or Grid) Universe

    The Globus or Grid Universe forwards the job to an external job management system. You use the grid_resource command to apply additional specifications of the Grid Universe. The Globus or Grid Universe allows users to submit jobs using Condor's interface. These jobs execute on grid resources. For Globus jobs, see http://www.globus.org for more information.

Job Submission File

Example 1

Here is the simplest possible job submission file. It will queue one copy of the program hello for execution by Condor. Condor will use its default universe and the default platform, which means to run the job on a compute node which has the same architecture and operating system as the submission host.

No input, output, and error commands appear in the job submission file, so the files stdin, stdout, and stderr will all refer to /dev/null (a.k.a. the null device. It is a special file that discards all data written to it, but reports that the write operation succeeded. It provides no data to any process that reads from it - returning EOF). The program may produce output by explicitly opening a file and writing to it. This job writes to a log file, hello.log. This log file will contain events the job had during its lifetime inside of Condor, such as any possible errors. When the job finishes, its exit conditions will also be noted in the log file. Condor recommends a log file so that you know what happened to your jobs.

If your program only returns output to the screen (like the hello.c program below does), then you should include Output = hello.out or something like it somewhere before Queue. Otherwise you will not see the output.

If you do not explicitly choose a universe, Condor uses the default universe: Vanilla Universe.

####################
#
# Example 1
# Simple Condor job description file
#
####################

executable     = hello
log            = hello.log
queue

Example 2

This example (from the Condor Manual), queues two copies of the program Mathematica. The first copy will run in directory run_1, and the second will run in directory run_2. For both queued copies, stdin will be file test.data, stdout will be file loop.out, and stderr will be file loop.error. This job writes two sets of files in separate directories. This is a convenient way to organize data if you have a large group of Condor jobs to run. The example file shows program submission of Mathematica as a Vanilla Universe job, since neither the source nor object code to program Mathematica is available for relinking to the Condor libraries.

Condor recommends using a single log file.

####################
#
# Example 2
# Demonstrate use of multiple directories for data organization
#
####################

universe   = VANILLA
executable = mathematica
input      = test.data
output     = loop.out
error      = loop.error
log        = loop.log

initialdir = run_1
queue

initialdir = run_2
queue

Example 3

In this example (also from the Condor Manual), the job submission file queues 150 runs of program foo which you compiled and linked for Silicon Graphics workstations running IRIX 6.5. This job requires Condor to run the program on machines which have greater than 32 megabytes of physical memory, and expresses a preference to run the program on machines with more than 64 megabytes, if such machines are available. It also advises Condor that it will use up to 28 megabytes of memory when running. Each of the 150 runs of the program receives its own process number, starting with process number 0. So, files stdin, stdout, and stderr will refer to in.0, out.0, and err.0 for the first run of the program; in.1, out.1, and err.1 for the second run of the program; and so forth. A log file foo.log will contain entries about when and where Condor runs, checkpoints, and migrates processes for the 150 queued runs of the program.

####################
#
# Example 3
# Show off some fancy features including use of pre-defined macros and logging
#
####################

executable   = foo
requirements = Memory >= 32 && OpSys == "IRIX65" && Arch =="SGI"
rank		= Memory >= 64
image_Size = 28 Meg

error   = err.$(Process)
input   = in.$(Process)
output  = out.$(Process)
log     = foo.log

queue 150

Job Submission

Once you have a job submission file, you may submit this script to Condor using the condor_submit command. As described above, a job submission file contains the commands and keywords which specify the type of compute node on which you wish to run your job. Condor will find an available processor core and run your job there, or leave your job in a queue until one becomes available.

You may submit jobs to BoilerGrid from any BoilerGrid submission host, including all RCAC cluster front-ends.

To submit a job submission file:

$ condor_submit myjobsubmissionfile

For more information about job submission:

Job Status

To check on the progress of your jobs, view the Condor queue on the host from which you submitted the jobs.

You must make certain that you logged in to the same submission host (…-fe00, …-fe01, …-fe02, etc.) from which you submitted your jobs, or you will not see them in the queue.

To view the status of all jobs in the Condor queue of your login host:

$ condor_q

To see only your own jobs, specify your own username as an argument:

$ condor_q myusername

-- Submitter: condor.rcac.purdue.edu : <128.211.157.86:42163> : condor.rcac.purdue.edu
 ID         OWNER        SUBMITTED    RUN_TIME   ST PRI SIZE CMD
1100900.0   myusername   2/20 15:13   0+00:00:00 I  0   0.0  Hello

1 jobs; 1 idle, 0 running, 0 held

Secondly, you may check on the status of your jobs through their log files. In your job submission file, you can specify a log command (log = myjob.log) at any point prior to the queue command. The main events during the processing of the job will appear in this log file: submittal, execution commencement, preemption, checkpoint, eviction, and termination.

Thirdly, as soon as your job begins executing, Condor will start a condor_shadow process on the submission host. This shadow process is the mechanism by which the remotely executing jobs can access the environment of the submit host, such as input and output files. There is a shadow process started on the submit host for each job. However, the load on the submit host from this is usually not significant. If you notice degraded performance, you can limit the number of jobs that can run simultaneously using the MAX_JOBS_RUNNING configuration parameter. Please contact RCAC for help with this if you notice poor performance.

To list all the compute nodes which are running your jobs:

$ condor_status -pool boilergrid.rcac.purdue.edu -constraint 'RemoteUser=="myusername@rcac.purdue.edu"'

Name          OpSys       Arch   State      Activity   LoadAv Mem   ActvtyTime
ba-005.rcac.p LINUX       INTEL  Claimed    Busy       1.000   502  0+00:24:44
ba-006.rcac.p LINUX       INTEL  Claimed    Busy       0.990   502  0+00:20:22
ba-007.rcac.p LINUX       INTEL  Claimed    Busy       1.000   502  0+00:23:16
ba-008.rcac.p LINUX       INTEL  Claimed    Busy       1.000   502  0+00:30:20
...

For more information about monitoring your job:

Job Cancellation

The command condor_rm removes a job from the queue. If the job has already started running, then Condor kills the job and removes its queue entry. Use condor_q to get the ID of the job.

Queue of jobs before removal:

$ condor_q
	
Submitter: condor.rcac.purdue.edu : <128.211.157.86:35407> : condor.rcac.purdue.edu
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
...
260076.7   nice-user       8/18 00:05   0+00:21:47 I  0   29.3 startfah.sh -oneun
260076.9   nice-user       8/18 00:05   0+01:40:44 I  0   136.7 startfah.sh -oneun
260185.0   myusername      8/30 13:01   0+00:00:00 R  0   19.5 hello
...

Remove a job:

$ condor_rm 260185.0
Job 260185.0 marked for removal

Queue of jobs after removal:

$ condor_q
	
Submitter: condor.rcac.purdue.edu : <128.211.157.86:35407> : condor.rcac.purdue.edu
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
...
260076.7   nice-user       8/18 00:05   0+00:21:47 I  0   29.3 startfah.sh -oneun
260076.9   nice-user       8/18 00:05   0+01:40:44 I  0   136.7 startfah.sh -oneun
...

For more information about removing your job:

Workflow Summary

This section offers a quick overview of the steps involved in preparing and submitting a simple Condor job.

  1. Prepare the Code

    The "Hello World" program below is a simple program which displays the text "hello, world":
    /* FILENAME: hello.c */
    #include <stdio.h>		
    int main (void) {		
        printf("hello, world\n");		
        return 0;
    }
    
  2. Choose the Condor Universe

    The two most commonly used Condor Universes are Standard and Vanilla. The "Hello World" program above will run in either universe.

    • Vanilla Universe

      Compile the "Hello World" program normally using any available compiler:
      $ module load intel
      $ icc -static hello.c -o hello
      
      $ module load gcc
      $ gcc -static hello.c -o hello
      	
      $ module load pgi
      $ pgcc -Bstatic hello.c -o hello
      
    • Standard Universe

      Relink the "Hello World" program with the Condor library using the condor_compile command and a compatible compiler:
      $ module load gcc	
      $ condor_compile gcc hello.c -o hello
      
  3. Prepare the Job Submission File

    Your job submission file defines how to run the job via Condor. It specifies the executable file, the chosen universe, a file containing standard input (not used in this example), files which will receive standard output and standard error, and the Condor log file, as well as many other possible parameters. The queue directive specifies how many executions of the job are to occur. Usually this is just once, as here:

    • Vanilla Universe

      # FILENAME: hello.sub
      executable = hello
      universe   = vanilla
      output     = hello.out
      error      = hello.err
      log        = hello.log
      queue
      
    • Standard Universe

      # FILENAME: hello.sub
      executable = hello
      universe   = standard
      output     = hello.out
      error      = hello.err
      log        = hello.log
      queue
      
  4. Submit the Job

    To run the "Hello World" program, use the condor_submit command to submit the job submission file to Condor:
    $ condor_submit hello.sub
    Submitting job(s).
    Logging submit event(s).
    1 job(s) submitted to cluster 1100744.
    
  5. Monitor the Job

    Once you submit the job, Condor will manage its execution. You can monitor the job's progress with the condor_q command:
    $ condor_q myusername
    
    
    -- Submitter: condor.rcac.purdue.edu : <128.211.157.86:56939> : condor.rcac.purdue.edu
     ID      OWNER              SUBMITTED     RUN_TIME  ST PRI SIZE CMD
    1100744.0  myusername  2/17 15:36  0+00:00:00  I  0   0.0  hello
    	
    1 jobs; 1 idle, 0 running, 0 held
    
  6. Remove the Job

    If you discover an error in your job while waiting for the results, you can remove the job from the queue with the condor_rm command:
    $ condor_rm 1100744
    
  7. View the Results

    When the "Hello World" program completes, its output will appear in the file hello.out. The exit status of your program and various statistics about its performance, including time used and I/O performed, will appear in the log file hello.log. To view the output file:
    $ less hello.out
    hello, world
    

    A log of Condor activity during your job's run will appear in the file hello.log. This log may report zero bytes transferred for some Vanilla Universe jobs, as the compute node may have been able to directly access your files through a shared filesystem without needing to transfer them to the compute node. To view the log file:
    $ less hello.log
    000 (1100744.000.000) 02/17 15:36:51 Job submitted from host: <128.211.157.86:56939>
    ...
    001 (1100744.000.000) 02/17 15:41:49 Job executing on host: <128.211.157.10:57321>
    ...
    005 (1100744.000.000) 02/17 15:41:53 Job terminated.
            (1) Normal termination (return value 0)
                    Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
                    Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
                    Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
                    Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
            1018  -  Run Bytes Sent By Job
            5429958  -  Run Bytes Received By Job
            1018  -  Total Bytes Sent By Job
            5429958  -  Total Bytes Received By Job
    ...
    

Job Hold

There are many reasons to put a job on hold. For example, if you do not have enough space to hold all the results at the same time but need to move those results somewhere else, you could queue all jobs and put them on hold immediately. Then release a few jobs at a time (with a -constraint to condor_release, can be scripted), and move the results as they appear, then release some more jobs. In addition to the user's holding jobs manually, the Condor Scheduler can hold jobs for various reasons (unable to write to your directory, etc.).

Any job in the hold state will remain in the hold state until released. A job in the queue may be placed on hold. A currently running, Vanilla Universe job receives a hard kill signal (preemption without checkpointing), and Condor returns the job to the queue; when released, this Standard Universe job continues its execution using the most recent checkpoint available. A currently running, Vanilla Universe job receives a hard kill signal (preemption without checkpointing), and Condor returns the job to the queue; when released, this Vanilla Universe job restarts at the beginning.

To hold a job:

condor_hold myjobid

To view the state, column "ST", of the held job, "H":

$ condor_q myusername

-- Submitter: condor.rcac.purdue.edu : <128.211.157.86:42163> : condor.rcac.purdue.edu
 ID         OWNER       SUBMITTED    RUN_TIME   ST PRI SIZE CMD
1101790.0   myusername   2/24 14:53   0+00:00:00 H  0   0.0  Hello

1 jobs; 0 idle, 0 running, 1 held

For more information about holding your job:

Job Release

A job that is in the hold state remains there until later released for execution.

To release a held job:

$ condor_release myjobid

The state of the released job is now "Idle", "I":

$ condor_q myusername

-- Submitter: condor.rcac.purdue.edu : <128.211.157.86:42163> : condor.rcac.purdue.edu
 ID         OWNER              SUBMITTED    RUN_TIME   ST PRI SIZE CMD
1101790.0   myusername   2/24 14:53   0+00:00:00 I  0   0.0  Hello

1 jobs; 1 idle, 0 running, 0 held

To release all held jobs of a single user:

$ condor_release myusername

For more information about releasing your job:

Compute Nodes and ClassAds

Condor attempts to start jobs by matching submitted jobs with available compute nodes on the basis of ClassAds. Condor's ClassAds are analogous to the classified advertising section of the newspaper. Both sellers and buyers advertise details about what they have to sell or want to buy. Both buyers and sellers have some requirements which absolutely must be satisfied, such as the right type of item, and some other criteria by which they will prefer certain offers over others, such as a better price. The same is true in Condor, but between users submitting jobs and compute nodes advertising available resources. Condor uses ClassAds to make the best matches between these two groups.

By default, your Condor jobs will seek an available compute node with the same values for the ClassAds Arch and OpSys as the host from which you submitted your job. The submission process assumes that in most cases your jobs will require the same combination of chip architecture and operating system to run as the host from which you submitted it. You can remove or alter this restriction by looking at the examples in the "Requiring Specific Architectures or Operating Systems" section.

Some applications may require even more specific capabilities. Using ClassAds, you may specify a set of requirements so that only a subset of available compute nodes become candidates to run your job. There are many ClassAds available for you to use in your job requirements. You may also use ClassAds to indicate a preference for certain nodes over others (but not as an absolute requirement) by using the rank command. The following examples illustrate how to discover current ClassAds and how to estimate the number of compute nodes which will match job requirements based on ClassAds.

To save a detailed report of all the ClassAds of all processor cores in BoilerGrid in the file myfile:

$ condor_status -pool boilergrid.rcac.purdue.edu -long > myfile

You may use any of the ClassAds which appear in this list to view a subset of BoilerGrid. For example, to save a listing of all user ID domains or all file system domains in the file myfile:

$ condor_status -pool boilergrid.rcac.purdue.edu -format "%s\n" UidDomain > myfile

$ condor_status -pool boilergrid.rcac.purdue.edu -format "%s\n" FileSystemDomain > myfile

To list all platforms (architectures and operating systems) and the number of processor cores of each platform on BoilerGrid:

$ condor_status -pool boilergrid.rcac.purdue.edu -total

                     Total Owner Claimed Unclaimed Matched Preempting Backfill

         INTEL/LINUX    64    13       5        46       0          0        0
           INTEL/OSX     2     0       0         2       0          0        0
       INTEL/WINNT51   345    29       2       314       0          0        0
       INTEL/WINNT61  4683   150      13      4520       0          0        0
    SUN4u/SOLARIS210     3     2       0         1       0          0        0
        X86_64/LINUX 31395 22617    4734      4035       2          2        5

               Total 36492 22811    4754      8918       2          2        5

Condor uses the name "INTEL" to indicate x86_32 (32-bit Intel-compatible) architecture.

The total number of processor cores on BoilerGrid is 36,492. The predominant platform of BoilerGrid is the x86_64/Linux with 31,395 processor cores. The values in this table are approximations since compute nodes require repair.

To see how many compute nodes have a given ClassAd value, add the ClassAd value as a constraint.

To see only how many 64-bit, Intel-compatible, Linux (x86_64/Linux) platforms are currently in BoilerGrid:

$ condor_status -pool boilergrid.rcac.purdue.edu -total -constraint '(Arch == "X86_64") && (OpSys == "LINUX")'

                     Total Owner Claimed Unclaimed Matched Preempting Backfill

        X86_64/LINUX 31395 22740    4688      3957       3          2        5

               Total 31395 22740    4688      3957       3          2        5

To see how many 64-bit, Intel-compatible, Linux (x86_64/Linux) platforms are currently in BoilerGrid and advertise MATLAB as installed:

$ condor_status -pool boilergrid.rcac.purdue.edu -total -constraint '(Arch == "X86_64") && (OpSys == "LINUX") && (HAS_MATLAB == TRUE)'

                     Total Owner Claimed Unclaimed Matched Preempting Backfill
        X86_64/LINUX 24659 12008    1557     11094       0          0        0
               Total 24659 12008    1557     11094       0          0        0

You may specify numeric constraints with other relational operators. To discover how many compute nodes have at least 16 GB of memory:

$ condor_status -pool boilergrid.rcac.purdue.edu -total -constraint 'TotalMemory >= 16046'

                     Total Owner Claimed Unclaimed Matched Preempting Backfill
        X86_64/LINUX 26093 18007    3330      4753       3          0        0
               Total 26093 18007    3330      4753       3          0        0

ClassAd string values are case-sensitive. ClassAd attribute names are case-insensitive. The comparison operators (<, >, <=, >=, and ==) compare strings case-insensitively. The special comparison operators =?= and =!= compare strings case-sensitively. ClassAd expressions are similar to C boolean expressions and can be quite elaborate.

For more information about ClassAds, requirements, and rank:

Shared Scratch File Systems

Increasing the throughput of your jobs may not come from maximizing the number of candidate compute nodes but rather from limiting the candidate compute nodes to the set which can access the shared scratch file system of the front-end. This limitation is useful in the case of a large input data file since it avoids both using Condor's file transfer mechanism and running the risk of preemptions preventing job completion.

The following table shows the current list of scratch directories:

Cluster Scratch Directory File System Domain
condor.rcac.purdue.edu
Radon
Steele
/scratch/scratch95/m/myusername
/scratch/scratch96/m/myusername

bluearc.rcac.purdue.edu


Coates
Rossmann
/scratch/lustreA/m/myusername

lustrea.rcac.purdue.edu

Hansen
/scratch/lustreC/m/myusername
lustrec.rcac.purdue.edu
Miner
/scratch/miner/m/myusername
miner.rcac.purdue.edu

To discover your scratch file directory, log in to your submission host and enter either of the following commands:

$ findscratch
$ echo $RCAC_SCRATCH

The response will be one of the following paths:

/scratch/scratch95/m/myusername
/scratch/scratch96/m/myusername
/scratch/lustreA/m/myusername
/scratch/lustreC/m/myusername
/scratch/miner/m/myusername

To see which shared scratch file system a specific cluster can access, search on the ClassAd attribute ClusterName:

$ condor_status -pool boilergrid.rcac.purdue.edu -constraint 'ClusterName=="Radon"' -format "%s\n" FileSystemDomain >myfile

To see which shared scratch file systems other clusters use, modify the preceding example with other cluster names: Hansen, Rossmann, Coates, Steele, or Miner.

To see which clusters can access a given shared scratch file system, search on the ClassAd attribute FileSystemDomain:

$ condor_status -pool boilergrid.rcac.purdue.edu -constraint 'FileSystemDomain == "bluearc.rcac.purdue.edu"' -format "%s\n" ClusterName >myfile
$ condor_status -pool boilergrid.rcac.purdue.edu -constraint 'FileSystemDomain == "lustrea.rcac.purdue.edu"' -format "%s\n" ClusterName >myfile
$ condor_status -pool boilergrid.rcac.purdue.edu -constraint 'FileSystemDomain == "lustrec.rcac.purdue.edu"' -format "%s\n" ClusterName >myfile
$ condor_status -pool boilergrid.rcac.purdue.edu -constraint 'FileSystemDomain == "miner.rcac.purdue.edu"' -format "%s\n" ClusterName >myfile

Using logical operators, you may combine ClassAd constraints. For example, to see how many x86_64 processor cores running Linux have access to the BlueArc shared scratch file system:

$ condor_status -pool boilergrid.rcac.purdue.edu -total -constraint '(Arch == "X86_64") && (OpSys == "LINUX") && (FileSystemDomain == "bluearc.rcac.purdue.edu")'

                     Total Owner Claimed Unclaimed Matched Preempting Backfill
        X86_64/LINUX  9232  5515    1431      2286       0          0        0
               Total  9232  5515    1431      2286       0          0        0

List of Common ClassAds

Here is a brief description of some of the common ClassAds and attributes available in Condor. For a more complete listing, see the Job Submission Chapter of the Condor Users' Manual.

Machine Attributes

  • Activity: String which describes Condor job activity on the machine. Can have one of the following values:
    • "Idle": There is no job activity
    • "Busy": A job is busy running
    • "Suspended": A job is currently suspended
    • "Vacating": A job is currently checkpointing
    • "Killing": A job is currently being killed
    • "Benchmarking": The startd is running benchmarks
  • Arch: String with the architecture of the machine.
  • ClockDay: The day of the week, where 0 = Sunday, 1 = Monday, ... , 6 = Saturday.
  • ClockMin: The number of minutes passed since midnight.
  • ConsoleIdle: The number of seconds since activity on the system console keyboard or console mouse has last been detected.
  • Cpus: Number of CPUs in this machine.
  • CurrentRank: A float which represents this machine owner's affinity for running the Condor job which it is currently hosting. If not currently hosting a Condor job, CurrentRank is 0.0. When a machine is claimed, the attribute's value is computed by evaluating the machine's Rank expression with respect to the current job's ClassAd.
  • Disk: The amount of disk space on this machine available for the job in Kbytes.
  • EnteredCurrentActivity: Time at which the machine entered the current Activity. On all platforms (including NT), this is measured in the number of integer seconds since the Unix epoch (00:00:00 UTC, Jan 1, 1970).
  • FileSystemDomain: A "domain" name configured by the Condor administrator which describes a cluster of machines which all access the same, uniformly-mounted, networked file systems usually via NFS or AFS. This is useful for Vanilla Universe jobs which require remote file access.
  • KeyboardIdle: The number of seconds since activity on any keyboard or mouse associated with this machine has last been detected.
  • KFlops: Relative floating point performance as determined via a Linpack benchmark.
  • LoadAvg: A floating point number with the machine's current load average.
  • Machine: A string with the machine's fully qualified hostname.
  • Memory: The amount of RAM in megabytes.
  • Name: The name of this resource. Typically the same value as the Machine attribute, but could be customized by the site administrator. On SMP machines, the condor_startd will divide the CPUs up into separate virtual machines, each with with a unique name. These names will be of the form "vm#@full.hostname", for example, "vm1@vulture.cs.wisc.edu", which signifies virtual machine 1 from vulture.cs.wisc.edu.
  • OpSys: String describing the operating system running on this machine.
  • Requirements: A boolean, which when evaluated within the context of the machine ClassAd and a job ClassAd, must evaluate to TRUE before Condor will allow the job to use this machine.
  • MaxJobRetirementTime: An expression giving the maximum time in seconds that the startd will wait for the job to finish before kicking it off if it needs to do so.
  • StartdIpAddr: String with the IP and port address of the condor_startd daemon which is publishing this machine ClassAd.
  • State: String which publishes the machine's Condor state. Can be:
    • "Owner": The machine owner is using the machine, and it is unavailable to Condor.
    • "Unclaimed": The machine is available to run Condor jobs, but a good match is either not available or not yet found.
    • "Matched": The Condor central manager has found a good match for this resource, but a Condor scheduler has not yet claimed it.
    • "Claimed": The machine is claimed by a remote condor_ schedd and is probably running a job.
    • "Preempting": A Condor job is being preempted (possibly via checkpointing) in order to clear the machine for either a higher priority job or because the machine owner wants the machine back.
  • VirtualMachineID: For SMP machines, the integer that identifies the VM. The value will be X for the VM with name="vmX@full.hostname". For non-SMP machines with one virtual machine, the value will be 1.
  • VirtualMemory: The amount of currently available virtual memory (swap space) expressed in Kbytes.

Job Attributes

  • Args: String representing the arguments passed to the job.
  • CkptArch: String describing the architecture of the machine this job executed on at the time it last produced a checkpoint. If the job has never produced a checkpoint, this attribute is undefined.
  • CkptOpSys: String describing the operating system of the machine this job executed on at the time it last produced a checkpoint. If the job has never produced a checkpoint, this attribute is undefined.
  • ClusterId: Integer cluster identifier for this job. A cluster is a group of jobs that were submitted together. Each job has its own unique job identifier within the cluster, but shares a common cluster identifier. The value changes each time a job or set of jobs are queued for execution under Condor.
  • CompletionDate: The time when the job completed, or the value 0 if the job has not yet completed. Measured in the number of seconds since the epoch (00:00:00 UTC, Jan 1, 1970).
  • CurrentHosts: The number of hosts in the claimed state, due to this job.
  • EnteredCurrentStatus: An integer containing the epoch time of when the job entered into its current status. For example, if the job is on hold, the ClassAd expression CurrentTime - EnteredCurrentStatus will equal the number of seconds that the job has been on hold.
  • ImageSize: Estimate of the memory image size of the job in Kbytes. The initial estimate may be specified in the job submit file. Otherwise, the initial value is equal to the size of the executable. When the job checkpoints, the ImageSize attribute is set to the size of the checkpoint file (since the checkpoint file contains the job's memory image). A Vanilla Universe job's ImageSize is recomputed internally every 15 seconds.
  • JobPrio: Integer priority for this job, set by condor_ submit or condor_ prio. The default value is 0. The higher the number, the worse the priority.
  • JobStartDate: Time at which the job first began running. Measured in the number of seconds since the epoch (00:00:00 UTC, Jan 1, 1970).
  • JobStatus: Integer which indicates the current status of the job.
    • 0: Unexpanded (the job has never run)
    • 1: Idle
    • 2: Running
    • 3: Removed
    • 4: Completed
    • 5: Held
  • JobUniverse: Integer which indicates the job universe.
    • 1: Standard
    • 4: PVM
    • 5: Vanilla
    • 7: Scheduler
    • 8: MPI
    • 9: Grid
    • 10: Java
  • LastMatchTime: An integer containing the epoch time when the job was last successfully matched with a resource (gatekeeper) Ad.
  • LastRejMatchReason: If, at any point in the past, this job failed to match with a resource ad, this attribute will contain a string with a human-readable message about why the match failed.
  • LastRejMatchTime: An integer containing the epoch time when Condor-G last tried to find a match for the job, but failed to do so.
  • MaxHosts: The maximum number of hosts that this job would like to claim. As long as CurrentHosts is the same as MaxHosts, no more hosts are negotiated for.
  • MaxJobRetirementTime: Maximum time in seconds to let this job run uninterrupted before kicking it off when it is being preempted. This can only decrease the amount of time from what the corresponding startd expression allows.
  • MinHosts: The minimum number of hosts that must be in the claimed state for this job, before the job may enter the running state.
  • NumGlobusSubmits: An integer that is incremented each time the condor_ gridmanager receives confirmation of a successful job submission into Globus.
  • Owner: String describing the user who submitted this job.
  • ProcId: Integer process identifier for this job. Within a cluster of many jobs, each job has the same ClusterId, but will have a unique ProcId. Within a cluster, assignment of a ProcId value will start with the value 0. The job (process) identifier described here is unrelated to operating system PIDs.
  • RemoteIwd: The path to the directory in which a job is to be executed on a remote machine.

Preemption (checkpointing, suspension, eviction)

Long-running computer programs which are executing in the Condor environment face risks that can prevent job completion, for example power loss, overflow of dynamic memory or disk storage, and preemption. Overflow means that a computer program allocates too much dynamic memory or writes too much data to the disk (remote or local) serving the program. Preemption occurs when a higher priority job needs the compute node. It involves either temporarily interrupting a Condor job with the intention of resuming that job from the point of preemption at a later time and often on a different compute node (checkpointing), stopping the job but keeping it on the compute node (checkpointing followed by suspension), or restarting the job from the beginning on a different compute node (eviction).

Checkpointing is a technique for inserting fault tolerance into computing systems. It changes the state of a CPU so that another job can run. This is how Condor scavenges unused computing cycles without preventing higher-priority work. It basically consists of storing a snapshot of the current state of an application and later using it to resume the execution. With checkpointing and suspension, a job has a chance to finish. Eviction may cause a job never to finish if the job's run time is significantly longer than the mean time between preemptions or between power failures. Restarting a job from the beginning can be exceedingly wasteful. Condor handles preemption somewhat differently on various compute nodes in BoilerGrid because the owners of each compute node may specify how they want preemption handled. However, a few general principles are true for all.

BoilerGrid offers a heterogeneous collection of compute nodes. These compute nodes support not only Condor. The majority are Linux systems also running the Portable Batch System (PBS). Many are Windows desktop machines. Architecture, performance, memory and disk space vary broadly.

For all compute nodes running PBS, when a PBS-scheduled job needs a compute node, Condor evicts any Condor jobs running on that node at the time. This is known as preemption. When Condor preempts a Standard Universe job, it checkpoints the job, immediately removes it, and starts seeking another compute node to run it, where it will resume the job from the point of preemption. When Condor preempts a Vanilla Universe job, Condor immediately evicts the job and starts seeking another compute node to run it, where it will restart the job at the beginning.

In the case of Windows-running compute nodes, preemption in the Condor environment occurs when a user touches the mouse or keyboard. On some nodes, Condor places the job in suspension and waits a finite amount of time to see whether it can restart the job on the same compute node. Perhaps the user needs only a few minutes to check email. If the compute node is still unavailable, Condor either checkpoints (Standard Universe) or evicts (Vanilla Universe) the job and moves it to another Windows compute node in the BoilerGrid. On other nodes, Condor immediately checkpoints or evicts the job.

To take advantage of checkpointing and remote system calls of Condor's Standard Universe, you must re-link your program with the Condor libraries. Typically, re-linking requires no change to the source code. Not all applications may take advantage of Condor's Standard Universe. Re-linking precludes commercial software binaries from taking advantage of these services because commercial vendors rarely make their source or object code available. Re-linking precludes applications which must be run from a script. Re-linking precludes using compilers which are incompatible with Condor. An incompatible compiler might yield more efficient code which reduces run time and the likelihood of eviction. Such applications must use Condor's Vanilla Universe. Unless a Vanilla job is self-checkpointing, eviction means that all work is lost.

Jobs running for long periods on BoilerGrid have a high probability of reaching preemption. These risks can warrant a significant retooling of a job to customize the match between the characteristics of a job's computation and the compute nodes of BoilerGrid in order to maximize throughput. Debugging a computer program and recoding a working program to improve performance are the usual tasks of a programmer. Condor may require additional retooling of that program so that it is able to reach completion.

Condor is able to schedule and run any type of process, but Condor's Standard Universe does have some limitations on any jobs that it checkpoints and migrates:

  • Multi-process jobs are not allowed. This includes system calls such as fork(), exec(), and system().
  • Interprocess communication is not allowed. This includes pipes, semaphores, and shared memory.
  • Network communication must be brief. A job may make network connections using system calls such as socket(), but a network connection left open for long periods will delay checkpointing and migration.
  • Sending or receiving the SIGUSR2 or SIGTSTP signals is not allowed. Condor reserves these signals for its own use. Sending or receiving all other signals is allowed.
  • Alarms, timers, and sleeping are not allowed. This includes system calls such as alarm(), getitimer(), and sleep().
  • Multiple kernel-level threads are not allowed. However, multiple user-level threads are allowed.
  • Memory mapped files are not allowed. This includes system calls such as mmap() and munmap().
  • File locks are allowed, but not retained between checkpoints.
  • All files must be opened read-only or write-only. A file opened for both reading and writing will cause trouble if a job must be rolled back to an old checkpoint image. For compatibility reasons, a file opened for both reading and writing will result in a warning but not an error.
  • A fair amount of disk space must be available on the submitting machine for storing a job's checkpoint images. A checkpoint image is approximately equal to the virtual memory consumed by a job while it runs. If disk space is short, a special checkpoint server can be designated for storing all the checkpoint images for a pool.
  • On Digital Unix (OSF/1), HP-UX, and Linux, your job must be statically linked. Dynamic linking is allowed on all other platforms.

These limitations apply only to Standard Universe jobs. They do not apply to Vanilla Universe jobs.

Examples

To submit jobs successfully to BoilerGrid and to achieve maximum throughput in Condor's computing environment, you must understand the architecture of BoilerGrid and how to request resources which are appropriate to your application. The following examples show how to discover the resources of BoilerGrid. They also explain standard input and output, command-line arguments, file input and output, Standard and Vanilla universe jobs, shared file systems, parameter sweeps, DAG Manager, job requirements and ranks, and how to run commercial and third-party software. You may wish to look here for an example that is most similar to your application and modify that example for your jobs. You may also refer to the Condor Manual for more details.

Simplest Job Submission File

The job submission file must contain one executable command and at least one queue command. All other commands of the job submission file have default actions. Condor's job submission parser ignores blank lines and single-line comments beginning with a pound sign ("#"). There is no block (multi-line) comment in a job submission file. In some cases, a single-line comment may appear on a command line.

# FILENAME: myjob.sub

executable = myprogram
queue    # place one copy of the job in the Condor queue

This job submission file uses Condor's default universe, Vanilla. Because this Vanilla job does not turn on Condor's file transfer mechanism, this job will (by default) use the shared file system of the submission host from which you submitted the job. Also by default, the number of jobs queued is one.

This job submission file may appear to be useless because it lacks the standard input, standard output, standard error, and a common log file; however, it will correctly process a program which reads and writes formatted files. Here is an example of file I/O. To compile this program for the Vanilla Universe, see Compiling Serial Programs.

To submit this job to Condor:

$ condor_submit myjob.sub

Standard Input/Output

Condor manages a batch environment. When Condor manages the execution of a computer program, that program cannot offer an interactive experience with a terminal. All input normally read from the keyboard (standard input) must be prepared in a file ahead of execution. All output normally written to the screen (standard output and standard error) appear in files where you may view them after execution. Also, Condor records in a common log file the main events of running a job.

Here is an example of standard I/O. To compile this program for the Vanilla Universe, see Compiling Serial Programs.

Prepare a job submission file with an appropriate filename, here named myjob.sub:

# FILENAME: myjob.sub

executable = myprogram

# Standard I/O files, Condor log file
input  = mydata.in
output = mydata.out
error  = mydata.err
log    = mydata.log

# queue one job
queue

This job submission file uses Condor's default universe, Vanilla. Because this Vanilla job does not turn on Condor's file transfer mechanism, this job wi ll (by default) use the shared file system of the submission host from which you submitted the job. Also by default, the number of jobs queued is one.

This submission specifies that there exists a file, mydata.in, which contains all text which the program would otherwise read from the keyboard, standard input. It also specifes the names of three files which will receive standard output, standard error, and Condor's log entries. These three output files need not preexist, but they can. Condor will overwrite standard output and standard error but will append to the log file during subsequent submissions.

To submit this job to Condor:

$ condor_submit myjob.sub

Command Line Arguments

Condor allows the specification of command-line arguments in the job submission file. There are two permissible formats for specifying arguments. The old syntax has arguments delimited (separated) by space characters. To use double quotes, escape with a backslash (i.e. put a backslash in front of each double quote). For example:

arguments = arg1 \"arg2\" 'arg3'

yields the following arguments:

arg1
"arg2"
'arg3'

The new syntax supports uniform quoting of spaces within arguments. A pair of double quotes surrounds the entire argument list. To include a literal double quote, simply repeat it. White space (spaces, tabs) separate arguments. To include literal white space in an argument, surround the argument with a pair of single quotes. To include a literal single quote within a single-quoted argument, repeat the single quote.

Here is a simple program which will display command-line arguments specified in a job submission file. To compile this program for the Vanilla Universe, see Compiling Serial Programs.

Prepare a job submission file with an appropriate filename, here named myjob.sub. Run with command-line arguments in either the old or new syntax:

# FILENAME: myjob.sub

universe = VANILLA

executable = myprogram
# Old Syntax
# arguments = arg1 arg2 arg3 \"arg4\" 'arg5' 'arg with spaces' arg6 arg7_with_spaces arg8

# New Syntax
arguments = "arg9 ""arg10"" 'arg with literal '' and spaces'"

# Condor Macros
# arguments = $(Cluster) $(Process)

# standard I/O files, Condor log file
output = myprogram.out
error  = myprogram.err
log    = myprogram.log

# queue one job
queue

To submit this job to Condor:

$ condor_submit myjob.sub

View command-line arguments submitted in the old syntax:

***  MAIN START  ***

Number of command line arguments: 12

command line argument, argv[0]: condor_exec.746418.0
command line argument, argv[1]: arg1
command line argument, argv[2]: arg2
command line argument, argv[3]: arg3
command line argument, argv[4]: "arg4"
command line argument, argv[5]: 'arg5'
command line argument, argv[6]: 'arg
command line argument, argv[7]: with
command line argument, argv[8]: spaces'
command line argument, argv[9]: arg6
command line argument, argv[10]: arg7_with_spaces
command line argument, argv[11]: arg8

***  MAIN STOP  ***

The old syntax requires simulating spaces in arguments with the underscore character. Then, user code can replace the underscores with spaces to achieve an argument with spaces.

View command-line arguments submitted in the new syntax:

***  MAIN START  ***

Number of command line arguments: 4

command line argument, argv[0]: condor_exec.341964.0
command line argument, argv[1]: arg9
command line argument, argv[2]: "arg10"
command line argument, argv[3]: arg with literal ' and spaces

***  MAIN STOP  ***

The array element argv[0] holds Condor's name for a job.

Two Condor macros are useful as command-line arguments, $(Cluster) and $(Process):

***  MAIN START  ***

Number of command line arguments: 3

command line argument, argv[0]: condor_exec.341965.0
command line argument, argv[1]: 341965
command line argument, argv[2]: 0

***  MAIN STOP  ***

File Input/Output

Condor is able to manage a computer program which reads and writes formatted data files.

Here is an example of formatted file I/O. To compile this program for the Vanilla Universe, see Compiling Serial Programs.

Prepare a job submission file with an appropriate filename, here named myjob.sub. This example combines formatted file I/O with standard output:

# FILENAME: myjob.sub

executable = myprogram

# Standard I/O files, Condor log file
output = mydata.out
error  = mydata.err
log    = mydata.log

# queue one job
queue

This job submission file uses Condor's default universe, Vanilla. Because this Vanilla job does not turn on Condor's file transfer mechanism, this job will (by default) use the shared file system of the submission host from which you submitted the job. Also by default, the number of jobs queued is one.

This submission specifies that there exists a formatted input file, myinputdata, a name which appears in the source code only. The result is a formatted output file, myoutputdata, a name which also appears in the source code only. This submission also specifes the names of three files which will receive standard output, standard error, and Condor's log entries. These three output files need not preexist, but they can. Condor will overwrite standard output and standard error but append to the log file during subsequent submissions.

To submit this job to Condor:

$ condor_submit myprogram.sub

Standard Universe Job

The Standard Universe is an execution environment of Condor. Jobs using the Standard Universe enjoy two advantages. A job with a higher priority may preempt a Condor job without loss of completed work. Condor can checkpoint the job and move (migrate) the job to a different compute node which would otherwise be idle. Condor restarts the job on the new compute node at precisely the point of preemption. The Standard Universe tells Condor that you re-linked your job via condor_compile with the Condor libraries, and therefore your job supports checkpointing. Condor transfers the executable and checkpoint files automatically, when needed.

The second advantage of Condor's Standard Universe is that remote system calls handle access to files (input and output). For example, Condor intercepts a call to read a record of a data file. Condor sends the read operation to the user's current working directory on the submission host which performs the read operation. Condor then sends the desired record to the compute node which processes the record. A similar process occurs with write operations. Therefore, the existence of a shared file system is not relevant. This feature maximizes the number of machines which can run a job. Compute nodes across an entire enterprise can run a job, including compute nodes in different administrative domains.

This section illustrates how to submit a small job to the Standard Universe of BoilerGrid. This example, myprogram.c, displays the name of the host which runs the job. To compile this program for the Standard Universe, see Compiling Serial Programs.

Prepare a job submission file with the Standard Universe, the compiled C program as the executable, Condor's transferring the compiled program to the chosen compute node, and an appropriate filename, here named myjob.sub:

# FILENAME:  myjob.sub

universe = STANDARD

# Transfer the "executable" myprogram to the compute node.
transfer_executable = TRUE
executable          = myprogram

# Standard I/O files, Condor log file
output = mydata.out
error  = mydata.err
log    = mydata.log

# queue one job
queue

Submit this job to Condor:

$ condor_submit myjob.sub
Submitting job(s).
Logging submit event(s).
1 job(s) submitted to cluster 341956.

View job status:

$ condor_q myusername

-- Submitter: condor.rcac.purdue.edu : <128.211.157.86:35556> : condor.rcac.purdue.edu
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
341956.0   myusername     10/22 11:18   0+00:00:00 I  0   7.3  myjob

Place the job on hold to study the submission:

$ condor_hold 341956
Cluster 341956 held.

Obtain the requirements of this job:

$ condor_q myusername -attributes requirements -long

-- Submitter: condor.rcac.purdue.edu : <128.211.157.86:35556> : condor.rcac.purdue.edu
requirements = (Arch == "X86_64") && (OpSys == "LINUX") && ((CkptArch == Arch) || (CkptArch =?= UNDEFINED)) && ((CkptOpSys == OpSys) || (CkptOpSys =?= UNDEFINED)) && (Disk >= DiskUsage) && ((Memory * 1024) >= ImageSize)

Job requirements reflect the Standard Universe (preemption with checkpointing). This job requires a processor core which runs the Linux operating system on the x86_64 architecture and has the ability to checkpoint the job at preemption. The requirements exclude any mention of the shared file system since a shared file system is not relevant to a Standard Universe job. Running a Standard Universe job does not limit the job to the processor cores which use the same shared file system that the submission host uses. The job may land either on a processor core that uses the same shared file system or not; in either case, the remote I/O of the Standard Universe handles the job's file I/O. The ClassAds Disk and Memory ensure that the chosen core has sufficient resources to hold the disk and memory footprint of the job.

To see how many compute nodes of BoilerGrid are able to satisfy this job's requirements:

$ condor_status -pool boilergrid.rcac.purdue.edu -total -constraint '(Arch == "X86_64") && (OpSys == "LINUX") && ((CkptArch == Arch) || (CkptArch =?= UNDEFINED)) && ((CkptOpSys == OpSys) || (CkptOpSys =?= UNDEFINED))'

                     Total Owner Claimed Unclaimed Matched Preempting Backfill
        X86_64/LINUX 33118 27602    2878      2596      42          0        0
               Total 33118 27602    2878      2596      42          0        0

The report shows that 33,118 processor cores are candidates for running the job. Using Condor's Standard Universe with its remote file I/O maximizes the number of candidate cores available to the job. This large pool of candidate cores has the potential of offering high throughput. Some subset of the candidate processor cores, possibly none, will be available to accept the job at any given moment.

Release the job from the queue:

$ condor_release 341956
Cluster 341956 released.

View results in the file for all standard output, here named mydata.out:

***  MAIN START  ***

hostname = cms-100.rcac.purdue.edu
domainname = (none)

***  MAIN  STOP  ***

The output shows the name of the processor core which ran the job. While this job ran on a processor core which resides on the same shared file system used by the submission host, another submission which forced the job onto a core of another shared file system also ran successfully because the remote I/O of the Standard Universe handled the reading and writing of records.

View the log file, mydata.log:

000 (341956.000.000) 10/22 11:42:22 Job submitted from host: <128.211.157.86:35556>
...
012 (341956.000.000) 10/22 11:42:57 Job was held.
    via condor_hold (by user myusername)
    Code 1 Subcode 0
...
001 (341956.000.000) 10/22 11:43:57 Job executing on host: <128.211.157.10:52556>
...
005 (341956.000.000) 10/22 11:43:57 Job terminated.
    (1) Normal termination (return value 0)
        Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
        Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
        Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
        Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
    1110  -  Run Bytes Sent By Job
    5431033  -  Run Bytes Received By Job
    1110  -  Total Bytes Sent By Job
    5431033  -  Total Bytes Received By Job

The log file records the main events related to the processing of this job. In this example, the log records the number of bytes read and written between the submission host and the compute node via the remote I/O of the Standard Universe.

The Standard Universe maximizes throughput with its ability to checkpoint jobs and to intercept remote system calls. The latter avoids requiring the submission host and the compute node to share a file system. The process of re-linking a job with Condor's libraries involves including both Condor's libraries and the user's libraries as static libraries. The danger of this effort to maximize throughput is that a Condor flock is a heterogeneous collection of old and new compute nodes, so a job can land on a compute node that is unable to run the job. When this happens, the user must consider how to avoid compute nodes which are unable to run a job to a successful completion.

Vanilla Universe Job (with shared file system)

The Vanilla Universe is an execution environment of Condor. The Vanilla Universe tells Condor that you did not re-link your job via condor_compile with the Condor libraries, and therefore your job does not support checkpointing or remote system calls. Such jobs include an executable binary of a commercial application, shell script, or a program which is to take advantage of features of a compiler which is not compatible with Condor's condor_compile command.

For jobs submitted under the Vanilla Universe, the existence of a shared file system is relevant since access to files (input and output) involves either a shared file system or Condor's file transfer mechanism mechanism, not the remote system calls of the Standard Universe.

This section illustrates how to submit a small job to the Vanilla Universe of BoilerGrid from a submission host which has a shared file system with Condor's file transfer mechanism turned off, by default. This example, myprogram.c, displays the name of the host which runs the job, the path name of the current working directory, and the contents of that directory. The Vanilla Universe allows using Linux commands to obtain the information. To compile this program for the Vanilla Universe, see Compiling Serial Programs.

Prepare a job submission file with the Vanilla Universe, the compiled C program as the executable, Condor's file transfer mechanism turned off by default, Condor's transferring the compiled program to the chosen compute node, and an appropriate filename, here named myjob.sub:

# FILENAME:  myjob.sub

universe = VANILLA

# Transfer the "executable" myprogram to the compute node.
transfer_executable = TRUE
executable          = myprogram

# Condor's file transfer mechanism is off, by default.

# Standard I/O files, Condor log file
output = mydata.out
error  = mydata.err
log    = mydata.log

# queue one job
queue

Submit this job to Condor:

$ condor_submit myjob.sub
Submitting job(s).
Logging submit event(s).
1 job(s) submitted to cluster 746407.

View job status:

$ condor_q myusername

-- Submitter: condor.rcac.purdue.edu : <128.211.157.86:60481> : condor.rcac.purdue.edu
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
746407.0   myusername     10/25 10:04   0+00:00:00 I  0   0.0  myjob

Place this job on hold to study the submission:

$ condor_hold 746407
Cluster 746407 held.

Obtain the requirements of this job:

$ condor_q myusername -attributes requirements -long


-- Submitter: condor.rcac.purdue.edu : <128.211.157.86:60481> : condor.rcac.purdue.edu
requirements = (Arch == "X86_64") && (OpSys == "LINUX") && (Disk >= DiskUsage) && ((Memory * 1024) >= ImageSize) && (TARGET.FileSystemDomain == MY.FileSystemDomain)

Job requirements reflect both the Vanilla Universe (preemption without checkpointing) and the shared file system. This job requires a compute node which runs the Linux operating system on the x86_64 architecture and, more importantly, which shares the same FileSystemDomain as the submission host (both the TARGET and MY shared file system must be the same). So, this submission limits running the job to the processor cores which use the same shared file system that the submission host uses. The ClassAds Disk and Memory ensure that the chosen core has sufficient resources to hold the disk and memory footprint of the job.

To see how many compute nodes in various file system domains of BoilerGrid are able to satisfy this job's requirements :

$ condor_status -pool boilergrid.rcac.purdue.edu -total -constraint '(Arch == "X86_64") && (OpSys == "LINUX") && (FileSystemDomain == "bluearc.rcac.purdue.edu")'

                     Total Owner Claimed Unclaimed Matched Preempting Backfill
        X86_64/LINUX  9924  7466    1579       878       0          1        0
               Total  9924  7466    1579       878       0          1        0


$ condor_status -pool boilergrid.rcac.purdue.edu -total -constraint '(Arch == "X86_64") && (OpSys == "LINUX") && (FileSystemDomain == "lustrea.rcac.purdue.edu")'

                     Total Owner Claimed Unclaimed Matched Preempting Backfill

        X86_64/LINUX 18784 16717    1438       628       0          1        0

               Total 18784 16717    1438       628       0          1        0


$ condor_status -pool boilergrid.rcac.purdue.edu -total -constraint '(Arch == "X86_64") && (OpSys == "LINUX") && (FileSystemDomain == "miner.rcac.purdue.edu")'

                     Total Owner Claimed Unclaimed Matched Preempting Backfill

        X86_64/LINUX  1006   156     760        90       0          0        0

               Total  1006   156     760        90       0          0        0


The report shows that 9,924 and 18,784 processor cores are candidates for running this job in various file system domains. While the number of candidate processor cores which are able to run this job is much less than the number of x86_64 cores running Linux on BoilerGrid, using the shared file system is the preferred method in many situations. Some subset of the candidate processor cores, possibly none, will be available to accept the job at any given moment.

Release the job from the queue:

$ condor_release 746407
Cluster 746407 released.

View results in the file for all standard output, here named mydata.out:

cms-100.rcac.purdue.edu
(none)
/autohome/u105/myhomedirectory/Condor/vanilla_w_sfs
total 288
-rw-r--r-- 1 myusername itap 1508 Oct 25 10:39 commands
-rw-r--r-- 1 myusername itap    0 Oct 25 10:46 mydata.err
-rw-r--r-- 1 myusername itap  467 Oct 25 10:46 mydata.log
-rw-r--r-- 1 myusername itap   71 Oct 25 10:46 mydata.out
-rwxr-xr-x 1 myusername itap 6863 Oct 25 10:14 myprogram
-rw-r----- 1 myusername itap  376 Oct 25 09:14 myprogram.c
-rw-r----- 1 myusername itap  199 Oct 25 10:04 myprogram.sub
-rwxr----- 1 myusername itap   70 Oct 25 09:19 run
-rwxr--r-- 1 myusername itap  216 Oct 25 09:14 tally
-rw-r----- 1 myusername itap  952 Oct 25 09:14 tmp
-rw-r--r-- 1 myusername itap    0 Oct 25 10:19 tmp1
***  MAIN START  ***


***  MAIN  STOP  ***

The output shows the name of the compute node which ran the job. This job ran on a compute node which uses the same shared file system which the submission host uses. The fact that the current working directory is the user's home directory on the submission host proves that this job used the shared file system for file I/O.

View the log file, mydata.log:

000 (746407.000.000) 10/25 10:04:51 Job submitted from host: <128.211.157.86:60481>
...
012 (746407.000.000) 10/25 10:05:15 Job was held.
    via condor_hold (by user myusername)
    Code 1 Subcode 0
...
009 (746406.000.000) 10/25 10:22:07 Job was aborted by the user.
    via condor_rm (by user myusername)
...
013 (746407.000.000) 10/25 10:44:07 Job was released.
    via condor_release (by user myusername)
...
001 (746407.000.000) 10/25 10:46:47 Job executing on host: <128.211.157.10:57108>
...
005 (746407.000.000) 10/25 10:46:47 Job terminated.
    (1) Normal termination (return value 0)
        Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
        Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
        Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
        Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
    0  -  Run Bytes Sent By Job
    0  -  Run Bytes Received By Job
    0  -  Total Bytes Sent By Job
    0  -  Total Bytes Received By Job

The log file records the main events related to the processing of this job. In this example, the log records that the number of bytes read and written between the submission host and the compute node is zero, another indication that this job used the shared file system for file I/O.

The Vanilla Universe is available for jobs which cannot take advantage of the Standard Universe because you cannot re-link them with Condor's libraries. You cannot re-link an executable binary, shell script, and a program which you compiled with an incompatible compiler. If a shared file system is available, the Vanilla job may use it for file I/O by keeping Condor's file transfer mechanism turned off. Keeping the file transfer mechanism off excludes compatible compute nodes which do not share a file system with the submission host.

Vanilla Universe Job (either shared file system or file transfer mechanism)

The Vanilla Universe is an execution environment of Condor. The Vanilla Universe tells Condor that you did not re-link a job via condor_compile with the Condor libraries and therefore your job does not support checkpointing or remote system calls. Such jobs include an executable binary of a commercial application, shell script, or a program which is to take advantage of features of a compiler which is not compatible with Condor's condor_compile command.

For jobs submitted under the Vanilla Universe, the existence of a shared file system is relevant since access to files (input and output) involves either a shared file system or Condor's file transfer mechanism mechanism, not the remote system calls of the Standard Universe.

This section illustrates how to submit a small job to the Vanilla Universe of BoilerGrid from a submission host which has a shared file system with Condor's file transfer mechanism turned on "if needed". Condor transfers files only when it matches the job with a compute node which uses a different FileSystemDomain from the one which the submission host uses. If Condor matches the job with a compute node which uses the same FileSystemDomain which the submission host uses, Condor does not transfer files and relies on the shared file system instead. This example, myprogram.c, displays the name of the host which runs the job, the path name of the current working directory, and the contents of that directory. The Vanilla Universe allows using Linux commands to obtain the information. To compile this program for the Vanilla Universe, see Compiling Serial Programs.

Prepare a job submission file with the Vanilla Universe, the compiled C program as the executable, Condor's file transfer mechanism turned on if needed, Condor's transferring the compiled program to the chosen compute node, and an appropriate filename, here named myjob.sub:

# FILENAME:  myjob.sub

universe = VANILLA

# Transfer the "executable" myprogram to the compute node.
transfer_executable = TRUE
executable          = myprogram

# Condor's file transfer mechanism is turned on only when needed.
should_transfer_files = IF_NEEDED

# Let Condor handle output file(s).
when_to_transfer_output = ON_EXIT

# Standard I/O files, Condor log file
output = mydata.out
error  = mydata.err
log    = mydata.log

# queue one job
queue

Submit this job to Condor:

$ condor_submit myjob.sub
Submitting job(s).
Logging submit event(s).
1 job(s) submitted to cluster 746408.

View job status:

$ condor_q myusername

-- Submitter: condor.rcac.purdue.edu : <128.211.157.86:60481> : condor.rcac.purdue.edu
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
746408.0   myusername     10/25 10:04   0+00:00:00 I  0   0.0  myjob

Place this job on hold to study the submission:

$ condor_hold 746408
Cluster 746408 held.

Obtain the requirements of this job:

$ condor_q myusername -attributes requirements -long


-- Submitter: condor.rcac.purdue.edu : <128.211.157.86:60481> : condor.rcac.purdue.edu
requirements = (Arch == "X86_64") && (OpSys == "LINUX") && (Disk >= DiskUsage) && ((Memory * 1024) >= ImageSize) && ((HasFileTransfer) || (TARGET.FileSystemDomain == MY.FileSystemDomain))

The requirements reflect both the Vanilla Universe (preemption without checkpointing) and Condor's file transfer mechanism turned on only if needed. This job requires a processor core which runs the Linux operating system on the x86_64 architecture, but, more importantly, the processor core chosen to run this job need not share the same FileSystemDomain which the submission host uses (both the TARGET and MY shared file system need not be equal). The ClassAd of this job states that the chosen core must either have the file transfer capability or share a file system with the submission host. So, this submission does not limit running the job to the processor cores which use the same shared file system which the submission host uses. The ClassAds Disk and Memory ensure that the chosen core has sufficient resources to hold the disk and memory footprint of the job.

To see how many compute nodes of BoilerGrid are able to satisfy this job's requirements:

$ condor_status -pool boilergrid.rcac.purdue.edu -total -constraint '(Arch == "X86_64") && (OpSys == "LINUX") && ((HasFileTransfer) || (FileSystemDomain == "bluearc.rcac.purdue.edu"))'

                     Total Owner Claimed Unclaimed Matched Preempting Backfill
        X86_64/LINUX 32074 20806    4287      6976       5          0        0
               Total 32074 20806    4287      6976       5          0        0          0        0

The report shows that 32,074 processor cores are candidates for running this job. Using Condor's Vanilla Universe with its file transfer mechanism turned on only if needed maximizes the number of candidate cores available to the job. This large pool of candidate cores has the potential of offering high throughput. On the other hand, large data files involved in the file transfer may diminish throughput. Some subset of the candidate processor cores, possibly none, will be available to accept the job at any given moment.

Release the job from the queue:

$ condor_release 746408
Cluster 746408 released.

View results in the file for all standard output, here named mydata.out:

cms-100.rcac.purdue.edu
(none)
/autohome/u105/myhomedirectory/Condor/vanilla_w_sfs_ftm
total 284
-rw-r--r-- 1 myusername itap 1508 Oct 25 10:39 commands
-rw-r--r-- 1 myusername itap    0 Oct 25 10:46 mydata.err
-rw-r--r-- 1 myusername itap  467 Oct 25 10:46 mydata.log
-rw-r--r-- 1 myusername itap   71 Oct 25 10:46 mydata.out
-rwxr-xr-x 1 myusername itap 6863 Oct 25 10:14 myprogram
-rw-r----- 1 myusername itap  376 Oct 25 09:14 myprogram.c
-rw-r----- 1 myusername itap  199 Oct 25 10:04 myjob.sub
-rwxr----- 1 myusername itap   70 Oct 25 09:19 run
-rwxr--r-- 1 myusername itap  216 Oct 25 09:14 tally
-rw-r----- 1 myusername itap  952 Oct 25 09:14 tmp
-rw-r--r-- 1 myusername itap    0 Oct 25 10:19 tmp1
***  MAIN START  ***


***  MAIN  STOP  ***

The output shows the name of the processor core which ran the job. This job ran on a processor core which uses the same shared file system which the submission host uses. The fact that the current working directory is the user's home directory on the submission host proves that this job used the shared file system for file I/O.

View the log file, mydata.log:

000 (746408.000.000) 10/25 10:04:51 Job submitted from host: <128.211.157.86:60481>
...
012 (746408.000.000) 10/25 10:05:15 Job was held.
    via condor_hold (by user myusername)
    Code 1 Subcode 0
...
013 (746408.000.000) 10/25 10:44:07 Job was released.
    via condor_release (by user myusername)
...
001 (746408.000.000) 10/25 10:46:47 Job executing on host: <128.211.157.10:57108>
...
005 (746408.000.000) 10/25 10:46:47 Job terminated.
    (1) Normal termination (return value 0)
        Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
        Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
        Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
        Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
    0  -  Run Bytes Sent By Job
    0  -  Run Bytes Received By Job
    0  -  Total Bytes Sent By Job
    0  -  Total Bytes Received By Job

The log file records the main events related to the processing of this job. In this example, the log records that the number of bytes read and written between the submission host and the compute node is zero, another indication that this job used the shared file system for file I/O.

To see Condor's file transfer mechanism at work, repeat the example above but force the job to a compute node which does not share a file system with the submission host.

Modify the job submission file of the previous example to send the job to a processor core which uses a different shared file system:

# FILENAME:  myjob.sub

universe = VANILLA

# A core on the Rossmann cluster uses a different shared file system.
requirements = ClusterName == "Rossmann"

# Transfer the "executable" myprogram to the compute node.
transfer_executable = TRUE
executable          = myprogram

# Condor's file transfer mechanism is turned on only when needed. This submission needs the transfer mechanism.
should_transfer_files   = IF_NEEDED
when_to_transfer_output = ON_EXIT


# Standard I/O files, Condor log file
output = mydata.out
error  = mydata.err
log    = mydata.log

# queue one job
queue

View results in the file for all standard output, here named mydata.out:

cms-100.rcac.purdue.edu
(none)
/var/condor/execute/dir_11554
total 12
-rwxr-xr-x 1 myusername itap 6863 Oct 25 12:28 condor_exec.exe
-rw-r--r-- 1 myusername itap    0 Oct 25 12:31 mydata.err
-rw-r--r-- 1 myusername itap   67 Oct 25 12:32 mydata.out
***  MAIN START  ***


***  MAIN  STOP  ***

This output file exhibits a temporary directory on the processor core which Condor chose to run the job, rather than the user's home directory, another indication that this job used Condor's file transfer mechanism for file I/O.

View the log file, mydata.log:

000 (746411.000.000) 10/25 12:08:12 Job submitted from host: <128.211.157.86:60481>
...
001 (746411.000.000) 10/25 12:31:59 Job executing on host: <128.211.157.10:51871>
...
005 (746411.000.000) 10/25 12:32:00 Job terminated.
    (1) Normal termination (return value 0)
        Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
        Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
        Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
        Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
    278  -  Run Bytes Sent By Job
    6863  -  Run Bytes Received By Job
    278  -  Total Bytes Sent By Job
    6863  -  Total Bytes Received By Job

The log file records the main events related to the processing of this job. In this example, the log records the number of bytes transferred between the submission host and the compute node via the file transfer mechanism, another indication that Condor's file transfer mechanism was used.

The Vanilla Universe is available for jobs which cannot take advantage of the Standard Universe because you cannot re-link them with Condor's libraries. You cannot re-link an executable binary, shell script, and a program which you compiled with an incompatible compiler. If a shared file system is available and Condor's file transfer mechanism is suitable for the job, the Vanilla job may use either for file I/O by specifying that the submission uses the mechanism only "if needed." While this method can maximize throughput, the size of any file which you intend to transfer must be reasonable. The size of the file must fit on the available disk space of the compute node. The amount of time needed to transfer the file cannot be so great that higher priority jobs constantly preempt the Condor job during file transfer.

Vanilla Universe Job (without shared file system)

The Vanilla Universe is an execution environment of Condor. The Vanilla Universe tells Condor that you did re-link your job via condor_compile with the Condor libraries, and therefore your job does not support checkpointing or remote system calls. Such jobs include an executable binary of a commercial application, shell script, or a program which is to take advantage of features of a compiler which is not compatible with Condor's condor_compile command.

For jobs submitted under the Vanilla Universe, the existence of a shared file system is relevant since access to files (input and output) involves either a shared file system or Condor's file transfer mechanism mechanism, not the remote system calls of the Standard Universe.

This section illustrates how to submit a small job to the Vanilla Universe of BoilerGrid from a submission host which lacks a shared file system with Condor's file transfer mechanism turned on. No matter which processor core Condor chooses to run the job, Condor transfers files. This example, myprogram.c, displays the name of the host which runs the job, the path name of the current working directory, and the contents of that directory. The Vanilla Universe allows using Linux commands to obtain the information. To compile this program for the Vanilla Universe, see Compiling Serial Programs.

Prepare a job submission file with the Vanilla Universe, the compiled C program as the executable, Condor's file transfer mechanism turned on, Condor's transferring the compiled program to the chosen compute node, and an appropriate filename, here named myjob.sub:

# FILENAME:  myjob.sub

universe = VANILLA

# Transfer the "executable" myprogram to the compute node.
transfer_executable = TRUE
executable          = myprogram

# Turn on Condor's file transfer mechanism.
should_transfer_files   = YES

# Let Condor handle output files.
when_to_transfer_output = ON_EXIT

# Standard I/O files, Condor log file
output = mydata.out
error  = mydata.err
log    = mydata.log

# queue one job
queue

Submit this job to Condor:

$ condor_submit myjob.sub
Submitting job(s).
Logging submit event(s).
1 job(s) submitted to cluster 341960.

View job status:

$ condor_q myusername

-- Submitter: condor.rcac.purdue.edu : <128.211.157.86:35556> : condor.rcac.purdue.edu
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
341960.0   myusername     10/25 15:02   0+00:00:00 I  0   0.0  myjob

Place this job on hold to study the submission:

$ condor_hold 341960
Cluster 341960 held.

Obtain the requirements of this job:

$ condor_q myusername -attributes requirements -long


-- Submitter: condor.rcac.purdue.edu : <128.211.157.86:35556> : condor.rcac.purdue.edu
requirements = (Arch == "X86_64") && (OpSys == "LINUX") && (Disk >= DiskUsage) && ((Memory * 1024) >= ImageSize) && (HasFileTransfer)

Job requirements reflect both the Vanilla Universe (preemption without checkpointing) and Condor's file transfer mechanism turned on. This job requires a processor core which runs the Linux operating system on the x86_64 architecture, but more importantly the processor core chosen to run this job can reside on a cluster which lacks a shared file system. The ClassAd of this job states that the chosen core must have the file transfer capability. The ClassAds Disk and Memory ensure that the chosen core has sufficient resources to hold the disk and memory footprint of the job.

To see how many compute nodes of BoilerGrid are able to satisfy this job's requirements:

$ condor_status -pool boilergrid.rcac.purdue.edu -total -constraint '(Arch == "X86_64") && (OpSys == "LINUX") && (HasFileTransfer)'

                     Total Owner Claimed Unclaimed Matched Preempting Backfill
        X86_64/LINUX 33068 20690    3850      8520       8          0        0
               Total 33068 20690    3850      8520       8          0        0

This report shows that 33,068 processor cores are candidates for running this job. Using Condor's Vanilla Universe with its file transfer mechanism turned off maximizes the number of candidate cores available to the job. This large pool of candidate cores has the potential of offering high throughput. On the other hand, large data files involved in the file transfer, may diminish throughput. Some subset of the candidate processor cores, possibly none, will be available to accept the job at any given moment.

Release the job from the queue:

$ condor_release 341960
Cluster 341960 released.

View results in the file for all standard output, here named mydata.out:

cms-100.rcac.purdue.edu
(none)
/var/condor/execute/dir_13374
total 12
-rwxr-xr-x 1 myusername itap 6863 Oct 25 15:47 condor_exec.exe
-rw-r--r-- 1 myusername itap    0 Oct 25 15:50 mydata.err
-rw-r--r-- 1 myusername itap   61 Oct 25 15:50 mydata.out
***  MAIN START  ***


***  MAIN  STOP  ***

The output shows the name of the processor core which ran the job. This job ran on a processor core which shares a file system with the submission host. Despite this, the current working directory is a temporary directory on the compute node; therefore, this job used the file transfer mechanism for file I/O.

View the log file, mydata.log:

000 (341960.000.000) 10/25 15:03:18 Job submitted from host: <128.211.157.86:35556>
...
012 (341960.000.000) 10/25 15:03:35 Job was held.
    via condor_hold (by user myusername)
    Code 1 Subcode 0
...
013 (341960.000.000) 10/25 15:48:00 Job was released.
    via condor_release (by user myusername)
...
001 (341960.000.000) 10/25 15:50:46 Job executing on host: <128.211.157.10:33047>
...
005 (341960.000.000) 10/25 15:50:46 Job terminated.
    (1) Normal termination (return value 0)
        Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
        Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
        Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
        Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
    272  -  Run Bytes Sent By Job
    6863  -  Run Bytes Received By Job
    272  -  Total Bytes Sent By Job
    6863  -  Total Bytes Received By Job

The log file records the main events related to the processing of this job. In this example, the log records the number of bytes transferred between the submission host and the compute node via Condor's file transfer mechanism.

The Vanilla Universe is available for jobs which cannot take advantage of the Standard Universe because you cannot re-link them with Condor's libraries. You cannot re-link an executable binary, shell script, and a program which you compiled with an incompatible compiler. If a shared file system is not available and Condor's file transfer mechanism is suitable for the job, you may turn on the file transfer mechanism, and the Vanilla job will transfer your files. The size of any file which you intend to transfer must be reasonable. The size of the file must fit on the available disk space of the compute node. The amount of time needed to transfer the file cannot be so great that higher priority jobs constantly preempt the Condor job during file transfer.

Scratch File

Some applications process data stored in a large input data file. The size of this file may be so large that it cannot fit within the quota of a home directory. This file might reside on Fortress or some other external storage medium. The way to process this file on BoilerGrid is to copy it to your scratch directory where a job running on a compute node of BoilerGrid may access it.

The job may run in either the Standard or Vanilla Universe. If the universe is Standard, then the job will use the remote file I/O of the Standard Universe. If the universe is Vanilla, then the job will use either the shared file system or Condor's file transfer mechanism, depending on the compute node which Condor chose to run the job.

This section illustrates how to submit a small job which reads a data file which resides on the scratch file system. The example uses the Vanilla Universe with Condor's file transfer mechanism turned on "if needed". Condor transfers files only when it matches the job with a compute node which uses a different FileSystemDomain from the one which the submission host uses. If Condor matches the job with a compute node which uses the same FileSystemDomain which the submission host uses, Condor does not transfer files and relies on the shared file system instead. This example, myprogram.c, displays the name of the compute node which runs the job, the path name of the current working directory, the contents of that directory, and copies the contents of an input scratch file to an output scratch file. The Vanilla Universe allows using Linux commands to obtain system information. To compile this program for the Vanilla Universe, see Compiling Serial Programs.

Prepare a scratch file directory with a large input data file:

$ ls -l $RCAC_SCRATCH
total 32
-rw-r----- 1 myusername itap   27 Jun  8 10:41 biginputdatafile

Prepare a job submission file with the Vanilla Universe, Condor's transferring the compiled program to the chosen compute node, the compiled C program specified as the executable, Condor's file transfer mechanism turned on if needed, a list of input file(s), and an appropriate filename, here named myjob.sub:

# FILENAME:  myjob.sub

universe = VANILLA

# Transfer the "executable" myprogram to the compute node.
transfer_executable = TRUE
executable          = myprogram

# Condor's file transfer mechanism is turned on only when needed.
should_transfer_files = IF_NEEDED

# Let Condor handle output file(s).
when_to_transfer_output = ON_EXIT

# List input data file(s) to be read or transferred from the
# initial directory, if needed.
transfer_input_files = biginputdatafile

# Standard I/O files, Condor log file
# Find these files in the initial directory.
output = mydata.out
error  = mydata.err
log    = mydata.log

# queue one job
queue

Submit this job to Condor while specifying that all files, except the executable, are located relative to the specified initial directory, namely your scratch directory:

$ condor_submit -append initialdir=$RCAC_SCRATCH myjob.sub
Submitting job(s).
1 job(s) submitted to cluster 1421563.

View job status:

$ condor_q myusername

-- Submitter: condor-fe00.rcac.purdue.edu : <128.211.157.87:40924> : condor-fe00.rcac.purdue.edu
 ID         OWNER          SUBMITTED     RUN_TIME   ST PRI SIZE CMD
1421563.0   myusername     6/9  11:49    0+00:00:00 I  0   0.0  myprogram

1 jobs; 1 idle, 0 running, 0 held

View four new files in the scratch file directory, including bigoutputdatafile:

$ ls -l $RCAC_SCRATCH
total 128
-rw-r----- 1 myusername itap  27 Jun  8 10:41 biginputdatafile
-rw-r--r-- 1 myusername itap  41 Jun  9 11:49 bigoutputdatafile
-rw-r--r-- 1 myusername itap   0 Jun  9 11:49 mydata.err
-rw-r--r-- 1 myusername itap 648 Jun  9 11:49 mydata.log
-rw-r--r-- 1 myusername itap 632 Jun  9 11:49 mydata.out

View results in the file for all standard output, here named mydata.out:

steele-d037.rcac.purdue.edu
/usr/rmt_share/scratch95/m/myusername
total 96
-rw-r----- 1 myusername itap  27 Jun  8 10:41 biginputdatafile
-rw-r--r-- 1 myusername itap   0 Jun  9 11:49 mydata.err
-rw-r--r-- 1 myusername itap 204 Jun  9 11:49 mydata.log
-rw-r--r-- 1 myusername itap  59 Jun  9 11:49 mydata.out
total 128
-rw-r----- 1 myusername itap  27 Jun  8 10:41 biginputdatafile
-rw-r--r-- 1 myusername itap  41 Jun  9 11:49 bigoutputdatafile
-rw-r--r-- 1 myusername itap   0 Jun  9 11:49 mydata.err
-rw-r--r-- 1 myusername itap 204 Jun  9 11:49 mydata.log
-rw-r--r-- 1 myusername itap 274 Jun  9 11:49 mydata.out
***  MAIN START  ***

scratch file system: textfromscratchfile

***  MAIN  STOP  ***

The output shows the name of the compute node which Condor chose to run the job, the path of the current working directory (the user's scratch file directory), before-and-after listings of the content of the current working directory, and output from the application. This job ran on a processor core which uses the same shared file system which the submission host uses. The fact that the current working directory is the user's scratch directory on the submission host proves that this job used the shared file system for file I/O. The output scratch file named bigoutdatafile, the primary output of this program, appears in the second listing of the current working directory.

The second line of the output file shows the path of the scratch directory. In this case, the submission host was one of the front-ends of condor.rcac.purdue.edu and the compute node was one which uses the same file system domain as the submission host. This same path will appear in the output when the submission host is either Radon or Steele and the compute node is one of the nodes of Radon or Steele. If the submission host is either Rossmann or Coates and the compute node is one of the nodes of Rossmann or Coates, then the path will be /scratch/lustreA/m/myusername. If the submission host is Hansen and the compute node is one of the nodes of Hansen, then the path will be /scratch/lustreC/m/myusername.

View the log file, mydata.log:

000 (1421563.000.000) 06/09 11:49:23 Job submitted from host: <128.211.157.87:40924>
...
001 (1421563.000.000) 06/09 11:49:47 Job executing on host: <172.18.30.47:37393?PrivNet=condor.ccb.purdue.edu>
...
005 (1421563.000.000) 06/09 11:49:47 Job terminated.
    (1) Normal termination (return value 0)
        Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
        Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
        Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
        Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
    0  -  Run Bytes Sent By Job
    0  -  Run Bytes Received By Job
    0  -  Total Bytes Sent By Job
    0  -  Total Bytes Received By Job
...

The log file records the main events related to the processing of this job. In this example, the log records that the number of bytes read and written between the submission host and the compute node is zero, another indication that this job used the shared file system for file I/O.

To see Condor's file transfer mechanism at work, repeat the example above but force the job to a compute node which does not share a file system with the submission host.

Resubmit this job to Condor while specifying that a compute node on the Miner cluster is to run the job. Either command works:

$ condor_submit -append initialdir=$RCAC_SCRATCH append requirements=ClusterName==\"Miner\" myjob.sub

$ condor_submit -append initialdir=$RCAC_SCRATCH append 'requirements=ClusterName=="Miner"' myjob.sub

View results in the file for all standard output, here named mydata.out:

miner-a141.rcac.purdue.edu
/var/condor/execute/dir_24010
total 20
-rw-r----- 1 myusername itap   27 Jun  9 13:39 biginputdatafile
-rwxr-xr-x 1 myusername itap 8574 Jun  9 13:39 condor_exec.exe
-rw-r--r-- 1 myusername itap    0 Jun  9 13:42 mydata.err
-rw-r--r-- 1 myusername itap   57 Jun  9 13:42 mydata.out
total 24
-rw-r----- 1 myusername itap   27 Jun  9 13:39 biginputdatafile
-rw-r--r-- 1 myusername itap   42 Jun  9 13:42 bigoutputdatafile
-rwxr-xr-x 1 myusername itap 8574 Jun  9 13:39 condor_exec.exe
-rw-r--r-- 1 myusername itap    0 Jun  9 13:42 mydata.err
-rw-r--r-- 1 myusername itap  281 Jun  9 13:42 mydata.out
***  MAIN START  ***

scratch file system: textfromscratchfile

***  MAIN  STOP  ***

The output shows the name of the compute node which Condor chose to run the job, the path of the current working directory (a temporary directory on the compute node, rather than the user's scratch file directory), before-and-after listings of the content of the current working directory, and output from the application. This job ran on a processor core which uses a shared file system which is different from the shared file system which the submission host uses. The fact that the current working directory is a temporary directory on the compute node and that the file named "biginputdatafile" appears in this temporary directory proves that this job used Condor's file transfer mechanism for file I/O. The output scratch file named bigoutdatafile, the primary output of this program, appears in the second listing of the current working directory. Condor transferred all output files (mydata.out, mydata.log, mydata.err, and bigoutputdatafile) to the scratch directory.

View the log file, mydata.log:

000 (1421565.000.000) 06/09 13:42:30 Job submitted from host: <128.211.157.87:40924>
...
001 (1421565.000.000) 06/09 13:42:56 Job executing on host: <172.18.32.151:43952?PrivNet=condor.ccb.purdue.edu>
...
005 (1421565.000.000) 06/09 13:42:56 Job terminated.
    (1) Normal termination (return value 0)
        Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
        Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
        Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
        Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
    690  -  Run Bytes Sent By Job
    8601  -  Run Bytes Received By Job
    690  -  Total Bytes Sent By Job
    8601  -  Total Bytes Received By Job

The log file records the main events related to the processing of this job. In this example, the log records the number of bytes transferred between the submission host and the compute node via the file transfer mechanism, another indication that Condor's file transfer mechanism was used.

While this method can maximize throughput, the size of any scratch file which you intend to transfer must be reasonable. The size of the scratch file must fit on the available disk space of the compute node. The amount of time needed to transfer the scratch file cannot be so great that higher priority jobs constantly preempt the Condor job during file transfer. If the scratch file cannot fit on the available disk space of the chosen compute node or the file transfer time is so great that preemption prevents completion, consider limiting the pool of candidate compute nodes to those which share a file system with the submission host (should_transfer_files = NO) or consider using the Standard Universe with its remote file I/O.

/tmp File

Some applications write a large amount of intermediate data to a temporary file during an early part of the process then read that data for further processing during a later part of the process. The size of this file may be so large that it cannot fit within the quota of a home directory or that it requires too much I/O activity between the compute node and either the home directory or the scratch file directory. The way to process this intermediate file on BoilerGrid is to use the /tmp directory of the compute node which runs the job. Used properly, /tmp may provide faster local storage to an active process than any other storage option.

The job may run in the Vanilla Universe. When preemption occurs, a Vanilla job restarts at the beginning, and it rebuilds the intermediate data file from the beginning. Condor's Standard Universe is not applicable since checkpointing does not include any file in /tmp.

This section illustrates how to submit a small job which first writes then reads an intermediate data file which resides on the /tmp directory. This example, myprogram.c, displays the contents of the /tmp directory before and after processing. Linux commands access system information. To compile this program, see Compiling Serial Programs.

Prepare a job submission file with the Vanilla Universe, Condor's transferring the compiled program to the chosen compute node, the compiled C program specified as the executable, Condor's file transfer mechanism turned on if needed, and an appropriate filename, here named myjob.sub:

# FILENAME:  myjob.sub

universe = VANILLA

# Transfer the "executable" myprogram to the compute node.
transfer_executable = TRUE
executable          = myprogram

# Condor's file transfer mechanism is turned on only when needed.
should_transfer_files = IF_NEEDED

# Let Condor handle output file(s).
when_to_transfer_output = ON_EXIT

# Standard I/O files, Condor log file
output = mydata.out
error  = mydata.err
log    = mydata.log

# queue one job
queue

Submit this job to Condor:

$ condor_submit myjob.sub
Submitting job(s).
1 job(s) submitted to cluster 346033.

View job status:

$ condor_q myusername

-- Submitter: condor-fe00.rcac.purdue.edu : <128.211.157.87:40924> : condor-fe00.rcac.purdue.edu
 ID         OWNER          SUBMITTED     RUN_TIME   ST PRI SIZE CMD
346033.0   myusername     6/16  15:05    0+00:00:00 I  0   0.0  myprogram

1 jobs; 1 idle, 0 running, 0 held

View results in the file for all standard output, here named mydata.out:

-rw-r--r-- 1 kes itap 12 Jun 16 15:12 /tmp/mytmpfile
***  MAIN START  ***

/tmp file data:  abcdefghijk

***  MAIN  STOP  ***

The output verifies the existence of the intermediate data file in the /tmp directory.

View results in the file for all standard error, here named mydata.err:

ls: /tmp/mytmpfile: No such file or directory

The results in the error file verify that the intermediate data file does not exist at the start of processing.

View the log file, mydata.log:

000 (346033.000.000) 06/16 15:05:25 Job submitted from host: <128.211.158.38:40666>
...
001 (346033.000.000) 06/16 15:12:00 Job executing on host: <172.18.22.85:54211?PrivNet=condor.ccb.purdue.edu>
...
005 (346033.000.000) 06/16 15:12:01 Job terminated.
    (1) Normal termination (return value 0)
        Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
        Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
        Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
        Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
    0  -  Run Bytes Sent By Job
    0  -  Run Bytes Received By Job
    0  -  Total Bytes Sent By Job
    0  -  Total Bytes Received By Job
...

The log file records the main events related to the processing of this job. In this example, the log records that the number of bytes read and written between the submission host and the compute node is zero, an indication that this job used the shared file system for file I/O.

While the /tmp directory can provide faster local storage to an active process than other storage options, you never know how much storage is available in the /tmp directory of the compute node chosen to run your job. If an intermediate data file consistently fails to fit in the /tmp directories of a set of compute nodes, consider limiting the pool of candidate compute nodes to those which can handle your intermediate data file.

Parameter Sweep

A parameter sweep is a computation applied to several unique sets of data. Computing the average test score for each class in a school is an example of a simple parameter sweep.

A parameter sweep requires that each queued run of a computer program reads a unique set of input parameters. This example illustrates how to implement a unique command line argument, a unique standard input, and a unique formatted input data file. The program writes to a unique standard output and to a unique formatted output file.

Condor provides a pre-defined macro $(Process) which supports the specification of unique command-line arguments and input/output files so that each queued run of a job sees a unique set of data. This macro appears in the job submission file, myprogram.sub:

# FILENAME:  myprogram.sub

universe = VANILLA

executable = myprogram
# Processes 0,1,2
# command line argument
arguments  = $(Process)

# Standard I/O files, Condor log file
input  = mydata.in.$(Process)
output = mydata.out.$(Process)
error  = mydata.err.$(Process)
log    = mydata.log

# queue 3 jobs in 1 cluster
queue 3

This job submission file specifies a parameter sweep of three queued runs. Each process has a unique ID: 0, 1, and 2. Process 0 of the parameter sweep expects a standard input file named "mydata.in.0"; process 1, "mydata.in.1"; and process 2, "mydata.in.2". The sweep will generate similarly named files for standard output and error. Condor advises using a single log file in a submission. In addition, the sweep expects to find formatted input data files with the same process number used as a suffix: i_00020.0, i_mydata.1, i_mydata.2. Each copy of the computer program used in this sweep finds its unique process number in its command-line argument and appends that unique process number to the generic names "i_mydata." and "o_mydata." to make unique formatted data file names. All files reside in the user's current working directory; hence, data file names must be unique. To compile this program for the Vanilla Universe, see Compiling Serial Programs.

This job submission file uses Condor's default universe, Vanilla. Because this Vanilla job does not turn on Condor's file transfer mechanism, this job will (by default) use the shared file system of the submission host from which you submitted the job.

To submit the executable to Condor:

$ condor_submit myprogram.sub

For this submission, command condor_q will show a single cluster number, three unique process numbers, and three unique command-line arguments:

$ condor_q myusername

-- Submitter: condor.rcac.purdue.edu : <128.211.157.86:60481> : condor.rcac.purdue.edu
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
746419.0   myusername     10/28 10:57   0+00:00:00 I  0   0.0  myprogram 0
746419.1   myusername     10/28 10:57   0+00:00:00 I  0   0.0  myprogram 1
746419.2   myusername     10/28 10:57   0+00:00:00 I  0   0.0  myprogram 2

View the standard input file for process 0, mydata.in.0:

textfromstandardinput:process0

View the formatted input file for process 0, i_mydata.0:

textfromformattedinput:process0

View the standard output file for process 0, mydata.out.0:

***  MAIN START  ***

program name:          condor_exec.exe
command line argument: 0
standard input/output: textfromstandardinput:process0
formatted input/output: textfromformattedinput:process0

***  MAIN  STOP  ***

View the formatted output file for process 0, o_mydata.0:

textfromformattedinput:process0

Processes 1 and 2 have similar input and output files.

The single log file collects records the major events of the submission of the three queued runs of this parameter sweep:

000 (746419.000.000) 10/28 10:57:43 Job submitted from host: <128.211.157.86:60481>
...
000 (746419.001.000) 10/28 10:57:43 Job submitted from host: <128.211.157.86:60481>
...
000 (746419.002.000) 10/28 10:57:43 Job submitted from host: <128.211.157.86:60481>
...
001 (746419.001.000) 10/28 11:02:14 Job executing on host: <128.211.157.10:44836>
...
005 (746419.001.000) 10/28 11:02:14 Job terminated.
    (1) Normal termination (return value 0)
        Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
        Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
        Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
        Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
    950  -  Run Bytes Sent By Job
    9800  -  Run Bytes Received By Job
    950  -  Total Bytes Sent By Job
    9800  -  Total Bytes Received By Job
...
001 (746419.000.000) 10/28 11:02:15 Job executing on host: <128.211.157.10:44836>
...
005 (746419.000.000) 10/28 11:02:15 Job terminated.
    (1) Normal termination (return value 0)
        Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
        Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
        Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
        Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
    950  -  Run Bytes Sent By Job
    9800  -  Run Bytes Received By Job
    950  -  Total Bytes Sent By Job
    9800  -  Total Bytes Received By Job
...
001 (746419.002.000) 10/28 11:02:17 Job executing on host: <128.211.157.10:44836>
...
005 (746419.002.000) 10/28 11:02:17 Job terminated.
    (1) Normal termination (return value 0)
        Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
        Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
        Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
        Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
    950  -  Run Bytes Sent By Job
    9800  -  Run Bytes Received By Job
    950  -  Total Bytes Sent By Job
    9800  -  Total Bytes Received By Job

Condor's parameter sweep offers a huge potential. Simply adding a large number to the queue command in a job submission file yields an enormous amount of computing. The catch is that the effort needed to prepare the unique input files, either standard input files or formatted input files. This effort can be minimal when the input data comes from some data collector operating in the field. This effort can be enormous when you must enter each unique dataset from the keyboard.

Parameter Sweep - Initial Directory

A parameter sweep is a computation applied to several unique sets of data. Computing the average test score for each class in a school is an example of a simple parameter sweep.

A parameter sweep requires that each queued run of a computer program reads a unique set of input parameters. This example illustrates how to implement a unique command line argument, a unique standard input, and a unique formatted input data file. The program writes to a unique standard output and to a unique formatted output file.

Condor provides a pre-defined macro $(Process) which supports the specification of unique command-line arguments so that each queued run of a job sees a unique set of data.

Also, Condor provides an "initial directory" which supports the specification of unique input/output files so that each queued run of a job sees a unique set of data. Command initialdir specifies a generic directory name which becomes unique after appending the process number of a queued run of a parameter sweep. Each initial directory is actually a subdirectory of the user's current working directory. Each initial directory holds the unique standard input and formatted input files of a queued run of a parameter sweep; each initial directory receives the unique standard output, error and log files plus any unique formatted output files generated by a queued run of a parameter sweep. Since data files of each run of a sweep reside in a separate directory, identical file names may be used; they need not be modified with a process number. Both macro and command appear in the job submission file, myprogram.sub:

# FILENAME:  myprogram.sub

universe = VANILLA

executable = myprogram
# Processes 0,1,2
# command line argument
arguments  = $(Process)

initialdir = mydatadirectory.$(Process)

# Standard I/O files, Condor log file
input          = mydata.in
output         = mydata.out
error          = mydata.err
log            = mydata.log

# queue 3 jobs in 1 cluster
queue 3

This job submission file specifies a parameter sweep of three queued runs. Each process has a unique ID: 0, 1, and 2. Process 0 of the parameter sweep expects a standard input file named "mydata.in" to reside in the initial directory named "mydatadirectory.0"; process 1, "mydata.in" resides in "mydirectory.1"; and process 2, "mydata.in" resides in "mydirectory.2". The sweep will generate similarly named files for standard output, error, and log in the initial directories. In addition, the sweep expects to find in the initial directories formatted input data files with identical names: myinputdata. Each copy of the computer program used in this sweep finds its unique process number in its command-line argument and finds its unique formatted input data file in its own initial directory. The program does not append its unique process number to the generic names of formatted files to make unique formatted data file names. All files reside in unique subdirectories of the user's current working directory; hence, data file names must be identical. To compile this program for the Vanilla Universe, see Compiling Serial Programs.

This job submission file uses Condor's default universe, Vanilla. Because this Vanilla job does not turn on Condor's file transfer mechanism, this job will (by default) use the shared file system of the submission host from which you submitted the job.

To submit the executable to Condor:

$ condor_submit myprogram.sub

For this submission, command condor_q will show a single cluster number, three unique process numbers, and three unique command-line arguments:

$ condor_q myusername

-- Submitter: condor.rcac.purdue.edu : <128.211.157.86:60481> : condor.rcac.purdue.edu
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
746420.0   myusername     10/28 12:28   0+00:00:00 I  0   0.0  myprogram 0
746420.1   myusername     10/28 12:28   0+00:00:00 I  0   0.0  myprogram 1
746420.2   myusername     10/28 12:28   0+00:00:00 I  0   0.0  myprogram 2

View the standard input file for process 0, mydata.in, in the initial directory mydirectory.0:

textfromstandardinput:process0

View the formatted input file for process 0, myinputdata, in the initial directory mydirectory.0:

textfromformattedinput:process0

View the standard output file for process 0, mydata.out, in the initial directory mydirectory.1:

***  MAIN START  ***

program name:          condor_exec.exe
command line argument: 0
standard input/output: textfromstandardinput:process0
formatted input/output: textfromformattedinput:process0

***  MAIN  STOP  ***

View the formatted output file for process 0, myoutputdata, in the initial directory mydirectory.0:

textfromformattedinput:process0

The log file, mydata.log, records the major events of the submission of the one queued run of this parameter sweep. View the log file for process 0, mydata.log, in the initial directory mydirectory.0:

000 (746420.000.000) 10/28 12:28:35 Job submitted from host: <128.211.157.86:60481>
...
001 (746420.000.000) 10/28 12:33:48 Job executing on host: <128.211.157.10:34460>
...
005 (746420.000.000) 10/28 12:33:49 Job terminated.
    (1) Normal termination (return value 0)
        Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
        Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
        Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
        Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
    909  -  Run Bytes Sent By Job
    9800  -  Run Bytes Received By Job
    909  -  Total Bytes Sent By Job
    9800  -  Total Bytes Received By Job

Processes 1 and 2 have similar input, output and log files and formatted input/output files residing in their respective initial directories.

Condor's parameter sweep offers a huge potential. Simply adding a large number to the queue command in a job submission file yields an enormous amount of computing. The catch is that the effort needed to prepare the unique input files, either standard input files or formatted input files, can be great. This effort is minimal when the input data comes from some data collector operating in the field. This effort can be overwhelming when you must enter each unique dataset from the keyboard.

Parameter Sweep - Single Data File

A parameter sweep is a computation applied to several unique sets of data. Computing the average test score for each class in a school is an example of a simple parameter sweep.

A parameter sweep requires that each queued run of a computer program reads a unique set of input parameters. This example illustrates how to implement a parameter sweep on a single large file. Each queued run of the job reads a different portion on the same file.

Condor provides a pre-defined macro $(Process) which supports the specification of unique command-line arguments and input/output files so that each queued run of a job sees a unique set of data. This macro appears in the job submission file, myprogram.sub:

# FILENAME:  myprogram.sub

universe = VANILLA

executable = myprogram
# Processes 0,1,2
arguments  = $(Process)

# There is a single formatted input data file, myinputdata.

# Standard I/O files, Condor log file
output = mydata.out.$(Process)
error  = mydata.err.$(Process)
log    = mydata.log

# queue 3 jobs in 1 cluster
queue 3

This job submission file specifies a parameter sweep of three queued runs. Each process has a unique ID: 0, 1, and 2. Each queued run of this job will read a different portion of the data file. Process 0 of the parameter sweep writes a standard output file named "mydata.out.0"; process 1, "mydata.out.1"; and process 2, "mydata.out.2". The sweep will generate similarly named files for standard error. Condor advises using a single log file in a submission to record the major events of the sweep. In addition, the sweep expects to find a single formatted input data file, myinputdata. Each copy of the computer program used in this sweep finds its unique process number in its command-line argument and uses that number to determine where in the single input data file it is to start reading records. All files reside in the user's current working directory; hence, data file names must be unique. To compile this program for the Vanilla Universe, see Compiling Serial Programs.

This job submission file uses Condor's default universe, Vanilla. Because this Vanilla job does not turn on Condor's file transfer mechanism, this job will (by default) use the shared file system of the submission host from which you submitted the job.

To submit the executable to Condor:

$ condor_submit myprogram.sub

For this submission, command condor_q will show a single cluster number, three unique process numbers, and three unique command-line arguments:

$ condor_q myusername

-- Submitter: condor.rcac.purdue.edu : <128.211.157.86:60481> : condor.rcac.purdue.edu
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
746421.0   myusername     10/29 10:57   0+00:00:00 I  0   0.0  myprogram 0
746421.1   myusername     10/29 10:57   0+00:00:00 I  0   0.0  myprogram 1
746421.2   myusername     10/29 10:57   0+00:00:00 I  0   0.0  myprogram 2

View the single formatted input file, myinputdata:

AAAAAAAAAA
BBBBBBBBBB
CCCCCCCCCC
    :
ZZZZZZZZZZ
0000000000
1111111111
2222222222
3333333333

View the standard output file for process 0, mydata.out.0:

***  MAIN START  ***

program name:          condor_exec.exe
command line argument: 0
current file position:   0
rtn_val = 0
starting file position:   0
line 1:   AAAAAAAAAA
line 2:   BBBBBBBBBB
line 3:   CCCCCCCCCC
line 4:   DDDDDDDDDD
line 5:   EEEEEEEEEE
line 6:   FFFFFFFFFF
line 7:   GGGGGGGGGG
line 8:   HHHHHHHHHH
line 9:   IIIIIIIIII
line 10:   JJJJJJJJJJ

***  MAIN  STOP  ***

View the standard output file for process 1, mydata.out.1:

***  MAIN START  ***

program name:          condor_exec.exe
command line argument: 1
current file position:   0
rtn_val = 0
starting file position:   110
line 11:   KKKKKKKKKK
line 12:   LLLLLLLLLL
line 13:   MMMMMMMMMM
line 14:   NNNNNNNNNN
line 15:   OOOOOOOOOO
line 16:   PPPPPPPPPP
line 17:   QQQQQQQQQQ
line 18:   RRRRRRRRRR
line 19:   SSSSSSSSSS
line 20:   TTTTTTTTTT
rtn_val = 0
starting file position:   0
line 0:   AAAAAAAAAA
rtn_val = 0
starting file position:   220
line 21:   UUUUUUUUUU

***  MAIN  STOP  ***

Process 1 also practices additional random file accesses.

View the standard output file for process 2, mydata.out.2:

***  MAIN START  ***

program name:          condor_exec.exe
command line argument: 2
current file position:   0
rtn_val = 0
starting file position:   220
line 21:   UUUUUUUUUU
line 22:   VVVVVVVVVV
line 23:   WWWWWWWWWW
line 24:   XXXXXXXXXX
line 25:   YYYYYYYYYY
line 26:   ZZZZZZZZZZ
line 27:   0000000000
line 28:   1111111111
line 29:   2222222222
line 30:   3333333333

***  MAIN  STOP  ***

Condor's parameter sweep, when applied to a single, large data file, offers a huge potential. Simply adding a large number to the queue command in a job submission file applies several compute servers to the data processing.

Transfer a Subdirectory

To review, Condor is unable to transfer a subdirectory of data files to a compute server. While the submit command transfer_input_files allows paths when specifying which input files to transfer, Condor places all transferred files in a single, flat directory where the executable and standard input file reside - the temporary working directory on the compute server. Therefore, the executing program must access input files without paths.

A similar situation exists for output files. If the program creates output files during execution, it must create them within the temporary working directory. Condor transfers back all new and modified files within the temporary working directory - the output files. To transfer back only a subset of these files, use the submit command transfer_output_files. Condor does not support the transfer of output files that exist but that do not reside within the temporary working directory on the compute server.

This restriction need not deter the user with a subdirectory of input and output files. The user simply makes an archive file of the subdirectory structure with the tar utility and tell Condor to transfer the tar file. The application may then un-tar the archive before reading the input files. The application may also write to output files which reside within the subdirectory. The final step of the application archives those files which your job made or modified. Condor will see the archive file as an output file and transfer the archive from the compute server to the user's working directory on the submission host. Finally, the user extracts the output files from the archive.

The computer program, myprogram.c, reads a formatted data file and writes a formatted data file. This example assumes that there exists a formatted input file, i_00110 in a subdirectory name mysubdirectory. The result is a formatted output file, o_00110, in the same subdirectory. The program uses the tar utility to extract the subdirectory structure on the compute server. After the program writes the output file, it then uses the tar utility again to archive the subdirectory of output files only. To compile this program for the Vanilla Universe, see Compiling Serial Programs.

This example assumes that the current working directory has a subdirectory containing a formatted input file. The tar utility prepares the archive of input files:

tar cf myarchive.i.tar mysubdirectory

Prepare a job submission file, myprogram.sub. Specify the Vanilla Universe and the file transfer mechanism as "on":

# FILENAME:  myprogram.sub

universe = VANILLA

executable = myprogram

# Specify the archive as the input data file.
transfer_input_files = myarchive.i.tar

# Turn on file transfer mechanism.
should_transfer_files   = YES

# Let Condor handle output file(s): myarchive.o.tar.
when_to_transfer_output = ON_EXIT

# Standard output files, Condor log file
output = mydata.out
error  = mydata.err
log    = mydata.log

# queue one job
queue

To submit the executable to Condor:

$ condor_submit myprogram.sub

The standard output file, mydata.out, shows the evolution of the current working directory on the compute server. Initially, it shows that Condor transferred the tar file which contains the archived subdirectory of input data file(s). After extraction, the subdirectory with its formatted input file(s), mysubdirectory and myinputdata, are visible. After processing, the formatted output file(s), myoutputdata, is visible:

total 24
-rwxr-xr-x 1 myusername itap  8708 Nov 12 15:27 condor_exec.exe
-rw-r--r-- 1 myusername itap 10240 Nov 12 15:27 myarchive.i.tar
-rw-r--r-- 1 myusername itap     0 Nov 12 15:30 mydata.err
-rw-r--r-- 1 myusername itap     0 Nov 12 15:30 mydata.out
total 32
-rwxr-xr-x 1 myusername itap  8708 Nov 12 15:27 condor_exec.exe
-rw-r--r-- 1 myusername itap 10240 Nov 12 15:27 myarchive.i.tar
-rw-r--r-- 1 myusername itap     0 Nov 12 15:30 mydata.err
-rw-r--r-- 1 myusername itap   227 Nov 12 15:30 mydata.out
drwxr-x--- 3 myusername itap  4096 Feb 14  2008 mysubdirectory
total 8
drwx------ 2 myusername itap 4096 Feb 14  2008  ..
-rw-r--r-- 1 myusername itap   19 Jul 12  2007 myinputdata
total 12
drwx------ 2 myusername itap 4096 Feb 14  2008  ..
-rw-r--r-- 1 myusername itap   19 Jul 12  2007 myinputdata
-rw-r--r-- 1 myusername itap   28 Nov 12 15:30 myoutputdata
***  MAIN START  ***

formatted input/output: textinsubdirectory

***  MAIN  STOP  ***

At job completion, Condor sees file myarchive.o.tar as an output file which it will transfer to the submission host. After the transfer, the user then extracts the output file(s) from this archive:

tar xf myarchive.o.tar mysubdirectory/myoutputfile

View the log file, mydata.log:

000 (342352.000.000) 11/12 15:29:31 Job submitted from host: <128.211.157.86:47933>
...
001 (342352.000.000) 11/12 15:30:55 Job executing on host: <128.211.157.10:59987?PrivNet=condor.ccb.purdue.edu>
...
005 (342352.000.000) 11/12 15:30:56 Job terminated.
    (1) Normal termination (return value 0)
        Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
        Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
        Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
        Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
    11094  -  Run Bytes Sent By Job
    18948  -  Run Bytes Received By Job
    11094  -  Total Bytes Sent By Job
    18948  -  Total Bytes Received By Job

The log file records the main events related to the processing of this job. The log shows the number of bytes transferred between the submission host and the compute server via Condor's file transfer mechanism.

Requiring Specific Amounts of Memory

Some applications require compute nodes with a certain minimum amount of memory. These applications may also perform better when even more memory is available on the compute node.

This section illustrates how to submit a small job to a BoilerGrid compute node with at least 16 GB of memory (requirements) and to prefer compute nodes with even more memory (rank), if available. This example, myprogram.c, displays the name of the host which runs the job, the path name of the current working directory, and the contents of that directory.

Prepare a job submission file with an appropriate filename, here named myjob.sub:

# FILENAME:  myjob.sub

universe = VANILLA

# Require a compute node with at least 16 GB of memory.
# 16 GB == 16046 MB;
requirements = TotalMemory >= 16046

# Prefer a compute node with more than 16 GB, if available.
rank = TotalMemory

# Transfer the "executable" myprogram to the compute node.
transfer_executable = TRUE
executable          = myprogram

# Turn on Condor's file transfer mechanism only when needed.
should_transfer_files   = IF_NEEDED
when_to_transfer_output = ON_EXIT

# Standard I/O files, Condor log file
output = myprogram.out
error  = myprogram.err
log    = myprogram.log

# queue one job
queue

The ClassAd TotalMemory specifies the amount of memory on a compute node. The amount of memory is in units of megabytes. To change this example to request at least 32 GB of total memory, replace "16046" with "32192". For at least 48 GB, use "48297".

This example assumes that all compute nodes have a definition for the attribute TotalMemory. To see how many compute nodes in BoilerGrid do not have the attribute TotalMemory defined:

$ condor_status -pool boilergrid.rcac.purdue.edu -total -constraint 'TotalMemory =?= undefined'

There is no output since all compute nodes of BoilerGrid do have this attribute defined.

Before submitting your job, you may wish to verify that there are a sufficient number of compute nodes which will satisfy your requirements and that those same compute nodes define the preferred ClassAds expressed in the rank command. To see how many compute nodes satisfy your requirements:

$ condor_status -pool boilergrid.rcac.purdue.edu -total -constraint 'TotalMemory >= 16046'

                     Total Owner Claimed Unclaimed Matched Preempting Backfill
        X86_64/LINUX 26093 18007    3330      4753       3          0        0
               Total 26093 18007    3330      4753       3          0        0

There are 26,093 compute nodes with at least 16 GB of memory.

View results in the file for all standard output, here named myjob.out:

cms-100.rcac.purdue.edu
(none)
/autohome/u105/myusername/condor/Introduction/memory
total 224
-rw-r--r-- 1 myusername itap 1508 Mar 11 14:38 README
-rw-r--r-- 1 myusername itap    0 Mar 11 15:36 myjob.err
-rw-r--r-- 1 myusername itap  791 Mar 11 15:36 myjob.log
-rw-r--r-- 1 myusername itap   77 Mar 11 15:36 myjob.out
-rw-r----- 1 myusername itap  663 Mar 11 15:20 myjob.sub
-rwxr-xr-x 1 myusername itap 6939 Mar 11 14:38 myprogram
-rw-r----- 1 myusername itap  488 Mar 11 14:40 myprogram.c
-rwxr----- 1 myusername itap   58 Mar 11 14:38 run
***  MAIN START  ***


***  MAIN  STOP  ***

This job happened to run on compute node cms-100. This compute node has 8 processor cores. To verify that cms-100 has at least 16 GB of memory:

$ condor_status -pool boilergrid.rcac.purdue.edu -constraint 'Machine=="cms-100.rcac.purdue.edu"' -format "%s\n" TotalMemory

16046
16046
16046
16046
16046
16046
16046
16046

For more information about requirements and rank:

Requiring Specific Architectures or Operating Systems

You compile a computer program to run on a specific combination of chip architecture and operating system. This combination is a platform. BoilerGrid contains compute nodes of many different platforms, so you must often specify the platform your program requires to ensure that your job runs on the correct platform. The predominant platform on BoilerGrid is 64-bit Linux ("X86_64/Linux"). To see a list of all platforms available on BoilerGrid:

$ condor_status -pool boilergrid.rcac.purdue.edu -total

                     Total Owner Claimed Unclaimed Matched Preempting Backfill

         INTEL/LINUX   114    18       0        60       0          0       36
           INTEL/OSX     2     0       0         2       0          0        0
       INTEL/WINNT51   334     8       0       326       0          0        0
       INTEL/WINNT61  6299   982       0      5317       0          0        0
    SUN4u/SOLARIS210     3     0       0         3       0          0        0
        X86_64/LINUX 30170 19460    4559      6150       0          0        1

               Total 36922 20468    4559     11858       0          0       37

The name "INTEL" as used on BoilerGrid means 32-bit Intel-compatible hardware, and it makes no distinction between Intel and AMD CPUs. The name "X86_64" is a vendor-neutral term to refer to 64-bit architecture from either Intel or AMD. The name "WINNT51" means Windows XP, and "WINNT61" means Windows 7.

By default, Condor will send a job to a compute node whose architecture and operating system match the platform of the host from which you submitted your job. Moreover, you may submit jobs to compute nodes which are platforms different from the submission host. You may compile a program to run on a Windows machine and submit the executable file to BoilerGrid from one of BoilerGrid's Linux submission hosts by specifying that the job requires a Windows compute node:

executable   = myprogram.exe
requirements = (ARCH == "INTEL") && ((OPSYS == "WINNT51") || (OPSYS == "WINNT61"))

It is possible to allow Condor to use a larger pool of compute nodes for a job if executables are available for multiple platforms. You need only take care to not reference any absolute paths within your job submission that are specific to one platform or installation. You can often use some existing ClassAd variables instead of fixed paths to make non-platform-specific submission files.

For more information about requirements and rank:

Requiring Specific Clusters or Compute Nodes

RCAC resources include several clusters. Currently, the clusters include the following:

Radon
Steele
Coates
Rossmann
Miner

This section illustrates how to apply Condor ClassAds to submit a small job to a node which resides on some subset of this collection of RCAC resources. These examples execute a simple shell script which displays the name of the compute node which ran the job.

Prepare a shell script with an appropriate filename, here named myjob.sh:

#!/bin/sh
# FILENAME:  myjob.sh

hostname

Change the permissions of the shell script to allow execution by the owner (you):

$ chmod u+x myjob.sh

Executing only on a node of one or more specific RCAC clusters

Prepare a job submission file with an appropriate filename, here named myjob.sub. The ClassAd requires that the chosen compute node should reside on either of two clusters. This example also specifies the shell script myjob.sh as the executable, transfers the shell script to the compute node, and uses Condor's file transfer mechanism to copy input and output files only if needed (if the same file system is not available on the compute node):

# FILENAME:  myjob.sub

universe = VANILLA

# Require a compute node of either the Steele or Coates cluster.
# Attribute name is not case sensitive; attribute value is.
requirements = (CLUSTERNAME=="Steele") || (clustername=="Coates")

# Transfer the "executable" myjob.sh to the compute node.
transfer_executable = TRUE
executable          = myjob.sh

# Turn on Condor's file transfer mechanism only when needed.
should_transfer_files   = IF_NEEDED
when_to_transfer_output = ON_EXIT

# Standard I/O files, Condor log file
output = myjob.out
error  = myjob.err
log    = myjob.log

# queue one job
queue

Submit the job:

$ condor_submit myjob.sub

View job status:

$ condor_q myusername

View results in the file for all standard output, here named myjob.out:

coates-d020.rcac.purdue.edu

Executing only on one specific compute node

Prepare a job submission file with an appropriate filename, here named myjob.sub. The ClassAd requires a specific compute node. This example also specifies the shell script myjob.sh as the executable, transfers the shell script to the compute node, and uses Condor's file transfer mechanism to copy input and output files only if needed (if the same file system is not available on the compute node):

# FILENAME:  myjob.sub

universe = VANILLA

# Require a specific compute node.
requirements = Machine=="miner-a500.rcac.purdue.edu"

# Transfer the "executable" myjob.sh to the compute node.
transfer_executable = TRUE
executable          = myjob.sh

# Turn on Condor's file transfer mechanism only when needed.
should_transfer_files   = IF_NEEDED
when_to_transfer_output = ON_EXIT

# Standard I/O files, Condor log file
output = myjob.out
error  = myjob.err
log    = myjob.log

# queue one job
queue

Submit the job:

$ condor_submit myjob.sub

View job status:

$ condor_q myusername

View results in the file for all standard output, here named myjob.out:

miner-a500.rcac.purdue.edu

Executing on any compute node of a cluster except one

When you discover that a compute node is consistently available and consistently fails to run your job, you may exclude that node from the set of candidate nodes.

Prepare a job submission file with an appropriate filename, here named myjob.sub. The ClassAd excludes one specific compute node of a chosen cluster. This example also specifies the shell script myjob.sh as the executable, transfers the shell script to the compute node, and uses Condor's file transfer mechanism to copy input and output files only if needed (if the same file system is not available on the compute node):

# FILENAME:  myjob.sub

universe = VANILLA

# Exclude a specific compute node.
requirements = ClusterName=="Miner" && Machine!="miner-a500.rcac.purdue.edu"

# Transfer the "executable" myjob.sh to the compute node.
transfer_executable = TRUE
executable          = myjob.sh

# Turn on Condor's file transfer mechanism only when needed.
should_transfer_files   = IF_NEEDED
when_to_transfer_output = ON_EXIT

# Standard I/O files, Condor log file
output = myjob.out
error  = myjob.err
log    = myjob.log

# queue one job
queue

Submit the job:

$ condor_submit myjob.sub

View job status:

$ condor_q myusername

View results in the file for all standard output, here named myjob.out:

miner-a502.rcac.purdue.edu

For more information about requirements and rank:

DAGMan - Linear DAG

Condor schedules individual programs to run on unused compute servers, but it does not schedule a sequence of programs; Condor does not handle dependencies. Instead, the Directed Acyclic Graph Manager (DAGMan), a meta-scheduler which can handle dependencies, submits programs to Condor in a sequence specified by a directed acyclic graph (DAG). A DAG can represent a sequence of computations. Nodes (vertices) of the DAG represent executable programs; edges (arcs) identify the dependencies between programs.

This example is a linear DAG which represents three ordered executions named "A", "B", and "C". Program A must finish before program B may begin; B must finish before C may begin.

Diagram of Linear DAG

The DAG submission file, myprogram.dag, describes the DAG and specifies job submission files which control the execution of individual programs at each node of the DAG and under Condor's control:

# FILENAME:  myprogram.dag

# Specify the nodes (job submission files) of a DAG.
JOB A myprogram.A.sub
JOB B myprogram.B.sub
JOB C myprogram.C.sub

# Specify command-line arguments as macro definitions.
VARS A nodename="A"
VARS B nodename="B"
VARS C nodename="C"

# Specify the edges (dependencies, order of execution) of a DAG.
PARENT A CHILD B
PARENT B CHILD C

View the job submission file, myprogram.A.sub, for the first node of the DAG:

universe   = VANILLA
executable = myprogram
arguments  = $(Cluster) $(Process) $(nodename)
output     = myprogram.A.out
error      = myprogram.A.err
log        = myprogram.log
queue

View the job submission file, myprogram.B.sub, for the second node of the DAG:

universe   = VANILLA
executable = myprogram
arguments  = $(Cluster) $(Process) $(nodename)
output     = myprogram.B.out
error      = myprogram.B.err
log        = myprogram.log
queue

View the job submission file, myprogram.C.sub, for the third node of the DAG:

universe   = VANILLA
executable = myprogram
arguments  = $(Cluster) $(Process) $(nodename)
output     = myprogram.C.out
error      = myprogram.C.err
log        = myprogram.log
queue

While DAGMan can execute a different program for each node of the DAG, this example uses a single executable file to remain simple. Here is a short program, myprogram.c, that displays the command-line arguments which originally appear in the DAG submission file and forwarded to the job submission files. To compile this program for the Vanilla Universe, see Compiling Serial Programs.

To submit the DAG to Condor:

$ condor_submit_dag -force myprogram.dag

The argument -force requires Condor to overwrite the files that it produces. This is a useful convenience when the DAG's output files already exist from a preceding run, you no longer need that earlier output. Condor appends, not overwrites, the file dagman.out .

Command condor_rm is able to remove a DAG from the job queue.

Command condor_q shows the sequence of execution:

$ condor_q myusername
-- Submitter: condor.rcac.purdue.edu : <128.211.157.86:38552> : condor.rcac.purdue.edu
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
746893.0 myusername       11/9  10:00   0+00:00:08 R  0   7.3  condor_dagman

$ condor_q myusername
-- Submitter: condor.rcac.purdue.edu : <128.211.157.86:38552> : condor.rcac.purdue.edu
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
746893.0 myusername       11/9  10:00   0+00:01:42 R  0   7.3  condor_dagman
746894.0 myusername       11/9  10:00   0+00:00:00 I  0   0.0  myprogram 746894 0 A

$ condor_q myusername
-- Submitter: condor.rcac.purdue.edu : <128.211.157.86:38552> : condor.rcac.purdue.edu
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
746893.0 myusername       11/9  10:00   0+00:18:48 R  0   7.3  condor_dagman
746897.0 myusername       11/9  10:15   0+00:00:00 I  0   0.0  myprogram 746897 0 B

$ condor_q myusername
-- Submitter: condor.rcac.purdue.edu : <128.211.157.86:38552> : condor.rcac.purdue.edu
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
746893.0 myusername       11/9  10:00   0+00:21:28 R  0   7.3  condor_dagman
746900.0 myusername       11/9  10:21   0+00:00:00 I  0   0.0  myprogram 746900 0 C

This report shows that DAGMan has its own cluster number. Each node of a DAG has its own cluster number. These cluster numbers are not necessarily in sequence since other users are submitting jobs to Condor.

View the output file of the first node of the DAG, myprogram.A.out:

***  MAIN START  ***

program name:     condor_exec.exe
cluster number:   746894
process number:   0
node name:        A

***  MAIN  STOP  ***

View the output file of the second node of the DAG, myprogram.B.out:

***  MAIN START  ***

program name:     condor_exec.exe
cluster number:   746897
process number:   0
node name:        B

***  MAIN  STOP  ***

View the output file of the third node of the DAG, myprogram.C.out:

***  MAIN START  ***

program name:     condor_exec.exe
cluster number:   746900
process number:   0
node name:        C

***  MAIN  STOP  ***

Each execution of the single program sees a unique node name: A, B, C.

The common log file records the execution of the three nodes of the DAG, myprogram.log:

000 (746894.000.000) 11/09 10:00:37 Job submitted from host: <128.211.157.86:38552>
    DAG Node: A
...
001 (746894.000.000) 11/09 10:15:09 Job executing on host: <128.211.157.10:59600>
...
005 (746894.000.000) 11/09 10:15:09 Job terminated.
    (1) Normal termination (return value 0)
        Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
        Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
        Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
        Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
    0  -  Run Bytes Sent By Job
    0  -  Run Bytes Received By Job
    0  -  Total Bytes Sent By Job
    0  -  Total Bytes Received By Job
...
000 (746897.000.000) 11/09 10:15:18 Job submitted from host: <128.211.157.86:38552>
    DAG Node: B
...
001 (746897.000.000) 11/09 10:21:24 Job executing on host: <128.211.157.10:52773>
...
005 (746897.000.000) 11/09 10:21:24 Job terminated.
    (1) Normal termination (return value 0)
        Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
        Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
        Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
        Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
    0  -  Run Bytes Sent By Job
    0  -  Run Bytes Received By Job
    0  -  Total Bytes Sent By Job
    0  -  Total Bytes Received By Job
...
000 (746900.000.000) 11/09 10:21:36 Job submitted from host: <128.211.157.86:38552>
    DAG Node: C
...
001 (746900.000.000) 11/09 10:26:06 Job executing on host: <128.211.157.10:59600>
...
005 (746900.000.000) 11/09 10:26:06 Job terminated.
    (1) Normal termination (return value 0)
        Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
        Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
        Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
        Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
    0  -  Run Bytes Sent By Job
    0  -  Run Bytes Received By Job
    0  -  Total Bytes Sent By Job
    0  -  Total Bytes Received By Job
...

For more information about DAGMan:

DAGMan - Parameter Sweep

A linear DAG may include a parameter sweep. The following diagram illustrates a three-step linear DAG with the middle process being a parameter sweep which applies a single computer program to unique data sets. The first and third steps might perform data preparation and collation, respectively:

Diagram of Parameter Sweep DAG

This example is a linear DAG which represents three ordered executions named "A", "B", and "C". Program A must finish before any run of program B used in the parameter sweep may begin; all runs of program B must finish before C may begin.

The DAG submission file, myprogram.dag, describes the DAG and specifies job submission files which control the execution of individual programs at each node of the DAG and under Condor's control. Notice that this DAG submission file is identical to a linear DAG submission file:

# FILENAME:  myprogram.dag

# Specify the nodes (job submission files) of a DAG.
JOB A myprogram.A.sub
JOB B myprogram.B.sub
JOB C myprogram.C.sub

# Specify command-line arguments as macro definitions.
VARS A nodename="A"
VARS B nodename="B"
VARS C nodename="C"

# Specify the edges (dependencies, order of execution) of a DAG.
PARENT A CHILD B
PARENT B CHILD C

View the job submission file, myprogram.A.sub, for the first node of the DAG:

universe   = VANILLA
executable = myprogram
arguments  = $(Cluster) $(Process) $(nodename)
output     = myprogram.A.out
error      = myprogram.A.err
log        = myprogram.log
queue

View the job submission file, myprogram.B.sub, for the second node, the parameter sweep, of the DAG. Command queue submits three copies of myprogram to Condor:

universe   = VANILLA
executable = myprogram
arguments  = $(Cluster) $(Process) $(nodename)
output     = myprogram.B.out.$(Process)
error      = myprogram.B.err.$(Process)
log        = myprogram.log
queue 3

View the job submission file, myprogram.C.sub, for the third node of the DAG:

universe   = VANILLA
executable = myprogram
arguments  = $(Cluster) $(Process) $(nodename)
output     = myprogram.C.out
error      = myprogram.C.err
log        = myprogram.log
queue

While DAGMan can execute a different program for each node of the DAG, this example uses a single executable file to remain simple. Here is a short program, myprogram.c, that displays the command-line arguments which originally appear in the DAG submission file and forwarded to the job submission files. To compile this program for the Vanilla Universe, see Compiling Serial Programs.

To submit the DAG to Condor:

$ condor_submit_dag -force myprogram.dag

The argument -force requires Condor to overwrite the files that it produces. This is a useful convenience when the DAG's output files already exist from a preceding run, and you no longer need the earlier output. DAGMan appends, not overwrites, file dagman.out.

Command condor_rm is able to remove a DAG from the job queue.

Three timely submissions of condor_q caught the three steps of the DAG, including the parameter sweep of the middle step:

$ condor_q myusername
-- Submitter: condor.rcac.purdue.edu : <128.211.157.86:58916> : condor.rcac.purdue.edu
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
746911.0 myusername       11/10 08:30   0+00:00:19 R  0   7.3  condor_dagman
746912.0 myusername       11/10 08:30   0+00:00:00 I  0   0.0  myprogram 746912 0 A

$ condor_q myusername
-- Submitter: condor.rcac.purdue.edu : <128.211.157.86:58916> : condor.rcac.purdue.edu
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
746911.0 myusername       11/10 08:30   0+00:02:25 R  0   7.3  condor_dagman
746913.0 myusername       11/10 08:32   0+00:00:00 I  0   0.0  myprogram 746913 0 B
746913.1 myusername       11/10 08:32   0+00:00:00 I  0   0.0  myprogram 746913 1 B
746913.2 myusername       11/10 08:32   0+00:00:00 I  0   0.0  myprogram 746913 2 B

$ condor_q myusername
-- Submitter: condor.rcac.purdue.edu : <128.211.157.86:58916> : condor.rcac.purdue.edu
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
746911.0 myusername       11/10 08:30   0+00:14:55 R  0   7.3  condor_dagman
746914.0 myusername       11/10 08:41   0+00:00:00 I  0   0.0  myprogram 746914 0 C

This report shows that DAGMan has its own cluster number. Each node of a DAG has its own cluster number. These cluster numbers are not necessarily in sequence since other users are submitting jobs to Condor. In addition, each process in the parameter sweep has its own process number, and they are in sequence.

View the output file of the first node of the DAG, myprogram.A.out:

***  MAIN START  ***

program name:   condor_exec.exe
cluster number: 746912
process number: 0
node name:      A

***  MAIN  STOP  ***

View the three output files of the three processes of the parameter sweep that is the second node of the DAG, myprogram.B.out.$(Process):

***  MAIN START  ***

program name:   condor_exec.exe
cluster number: 746913
process number: 0
node name:      B

***  MAIN  STOP  ***

***  MAIN START  ***

program name:   condor_exec.exe
cluster number: 746913
process number: 1
node name:      B

***  MAIN  STOP  ***

***  MAIN START  ***

program name:   condor_exec.exe
cluster number: 746913
process number: 2
node name:      B

***  MAIN  STOP  ***

View the output file of the third node of the DAG, myprogram.C.out:

***  MAIN START  ***

program name:   condor_exec.exe
cluster number: 746914
process number: 0
node name:      C

***  MAIN  STOP  ***

Each execution of the single program sees a unique node name: A, B, C. In the parameter sweep, all runs of the single program see the same node name, B; however, each copy sees a unique process number.

The common log file records the execution of the three nodes of the DAG, myprogram.log:


000 (746912.000.000) 11/10 08:30:43 Job submitted from host: <128.211.157.86:58916>
    DAG Node: A
...
001 (746912.000.000) 11/10 08:32:36 Job executing on host: <128.211.157.10:37230>
...
005 (746912.000.000) 11/10 08:32:36 Job terminated.
    (1) Normal termination (return value 0)
        Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
        Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
        Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
        Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
    0  -  Run Bytes Sent By Job
    0  -  Run Bytes Received By Job
    0  -  Total Bytes Sent By Job
    0  -  Total Bytes Received By Job
...
000 (746913.000.000) 11/10 08:32:44 Job submitted from host: <128.211.157.86:58916>
    DAG Node: B
...
000 (746913.001.000) 11/10 08:32:44 Job submitted from host: <128.211.157.86:58916>
    DAG Node: B
...
000 (746913.002.000) 11/10 08:32:44 Job submitted from host: <128.211.157.86:58916>
    DAG Node: B
...
001 (746913.000.000) 11/10 08:41:12 Job executing on host: <128.211.157.10:34460>
...
001 (746913.001.000) 11/10 08:41:12 Job executing on host: <128.211.157.10:36048>
...
005 (746913.000.000) 11/10 08:41:12 Job terminated.
    (1) Normal termination (return value 0)
        Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
        Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
        Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
        Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
    0  -  Run Bytes Sent By Job
    0  -  Run Bytes Received By Job
    0  -  Total Bytes Sent By Job
    0  -  Total Bytes Received By Job
...
005 (746913.001.000) 11/10 08:41:12 Job terminated.
    (1) Normal termination (return value 0)
        Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
        Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
        Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
        Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
    0  -  Run Bytes Sent By Job
    0  -  Run Bytes Received By Job
    0  -  Total Bytes Sent By Job
    0  -  Total Bytes Received By Job
...
001 (746913.002.000) 11/10 08:41:12 Job executing on host: <128.211.157.10:34460>
...
005 (746913.002.000) 11/10 08:41:13 Job terminated.
    (1) Normal termination (return value 0)
        Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
        Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
        Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
        Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
    0  -  Run Bytes Sent By Job
    0  -  Run Bytes Received By Job
    0  -  Total Bytes Sent By Job
    0  -  Total Bytes Received By Job
...
000 (746914.000.000) 11/10 08:41:26 Job submitted from host: <128.211.157.86:58916>
    DAG Node: C
...
001 (746914.000.000) 11/10 08:48:03 Job executing on host: <128.211.157.10:40848>
...
005 (746914.000.000) 11/10 08:48:04 Job terminated.
    (1) Normal termination (return value 0)
        Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
        Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
        Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
        Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
    0  -  Run Bytes Sent By Job
    0  -  Run Bytes Received By Job
    0  -  Total Bytes Sent By Job
    0  -  Total Bytes Received By Job
...

For more information about DAGMan:

DAGMan - Another Parameter Sweep

All nodes of a DAG may be a parameter sweep. This means that each run of an entire DAG can process a unique set of input data. This is the logical extension of a single progam used in a parameter sweep. The disadvantage of this method is the interdependence among copies of the DAG.

This example is a linear DAG which represents three ordered executions named "A", "B", and "C". This DAG runs as a parameter sweep. The interdependence among the runs of this DAG means that all runs of program associated with node A must finish before any run of program associated with node B may begin; all runs of the program associated with node B must finish before any run of the program associated with node C may begin. If one of the runs of DAG Node A experiences a delay because the executable file landed on a slow compute node, then all runs of the parameter sweep wait, not just the run which experiences the delay.

Diagram of Parameter Sweep DAG

The DAG submission file, myprogram.dag, describes the DAG and specifies job submission files which control the execution of individual programs at each node of the DAG and under Condor's control. Notice that this DAG submission file is identical to a linear DAG submission file:

# FILENAME:  myprogram.dag

# Specify the nodes (job submission files) of a DAG.
JOB A myprogram.A.sub
JOB B myprogram.B.sub
JOB C myprogram.C.sub

# Specify command-line arguments as macro definitions.
VARS A nodename="A"
VARS B nodename="B"
VARS C nodename="C"

# Specify the edges (dependencies, order of execution) of a DAG.
PARENT A CHILD B
PARENT B CHILD C

View the job submission file, myprogram.A.sub, for the first node of the DAG:

universe     = VANILLA
executable   = myprogram
arguments    = $(Cluster) $(Process) $(nodename)
output       = myprogram.A.out.$(Process)
error        = myprogram.A.err.$(Process)
log          = myprogram.log
queue 3      # queue 3 runs

View the job submission file, myprogram.B.sub, for the second node of the DAG:

universe     = VANILLA
executable   = myprogram
arguments    = $(Cluster) $(Process) $(nodename)
output       = myprogram.B.out.$(Process)
error        = myprogram.B.err.$(Process)
log          = myprogram.log
queue 3      # queue 3 runs

View the job submission file, myprogram.C.sub, for the third node of the DAG:

universe     = VANILLA
executable   = myprogram
arguments    = $(Cluster) $(Process) $(nodename)
output       = myprogram.C.out.$(Process)
error        = myprogram.C.err.$(Process)
log          = myprogram.log
queue 3      # queue 3 runs

For each node of the DAG, command queue submits three copies of myprogram to Condor.

While DAGMan can execute a different program for each node of the DAG, this example uses a single executable file to remain simple. Here is a short program, myprogram.c, that displays the command-line arguments which originally appear in the DAG submission file and forwarded to the job submission files. To compile this program for the Vanilla Universe, see Compiling Serial Programs.

To submit the DAG to Condor:

$ condor_submit_dag -force myprogram.dag

The argument -force requires Condor to overwrite the files that it produces. This is a useful convenience when the DAG's output files already exist from a preceding run, and you no longer need the earlier output. DAGMan appends, not overwrites, the file dagman.out.

Command condor_rm is able to remove a DAG from the job queue.

Three timely submissions of condor_q caught the parameter sweeps of the three steps of the DAG:

$ condor_q myusername
-- Submitter: condor.rcac.purdue.edu : <128.211.157.86:58916> : condor.rcac.purdue.edu
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
746924.0 myusername       11/11 14:28   0+00:00:22 R  0   7.3  condor_dagman
746925.0 myusername       11/11 14:28   0+00:00:00 I  0   0.0  myprogram 746925 0 A
746925.1 myusername       11/11 14:28   0+00:00:00 I  0   0.0  myprogram 746925 1 A
746925.2 myusername       11/11 14:28   0+00:00:00 I  0   0.0  myprogram 746925 2 A

$ condor_q myusername
-- Submitter: condor.rcac.purdue.edu : <128.211.157.86:58916> : condor.rcac.purdue.edu
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
746924.0 myusername       11/11 14:28   0+00:04:51 R  0   7.3  condor_dagman
746926.0 myusername       11/11 14:32   0+00:00:00 I  0   0.0  myprogram 746926 0 B
746926.1 myusername       11/11 14:32   0+00:00:00 I  0   0.0  myprogram 746926 1 B
746926.2 myusername       11/11 14:32   0+00:00:00 I  0   0.0  myprogram 746926 2 B

$ condor_q myusername
-- Submitter: condor.rcac.purdue.edu : <128.211.157.86:58916> : condor.rcac.purdue.edu
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
746924.0 myusername       11/11 14:28   0+00:09:55 R  0   7.3  condor_dagman
746927.0 myusername       11/11 14:37   0+00:00:00 I  0   0.0  myprogram 746927 0 C
746927.1 myusername       11/11 14:37   0+00:00:00 I  0   0.0  myprogram 746927 1 C
746927.2 myusername       11/11 14:37   0+00:00:00 I  0   0.0  myprogram 746927 2 C

This report shows that DAGMan has its own cluster number. Each node of the DAG has its own cluster number. These cluster numbers are not necessarily in sequence since other users are submitting jobs to Condor. In addition, since each node is a parameter sweep, each process in the parameter sweep has its own process number, and they are in sequence.

View the three output files of the zero-th run of the parameter sweep of the DAG: myprogram.A.out.0, myprogram.B.out.0, and myprogram.C.out.0:

***  MAIN START  ***

program name:   condor_exec.exe
cluster number: 746925
process number: 0
node name:      A

***  MAIN  STOP  ***

***  MAIN START  ***

program name:   condor_exec.exe
cluster number: 746926
process number: 0
node name:      B

***  MAIN  STOP  ***

***  MAIN START  ***

program name:   condor_exec.exe
cluster number: 746927
process number: 0
node name:      C

***  MAIN  STOP  ***

Similar sets of output files exist for the other two runs of the parameter sweep. Each execution of the single program sees a unique pair of node name (A, B, C) and process number (0, 1, 2).

The common log file records the execution of the three runs of the parameter sweep. In particular, it shows that all runs of node B start only after all runs of node A reach completion.

For more information about DAGMan:

DAGMan - Multiple, Independent DAGs

A single use of condor_submit_dag may execute several independent DAGs. Each independent DAG has its own DAG submission file. The names of these DAG submission files appear as command-line arguments of condor_submit_dag, as in the following:

condor_submit_dag -force mydagsubmissionfile1 mydagsubmissionfile2 ... mydagsubmissionfileN

This example is two independent linear DAGs which represent three ordered executions named "A", "B", and "C" and two ordered executions named "D" and "E". While each sequence must be executed in the order specified by their respective DAGs, there is no dependency between the two sequences; the two sequences are independent. In other words, the execution of step E does not depend on the completion of either step A, B, or C, only step D.

Diagram of Multiple Independent DAGs

Here are the two independent DAG submission files, myprogram.dag.1 and myprogram.dag.2:

# FILENAME:  myprogram.dag.1

# Specify the nodes (job submission files) of a DAG.
JOB A myprogram.dag1.A.sub
JOB B myprogram.dag1.B.sub
JOB C myprogram.dag1.C.sub

# Specify command-line arguments as macro definitions.
VARS A nodename="A"
VARS B nodename="B"
VARS C nodename="C"

# Specify the edges (dependencies, order of execution) of a DAG.
PARENT A CHILD B
PARENT B CHILD C

# FILENAME:  myprogram.dag.2

# Specify the nodes (job submission files) of a DAG.
JOB D p_00156.dag2.D.sub
JOB E p_00156.dag2.E.sub

# Specify command-line arguments as macro definitions.
VARS D nodename="D"
VARS E nodename="E"

# Specify the edges (dependencies, order of execution) of a DAG.
PARENT D CHILD E

View the three job submission files of DAG 1:

# FILENAME:  myprogram.dag1.A.sub
universe   = VANILLA
executable = myprogram
arguments  = $(Cluster) $(Process) $(nodename)
output     = myprogram.dag1.A.out
error      = myprogram.dag1.A.err
log        = myprogram.dag1.log
queue

# FILENAME:  myprogram.dag1.B.sub
universe   = VANILLA
executable = myprogram
arguments  = $(Cluster) $(Process) $(nodename)
output     = myprogram.dag1.B.out
error      = myprogram.dag1.B.err
log        = myprogram.dag1.log
queue

# FILENAME:  myprogram.dag1.C.sub
universe   = VANILLA
executable = myprogram
arguments  = $(Cluster) $(Process) $(nodename)
output     = myprogram.dag1.C.out
error      = myprogram.dag1.C.err
log        = myprogram.dag1.log
queue

View the two job submission files of DAG 2:

# FILENAME:  myprogram.dag2.D.sub
universe   = VANILLA
executable = myprogram
arguments  = $(Cluster) $(Process) $(nodename)
output     = myprogram.dag2.D.out
error      = myprogram.dag2.D.err
log        = myprogram.dag2.log
queue

# FILENAME:  myprogram.dag2.E.sub
universe   = VANILLA
executable = myprogram
arguments  = $(Cluster) $(Process) $(nodename)
output     = myprogram.dag2.E.out
error      = myprogram.dag2.E.err
log        = myprogram.dag2.log
queue

While DAGMan can execute a different program for each node of the DAG, this example uses a single executable file to remain simple. Here is a short program, myprogram.c, that displays the command-line arguments which originally appear in the DAG submission file and forwarded to the job submission files. To compile this program for the Vanilla Universe, see Compiling Serial Programs.

To submit the independent DAGs to Condor:

$ condor_submit_dag -force myprogram.dag.1 myprogram.dag.2

The argument -force requires Condor to overwrite the files that it produces. This is a useful convenience when the DAG's output files already exist from a preceding run, and you no longer need the earlier output. DAGMan appends, not overwrites, the file dagman.out.

Command condor_rm is able to remove a DAG from the job queue.

Command condor_q shows the start of the two independent DAGs:

$ condor_q myusername
-- Submitter: condor.rcac.purdue.edu : <128.211.157.86:38552> : condor.rcac.purdue.edu
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
746918.0 myusername       11/10 11:06   0+00:00:32 R  0   7.3  condor_dagman
746919.0 myusername       11/10 11:06   0+00:00:00 I  0   0.0  myprogram.dag1 74691
746920.0 myusername       11/10 11:06   0+00:00:00 I  0   0.0  myprogram.dag2 74692

This report shows that DAGMan has its own cluster number. Each independent DAG has its own set of cluster numbers. These cluster numbers are not necessarily in sequence since other users are submitting jobs to Condor.

View the output file of the first node of DAG 1, myprogram.dag1.A.out:

***  MAIN START  ***

program name:     condor_exec.exe
cluster number:   746894
process number:   0
node name:        A

***  MAIN  STOP  ***

Similarly named output files exist for the other four nodes.

This example ran with each independent DAG having its own log file. Here is the log file for DAG 2, myprogram.dag2.log:

000 (746920.000.000) 11/10 11:06:17 Job submitted from host: <128.211.157.86:58916>
    DAG Node: 1.D
...
001 (746920.000.000) 11/10 11:12:00 Job executing on host: <128.211.157.10:42201>
...
005 (746920.000.000) 11/10 11:12:00 Job terminated.
    (1) Normal termination (return value 0)
        Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
        Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
        Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
        Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
    0  -  Run Bytes Sent By Job
    0  -  Run Bytes Received By Job
    0  -  Total Bytes Sent By Job
    0  -  Total Bytes Received By Job
...
000 (746922.000.000) 11/10 11:12:08 Job submitted from host: <128.211.157.86:58916>
    DAG Node: 1.E
...
001 (746922.000.000) 11/10 11:18:38 Job executing on host: <128.211.157.10:49358>
...
005 (746922.000.000) 11/10 11:18:38 Job terminated.
    (1) Normal termination (return value 0)
        Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
        Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
        Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
        Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
    0  -  Run Bytes Sent By Job
    0  -  Run Bytes Received By Job
    0  -  Total Bytes Sent By Job
    0  -  Total Bytes Received By Job
...

The text "DAG Node: 1.D" refers to step D of the second independent DAG listed as a command-line argument of condor_submit_dag.

Finally, this example could be reshaped into a parameter sweep, but the need to list the names of separate DAG submission files as command-line arguments of condor_submit_dag is very inconvenient for large sweeps.

For more information about DAGMan:

DAGMan - Pre and Post Scripts

The Condor keyword SCRIPT specifies optional processing that occurs either before a job within a DAG starts its execution or after a job within a DAG completes its execution. A PRE script performs processing before a job starts its execution under Condor; a POST script performs processing after a job completes its execution under Condor. A node in the DAG includes the job together with PRE and/or POST scripts. These scripts run on the submission host, not on a compute node.

A common use of a PRE script places files in a staging area for a cluster of jobs to use; a common use of a POST script cleans up or removes files once that cluster of jobs reaches completion. An example might use a PRE script to transfer needed files from long-term storage; the corresponding POST script might return the processed files to long-term storage. In another example about staging files, a PRE script might archive a subdirectory structure of files in preparation for transferring that archive as a single input file to the compute node, while the POST script might extract output files from the archive which Condor transferred from the compute node to the submission host after job completion.

The following flowchart illustrates a DAG with PRE and POST scripts:

Diagram of Pre/Post-Processing DAG

The DAG submission file, myprogram.dag, describes the DAG and specifies job submission files which control the execution of individual programs at each node of the DAG and under Condor's control. It also specifies the PRE and POST scripts:

# FILENAME:  myprogram.dag

# Specify the nodes (job submission files) of a DAG.
JOB A myprogram.A.sub
JOB B myprogram.B.sub

# Specify PRE and POST scripts.
SCRIPT PRE  A myprogram_preA.scr
SCRIPT POST A myprogram_pstA.scr
SCRIPT PRE  B myprogram_preB.scr
SCRIPT POST B myprogram_pstB.scr

# Specify command-line arguments as macro definitions.
VARS A nodename="A"
VARS B nodename="B"

# Specify the edges (dependencies, order of execution) of a DAG.
PARENT A CHILD B

View the job submission file, myprogram.A.sub, for the first node of the DAG:

universe   = VANILLA
executable = myprogram
arguments  = $(Cluster) $(Process) $(nodename)
output     = myprogram.A.out
error      = myprogram.A.err
log        = myprogram.log
queue

View the job submission file, myprogram.B.sub, for the second node of the DAG:

universe   = VANILLA
executable = myprogram
arguments  = $(Cluster) $(Process) $(nodename)
output     = myprogram.B.out
error      = myprogram.B.err
log        = myprogram.log
queue

The four PRE and POST scripts write a short message to a common output file:

#!/bin/sh
# FILENAME:  myprogram_preA.scr
echo "before node A" >>myprogram.lst
/bin/hostname >>myprogram.lst

#!/bin/sh
# FILENAME:  myprogram_pstA.scr
echo "after node A" >>myprogram.lst
/bin/hostname >>myprogram.lst

#!/bin/sh
# FILENAME:  myprogram_preB.scr
echo "before node B" >>myprogram.lst
/bin/hostname >>myprogram.lst

#!/bin/sh
# FILENAME:  myprogram_pstB.scr
echo "after node B" >>myprogram.lst
/bin/hostname >>myprogram.lst

While DAGMan can execute a different program for each node of the DAG, this example uses a single executable file to remain simple. Here is a short program, myprogram.c, that displays the command-line arguments which originally appear in the DAG submission file and forwarded to the job submission files. To compile this program for the Vanilla Universe, see Compiling Serial Programs.

To submit the DAG to Condor:

$ condor_submit_dag -force myprogram.dag

The argument -force requires Condor to overwrite the files that it produces. This is a useful convenience when the DAG's output files already exist from a preceding run, and you no longer need the earlier output. DAGMan appends, not overwrites, the file dagman.out.

Command condor_rm is able to remove a DAG from the job queue.

View the output file of the first node of the DAG, myprogram.A.out:

***  MAIN START  ***

program name:     condor_exec.exe
cluster number:   746948
process number:   0
node name:        A

***  MAIN  STOP  ***

View the output file of the second node of the DAG, myprogram.B.out:

***  MAIN START  ***

program name:     condor_exec.exe
cluster number:   746949
process number:   0
node name:        B

***  MAIN  STOP  ***

Each execution of the single program sees a unique node name: A, B.

View the common output file, myprogram.lst, of the four PRE and POST scripts. Output shows that the submission host itself executed the PRE and POST scripts:

before node A
condor.rcac.purdue.edu
after node A
condor.rcac.purdue.edu
before node B
condor.rcac.purdue.edu
after node B
condor.rcac.purdue.edu

For more information about DAGMan:

Job Priority

You may assign a priority to each of your jobs within a specific Condor queue (on a specific submission host). A priority value can be any integer, where higher values mean higher priority. Condor will generally attempt to assign a compute node to the highest priority job of yours first. However, this does not necessarily mean that a higher priority job will get a compute node before a lower priority job. An available compute node may match the requirements of a lower priority job but not the requirements of a higher priority job. Even once started, a higher priority job may not finish before lower priority jobs, because a higher priority job might have a longer run time or be preempted and have to restart more.

Job priorities are user-specific and queue-specific and will not affect which user's jobs run first—only which jobs of yours start before which other jobs of yours. The default job priority is 0.

One possible example of when job priorities could be useful is if you have submitted many jobs with the default priority, and only afterward realize that you would really prefer to see the results of another job first. You may submit this new urgent job and give it a higher priority so that Condor will try to find a compute node for this job before finding compute nodes for your other jobs. This will also only work if you submit this new job to the same queue (on the same submission host) as your other jobs, because job priorities are queue-specific.

First submit a job to the Condor queue at the default priority (0). To raise this job's priority to 5:

$ condor_q myusername

-- Submitter: condor.rcac.purdue.edu : <128.211.157.86:35407> : condor.rcac.purdue.edu
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
260187.0   myusername      8/30 13:59   0+00:00:00 I  0   19.5 hello

1 jobs; 1 idle, 0 running, 0 held

$ condor_prio -p 5 260187.0
$ condor_q myusername

-- Submitter: condor.rcac.purdue.edu : <128.211.157.86:35407> : condor.rcac.purdue.edu
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
260187.0   myusername      8/30 13:59   0+00:00:03 I  5 19.5 hello

1 jobs; 0 idle, 1 running, 0 held

For more information about job priority:

Commercial and Third-Party Applications

Several commercial and third-party software packages are available on RCAC resources and available through BoilerGrid.

Running one of these applications on BoilerGrid through Condor can be tricky. Ideally, to achieve high throughput, you want to maximize the number of candidate compute nodes that can execute your job. However, because BoilerGrid consists of many different types of systems, not all compute nodes have a given application or may be able to run your job effectively. You need to specify enough requirements to ensure that your job submission only tries to run on systems that are capable of running your job, but not be so specific that the requirements limit the running of your jobs to too few systems. You must carefully balance these two considerations. There is no single, fool-proof method for executing these applications, but a few general comments may help to guide you.

These applications are executable files supplied by the manufacturer. Since these executable files cannot be relinked with condor_compile, you must use the Vanilla Universe when submitting to Condor jobs that use them. By default, the Vanilla Universe will transfer to the compute node whatever file appears on the job's executable command. If you intend to use an executable that is already available on the compute nodes you are using, then specify transfer_executable = FALSE in your job submission file to avoid needlessly copying the manufacturer's executable from the submission host to the compute node with every attempted run. If you specify a shell script of your own creation as the Condor executable, you will want to leave this value as the default (TRUE).

Be aware there are several potential reasons a given application may not be available on all compute nodes in BoilerGrid. The application examples below all run on 64-bit Linux platforms, but those applications may or may not be available on 32-bit Linux or Windows platforms. Owners of some compute nodes may have agreed to include their nodes in BoilerGrid, but they may not have installed some applications. Commercial software licenses may or may not allow some compute nodes to have an application installed.

Ideally, your job submission would specify exactly the subset of compute nodes that have the application installed. To that end, many compute nodes explicitly advertise their applications. They advertise through ClassAd attributes such as HAS_MATLAB or HAS_MAPLE. Using such a ClassAd attribute may exclude some compute nodes that do have the application installed but that do not advertise this fact. However, this method is relatively robust. Unfortunately, only a few applications currently appear in an explicit advertisement. If you need to use an application which no node explicitly advertises through a ClassAd, you may find that you need to restrict the set of potential compute nodes in other ways.

The examples in the next few sections follow the guidelines described above. Start with the example that most closely resembles your computing goal. After a successful run of a simple job, you may modify your approach to attempt to maximize the number of candidate compute nodes without also including nodes that fail to run your submission.

RCAC tested the examples in the next few sections on some RCAC resources of BoilerGrid but not recently, so you may find some differences. If you need assistance, please contact RCAC.

With the exception of Octave and R, which are free software, only Purdue affiliates may use the following licensed software.

Maple

Maple is a general-purpose computer algebra system. This section illustrates how to submit a small Maple job to BoilerGrid. This Maple example differentiates, integrates, and finds the roots of polynomials.

Prepare a Maple input file with an appropriate filename, here named myjob.in:

# FILENAME:  myjob.in

# Differentiate with respect to x.
diff( 2*x^3,x );

# Integrate with respect to x.
int( 3*x^2*sin(x)+x,x );

# Solve for x.
solve( 3*x^2+2*x-1,x );

Use the ClassAd attribute "HAS_MAPLE" to discover how many compute nodes advertise Maple:

$ condor_status -pool boilergrid.rcac.purdue.edu -total -constraint '(Arch == "X86_64") && (OpSys == "LINUX") && (HAS_MAPLE==True)'

                     Total Owner Claimed Unclaimed Matched Preempting Backfill
        X86_64/LINUX 25360 16406    3983      4970       0          1        0
               Total 25360 16406    3983      4970       0          1        0

Prepare a job submission file with an appropriate filename, here named myjob.sub. This example uses the ClassAd attribute HAS_MAPLE to locate a compute node, the ClassAd attribute MAPLE_EXE for the path to Maple on the compute node, and Condor's file transfer mechanism to copy input and output files only if needed (if the same filesystem is not available on the compute node):

# FILENAME:  myjob.sub

universe     = VANILLA
requirements = (HAS_MAPLE==TRUE)

# Run the executable already installed on the compute node.
transfer_executable = FALSE
executable          = /$$(MAPLE_EXE)

# Use the -q option to suppress startup messages.
# arguments = -q

should_transfer_files   = IF_NEEDED
when_to_transfer_output = ON_EXIT
input  = myjob.in
output = myjob.out
error  = myjob.err
log    = myjob.log

queue

Submit the job:

$ condor_submit myjob.sub

View job status:

$ condor_q myusername

If there are any problems with your job, you may need to cancel your job or view the full job requirements. You may also wish to investigate more options on job submission. Please refer to the appropriate sections in this user guide for more details on these topics:

View results in the file for all standard output, here named myjob.out:

    |\^/|     Maple 13 (X86 64 LINUX)
._|\|   |/|_. Copyright (c) Maplesoft, a division of Waterloo Maple Inc. 2009
 \  MAPLE  /  All rights reserved. Maple is a trademark of
 <____ ____>  Waterloo Maple Inc.
      |       Type ? for help.
# FILENAME:  myjob.in
>
# Differentiate wrt x.
> diff(2*x^3,x );
                                         2
                                      6 x

>
# Integrate wrt x.
> int(3*x^2*sin(x)+x,x );
                                                           2
                      2                                   x
                  -3 x  cos(x) + 6 cos(x) + 6 x sin(x) + ----
                                                          2

>
# Solve for x.
> solve(3*x^2+2*x-1,x );
                                    1/3, -1

> quit
memory used=3.0MB, alloc=2.8MB, time=0.04

Any output written to standard error will appear in myjob.err.

You may also view the Condor log file, here named myjob.log. The counts of bytes sent and received are zero because this job ran on a compute node which could directly access the shared file system holding the input file:

000 (342479.000.000) 01/29 10:11:24 Job submitted from host: <128.211.157.86:60004>
...
001 (342479.000.000) 01/29 10:11:53 Job executing on host: <128.211.157.10:53997?PrivNet=condor.ccb.purdue.edu>
...
005 (342479.000.000) 01/29 10:11:57 Job terminated.
    (1) Normal termination (return value 0)
        Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
        Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
        Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
        Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
    0  -  Run Bytes Sent By Job
    0  -  Run Bytes Received By Job
    0  -  Total Bytes Sent By Job
    0  -  Total Bytes Received By Job
... 

For more information about Maple:

Mathematica

Mathematica implements numeric and symbolic mathematics. This section illustrates how to submit a small Mathematica job to BoilerGrid. This Mathematica example finds the three roots of a third-degree polynomial.

Prepare a Mathematica input file with an appropriate filename, here named myjob.in:

(* FILENAME:  myjob.in *)

(* Find three roots. *)
p=x^3+3*x^2+3*x+1
Solve[p==0]
Quit

Prepare a shell script with an appropriate filename, here named myjob.sh, to run the non-graphical version of Mathematica:

#!/bin/sh
# FILENAME:  myjob.sh

module load mathematica

# For additional information about your job, uncomment the following commands:
# hostname
# module list
# which math

math

Change the permissions of the shell script to allow execution by the owner (you):

$ chmod u+x myjob.sh

Prepare a job submission file with an appropriate filename, here named myjob.sub. This example specifies the shell script myjob.sh as the executable, transfers the shell script and the user's current environment variables to the compute node, and uses Condor's file transfer mechanism to copy input and output files only if needed (if the same filesystem is not available on the compute node):

# FILENAME:  myjob.sub

universe     = VANILLA

# Copy my shell environment variables to the compute node.
# The compute node requires this to find the module command.
getenv = TRUE

# Transfer the "executable" script myjob.sh to the compute node.
transfer_executable = TRUE
executable          = myjob.sh

should_transfer_files   = IF_NEEDED
when_to_transfer_output = ON_EXIT
input  = myjob.in
output = myjob.out
error  = myjob.err
log    = myjob.log

queue

Submit the job:

$ condor_submit myjob.sub

View job status:

$ condor_q myusername

If there are any problems with your job, you may need to cancel your job or view the full job requirements. You may also wish to investigate more options on job submission. Please refer to the appropriate sections in this user guide for more details on these topics:

View results in the file for all standard output, here named myjob.out:

Mathematica 5.2 for Linux x86 (64 bit)
Copyright 1988-2005 Wolfram Research, Inc.
 -- Motif graphics initialized --

In[1]:=
In[2]:=
                     2    3
Out[2]= 1 + 3 x + 3 x  + x

In[3]:=
Out[3]= {{x -> -1}, {x -> -1}, {x -> -1}}

In[4]:=

Any output written to standard error will appear in myjob.err.

You may also view the Condor log file, here named myjob.log. The counts of bytes sent and received are zero because this job ran on a compute node which could directly access the shared file system holding the input file:

000 (342433.000.000) 12/16 14:12:29 Job submitted from host: <128.211.157.86:41603>
...
001 (342433.000.000) 12/16 14:31:33 Job executing on host: <128.211.157.10:60202>
...
005 (342433.000.000) 12/16 14:31:39 Job terminated.
    (1) Normal termination (return value 0)
        Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
        Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
        Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
        Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
    0  -  Run Bytes Sent By Job
    0  -  Run Bytes Received By Job
    0  -  Total Bytes Sent By Job
    0  -  Total Bytes Received By Job

For more information about Mathematica:

MATLAB (Interpreting an M-file)

MATLAB (an acronym for MATrix LABoratory) is a computing environment and a fourth-generation programming language supporting algorithm development, data analysis and visualization, and numeric and symbolic computation. The MATLAB interpreter is the part of MATLAB which reads M-files and MEX-files and executes MATLAB statements.

This section illustrates how to submit a small MATLAB job to BoilerGrid. This MATLAB example computes the inverse of a matrix. This example, when executed, uses the MATLAB interpreter, so it requires and checks out a MATLAB license.

Prepare a MATLAB M-file with an appropriate filename, here named myjob.m:

% FILENAME:  myjob.m

% Display name of compute node which ran this job.
[c name] = system('hostname');
fprintf('\n\nhostname:%s\n', name)

% Invert matrix A.
A = [1 2 3; 4 5 6; 7 8 0]
inv(A)

quit

Use the ClassAd attribute "HAS_MATLAB" to discover how many compute nodes advertise MATLAB:

$ condor_status -pool boilergrid.rcac.purdue.edu -total -constraint '(Arch == "X86_64") && (OpSys == "LINUX") && (HAS_MATLAB==True)'

                     Total Owner Claimed Unclaimed Matched Preempting Backfill
        X86_64/LINUX 25458 19339    2101      4018       0          0        0
               Total 25458 19339    2101      4018       0          0        0

Prepare a job submission file with an appropriate filename, here named myjob.sub. This example uses the ClassAd attribute HAS_MATLAB to locate a compute node, the ClassAd attribute MAPLE_EXE for the path to the MATLAB interpreter on the compute node, and Condor's file transfer mechanism to copy input and output files only if needed (if the same filesystem is not available on the compute node):

# FILENAME:  myjob.sub

universe     = VANILLA
requirements = (HAS_MATLAB==TRUE)

# Run the executable already installed on the compute node.
transfer_executable = FALSE
executable          = /$$(MATLAB_EXE)

# arguments = -nodisplay -nosplash -nojvm
# -nodisplay: turn off graphics
# -nosplash:  start MATLAB without the splash screen
# -nojvm:     turn off graphics
arguments = $$(MATLAB_ARGS)

should_transfer_files   = IF_NEEDED
when_to_transfer_output = ON_EXIT
input  = myjob.m
output = myjob.out
error  = myjob.err
log    = myjob.log

queue

Submit the job:

$ condor_submit myjob.sub

View job status:

$ condor_q myusername

If there are any problems with your job, you may need to cancel your job or view the full job requirements. You may also wish to investigate more options on job submission. Please refer to the appropriate sections in this user guide for more details on these topics:

View results in the file for all standard output, here named myjob.out:

                            < M A T L A B (R) >
                  Copyright 1984-2010 The MathWorks, Inc.
                Version 7.10.0.499 (R2010a) 64-bit (glnxa64)
                              February 5, 2010

    ----------------------------------------------------------
        Your MATLAB license will expire in 48 days.
        Please contact your system administrator or
        The MathWorks to renew this license.
    ----------------------------------------------------------

  To get started, type one of these: helpwin, helpdesk, or demo.
  For product information, visit www.mathworks.com.

>> >> >> >> >>

hostname:hansen-b066.rcac.purdue.edu

>> >> >>
A =

     1     2     3
     4     5     6
     7     8     0

>>
ans =

   -1.7778    0.8889   -0.1111
    1.5556   -0.7778    0.2222
   -0.1111    0.2222   -0.1111

>>

Any output written to standard error will appear in myjob.err.

You may also view the Condor log file, here named myjob.log. The counts of bytes sent and received are zero because this job ran on a compute node which could directly access the shared file system holding the input file:

000 (746973.000.000) 12/15 15:01:11 Job submitted from host: <128.211.157.86:49400>
...
001 (746973.000.000) 12/15 15:01:17 Job executing on host: <128.211.157.10:53612?PrivNet=condor.ccb.purdue.edu>
...
005 (746973.000.000) 12/15 15:02:09 Job terminated.
    (1) Normal termination (return value 0)
        Usr 0 00:00:02, Sys 0 00:00:00  -  Run Remote Usage
        Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
        Usr 0 00:00:02, Sys 0 00:00:00  -  Total Remote Usage
        Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
    0  -  Run Bytes Sent By Job
    0  -  Run Bytes Received By Job
    0  -  Total Bytes Sent By Job
    0  -  Total Bytes Received By Job

For more information about MATLAB:

MATLAB Compiler (Compiling an M-file)

The MATLAB Compiler translates an M-file into an executable file. A compiled version of an M-file can substantially improve performance of MATLAB code, especially for statements like for and while.

This section illustrates how to submit a small, compiled MATLAB job to BoilerGrid. This MATLAB example computes the inverse of a matrix. This example, when executed, does not use the MATLAB interpreter, so it neither requires nor checks out a MATLAB license.

Prepare a MATLAB M-file with an appropriate filename, here named myjob.m:

% FILENAME:  myjob.m

% Display name of compute node which ran this job.
[c name] = system('hostname');
fprintf('\n\nhostname:%s\n', name)

% Invert matrix A.
A = [1 2 3; 4 5 6; 7 8 0]
inv(A)

quit

To access the MATLAB Compiler mcc, load a MATLAB module. The MATLAB Compiler depends on shared libraries from GCC Version 4.2.3. This version is not available on most of BoilerGrid, but GCC Version 4.2.4 is compatible:

$ module load matlab
$ module load gcc/4.2.4

To compile MATLAB source code into a stand-alone executable, use the macro option -m:

$ mcc -m -v -R -nojvm myjob.m

A few new files appear after the compilation:

mccExcludedFiles.log
myjob
myjob.prj
myjob_main.c
myjob_mcc_component_data.c
readme.txt
run_myjob.sh

The name of the stand-alone executable file is myjob. The name of the shell script to run this executable file is run_myjob.sh.

Prepare the job submission file with an appropriate filename, here named myjob.sub. This example specifies the MATLAB shell script run_myjob.sh as the executable, transfers the shell script to the compute node, and uses Condor's file transfer mechanism to copy input and output files only if needed (if the same filesystem is not available on the compute node):

# FILENAME:  myjob.sub

universe     = VANILLA
requirements = (HAS_MATLAB==TRUE)

# Transfer "executable" shell script run_myjob.sh to the compute node.
transfer_executable = TRUE
executable          = run_myjob.sh

# Pass the MATLAB root directory as an argument to the shell script.
arguments = $$(MATLAB_ROOT)

should_transfer_files   = IF_NEEDED
when_to_transfer_output = ON_EXIT
#
# The file "myjob" is the compiled version of "myjob.m".
input  = myjob
output = myjob.out
error  = myjob.err
log    = myjob.log

queue

Submit the job:

$ condor_submit myjob.sub

View job status:

$ condor_q myusername

If there are any problems with your job, you may need to cancel your job or view the full job requirements. You may also wish to investigate more options on job submission. Please refer to the appropriate sections in this user guide for more details on these topics:

View results in the file for all standard output, here named myjob.out:

------------------------------------------
Setting up environment variables
---
LD_LIBRARY_PATH is .:/apps/rhel5/MATLAB_R2010a/runtime/glnxa64:/apps/rhel5/MATLAB_R2010a/bin/glnxa64:/apps/rhel5/MATLAB_R2010a/sys/os/glnxa
64:/apps/rhel5/MATLAB_R2010a/sys/java/jre/glnxa64/jre/lib/amd64/native_threads:/apps/rhel5/MATLAB_R2010a/sys/java/jre/glnxa64/jre/lib/amd64
/server:/apps/rhel5/MATLAB_R2010a/sys/java/jre/glnxa64/jre/lib/amd64/client:/apps/rhel5/MATLAB_R2010a/sys/java/jre/glnxa64/jre/lib/amd64
Warning: No display specified.  You will not be able to display graphics on the screen.

hostname:hansen-b066.rcac.purdue.edu

A =

     1     2     3
     4     5     6
     7     8     0


ans =

   -1.7778    0.8889   -0.1111
    1.5556   -0.7778    0.2222
   -0.1111    0.2222   -0.1111

Any output written to standard error will appear in myjob.err.

You may also view the Condor log file, here named myjob.log. The counts of bytes sent and received are zero because this job ran on a compute node which could directly access the shared file system holding the input file:

000 (746999.000.000) 01/03 11:49:13 Job submitted from host: <128.211.157.86:49400>
...
001 (746999.000.000) 01/03 13:14:00 Job executing on host: <128.211.157.10:33910?PrivNet=condor.ccb.purdue.edu>
...
006 (746999.000.000) 01/03 13:14:09 Image size of job updated: 81956
...
005 (746999.000.000) 01/03 13:14:49 Job terminated.
    (1) Normal termination (return value 0)
        Usr 0 00:00:01, Sys 0 00:00:00  -  Run Remote Usage
        Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
        Usr 0 00:00:01, Sys 0 00:00:00  -  Total Remote Usage
        Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
    0  -  Run Bytes Sent By Job
    0  -  Run Bytes Received By Job
    0  -  Total Bytes Sent By Job
    0  -  Total Bytes Received By Job

For more information about the MATLAB Compiler:

MATLAB Executable (MEX-file)

MEX stands for "MATLAB Executable". A MEX-file offers a way for MATLAB code to call functions written in C, C++, or Fortran as though these external functions were built-in MATLAB functions. You may wish to use a MEX-file if you would like to call an existing C, C++, or Fortran function directly from MATLAB rather than reimplementing that code as a MATLAB function. Also, by implementing performance-critical routines in C, C++, or Fortran rather than MATLAB, you may be able to substantially improve performance over MATLAB source code, especially for statements like for and while.

This section illustrates how to submit a small MATLAB job with a MEX-file to BoilerGrid. This MATLAB example calls a C function which adds two matrices. This example, when executed, uses the MATLAB interpreter, so it requires and checks out a MATLAB license.

Prepare a complicated and time-consuming computation in the form of a C, C++, or Fortran function. In this example, the computation is a C function which adds two matrices:

/* Computational Routine */
void matrixSum (double *a, double *b, double *c, int n) {
    int i;

    /* Matrix (component-wise) addition. */
    for (i = 0; i<n; i++) {
        c[i] = a[i] + b[i];
    }
}

Combine the computational routine with a MEX-file, which contains the necessary external function interface of MATLAB. In the computational routine, change int to mwSize. The name of the file is matrixSum.c:

/***********************************************************
 * FILENAME:  matrixSum.c
 *
 * Adds two MxN arrays (inMatrix).
 * Outputs one MxN array (outMatrix).
 *
 * The calling syntax is:
 *
 *      matrixSum (inMatrix, inMatrix, outMatrix, size)
 *
 * This is a MEX-file for MATLAB.
 *
 **********************************************************/

#include "mex.h"

/* Computational Routine */
void matrixSum (double *a, double *b, double *c, mwSize n) {
    mwSize i;

    /* Component-wise addition. */
    for (i = 0; i<n; i++) {
        c[i] = a[i] + b[i];
    }
}

/* Gateway Function */
void mexFunction (int nlhs, mxArray *plhs[],
                  int nrhs, const mxArray *prhs[]) {
    double *inMatrix_a;               /* mxn input matrix  */
    double *inMatrix_b;               /* mxn input matrix  */
    mwSize nrows_a,ncols_a;           /* size of matrix a  */
    mwSize nrows_b,ncols_b;           /* size of matrix b  */
    double *outMatrix_c;              /* mxn output matrix */

    /* Check for proper number of arguments */
    if(nrhs!=2) {
        mexErrMsgIdAndTxt("MyToolbox:matrixSum:nrhs","Two inputs required.");
    }
    if(nlhs!=1) {
        mexErrMsgIdAndTxt("MyToolbox:matrixSum:nlhs","One output required.");
    }

    /* Get dimensions of the first input matrix */
    nrows_a = mxGetM(prhs[0]);
    ncols_a = mxGetN(prhs[0]);
    /* Get dimensions of the second input matrix */
    nrows_b = mxGetM(prhs[1]);
    ncols_b = mxGetN(prhs[1]);

    /* Check for equal number of rows. */
    if(nrows_a != nrows_b) {
        mexErrMsgIdAndTxt("MyToolbox:matrixSum:notEqual","Unequal number of rows.");
    }
    /* Check for equal number of columns. */
    if(ncols_a != ncols_b) {
        mexErrMsgIdAndTxt("MyToolbox:matrixSum:notEqual","Unequal number of columns.");
    }

    /* Make a pointer to the real data in the first input matrix  */
    inMatrix_a = mxGetPr(prhs[0]);
    /* Make a pointer to the real data in the second input matrix  */
    inMatrix_b = mxGetPr(prhs[1]);

    /* Make the output matrix */
    plhs[0] = mxCreateDoubleMatrix(nrows_a,ncols_a,mxREAL);

    /* Make a pointer to the real data in the output matrix */
    outMatrix_c = mxGetPr(plhs[0]);

    /* Call the computational routine */
    matrixSum(inMatrix_a,inMatrix_b,outMatrix_c,nrows_a*ncols_a);
}

To access the MATLAB utility mex, load a MATLAB module. The MATLAB Compiler, mcc, depends on shared libraries from GCC Version 4.2.3. This version is not available on most of BoilerGrid, but GCC Version 4.2.4 is compatible:

$ module load matlab
$ module load gcc/4.2.4

To compile matrixSum.c into a MEX-file:

$ mex matrixSum.c

The name of the MATLAB-callable MEX-file is matrixSum.mexa64.

Prepare a MATLAB M-file with an appropriate filename, here named myjob.m:

% FILENAME:  myjob.m

% Call the separately compiled and dynamically linked MEX-file.
A = [1,1,1;1,1,1]
B = [2,2,2;2,2,2]
C = matrixSum(A,B)

quit

Prepare a job submission file with an appropriate filename, here named myjob.sub. This example uses the ClassAd attribute HAS_MATLAB to locate a compute node, the ClassAd attribute MATLAB_EXE for the path to the MATLAB interpreter on the compute node, and Condor's file transfer mechanism to copy input and output files only if needed (if the same filesystem is not available on the compute node):

# FILENAME:  myjob.sub

universe     = VANILLA
requirements = (HAS_MATLAB==TRUE)

# Run the executable already installed on the compute node.
transfer_executable = FALSE
executable          = /$$(MATLAB_EXE)

# -nodesktop: run MATLAB in text mode
# -nodisplay: turn off graphics
# -nosplash:  start MATLAB without the splash screen
# -nojvm:     turn off graphics
# arguments = -nodesktop -nodisplay -nosplash -nojvm
arguments = - nodesktop $$(MATLAB_ARGS)

should_transfer_files   = IF_NEEDED
when_to_transfer_output = ON_EXIT
input  = myjob.m
output = myjob.out
error  = myjob.err
log    = myjob.log
#
# Transfer MEX-file matrixSum.mexa64 to the compute node.
transfer_input_files = matrixSum.mexa64

queue

Submit the job:

$ condor_submit myjob.sub

View job status:

$ condor_q myusername

If there are any problems with your job, you may need to cancel your job or view the full job requirements. You may also wish to investigate more options on job submission. Please refer to the appropriate sections in this user guide for more details on these topics:

View results in the file for all standard output, here named myjob.out:


                            < M A T L A B (R) >
                  Copyright 1984-2010 The MathWorks, Inc.
                Version 7.10.0.499 (R2010a) 64-bit (glnxa64)
                              February 5, 2010


  To get started, type one of these: helpwin, helpdesk, or demo.
  For product information, visit www.mathworks.com.

>>
A =

     1     1     1
     1     1     1


B =

     2     2     2
     2     2     2


C =

     3     3     3
     3     3     3

Any output written to standard error will appear in myjob.err.

You may also view the Condor log file, here named myjob.log. The counts of bytes sent and received are zero because this job ran on a compute node which could directly access the shared file system holding the input file:

000 (342465.000.000) 01/11 09:30:17 Job submitted from host: <128.211.157.86:60004>
...
001 (342465.000.000) 01/11 09:30:21 Job executing on host: <128.211.157.10:36464?PrivNet=condor.ccb.purdue.edu>
...
005 (342465.000.000) 01/11 09:31:11 Job terminated.
    (1) Normal termination (return value 0)
        Usr 0 00:00:01, Sys 0 00:00:00  -  Run Remote Usage
        Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
        Usr 0 00:00:01, Sys 0 00:00:00  -  Total Remote Usage
        Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
    0  -  Run Bytes Sent By Job
    0  -  Run Bytes Received By Job
    0  -  Total Bytes Sent By Job
    0  -  Total Bytes Received By Job

For more information about the MATLAB MEX-file:

MATLAB Standalone Program

A stand-alone MATLAB program is a C, C++, or Fortran program which calls user-written M-files and the same libraries which MATLAB uses. A stand-alone program has access to MATLAB objects, such as the array and matrix classes, as well as all the MATLAB algorithms. If you would like to implement performance-critical routines in C, C++, or Fortran and still call select MATLAB functions, a stand-alone MATLAB program may be a good option. This offers the possibility for substantially improved performance over MATLAB source code, especially for statements like for and while, while still allowing use of specialized MATLAB functions where useful.

This section illustrates how to submit a small stand-alone MATLAB program to BoilerGrid. This C example calls a compiled MATLAB script which ranks magic squares and another compiled MATLAB script which displays the ranks. This example, when executed, does not use the MATLAB interpreter, so it neither requires nor checks out a MATLAB license.

Prepare a MATLAB function which returns a vector of the ranks of the magic squares from 1 to n. Use an appropriate filename, here named mrank.m:

% FILENAME:  mrank.m

function r = mrank(n)
r = zeros(n,1);
for k = 1:n
    r(k) = rank(magic(k));
end

Prepare a second MATLAB function which displays a vector, the return value of function mrank. Use an appropriate filename, like printmatrix.m:

% FILENAME:  printmatrix.m

function printmatrix(A)
    disp(A)
end

Prepare a C source file with a main function and the necessary external function interface and give it an appropriate filename, here named myprogram.c. In a C program, you must use a "mangled" MATLAB function names in an invocation. The C program invokes the MATLAB function mrank using the name mlfMrank and the MATLAB function printmatrix using the name mlfPrintmatrix. All MATLAB function names must be modified in this manner when called from outside MATLAB:

/* FILENAME:  myprogram.c */

#include <stdio.h>
#include <math.h>
#include "Pkg.h"

int main (const int argc, char ** argv) {

    mxArray *N;   /* matrix containing n                  */
    mxArray *R;   /* result matrix                        */
    int n=12;     /* integer parameter from command line  */

    printf("Enter myprogram.c\n");

    PkgInitialize();     /* call Pkg initialization */

    /* Create a 1-by-1 matrix containing n. */
    N = mxCreateDoubleMatrix(1, 1, mxREAL);
    *mxGetPr(N) = n;

    /* Call mlfMrank, the compiled version of mrank.m. */
    mlfMrank(1,&R,N);

    /* Print the results. */
    mlfPrintmatrix(R);

    /* Free the matrices allocated during this computation. */
    mxDestroyArray(N);
    mxDestroyArray(R);

    PkgTerminate();     /* call Pkg initialization */

    printf("Exit myprogram.c\n");
    return 0;
}

To access the MATLAB Compiler mcc, load a MATLAB module. The MATLAB Compiler, mcc, depends on shared libraries from GCC Version 4.2.3. This version is not available on most of BoilerGrid, but GCC Version 4.2.4 is compatible:

$ module load matlab
$ module load gcc/4.2.4

To compile the stand-alone MATLAB program:

$ mcc -W lib:Pkg -T link:exe myprogram.c mrank printmatrix libmmfile.mlib -v

Several new files and one subdirectory appear after the compilation:

Pkg.c
Pkg.ctf
Pkg.exports
Pkg.h
Pkg.prj
Pkg_mcc_component_data.c
Pkg_mcr
mccExcludedFiles.log
myprogram
readme.txt

The name of the compiled, stand-alone MATLAB program is myprogram.

Prepare a shell script which will run the stand-alone MATLAB program with an appropriate filename, here named myjob.sh:

#!/bin/sh
# FILENAME:  myjob.sh

# A stand-alone program does not use the MATLAB interpreter,
# but it does need a shared library which comes from a 
# compatible version of GCC.
module load gcc/4.2.4

# For additional information about your job submission, uncomment the following commands.
# hostname
# module list
# which gcc

myprogram

Change the permissions of the shell script to allow execution by the owner (you):

$ chmod u+x myjob.sh

Prepare a job submission file with an appropriate filename, here named myjob.sub. This example specifies the shell script myjob.sh as the executable, transfers the shell script to the compute node, and uses Condor's file transfer mechanism to copy input and output files only if needed (if the same filesystem is not available on the compute node):

# FILENAME:  myjob.sub

universe     = VANILLA

# Transfer the "executable" shell script myjob.sh to the compute node.
transfer_executable = TRUE
executable          = myjob.sh

should_transfer_files   = IF_NEEDED
when_to_transfer_output = ON_EXIT
#
# Transfer the compiled code to the compute node.  The file myprogram
# is the compiled version of myprogram.c, mrank.m, and printmatrix.m.
input  = myprogram
output = myjob.out
error  = myjob.err
log    = myjob.log

queue

Submit the job:

$ condor_submit myjob.sub

View job status:

$ condor_q myusername

If there are any problems with your job, you may need to cancel your job or view the full job requirements. You may also wish to investigate more options on job submission. Please refer to the appropriate sections in this user guide for more details on these topics:

View results in the file for all standard output, here named myjob.out:

Enter myprogram.c
     1
     2
     3
     3
     5
     5
     7
     3
     9
     7
    11
     3

Exit myprogram.c

View the standard error file, here named myjob.err:

pure virtual method called

You may also view the Condor log file, here named myjob.log. The counts of bytes sent and received are zero because this job ran on a compute node which could directly access the shared file system holding the input file:

000 (342501.000.000) 02/03 11:35:14 Job submitted from host: <128.211.157.86:60004>
...
001 (342501.000.000) 02/03 11:36:00 Job executing on host: <128.211.157.10:44356?PrivNet=condor.ccb.purdue.edu>
...
005 (342501.000.000) 02/03 11:36:08 Job terminated.
    (1) Normal termination (return value 0)
        Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
        Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
        Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
        Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
    0  -  Run Bytes Sent By Job
    0  -  Run Bytes Received By Job
    0  -  Total Bytes Sent By Job
    0  -  Total Bytes Received By Job

For more information about the MATLAB stand-alone programs:

Octave (Interpreting an M-file)

GNU Octave is a high-level, interpreted, programming language for numerical computations. The Octave interpreter is the part of Octave which reads M-files, oct-files, and MEX-files and executes Octave statements. Octave is a structured language (similar to C) and mostly compatible with MATLAB. You may use Octave to avoid the need for a MATLAB license, both during development and as a deployed application. By doing so, you may be able to run your application on more systems or more easily distribute it to others.

This section illustrates how to submit a small Octave job to BoilerGrid. This Octave example computes the inverse of a matrix.

Prepare an Octave-compatible M-file with an appropriate filename, here named myjob.m:

% FILENAME:  myjob.m

% Invert matrix A.
A = [1 2 3; 4 5 6; 7 8 0]
inv(A)

quit

Prepare a shell script with an appropriate filename, here named myjob.sh:

#!/bin/sh
# FILENAME:  myjob.sh

module load octave

# For additional information about your job submission, uncomment the following commands.
# hostname
# module list
# which octave

# Use the -q option to suppress startup messages.
# octave -q
octave

Change the permissions of the shell script to allow execution by the owner (you):

$ chmod u+x myjob.sh

Prepare a job submission file with an appropriate filename, here named myjob.sub. This example specifies the shell script myjob.sh as the executable, transfers the shell script and the user's current shell environment to the compute node, and uses Condor's file transfer mechanism to copy input and output files only if needed (if the same filesystem is not available on the compute node):

# FILENAME:  myjob.sub

universe     = VANILLA

# Copy my shell environment variables to the compute node.
# The compute node requires this to find the module command.
getenv = TRUE

# Transfer the "executable" myjob.sh to the compute node.
transfer_executable = TRUE
executable          = myjob.sh

should_transfer_files   = IF_NEEDED
when_to_transfer_output = ON_EXIT
input  = myjob.m
output = myjob.out
error  = myjob.err
log    = myjob.log

queue

Submit the job:

$ condor_submit myjob.sub

View job status:

$ condor_q myusername

If there are any problems with your job, you may need to cancel your job or view the full job requirements. You may also wish to investigate more options on job submission. Please refer to the appropriate sections in this user guide for more details on these topics:

View results in the file for all standard output, here named myjob.out:

A =

   1   2   3
   4   5   6
   7   8   0

ans =

  -1.77778   0.88889  -0.11111
   1.55556  -0.77778   0.22222
  -0.11111   0.22222  -0.11111

You may also view the Condor log file, here named myjob.log. The counts of bytes sent and received are zero because this job ran on a compute node which could directly access the shared file system holding the input file:

000 (746978.000.000) 12/17 12:49:44 Job submitted from host: <128.211.157.86:49400>
...
001 (746978.000.000) 12/17 12:57:58 Job executing on host: <128.211.157.10:54256?PrivNet=condor.ccb.purdue.edu>
...
005 (746978.000.000) 12/17 12:58:12 Job terminated.
    (1) Normal termination (return value 0)
        Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
        Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
        Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
        Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
    0  -  Run Bytes Sent By Job
    0  -  Run Bytes Received By Job
    0  -  Total Bytes Sent By Job
    0  -  Total Bytes Received By Job
...

Any output written to standard error will appear in myjob.err.

For more information about Octave:

Octave Compiler (Compiling an M-file)

Octave does not offer a compiler to translate an M-file into an executable file for additional speed or distribution. You may wish to consider recoding an M-file as either an oct-file or a stand-alone program.

Octave Executable (Oct-file)

An oct-file is an "Octave Executable". It offers a way for Octave code to call functions written in C, C++, or Fortran as though these external functions were built-in Octave functions. You may wish to use an oct-file if you would like to call an existing C, C++, or Fortran function directly from Octave rather than reimplementing that code as an Octave function. Also, by implementing performance-critical routines in C, C++, or Fortran rather than Octave, you may be able to substantially improve performance over Octave source code, especially for statements like for and while.

This section illustrates how to submit a small Octave job with an oct-file to BoilerGrid. This Octave example calls a C function which adds two matrices.

Prepare a complicated and time-consuming computation in the form of a C, C++, or Fortran function. In this example, the computation is a C function which adds two matrices:

/* Computational Routine */
void matrixSum (double *a, double *b, double *c, int n) {
    int i;

    /* Component-wise addition. */
    for (i=0; i<n; i++) {
        c[i] = a[i] + b[i];
    }
}

Combine the computational routine with an oct-file, which contains the necessary external function interface of Octave. The name of the file is matrixSum.cc:

/***********************************************************
 * FILENAME:  matrixSum.cc
 *
 * Adds two MxN arrays (inMatrix).
 * Outputs one MxN array (outMatrix).
 *
 * The calling syntax is:
 *
 *      matrixSum (inMatrix, inMatrix, outMatrix, size)
 *
 * This is an oct-file for Octave.
 *
 **********************************************************/

#include <octave/oct.h>

/* Computational Routine */
void matrixSum (double *a, double *b, double *c, int n) {
    int i;

    /* Component-wise addition. */
    for (i=0; i<n; i++) {
        c[i] = a[i] + b[i];
    }
}

/* Gateway Function */
DEFUN_DLD (matrixSum, args, nargout, "matrixSum: A + B") {

    NDArray inMatrix_a;                /* mxn input matrix   */
    NDArray inMatrix_b;                /* mxn input matrix   */
    int nrows_a,ncols_a;               /* size of matrix a   */
    int nrows_b,ncols_b;               /* size of matrix b   */
    NDArray outMatrix_c;               /* mxn output matrix  */

    /* Check for proper number of input arguments */
    if (args.length() != 2) {
       printf("matrixSum:  two inputs required.");
       exit(-1);
    }
    /* Check for proper number of output arguments */
    if (nargout != 1) {
       printf("matrixSum:  one output required.");
       exit(-1);
    }

    /* Check that both input matrices are real matrices. */
    if (!args(0).is_real_matrix()) {
       printf("matrixSum:  expecting LHS (arg 1) to be a real matrix");
       exit(-1);
    }
    if (!args(1).is_real_matrix()) {
       printf("matrixSum:  expecting RHS (arg 2) to be a real matrix");
       exit(-1);
    }

    /* Get dimensions of the first input matrix */
    nrows_a = args(0).rows();
    ncols_a = args(0).columns();
    /* Get dimensions of the second input matrix */
    nrows_b = args(1).rows();
    ncols_b = args(1).columns();

    /* Check for equal number of rows. */
    if(nrows_a != nrows_b) {
       printf("matrixSum:  unequal number of rows.");
       exit(-1);
    }
    /* Check for equal number of columns. */
    if(ncols_a != ncols_b) {
       printf("matrixSum:  unequal number of rows.");
       exit(-1);
    }

    /* Make a pointer to the real data in the first input matrix  */
    inMatrix_a = args(0).array_value();
    /* Make a pointer to the real data in the second input matrix  */
    inMatrix_b = args(1).array_value();

    /* Construct output matrix as a copy of the first input matrix. */
    outMatrix_c = args(0).array_value();

    /* Call the computational routine. */
    double* ptr_a = inMatrix_a.fortran_vec();
    double* ptr_b = inMatrix_b.fortran_vec();
    double* ptr_c = outMatrix_c.fortran_vec(); 
    matrixSum(ptr_a,ptr_b,ptr_c,nrows_a*ncols_a);

    return octave_value(outMatrix_c);
}

To access the Octave utility mkoctfile, load an Octave module. Loading Octave also loads a compatible GCC:

$ module load octave

To compile matrixSum.cc into an oct-file:

$ mkoctfile matrixSum.cc

Two new files appear after the compilation:

matrixSum.o
matrixSum.oct

The name of the Octave-callable oct-file is matrixSum.oct.

Prepare an Octave-compatible M-file with an appropriate filename, here named myjob.m:

% FILENAME:  myjob.m

% Call the separately compiled and dynamically linked oct-file.
A = [1,1,1;1,1,1]
B = [2,2,2;2,2,2]
C = matrixSum(A,B)

quit

Prepare a shell script with an appropriate filename, here named myjob.sh:

#!/bin/sh
# FILENAME:  myjob.sh

module load octave

# For additional information about your job submission,
# uncomment the following commands.
# hostname
# module list
# which octave

# Use the -q option to suppress startup messages.
# octave -q
octave

Change the permissions of the shell script to allow execution by the owner (you):

$ chmod u+x myjob.sh

Prepare a job submission file with an appropriate filename, here named myjob.sub. This example specifies the shell script myjob.sh as the executable, transfers the shell script and the user's current shell environment to the compute node, and uses Condor's file transfer mechanism to copy input and output files only if needed (if the same filesystem is not available on the compute node):

# FILENAME:  myjob.sub

universe     = VANILLA

# Copy my shell environment variables to the compute node.
# The compute node requires this to find the module command.
getenv = TRUE

# Transfer the "executable" myjob.sh to the compute node.
transfer_executable = TRUE
executable          = myjob.sh

should_transfer_files   = IF_NEEDED
when_to_transfer_output = ON_EXIT
input  = myjob.m
output = myjob.out
error  = myjob.err
log    = myjob.log
#
# Transfer oct-file matrixSum.oct to the compute node.
transfer_input_files = matrixSum.oct

queue

Submit the job:

$ condor_submit myjob.sub

View job status:

$ condor_q myusername

If there are any problems with your job, you may need to cancel your job or view the full job requirements. You may also wish to investigate more options on job submission. Please refer to the appropriate sections in this user guide for more details on these topics:

View results in the file for all standard output, here named myjob.out:

A =

   1   1   1
   1   1   1

B =

   2   2   2
   2   2   2

C =

   3   3   3
   3   3   3

You may also view the Condor log file, here named myjob.log. The counts of bytes sent and received are zero because this job ran on a compute node which could directly access the shared file system holding the input file:

000 (747006.000.000) 01/13 08:09:56 Job submitted from host: <128.211.157.86:42696>
...
001 (747006.000.000) 01/13 08:10:22 Job executing on host: <128.211.157.10:35807?PrivNet=condor.ccb.purdue.edu>
...
006 (747006.000.000) 01/13 08:10:31 Image size of job updated: 99404
...
005 (747006.000.000) 01/13 08:10:40 Job terminated.
    (1) Normal termination (return value 0)
        Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
        Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
        Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
        Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
    0  -  Run Bytes Sent By Job
    0  -  Run Bytes Received By Job
    0  -  Total Bytes Sent By Job
    0  -  Total Bytes Received By Job
...

Any output written to standard error will appear in myjob.err.

For more information about Octave oct-files:

Octave Standalone Program

A stand-alone program is a C, C++, or Fortran program which calls user-written oct-files and the same libraries that Octave uses. A stand-alone program has access to Octave objects, such as the array and matrix classes, as well as all the Octave algorithms. If you would like to implement performance-critical routines in C, C++, or Fortran and still call select Octave functions, a stand-alone Octave program may be a good option. This offers the possibility for substantially improved performance over Octave source code, especially for statements like for and while.

This section illustrates how to submit a small stand-alone program which calls Octave to BoilerGrid. This C++ example uses class Matrix and calls an Octave script which prints a message.

Prepare a C++ function file with the necessary external function interface and with an appropriate filename, here named hello.cc:

// FILENAME:  hello.cc

#include <iostream>
#include <octave/oct.h>
#include <octave/octave.h>
#include <octave/parse.h>
#include <octave/toplev.h> /* do_octave_atexit */

int main (const int argc, char ** argv) {

    const char * argvv [] = {"" /* name of program, not relevant */, "--silent"};
    octave_main (2, (char **) argvv, true /* embedded */);

    /* Display the start of this program. */
    std::cout << "hello.cc:   hello, world" << std::endl;

    /* Invoke hello.m */
    const octave_value_list result = feval ("hello");

    /* Define an Octave Matrix. */
    int n = 2;
    Matrix a_matrix = Matrix (1,2);
    a_matrix (0,0) = 888;
    a_matrix (0,1) = 999;
    std::cout << "hello.cc:   " << a_matrix;

    do_octave_atexit ();

}

Prepare an Octave-compatible M-file with an appropriate filename, here named hello.m:

% FILENAME:  hello.m

disp('hello.m :   hello, world')

To access the Octave utility mkoctfile, load an Octave module. Loading Octave also loads a compatible GCC:

$ module load octave

To compile the stand-alone Octave program:

$ mkoctfile --link-stand-alone hello.cc -o hello

Two new files appear after the compilation:

hello
hello.o

The name of the stand-alone program is hello.

Prepare a shell script with an appropriate filename, here named myjob.sh:

#!/bin/sh
# FILENAME:  myjob.sh

# A stand-alone program does not use the Octave interpreter,
# but it does need a shared library which comes from GCC.
module load gcc

# For additional information about your job submission,
# uncomment the following commands.
# hostname
# module list
# which gcc

hello

Change the permissions of the shell script to allow execution by the owner (you):

$ chmod u+x myjob.sh

Prepare the job submission file with an appropriate filename, here named myjob.sub. This example specifies a shell script myjob.sh as the executable, transfers the shell script and the user's current shell environment to the compute node, and uses Condor's file transfer mechanism to copy input and output files only if needed (if the same filesystem is not available on the compute node):

# FILENAME:  myjob.sub

universe     = VANILLA

# Copy my shell environment variables to the compute node.
# The compute node requires this to find the module command.
getenv = TRUE

# Transfer the "executable" myjob.sh to the compute node.
transfer_executable = TRUE
executable          = myjob.sh

should_transfer_files   = IF_NEEDED
when_to_transfer_output = ON_EXIT
#
# Transfer the compiled code to the compute node.  The file hello
# is the compiled version of hello.m.
input  = hello
output = myjob.out
error  = myjob.err
log    = myjob.log

queue

Submit the job:

$ condor_submit myjob.sub

View job status:

$ condor_q myusername

If there are any problems with your job, you may need to cancel your job or view the full job requirements. You may also wish to investigate more options on job submission. Please refer to the appropriate sections in this user guide for more details on these topics:

View results in the file for all standard output, here named myjob.out:

hello.cc:   hello, world
hello.m:    hello, world
hello.cc:    888 999

You may also view the Condor log file, here named myjob.log. The counts of bytes sent and received are zero because this job ran on a compute node which could directly access the shared file system holding the input file:

000 (747012.000.000) 01/14 08:19:57 Job submitted from host: <128.211.157.86:42696>
...
001 (747012.000.000) 01/14 08:20:36 Job executing on host: <128.211.157.10:53050?PrivNet=condor.ccb.purdue.edu>
...
005 (747012.000.000) 01/14 08:20:37 Job terminated.
    (1) Normal termination (return value 0)
        Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
        Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
        Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
        Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
    0  -  Run Bytes Sent By Job
    0  -  Run Bytes Received By Job
    0  -  Total Bytes Sent By Job
    0  -  Total Bytes Received By Job
...

Any output written to standard error will appear in myjob.err.

For more information about the Octave stand-alone program:

Octave (MEX-file)

MEX stands for "MATLAB Executable". A MEX-file offers a way for MATLAB code to call functions written in C, C++ or Fortran as though these external functions were built-in MATLAB functions. You may wish to use a MEX-file if you would like to call an existing C, C++, or Fortran function directly from MATLAB rather than reimplementing that code as a MATLAB function. Also, by implementing performance-critical routines in C, C++, or Fortran rather than MATLAB, you may be able to substantially improve performance over MATLAB source code, especially for statements like for and while.

Octave includes an interface which can link compiled legacy MEX-files. This interface allows sharing code between Octave and MATLAB users. In Octave, an oct-file will always perform better than a MEX-file, so you should write new code using the oct-file interface, if possible. However, you may test a new MEX-file in Octave then use it in a MATLAB application.

This section illustrates how to submit a small Octave job with a MEX-file to BoilerGrid. This Octave example calls a C function which adds two matrices.

Prepare a complicated and time-consuming computation in the form of a C, C++, or Fortran function. In this example, the computation is a C function which adds two matrices:

/* Computational Routine */
void matrixSum (double *a, double *b, double *c, int n) {
    int i;

    /* Component-wise addition. */
    for (i=0; i<n; i++) {
        c[i] = a[i] + b[i];
    }
}

Combine the computational routine with a MEX-file, which contains the necessary external function interface of Octave. In the computational routine, change int to mwSize. The name of the file is matrixSum.c:

/*************************************************************
 * FILENAME:  matrixSum.c
 *
 * Adds two MxN arrays (inMatrix).
 * Outputs one MxN array (outMatrix).
 *
 * The calling syntax is:
 *
 *      matrixSum(inMatrix, inMatrix, outMatrix, size)
 *
 * This is a MEX-file which Octave will execute.
 *
 **************************************************************/

#include "mex.h"

/* Computational Routine */
void matrixSum (double *a, double *b, double *c, mwSize n) {
    mwSize i;

    /* Component-wise addition. */
    for (i=0; i<n; i++) {
        c[i] = a[i] + b[i];
    }
}

/* Gateway Function */
void mexFunction (int nlhs, mxArray *plhs[],
                  int nrhs, const mxArray *prhs[]) {

    double *inMatrix_a;               /* mxn input matrix  */
    double *inMatrix_b;               /* mxn input matrix  */
    mwSize nrows_a,ncols_a;           /* size of matrix a  */
    mwSize nrows_b,ncols_b;           /* size of matrix b  */
    double *outMatrix_c;              /* mxn output matrix */

    /* Check for proper number of arguments */
    if(nrhs!=2) {
        mexErrMsgIdAndTxt("MyToolbox:matrixSum:nrhs","Two inputs required.");
    }
    if(nlhs!=1) {
        mexErrMsgIdAndTxt("MyToolbox:matrixSum:nlhs","One output required.");
    }

    /* Get dimensions of the first input matrix */
    nrows_a = mxGetM(prhs[0]);
    ncols_a = mxGetN(prhs[0]);
    /* Get dimensions of the second input matrix */
    nrows_b = mxGetM(prhs[1]);
    ncols_b = mxGetN(prhs[1]);

    /* Check for equal number of rows. */
    if(nrows_a != nrows_b) {
        mexErrMsgIdAndTxt("MyToolbox:matrixSum:notEqual","Unequal number of rows.");
    }
    /* Check for equal number of columns. */
    if(ncols_a != ncols_b) {
        mexErrMsgIdAndTxt("MyToolbox:matrixSum:notEqual","Unequal number of columns.");
    }

    /* Make a pointer to the real data in the first input matrix  */
    inMatrix_a = mxGetPr(prhs[0]);
    /* Make a pointer to the real data in the second input matrix  */
    inMatrix_b = mxGetPr(prhs[1]);

    /* Make the output matrix */
    plhs[0] = mxCreateDoubleMatrix(nrows_a,ncols_a,mxREAL);

    /* Make a pointer to the real data in the output matrix */
    outMatrix_c = mxGetPr(plhs[0]);

    /* Call the computational routine */
    matrixSum(inMatrix_a,inMatrix_b,outMatrix_c,nrows_a*ncols_a);
}

To access the Octave utility mkoctfile, load an Octave module. Loading Octave also loads a compatible GCC:

$ module load octave

To compile matrixSum.c into a MEX-file:

$ mkoctfile --mex matrixSum.c

Two new files appear after the compilation:

matrixSum.mex
matrixSum.o

The name of the Octave-callable MEX-file is matrixSum.mex.

Prepare an Octave-compatible M-file with an appropriate filename, here named myjob.m:

% FILENAME:  myjob.m

% Call the separately compiled and dynamically linked oct-file.
A = [1,1,1;1,1,1]
B = [2,2,2;2,2,2]
C = matrixSum(A,B)

quit

Prepare a shell script with an appropriate filename, here named myjob.sh:

#!/bin/sh
# FILENAME:  myjob.sh

module load octave

# For additional information about your job submission,
# uncomment the following commands.
# hostname
# module list
# which octave

# Use the -q option to suppress startup messages.
# octave -q
octave

Change the permissions of the shell script to allow execution by the owner (you):

$ chmod u+x myjob.sh

Prepare a job submission file with an appropriate filename, here named myjob.sub. This example specifies the shell script myjob.sh as the executable, transfers the shell script and the user's current shell environment to the compute node, and uses Condor's file transfer mechanism to copy input and output files only if needed (if the same filesystem is not available on the compute node):

# FILENAME:  myjob.sub

universe     = VANILLA

# Copy my shell environment variables to the compute node.
# The compute node requires this to find the module command.
getenv = TRUE

# Transfer the "executable" myjob.sh to the compute node.
transfer_executable = TRUE
executable          = myjob.sh

should_transfer_files   = IF_NEEDED
when_to_transfer_output = ON_EXIT
input  = myjob.m
output = myjob.out
error  = myjob.err
log    = myjob.log
#
# Transfer the MEX-file matrixSum.mex to the compute node.
transfer_input_files = matrixSum.mex

queue

Submit the job:

$ condor_submit myjob.sub

View job status:

$ condor_q myusername

If there are any problems with your job, you may need to cancel your job or view the full job requirements. You may also wish to investigate more options on job submission. Please refer to the appropriate sections in this user guide for more details on these topics:

View results in the file for all standard output, here named myjob.out:

A =

   1   1   1
   1   1   1

B =

   2   2   2
   2   2   2

C =

   3   3   3
   3   3   3

You may also view the Condor log file, here named myjob.log. The counts of bytes sent and received are zero because this job ran on a compute node which could directly access the shared file system holding the input file:

000 (342475.000.000) 01/19 10:03:55 Job submitted from host: <128.211.157.86:60004>
...
001 (342475.000.000) 01/19 10:06:01 Job executing on host: <128.211.157.10:33917?PrivNet=condor.ccb.purdue.edu>
...
006 (342475.000.000) 01/19 10:06:10 Image size of job updated: 99616
...
005 (342475.000.000) 01/19 10:06:14 Job terminated.
    (1) Normal termination (return value 0)
        Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
        Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
        Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
        Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
    0  -  Run Bytes Sent By Job
    0  -  Run Bytes Received By Job
    0  -  Total Bytes Sent By Job
    0  -  Total Bytes Received By Job
...

Any output written to standard error will appear in myjob.err.

For more information about the Octave-compatible Mex-file:

Perl

Perl is a high-level, general-purpose, interpreted, dynamic programming language offering powerful text processing features. This section illustrates how to submit a small Perl job to BoilerGrid. This Perl example prints a single line of text.

Prepare a Perl input file with an appropriate filename, here named myjob.in:

# FILENAME:  myjob.in

print "hello, world\n"

The absolute path of Perl is the same on all Linux platforms. This allows using the absolute path to Perl in the job submission file. Also, consider including both 32-bit and 64-bit Linux platforms as candidates to run the job.

To discover the number of Linux platforms which can run Perl:

$ condor_status -pool boilergrid.rcac.purdue.edu -total -constraint '(OpSys == "LINUX")'

                  Total Owner Claimed Unclaimed Matched Preempting Backfill
         INTEL/LINUX    92    12       4        52       0          0       24
        X86_64/LINUX 28876 19310    3480      6085       1          0        0
               Total 28968 19322    3484      6137       1          0       24

Prepare the job submission file with an appropriate filename, here named myjob.sub. This example specifies the absolute path of the executable file, selects candidate compute nodes among the 32-bit and 64-bit Linux platforms, and uses Condor's file transfer mechanism to copy input and output files only if needed (if the same filesystem is not available on the compute node:

# FILENAME:  myjob.sub

universe     = VANILLA
requirements = ((Arch=="x86_64") || (Arch=="INTEL")) && (OpSys=="LINUX")

# Run the executable already installed on the compute node.
transfer_executable = FALSE
executable          = /usr/bin/perl

# Use the -w option to issue warnings.
arguments = -w
	
should_transfer_files   = IF_NEEDED
when_to_transfer_output = ON_EXIT
input  = myjob.in
output = myjob.out
error  = myjob.err
log    = myjob.log

queue

Submit the job:

$ condor_submit myjob.sub

View job status:

$ condor_q myusername

If there are any problems with your job, you may need to cancel your job or view the full job requirements. You may also wish to investigate more options on job submission. Please refer to the appropriate sections in this user guide for more details on these topics:

View results in the file for all standard output, here named myjob.out:

hello, world

You may also view the Condor log file, here named myjob.log. The counts of bytes sent and received are zero because this job ran on a compute node which could directly access the shared file system holding the input file:

000 (746987.000.000) 12/20 13:53:59 Job submitted from host: <128.211.157.86:49400>
...
001 (746987.000.000) 12/20 13:55:52 Job executing on host: <128.211.157.10:50656?PrivNet=condor.ccb.purdue.edu>
...
005 (746987.000.000) 12/20 13:55:52 Job terminated.
    (1) Normal termination (return value 0)
        Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
        Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
        Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
        Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
    0  -  Run Bytes Sent By Job
    0  -  Run Bytes Received By Job
    0  -  Total Bytes Sent By Job
    0  -  Total Bytes Received By Job
...

Any output written to standard error will appear in myjob.err.

For more information about Perl:

Python

Python is an interpreted, general-purpose, interpreted, dynamic programming language offering powerful text processing features. This section illustrates how to submit a small Python job to BoilerGrid. This Python example prints a single line of text.

Prepare a Python input file with an appropriate filename, here named myjob.in:

#!/usr/bin/python
# FILENAME:  myjob.in

import string, sys
print "hello, world"

The absolute path of Python is the same on all Linux platforms. This allows using the absolute path to Python in the job submission file. Also, consider including both 32-bit and 64-bit Linux platforms as candidates to run the job.

To discover the number of Linux platforms which can run Python:

$ condor_status -pool boilergrid.rcac.purdue.edu -total -constraint '(OpSys == "LINUX")'

                  Total Owner Claimed Unclaimed Matched Preempting Backfill
         INTEL/LINUX    92    12       4        52       0          0       24
        X86_64/LINUX 28876 19310    3480      6085       1          0        0
               Total 28968 19322    3484      6137       1          0       24

Prepare the job submission file with an appropriate filename, here named myjob.sub. This example specifies the absolute path of the executable file, selects candidate compute nodes among the 32-bit and 64-bit Linux platforms, and uses Condor's file transfer mechanism to copy input and output files only if needed (if the same filesystem is not available on the compute node):

# FILENAME:  myjob.sub

universe     = VANILLA
requirements = ((Arch=="x86_64") || (Arch=="INTEL")) && (OpSys=="LINUX")

# Run the executable already installed on the compute node.
transfer_executable = FALSE
executable          = /usr/bin/python

should_transfer_files   = IF_NEEDED
when_to_transfer_output = ON_EXIT
input  = myjob.in
output = myjob.out
error  = myjob.err
log    = myjob.log

queue

Submit the job:

$ condor_submit myjob.sub

View job status:

$ condor_q myusername

If there are any problems with your job, you may need to cancel your job or view the full job requirements. You may also wish to investigate more options on job submission. Please refer to the appropriate sections in this user guide for more details on these topics:

View results in the file for all standard output, here named myjob.out:

hello, world

You may also view the Condor log file, here named myjob.log. The counts of bytes sent and received are zero because this job ran on a compute node which could directly access the shared file system holding the input file:

000 (342437.000.000) 12/21 10:13:04 Job submitted from host: <128.211.157.86:41603>
...
001 (342437.000.000) 12/21 10:14:43 Job executing on host: <128.211.157.10:34840?PrivNet=condor.ccb.purdue.edu>
...
005 (342437.000.000) 12/21 10:14:43 Job terminated.
    (1) Normal termination (return value 0)
        Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
        Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
        Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
        Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
    0  -  Run Bytes Sent By Job
    0  -  Run Bytes Received By Job
    0  -  Total Bytes Sent By Job
    0  -  Total Bytes Received By Job
...

Any output written to standard error will appear in myjob.err.

For more information about Python:

R

R, a GNU project, is a language and environment for statistics and graphics. It is an open source version of the S programming language. This section illustrates how to submit a small R job to BoilerGrid. This R example computes a Pythagorean triple.

Prepare an R input file with an appropriate filename, here named myjob.in:

# FILENAME:  myjob.in

# Compute a Pythagorean triple.
a = 3
b = 4
c = sqrt(a*a + b*b)
c     # display result

Use the ClassAd attribute "HAS_R" to discover how many compute nodes advertise R:

$ condor_status -pool boilergrid.rcac.purdue.edu -total -constraint '(Arch=="X86_64") && (OpSys=="LINUX") && (HAS_R==True)'

                     Total Owner Claimed Unclaimed Matched Preempting Backfill
        X86_64/LINUX 25416 16524    1781      7111       0          0        0
               Total 25416 16524    1781      7111       0          0        0

The absolute path of R is not the same on all clusters. The ClassAd attribute R_EXE handles this discrepancy. To see the three values of R_EXE:

$ condor_status -pool boilergrid.rcac.purdue.edu -constraint '(Arch == "X86_64") && (OpSys == "LINUX") && (HAS_R==True)' -format "%s\n" R_EXE > myfile

The three values of ClassAd attribute R_EXE:

/apps/rhel5/R-2.10.0/bin/R
/apps/steele/R-2.9.0/bin/R
/apps/coates/R-2.9.0/bin/R

The three values include two different versions of R. There is a chance that different versions will run your job during your project.

The existence of paths specific to clusters suggest using the ClassAd attribute R_EXE rather than an absolute path; however, R requires that a shared library be loaded also. So, this method uses module load in a shell script.

Prepare a shell script with an appropriate filename, here named myjob.sh:

#!/bin/sh
# FILENAME:  myjob.sh

module load R

# For additional information about your job submission, uncomment the following commands. 
# hostname
# module list
# which R

# --vanilla:
# --no-save: do not save datasets at the end of an R session
R --vanilla --no-save

Change the permissions of the shell script to allow execution by the owner (you):

$ chmod u+x myjob.sh

Prepare a job submission file with an appropriate filename, here named myjob.sub. This example uses the ClassAd attribute HAS_R to locate a compute node, specifies the shell script myjob.sh as the executable, transfers the shell script and the user's current shell environment to the compute node, and uses Condor's file transfer mechanism to copy input and output files only if needed (if the same filesystem is not available on the compute node):

# FILENAME:  myjob.sub

universe     = VANILLA
requirements = (HAS_R==TRUE)
	                                        
# Copy my shell environment variables to the compute node.
# The compute node requires this to find the module command.
getenv = TRUE

# Transfer the "executable" myjob.sh to the compute node.
transfer_executable = TRUE
executable          = myjob.sh
	
should_transfer_files   = IF_NEEDED
when_to_transfer_output = ON_EXIT
input  = myjob.in
output = myjob.out
error  = myjob.err
log    = myjob.log

queue

Submit the job:

$ condor_submit myjob.sub

View job status:

$ condor_q myusername

If there are any problems with your job, you may need to cancel your job or view the full job requirements. You may also wish to investigate more options on job submission. Please refer to the appropriate sections in this user guide for more details on these topics:

View results in the file for all standard output, here named myjob.out:

R version 2.9.0 (2009-04-17)
Copyright (C) 2009 The R Foundation for Statistical Computing
ISBN 3-900051-07-0

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> # FILENAME:  myjob.in
>
> # Compute a Pythagorean triple.
> a = 3
> b = 4
> c = sqrt(a*a + b*b)
> c     # display result
[1] 5
>

You may also view the Condor log file, here named myjob.log. The counts of bytes sent and received are zero because this job ran on a compute node which could directly access the shared file system holding the input file:

000 (747004.000.000) 01/06 14:16:52 Job submitted from host: <128.211.157.86:49041>
...
001 (747004.000.000) 01/06 14:18:30 Job executing on host: <128.211.157.10:45461?PrivNet=condor.ccb.purdue.edu>
...
005 (747004.000.000) 01/06 14:18:35 Job terminated.
    (1) Normal termination (return value 0)
        Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
        Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
        Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
        Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
    0  -  Run Bytes Sent By Job
    0  -  Run Bytes Received By Job
    0  -  Total Bytes Sent By Job
    0  -  Total Bytes Received By Job
...

Any output written to standard error will appear in myjob.err.

For more information about R:

SAS

SAS is an integrated system supporting statistical analysis, report generation, business planning, and forecasting. This section illustrates how to submit a small SAS job to BoilerGrid. This SAS example displays a small dataset.

Prepare a SAS input file with an appropriate filename, here named myjob.sas:

* FILENAME:  myjob.sas

/* Display a small dataset. */
TITLE 'Display a Small Dataset';
DATA grades;
INPUT name $ midterm final;
DATALINES;
Anne     61 64
Bob      71 71
Carla    86 80
David    79 77
Edwardo  73 73
Fannie   81 81
;
PROC PRINT data=grades;
RUN;

Prepare a shell script with an appropriate filename, here named myjob.sh:

#!/bin/sh
# FILENAME:  myjob.sh

module load sas

# For additional information about your job submission, uncomment the following commands.
# hostname
# module list
# which sas

# -stdio:   run SAS in batch mode:
#              read SAS input from stdin
#              write SAS output to stdout
#              write SAS log to stderr
# -nonews:  do not display SAS news
sas -stdio -nonews

The SAS command-line option -stdio uses standard I/O in the normal fashion. Using this option sends the SAS log file to stderr and avoids any conflict between the SAS log file and the Condor log file.

Change the permissions of the shell script to allow execution by the owner (you):

$ chmod u+x myjob.sh

Prepare a job submission file with an appropriate filename, here named myjob.sub. This example specifies the shell script myjob.sh as the executable, transfers the shell script and the user's current shell environment to the compute node, and uses Condor's file transfer mechanism to copy input and output files only if needed (if the same filesystem is not available on the compute node):

# FILENAME:  myjob.sub

universe     = VANILLA

# Copy my shell environment variables to the compute node.
# The compute node requires this to find the module command.
getenv = TRUE

# SAS needs the environment variable HOME set to your home directory.
environment = HOME=myhomedirectory

# Transfer the "executable" myjob.sh to the compute node.
transfer_executable = TRUE
executable          = myjob.sh

should_transfer_files   = IF_NEEDED
when_to_transfer_output = ON_EXIT
input  = myjob.sas
output = myjob.out
error  = myjob.err
log    = myjob.log

queue

Submit the job:

$ condor_submit myjob.sub

View job status:

$ condor_q myusername

If there are any problems with your job, you may need to cancel your job or view the full job requirements. You may also wish to investigate more options on job submission. Please refer to the appropriate sections in this user guide for more details on these topics:

View results in the file for all standard output, here named myjob.out:

                                                           The SAS System                       11:22 Wednesday, January 5, 2011   1

                                                 Obs    name       midterm    final

                                                  1     Anne          61        64
                                                  2     Bob           71        71
                                                  3     Carla         86        80
                                                  4     David         79        77
                                                  5     Edwardo       73        73
                                                  6     Fannie        81        81

You may also view the Condor log file, here named myjob.log. The counts of bytes sent and received are zero because this job ran on a compute node which could directly access the shared file system holding the input file:

00 (747003.000.000) 01/05 11:21:35 Job submitted from host: <128.211.157.86:49041>
...
001 (747003.000.000) 01/05 11:22:02 Job executing on host: <128.211.157.10:46641?PrivNet=condor.ccb.purdue.edu>
...
005 (747003.000.000) 01/05 11:22:04 Job terminated.
    (1) Normal termination (return value 0)
        Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
        Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
        Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
        Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
    0  -  Run Bytes Sent By Job
    0  -  Run Bytes Received By Job
    0  -  Total Bytes Sent By Job
    0  -  Total Bytes Received By Job
...

View the SAS log in the standard error file, here named myjob.err:

1                                                          The SAS System                           11:22 Wednesday, January 5, 2011

NOTE: Copyright (c) 2002-2008 by SAS Institute Inc., Cary, NC, USA.
NOTE: SAS (r) Proprietary Software 9.2 (TS2M0)
      Licensed to PURDUE UNIVERSITY - T&R, Site 70063312.
NOTE: This session is executing on the Linux 2.6.18-194.17.1.el5rcac2 (LINUX) platform.



NOTE: SAS initialization used:
      real time           0.06 seconds
      cpu time            0.02 seconds

1    * FILENAME:  myjob.in
2
3    /* Display a small dataset. */
4    TITLE 'Display a Small Dataset';
5    DATA grades;
6    INPUT name $ midterm final;
7    DATALINES;
NOTE: The data set WORK.GRADES has 6 observations and 3 variables.
NOTE: DATA statement used (Total process time):
      real time           0.01 seconds
      cpu time            0.01 seconds

14   ;
15   PROC PRINT data=grades;
16   RUN;
NOTE: There were 6 observations read from the data set WORK.GRADES.
NOTE: The PROCEDURE PRINT printed page 1.
NOTE: PROCEDURE PRINT used (Total process time):
      real time           0.68 seconds
      cpu time            0.03 seconds

NOTE: SAS Institute Inc., SAS Campus Drive, Cary, NC USA 27513-2414
NOTE: The SAS System used:
      real time           0.77 seconds
      cpu time            0.06 seconds

For more information about SAS:

Flocking to Other Grids

Even though a Condor pool usually contains machines owned by many different people, it will often be the case that collaborating researchers from different organizations do not consider it feasible to combine all of their computers into a single Condor pool. The solution to this is to create multiple Condor pools and allow flocking between these pools. Jobs may then flock (migrate) from one pool to another based on the availability of compute nodes. If your local Condor pool does not have any available machines to run your job, it may flock to another pool. You need do nothing special to enable this for your jobs. It will happen automatically.

If you would like to learn more about how this works, see the Grid Computing Chapter of the Condor Users' Manual.

BoilerGrid Frequently Asked Questions (FAQ)

There are currently no FAQs for BoilerGrid.