BoilerGrid - Getting Started

Conventions Used in this Document

This document follows certain typesetting and naming conventions:

  • Colored, underlined text indicates a link.
  • Colored, bold text highlights something of particular importance.
  • Italicized text notes the first use of a key concept or term.
  • Bold, fixed-width font text indicates a command or command argument that you type verbatim.
  • Examples of commands and output as you would see them on the command line will appear in colored blocks of fixed-width text such as this:
    $ example
    This is an example of commands and output.
    
  • All command line shell prompts appear as a single dollar sign ("$"). Your actual shell prompt may differ.
  • All examples work with bash or ksh shells. Where different, changes needed for tcsh or csh shell users appear in example comments.
  • All names that begin with "my" illustrate examples that you replace with an appropriate name. These include "myusername", "myfilename", "mydirectory", "myjobid", etc.
  • The term "processor core" or "core" throughout this guide refers to the individual CPU cores on a processor chip.

Overview of BoilerGrid

BoilerGrid is a large, high-throughput, distributed computing system operated by ITaP, and using the HTCondor system developed by the HTCondor Project at the University of Wisconsin. BoilerGrid provides a way for you to run programs on large numbers of otherwise idle computers in various locations, including any temporarily under-utilized high-performance cluster resources as well as any computer lab desktop machines not currently in use. Whenever a local user or scheduled job needs a machine back, HTCondor stops its job and sends it to another HTCondor node as soon as possible. Because this model limits the ability to do parallel processing and communications, BoilerGrid is only appropriate for relatively quick serial jobs.

How to Join BoilerGrid

If you have a desktop computer on the Purdue West Lafayette campus, please consider donating your desktop's idle time to BoilerGrid! The process is easy and allows other Purdue researchers to use otherwise wasted cycles when your computer is doing nothing. More information on joining BoilerGrid is available on the Join BoilerGrid page.

Detailed Hardware Specification

BoilerGrid scavenges cycles from nearly all ITaP research systems, including all the ITaP-maintained research clusters and specialized systems. BoilerGrid also uses idle time of machines in student labs on the Purdue West Lafayette campus. Through the larger consortium DiaGrid, BoilerGrid may also send jobs to machines at other institutions, including the University of Wisconsin, the University of Louisville, Indiana University, the University of Notre Dame, Indiana State University, the Purdue Calumet and North Central campuses, and the Indiana University – Purdue University Fort Wayne campus. Whenever the primary scheduling system on any of these machines needs a compute node back or a user sits down and starts to use a desktop computer, HTCondor will stop its job and, if possible, checkpoint its work. HTCondor then immediately tries to restart this job on some other available compute node in BoilerGrid.

A recent snapshot of BoilerGrid found 36,524 total processor cores. Of these, there were 29,111 Linux/x86_64, 98 Linux/Intel (ia32), 385 WinNT51/Intel, and 6925 WinNT61/Intel. There are also small numbers of Itanium Linux, Solaris, and Intel OSX nodes. Memory on compute nodes ranges from 512 MB to 192 GB, and most processors run at 2 GHz or faster. With a total of over 60 TFLOPS available, BoilerGrid can provide large numbers of cycles in a short amount of time. HTCondor offers high-throughput computing and is excellent for parameter sweeps, Monte Carlo simulations, or nearly any serial application.

Owner Arch/OS Processor Cores
ITaP - Research Computing x86_64/Linux 30,717
ITaP - Research Computing Intel/Linux 29
ITaP - Envision Center Intel/Linux 48
ITaP - Teaching & Learning Intel/WinNTXX ~9,300
Purdue Calumet X86_64/Linux 998
Notre Dame CSE Intel/Linux, Intel/OSX, Sun4u/Solaris210, x86_64/Linux 1,213
Purdue Biology, Libraries & some ITaP Intel/Linux, Intel/WinNT51 187

BoilerGrid currently uses HTCondor 7.6.10. You can check on the overall status of BoilerGrid using CondorView.

Accounts on BoilerGrid

Obtaining an Account

All Purdue faculty, staff, and students with the approval of their advisor may request access to BoilerGrid. However, if you have an account on Radon or any of the ITaP Community Clusters (Carter, Hansen, Rossmann, Coates, Steele, and Peregrine 1), then you already have access to BoilerGrid. Refer to the Accounts / Access page for more details on how to request access.

Login / SSH

To submit jobs on BoilerGrid, log in to the submission host condor.rcac.purdue.edu via SSH. This submission host is actually three front-end hosts: condor-fe00, condor-fe01, and condor-fe02. The login process randomly assigns one of these three front-ends to each login to condor.rcac.purdue.edu. While the three front-end hosts are identical, each has its own HTCondor queue. When you submit jobs to the HTCondor queue from the front-end named condor-fe00, you will not see those jobs on the HTCondor queue while logged in to either condor-fe01 or condor-fe02. To ensure that you always see the same HTCondor queue, log in to the same front-end.

Each front-end host has its own /tmp. Sharing data in /tmp during subsequent sessions may fail. ITaP advises using scratch storage for multisession, shared data instead.

You may also submit jobs to BoilerGrid from Radon or any of the ITaP Community Clusters (Carter, Hansen, Rossmann, Coates, Steele, and Peregrine 1). These clusters also have multiple front-end hosts.

SSH Client Software

Secure Shell or SSH is a way of establishing a secure (encrypted) connection between two computers. It uses public-key cryptography to authenticate the remote computer and (optionally) to allow the remote computer to authenticate the user. Its usual function involves logging in to a remote machine and executing commands, but it also supports tunneling and forwarding of X11 or arbitrary TCP connections. There are many SSH clients available for all operating systems.

Linux / Solaris / AIX / HP-UX / Unix:

  • The ssh command is pre-installed. Log in using ssh myusername@servername.

Microsoft Windows:

  • PuTTY is an extremely small download of a free, full-featured SSH client.
  • Secure CRT is a commercial SSH client which is freely available to Purdue students, faculty, and staff with a Purdue career account.

Mac OS X:

  • The ssh command is pre-installed. You may start a local terminal window from "Applications->Utilities". Log in using ssh myusername@servername.
  • MacSSH is another free SSH client.

SSH Keys

SSH works with many different means of authentication. One popular authentication method is Public Key Authentication (PKA). PKA is a method of establishing your identity to a remote computer using related sets of encryption data called keys. PKA is a more secure alternative to traditional password-based authentication with which you are probably familiar.

To employ PKA via SSH, you manually generate a keypair (also called SSH keys) in the location from where you wish to initiate a connection to a remote machine. This keypair consists of two text files: private key and public key. You keep the private key file confidential on your local machine or local home directory (hence the name "private" key). You then log in to a remote machine (if possible) and append the corresponding public key text to the end of a specific file, or have a system administrator do so on your behalf. In future login attempts, PKA compares the public and private keys to verify your identity; only then do you have access to the remote machine.

As a user, you can create, maintain, and employ as many keypairs as you wish. If you connect to a computational resource from your work laptop, your work desktop, and your home desktop, you can create and employ keypairs on each. You can also create multiple keypairs on a single local machine to serve different purposes, such as establishing access to different remote machines or establishing different types of access to a single remote machine. In short, PKA via SSH offers a secure but flexible means of identifying yourself to all kinds of computational resources.

Passphrases and SSH Keys

Creating a keypair prompts you to provide a passphrase for the private key. This passphrase is different from a password in a number of ways. First, a passphrase is, as the name implies, a phrase. It can include most types of characters, including spaces, and has no limits on length. Secondly, the remote machine does not receive this passphrase for verification. Its purpose is only to allow the use of your local private key and is specific to a specific local private key.

Perhaps you are wondering why you would need a private key passphrase at all when using PKA. If the private key remains secure, why the need for a passphrase just to use it? Indeed, if the location of your private keys were always completely secure, a passphrase might not be necessary. In reality, a number of situations could arise in which someone may improperly gain access to your private key files. In these situations, a passphrase offers another level of security for you, the user who created the keypair.

Think of the private key/passphrase combination as being analogous to your ATM card/PIN combination. The ATM card itself is the object that grants access to your important accounts, and as such, should remain secure at all times—just as a private key should. But if you ever lose your wallet or someone steals your ATM card, you are glad that your PIN exists to offer another level of protection. The same is true for a private key passphrase.

When you create a keypair, you should always provide a corresponding private key passphrase. For security purposes, avoid using phrases which automated programs can discover (e.g. phrases that consist solely of words in English-language dictionaries). This passphrase is not recoverable if forgotten, so make note of it. Only a few situations warrant using a non-passphrase-protected private key—conducting automated file backups is one such situation. If you need to use a non-passphrase-protected private key to conduct automated backups to Fortress, see the No-Passphrase SSH Keys section.

Passwords

If you have received a default password as part of the process of obtaining your account, you should change it before you log onto BoilerGrid for the first time. Change your password from the SecurePurdue website. You will have the same password on all ITaP systems such as BoilerGrid, Purdue email, or Blackboard.

Passwords may need to be changed periodically in accordance with Purdue security policies. Passwords must follow certain guidelines as described on the SecurePurdue webpage and ITaP recommends following some guidelines to select a strong password.

ITaP staff will NEVER ask for your password, by email or otherwise.

Never share your password with another user or make your password known to anyone else.

File Storage and Transfer for BoilerGrid

Storage Options

File storage options on ITaP research systems include long-term storage (home directories, Fortress) and short-term storage (scratch directories, /tmp directory). Each option has different performance and intended uses, and some options vary from system to system as well. ITaP provides daily snapshots of home directories for a limited time for accidental deletion recovery. ITaP does not back up scratch directories or temporary storage and regularly purges old files from scratch and /tmp directories. More details about each storage option appear below.

Home Directories

ITaP provides home directories for long-term file storage. Each user ID has one home directory. You should use your home directory for storing important program files, scripts, input data sets, critical results, and frequently used files. You should store infrequently used files on Fortress. Your home directory becomes your current working directory, by default, when you log in.

ITaP provides daily snapshots of your home directory for a limited period of time in the event of accidental deletion. For additional security, you should store another copy of your files on more permanent storage, such as the Fortress HPSS Archive.

Your home directory physically resides within the Isilon storage system at Purdue. To find the path to your home directory, first log in then immediately enter the following:

$ pwd
/home/myusername

Or from any subdirectory:

$ echo $HOME
/home/myusername

Your home directory and its contents are available on all ITaP research front-end hosts and compute nodes via the Network File System (NFS).

Your home directory has a quota capping the size and/or number of files you may store within. For more information, refer to the Storage Quotas / Limits Section.

Lost Home Directory File Recovery

Only files which have been snap-shotted overnight are recoverable. If you lose a file the same day you created it, it is NOT recoverable.

To recover files lost from your home directory, use the flost command:

$ flost

Scratch Directories

ITaP provides scratch directories for short-term file storage only. Each file system domain has at least one scratch directory. Each user ID may access one scratch directory in a file system domain. The quota of your scratch directory is several times greater than the quota of your home directory. You should use your scratch directory for storing large temporary input files which your job reads or for writing large temporary output files which you may examine after execution of your job. You should use your home directory and Fortress for longer-term storage or for holding critical results.

Users of all ITaP research clusters have access to a scratch directory.

ITaP does not perform backups for scratch directories. In the event of a disk crash or file purge, files in scratch directories are not recoverable. You should copy any important files to more permanent storage.

ITaP automatically removes (purges) from scratch directories all files stored for more than 90 days. Owners of these files receive a notice one week before removal via email. For more information, please refer to our Scratch File Purging Policy.

To find the path to your scratch directory:

$ findscratch

The response from command findscratch depends on your submission host. You may see one of the following paths:

/scratch/scratch95/m/myusername
/scratch/scratch96/m/myusername
/scratch/lustreA/m/myusername
/scratch/miner/m/myusername

The value of variable $RCAC_SCRATCH is the path of your scratch directory. Use this variable in any scripts. Your actual scratch directory path may change without warning, but this variable will remain current.

$ echo $RCAC_SCRATCH

The response will be one of the previously listed paths.

Your scratch directory on ITaP research resources may be the same location and shared by some other ITaP research resources, and also distinct and not shared by other ITaP research resources. All submission hosts on all computational resources are able to access the scratch directories of all other computational resources. However, compute nodes are only able to access the scratch directory allocated to that specific computational resource. ITaP may change which computational resources share scratch storage with other computational resources as needs dictate. For more information about which computational resources share scratch volumes, please see the Network Storage Resource Page.

All BoilerGrid jobs submitted from a submission host of an ITaP research resource will have their HTCondor filesystem domain set such that these jobs will stay on ITaP compute nodes which have access to the scratch directory of the submission host unless you specify file transfer (which would eliminate any need for this). This will ensure that non-file-transfer jobs will always run on nodes which can access the scratch directory you had where you submitted the jobs. If you have no need of this scratch directory and want these jobs to run on systems which do not have access to it, you will need to explicitly set the file system domain of your jobs.

To find the path to someone else's scratch directory:

$ findscratch someusername
/scratch/scratch95/s/someusername

Your scratch directory has a quota capping the size and number of files you may store in it. For more information, refer to the Storage Quotas / Limits Section.

/tmp Directory

ITaP provides /tmp directories for short-term file storage only. Each front-end and compute node has a /tmp directory. Your program may write temporary data to the /tmp directory of the compute node on which it is running. That data is available for as long as your program is active. Once your program terminates, that temporary data is no longer available. When used properly, /tmp may provide faster local storage to an active process than any other storage option. You should use your home directory and Fortress for longer-term storage or for holding critical results.

ITaP does not perform backups for the /tmp directory and removes files from /tmp whenever space is low or whenever the system needs a reboot. In the event of a disk crash or file purge, files in /tmp are not recoverable. You should copy any important files to more permanent storage.

Long-Term Storage

Long-term Storage or Permanent Storage is available to ITaP research users on the High Performance Storage System (HPSS), an archival storage system, commonly referred to as "Fortress". HPSS is a software package that manages a hierarchical storage system. Program files, data files and any other files which are not used often, but which must be saved, can be put in permanent storage. Fortress currently has a 1.2 PB capacity.

Files smaller than 100 MB have their primary copy stored on low-cost disks (disk cache), but the second copy (backup of disk cache) is on tape or optical disks. This provides a rapid restore time to the disk cache. However, the large latency to access a larger file (usually involving a copy from a tape cartridge) makes it unsuitable for direct use by any processes or jobs, even where possible. The primary and secondary copies of larger files are stored on separate tape cartridges in the Quantum (ADIC, Advanced Digital Information Corporation) tape library.

To ensure optimal performance for all users, and to keep the Fortress system healthy, please remember the following tips:

  • Fortress operates most effectively with large files - 1GB or larger. If your data is comprised of smaller files, use HTAR to directly create archives in Fortress.
  • When working with files on cluster head nodes, use your home directory or a scratch file system, rather than editing or computing on files directly in Fortress. Copy any data you wish to archive to Fortress after computation is complete.
  • The HPSS software does not handle sparse files (files with empty space) in an optimal manner. Therefore, if you must copy a sparse file into HPSS, use HSI rather than the cp or mv commands.
  • Due to the sparse files issue, the rsync command should not be used to copy data into Fortress through NFS, as this may cause problems with the system.

Fortress writes two copies of every file either to two tapes, or to disk and a tape, to protect against medium errors. Unfortunately, Fortress does not automatically switch to the alternate copy when it has trouble accessing the primary. If it seems to be taking an extraordinary amount of time to retrieve a file (hours), please either email rcac-help@purdue.edu or call ITaP Customer Service at 765-49-4400. We can then investigate why it is taking so long. If it is an error on the primary copy, we will instruct Fortress to switch to the alternate copy as the primary and recreate a new alternate copy.

For more information about Fortress, how it works, user guides, and how to obtain an account:

Environment Variables

There are many environment variables related to storage locations and paths. Logging in automatically sets these environment variables. You may change the variables at any time.

Use environment variables instead of actual paths whenever possible to avoid problems if the specific paths to any of these change. Some of the environment variables you should have are:

Name Description
USER your username
HOME path to your home directory
PWD path to your current directory
RCAC_SCRATCH path to scratch filesystem
PATH all directories searched for commands/applications
HOSTNAME name of the machine you are on
SHELL your current shell (bash, tcsh, csh, ksh)
SSH_CLIENT your local client's IP address
TERM type of terminal or terminal emulator being used

By convention, environment variable names are all uppercase. Use them on the command line or in any scripts in place of and in combination with hard-coded values:

$ ls $HOME
...

$ ls $RCAC_SCRATCH/myproject
...

To find the value of any environment variable:

$ echo $RCAC_SCRATCH
/scratch/scratch95/m/myusername

$ echo $SHELL
/bin/tcsh

To list the values of all environment variables:

$ env
USER=myusername
HOME=/home/myusername
RCAC_SCRATCH=/scratch/scratch95/m/myusername
SHELL=/bin/tcsh
...

You may create or overwrite an environment variable. To pass (export) the value of a variable in either bash or ksh:

$ export VARIABLE=value

To assign a value to an environment variable in either tcsh or csh:

% setenv VARIABLE value

Storage Quotas / Limits

ITaP imposes some limits on your disk usage on research systems. ITaP implements a quota on each filesystem. Each filesystem (home directory, scratch directory, etc.) may have a different limit. If you exceed the quota, you will not be able to save new files or new data to the filesystem until you delete or move data to long-term storage.

Checking Quota Usage

To check the current quotas of your home and scratch directories use the myquota command:

$ myquota
Type        Filesystem          Size    Limit  Use         Files    Limit  Use
==============================================================================
home        extensible         5.0GB   10.0GB  50%             -        -   - 
scratch     /scratch/scratch95/    8KB  476.8GB   0%             2  100,000   0%

The columns are as follows:

  1. Type: indicates home or scratch directory.
  2. Filesystem: name of storage option.
  3. Size: sum of file sizes in bytes.
  4. Limit: allowed maximum on sum of file sizes in bytes.
  5. Use: percentage of file-size limit currently in use.
  6. Files: number of files and directories (not the size).
  7. Limit: allowed maximum on number of files and directories. It is possible, though unlikely, to reach this limit and not the file-size limit if you create a large number of very small files.
  8. Use: percentage of file-number limit currently in use.

If you find that you reached your quota in either your home directory or your scratch file directory, obtain estimates of your disk usage. Find the top-level directories which have a high disk usage, then study the subdirectories to discover where the heaviest usage lies.

To see in a human-readable format an estimate of the disk usage of your top-level directories in your home directory:

$ du -h --max-depth=1 $HOME >myfile
32K /home/myusername/mysubdirectory_1
529M    /home/myusername/mysubdirectory_2
608K    /home/myusername/mysubdirectory_3

The second directory is the largest of the three, so apply command du to it.

To see in a human-readable format an estimate of the disk usage of your top-level directories in your scratch file directory:

$ du -h --max-depth=1 $RCAC_SCRATCH >myfile
160K    /scratch/scratch95/m/myusername

This strategy can be very helpful in figuring out the location of your largest usage. Move unneeded files and directories to long-term storage to free space in your home and scratch directories.

Increasing Your Storage Quota

Home Directory

If you find you need additional disk space in your home directory, please first consider archiving and compressing old files and moving them to long-term storage on the Fortress HPSS Archive. If you are unable to do so, you may go to the BoilerBackpack Quota Management site and use the sliders there to increase the amount of space allocated to your research home directory vs. other storage options, up to a maximum of 100GB.

Scratch Directory

If you find you need additional disk space in your scratch directory, please first consider archiving and compressing old files and moving them to long-term storage on the Fortress HPSS Archive. If you are unable to do so, you may ask for a quota increase at rcac-help@purdue.edu. Quota requests up to 2TB and 200,000 files on LustreA or LustreC can be processed quickly.

Archive and Compression

There are several options for archiving and compressing groups of files or directories on ITaP research systems. The mostly commonly used options are:

  • tar   (more information)
    Saves many files together into a single archive file, and restores individual files from the archive. Includes automatic archive compression/decompression options and special features for incremental and full backups.
    Examples:
      (list contents of archive somefile.tar)
    $ tar tvf somefile.tar
    
      (extract contents of somefile.tar)
    $ tar xvf somefile.tar
    
      (extract contents of gzipped archive somefile.tar.gz)
    $ tar xzvf somefile.tar.gz
    
      (extract contents of bzip2 archive somefile.tar.bz2)
    $ tar xjvf somefile.tar.bz2
    
      (archive all ".c" files in current directory into one archive file)
    $ tar cvf somefile.tar *.c 
    
      (archive and gzip-compress all files in a directory into one archive file)
    $ tar czvf somefile.tar.gz somedirectory/
    
      (archive and bzip2-compress all files in a directory into one archive file)
    $ tar cjvf somefile.tar.bz2 somedirectory/
    
    
    Other arguments for tar can be explored by using the man tar command.
  • gzip   (more information)
    The standard compression system for all GNU software.
    Examples:
      (compress file somefile - also removes uncompressed file)
    $ gzip somefile
    
      (uncompress file somefile.gz - also removes compressed file)
    $ gunzip somefile.gz
    
  • bzip2   (more information)
    Strong, lossless data compressor based on the Burrows-Wheeler transform. Stronger compression than gzip.
    Examples:
      (compress file somefile - also removes uncompressed file)
    $ bzip2 somefile
    
      (uncompress file somefile.bz2 - also removes compressed file)
    $ bunzip2 somefile.bz2
    

There are several other, less commonly used, options available as well:

  • zip
  • 7zip
  • xz

File Transfer

There are a variety of ways to transfer data to and from ITaP research systems. Which you should use depends on several factors, including the ease of use for you personally, connection speed and bandwidth, the size and number of files to be transferred. For more details on file transfer methods and applications, refer to the BoilerGrid Complete User Guide.

Applications on BoilerGrid

Compiling Source Code on BoilerGrid

Provided Compilers

The compilers available on Radon and the Community Clusters (Hansen, Rossmann, Coates, Steele, and Miner) are able to compile code for HTCondor. Compilers are available for Fortran 77, Fortran 90, Fortran 95, C, and C++. The compilers can produce general-purpose and architecture-specific optimizations to improve performance. These include loop-level optimizations, inter-procedural analysis and cache optimizations. While the compilers support automatic and user-directed parallelization of Fortran, C, and C++ applications for multiprocessing execution, BoilerGrid allows only serial jobs.

To see the available compilers, choose one of the following entries:

$ module avail intel
$ module avail gcc
$ module avail pgi 

Statically Linked Libraries

Using statically linked libraries, regardless the chosen HTCondor universe, is good practice; you cannot rely on which versions of dynamic libraries are available on the machines selected to run your job. With static libraries, HTCondor will send the same libraries to all machines. On the other hand, with the HTCondor flock consisting of a mix of machine architectures, there is also the possibility that your job will land on a machine that is so different from or much older than the machine on which you built your executable file that your job may fail to execute an instruction in the statically linked library. In a parameter sweep, this leads to the confusing situation of some of the runs of the sweep completing successfully while others fail. In this case, you must consider using the corresponding dynamic library on the selected machine or using ClassAds to select compute nodes known to run your job successfully or to exclude compute nodes known to fail. So, use static linkage if at all possible. For the Standard Universe, the condor_compile command specifies static linkage as part of its arguments to the linker; the condor_compile command exhibits its arguments in the "LINKING FOR" message. Regarding jobs destined for the Vanilla Universe, use your compiler's command-line option for selecting statically linked libraries.

Running Jobs on BoilerGrid

You may use HTCondor to submit jobs to BoilerGrid. HTCondor performs job scheduling. Jobs may be serial only. You may use only the batch mode for developing and running your program. BoilerGrid does not offer an interactive mode to run your jobs.

Running Jobs via HTHTCondor

HTCondor is one of several distributed computing resources ITaP provides. Like other similar resources, HTCondor provides a framework for running programs on otherwise idle computers. While this imposes serious limitations on parallel jobs and codes with large I/O or memory requirements, HTCondor can provide a large quantity of cycles for researchers who need to run hundreds of smaller jobs.

HTCondor is a specialized batch system for managing compute-intensive jobs. HTCondor provides a queuing mechanism, scheduling policy, priority scheme, and resource classifications. Users submit their jobs to HTCondor, which then puts these jobs in a queue, runs them, and reports back with the results.

In some ways, HTCondor is different from other batch systems. They usually only operate on dedicated machines/compute servers. Instead, HTCondor can both schedule jobs on dedicated machines and effectively utilize non-dedicated machines to run jobs. It only runs jobs on machines which are currently idle (no keyboard activity, no load average, no active telnet users, etc). In this way, HTCondor effectively harnesses otherwise idle machines throughout a pool of machines.

Currently, ITaP uses HTCondor to utilize idle cycles on all ITaP research resources, including all Linux cluster nodes as well as some other servers and workstations. While ITaP uses PBS to schedule the resources of the Linux clusters, HTCondor schedules jobs on compute nodes when the nodes are not running PBS jobs. When PBS elects to run a new job on a node which is currently running HTCondor-scheduled jobs, HTCondor preempts all jobs running on that node to make room for the PBS-scheduled job. You may submit HTCondor jobs from any ITaP research system.

For more information:

Tips

  • Do not queue up thousands of jobs in a queue. Submit fewer jobs at a time or use DAGMan to divide your jobs into reasonably-sized chunks (less than 500 jobs per set).
  • Never run condor_q repeatedly on a heavily used submit node. The condor_schedd is single-threaded and schedules work in the same thread that you are using to list the queue. This actually takes resources away from the scheduler and is counter-productive.
  • Long jobs should run in the Standard Universe, not in the Vanilla Universe, since they will likely never finish in Vanilla.
  • Vanilla Universe can use Intel compilers (may run 30–40% faster). Using Intel compilers under Vanilla may ultimately provide better throughput than checkpointing jobs in the Standard Universe using a different compiler because the speed gained from using the Intel compilers may be greater than the advantage of checkpointing.
  • Prefer statically linked libraries over dynamically linked libraries.
  • Generally, if your jobs run in less than 1/2 hour, they will seldom be evicted. If they take 1/2 hour to 1 hour, there will usually still only be a few evictions.
  • Purdue has both a scavenging/preempting and a scheduling system. Remember that the HTCondor pool is very heterogeneous, both regarding processor versions and OS versions/types (both Linux of different varieties and some Windows).
  • At Purdue, ITaP has disabled all automatic email notification using the notification HTCondor submission command. Setting this in a submission file will have no effect.
  • Why no middleware (like Mycluster at TACC)? Middleware can be easier for the user, since it uses HTCondor (and other schedulers) "behind the scenes". Middlewares are also themselves schedulers and will not start a job until they can guarantee to run a job to completion (no eviction). However, HTCondor has a lot of job restarts and thus much overhead on many jobs. For a large number of jobs, using HTCondor without any middleware is a better approach.

Job Submission File

Example 1

Here is the simplest possible job submission file. It will queue one copy of the program hello for execution by HTCondor. HTCondor will use its default universe and the default platform, which means to run the job on a compute node which has the same architecture and operating system as the submission host.

No input, output, and error commands appear in the job submission file, so the files stdin, stdout, and stderr will all refer to /dev/null (a.k.a. the null device. It is a special file that discards all data written to it, but reports that the write operation succeeded. It provides no data to any process that reads from it - returning EOF). The program may produce output by explicitly opening a file and writing to it. This job writes to a log file, hello.log. This log file will contain events the job had during its lifetime inside of HTCondor, such as any possible errors. When the job finishes, its exit conditions will also be noted in the log file. HTCondor recommends a log file so that you know what happened to your jobs.

If your program only returns output to the screen (like the hello.c program below does), then you should include Output = hello.out or something like it somewhere before Queue. Otherwise you will not see the output.

If you do not explicitly choose a universe, HTCondor uses the default universe: Vanilla Universe.

####################
#
# Example 1
# Simple HTCondor job description file
#
####################

executable     = hello
log            = hello.log
queue

Example 2

This example (from the HTCondor Manual), queues two copies of the program Mathematica. The first copy will run in directory run_1, and the second will run in directory run_2. For both queued copies, stdin will be file test.data, stdout will be file loop.out, and stderr will be file loop.error. This job writes two sets of files in separate directories. This is a convenient way to organize data if you have a large group of HTCondor jobs to run. The example file shows program submission of Mathematica as a Vanilla Universe job, since neither the source nor object code to program Mathematica is available for relinking to the HTCondor libraries.

HTCondor recommends using a single log file.

####################
#
# Example 2
# Demonstrate use of multiple directories for data organization
#
####################

universe   = VANILLA
executable = mathematica
input      = test.data
output     = loop.out
error      = loop.error
log        = loop.log

initialdir = run_1
queue

initialdir = run_2
queue

Example 3

In this example (also from the HTCondor Manual), the job submission file queues 150 runs of program foo which you compiled and linked for Silicon Graphics workstations running IRIX 6.5. This job requires HTCondor to run the program on machines which have greater than 32 megabytes of physical memory, and expresses a preference to run the program on machines with more than 64 megabytes, if such machines are available. It also advises HTCondor that it will use up to 28 megabytes of memory when running. Each of the 150 runs of the program receives its own process number, starting with process number 0. So, files stdin, stdout, and stderr will refer to in.0, out.0, and err.0 for the first run of the program; in.1, out.1, and err.1 for the second run of the program; and so forth. A log file foo.log will contain entries about when and where HTCondor runs, checkpoints, and migrates processes for the 150 queued runs of the program.

####################
#
# Example 3
# Show off some fancy features including use of pre-defined macros and logging
#
####################

executable   = foo
requirements = Memory >= 32 && OpSys == "IRIX65" && Arch =="SGI"
rank		= Memory >= 64
image_Size = 28 Meg

error   = err.$(Process)
input   = in.$(Process)
output  = out.$(Process)
log     = foo.log

queue 150

Job Submission

Once you have a job submission file, you may submit this script to HTCondor using the condor_submit command. As described above, a job submission file contains the commands and keywords which specify the type of compute node on which you wish to run your job. HTCondor will find an available processor core and run your job there, or leave your job in a queue until one becomes available.

You may submit jobs to BoilerGrid from any BoilerGrid submission host, including all ITaP research cluster front-ends.

To submit a job submission file:

$ condor_submit myjobsubmissionfile

For more information about job submission:

Job Status

To check on the progress of your jobs, view the HTCondor queue on the host from which you submitted the jobs.

You must make certain that you logged in to the same submission host (…-fe00, …-fe01, …-fe02, etc.) from which you submitted your jobs, or you will not see them in the queue.

To view the status of all jobs in the HTCondor queue of your login host:

$ condor_q

To see only your own jobs, specify your own username as an argument:

$ condor_q myusername

-- Submitter: condor.rcac.purdue.edu : <128.211.157.86:42163> : condor.rcac.purdue.edu
 ID         OWNER        SUBMITTED    RUN_TIME   ST PRI SIZE CMD
1100900.0   myusername   2/20 15:13   0+00:00:00 I  0   0.0  Hello

1 jobs; 1 idle, 0 running, 0 held

Secondly, you may check on the status of your jobs through their log files. In your job submission file, you can specify a log command (log = myjob.log) at any point prior to the queue command. The main events during the processing of the job will appear in this log file: submittal, execution commencement, preemption, checkpoint, eviction, and termination.

Thirdly, as soon as your job begins executing, HTCondor will start a condor_shadow process on the submission host. This shadow process is the mechanism by which the remotely executing jobs can access the environment of the submit host, such as input and output files. There is a shadow process started on the submit host for each job. However, the load on the submit host from this is usually not significant. If you notice degraded performance, you can limit the number of jobs that can run simultaneously using the MAX_JOBS_RUNNING configuration parameter. Please contact us for help with this if you notice poor performance.

To list all the compute nodes which are running your jobs:

$ condor_status -pool boilergrid.rcac.purdue.edu -constraint 'RemoteUser=="myusername@rcac.purdue.edu"'

Name          OpSys       Arch   State      Activity   LoadAv Mem   ActvtyTime
ba-005.rcac.p LINUX       INTEL  Claimed    Busy       1.000   502  0+00:24:44
ba-006.rcac.p LINUX       INTEL  Claimed    Busy       0.990   502  0+00:20:22
ba-007.rcac.p LINUX       INTEL  Claimed    Busy       1.000   502  0+00:23:16
ba-008.rcac.p LINUX       INTEL  Claimed    Busy       1.000   502  0+00:30:20
...

For more information about monitoring your job:

Job Cancellation

The command condor_rm removes a job from the queue. If the job has already started running, then HTCondor kills the job and removes its queue entry. Use condor_q to get the ID of the job.

Queue of jobs before removal:

$ condor_q
	
Submitter: condor.rcac.purdue.edu : <128.211.157.86:35407> : condor.rcac.purdue.edu
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
...
260076.7   nice-user       8/18 00:05   0+00:21:47 I  0   29.3 startfah.sh -oneun
260076.9   nice-user       8/18 00:05   0+01:40:44 I  0   136.7 startfah.sh -oneun
260185.0   myusername      8/30 13:01   0+00:00:00 R  0   19.5 hello
...

Remove a job:

$ condor_rm 260185.0
Job 260185.0 marked for removal

Queue of jobs after removal:

$ condor_q
	
Submitter: condor.rcac.purdue.edu : <128.211.157.86:35407> : condor.rcac.purdue.edu
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
...
260076.7   nice-user       8/18 00:05   0+00:21:47 I  0   29.3 startfah.sh -oneun
260076.9   nice-user       8/18 00:05   0+01:40:44 I  0   136.7 startfah.sh -oneun
...

For more information about removing your job:

Workflow Summary

This section offers a quick overview of the steps involved in preparing and submitting a simple HTCondor job.

  1. Prepare the Code

    The "Hello World" program below is a simple program which displays the text "hello, world":
    /* FILENAME: hello.c */
    #include <stdio.h>		
    int main (void) {		
        printf("hello, world\n");		
        return 0;
    }
    
  2. Choose the HTCondor Universe

    The two most commonly used HTCondor Universes are Standard and Vanilla. The "Hello World" program above will run in either universe.

    • Vanilla Universe

      Compile the "Hello World" program normally using any available compiler:
      $ module load intel
      $ icc -static hello.c -o hello
      
      $ module load gcc
      $ gcc -static hello.c -o hello
      	
      $ module load pgi
      $ pgcc -Bstatic hello.c -o hello
      
    • Standard Universe

      Relink the "Hello World" program with the HTCondor library using the condor_compile command and a compatible compiler:
      $ module load gcc	
      $ condor_compile gcc hello.c -o hello
      
  3. Prepare the Job Submission File

    Your job submission file defines how to run the job via HTCondor. It specifies the executable file, the chosen universe, a file containing standard input (not used in this example), files which will receive standard output and standard error, and the HTCondor log file, as well as many other possible parameters. The queue directive specifies how many executions of the job are to occur. Usually this is just once, as here:

    • Vanilla Universe

      # FILENAME: hello.sub
      executable = hello
      universe   = vanilla
      output     = hello.out
      error      = hello.err
      log        = hello.log
      queue
      
    • Standard Universe

      # FILENAME: hello.sub
      executable = hello
      universe   = standard
      output     = hello.out
      error      = hello.err
      log        = hello.log
      queue
      
  4. Submit the Job

    To run the "Hello World" program, use the condor_submit command to submit the job submission file to HTCondor:
    $ condor_submit hello.sub
    Submitting job(s).
    Logging submit event(s).
    1 job(s) submitted to cluster 1100744.
    
  5. Monitor the Job

    Once you submit the job, HTCondor will manage its execution. You can monitor the job's progress with the condor_q command:
    $ condor_q myusername
    
    
    -- Submitter: condor.rcac.purdue.edu : <128.211.157.86:56939> : condor.rcac.purdue.edu
     ID      OWNER              SUBMITTED     RUN_TIME  ST PRI SIZE CMD
    1100744.0  myusername  2/17 15:36  0+00:00:00  I  0   0.0  hello
    	
    1 jobs; 1 idle, 0 running, 0 held
    
  6. Remove the Job

    If you discover an error in your job while waiting for the results, you can remove the job from the queue with the condor_rm command:
    $ condor_rm 1100744
    
  7. View the Results

    When the "Hello World" program completes, its output will appear in the file hello.out. The exit status of your program and various statistics about its performance, including time used and I/O performed, will appear in the log file hello.log. To view the output file:
    $ less hello.out
    hello, world
    

    A log of HTCondor activity during your job's run will appear in the file hello.log. This log may report zero bytes transferred for some Vanilla Universe jobs, as the compute node may have been able to directly access your files through a shared filesystem without needing to transfer them to the compute node. To view the log file:
    $ less hello.log
    000 (1100744.000.000) 02/17 15:36:51 Job submitted from host: <128.211.157.86:56939>
    ...
    001 (1100744.000.000) 02/17 15:41:49 Job executing on host: <128.211.157.10:57321>
    ...
    005 (1100744.000.000) 02/17 15:41:53 Job terminated.
            (1) Normal termination (return value 0)
                    Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
                    Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
                    Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
                    Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
            1018  -  Run Bytes Sent By Job
            5429958  -  Run Bytes Received By Job
            1018  -  Total Bytes Sent By Job
            5429958  -  Total Bytes Received By Job
    ...
    

Compute Nodes and ClassAds

HTCondor attempts to start jobs by matching submitted jobs with available compute nodes on the basis of ClassAds. HTCondor's ClassAds are analogous to the classified advertising section of the newspaper. Both sellers and buyers advertise details about what they have to sell or want to buy. Both buyers and sellers have some requirements which absolutely must be satisfied, such as the right type of item, and some other criteria by which they will prefer certain offers over others, such as a better price. The same is true in HTCondor, but between users submitting jobs and compute nodes advertising available resources. HTCondor uses ClassAds to make the best matches between these two groups.

By default, your HTCondor jobs will seek an available compute node with the same values for the ClassAds Arch and OpSys as the host from which you submitted your job. The submission process assumes that in most cases your jobs will require the same combination of chip architecture and operating system to run as the host from which you submitted it. You can remove or alter this restriction by looking at the examples in the "Requiring Specific Architectures or Operating Systems" section.

Some applications may require even more specific capabilities. Using ClassAds, you may specify a set of requirements so that only a subset of available compute nodes become candidates to run your job. There are many ClassAds available for you to use in your job requirements. You may also use ClassAds to indicate a preference for certain nodes over others (but not as an absolute requirement) by using the rank command. The following examples illustrate how to discover current ClassAds and how to estimate the number of compute nodes which will match job requirements based on ClassAds.

To save a detailed report of all the ClassAds of all processor cores in BoilerGrid in the file myfile:

$ condor_status -pool boilergrid.rcac.purdue.edu -long > myfile

You may use any of the ClassAds which appear in this list to view a subset of BoilerGrid. For example, to save a listing of all user ID domains or all file system domains in the file myfile:

$ condor_status -pool boilergrid.rcac.purdue.edu -format "%s\n" UidDomain > myfile

$ condor_status -pool boilergrid.rcac.purdue.edu -format "%s\n" FileSystemDomain > myfile

To list all platforms (architectures and operating systems) and the number of processor cores of each platform on BoilerGrid:

$ condor_status -pool boilergrid.rcac.purdue.edu -total

                     Total Owner Claimed Unclaimed Matched Preempting Backfill

         INTEL/LINUX    64    13       5        46       0          0        0
           INTEL/OSX     2     0       0         2       0          0        0
       INTEL/WINNT51   345    29       2       314       0          0        0
       INTEL/WINNT61  4683   150      13      4520       0          0        0
    SUN4u/SOLARIS210     3     2       0         1       0          0        0
        X86_64/LINUX 31395 22617    4734      4035       2          2        5

               Total 36492 22811    4754      8918       2          2        5

HTCondor uses the name "INTEL" to indicate x86_32 (32-bit Intel-compatible) architecture.

The total number of processor cores on BoilerGrid is 36,492. The predominant platform of BoilerGrid is the x86_64/Linux with 31,395 processor cores. The values in this table are approximations since compute nodes require repair.

To see how many compute nodes have a given ClassAd value, add the ClassAd value as a constraint.

To see only how many 64-bit, Intel-compatible, Linux (x86_64/Linux) platforms are currently in BoilerGrid:

$ condor_status -pool boilergrid.rcac.purdue.edu -total -constraint '(Arch == "X86_64") && (OpSys == "LINUX")'

                     Total Owner Claimed Unclaimed Matched Preempting Backfill

        X86_64/LINUX 31395 22740    4688      3957       3          2        5

               Total 31395 22740    4688      3957       3          2        5

To see how many 64-bit, Intel-compatible, Linux (x86_64/Linux) platforms are currently in BoilerGrid and advertise MATLAB as installed:

$ condor_status -pool boilergrid.rcac.purdue.edu -total -constraint '(Arch == "X86_64") && (OpSys == "LINUX") && (HAS_MATLAB == TRUE)'

                     Total Owner Claimed Unclaimed Matched Preempting Backfill
        X86_64/LINUX 24659 12008    1557     11094       0          0        0
               Total 24659 12008    1557     11094       0          0        0

You may specify numeric constraints with other relational operators. To discover how many compute nodes have at least 16 GB of memory:

$ condor_status -pool boilergrid.rcac.purdue.edu -total -constraint 'TotalMemory >= 16046'

                     Total Owner Claimed Unclaimed Matched Preempting Backfill
        X86_64/LINUX 26093 18007    3330      4753       3          0        0
               Total 26093 18007    3330      4753       3          0        0

ClassAd string values are case-sensitive. ClassAd attribute names are case-insensitive. The comparison operators (<, >, <=, >=, and ==) compare strings case-insensitively. The special comparison operators =?= and =!= compare strings case-sensitively. ClassAd expressions are similar to C boolean expressions and can be quite elaborate.

For more information about ClassAds, requirements, and rank:

Shared Scratch File Systems

Increasing the throughput of your jobs may not come from maximizing the number of candidate compute nodes but rather from limiting the candidate compute nodes to the set which can access the shared scratch file system of the front-end. This limitation is useful in the case of a large input data file since it avoids both using HTCondor's file transfer mechanism and running the risk of preemptions preventing job completion.

The following table shows the current list of scratch directories:

Cluster Scratch Directory File System Domain
condor.rcac.purdue.edu
Radon
Steele
/scratch/scratch95/m/myusername
/scratch/scratch96/m/myusername

bluearc.rcac.purdue.edu


Coates
Rossmann
/scratch/lustreA/m/myusername

lustrea.rcac.purdue.edu

Hansen
/scratch/lustreC/m/myusername
lustrec.rcac.purdue.edu
Miner
/scratch/miner/m/myusername
miner.rcac.purdue.edu

To discover your scratch file directory, log in to your submission host and enter either of the following commands:

$ findscratch
$ echo $RCAC_SCRATCH

The response will be one of the following paths:

/scratch/scratch95/m/myusername
/scratch/scratch96/m/myusername
/scratch/lustreA/m/myusername
/scratch/lustreC/m/myusername
/scratch/miner/m/myusername

To see which shared scratch file system a specific cluster can access, search on the ClassAd attribute ClusterName:

$ condor_status -pool boilergrid.rcac.purdue.edu -constraint 'ClusterName=="Radon"' -format "%s\n" FileSystemDomain >myfile

To see which shared scratch file systems other clusters use, modify the preceding example with other cluster names: Hansen, Rossmann, Coates, Steele, or Miner.

To see which clusters can access a given shared scratch file system, search on the ClassAd attribute FileSystemDomain:

$ condor_status -pool boilergrid.rcac.purdue.edu -constraint 'FileSystemDomain == "bluearc.rcac.purdue.edu"' -format "%s\n" ClusterName >myfile
$ condor_status -pool boilergrid.rcac.purdue.edu -constraint 'FileSystemDomain == "lustrea.rcac.purdue.edu"' -format "%s\n" ClusterName >myfile
$ condor_status -pool boilergrid.rcac.purdue.edu -constraint 'FileSystemDomain == "lustrec.rcac.purdue.edu"' -format "%s\n" ClusterName >myfile
$ condor_status -pool boilergrid.rcac.purdue.edu -constraint 'FileSystemDomain == "miner.rcac.purdue.edu"' -format "%s\n" ClusterName >myfile

Using logical operators, you may combine ClassAd constraints. For example, to see how many x86_64 processor cores running Linux have access to the BlueArc shared scratch file system:

$ condor_status -pool boilergrid.rcac.purdue.edu -total -constraint '(Arch == "X86_64") && (OpSys == "LINUX") && (FileSystemDomain == "bluearc.rcac.purdue.edu")'

                     Total Owner Claimed Unclaimed Matched Preempting Backfill
        X86_64/LINUX  9232  5515    1431      2286       0          0        0
               Total  9232  5515    1431      2286       0          0        0

Examples

To submit jobs successfully to BoilerGrid and to achieve maximum throughput in HTCondor's computing environment, you must understand the architecture of BoilerGrid and how to request resources which are appropriate to your application. The following examples show how to discover the resources of BoilerGrid. They also explain standard input and output, command-line arguments, file input and output, Standard and Vanilla universe jobs, shared file systems, parameter sweeps, DAG Manager, job requirements and ranks, and how to run commercial and third-party software. You may wish to look here for an example that is most similar to your application and modify that example for your jobs. You may also refer to the HTCondor Manual for more details.

Simplest Job Submission File

The job submission file must contain one executable command and at least one queue command. All other commands of the job submission file have default actions. HTCondor's job submission parser ignores blank lines and single-line comments beginning with a pound sign ("#"). There is no block (multi-line) comment in a job submission file. In some cases, a single-line comment may appear on a command line.

# FILENAME: myjob.sub

executable = myprogram
queue    # place one copy of the job in the HTCondor queue

This job submission file uses HTCondor's default universe, Vanilla. Because this Vanilla job does not turn on HTCondor's file transfer mechanism, this job will (by default) use the shared file system of the submission host from which you submitted the job. Also by default, the number of jobs queued is one.

This job submission file may appear to be useless because it lacks the standard input, standard output, standard error, and a common log file; however, it will correctly process a program which reads and writes formatted files. Here is an example of file I/O. To compile this program for the Vanilla Universe, see Compiling Serial Programs.

To submit this job to HTCondor:

$ condor_submit myjob.sub

Standard Input/Output

HTCondor manages a batch environment. When HTCondor manages the execution of a computer program, that program cannot offer an interactive experience with a terminal. All input normally read from the keyboard (standard input) must be prepared in a file ahead of execution. All output normally written to the screen (standard output and standard error) appear in files where you may view them after execution. Also, HTCondor records in a common log file the main events of running a job.

Here is an example of standard I/O. To compile this program for the Vanilla Universe, see Compiling Serial Programs.

Prepare a job submission file with an appropriate filename, here named myjob.sub:

# FILENAME: myjob.sub

executable = myprogram

# Standard I/O files, HTCondor log file
input  = mydata.in
output = mydata.out
error  = mydata.err
log    = mydata.log

# queue one job
queue

This job submission file uses HTCondor's default universe, Vanilla. Because this Vanilla job does not turn on HTCondor's file transfer mechanism, this job wi ll (by default) use the shared file system of the submission host from which you submitted the job. Also by default, the number of jobs queued is one.

This submission specifies that there exists a file, mydata.in, which contains all text which the program would otherwise read from the keyboard, standard input. It also specifes the names of three files which will receive standard output, standard error, and HTCondor's log entries. These three output files need not preexist, but they can. HTCondor will overwrite standard output and standard error but will append to the log file during subsequent submissions.

To submit this job to HTCondor:

$ condor_submit myjob.sub

Command Line Arguments

HTCondor allows the specification of command-line arguments in the job submission file. There are two permissible formats for specifying arguments. The old syntax has arguments delimited (separated) by space characters. To use double quotes, escape with a backslash (i.e. put a backslash in front of each double quote). For example:

arguments = arg1 \"arg2\" 'arg3'

yields the following arguments:

arg1
"arg2"
'arg3'

The new syntax supports uniform quoting of spaces within arguments. A pair of double quotes surrounds the entire argument list. To include a literal double quote, simply repeat it. White space (spaces, tabs) separate arguments. To include literal white space in an argument, surround the argument with a pair of single quotes. To include a literal single quote within a single-quoted argument, repeat the single quote.

Here is a simple program which will display command-line arguments specified in a job submission file. To compile this program for the Vanilla Universe, see Compiling Serial Programs.

Prepare a job submission file with an appropriate filename, here named myjob.sub. Run with command-line arguments in either the old or new syntax:

# FILENAME: myjob.sub

universe = VANILLA

executable = myprogram
# Old Syntax
# arguments = arg1 arg2 arg3 \"arg4\" 'arg5' 'arg with spaces' arg6 arg7_with_spaces arg8

# New Syntax
arguments = "arg9 ""arg10"" 'arg with literal '' and spaces'"

# HTCondor Macros
# arguments = $(Cluster) $(Process)

# standard I/O files, HTCondor log file
output = myprogram.out
error  = myprogram.err
log    = myprogram.log

# queue one job
queue

To submit this job to HTCondor:

$ condor_submit myjob.sub

View command-line arguments submitted in the old syntax:

***  MAIN START  ***

Number of command line arguments: 12

command line argument, argv[0]: condor_exec.746418.0
command line argument, argv[1]: arg1
command line argument, argv[2]: arg2
command line argument, argv[3]: arg3
command line argument, argv[4]: "arg4"
command line argument, argv[5]: 'arg5'
command line argument, argv[6]: 'arg
command line argument, argv[7]: with
command line argument, argv[8]: spaces'
command line argument, argv[9]: arg6
command line argument, argv[10]: arg7_with_spaces
command line argument, argv[11]: arg8

***  MAIN STOP  ***

The old syntax requires simulating spaces in arguments with the underscore character. Then, user code can replace the underscores with spaces to achieve an argument with spaces.

View command-line arguments submitted in the new syntax:

***  MAIN START  ***

Number of command line arguments: 4

command line argument, argv[0]: condor_exec.341964.0
command line argument, argv[1]: arg9
command line argument, argv[2]: "arg10"
command line argument, argv[3]: arg with literal ' and spaces

***  MAIN STOP  ***

The array element argv[0] holds HTCondor's name for a job.

Two HTCondor macros are useful as command-line arguments, $(Cluster) and $(Process):

***  MAIN START  ***

Number of command line arguments: 3

command line argument, argv[0]: condor_exec.341965.0
command line argument, argv[1]: 341965
command line argument, argv[2]: 0

***  MAIN STOP  ***

File Input/Output

HTCondor is able to manage a computer program which reads and writes formatted data files.

Here is an example of formatted file I/O. To compile this program for the Vanilla Universe, see Compiling Serial Programs.

Prepare a job submission file with an appropriate filename, here named myjob.sub. This example combines formatted file I/O with standard output:

# FILENAME: myjob.sub

executable = myprogram

# Standard I/O files, HTCondor log file
output = mydata.out
error  = mydata.err
log    = mydata.log

# queue one job
queue

This job submission file uses HTCondor's default universe, Vanilla. Because this Vanilla job does not turn on HTCondor's file transfer mechanism, this job will (by default) use the shared file system of the submission host from which you submitted the job. Also by default, the number of jobs queued is one.

This submission specifies that there exists a formatted input file, myinputdata, a name which appears in the source code only. The result is a formatted output file, myoutputdata, a name which also appears in the source code only. This submission also specifes the names of three files which will receive standard output, standard error, and HTCondor's log entries. These three output files need not preexist, but they can. HTCondor will overwrite standard output and standard error but append to the log file during subsequent submissions.

To submit this job to HTCondor:

$ condor_submit myprogram.sub

Standard Universe Job

The Standard Universe is an execution environment of HTCondor. Jobs using the Standard Universe enjoy two advantages. A job with a higher priority may preempt a HTCondor job without loss of completed work. HTCondor can checkpoint the job and move (migrate) the job to a different compute node which would otherwise be idle. HTCondor restarts the job on the new compute node at precisely the point of preemption. The Standard Universe tells HTCondor that you re-linked your job via condor_compile with the HTCondor libraries, and therefore your job supports checkpointing. HTCondor transfers the executable and checkpoint files automatically, when needed.

The second advantage of HTCondor's Standard Universe is that remote system calls handle access to files (input and output). For example, HTCondor intercepts a call to read a record of a data file. HTCondor sends the read operation to the user's current working directory on the submission host which performs the read operation. HTCondor then sends the desired record to the compute node which processes the record. A similar process occurs with write operations. Therefore, the existence of a shared file system is not relevant. This feature maximizes the number of machines which can run a job. Compute nodes across an entire enterprise can run a job, including compute nodes in different administrative domains.

This section illustrates how to submit a small job to the Standard Universe of BoilerGrid. This example, myprogram.c, displays the name of the host which runs the job. To compile this program for the Standard Universe, see Compiling Serial Programs.

Prepare a job submission file with the Standard Universe, the compiled C program as the executable, HTCondor's transferring the compiled program to the chosen compute node, and an appropriate filename, here named myjob.sub:

# FILENAME:  myjob.sub

universe = STANDARD

# Transfer the "executable" myprogram to the compute node.
transfer_executable = TRUE
executable          = myprogram

# Standard I/O files, HTCondor log file
output = mydata.out
error  = mydata.err
log    = mydata.log

# queue one job
queue

Submit this job to HTCondor:

$ condor_submit myjob.sub
Submitting job(s).
Logging submit event(s).
1 job(s) submitted to cluster 341956.

View job status:

$ condor_q myusername

-- Submitter: condor.rcac.purdue.edu : <128.211.157.86:35556> : condor.rcac.purdue.edu
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
341956.0   myusername     10/22 11:18   0+00:00:00 I  0   7.3  myjob

Place the job on hold to study the submission:

$ condor_hold 341956
Cluster 341956 held.

Obtain the requirements of this job:

$ condor_q myusername -attributes requirements -long

-- Submitter: condor.rcac.purdue.edu : <128.211.157.86:35556> : condor.rcac.purdue.edu
requirements = (Arch == "X86_64") && (OpSys == "LINUX") && ((CkptArch == Arch) || (CkptArch =?= UNDEFINED)) && ((CkptOpSys == OpSys) || (CkptOpSys =?= UNDEFINED)) && (Disk >= DiskUsage) && ((Memory * 1024) >= ImageSize)

Job requirements reflect the Standard Universe (preemption with checkpointing). This job requires a processor core which runs the Linux operating system on the x86_64 architecture and has the ability to checkpoint the job at preemption. The requirements exclude any mention of the shared file system since a shared file system is not relevant to a Standard Universe job. Running a Standard Universe job does not limit the job to the processor cores which use the same shared file system that the submission host uses. The job may land either on a processor core that uses the same shared file system or not; in either case, the remote I/O of the Standard Universe handles the job's file I/O. The ClassAds Disk and Memory ensure that the chosen core has sufficient resources to hold the disk and memory footprint of the job.

To see how many compute nodes of BoilerGrid are able to satisfy this job's requirements:

$ condor_status -pool boilergrid.rcac.purdue.edu -total -constraint '(Arch == "X86_64") && (OpSys == "LINUX") && ((CkptArch == Arch) || (CkptArch =?= UNDEFINED)) && ((CkptOpSys == OpSys) || (CkptOpSys =?= UNDEFINED))'

                     Total Owner Claimed Unclaimed Matched Preempting Backfill
        X86_64/LINUX 33118 27602    2878      2596      42          0        0
               Total 33118 27602    2878      2596      42          0        0

The report shows that 33,118 processor cores are candidates for running the job. Using HTCondor's Standard Universe with its remote file I/O maximizes the number of candidate cores available to the job. This large pool of candidate cores has the potential of offering high throughput. Some subset of the candidate processor cores, possibly none, will be available to accept the job at any given moment.

Release the job from the queue:

$ condor_release 341956
Cluster 341956 released.

View results in the file for all standard output, here named mydata.out:

***  MAIN START  ***

hostname = cms-100.rcac.purdue.edu
domainname = (none)

***  MAIN  STOP  ***

The output shows the name of the processor core which ran the job. While this job ran on a processor core which resides on the same shared file system used by the submission host, another submission which forced the job onto a core of another shared file system also ran successfully because the remote I/O of the Standard Universe handled the reading and writing of records.

View the log file, mydata.log:

000 (341956.000.000) 10/22 11:42:22 Job submitted from host: <128.211.157.86:35556>
...
012 (341956.000.000) 10/22 11:42:57 Job was held.
    via condor_hold (by user myusername)
    Code 1 Subcode 0
...
001 (341956.000.000) 10/22 11:43:57 Job executing on host: <128.211.157.10:52556>
...
005 (341956.000.000) 10/22 11:43:57 Job terminated.
    (1) Normal termination (return value 0)
        Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
        Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
        Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
        Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
    1110  -  Run Bytes Sent By Job
    5431033  -  Run Bytes Received By Job
    1110  -  Total Bytes Sent By Job
    5431033  -  Total Bytes Received By Job

The log file records the main events related to the processing of this job. In this example, the log records the number of bytes read and written between the submission host and the compute node via the remote I/O of the Standard Universe.

The Standard Universe maximizes throughput with its ability to checkpoint jobs and to intercept remote system calls. The latter avoids requiring the submission host and the compute node to share a file system. The process of re-linking a job with HTCondor's libraries involves including both HTCondor's libraries and the user's libraries as static libraries. The danger of this effort to maximize throughput is that a HTCondor flock is a heterogeneous collection of old and new compute nodes, so a job can land on a compute node that is unable to run the job. When this happens, the user must consider how to avoid compute nodes which are unable to run a job to a successful completion.

Vanilla Universe Job (with shared file system)

The Vanilla Universe is an execution environment of HTCondor. The Vanilla Universe tells HTCondor that you did not re-link your job via condor_compile with the HTCondor libraries, and therefore your job does not support checkpointing or remote system calls. Such jobs include an executable binary of a commercial application, shell script, or a program which is to take advantage of features of a compiler which is not compatible with HTCondor's condor_compile command.

For jobs submitted under the Vanilla Universe, the existence of a shared file system is relevant since access to files (input and output) involves either a shared file system or HTCondor's file transfer mechanism mechanism, not the remote system calls of the Standard Universe.

This section illustrates how to submit a small job to the Vanilla Universe of BoilerGrid from a submission host which has a shared file system with HTCondor's file transfer mechanism turned off, by default. This example, myprogram.c, displays the name of the host which runs the job, the path name of the current working directory, and the contents of that directory. The Vanilla Universe allows using Linux commands to obtain the information. To compile this program for the Vanilla Universe, see Compiling Serial Programs.

Prepare a job submission file with the Vanilla Universe, the compiled C program as the executable, HTCondor's file transfer mechanism turned off by default, HTCondor's transferring the compiled program to the chosen compute node, and an appropriate filename, here named myjob.sub:

# FILENAME:  myjob.sub

universe = VANILLA

# Transfer the "executable" myprogram to the compute node.
transfer_executable = TRUE
executable          = myprogram

# HTCondor's file transfer mechanism is off, by default.

# Standard I/O files, HTCondor log file
output = mydata.out
error  = mydata.err
log    = mydata.log

# queue one job
queue

Submit this job to HTCondor:

$ condor_submit myjob.sub
Submitting job(s).
Logging submit event(s).
1 job(s) submitted to cluster 746407.

View job status:

$ condor_q myusername

-- Submitter: condor.rcac.purdue.edu : <128.211.157.86:60481> : condor.rcac.purdue.edu
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
746407.0   myusername     10/25 10:04   0+00:00:00 I  0   0.0  myjob

Place this job on hold to study the submission:

$ condor_hold 746407
Cluster 746407 held.

Obtain the requirements of this job:

$ condor_q myusername -attributes requirements -long


-- Submitter: condor.rcac.purdue.edu : <128.211.157.86:60481> : condor.rcac.purdue.edu
requirements = (Arch == "X86_64") && (OpSys == "LINUX") && (Disk >= DiskUsage) && ((Memory * 1024) >= ImageSize) && (TARGET.FileSystemDomain == MY.FileSystemDomain)

Job requirements reflect both the Vanilla Universe (preemption without checkpointing) and the shared file system. This job requires a compute node which runs the Linux operating system on the x86_64 architecture and, more importantly, which shares the same FileSystemDomain as the submission host (both the TARGET and MY shared file system must be the same). So, this submission limits running the job to the processor cores which use the same shared file system that the submission host uses. The ClassAds Disk and Memory ensure that the chosen core has sufficient resources to hold the disk and memory footprint of the job.

To see how many compute nodes in various file system domains of BoilerGrid are able to satisfy this job's requirements :

$ condor_status -pool boilergrid.rcac.purdue.edu -total -constraint '(Arch == "X86_64") && (OpSys == "LINUX") && (FileSystemDomain == "bluearc.rcac.purdue.edu")'

                     Total Owner Claimed Unclaimed Matched Preempting Backfill
        X86_64/LINUX  9924  7466    1579       878       0          1        0
               Total  9924  7466    1579       878       0          1        0


$ condor_status -pool boilergrid.rcac.purdue.edu -total -constraint '(Arch == "X86_64") && (OpSys == "LINUX") && (FileSystemDomain == "lustrea.rcac.purdue.edu")'

                     Total Owner Claimed Unclaimed Matched Preempting Backfill

        X86_64/LINUX 18784 16717    1438       628       0          1        0

               Total 18784 16717    1438       628       0          1        0


$ condor_status -pool boilergrid.rcac.purdue.edu -total -constraint '(Arch == "X86_64") && (OpSys == "LINUX") && (FileSystemDomain == "miner.rcac.purdue.edu")'

                     Total Owner Claimed Unclaimed Matched Preempting Backfill

        X86_64/LINUX  1006   156     760        90       0          0        0

               Total  1006   156     760        90       0          0        0


The report shows that 9,924 and 18,784 processor cores are candidates for running this job in various file system domains. While the number of candidate processor cores which are able to run this job is much less than the number of x86_64 cores running Linux on BoilerGrid, using the shared file system is the preferred method in many situations. Some subset of the candidate processor cores, possibly none, will be available to accept the job at any given moment.

Release the job from the queue:

$ condor_release 746407
Cluster 746407 released.

View results in the file for all standard output, here named mydata.out:

cms-100.rcac.purdue.edu
(none)
/home/myhomedirectory/Condor/vanilla_w_sfs
total 288
-rw-r--r-- 1 myusername itap 1508 Oct 25 10:39 commands
-rw-r--r-- 1 myusername itap    0 Oct 25 10:46 mydata.err
-rw-r--r-- 1 myusername itap  467 Oct 25 10:46 mydata.log
-rw-r--r-- 1 myusername itap   71 Oct 25 10:46 mydata.out
-rwxr-xr-x 1 myusername itap 6863 Oct 25 10:14 myprogram
-rw-r----- 1 myusername itap  376 Oct 25 09:14 myprogram.c
-rw-r----- 1 myusername itap  199 Oct 25 10:04 myprogram.sub
-rwxr----- 1 myusername itap   70 Oct 25 09:19 run
-rwxr--r-- 1 myusername itap  216 Oct 25 09:14 tally
-rw-r----- 1 myusername itap  952 Oct 25 09:14 tmp
-rw-r--r-- 1 myusername itap    0 Oct 25 10:19 tmp1
***  MAIN START  ***


***  MAIN  STOP  ***

The output shows the name of the compute node which ran the job. This job ran on a compute node which uses the same shared file system which the submission host uses. The fact that the current working directory is the user's home directory on the submission host proves that this job used the shared file system for file I/O.

View the log file, mydata.log:

000 (746407.000.000) 10/25 10:04:51 Job submitted from host: <128.211.157.86:60481>
...
012 (746407.000.000) 10/25 10:05:15 Job was held.
    via condor_hold (by user myusername)
    Code 1 Subcode 0
...
009 (746406.000.000) 10/25 10:22:07 Job was aborted by the user.
    via condor_rm (by user myusername)
...
013 (746407.000.000) 10/25 10:44:07 Job was released.
    via condor_release (by user myusername)
...
001 (746407.000.000) 10/25 10:46:47 Job executing on host: <128.211.157.10:57108>
...
005 (746407.000.000) 10/25 10:46:47 Job terminated.
    (1) Normal termination (return value 0)
        Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
        Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
        Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
        Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
    0  -  Run Bytes Sent By Job
    0  -  Run Bytes Received By Job
    0  -  Total Bytes Sent By Job
    0  -  Total Bytes Received By Job

The log file records the main events related to the processing of this job. In this example, the log records that the number of bytes read and written between the submission host and the compute node is zero, another indication that this job used the shared file system for file I/O.

The Vanilla Universe is available for jobs which cannot take advantage of the Standard Universe because you cannot re-link them with HTCondor's libraries. You cannot re-link an executable binary, shell script, and a program which you compiled with an incompatible compiler. If a shared file system is available, the Vanilla job may use it for file I/O by keeping HTCondor's file transfer mechanism turned off. Keeping the file transfer mechanism off excludes compatible compute nodes which do not share a file system with the submission host.

Vanilla Universe Job (either shared file system or file transfer mechanism)

The Vanilla Universe is an execution environment of HTCondor. The Vanilla Universe tells HTCondor that you did not re-link a job via condor_compile with the HTCondor libraries and therefore your job does not support checkpointing or remote system calls. Such jobs include an executable binary of a commercial application, shell script, or a program which is to take advantage of features of a compiler which is not compatible with HTCondor's condor_compile command.

For jobs submitted under the Vanilla Universe, the existence of a shared file system is relevant since access to files (input and output) involves either a shared file system or HTCondor's file transfer mechanism mechanism, not the remote system calls of the Standard Universe.

This section illustrates how to submit a small job to the Vanilla Universe of BoilerGrid from a submission host which has a shared file system with HTCondor's file transfer mechanism turned on "if needed". HTCondor transfers files only when it matches the job with a compute node which uses a different FileSystemDomain from the one which the submission host uses. If HTCondor matches the job with a compute node which uses the same FileSystemDomain which the submission host uses, HTCondor does not transfer files and relies on the shared file system instead. This example, myprogram.c, displays the name of the host which runs the job, the path name of the current working directory, and the contents of that directory. The Vanilla Universe allows using Linux commands to obtain the information. To compile this program for the Vanilla Universe, see Compiling Serial Programs.

Prepare a job submission file with the Vanilla Universe, the compiled C program as the executable, HTCondor's file transfer mechanism turned on if needed, HTCondor's transferring the compiled program to the chosen compute node, and an appropriate filename, here named myjob.sub:

# FILENAME:  myjob.sub

universe = VANILLA

# Transfer the "executable" myprogram to the compute node.
transfer_executable = TRUE
executable          = myprogram

# HTCondor's file transfer mechanism is turned on only when needed.
should_transfer_files = IF_NEEDED

# Let HTCondor handle output file(s).
when_to_transfer_output = ON_EXIT

# Standard I/O files, HTCondor log file
output = mydata.out
error  = mydata.err
log    = mydata.log

# queue one job
queue

Submit this job to HTCondor:

$ condor_submit myjob.sub
Submitting job(s).
Logging submit event(s).
1 job(s) submitted to cluster 746408.

View job status:

$ condor_q myusername

-- Submitter: condor.rcac.purdue.edu : <128.211.157.86:60481> : condor.rcac.purdue.edu
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
746408.0   myusername     10/25 10:04   0+00:00:00 I  0   0.0  myjob

Place this job on hold to study the submission:

$ condor_hold 746408
Cluster 746408 held.

Obtain the requirements of this job:

$ condor_q myusername -attributes requirements -long


-- Submitter: condor.rcac.purdue.edu : <128.211.157.86:60481> : condor.rcac.purdue.edu
requirements = (Arch == "X86_64") && (OpSys == "LINUX") && (Disk >= DiskUsage) && ((Memory * 1024) >= ImageSize) && ((HasFileTransfer) || (TARGET.FileSystemDomain == MY.FileSystemDomain))

The requirements reflect both the Vanilla Universe (preemption without checkpointing) and HTCondor's file transfer mechanism turned on only if needed. This job requires a processor core which runs the Linux operating system on the x86_64 architecture, but, more importantly, the processor core chosen to run this job need not share the same FileSystemDomain which the submission host uses (both the TARGET and MY shared file system need not be equal). The ClassAd of this job states that the chosen core must either have the file transfer capability or share a file system with the submission host. So, this submission does not limit running the job to the processor cores which use the same shared file system which the submission host uses. The ClassAds Disk and Memory ensure that the chosen core has sufficient resources to hold the disk and memory footprint of the job.

To see how many compute nodes of BoilerGrid are able to satisfy this job's requirements:

$ condor_status -pool boilergrid.rcac.purdue.edu -total -constraint '(Arch == "X86_64") && (OpSys == "LINUX") && ((HasFileTransfer) || (FileSystemDomain == "bluearc.rcac.purdue.edu"))'

                     Total Owner Claimed Unclaimed Matched Preempting Backfill
        X86_64/LINUX 32074 20806    4287      6976       5          0        0
               Total 32074 20806    4287      6976       5          0        0          0        0

The report shows that 32,074 processor cores are candidates for running this job. Using HTCondor's Vanilla Universe with its file transfer mechanism turned on only if needed maximizes the number of candidate cores available to the job. This large pool of candidate cores has the potential of offering high throughput. On the other hand, large data files involved in the file transfer may diminish throughput. Some subset of the candidate processor cores, possibly none, will be available to accept the job at any given moment.

Release the job from the queue:

$ condor_release 746408
Cluster 746408 released.

View results in the file for all standard output, here named mydata.out:

cms-100.rcac.purdue.edu
(none)
/home/myhomedirectory/HTCondor/vanilla_w_sfs_ftm
total 284
-rw-r--r-- 1 myusername itap 1508 Oct 25 10:39 commands
-rw-r--r-- 1 myusername itap    0 Oct 25 10:46 mydata.err
-rw-r--r-- 1 myusername itap  467 Oct 25 10:46 mydata.log
-rw-r--r-- 1 myusername itap   71 Oct 25 10:46 mydata.out
-rwxr-xr-x 1 myusername itap 6863 Oct 25 10:14 myprogram
-rw-r----- 1 myusername itap  376 Oct 25 09:14 myprogram.c
-rw-r----- 1 myusername itap  199 Oct 25 10:04 myjob.sub
-rwxr----- 1 myusername itap   70 Oct 25 09:19 run
-rwxr--r-- 1 myusername itap  216 Oct 25 09:14 tally
-rw-r----- 1 myusername itap  952 Oct 25 09:14 tmp
-rw-r--r-- 1 myusername itap    0 Oct 25 10:19 tmp1
***  MAIN START  ***


***  MAIN  STOP  ***

The output shows the name of the processor core which ran the job. This job ran on a processor core which uses the same shared file system which the submission host uses. The fact that the current working directory is the user's home directory on the submission host proves that this job used the shared file system for file I/O.

View the log file, mydata.log:

000 (746408.000.000) 10/25 10:04:51 Job submitted from host: <128.211.157.86:60481>
...
012 (746408.000.000) 10/25 10:05:15 Job was held.
    via condor_hold (by user myusername)
    Code 1 Subcode 0
...
013 (746408.000.000) 10/25 10:44:07 Job was released.
    via condor_release (by user myusername)
...
001 (746408.000.000) 10/25 10:46:47 Job executing on host: <128.211.157.10:57108>
...
005 (746408.000.000) 10/25 10:46:47 Job terminated.
    (1) Normal termination (return value 0)
        Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
        Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
        Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
        Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
    0  -  Run Bytes Sent By Job
    0  -  Run Bytes Received By Job
    0  -  Total Bytes Sent By Job
    0  -  Total Bytes Received By Job

The log file records the main events related to the processing of this job. In this example, the log records that the number of bytes read and written between the submission host and the compute node is zero, another indication that this job used the shared file system for file I/O.

To see HTCondor's file transfer mechanism at work, repeat the example above but force the job to a compute node which does not share a file system with the submission host.

Modify the job submission file of the previous example to send the job to a processor core which uses a different shared file system:

# FILENAME:  myjob.sub

universe = VANILLA

# A core on the Rossmann cluster uses a different shared file system.
requirements = ClusterName == "Rossmann"

# Transfer the "executable" myprogram to the compute node.
transfer_executable = TRUE
executable          = myprogram

# HTCondor's file transfer mechanism is turned on only when needed. This submission needs the transfer mechanism.
should_transfer_files   = IF_NEEDED
when_to_transfer_output = ON_EXIT


# Standard I/O files, HTCondor log file
output = mydata.out
error  = mydata.err
log    = mydata.log

# queue one job
queue

View results in the file for all standard output, here named mydata.out:

cms-100.rcac.purdue.edu
(none)
/var/condor/execute/dir_11554
total 12
-rwxr-xr-x 1 myusername itap 6863 Oct 25 12:28 condor_exec.exe
-rw-r--r-- 1 myusername itap    0 Oct 25 12:31 mydata.err
-rw-r--r-- 1 myusername itap   67 Oct 25 12:32 mydata.out
***  MAIN START  ***


***  MAIN  STOP  ***

This output file exhibits a temporary directory on the processor core which HTCondor chose to run the job, rather than the user's home directory, another indication that this job used HTCondor's file transfer mechanism for file I/O.

View the log file, mydata.log:

000 (746411.000.000) 10/25 12:08:12 Job submitted from host: <128.211.157.86:60481>
...
001 (746411.000.000) 10/25 12:31:59 Job executing on host: <128.211.157.10:51871>
...
005 (746411.000.000) 10/25 12:32:00 Job terminated.
    (1) Normal termination (return value 0)
        Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
        Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
        Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
        Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
    278  -  Run Bytes Sent By Job
    6863  -  Run Bytes Received By Job
    278  -  Total Bytes Sent By Job
    6863  -  Total Bytes Received By Job

The log file records the main events related to the processing of this job. In this example, the log records the number of bytes transferred between the submission host and the compute node via the file transfer mechanism, another indication that HTCondor's file transfer mechanism was used.

The Vanilla Universe is available for jobs which cannot take advantage of the Standard Universe because you cannot re-link them with HTCondor's libraries. You cannot re-link an executable binary, shell script, and a program which you compiled with an incompatible compiler. If a shared file system is available and HTCondor's file transfer mechanism is suitable for the job, the Vanilla job may use either for file I/O by specifying that the submission uses the mechanism only "if needed." While this method can maximize throughput, the size of any file which you intend to transfer must be reasonable. The size of the file must fit on the available disk space of the compute node. The amount of time needed to transfer the file cannot be so great that higher priority jobs constantly preempt the HTCondor job during file transfer.

Vanilla Universe Job (without shared file system)

The Vanilla Universe is an execution environment of HTCondor. The Vanilla Universe tells HTCondor that you did re-link your job via condor_compile with the HTCondor libraries, and therefore your job does not support checkpointing or remote system calls. Such jobs include an executable binary of a commercial application, shell script, or a program which is to take advantage of features of a compiler which is not compatible with HTCondor's condor_compile command.

For jobs submitted under the Vanilla Universe, the existence of a shared file system is relevant since access to files (input and output) involves either a shared file system or HTCondor's file transfer mechanism mechanism, not the remote system calls of the Standard Universe.

This section illustrates how to submit a small job to the Vanilla Universe of BoilerGrid from a submission host which lacks a shared file system with HTCondor's file transfer mechanism turned on. No matter which processor core HTCondor chooses to run the job, HTCondor transfers files. This example, myprogram.c, displays the name of the host which runs the job, the path name of the current working directory, and the contents of that directory. The Vanilla Universe allows using Linux commands to obtain the information. To compile this program for the Vanilla Universe, see Compiling Serial Programs.

Prepare a job submission file with the Vanilla Universe, the compiled C program as the executable, HTCondor's file transfer mechanism turned on, HTCondor's transferring the compiled program to the chosen compute node, and an appropriate filename, here named myjob.sub:

# FILENAME:  myjob.sub

universe = VANILLA

# Transfer the "executable" myprogram to the compute node.
transfer_executable = TRUE
executable          = myprogram

# Turn on HTCondor's file transfer mechanism.
should_transfer_files   = YES

# Let HTCondor handle output files.
when_to_transfer_output = ON_EXIT

# Standard I/O files, HTCondor log file
output = mydata.out
error  = mydata.err
log    = mydata.log

# queue one job
queue

Submit this job to HTCondor:

$ condor_submit myjob.sub
Submitting job(s).
Logging submit event(s).
1 job(s) submitted to cluster 341960.

View job status:

$ condor_q myusername

-- Submitter: condor.rcac.purdue.edu : <128.211.157.86:35556> : condor.rcac.purdue.edu
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
341960.0   myusername     10/25 15:02   0+00:00:00 I  0   0.0  myjob

Place this job on hold to study the submission:

$ condor_hold 341960
Cluster 341960 held.

Obtain the requirements of this job:

$ condor_q myusername -attributes requirements -long


-- Submitter: condor.rcac.purdue.edu : <128.211.157.86:35556> : condor.rcac.purdue.edu
requirements = (Arch == "X86_64") && (OpSys == "LINUX") && (Disk >= DiskUsage) && ((Memory * 1024) >= ImageSize) && (HasFileTransfer)

Job requirements reflect both the Vanilla Universe (preemption without checkpointing) and HTCondor's file transfer mechanism turned on. This job requires a processor core which runs the Linux operating system on the x86_64 architecture, but more importantly the processor core chosen to run this job can reside on a cluster which lacks a shared file system. The ClassAd of this job states that the chosen core must have the file transfer capability. The ClassAds Disk and Memory ensure that the chosen core has sufficient resources to hold the disk and memory footprint of the job.

To see how many compute nodes of BoilerGrid are able to satisfy this job's requirements:

$ condor_status -pool boilergrid.rcac.purdue.edu -total -constraint '(Arch == "X86_64") && (OpSys == "LINUX") && (HasFileTransfer)'

                     Total Owner Claimed Unclaimed Matched Preempting Backfill
        X86_64/LINUX 33068 20690    3850      8520       8          0        0
               Total 33068 20690    3850      8520       8          0        0

This report shows that 33,068 processor cores are candidates for running this job. Using HTCondor's Vanilla Universe with its file transfer mechanism turned off maximizes the number of candidate cores available to the job. This large pool of candidate cores has the potential of offering high throughput. On the other hand, large data files involved in the file transfer, may diminish throughput. Some subset of the candidate processor cores, possibly none, will be available to accept the job at any given moment.

Release the job from the queue:

$ condor_release 341960
Cluster 341960 released.

View results in the file for all standard output, here named mydata.out:

cms-100.rcac.purdue.edu
(none)
/var/condor/execute/dir_13374
total 12
-rwxr-xr-x 1 myusername itap 6863 Oct 25 15:47 condor_exec.exe
-rw-r--r-- 1 myusername itap    0 Oct 25 15:50 mydata.err
-rw-r--r-- 1 myusername itap   61 Oct 25 15:50 mydata.out
***  MAIN START  ***


***  MAIN  STOP  ***

The output shows the name of the processor core which ran the job. This job ran on a processor core which shares a file system with the submission host. Despite this, the current working directory is a temporary directory on the compute node; therefore, this job used the file transfer mechanism for file I/O.

View the log file, mydata.log:

000 (341960.000.000) 10/25 15:03:18 Job submitted from host: <128.211.157.86:35556>
...
012 (341960.000.000) 10/25 15:03:35 Job was held.
    via condor_hold (by user myusername)
    Code 1 Subcode 0
...
013 (341960.000.000) 10/25 15:48:00 Job was released.
    via condor_release (by user myusername)
...
001 (341960.000.000) 10/25 15:50:46 Job executing on host: <128.211.157.10:33047>
...
005 (341960.000.000) 10/25 15:50:46 Job terminated.
    (1) Normal termination (return value 0)
        Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
        Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
        Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
        Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
    272  -  Run Bytes Sent By Job
    6863  -  Run Bytes Received By Job
    272  -  Total Bytes Sent By Job
    6863  -  Total Bytes Received By Job

The log file records the main events related to the processing of this job. In this example, the log records the number of bytes transferred between the submission host and the compute node via HTCondor's file transfer mechanism.

The Vanilla Universe is available for jobs which cannot take advantage of the Standard Universe because you cannot re-link them with HTCondor's libraries. You cannot re-link an executable binary, shell script, and a program which you compiled with an incompatible compiler. If a shared file system is not available and HTCondor's file transfer mechanism is suitable for the job, you may turn on the file transfer mechanism, and the Vanilla job will transfer your files. The size of any file which you intend to transfer must be reasonable. The size of the file must fit on the available disk space of the compute node. The amount of time needed to transfer the file cannot be so great that higher priority jobs constantly preempt the HTCondor job during file transfer.

Parameter Sweep

A parameter sweep is a computation applied to several unique sets of data. Computing the average test score for each class in a school is an example of a simple parameter sweep.

A parameter sweep requires that each queued run of a computer program reads a unique set of input parameters. This example illustrates how to implement a unique command line argument, a unique standard input, and a unique formatted input data file. The program writes to a unique standard output and to a unique formatted output file.

HTCondor provides a pre-defined macro $(Process) which supports the specification of unique command-line arguments and input/output files so that each queued run of a job sees a unique set of data. This macro appears in the job submission file, myprogram.sub:

# FILENAME:  myprogram.sub

universe = VANILLA

executable = myprogram
# Processes 0,1,2
# command line argument
arguments  = $(Process)

# Standard I/O files, HTCondor log file
input  = mydata.in.$(Process)
output = mydata.out.$(Process)
error  = mydata.err.$(Process)
log    = mydata.log

# queue 3 jobs in 1 cluster
queue 3

This job submission file specifies a parameter sweep of three queued runs. Each process has a unique ID: 0, 1, and 2. Process 0 of the parameter sweep expects a standard input file named "mydata.in.0"; process 1, "mydata.in.1"; and process 2, "mydata.in.2". The sweep will generate similarly named files for standard output and error. HTCondor advises using a single log file in a submission. In addition, the sweep expects to find formatted input data files with the same process number used as a suffix: i_00020.0, i_mydata.1, i_mydata.2. Each copy of the computer program used in this sweep finds its unique process number in its command-line argument and appends that unique process number to the generic names "i_mydata." and "o_mydata." to make unique formatted data file names. All files reside in the user's current working directory; hence, data file names must be unique. To compile this program for the Vanilla Universe, see Compiling Serial Programs.

This job submission file uses HTCondor's default universe, Vanilla. Because this Vanilla job does not turn on HTCondor's file transfer mechanism, this job will (by default) use the shared file system of the submission host from which you submitted the job.

To submit the executable to HTCondor:

$ condor_submit myprogram.sub

For this submission, command condor_q will show a single cluster number, three unique process numbers, and three unique command-line arguments:

$ condor_q myusername

-- Submitter: condor.rcac.purdue.edu : <128.211.157.86:60481> : condor.rcac.purdue.edu
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
746419.0   myusername     10/28 10:57   0+00:00:00 I  0   0.0  myprogram 0
746419.1   myusername     10/28 10:57   0+00:00:00 I  0   0.0  myprogram 1
746419.2   myusername     10/28 10:57   0+00:00:00 I  0   0.0  myprogram 2

View the standard input file for process 0, mydata.in.0:

textfromstandardinput:process0

View the formatted input file for process 0, i_mydata.0:

textfromformattedinput:process0

View the standard output file for process 0, mydata.out.0:

***  MAIN START  ***

program name:          condor_exec.exe
command line argument: 0
standard input/output: textfromstandardinput:process0
formatted input/output: textfromformattedinput:process0

***  MAIN  STOP  ***

View the formatted output file for process 0, o_mydata.0:

textfromformattedinput:process0

Processes 1 and 2 have similar input and output files.

The single log file collects records the major events of the submission of the three queued runs of this parameter sweep:

000 (746419.000.000) 10/28 10:57:43 Job submitted from host: <128.211.157.86:60481>
...
000 (746419.001.000) 10/28 10:57:43 Job submitted from host: <128.211.157.86:60481>
...
000 (746419.002.000) 10/28 10:57:43 Job submitted from host: <128.211.157.86:60481>
...
001 (746419.001.000) 10/28 11:02:14 Job executing on host: <128.211.157.10:44836>
...
005 (746419.001.000) 10/28 11:02:14 Job terminated.
    (1) Normal termination (return value 0)
        Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
        Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
        Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
        Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
    950  -  Run Bytes Sent By Job
    9800  -  Run Bytes Received By Job
    950  -  Total Bytes Sent By Job
    9800  -  Total Bytes Received By Job
...
001 (746419.000.000) 10/28 11:02:15 Job executing on host: <128.211.157.10:44836>
...
005 (746419.000.000) 10/28 11:02:15 Job terminated.
    (1) Normal termination (return value 0)
        Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
        Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
        Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
        Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
    950  -  Run Bytes Sent By Job
    9800  -  Run Bytes Received By Job
    950  -  Total Bytes Sent By Job
    9800  -  Total Bytes Received By Job
...
001 (746419.002.000) 10/28 11:02:17 Job executing on host: <128.211.157.10:44836>
...
005 (746419.002.000) 10/28 11:02:17 Job terminated.
    (1) Normal termination (return value 0)
        Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
        Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
        Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
        Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
    950  -  Run Bytes Sent By Job
    9800  -  Run Bytes Received By Job
    950  -  Total Bytes Sent By Job
    9800  -  Total Bytes Received By Job

HTCondor's parameter sweep offers a huge potential. Simply adding a large number to the queue command in a job submission file yields an enormous amount of computing. The catch is that the effort needed to prepare the unique input files, either standard input files or formatted input files. This effort can be minimal when the input data comes from some data collector operating in the field. This effort can be enormous when you must enter each unique dataset from the keyboard.

Parameter Sweep - Initial Directory

A parameter sweep is a computation applied to several unique sets of data. Computing the average test score for each class in a school is an example of a simple parameter sweep.

A parameter sweep requires that each queued run of a computer program reads a unique set of input parameters. This example illustrates how to implement a unique command line argument, a unique standard input, and a unique formatted input data file. The program writes to a unique standard output and to a unique formatted output file.

HTCondor provides a pre-defined macro $(Process) which supports the specification of unique command-line arguments so that each queued run of a job sees a unique set of data.

Also, HTCondor provides an "initial directory" which supports the specification of unique input/output files so that each queued run of a job sees a unique set of data. Command initialdir specifies a generic directory name which becomes unique after appending the process number of a queued run of a parameter sweep. Each initial directory is actually a subdirectory of the user's current working directory. Each initial directory holds the unique standard input and formatted input files of a queued run of a parameter sweep; each initial directory receives the unique standard output, error and log files plus any unique formatted output files generated by a queued run of a parameter sweep. Since data files of each run of a sweep reside in a separate directory, identical file names may be used; they need not be modified with a process number. Both macro and command appear in the job submission file, myprogram.sub:

# FILENAME:  myprogram.sub

universe = VANILLA

executable = myprogram
# Processes 0,1,2
# command line argument
arguments  = $(Process)

initialdir = mydatadirectory.$(Process)

# Standard I/O files, HTCondor log file
input          = mydata.in
output         = mydata.out
error          = mydata.err
log            = mydata.log

# queue 3 jobs in 1 cluster
queue 3

This job submission file specifies a parameter sweep of three queued runs. Each process has a unique ID: 0, 1, and 2. Process 0 of the parameter sweep expects a standard input file named "mydata.in" to reside in the initial directory named "mydatadirectory.0"; process 1, "mydata.in" resides in "mydirectory.1"; and process 2, "mydata.in" resides in "mydirectory.2". The sweep will generate similarly named files for standard output, error, and log in the initial directories. In addition, the sweep expects to find in the initial directories formatted input data files with identical names: myinputdata. Each copy of the computer program used in this sweep finds its unique process number in its command-line argument and finds its unique formatted input data file in its own initial directory. The program does not append its unique process number to the generic names of formatted files to make unique formatted data file names. All files reside in unique subdirectories of the user's current working directory; hence, data file names must be identical. To compile this program for the Vanilla Universe, see Compiling Serial Programs.

This job submission file uses HTCondor's default universe, Vanilla. Because this Vanilla job does not turn on HTCondor's file transfer mechanism, this job will (by default) use the shared file system of the submission host from which you submitted the job.

To submit the executable to HTCondor:

$ condor_submit myprogram.sub

For this submission, command condor_q will show a single cluster number, three unique process numbers, and three unique command-line arguments:

$ condor_q myusername

-- Submitter: condor.rcac.purdue.edu : <128.211.157.86:60481> : condor.rcac.purdue.edu
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
746420.0   myusername     10/28 12:28   0+00:00:00 I  0   0.0  myprogram 0
746420.1   myusername     10/28 12:28   0+00:00:00 I  0   0.0  myprogram 1
746420.2   myusername     10/28 12:28   0+00:00:00 I  0   0.0  myprogram 2

View the standard input file for process 0, mydata.in, in the initial directory mydirectory.0:

textfromstandardinput:process0

View the formatted input file for process 0, myinputdata, in the initial directory mydirectory.0:

textfromformattedinput:process0

View the standard output file for process 0, mydata.out, in the initial directory mydirectory.1:

***  MAIN START  ***

program name:          condor_exec.exe
command line argument: 0
standard input/output: textfromstandardinput:process0
formatted input/output: textfromformattedinput:process0

***  MAIN  STOP  ***

View the formatted output file for process 0, myoutputdata, in the initial directory mydirectory.0:

textfromformattedinput:process0

The log file, mydata.log, records the major events of the submission of the one queued run of this parameter sweep. View the log file for process 0, mydata.log, in the initial directory mydirectory.0:

000 (746420.000.000) 10/28 12:28:35 Job submitted from host: <128.211.157.86:60481>
...
001 (746420.000.000) 10/28 12:33:48 Job executing on host: <128.211.157.10:34460>
...
005 (746420.000.000) 10/28 12:33:49 Job terminated.
    (1) Normal termination (return value 0)
        Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
        Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
        Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
        Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
    909  -  Run Bytes Sent By Job
    9800  -  Run Bytes Received By Job
    909  -  Total Bytes Sent By Job
    9800  -  Total Bytes Received By Job

Processes 1 and 2 have similar input, output and log files and formatted input/output files residing in their respective initial directories.

HTCondor's parameter sweep offers a huge potential. Simply adding a large number to the queue command in a job submission file yields an enormous amount of computing. The catch is that the effort needed to prepare the unique input files, either standard input files or formatted input files, can be great. This effort is minimal when the input data comes from some data collector operating in the field. This effort can be overwhelming when you must enter each unique dataset from the keyboard.

Parameter Sweep - Single Data File

A parameter sweep is a computation applied to several unique sets of data. Computing the average test score for each class in a school is an example of a simple parameter sweep.

A parameter sweep requires that each queued run of a computer program reads a unique set of input parameters. This example illustrates how to implement a parameter sweep on a single large file. Each queued run of the job reads a different portion on the same file.

HTCondor provides a pre-defined macro $(Process) which supports the specification of unique command-line arguments and input/output files so that each queued run of a job sees a unique set of data. This macro appears in the job submission file, myprogram.sub:

# FILENAME:  myprogram.sub

universe = VANILLA

executable = myprogram
# Processes 0,1,2
arguments  = $(Process)

# There is a single formatted input data file, myinputdata.

# Standard I/O files, HTCondor log file
output = mydata.out.$(Process)
error  = mydata.err.$(Process)
log    = mydata.log

# queue 3 jobs in 1 cluster
queue 3

This job submission file specifies a parameter sweep of three queued runs. Each process has a unique ID: 0, 1, and 2. Each queued run of this job will read a different portion of the data file. Process 0 of the parameter sweep writes a standard output file named "mydata.out.0"; process 1, "mydata.out.1"; and process 2, "mydata.out.2". The sweep will generate similarly named files for standard error. HTCondor advises using a single log file in a submission to record the major events of the sweep. In addition, the sweep expects to find a single formatted input data file, myinputdata. Each copy of the computer program used in this sweep finds its unique process number in its command-line argument and uses that number to determine where in the single input data file it is to start reading records. All files reside in the user's current working directory; hence, data file names must be unique. To compile this program for the Vanilla Universe, see Compiling Serial Programs.

This job submission file uses HTCondor's default universe, Vanilla. Because this Vanilla job does not turn on HTCondor's file transfer mechanism, this job will (by default) use the shared file system of the submission host from which you submitted the job.

To submit the executable to HTCondor:

$ condor_submit myprogram.sub

For this submission, command condor_q will show a single cluster number, three unique process numbers, and three unique command-line arguments:

$ condor_q myusername

-- Submitter: condor.rcac.purdue.edu : <128.211.157.86:60481> : condor.rcac.purdue.edu
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
746421.0   myusername     10/29 10:57   0+00:00:00 I  0   0.0  myprogram 0
746421.1   myusername     10/29 10:57   0+00:00:00 I  0   0.0  myprogram 1
746421.2   myusername     10/29 10:57   0+00:00:00 I  0   0.0  myprogram 2

View the single formatted input file, myinputdata:

AAAAAAAAAA
BBBBBBBBBB
CCCCCCCCCC
    :
ZZZZZZZZZZ
0000000000
1111111111
2222222222
3333333333

View the standard output file for process 0, mydata.out.0:

***  MAIN START  ***

program name:          condor_exec.exe
command line argument: 0
current file position:   0
rtn_val = 0
starting file position:   0
line 1:   AAAAAAAAAA
line 2:   BBBBBBBBBB
line 3:   CCCCCCCCCC
line 4:   DDDDDDDDDD
line 5:   EEEEEEEEEE
line 6:   FFFFFFFFFF
line 7:   GGGGGGGGGG
line 8:   HHHHHHHHHH
line 9:   IIIIIIIIII
line 10:   JJJJJJJJJJ

***  MAIN  STOP  ***

View the standard output file for process 1, mydata.out.1:

***  MAIN START  ***

program name:          condor_exec.exe
command line argument: 1
current file position:   0
rtn_val = 0
starting file position:   110
line 11:   KKKKKKKKKK
line 12:   LLLLLLLLLL
line 13:   MMMMMMMMMM
line 14:   NNNNNNNNNN
line 15:   OOOOOOOOOO
line 16:   PPPPPPPPPP
line 17:   QQQQQQQQQQ
line 18:   RRRRRRRRRR
line 19:   SSSSSSSSSS
line 20:   TTTTTTTTTT
rtn_val = 0
starting file position:   0
line 0:   AAAAAAAAAA
rtn_val = 0
starting file position:   220
line 21:   UUUUUUUUUU

***  MAIN  STOP  ***

Process 1 also practices additional random file accesses.

View the standard output file for process 2, mydata.out.2:

***  MAIN START  ***

program name:          condor_exec.exe
command line argument: 2
current file position:   0
rtn_val = 0
starting file position:   220
line 21:   UUUUUUUUUU
line 22:   VVVVVVVVVV
line 23:   WWWWWWWWWW
line 24:   XXXXXXXXXX
line 25:   YYYYYYYYYY
line 26:   ZZZZZZZZZZ
line 27:   0000000000
line 28:   1111111111
line 29:   2222222222
line 30:   3333333333

***  MAIN  STOP  ***

HTCondor's parameter sweep, when applied to a single, large data file, offers a huge potential. Simply adding a large number to the queue command in a job submission file applies several compute servers to the data processing.

Transfer a Subdirectory

To review, HTCondor is unable to transfer a subdirectory of data files to a compute server. While the submit command transfer_input_files allows paths when specifying which input files to transfer, HTCondor places all transferred files in a single, flat directory where the executable and standard input file reside - the temporary working directory on the compute server. Therefore, the executing program must access input files without paths.

A similar situation exists for output files. If the program creates output files during execution, it must create them within the temporary working directory. HTCondor transfers back all new and modified files within the temporary working directory - the output files. To transfer back only a subset of these files, use the submit command transfer_output_files. HTCondor does not support the transfer of output files that exist but that do not reside within the temporary working directory on the compute server.

This restriction need not deter the user with a subdirectory of input and output files. The user simply makes an archive file of the subdirectory structure with the tar utility and tell HTCondor to transfer the tar file. The application may then un-tar the archive before reading the input files. The application may also write to output files which reside within the subdirectory. The final step of the application archives those files which your job made or modified. HTCondor will see the archive file as an output file and transfer the archive from the compute server to the user's working directory on the submission host. Finally, the user extracts the output files from the archive.

The computer program, myprogram.c, reads a formatted data file and writes a formatted data file. This example assumes that there exists a formatted input file, i_00110 in a subdirectory name mysubdirectory. The result is a formatted output file, o_00110, in the same subdirectory. The program uses the tar utility to extract the subdirectory structure on the compute server. After the program writes the output file, it then uses the tar utility again to archive the subdirectory of output files only. To compile this program for the Vanilla Universe, see Compiling Serial Programs.

This example assumes that the current working directory has a subdirectory containing a formatted input file. The tar utility prepares the archive of input files:

tar cf myarchive.i.tar mysubdirectory

Prepare a job submission file, myprogram.sub. Specify the Vanilla Universe and the file transfer mechanism as "on":

# FILENAME:  myprogram.sub

universe = VANILLA

executable = myprogram

# Specify the archive as the input data file.
transfer_input_files = myarchive.i.tar

# Turn on file transfer mechanism.
should_transfer_files   = YES

# Let HTCondor handle output file(s): myarchive.o.tar.
when_to_transfer_output = ON_EXIT

# Standard output files, HTCondor log file
output = mydata.out
error  = mydata.err
log    = mydata.log

# queue one job
queue

To submit the executable to HTCondor:

$ condor_submit myprogram.sub

The standard output file, mydata.out, shows the evolution of the current working directory on the compute server. Initially, it shows that HTCondor transferred the tar file which contains the archived subdirectory of input data file(s). After extraction, the subdirectory with its formatted input file(s), mysubdirectory and myinputdata, are visible. After processing, the formatted output file(s), myoutputdata, is visible:

total 24
-rwxr-xr-x 1 myusername itap  8708 Nov 12 15:27 condor_exec.exe
-rw-r--r-- 1 myusername itap 10240 Nov 12 15:27 myarchive.i.tar
-rw-r--r-- 1 myusername itap     0 Nov 12 15:30 mydata.err
-rw-r--r-- 1 myusername itap     0 Nov 12 15:30 mydata.out
total 32
-rwxr-xr-x 1 myusername itap  8708 Nov 12 15:27 condor_exec.exe
-rw-r--r-- 1 myusername itap 10240 Nov 12 15:27 myarchive.i.tar
-rw-r--r-- 1 myusername itap     0 Nov 12 15:30 mydata.err
-rw-r--r-- 1 myusername itap   227 Nov 12 15:30 mydata.out
drwxr-x--- 3 myusername itap  4096 Feb 14  2008 mysubdirectory
total 8
drwx------ 2 myusername itap 4096 Feb 14  2008  ..
-rw-r--r-- 1 myusername itap   19 Jul 12  2007 myinputdata
total 12
drwx------ 2 myusername itap 4096 Feb 14  2008  ..
-rw-r--r-- 1 myusername itap   19 Jul 12  2007 myinputdata
-rw-r--r-- 1 myusername itap   28 Nov 12 15:30 myoutputdata
***  MAIN START  ***

formatted input/output: textinsubdirectory

***  MAIN  STOP  ***

At job completion, HTCondor sees file myarchive.o.tar as an output file which it will transfer to the submission host. After the transfer, the user then extracts the output file(s) from this archive:

tar xf myarchive.o.tar mysubdirectory/myoutputfile

View the log file, mydata.log:

000 (342352.000.000) 11/12 15:29:31 Job submitted from host: <128.211.157.86:47933>
...
001 (342352.000.000) 11/12 15:30:55 Job executing on host: <128.211.157.10:59987?PrivNet=condor.ccb.purdue.edu>
...
005 (342352.000.000) 11/12 15:30:56 Job terminated.
    (1) Normal termination (return value 0)
        Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
        Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
        Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
        Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
    11094  -  Run Bytes Sent By Job
    18948  -  Run Bytes Received By Job
    11094  -  Total Bytes Sent By Job
    18948  -  Total Bytes Received By Job

The log file records the main events related to the processing of this job. The log shows the number of bytes transferred between the submission host and the compute server via HTCondor's file transfer mechanism.

Requiring Specific Amounts of Memory

Some applications require compute nodes with a certain minimum amount of memory. These applications may also perform better when even more memory is available on the compute node.

This section illustrates how to submit a small job to a BoilerGrid compute node with at least 16 GB of memory (requirements) and to prefer compute nodes with even more memory (rank), if available. This example, myprogram.c, displays the name of the host which runs the job, the path name of the current working directory, and the contents of that directory.

Prepare a job submission file with an appropriate filename, here named myjob.sub:

# FILENAME:  myjob.sub

universe = VANILLA

# Require a compute node with at least 16 GB of memory.
# 16 GB == 16046 MB;
requirements = TotalMemory >= 16046

# Prefer a compute node with more than 16 GB, if available.
rank = TotalMemory

# Transfer the "executable" myprogram to the compute node.
transfer_executable = TRUE
executable          = myprogram

# Turn on HTCondor's file transfer mechanism only when needed.
should_transfer_files   = IF_NEEDED
when_to_transfer_output = ON_EXIT

# Standard I/O files, HTCondor log file
output = myprogram.out
error  = myprogram.err
log    = myprogram.log

# queue one job
queue

The ClassAd TotalMemory specifies the amount of memory on a compute node. The amount of memory is in units of megabytes. To change this example to request at least 32 GB of total memory, replace "16046" with "32192". For at least 48 GB, use "48297".

This example assumes that all compute nodes have a definition for the attribute TotalMemory. To see how many compute nodes in BoilerGrid do not have the attribute TotalMemory defined:

$ condor_status -pool boilergrid.rcac.purdue.edu -total -constraint 'TotalMemory =?= undefined'

There is no output since all compute nodes of BoilerGrid do have this attribute defined.

Before submitting your job, you may wish to verify that there are a sufficient number of compute nodes which will satisfy your requirements and that those same compute nodes define the preferred ClassAds expressed in the rank command. To see how many compute nodes satisfy your requirements:

$ condor_status -pool boilergrid.rcac.purdue.edu -total -constraint 'TotalMemory >= 16046'

                     Total Owner Claimed Unclaimed Matched Preempting Backfill
        X86_64/LINUX 26093 18007    3330      4753       3          0        0
               Total 26093 18007    3330      4753       3          0        0

There are 26,093 compute nodes with at least 16 GB of memory.

View results in the file for all standard output, here named myjob.out:

cms-100.rcac.purdue.edu
(none)
/home/myusername/condor/Introduction/memory
total 224
-rw-r--r-- 1 myusername itap 1508 Mar 11 14:38 README
-rw-r--r-- 1 myusername itap    0 Mar 11 15:36 myjob.err
-rw-r--r-- 1 myusername itap  791 Mar 11 15:36 myjob.log
-rw-r--r-- 1 myusername itap   77 Mar 11 15:36 myjob.out
-rw-r----- 1 myusername itap  663 Mar 11 15:20 myjob.sub
-rwxr-xr-x 1 myusername itap 6939 Mar 11 14:38 myprogram
-rw-r----- 1 myusername itap  488 Mar 11 14:40 myprogram.c
-rwxr----- 1 myusername itap   58 Mar 11 14:38 run
***  MAIN START  ***


***  MAIN  STOP  ***

This job happened to run on compute node cms-100. This compute node has 8 processor cores. To verify that cms-100 has at least 16 GB of memory:

$ condor_status -pool boilergrid.rcac.purdue.edu -constraint 'Machine=="cms-100.rcac.purdue.edu"' -format "%s\n" TotalMemory

16046
16046
16046
16046
16046
16046
16046
16046

For more information about requirements and rank:

Requiring Specific Architectures or Operating Systems

You compile a computer program to run on a specific combination of chip architecture and operating system. This combination is a platform. BoilerGrid contains compute nodes of many different platforms, so you must often specify the platform your program requires to ensure that your job runs on the correct platform. The predominant platform on BoilerGrid is 64-bit Linux ("X86_64/Linux"). To see a list of all platforms available on BoilerGrid:

$ condor_status -pool boilergrid.rcac.purdue.edu -total

                     Total Owner Claimed Unclaimed Matched Preempting Backfill

         INTEL/LINUX   114    18       0        60       0          0       36
           INTEL/OSX     2     0       0         2       0          0        0
       INTEL/WINNT51   334     8       0       326       0          0        0
       INTEL/WINNT61  6299   982       0      5317       0          0        0
    SUN4u/SOLARIS210     3     0       0         3       0          0        0
        X86_64/LINUX 30170 19460    4559      6150       0          0        1

               Total 36922 20468    4559     11858       0          0       37

The name "INTEL" as used on BoilerGrid means 32-bit Intel-compatible hardware, and it makes no distinction between Intel and AMD CPUs. The name "X86_64" is a vendor-neutral term to refer to 64-bit architecture from either Intel or AMD. The name "WINNT51" means Windows XP, and "WINNT61" means Windows 7.

By default, HTCondor will send a job to a compute node whose architecture and operating system match the platform of the host from which you submitted your job. Moreover, you may submit jobs to compute nodes which are platforms different from the submission host. You may compile a program to run on a Windows machine and submit the executable file to BoilerGrid from one of BoilerGrid's Linux submission hosts by specifying that the job requires a Windows compute node:

executable   = myprogram.exe
requirements = (ARCH == "INTEL") && ((OPSYS == "WINNT51") || (OPSYS == "WINNT61"))

It is possible to allow HTCondor to use a larger pool of compute nodes for a job if executables are available for multiple platforms. You need only take care to not reference any absolute paths within your job submission that are specific to one platform or installation. You can often use some existing ClassAd variables instead of fixed paths to make non-platform-specific submission files.

For more information about requirements and rank:

Requiring Specific Clusters or Compute Nodes

ITaP research resources include several clusters. Currently, the clusters include the following:

Radon
Peregrine 1
Steele
Coates
Rossmann
Hansen
Carter

This section illustrates how to apply HTCondor ClassAds to submit a small job to a node in some subset of ITaP resources. These examples execute a simple shell script which displays the name of the compute node which ran the job.

Prepare a shell script with an appropriate filename, here named myjob.sh:

#!/bin/sh
# FILENAME:  myjob.sh

hostname

Change the permissions of the shell script to allow execution by the owner (you):

$ chmod u+x myjob.sh

Executing only on a node of one or more specific research clusters

Prepare a job submission file with an appropriate filename, here named myjob.sub. The ClassAd requires that the chosen compute node should reside on either of two clusters. This example also specifies the shell script myjob.sh as the executable, transfers the shell script to the compute node, and uses HTCondor's file transfer mechanism to copy input and output files only if needed (if the same file system is not available on the compute node):

# FILENAME:  myjob.sub

universe = VANILLA

# Require a compute node of either the Steele or Coates cluster.
# Attribute name is not case sensitive; attribute value is.
requirements = (CLUSTERNAME=="Steele") || (clustername=="Coates")

# Transfer the "executable" myjob.sh to the compute node.
transfer_executable = TRUE
executable          = myjob.sh

# Turn on HTCondor's file transfer mechanism only when needed.
should_transfer_files   = IF_NEEDED
when_to_transfer_output = ON_EXIT

# Standard I/O files, HTCondor log file
output = myjob.out
error  = myjob.err
log    = myjob.log

# queue one job
queue

Submit the job:

$ condor_submit myjob.sub

View job status:

$ condor_q myusername

View results in the file for all standard output, here named myjob.out:

coates-d020.rcac.purdue.edu

Executing only on one specific compute node

Prepare a job submission file with an appropriate filename, here named myjob.sub. The ClassAd requires a specific compute node. This example also specifies the shell script myjob.sh as the executable, transfers the shell script to the compute node, and uses HTCondor's file transfer mechanism to copy input and output files only if needed (if the same file system is not available on the compute node):

# FILENAME:  myjob.sub

universe = VANILLA

# Require a specific compute node.
requirements = Machine=="miner-a500.rcac.purdue.edu"

# Transfer the "executable" myjob.sh to the compute node.
transfer_executable = TRUE
executable          = myjob.sh

# Turn on HTCondor's file transfer mechanism only when needed.
should_transfer_files   = IF_NEEDED
when_to_transfer_output = ON_EXIT

# Standard I/O files, HTCondor log file
output = myjob.out
error  = myjob.err
log    = myjob.log

# queue one job
queue

Submit the job:

$ condor_submit myjob.sub

View job status:

$ condor_q myusername

View results in the file for all standard output, here named myjob.out:

miner-a500.rcac.purdue.edu

Executing on any compute node of a cluster except one

When you discover that a compute node is consistently available and consistently fails to run your job, you may exclude that node from the set of candidate nodes.

Prepare a job submission file with an appropriate filename, here named myjob.sub. The ClassAd excludes one specific compute node of a chosen cluster. This example also specifies the shell script myjob.sh as the executable, transfers the shell script to the compute node, and uses HTCondor's file transfer mechanism to copy input and output files only if needed (if the same file system is not available on the compute node):

# FILENAME:  myjob.sub

universe = VANILLA

# Exclude a specific compute node.
requirements = ClusterName=="Miner" && Machine!="miner-a500.rcac.purdue.edu"

# Transfer the "executable" myjob.sh to the compute node.
transfer_executable = TRUE
executable          = myjob.sh

# Turn on HTCondor's file transfer mechanism only when needed.
should_transfer_files   = IF_NEEDED
when_to_transfer_output = ON_EXIT

# Standard I/O files, HTCondor log file
output = myjob.out
error  = myjob.err
log    = myjob.log

# queue one job
queue

Submit the job:

$ condor_submit myjob.sub

View job status:

$ condor_q myusername

View results in the file for all standard output, here named myjob.out:

miner-a502.rcac.purdue.edu

For more information about requirements and rank:

BoilerGrid Frequently Asked Questions (FAQ)

There are currently no FAQs for BoilerGrid.