Photo of BoilerGrid

BoilerGrid

BoilerGrid - User Guide

Expand All
  • 1  Conventions Used in this Document

    1  Conventions Used in this Document

    This document follows certain typesetting and naming conventions:

    • Colored, underlined text indicates a link.
    • Colored, bold text highlights something of particular importance.
    • Italicized text notes the first use of a key concept or term.
    • Bold, fixed-width font text indicates a command or command argument that you type verbatim.
    • Examples of commands and output as you would see them on the command line will appear in colored blocks of fixed-width text such as this:
      $ example
      This is an example of commands and output.
      
    • All command line shell prompts appear as a single dollar sign ("$"). Your actual shell prompt may differ.
    • All examples work with bash or ksh shells. Where different, changes needed for tcsh or csh shell users appear in example comments.
    • All names that begin with "my" illustrate examples that you replace with an appropriate name. These include "myusername", "myfilename", "mydirectory", "myjobid", etc.
    • The term "processor core" or "core" throughout this guide refers to the individual CPU cores on a processor chip.
  • 2  Overview of BoilerGrid

    2  Overview of BoilerGrid

    BoilerGrid is a large, high-throughput, distributed computing system operated by ITaP, and using the HTCondor system developed by the HTCondor Project at the University of Wisconsin. BoilerGrid provides a way for you to run programs on large numbers of otherwise idle computers in various locations, including any temporarily under-utilized high-performance cluster resources as well as some desktop machines not currently in use.

    Whenever a local user or scheduled job needs a machine back, HTCondor stops its job and sends it to another HTCondor node as soon as possible. Because this model limits the ability to do parallel processing and communications, BoilerGrid is only appropriate for relatively quick serial jobs.

    • 2.1  Detailed Hardware Specification

      2.1  Detailed Hardware Specification

      BoilerGrid scavenges cycles from many ITaP research systems. BoilerGrid also uses idle time of machines around the Purdue West Lafayette campus. Whenever the primary scheduling system on any of these machines needs a compute node back or a user sits down and starts to use a desktop computer, HTCondor will stop its job and, if possible, checkpoint its work. HTCondor then immediately tries to restart this job on some other available compute node in BoilerGrid.

      A recent snapshot of BoilerGrid found 36,524 total processor cores. Memory on compute nodes ranges from 512 MB to 192 GB, and most processors run at 2 GHz or faster. With a total of over 60 TFLOPS available, BoilerGrid can provide large numbers of cycles in a short amount of time. HTCondor offers high-throughput computing and is excellent for parameter sweeps, Monte Carlo simulations, or nearly any serial application that can run in one hour or less.

      BoilerGrid currently uses HTCondor 7.8.7.

  • 3  Accounts on BoilerGrid

    3  Accounts on BoilerGrid

    • 3.1  Obtaining an Account

      3.1  Obtaining an Account

      All Purdue faculty, staff, and students with the approval of their advisor may request access to BoilerGrid. However, if you have an account on Radon or any of the ITaP Community Clusters (Carter, Hansen, Rossmann, Coates, Steele, and Peregrine 1), then you already have access to BoilerGrid. Refer to the Accounts / Access page for more details on how to request access.

    • 3.2  Login / SSH

      3.2  Login / SSH

      To submit jobs on BoilerGrid, log in to the submission host condor.rcac.purdue.edu via SSH. This submission host is actually three front-end hosts: condor-fe00, condor-fe01, and condor-fe02. The login process randomly assigns one of these three front-ends to each login to condor.rcac.purdue.edu. While the three front-end hosts are identical, each has its own HTCondor queue. When you submit jobs to the HTCondor queue from the front-end named condor-fe00, you will not see those jobs on the HTCondor queue while logged in to either condor-fe01 or condor-fe02. To ensure that you always see the same HTCondor queue, log in to the same front-end.

      Each front-end host has its own /tmp. Sharing data in /tmp during subsequent sessions may fail. ITaP advises using scratch space for multisession, shared data instead.

      • 3.2.1  SSH Client Software

        3.2.1  SSH Client Software

        Secure Shell or SSH is a way of establishing a secure (encrypted) connection between two computers. It uses public-key cryptography to authenticate the remote computer and (optionally) to allow the remote computer to authenticate the user. Its usual function involves logging in to a remote machine and executing commands, but it also supports tunneling and forwarding of X11 or arbitrary TCP connections. There are many SSH clients available for all operating systems.

        Linux / Solaris / AIX / HP-UX / Unix:

        • The ssh command is pre-installed. Log in using ssh myusername@servername.

        Microsoft Windows:

        • PuTTY is an extremely small download of a free, full-featured SSH client.
        • Secure CRT is a commercial SSH client which is freely available to Purdue students, faculty, and staff with a Purdue career account.

        Mac OS X:

        • The ssh command is pre-installed. You may start a local terminal window from "Applications->Utilities". Log in using ssh myusername@servername.
      • 3.2.2  SSH Keys

        3.2.2  SSH Keys

        SSH works with many different means of authentication. One popular authentication method is Public Key Authentication (PKA). PKA is a method of establishing your identity to a remote computer using related sets of encryption data called keys. PKA is a more secure alternative to traditional password-based authentication with which you are probably familiar.

        To employ PKA via SSH, you manually generate a keypair (also called SSH keys) in the location from where you wish to initiate a connection to a remote machine. This keypair consists of two text files: private key and public key. You keep the private key file confidential on your local machine or local home directory (hence the name "private" key). You then log in to a remote machine (if possible) and append the corresponding public key text to the end of a specific file, or have a system administrator do so on your behalf. In future login attempts, PKA compares the public and private keys to verify your identity; only then do you have access to the remote machine.

        As a user, you can create, maintain, and employ as many keypairs as you wish. If you connect to a computational resource from your work laptop, your work desktop, and your home desktop, you can create and employ keypairs on each. You can also create multiple keypairs on a single local machine to serve different purposes, such as establishing access to different remote machines or establishing different types of access to a single remote machine. In short, PKA via SSH offers a secure but flexible means of identifying yourself to all kinds of computational resources.

        Passphrases and SSH Keys

        Creating a keypair prompts you to provide a passphrase for the private key. This passphrase is different from a password in a number of ways. First, a passphrase is, as the name implies, a phrase. It can include most types of characters, including spaces, and has no limits on length. Secondly, the remote machine does not receive this passphrase for verification. Its purpose is only to allow the use of your local private key and is specific to a specific local private key.

        Perhaps you are wondering why you would need a private key passphrase at all when using PKA. If the private key remains secure, why the need for a passphrase just to use it? Indeed, if the location of your private keys were always completely secure, a passphrase might not be necessary. In reality, a number of situations could arise in which someone may improperly gain access to your private key files. In these situations, a passphrase offers another level of security for you, the user who created the keypair.

        Think of the private key/passphrase combination as being analogous to your ATM card/PIN combination. The ATM card itself is the object that grants access to your important accounts, and as such, should remain secure at all times—just as a private key should. But if you ever lose your wallet or someone steals your ATM card, you are glad that your PIN exists to offer another level of protection. The same is true for a private key passphrase.

        When you create a keypair, you should always provide a corresponding private key passphrase. For security purposes, avoid using phrases which automated programs can discover (e.g. phrases that consist solely of words in English-language dictionaries). This passphrase is not recoverable if forgotten, so make note of it. Only a few situations warrant using a non-passphrase-protected private key—conducting automated file backups is one such situation. If you need to use a non-passphrase-protected private key to conduct automated backups to Fortress, see the No-Passphrase SSH Keys section.

      • 3.2.3  SSH X11 Forwarding

        3.2.3  SSH X11 Forwarding

        SSH supports tunneling of X11 (X-Windows), so you may run X11 applications on the machine you are using to issue jobs to BoilerGrid. However, running an X11 application via HTCondor is not possible.

      • 3.2.4  Thinlinc Remote Desktop

        3.2.4  Thinlinc Remote Desktop

        ITaP Research Computing provides ThinLinc as an alternative to running an X11 server directly on your computer. It allows you to run graphical applications or graphical interacitve jobs directly on BoilerGrid through a persisent remote graphical desktop session.

        ThinLinc is a service that allows you to connect to a persistent remote graphical desktop session. This service works very well over a high latency, low bandwidth, or off-campus connection compared to running an X11 server locally. It is also very helpful for Windows users who do not have an easy to use local X11 server, as little to no set up is required on your computer.

        There are two ways in which to use ThinLinc: preferably through the native client or through a web browser.

        Installing the ThinLinc native client

        The native ThinLinc client will offer the best experience especially over off-campus connections and is the recommended method for using ThinLinc. It is compatible with Windows, Mac OS X, and Linux.

        • Download the ThinLinc client from the ThinLinc website.
        • Start the ThinLinc client on your computer.
        • In the client's login window, use thinlinc.rcac.purdue.edu as the Server. Use your Purdue Career Account username and password.
        • Click the Connect button.
        • Continue to following section on connecting to BoilerGrid from ThinLinc.

        Using ThinLinc through your web browser

        The ThinLinc service can be accessed from your web browser as a convenience to installing the native client. This option works with no set up and is a good option for those on computers where you do not have privileges to install software. All that is required is an up-to-date web browser. Older versions of Internet Explorer may not work.

        • Open a web browser and navigate to thinlinc.rcac.purdue.edu.
        • Log in with your Purdue Career Account username and password.
        • You may safely proceed past any warning messages from your browser.
        • Continue to following section on connecting to from ThinLinc.
          • Connecting to BoilerGrid from ThinLinc

            • Once logged in, you will be presented with a remote Linux desktop.
            • Open the terminal application on the remote desktop.
            • Log in to the submission host boilergrid.rcac.purdue.edu with X forwarding enabled using the following command:
              $ ssh -Y boilergrid.rcac.purdue.edu 
            • Once logged in to the BoilerGrid head node, you may use graphical editors, debuggers, software like Matlab, or run graphical interactive jobs. For example, to test the X forwarding connection issue the following command to launch the graphical editor gedit:
              $ gedit
            • This session will remain persistent even if you disconnect from the session. Any interactive jobs or applications you left running will continue running even if you are not connected to the session.

            Tips for using ThinLinc native client

            • To exit a full screen ThinLinc session press the F8 key on your keyboard (fn + F8 key for Mac users) and click to disconnect or exit full screen.
            • Full screen mode can be disabled when connecting to a session by clicking the Options button and disabling full screen mode from the Screen tab.
    • 3.3  Passwords

      3.3  Passwords

      If you have received a default password as part of the process of obtaining your account, you should change it before you log onto BoilerGrid for the first time. Change your password from the SecurePurdue website. You will have the same password on all ITaP systems such as BoilerGrid, Purdue email, or Blackboard.

      Passwords may need to be changed periodically in accordance with Purdue security policies. Passwords must follow certain guidelines as described on the SecurePurdue webpage and ITaP recommends following some guidelines to select a strong password.

      ITaP staff will NEVER ask for your password, by email or otherwise.

      Never share your password with another user or make your password known to anyone else.

    • 3.4  Email

      3.4  Email

      There is no local email delivery available on BoilerGrid. BoilerGrid forwards all email which it receives to your career account email address.

    • 3.5  Login Shell

      3.5  Login Shell

      Your shell is the program that generates your command-line prompt and processes commands. On ITaP research systems, several common shell choices are available:

      Name Description Path
      bash A Bourne-shell (sh) compatible shell with many newer advanced features as well. Bash is the default shell for new ITaP research system accounts. This is the most common shell in use on ITaP research systems. /bin/bash
      tcsh An advanced variant on csh with all the features of modern shells. Tcsh is the second most popular shell in use today. /bin/tcsh
      zsh An advanced shell which incorprates all the functionality of bash and tcsh combined, usually with identical syntax. /bin/zsh

      To find out what shell you are running right now, simply use the ps command:

      $ ps
        PID TTY          TIME CMD
      30181 pts/27   00:00:00 bash
      30273 pts/27   00:00:00 ps
      

      To use a different shell on a one-time or trial basis, simply type the shell name as a command. To return to your original shell, type exit:

      $ ps
        PID TTY          TIME CMD
      30181 pts/27   00:00:00 bash
      30273 pts/27   00:00:00 ps
      
      $ tcsh
      % ps
        PID TTY          TIME CMD
      30181 pts/27   00:00:00 bash
      30313 pts/27   00:00:00 tcsh
      30315 pts/27   00:00:00 ps
      
      % exit
      $
      

      To permanently change your default login shell, use the secure web form provided to change shells.

      There is a propagation delay which may last up to two hours before this change will take effect. Once propagated you will need to log out and log back in to start in your new shell.

  • 4  File Storage and Transfer for BoilerGrid

    4  File Storage and Transfer for BoilerGrid

    • 4.1  Storage Options

      4.1  Storage Options

      File storage options on ITaP research systems include long-term storage (home directories, Fortress) and short-term storage (scratch directories, /tmp directory). Each option has different performance and intended uses, and some options vary from system to system as well. ITaP provides daily snapshots of home directories for a limited time for accidental deletion recovery. ITaP does not back up scratch directories or temporary storage and regularly purges old files from scratch and /tmp directories. More details about each storage option appear below.

      • 4.1.1  Home Directory

        4.1.1  Home Directory

        ITaP provides home directories for long-term file storage. Each user has one home directory. You should use your home directory for storing important program files, scripts, input data sets, critical results, and frequently used files. You should store infrequently used files on Fortress. Your home directory becomes your current working directory, by default, when you log in.

        ITaP provides daily snapshots of your home directory for a limited period of time in the event of accidental deletion. For additional security, you should store another copy of your files on more permanent storage, such as the Fortress HPSS Archive.

        Your home directory physically resides within the Isilon storage system at Purdue. To find the path to your home directory, first log in then immediately enter the following:

        $ pwd
        /home/myusername
        

        Or from any subdirectory:

        $ echo $HOME
        /home/myusername
        

        Your home directory and its contents are available on all ITaP research computing machines, including front-end hosts and compute nodes.

        Your home directory has a quota limiting the total size of files you may store within. For more information, refer to the Storage Quotas / Limits Section.

        • 4.1.1.1  Lost Home Directory File Recovery

          4.1.1.1  Lost Home Directory File Recovery

          Only files which have been snap-shotted overnight are recoverable. If you lose a file the same day you created it, it is NOT recoverable.

          To recover files lost from your home directory, use the flost command:

          $ flost
          
      • 4.1.2  Scratch Space

        4.1.2  Scratch Space

        ITaP provides scratch directories for short-term file storage only. Each file system domain has at least one scratch directory. Each user ID may access one scratch directory in a file system domain. The quota of your scratch directory is several times greater than the quota of your home directory. You should use your scratch directory for storing large temporary input files which your job reads or for writing large temporary output files which you may examine after execution of your job. You should use your home directory and Fortress for longer-term storage or for holding critical results.

        Users of all ITaP research clusters have access to a scratch directory.

        ITaP does not perform backups for scratch directories. In the event of a disk crash or file purge, files in scratch directories are not recoverable. You should copy any important files to more permanent storage.

        ITaP automatically removes (purges) from scratch directories all files stored for more than 90 days. Owners of these files receive a notice one week before removal via email. For more information, please refer to our Scratch File Purging Policy.

        To find the path to your scratch directory:

        $ findscratch
        

        The response from command findscratch depends on your submission host. You may see one of the following paths:

        /scratch/radon/m/myusername
        /scratch/carter/m/myusername
        

        The value of variable $RCAC_SCRATCH is the path of your scratch directory. Use this variable in any scripts. Your actual scratch directory path may change without warning, but this variable will remain current.

        $ echo $RCAC_SCRATCH
        

        The response will be one of the previously listed paths.

        As there is no global scratch filesystem available on all cluster nodes, ITaP recommends the use of Condor's file-transfer capability to move data between the submit and execute nodes.

        Your scratch directory has a quota capping the size and number of files you may store in it. For more information, refer to the Storage Quotas / Limits Section.

      • 4.1.3  /tmp Directory

        4.1.3  /tmp Directory

        ITaP provides /tmp directories for short-term file storage only. Each front-end and compute node has a /tmp directory. Your program may write temporary data to the /tmp directory of the compute node on which it is running. That data is available for as long as your program is active. Once your program terminates, that temporary data is no longer available. When used properly, /tmp may provide faster local storage to an active process than any other storage option. You should use your home directory and Fortress for longer-term storage or for holding critical results.

        ITaP does not perform backups for the /tmp directory and removes files from /tmp whenever space is low or whenever the system needs a reboot. In the event of a disk crash or file purge, files in /tmp are not recoverable. You should copy any important files to more permanent storage.

      • 4.1.4  Long-Term Storage

        4.1.4  Long-Term Storage

        Long-term Storage or Permanent Storage is available to ITaP research users on the High Performance Storage System (HPSS), an archival storage system, called Fortress. HPSS is a software package that manages a hierarchical storage system. Program files, data files and any other files which are not used often, but which must be saved, can be put in permanent storage. Fortress currently has over 10PB of capacity.

        Files smaller than 100 MB have their primary copy stored on low-cost disks (disk cache), but the second copy (backup of disk cache) is on tape or optical disks. This provides a rapid restore time to the disk cache. However, the large latency to access a larger file (usually involving a copy from a tape cartridge) makes it unsuitable for direct use by any processes or jobs, even where possible. The primary and secondary copies of larger files are stored on separate tape cartridges in the tape library.

        To ensure optimal performance for all users, and to keep the Fortress system healthy, please remember the following tips:

        • Fortress operates most effectively with large files - 1GB or larger. If your data is comprised of smaller files, use HTAR to directly create archives in Fortress.
        • When working with files on cluster head nodes, use your home directory or a scratch file system, rather than editing or computing on files directly in Fortress. Copy any data you wish to archive to Fortress after computation is complete.

        Fortress writes two copies of every file either to two tapes, or to disk and a tape, to protect against medium errors. Unfortunately, Fortress does not automatically switch to the alternate copy when it has trouble accessing the primary. If it seems to be taking an extraordinary amount of time to retrieve a file (hours), please either email rcac-help@purdue.edu. We can then investigate why it is taking so long. If it is an error on the primary copy, we will instruct Fortress to switch to the alternate copy as the primary and recreate a new alternate copy.

        For more information about Fortress, how it works, user guides, and how to obtain an account:

        • 4.1.4.1  Manual File Transfer to Long-Term Storage

          4.1.4.1  Manual File Transfer to Long-Term Storage

          There are a variety of ways to manually transfer files to your Fortress home directory for long-term storage.

          • 4.1.4.1.1  HSI

            4.1.4.1.1  HSI

            HSI, the Hierarchical Storage Interface, is the preferred method of transferring files to and from BoilerGrid. HSI is designed to be a friendly interface for users of the High Performance Storage System (HPSS). It provides a familiar Unix-style environment for working within HPSS while automatically taking advantage of high-speed, parallel file transfers without requiring any special user knowledge.

            HSI is provided on all ITaP research systems as the command hsi. HSI is also available for Download for many operating systems.

            Interactive usage:

            $ hsi
            
            *************************************************************************
            *                    Purdue University
            *                  High Performance Storage System (HPSS)
            *************************************************************************
            * This is the Purdue Data Archive, Fortress.  For further information
            * see http://www.rcac.purdue.edu/storage/fortress/
            *
            *   If you are having problems with HPSS, please call IT/Operational
            *   Services at 49-44000 or send E-mail to rcac-help@purdue.edu.
            *
            *************************************************************************
            Username: myusername  UID: 12345  Acct: 12345(12345) Copies: 1 Firewall: off [hsi.3.5.8 Wed Sep 21 17:31:14 EDT 2011]
            
            [Fortress HSI]/home/myusername->put data1.fits
            put  'test' : '/home/myusername/test' ( 1024000000 bytes, 250138.1 KBS (cos=11))
            
            [Fortress HSI]/home/myusername->lcd /tmp
            
            [Fortress HSI]/home/myusername->get data1.fits
            get  '/tmp/data1.fits' : '/home/myusername/data1.fits' (2011/10/04 16:28:50 1024000000 bytes, 325844.9 KBS )
            
            [Fortress HSI]/home/myusername->quit
            

            Batch transfer file:

            put data1.fits
            put data2.fits
            put data3.fits
            put data4.fits
            put data5.fits
            put data6.fits
            put data7.fits
            put data8.fits
            put data9.fits
            

            Batch usage:

            $ hsi < my_batch_transfer_file
            *************************************************************************
            *                    Purdue University
            *                  High Performance Storage System (HPSS)
            *************************************************************************
            * This is the Purdue Data Archive, Fortress.  For further information
            * see http://www.rcac.purdue.edu/storage/fortress/
            *
            *   If you are having problems with HPSS, please call IT/Operational
            *   Services at 49-44000 or send E-mail to rcac-help@purdue.edu.
            *
            *************************************************************************
            Username: myusername  UID: 12345  Acct: 12345(12345) Copies: 1 Firewall: off [hsi.3.5.8 Wed Sep 21 17:31:14 EDT 2011]
            put  'data1.fits' : '/home/myusername/data1.fits' ( 1024000000 bytes, 250200.7 KBS (cos=11))
            put  'data2.fits' : '/home/myusername/data2.fits' ( 1024000000 bytes, 258893.4 KBS (cos=11))
            put  'data3.fits' : '/home/myusername/data3.fits' ( 1024000000 bytes, 222819.7 KBS (cos=11))
            put  'data4.fits' : '/home/myusername/data4.fits' ( 1024000000 bytes, 224311.9 KBS (cos=11))
            put  'data5.fits' : '/home/myusername/data5.fits' ( 1024000000 bytes, 323707.3 KBS (cos=11))
            put  'data6.fits' : '/home/myusername/data6.fits' ( 1024000000 bytes, 320322.9 KBS (cos=11))
            put  'data7.fits' : '/home/myusername/data7.fits' ( 1024000000 bytes, 253192.6 KBS (cos=11))
            put  'data8.fits' : '/home/myusername/data8.fits' ( 1024000000 bytes, 253056.2 KBS (cos=11))
            put  'data9.fits' : '/home/myusername/data9.fits' ( 1024000000 bytes, 323218.9 KBS (cos=11))
            EOF detected on TTY - ending HSI session
            

            For more information about HSI:

          • 4.1.4.1.2  HTAR

            4.1.4.1.2  HTAR

            HTAR (short for "HPSS TAR") is a utility program that writes TAR-compatible archive files directly onto BoilerGrid, without having to first create a local file. Its command line was originally based on the AIX tar program, with a number of extensions added to provide extra features.

            HTAR is provided on all ITaP research systems as the command htar. HTAR is also available for Download for many operating systems.

            Usage:

              (Create a tar archive on BoilerGrid named data.tar including all files with the extension ".fits".)
            $ htar -cvf data.tar *.fits
            HTAR: a   data1.fits
            HTAR: a   data2.fits
            HTAR: a   data3.fits
            HTAR: a   data4.fits
            HTAR: a   data5.fits
            HTAR: a   data6.fits
            HTAR: a   data7.fits
            HTAR: a   data8.fits
            HTAR: a   data9.fits
            HTAR: a   /tmp/HTAR_CF_CHK_17953_1317760775
            HTAR Create complete for data.tar. 9,216,006,144 bytes written for 9 member files, max threads: 3 Transfer time: 29.622 seconds (311.121 MB/s)
            HTAR: HTAR SUCCESSFUL
            
              (Unpack a tar archive on BoilerGrid named data.tar into a scratch directory for use in a batch job.)
            $ cd $RCAC_SCRATCH/job_dir
            $ htar -xvf data.tar
            HTAR: x data1.fits, 1024000000 bytes, 2000001 media blocks
            HTAR: x data2.fits, 1024000000 bytes, 2000001 media blocks
            HTAR: x data3.fits, 1024000000 bytes, 2000001 media blocks
            HTAR: x data4.fits, 1024000000 bytes, 2000001 media blocks
            HTAR: x data5.fits, 1024000000 bytes, 2000001 media blocks
            HTAR: x data6.fits, 1024000000 bytes, 2000001 media blocks
            HTAR: x data7.fits, 1024000000 bytes, 2000001 media blocks
            HTAR: x data8.fits, 1024000000 bytes, 2000001 media blocks
            HTAR: x data9.fits, 1024000000 bytes, 2000001 media blocks
            HTAR: Extract complete for data.tar, 9 files. total bytes read: 9,216,004,608 in 33.914 seconds (271.749 MB/s )
            HTAR: HTAR SUCCESSFUL
            
              (Look at the contents of the data.tar HTAR archive on BoilerGrid.)
            $ htar -tvf data.tar
            HTAR: -rw-r--r--  myusername/pucc 1024000000 2011-10-04 16:30  data1.fits
            HTAR: -rw-r--r--  myusername/pucc 1024000000 2011-10-04 16:35  data2.fits
            HTAR: -rw-r--r--  myusername/pucc 1024000000 2011-10-04 16:35  data3.fits
            HTAR: -rw-r--r--  myusername/pucc 1024000000 2011-10-04 16:35  data4.fits
            HTAR: -rw-r--r--  myusername/pucc 1024000000 2011-10-04 16:35  data5.fits
            HTAR: -rw-r--r--  myusername/pucc 1024000000 2011-10-04 16:35  data6.fits
            HTAR: -rw-r--r--  myusername/pucc 1024000000 2011-10-04 16:35  data7.fits
            HTAR: -rw-r--r--  myusername/pucc 1024000000 2011-10-04 16:35  data8.fits
            HTAR: -rw-r--r--  myusername/pucc 1024000000 2011-10-04 16:35  data9.fits
            HTAR: -rw-------  myusername/pucc        256 2011-10-04 16:39  /tmp/HTAR_CF_CHK_17953_1317760775
            HTAR: Listing complete for data.tar, 10 files 10 total objects
            HTAR: HTAR SUCCESSFUL
            
              (Unpack a single file, "data7.fits", from the tar archive on BoilerGrid named data.tar into a scratch directory.)
            $ htar -xvf data.tar data7.fits
            HTAR: x data7.fits, 1024000000 bytes, 2000001 media blocks
            HTAR: Extract complete for data.tar, 1 files. total bytes read: 1,024,000,512 in 3.642 seconds (281.166 MB/s )
            HTAR: HTAR SUCCESSFUL
            

            For more information about HTAR:

          • 4.1.4.1.3  SCP

            4.1.4.1.3  SCP

            Fortress does NOT support SCP.

          • 4.1.4.1.4  SFTP

            4.1.4.1.4  SFTP

            Fortress does NOT support SFTP.

    • 4.2  Environment Variables

      4.2  Environment Variables

      There are many environment variables related to storage locations and paths. Logging in automatically sets these environment variables. You may change the variables at any time.

      Use environment variables instead of actual paths whenever possible to avoid problems if the specific paths to any of these change. Some of the environment variables you should have are:

      Name Description
      USER your username
      HOME path to your home directory
      PWD path to your current directory
      RCAC_SCRATCH path to scratch filesystem
      PATH all directories searched for commands/applications
      HOSTNAME name of the machine you are on
      SHELL your current shell (bash, tcsh, csh, ksh)
      SSH_CLIENT your local client's IP address
      TERM type of terminal or terminal emulator being used

      By convention, environment variable names are all uppercase. Use them on the command line or in any scripts in place of and in combination with hard-coded values:

      $ ls $HOME
      ...
      
      $ ls $RCAC_SCRATCH/myproject
      ...
      

      To find the value of any environment variable:

      $ echo $RCAC_SCRATCH
      /scratch/scratch96/m/myusername
      
      $ echo $SHELL
      /bin/tcsh
      

      To list the values of all environment variables:

      $ env
      USER=myusername
      HOME=/home/myusername
      RCAC_SCRATCH=/scratch/scratch96/m/myusername
      SHELL=/bin/tcsh
      ...
      

      You may create or overwrite an environment variable. To pass (export) the value of a variable in either bash or ksh:

      $ export VARIABLE=value
      

      To assign a value to an environment variable in either tcsh or csh:

      % setenv VARIABLE value
      
    • 4.3  Storage Quotas / Limits

      4.3  Storage Quotas / Limits

      ITaP imposes some limits on your disk usage on research systems. ITaP implements a quota on each filesystem. Each filesystem (home directory, scratch directory, etc.) may have a different limit. If you exceed the quota, you will not be able to save new files or new data to the filesystem until you delete or move data to long-term storage.

      • 4.3.1  Checking Quota Usage

        4.3.1  Checking Quota Usage

        To check the current quotas of your home and scratch directories use the myquota command:

        $ myquota
        Type        Filesystem          Size    Limit  Use         Files    Limit  Use
        ==============================================================================
        home        extensible         5.0GB   10.0GB  50%             -        -   -
        scratch     /scratch/scratch96/    8KB  476.8GB   0%             2  100,000   0%
        

        The columns are as follows:

        1. Type: indicates home or scratch directory.
        2. Filesystem: name of storage option.
        3. Size: sum of file sizes in bytes.
        4. Limit: allowed maximum on sum of file sizes in bytes.
        5. Use: percentage of file-size limit currently in use.
        6. Files: number of files and directories (not the size).
        7. Limit: allowed maximum on number of files and directories. It is possible, though unlikely, to reach this limit and not the file-size limit if you create a large number of very small files.
        8. Use: percentage of file-number limit currently in use.

        If you find that you reached your quota in either your home directory or your scratch file directory, obtain estimates of your disk usage. Find the top-level directories which have a high disk usage, then study the subdirectories to discover where the heaviest usage lies.

        To see in a human-readable format an estimate of the disk usage of your top-level directories in your home directory:

        $ du -h --max-depth=1 $HOME >myfile
        32K /home/myusername/mysubdirectory_1
        529M    /home/myusername/mysubdirectory_2
        608K    /home/myusername/mysubdirectory_3
        

        The second directory is the largest of the three, so apply command du to it.

        To see in a human-readable format an estimate of the disk usage of your top-level directories in your scratch file directory:

        $ du -h --max-depth=1 $RCAC_SCRATCH >myfile
        160K    /scratch/scratch96/m/myusername
        

        This strategy can be very helpful in figuring out the location of your largest usage. Move unneeded files and directories to long-term storage to free space in your home and scratch directories.

      • 4.3.2  Increasing Your Storage Quota

        4.3.2  Increasing Your Storage Quota

        Home Directory

        If you find you need additional disk space in your home directory, please first consider archiving and compressing old files and moving them to long-term storage on the Fortress HPSS Archive. If you are unable to do so, you may go to the BoilerBackpack Quota Management site and use the sliders there to increase the amount of space allocated to your research home directory vs. other storage options, up to a maximum of 100GB.

        Scratch Space

        If you find you need additional disk space in your scratch space, please first consider archiving and compressing old files and moving them to long-term storage on the Fortress HPSS Archive. If you are unable to do so, you may ask for a quota increase at rcac-help@purdue.edu. Quota requests up to 2TB and 200,000 files on LustreA or LustreC can be processed quickly.

    • 4.4  Archive and Compression

      4.4  Archive and Compression

      There are several options for archiving and compressing groups of files or directories on ITaP research systems. The mostly commonly used options are:

      • tar   (more information)
        Saves many files together into a single archive file, and restores individual files from the archive. Includes automatic archive compression/decompression options and special features for incremental and full backups.
        Examples:
          (list contents of archive somefile.tar)
        $ tar tvf somefile.tar
        
          (extract contents of somefile.tar)
        $ tar xvf somefile.tar
        
          (extract contents of gzipped archive somefile.tar.gz)
        $ tar xzvf somefile.tar.gz
        
          (extract contents of bzip2 archive somefile.tar.bz2)
        $ tar xjvf somefile.tar.bz2
        
          (archive all ".c" files in current directory into one archive file)
        $ tar cvf somefile.tar *.c
        
          (archive and gzip-compress all files in a directory into one archive file)
        $ tar czvf somefile.tar.gz somedirectory/
        
          (archive and bzip2-compress all files in a directory into one archive file)
        $ tar cjvf somefile.tar.bz2 somedirectory/
        
        
        Other arguments for tar can be explored by using the man tar command.
      • gzip   (more information)
        The standard compression system for all GNU software.
        Examples:
          (compress file somefile - also removes uncompressed file)
        $ gzip somefile
        
          (uncompress file somefile.gz - also removes compressed file)
        $ gunzip somefile.gz
        
      • bzip2   (more information)
        Strong, lossless data compressor based on the Burrows-Wheeler transform. Stronger compression than gzip.
        Examples:
          (compress file somefile - also removes uncompressed file)
        $ bzip2 somefile
        
          (uncompress file somefile.bz2 - also removes compressed file)
        $ bunzip2 somefile.bz2
        

      There are several other, less commonly used, options available as well:

      • zip
      • 7zip
      • xz

    • 4.5  File Transfer

      4.5  File Transfer

      There are a variety of ways to transfer data to and from ITaP research systems. Which you should use depends on several factors, including the ease of use for you personally, connection speed and bandwidth, and the size and number of files which you intend to transfer.

      • 4.5.1  FTP

        4.5.1  FTP

        ITaP does not support FTP on any ITaP research systems because it does not allow for secure transmission of data. Try using one of the other methods described below instead of FTP.

      • 4.5.2  SCP

        4.5.2  SCP

        SCP (Secure CoPy) is a simple way of transferring files between two machines that use the SSH protocol. SCP is available as a protocol choice in some graphical file transfer programs and also as a command line program on most Linux, Unix, and Mac OS X systems. SCP can copy single files, but will also recursively copy directory contents if given a directory name.

        Command-line usage:

          (to a remote system from local)
        $ scp sourcefilename myusername@hostname:somedirectory/destinationfilename
        
          (from a remote system to local)
        $ scp myusername@hostname:somedirectory/sourcefilename destinationfilename
        
          (recursive directory copy to a remote system from local)
        $ scp -r sourcedirectory/ myusername@hostname:somedirectory/
        

        Linux / Solaris / AIX / HP-UX / Unix:

        • You should have already installed the "scp" command-line program.

        Microsoft Windows:

        • WinSCP is a full-featured and free graphical SCP and SFTP client.
        • PuTTY also offers "pscp.exe", which is an extremely small program and a basic SCP client.
        • Secure FX is a commercial SCP and SFTP client which is freely available to Purdue students, faculty, and staff with a Purdue career account.

        Mac OS X:

        • You should have already installed the "scp" command-line program. You may start a local terminal window from "Applications->Utilities".
      • 4.5.3  SFTP

        4.5.3  SFTP

        SFTP (Secure File Transfer Protocol) is a reliable way of transferring files between two machines. SFTP is available as a protocol choice in some graphical file transfer programs and also as a command-line program on most Linux, Unix, and Mac OS X systems. SFTP has more features than SCP and allows for other operations on remote files, remote directory listing, and resuming interrupted transfers. Command-line SFTP cannot recursively copy directory contents; to do so, try using SCP or graphical SFTP client.

        Command-line usage:

        $ sftp -B buffersize myusername@hostname
        
              (to a remote system from local)
        sftp> put sourcefile somedir/destinationfile
        sftp> put -P sourcefile somedir/
        
              (from a remote system to local)
        sftp> get sourcefile somedir/destinationfile
        sftp> get -P sourcefile somedir/
        
        sftp> exit
        
        • -B: optional, specify buffer size for transfer; larger may increase speed, but costs memory
        • -P: optional, preserve file attributes and permissions

        Linux / Solaris / AIX / HP-UX / Unix:

        • The "sftp" command line program should already be installed.

        Microsoft Windows:

        • WinSCP is a full-featured and free graphical SFTP and SCP client.
        • PuTTY also offers "psftp.exe", which is an extremely small program and a basic SFTP client.
        • Secure FX is a commercial SFTP and SCP client which is freely available to Purdue students, faculty, and staff with a Purdue career account.

        Mac OS X:

        • The "sftp" command-line program should already be installed. You may start a local terminal window from "Applications->Utilities".
        • MacSFTP is a free graphical SFTP client for Macs.
      • 4.5.4  Globus

        4.5.4  Globus

        Globus, previously known as Globus Online, is a powerful and easy to use file transfer service that is useful for transferring files virtually anywhere. It works within ITaP's various research storage systems; it connects between ITaP and remote research sites running Globus; and it connects research systems to personal systems. You may use Globus to connect to your home, scratch, and Fortress storage directories. Since Globus is web-based, it works on any operating system that is connected to the internet. The Globus Personal client is available on Windows, Linux, and Mac OS X. It is primarily used as a graphical means of transfer but it can also be used over the command line.

        Globus Web:

        • Navigate to http://transfer.rcac.purdue.edu
        • Click "Proceed" to log in with your Purdue Career Account.
        • On your first login it will ask to make a connection to a Globus account. If you already have one - sign in to associate with your Career Account. Otherwise, click the link to create a new account.
        • Now you're at the main screen. Click "File Transfer" which will bring you to a two-endpoint interface.
        • Purdue's endpoint is named "purdue#rcac", however, you can start typing "purdue" and it will autocomplete.
        • The paths to research storage are the same as they are when you're logged into the clusters, but are provided below for reference.
          • Home directory: /~/
          • Scratch directory: /scratch/scratch96/m/myusername where m is the first letter of your username and myusername is your career account name.
          • Research Data Depot directory: /depot/mygroupname where mygroupname is the name of your group.
          • Fortress long-term storage: /archive/fortress/home/myusername where myusername is your career account name.

        • For the second endpoint, you can choose any other Globus endpoint, such as another research site, or a Globus Personal endpoint, which will allow you to transfer to a personal workstation or laptop.

        Globus Personal Client setup:

        • On the endpoint page from earlier, click "Get Globus Connect Personal" or download it from here: Globus Connect Personal
        • Name this particular personal system and click "Generate Setup Key" on this page: Create Gloubs Personal endpoint
        • Copy the key and paste it into the setup box when installing the client for your system.
        • Your personal system is now available as an endpoint within the Globus transfer interface.

        Globus Command Line:

        For more information, please see Globus Support.

      • 4.5.5  Windows Network Drive / SMB

        4.5.5  Windows Network Drive / SMB

        SMB (Server Message Block), also known as CIFS, is an easy to use file transfer protocol that is useful for transferring files between ITaP research systems and a desktop or laptop. You may use SMB to connect to your home, scratch, and Fortress storage directories. The SMB protocol is available on Windows, Linux, and Mac OS X. It is primarily used as a graphical means of transfer but it can also be used over the command line.

        Note: to access BoilerGrid through SMB file sharing, you must be on a Purdue campus network or connected through VPN.

        Windows:

        • Windows 7: Click Windows menu > Computer, then click Map Network Drive in the top bar
        • Windows 8.1: Tap the Windows key, type computer, select This PC, click Computer > Map Network Drive in the top bar
        • In the folder location enter the following information and click Finish:

          • To access your home directory, enter \\samba.rcac.purdue.edu\myusername where myusername is your career account name.
          • To access your scratch space on BoilerGrid, enter \\samba.rcac.purdue.edu\scratch. Once mapped, you will be able to navigate to boilergrid\m\myusername where m is the first letter of your username and myusername is your career account name. You may also navigate to any of the other cluster scratch directories from this drive mapping.
          • To access your Fortress long-term storage home directory, enter \\fortress-smb.rcac.purdue.edu\myusername where myusername is your career account name.
          • To access a shared Fortress group storage directory, enter \\fortress-smb.rcac.purdue.edu\group\mygroupname where mygroupname is the name of the shared group space.

        • You may be prompted for login information. Enter your username as onepurdue\myusername and your account password. If you forget the onepurdue prefix it will prevent you from logging in.
        • Your home, scratch, or Fortress directory should now be mounted as a drive in the Computer window.

        Mac OS X:

        • In the Finder, click Go > Connect to Server
        • In the Server Address enter the following information and click Connect:

          • To access your home directory, enter smb://samba.rcac.purdue.edu/myusername where myusername is your career account name.
          • To access your scratch space on BoilerGrid, enter smb://samba.rcac.purdue.edu\scratch. Once connected, you will be able to navigate to boilergrid\m\myusername where m is the first letter of your username and myusername is your career account name. You may also navigate to any of the other cluster scratch directories from this mount.
          • To access your Fortress long-term storage home directory, enter smb://fortress-smb.rcac.purdue.edu/myusername where myusername is your career account name.
          • To access a shared Fortress group storage directory, enter smb://fortress-smb.rcac.purdue.edu/group/mygroupname where mygroupname is the name of the shared group space.

        • You may be prompted for login information. Enter your username, password and for the domain enter onepurdue or it will prevent you from logging in.

        Linux:

        • There are several graphical methods to connect in Linux depending on your desktop environment. Once you find out how to connect to a network server on your desktop environment, choose the Samba/SMB protocol and adapt the information from the Mac OS X section to connect.
        • If you would like access via samba on the command line you may install smbclient which will give you ftp-like access and can be used as shown below. SCP or SFTP is recommended over this use case. For all the possible ways to connect look at the Mac OS X instructions.
          smbclient //samba.rcac.purdue.edu/myusername -U myusername -W onepurdue
  • 5  Applications on BoilerGrid

    5  Applications on BoilerGrid

  • 6  Compiling Source Code on BoilerGrid

    6  Compiling Source Code on BoilerGrid

    • 6.1  Provided Compilers

      6.1  Provided Compilers

      The compilers available on all research systems are able to compile code for HTCondor. Compilers are available for Fortran 77, Fortran 90, Fortran 95, C, and C++. The compilers can produce general-purpose and architecture-specific optimizations to improve performance. These include loop-level optimizations, inter-procedural analysis and cache optimizations. While the compilers support automatic and user-directed parallelization of Fortran, C, and C++ applications for multiprocessing execution, BoilerGrid allows only serial jobs.

      To see the available compilers, choose one of the following entries:

      $ module avail intel
      $ module avail gcc
      $ module avail pgi
      
    • 6.2  Statically Linked Libraries

      6.2  Statically Linked Libraries

      Using statically linked libraries, regardless the chosen HTCondor universe, is good practice; you cannot rely on which versions of dynamic libraries are available on the machines selected to run your job. With static libraries, HTCondor will send the same libraries to all machines. On the other hand, with the HTCondor flock consisting of a mix of machine architectures, there is also the possibility that your job will land on a machine that is so different from or much older than the machine on which you built your executable file that your job may fail to execute an instruction in the statically linked library. In a parameter sweep, this leads to the confusing situation of some of the runs of the sweep completing successfully while others fail. In this case, you must consider using the corresponding dynamic library on the selected machine or using ClassAds to select compute nodes known to run your job successfully or to exclude compute nodes known to fail. So, use static linkage if at all possible. For the Standard Universe, the condor_compile command specifies static linkage as part of its arguments to the linker; the condor_compile command exhibits its arguments in the "LINKING FOR" message. Regarding jobs destined for the Vanilla Universe, use your compiler's command-line option for selecting statically linked libraries.

    • 6.3  Compiling Serial Programs

      6.3  Compiling Serial Programs

      A serial program is a single process whose steps execute as a sequential stream of instructions on one computer. Compilers capable of serial programming are available for C, C++, and versions of Fortran.

      Here are a few sample serial programs:

      Standard Universe

      With the GNU compilers only, the command condor_compile compiles source code and relinks it with the HTCondor libraries for submission into HTCondor's Standard Universe. The HTCondor libraries provide the program with additional support, such as the capability to preempt with checkpointing, which is a feature of HTCondor's Standard Universe mode of operation. The command condor_compile requires the source or object code of a computer program as well as a compatible compiler.

      To use condor_compile and the Standard Universe, first load a compatible compiler (in this case the default GNU compiler):

      $ module load gcc
      

      Next, choose one of the following entries:

      $ condor_compile gfortran myprogram.f -o myprogram
      $ condor_compile gfortran myprogram.f90 -o myprogram
      $ condor_compile gfortran myprogram.f95 -o myprogram
      $ condor_compile gcc myprogram.c -o myprogram
      $ condor_compile g++ myprogram.cpp -o myprogram
      

      Vanilla Universe

      When neither source nor object code of a computer program is available (i.e. only an executable binary or a shell script) or when you wish to take advantage of features of a compiler which is not compatible with HTCondor's condor_compile and Standard Universe, you must compile without condor_compile and submit your executable file to HTCondor's Vanilla Universe. This section looks at just compiling with the standard C/C++ and Fortran compilers, as opposed to compiling with condor_compile.

      The following table illustrates how to compile a serial program with statically linked libraries. Note that not all compilers are available on all systems.

      Language Intel Compiler GNU Compiler PGI Compiler
      Fortran 77
      $ module load intel
      $ ifort -static myprogram.f -o myprogram
      
      $ module load gcc
      $ gfortran -static myprogram.f -o myprogram
      
      $ module load pgi
      $ pgf77 -Bstatic myprogram.f -o myprogram
      
      Fortran 90
      $ module load intel
      $ ifort -static myprogram.f90 -o myprogram
      
      $ module load gcc
      $ gfortran -static myprogram.f90 -o myprogram
      
      $ module load pgi
      $ pgf90 -Bstatic myprogram.f90 -o myprogram
      
      Fortran 95
      $ module load intel
      $ ifort -static myprogram.f90 -o myprogram
      
      $ module load gcc
      $ gfortran -static myprogram.f95 -o myprogram
      
      $ module load pgi
      $ pgf95 -Bstatic myprogram.f95 -o myprogram
      
      C
      $ module load intel
      $ icc -static myprogram.c -o myprogram
      
      $ module load gcc
      $ gcc -static myprogram.c -o myprogram
      
      $ module load pgi
      $ pgcc -Bstatic myprogram.c -o myprogram
      
      C++ ¹
      $ module load intel
      $ icc -static myprogram.cpp -o myprogram
      
      $ module load gcc
      $ g++ -static myprogram.cpp -o myprogram
      
      $ module load pgi
      $ pgCC -Bstatic myprogram.cpp -o myprogram
      
      ¹  The suffix of a C++ file may be .C, .c, .cc, .cpp, .cxx, or .c++.

      The Intel, GNU and PGI compilers will not output anything for a successful compilation. Also, the Intel compiler does not recognize the suffix ".f95".

      An older version of the GNU compiler will be in your path by default. Do NOT use this version. Instead, load a newer version using the command module load gcc.

      More information on compiler options is available in the official man pages on the Web. Also, the command man mycompiler displays man pages (only after using module load to load the appropriate compiler.

    • 6.4  Compiling MPI Programs

      6.4  Compiling MPI Programs

      BoilerGrid allows only serial programs to run via HTCondor. There is no support for MPI.

    • 6.5  Compiling OpenMP Programs

      6.5  Compiling OpenMP Programs

      BoilerGrid allows only serial programs to run via HTCondor. There is no support for OpenMP.

    • 6.6  Compiling Hybrid Programs

      6.6  Compiling Hybrid Programs

      BoilerGrid allows only serial programs to run via HTCondor. There is no support for MPI or OpenMP.

    • 6.7  Provided Libraries

      6.7  Provided Libraries

      BoilerGrid has a few preinstalled libraries, including mathematical libraries. More detailed documentation on the libraries available on BoilerGrid follows.

      • 6.7.1  MPICH Library

        6.7.1  MPICH Library

        There is currently no support for MPICH through HTCondor.

      • 6.7.2  Intel Math Kernel Library (MKL)

        6.7.2  Intel Math Kernel Library (MKL)

        Intel Math Kernel Library (MKL) contains ScaLAPACK, LAPACK, Sparse Solver, BLAS, Sparse BLAS, CBLAS, GMP, FFTs, DFTs, VSL, VML, and Interval Arithmetic routines. MKL resides in the directory stored in the environment variable MKL_HOME, after loading a version of the Intel compiler with module.

        By using module load to activate an Intel compiler your shell environment will have several variables set up to help link applications with MKL. Here are some example combinations of simplified linking options:

        $ module load intel
        $ echo $LINK_LAPACK
        -L${MKL_HOME}/lib/intel64 -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -liomp5 -lpthread
        
        $ echo $LINK_LAPACK95
        -L${MKL_HOME}/lib/intel64 -lmkl_lapack95_lp64 -lmkl_blas95_lp64 -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -liomp5 -lpthread
        

        ITaP recommends you use the provided variables to define MKL linking options in your compiling procedures. The Intel compiler modules also provide two other environment variables, LINK_LAPACK_STATIC and LINK_LAPACK95_STATIC that you may use if you need to link MKL statically.

        ITaP recommends that you use dynamic linking of libguide. If so, define LD_LIBRARY_PATH such that you are using the correct version of libguide at run time. If you use static linking of libguide (discouraged), then:

        • If you use the Intel compilers, link in the libguide version that comes with the compiler (use the -openmp option).
        • If you do not use the Intel compilers, link in the libguide version that comes with the Intel MKL above.

        Here are some more documentation from other sources on the Intel MKL:

    • 6.8  Mixing Fortran, C, and C++ Code on Unix

      6.8  Mixing Fortran, C, and C++ Code on Unix

      You may write different parts of a computing application in different programming languages. For example, an application might incorporate older, legacy code which performs numerical calculations written in Fortran. Systems functions might use C. A newer, main program which binds together all older code might use C++ to take advantage of the object orientation. This section illustrates a few simple examples.

      For more information about mixing programming languages:

    • 6.9  Using cpp with Fortran

      6.9  Using cpp with Fortran

      If the source file ends with .F, .fpp, or .FPP, cpp automatically preprocesses the source code before compilation. If you want to use the C preprocessor with source files that do not end with .F, use the following compiler option to specify the filename suffix:

      • GNU Compilers: -x f77-cpp-input
        Note that preprocessing does not extend to the contents of files included by an "INCLUDE" directive. You must use the #include preprocessor directive instead.
        For example, to preprocess source files that end with .f:
        $ gfortran -x f77-cpp-input myprogram.f
        
      • Intel Compilers: -cpp
        To tell the compiler to link using C++ runtime libraries included with gcc/icc:
        $ ... -cxxlib -gcc/-cxxlib -icc
        
        For example, to preprocess source files that end with .f:
        $ ifort -cpp myprogram.f
        

      Generally, it is advisable to rename your file from myprogram.f to myprogram.F. The preprocessor then automatically runs when you compile the file.

      For more information on combining C/C++ and Fortran:

      • 6.9.1  Using cpp with Fortran

        6.9.1  Using cpp with Fortran

        If the source file ends with .F, .fpp, or .FPP, cpp automatically preprocesses the source code before compilation. If you want to use the C preprocessor with source files that do not end with .F, use the following compiler option to specify the filename suffix:

        • GNU Compilers: -x f77-cpp-input
          Note that preprocessing does not extend to the contents of files included by an "INCLUDE" directive. You must use the #include preprocessor directive instead.
          For example, to preprocess source files that end with .f:
          $ gfortran -x f77-cpp-input myprogram.f
          
        • Intel Compilers: -cpp
          To tell the compiler to link using C++ runtime libraries included with gcc/icc:
          $ ... -cxxlib -gcc/-cxxlib -icc
          
          For example, to preprocess source files that end with .f:
          $ ifort -cpp myprogram.f
          

        Generally, it is advisable to rename your file from myprogram.f to myprogram.F. The preprocessor then automatically runs when you compile the file.

        For more information on combining C/C++ and Fortran:

      • 6.9.2  C Program Calling Subroutines in Fortran, C, and C++

        6.9.2  C Program Calling Subroutines in Fortran, C, and C++

        A C language program calls routines written in Fortran 90, C, and C++. The routines change the value of a character argument. To understand what makes this example work, you must be aware of a few simple issues.

        To discover how the chosen Fortran compiler handles the names of routines, apply the Linux command nm to the object file: nm filename.o. The Fortran compilers used in this example append an underscore after the name of a routine. The C program calls the Fortran routine with the underscore character.

        Fortran uses pass-by-reference while C uses pass-by-value. Therefore, to pass a value from a Fortran routine to a C program requires the argument in the call to the Fortran routine to be a pointer (ampersand "&"). To pass a value from a C++ routine to a C program, the C++ routine may use the pass-by-reference syntax (ampersand "&") of C++ while the C program again specifies a pointer (ampersand "&") in the call to the C++ routine.

        The C++ compiler must know at the time of compiling the C++ routine that the C program will invoke the C++ routine with the C-style interface rather than the C++ interface.

        The following files of source code illustrate these technical details:

        Separately compile each source code file with the appropriate compiler into an object (.o) file. Then link the object files into a single executable file (a.out):

        Compiler Intel GNU PGI
        C Main Program
        $ module load intel
        $ icc -c main.c
        $ ifort -c f90.f90
        $ icc -c c.c
        $ icc -c cpp.cpp
        $ icc -lstdc++ main.o f90.o c.o cpp.o
        
        $ module load gcc
        $ gcc -c main.c
        $ gfortran -c f90.f90
        $ gcc -c c.c
        $ g++ -c cpp.cpp
        $ gcc -lstdc++ main.o f90.o c.o cpp.o
        
        $ module load pgi
        $ pgcc -c main.c
        $ pgcc -c c.c
        $ pgCC -c cpp.cpp
        $ pgf90 -Mnomain main.o c.o cpp.o f90.f90
        
        

        The results show that each routine successfully returns a different character to the main program:

        $ a.out
        main(), initial value:               chr=X
        main(), after function subr_f_():    chr=f
        main(), after function func_c():     chr=c
        main(), after function func_cpp():   chr=+
        Exit main.c
        
      • 6.9.3  C++ Program Calling Subroutines in Fortran, C, and C++

        6.9.3  C++ Program Calling Subroutines in Fortran, C, and C++

        A C++ language program calls routines written in Fortran 90, C, and C++. The routines change the value of a character argument. To understand what makes this example work, you must be aware of a few simple issues.

        To discover how the chosen Fortran compiler handles the names of routines, apply the Linux command nm to the object file: nm filename.o. The Fortran compilers used in this example append an underscore after the name of a routine. The C++ program calls the Fortran routine with the underscore character.

        Fortran uses pass-by-reference while C++ uses pass-by-value. Therefore, to pass a value from a Fortran routine to a C++ program requires the argument in the call to the Fortran routine to be a pointer (ampersand "&"). To pass a value from a C routine to a C++ program, the C routine must declare a parameter as a pointer (asterisk "*") while the C++ program again specifies a pointer (ampersand "&") in the call to the C routine.

        The C++ compiler must know at the time of compiling the C++ program that the C++ program will invoke the Fortran and C routines with the C-style interface rather than the C++ interface.

        The following files of source code illustrate these technical details:

        Separately compile each source code file with the appropriate compiler into an object (.o) file. Then link the object files into a single executable file (a.out):

        Compiler Intel GNU PGI
        C++ Main Program
        $ module load intel
        $ icc -c main.cpp
        $ ifort -c f90.f90
        $ icc -c c.c
        $ icc -c cpp.cpp
        $ icc -lstdc++ main.o f90.o c.o cpp.o
        
        $ module load gcc
        $ g++ -c main.cpp
        $ gfortran -c f90.f90
        $ gcc -c c.c
        $ g++ -c cpp.cpp
        $ g++ main.o f90.o c.o cpp.o
        
        $ module load pgi
        $ pgCC -c main.cpp
        $ pgf90 -c f90.f90
        $ pgcc -c c.c
        $ pgCC -c cpp.cpp
        $ pgCC -L../lib main.o c.o cpp.o f90.o -pgf90libs
        

        The results show that each routine successfully returns a different character to the main program:

        $ a.out
        main(), initial value:               chr=X
        main(), after function subr_f_():    chr=f
        main(), after function func_c():     chr=c
        main(), after function func_cpp():   chr=+
        Exit main.cpp
        
      • 6.9.4  Fortran Program Calling Subroutines in Fortran, C, and C++

        6.9.4  Fortran Program Calling Subroutines in Fortran, C, and C++

        A Fortran language program calls routines written in Fortran 90, C, and C++. The routines change the value of a character argument. To understand what makes this example work, you must be aware of a few simple issues.

        To discover how the chosen Fortran compiler handles the names of routines, apply the Linux command nm to the object file: nm filename.o. The Fortran compilers used in this example append an underscore after the name of a routine, so the definitions of the C and C++ routines must include the underscore. The Fortran program calls these routines without the underscore character in the Fortran source code.

        Fortran uses pass-by-reference while C uses pass-by-value. Therefore, to pass a value from a C routine to a Fortran program requires the parameter of the C routine to be a pointer (asterisk "*") in the C routine's definition. To pass a value from a C++ routine to a Fortran program, the C++ routine may use the pass-by-reference syntax (ampersand "&") of C++ in its definition.

        The C++ compiler must know at the time of compiling the C++ routine that the Fortran program will invoke the C++ routine with the C-style interface rather than the C++ interface.

        The following files of source code illustrate these technical details:

        Separately compile each source code file with the appropriate compiler into an object (.o) file. Then link the object files into a single executable file (a.out):

        Compiler Intel GNU PGI
        Fortran 90 Main Program
        $ module load intel
        $ ifort -c main.f90
        $ ifort -c f90.f90
        $ icc -c c.c
        $ icc -c cpp.cpp
        $ ifort -lstdc++ main.o f90.o c.o cpp.o
        
        $ module load gcc
        $ gfortran -c main.f90
        $ gfortran -c f90.f90
        $ gcc -c c.c
        $ g++ -c cpp.cpp
        $ gfortran -lstdc++ main.o c.o cpp.o f90.o
        
        $ module load pgi
        $ pgf90 -c main.f90
        $ pgf90 -c f90.f90
        $ pgcc -c c.c
        $ pgCC -c cpp.cpp
        $ pgf90 main.o c.o cpp.o f90.o
        

        The results show that each routine successfully returns a different character to the main program:

        $ a.out
         main(), initial value:               chr=X
         main(), after function subr_f():     chr=f
         main(), after function subr_c():     chr=c
         main(), after function func_cpp():   chr=+
         Exit mixlang
        
  • 7  Running Jobs on BoilerGrid

    7  Running Jobs on BoilerGrid

    You may use HTCondor to submit jobs to BoilerGrid. HTCondor performs job scheduling. Jobs may be serial only. You may use only the batch mode for developing and running your program. BoilerGrid does not offer an interactive mode to run your jobs.

    • 7.1  Running Jobs via HTCondor

      7.1  Running Jobs via HTCondor

      HTCondor is one of several distributed computing resources ITaP provides. Like other similar resources, HTCondor provides a framework for running programs on otherwise idle computers. While this imposes serious limitations on parallel jobs and codes with large I/O or memory requirements, HTCondor can provide a large quantity of cycles for researchers who need to run hundreds of smaller jobs, that execute for an hour or less..

      HTCondor is a specialized batch system for managing compute-intensive jobs. HTCondor provides a queuing mechanism, scheduling policy, priority scheme, and resource classifications. Users submit their jobs to HTCondor, which then puts these jobs in a queue, runs them, and reports back with the results.

      In some ways, HTCondor is different from other batch systems. They usually only operate on dedicated machines/compute servers. Instead, HTCondor can both schedule jobs on dedicated machines and effectively utilize non-dedicated machines to run jobs. It only runs jobs on machines which are currently idle (no keyboard activity, no load average, no active telnet users, etc). In this way, HTCondor effectively harnesses otherwise idle machines throughout a pool of machines.

      Currently, ITaP uses HTCondor to utilize idle cycles on many ITaP research resources, including Linux cluster nodes as well as some other servers and workstations. While ITaP uses PBS to schedule the resources of the Linux clusters, HTCondor schedules jobs on compute nodes when the nodes are not running PBS jobs. When PBS elects to run a new job on a node which is currently running HTCondor-scheduled jobs, HTCondor preempts all jobs running on that node to make room for the PBS-scheduled job. You may submit HTCondor jobs from any ITaP research system.

      For more information:

      • 7.1.1  Tips

        7.1.1  Tips

        • There is no global shared filesystem. ITaP recommends using HTCondor's file transfer support for managing your jobs' data.
        • Do not queue up thousands of jobs in a queue. Submit fewer jobs at a time or use DAGMan to divide your jobs into reasonably-sized chunks (less than 500 jobs per set).
        • Never run condor_q repeatedly on a heavily used submit node. The condor_schedd is single-threaded and schedules work in the same thread that you are using to list the queue. This actually takes resources away from the scheduler and is counter-productive.
        • Long jobs should run in the Standard Universe, not in the Vanilla Universe, since they will likely never finish in Vanilla.
        • Vanilla Universe can use Intel compilers (may run 30–40% faster). Using Intel compilers under Vanilla may ultimately provide better throughput than checkpointing jobs in the Standard Universe using a different compiler because the speed gained from using the Intel compilers may be greater than the advantage of checkpointing.
        • Prefer statically linked libraries over dynamically linked libraries.
        • Generally, if your jobs run in less than 1/2 hour, they will seldom be evicted. If they take 1/2 hour to 1 hour, there will usually still only be a few evictions.
        • Purdue has both a scavenging/preempting and a scheduling system. Remember that the HTCondor pool is very heterogeneous, both regarding processor versions and OS versions/types (both Linux of different varieties and some Windows).
        • At Purdue, ITaP has disabled all automatic email notification using the notification HTCondor submission command. Setting this in a submission file will have no effect.
      • 7.1.2  Choosing a HTCondor Universe

        7.1.2  Choosing a HTCondor Universe

        A Universe in HTCondor defines an execution environment. HTCondor supports several different Universes for user jobs. The most used on BoilerGrid are "Standard", "Vanilla", and "Globus" (or "Grid"). There are other Universes. See Chapter 2.4.1 of the HTCondor Manual for more details about the different Universes.

        Job submission files specify the HTCondor Universe through the universe command. The default Universe is Vanilla (not Standard). Windows compute nodes accept only Vanilla Universe jobs.

        You will need to determine the appropriate Universe for your jobs. Here are some more details about how the Universes differ:

        • Vanilla Universe

          The Vanilla Universe is the default (except where the configuration variable DEFAULT_UNIVERSE defines it otherwise). It is an execution environment for jobs which you did not re-link with the HTCondor libraries. It provides fewer services, but has very few restrictions. Preemption with either suspension or eviction (without checkpointing) is a signature of the Vanilla Universe. If a compute node which is running one or more Vanilla jobs ceases to be idle, HTCondor will either suspend or evict those jobs. HTCondor may restart a suspended job on the same compute node; HTCondor will restart evicted jobs on other compute nodes. When re-linking a computer program to the HTCondor libraries is impossible or when you wish to use a compiler which is incompatible with condor_compile, use the Vanilla Universe.

          Virtually any non-parallel program can use the Vanilla Universe. Shell scripts may be executables. It is the only possibility for Windows machines. You may use compilers which are incompatible with condor_compile. For example, Intel compilers may run 30–40% faster than compatible compilers and may even be faster for somewhat longer jobs, because the speed gain may be bigger than the advantage from checkpointing in the Standard Universe. Preemption with suspension or eviction is, in general, bad for long jobs, but OK for short jobs. A long job may never finish because repeated preemptions with restarts can prevent completion.

          Static linkage of libraries for Vanilla Universe jobs eliminates the chance of running a job with different, older libraries which may be available on some compute nodes since it sends the same collection of libraries to all compute nodes. There is the risk that some compute nodes are sufficiently out of synch with the submission host that they are unable to run the newer libraries. ITaP recommends using static linkage if at all possible.

        • Standard Universe

          The Standard Universe supports transparent job preemption with checkpointing, remote system calls, and migration from compute node to compute node without restarting. Specifying the Standard Universe in your job submission file tells HTCondor that you previously re-linked your job via condor_compile with the HTCondor libraries while using various HTCondor-specific compiler options and libraries. Standard Universe is a desirable Universe due to its premption with checkpointing. If possible use the Standard Universe for long jobs. Long jobs are less likely to finish in the Vanilla Universe.

          There are a few restrictions on programs. There is no possibility of sub-processes. Shell scripts may not be executables. You may not use incompatible compilers, for example Intel compilers. All Standard Universe executables should be statically linked since there is no guarantee that the dynamic libraries on all machines in the flock will be the same version. That way HTCondor will send the same executable file to all machines. There is also the problem that your job land on a system that is not even the same version as your build system. The condor_compile command specifies static linkage as part of its arguments to the linker; condor_compile displays these arguments in the 'LINKING FOR' message. This command not only forces a static link but also fills in a number of wrappers for standard C library routines to make, among other things, remote file access work.

        • Globus (or Grid) Universe

          The Globus or Grid Universe forwards the job to an external job management system. You use the grid_resource command to apply additional specifications of the Grid Universe. The Globus or Grid Universe allows users to submit jobs using HTCondor's interface. These jobs execute on grid resources. For Globus jobs, see http://www.globus.org for more information.

      • 7.1.3  Job Submission File

        7.1.3  Job Submission File

        Example 1

        Here is the simplest possible job submission file. It will queue one copy of the program hello for execution by HTCondor. HTCondor will use its default universe and the default platform, which means to run the job on a compute node which has the same architecture and operating system as the submission host.

        No input, output, and error commands appear in the job submission file, so the files stdin, stdout, and stderr will all refer to /dev/null (a.k.a. the null device. It is a special file that discards all data written to it, but reports that the write operation succeeded. It provides no data to any process that reads from it - returning EOF). The program may produce output by explicitly opening a file and writing to it. This job writes to a log file, hello.log. This log file will contain events the job had during its lifetime inside of HTCondor, such as any possible errors. When the job finishes, its exit conditions will also be noted in the log file. HTCondor recommends a log file so that you know what happened to your jobs.

        If your program only returns output to the screen (like the hello.c program below does), then you should include Output = hello.out or something like it somewhere before Queue. Otherwise you will not see the output.

        If you do not explicitly choose a universe, HTCondor uses the default universe: Vanilla Universe.

        ####################
        #
        # Example 1
        # Simple HTCondor job description file
        #
        ####################
        
        executable     = hello
        log            = hello.log
        queue
        

        Example 2

        This example (from the HTCondor Manual), queues two copies of the program Mathematica. The first copy will run in directory run_1, and the second will run in directory run_2. For both queued copies, stdin will be file test.data, stdout will be file loop.out, and stderr will be file loop.error. This job writes two sets of files in separate directories. This is a convenient way to organize data if you have a large group of HTCondor jobs to run. The example file shows program submission of Mathematica as a Vanilla Universe job, since neither the source nor object code to program Mathematica is available for relinking to the HTCondor libraries.

        HTCondor recommends using a single log file.

        ####################
        #
        # Example 2
        # Demonstrate use of multiple directories for data organization
        #
        ####################
        
        universe   = VANILLA
        executable = mathematica
        input      = test.data
        output     = loop.out
        error      = loop.error
        log        = loop.log
        
        initialdir = run_1
        queue
        
        initialdir = run_2
        queue
        

        Example 3

        In this example (also from the HTCondor Manual), the job submission file queues 150 runs of program foo which you compiled and linked for Silicon Graphics workstations running IRIX 6.5. This job requires HTCondor to run the program on machines which have greater than 32 megabytes of physical memory, and expresses a preference to run the program on machines with more than 64 megabytes, if such machines are available. It also advises HTCondor that it will use up to 28 megabytes of memory when running. Each of the 150 runs of the program receives its own process number, starting with process number 0. So, files stdin, stdout, and stderr will refer to in.0, out.0, and err.0 for the first run of the program; in.1, out.1, and err.1 for the second run of the program; and so forth. A log file foo.log will contain entries about when and where HTCondor runs, checkpoints, and migrates processes for the 150 queued runs of the program.

        ####################
        #
        # Example 3
        # Show off some fancy features including use of pre-defined macros and logging
        #
        ####################
        
        executable   = foo
        requirements = Memory >= 32 && OpSys == "IRIX65" && Arch =="SGI"
        rank		= Memory >= 64
        image_Size = 28 Meg
        
        error   = err.$(Process)
        input   = in.$(Process)
        output  = out.$(Process)
        log     = foo.log
        
        queue 150
        
      • 7.1.4  Job Submission

        7.1.4  Job Submission

        Once you have a job submission file, you may submit this script to HTCondor using the condor_submit command. As described above, a job submission file contains the commands and keywords which specify the type of compute node on which you wish to run your job. HTCondor will find an available processor core and run your job there, or leave your job in a queue until one becomes available.

        You may submit jobs to BoilerGrid from any BoilerGrid submission host, including all ITaP research cluster front-ends.

        To submit a job submission file:

        $ condor_submit myjobsubmissionfile
        

        For more information about job submission:

      • 7.1.5  Job Status

        7.1.5  Job Status

        To check on the progress of your jobs, view the HTCondor queue on the host from which you submitted the jobs.

        You must make certain that you logged in to the same submission host (…-fe00, …-fe01, …-fe02, etc.) from which you submitted your jobs, or you will not see them in the queue.

        To view the status of all jobs in the HTCondor queue of your login host:

        $ condor_q
        

        To see only your own jobs, specify your own username as an argument:

        $ condor_q myusername
        
        -- Submitter: condor.rcac.purdue.edu : <128.211.157.86:42163> : condor.rcac.purdue.edu
         ID         OWNER        SUBMITTED    RUN_TIME   ST PRI SIZE CMD
        1100900.0   myusername   2/20 15:13   0+00:00:00 I  0   0.0  Hello
        
        1 jobs; 1 idle, 0 running, 0 held
        

        Secondly, you may check on the status of your jobs through their log files. In your job submission file, you can specify a log command (log = myjob.log) at any point prior to the queue command. The main events during the processing of the job will appear in this log file: submittal, execution commencement, preemption, checkpoint, eviction, and termination.

        Thirdly, as soon as your job begins executing, HTCondor will start a condor_shadow process on the submission host. This shadow process is the mechanism by which the remotely executing jobs can access the environment of the submit host, such as input and output files. There is a shadow process started on the submit host for each job. However, the load on the submit host from this is usually not significant. If you notice degraded performance, you can limit the number of jobs that can run simultaneously using the MAX_JOBS_RUNNING configuration parameter. Please contact us for help with this if you notice poor performance.

        To list all the compute nodes which are running your jobs:

        $ condor_status -pool boilergrid.rcac.purdue.edu -constraint 'RemoteUser=="myusername@rcac.purdue.edu"'
        
        Name          OpSys       Arch   State      Activity   LoadAv Mem   ActvtyTime
        ba-005.rcac.p LINUX       INTEL  Claimed    Busy       1.000   502  0+00:24:44
        ba-006.rcac.p LINUX       INTEL  Claimed    Busy       0.990   502  0+00:20:22
        ba-007.rcac.p LINUX       INTEL  Claimed    Busy       1.000   502  0+00:23:16
        ba-008.rcac.p LINUX       INTEL  Claimed    Busy       1.000   502  0+00:30:20
        ...
        

        For more information about monitoring your job:

      • 7.1.6  Job Cancellation

        7.1.6  Job Cancellation

        The command condor_rm removes a job from the queue. If the job has already started running, then HTCondor kills the job and removes its queue entry. Use condor_q to get the ID of the job.

        Queue of jobs before removal:

        $ condor_q
        	
        Submitter: condor.rcac.purdue.edu : <128.211.157.86:35407> : condor.rcac.purdue.edu
         ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
        ...
        260076.7   nice-user       8/18 00:05   0+00:21:47 I  0   29.3 startfah.sh -oneun
        260076.9   nice-user       8/18 00:05   0+01:40:44 I  0   136.7 startfah.sh -oneun
        260185.0   myusername      8/30 13:01   0+00:00:00 R  0   19.5 hello
        ...
        

        Remove a job:

        $ condor_rm 260185.0
        Job 260185.0 marked for removal
        

        Queue of jobs after removal:

        $ condor_q
        	
        Submitter: condor.rcac.purdue.edu : <128.211.157.86:35407> : condor.rcac.purdue.edu
         ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
        ...
        260076.7   nice-user       8/18 00:05   0+00:21:47 I  0   29.3 startfah.sh -oneun
        260076.9   nice-user       8/18 00:05   0+01:40:44 I  0   136.7 startfah.sh -oneun
        ...
        

        For more information about removing your job:

      • 7.1.7  Workflow Summary

        7.1.7  Workflow Summary

        This section offers a quick overview of the steps involved in preparing and submitting a simple HTCondor job.

        1. Prepare the Code

          The "Hello World" program below is a simple program which displays the text "hello, world":
          /* FILENAME: hello.c */
          #include <stdio.h>		
          int main (void) {		
              printf("hello, world\n");		
              return 0;
          }
          
        2. Choose the HTCondor Universe

          The two most commonly used HTCondor Universes are Standard and Vanilla. The "Hello World" program above will run in either universe.

          • Vanilla Universe

            Compile the "Hello World" program normally using any available compiler:
            $ module load intel
            $ icc -static hello.c -o hello
            
            $ module load gcc
            $ gcc -static hello.c -o hello
            	
            $ module load pgi
            $ pgcc -Bstatic hello.c -o hello
            
          • Standard Universe

            Relink the "Hello World" program with the HTCondor library using the condor_compile command and a compatible compiler:
            $ module load gcc	
            $ condor_compile gcc hello.c -o hello
            
        3. Prepare the Job Submission File

          Your job submission file defines how to run the job via HTCondor. It specifies the executable file, the chosen universe, a file containing standard input (not used in this example), files which will receive standard output and standard error, and the HTCondor log file, as well as many other possible parameters. The queue directive specifies how many executions of the job are to occur. Usually this is just once, as here:

          • Vanilla Universe

            # FILENAME: hello.sub
            executable = hello
            universe   = vanilla
            output     = hello.out
            error      = hello.err
            log        = hello.log
            queue
            
          • Standard Universe

            # FILENAME: hello.sub
            executable = hello
            universe   = standard
            output     = hello.out
            error      = hello.err
            log        = hello.log
            queue
            
        4. Submit the Job

          To run the "Hello World" program, use the condor_submit command to submit the job submission file to HTCondor:
          $ condor_submit hello.sub
          Submitting job(s).
          Logging submit event(s).
          1 job(s) submitted to cluster 1100744.
          
        5. Monitor the Job

          Once you submit the job, HTCondor will manage its execution. You can monitor the job's progress with the condor_q command:
          $ condor_q myusername
          
          
          -- Submitter: condor.rcac.purdue.edu : <128.211.157.86:56939> : condor.rcac.purdue.edu
           ID      OWNER              SUBMITTED     RUN_TIME  ST PRI SIZE CMD
          1100744.0  myusername  2/17 15:36  0+00:00:00  I  0   0.0  hello
          	
          1 jobs; 1 idle, 0 running, 0 held
          
        6. Remove the Job

          If you discover an error in your job while waiting for the results, you can remove the job from the queue with the condor_rm command:
          $ condor_rm 1100744
          
        7. View the Results

          When the "Hello World" program completes, its output will appear in the file hello.out. The exit status of your program and various statistics about its performance, including time used and I/O performed, will appear in the log file hello.log. To view the output file:
          $ less hello.out
          hello, world
          

          A log of HTCondor activity during your job's run will appear in the file hello.log. This log may report zero bytes transferred for some Vanilla Universe jobs, as the compute node may have been able to directly access your files through a shared filesystem without needing to transfer them to the compute node. To view the log file:
          $ less hello.log
          000 (1100744.000.000) 02/17 15:36:51 Job submitted from host: <128.211.157.86:56939>
          ...
          001 (1100744.000.000) 02/17 15:41:49 Job executing on host: <128.211.157.10:57321>
          ...
          005 (1100744.000.000) 02/17 15:41:53 Job terminated.
                  (1) Normal termination (return value 0)
                          Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
                          Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
                          Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
                          Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
                  1018  -  Run Bytes Sent By Job
                  5429958  -  Run Bytes Received By Job
                  1018  -  Total Bytes Sent By Job
                  5429958  -  Total Bytes Received By Job
          ...
          

      • 7.1.8  Job Hold

        7.1.8  Job Hold

        There are many reasons to put a job on hold. For example, if you do not have enough space to hold all the results at the same time but need to move those results somewhere else, you could queue all jobs and put them on hold immediately. Then release a few jobs at a time (with a -constraint to condor_release, can be scripted), and move the results as they appear, then release some more jobs. In addition to the user's holding jobs manually, the HTCondor Scheduler can hold jobs for various reasons (unable to write to your directory, etc.).

        Any job in the hold state will remain in the hold state until released. A job in the queue may be placed on hold. A currently running, Vanilla Universe job receives a hard kill signal (preemption without checkpointing), and HTCondor returns the job to the queue; when released, this Standard Universe job continues its execution using the most recent checkpoint available. A currently running, Vanilla Universe job receives a hard kill signal (preemption without checkpointing), and HTCondor returns the job to the queue; when released, this Vanilla Universe job restarts at the beginning.

        To hold a job:

        condor_hold myjobid
        

        To view the state, column "ST", of the held job, "H":

        $ condor_q myusername
        
        -- Submitter: condor.rcac.purdue.edu : <128.211.157.86:42163> : condor.rcac.purdue.edu
         ID         OWNER       SUBMITTED    RUN_TIME   ST PRI SIZE CMD
        1101790.0   myusername   2/24 14:53   0+00:00:00 H  0   0.0  Hello
        
        1 jobs; 0 idle, 0 running, 1 held
        

        For more information about holding your job:

      • 7.1.9  Job Release

        7.1.9  Job Release

        A job that is in the hold state remains there until later released for execution.

        To release a held job:

        $ condor_release myjobid
        

        The state of the released job is now "Idle", "I":

        $ condor_q myusername
        
        -- Submitter: condor.rcac.purdue.edu : <128.211.157.86:42163> : condor.rcac.purdue.edu
         ID         OWNER              SUBMITTED    RUN_TIME   ST PRI SIZE CMD
        1101790.0   myusername   2/24 14:53   0+00:00:00 I  0   0.0  Hello
        
        1 jobs; 1 idle, 0 running, 0 held
        

        To release all held jobs of a single user:

        $ condor_release myusername
        

        For more information about releasing your job:

      • 7.1.10  Compute Nodes and ClassAds

        7.1.10  Compute Nodes and ClassAds

        HTCondor attempts to start jobs by matching submitted jobs with available compute nodes on the basis of ClassAds. HTCondor's ClassAds are analogous to the classified advertising section of the newspaper. Both sellers and buyers advertise details about what they have to sell or want to buy. Both buyers and sellers have some requirements which absolutely must be satisfied, such as the right type of item, and some other criteria by which they will prefer certain offers over others, such as a better price. The same is true in HTCondor, but between users submitting jobs and compute nodes advertising available resources. HTCondor uses ClassAds to make the best matches between these two groups.

        By default, your HTCondor jobs will seek an available compute node with the same values for the ClassAds Arch and OpSys as the host from which you submitted your job. The submission process assumes that in most cases your jobs will require the same combination of chip architecture and operating system to run as the host from which you submitted it. You can remove or alter this restriction by looking at the examples in the "Requiring Specific Architectures or Operating Systems" section.

        Some applications may require even more specific capabilities. Using ClassAds, you may specify a set of requirements so that only a subset of available compute nodes become candidates to run your job. There are many ClassAds available for you to use in your job requirements. You may also use ClassAds to indicate a preference for certain nodes over others (but not as an absolute requirement) by using the rank command. The following examples illustrate how to discover current ClassAds and how to estimate the number of compute nodes which will match job requirements based on ClassAds.

        To save a detailed report of all the ClassAds of all processor cores in BoilerGrid in the file myfile:

        $ condor_status -pool boilergrid.rcac.purdue.edu -long > myfile
        

        You may use any of the ClassAds which appear in this list to view a subset of BoilerGrid. For example, to save a listing of all user ID domains or all file system domains in the file myfile:

        $ condor_status -pool boilergrid.rcac.purdue.edu -format "%s\n" UidDomain > myfile
        
        $ condor_status -pool boilergrid.rcac.purdue.edu -format "%s\n" FileSystemDomain > myfile
        

        To list all platforms (architectures and operating systems) and the number of processor cores of each platform on BoilerGrid:

        $ condor_status -pool boilergrid.rcac.purdue.edu -total
        
                             Total Owner Claimed Unclaimed Matched Preempting Backfill
        
                 INTEL/LINUX    64    13       5        46       0          0        0
                   INTEL/OSX     2     0       0         2       0          0        0
               INTEL/WINNT51   345    29       2       314       0          0        0
               INTEL/WINNT61  4683   150      13      4520       0          0        0
                X86_64/LINUX 31395 22617    4734      4035       2          2        5
        
                       Total 36492 22811    4754      8918       2          2        5
        

        The total number of processor cores on BoilerGrid is 36,492. The predominant platform of BoilerGrid is the x86_64/Linux with 31,395 processor cores. The values in this table are approximations since compute nodes require repair.

        To see how many compute nodes have a given ClassAd value, add the ClassAd value as a constraint.

        To see only how many 64-bit, Intel-compatible, Linux (x86_64/Linux) platforms are currently in BoilerGrid:

        $ condor_status -pool boilergrid.rcac.purdue.edu -total -constraint '(Arch == "X86_64") && (OpSys == "LINUX")'
        
                             Total Owner Claimed Unclaimed Matched Preempting Backfill
        
                X86_64/LINUX 31395 22740    4688      3957       3          2        5
        
                       Total 31395 22740    4688      3957       3          2        5
        

        You may specify numeric constraints with other relational operators. To discover how many compute nodes have at least 16 GB of memory:

        $ condor_status -pool boilergrid.rcac.purdue.edu -total -constraint 'TotalMemory >= 16046'
        
                             Total Owner Claimed Unclaimed Matched Preempting Backfill
                X86_64/LINUX 26093 18007    3330      4753       3          0        0
                       Total 26093 18007    3330      4753       3          0        0
        

        ClassAd string values are case-sensitive. ClassAd attribute names are case-insensitive. The comparison operators (<, >, <=, >=, and ==) compare strings case-insensitively. The special comparison operators =?= and =!= compare strings case-sensitively. ClassAd expressions are similar to C boolean expressions and can be quite elaborate.

        For more information about ClassAds, requirements, and rank:

      • 7.1.11  Shared Scratch File Systems

        7.1.11  Shared Scratch File Systems

        As there is no scratch space globally accessible on every compute node, ITaP recommends using Condor's file-transfer capability to transfer executables and data between submit and execute nodes.

      • 7.1.12  List of Common ClassAds

        7.1.12  List of Common ClassAds

        Here is a brief description of some of the common ClassAds and attributes available in HTCondor. For a more complete listing, see the Job Submission Chapter of the HTCondor Users' Manual.

        Machine Attributes

        • Activity: String which describes HTCondor job activity on the machine. Can have one of the following values:
          • "Idle": There is no job activity
          • "Busy": A job is busy running
          • "Suspended": A job is currently suspended
          • "Vacating": A job is currently checkpointing
          • "Killing": A job is currently being killed
          • "Benchmarking": The startd is running benchmarks
        • Arch: String with the architecture of the machine.
        • ClockDay: The day of the week, where 0 = Sunday, 1 = Monday, ... , 6 = Saturday.
        • ClockMin: The number of minutes passed since midnight.
        • ConsoleIdle: The number of seconds since activity on the system console keyboard or console mouse has last been detected.
        • Cpus: Number of CPUs in this machine.
        • CurrentRank: A float which represents this machine owner's affinity for running the HTCondor job which it is currently hosting. If not currently hosting a HTCondor job, CurrentRank is 0.0. When a machine is claimed, the attribute's value is computed by evaluating the machine's Rank expression with respect to the current job's ClassAd.
        • Disk: The amount of disk space on this machine available for the job in Kbytes.
        • EnteredCurrentActivity: Time at which the machine entered the current Activity. On all platforms (including NT), this is measured in the number of integer seconds since the Unix epoch (00:00:00 UTC, Jan 1, 1970).
        • FileSystemDomain: A "domain" name configured by the HTCondor administrator which describes a cluster of machines which all access the same, uniformly-mounted, networked file systems usually via NFS or AFS. This is useful for Vanilla Universe jobs which require remote file access.
        • KeyboardIdle: The number of seconds since activity on any keyboard or mouse associated with this machine has last been detected.
        • KFlops: Relative floating point performance as determined via a Linpack benchmark.
        • LoadAvg: A floating point number with the machine's current load average.
        • Machine: A string with the machine's fully qualified hostname.
        • Memory: The amount of RAM in megabytes.
        • Name: The name of this resource. Typically the same value as the Machine attribute, but could be customized by the site administrator. On SMP machines, the condor_startd will divide the CPUs up into separate virtual machines, each with with a unique name. These names will be of the form "vm#@full.hostname", for example, "vm1@vulture.cs.wisc.edu", which signifies virtual machine 1 from vulture.cs.wisc.edu.
        • OpSys: String describing the operating system running on this machine.
        • Requirements: A boolean, which when evaluated within the context of the machine ClassAd and a job ClassAd, must evaluate to TRUE before HTCondor will allow the job to use this machine.
        • MaxJobRetirementTime: An expression giving the maximum time in seconds that the startd will wait for the job to finish before kicking it off if it needs to do so.
        • StartdIpAddr: String with the IP and port address of the condor_startd daemon which is publishing this machine ClassAd.
        • State: String which publishes the machine's HTCondor state. Can be:
          • "Owner": The machine owner is using the machine, and it is unavailable to HTCondor.
          • "Unclaimed": The machine is available to run HTCondor jobs, but a good match is either not available or not yet found.
          • "Matched": The HTCondor central manager has found a good match for this resource, but a HTCondor scheduler has not yet claimed it.
          • "Claimed": The machine is claimed by a remote condor_ schedd and is probably running a job.
          • "Preempting": A HTCondor job is being preempted (possibly via checkpointing) in order to clear the machine for either a higher priority job or because the machine owner wants the machine back.
        • VirtualMachineID: For SMP machines, the integer that identifies the VM. The value will be X for the VM with name="vmX@full.hostname". For non-SMP machines with one virtual machine, the value will be 1.
        • VirtualMemory: The amount of currently available virtual memory (swap space) expressed in Kbytes.

        Job Attributes

        • Args: String representing the arguments passed to the job.
        • CkptArch: String describing the architecture of the machine this job executed on at the time it last produced a checkpoint. If the job has never produced a checkpoint, this attribute is undefined.
        • CkptOpSys: String describing the operating system of the machine this job executed on at the time it last produced a checkpoint. If the job has never produced a checkpoint, this attribute is undefined.
        • ClusterId: Integer cluster identifier for this job. A cluster is a group of jobs that were submitted together. Each job has its own unique job identifier within the cluster, but shares a common cluster identifier. The value changes each time a job or set of jobs are queued for execution under HTCondor.
        • CompletionDate: The time when the job completed, or the value 0 if the job has not yet completed. Measured in the number of seconds since the epoch (00:00:00 UTC, Jan 1, 1970).
        • CurrentHosts: The number of hosts in the claimed state, due to this job.
        • EnteredCurrentStatus: An integer containing the epoch time of when the job entered into its current status. For example, if the job is on hold, the ClassAd expression CurrentTime - EnteredCurrentStatus will equal the number of seconds that the job has been on hold.
        • ImageSize: Estimate of the memory image size of the job in Kbytes. The initial estimate may be specified in the job submit file. Otherwise, the initial value is equal to the size of the executable. When the job checkpoints, the ImageSize attribute is set to the size of the checkpoint file (since the checkpoint file contains the job's memory image). A Vanilla Universe job's ImageSize is recomputed internally every 15 seconds.
        • JobPrio: Integer priority for this job, set by condor_ submit or condor_ prio. The default value is 0. The higher the number, the worse the priority.
        • JobStartDate: Time at which the job first began running. Measured in the number of seconds since the epoch (00:00:00 UTC, Jan 1, 1970).
        • JobStatus: Integer which indicates the current status of the job.
          • 0: Unexpanded (the job has never run)
          • 1: Idle
          • 2: Running
          • 3: Removed
          • 4: Completed
          • 5: Held
        • JobUniverse: Integer which indicates the job universe.
          • 1: Standard
          • 4: PVM
          • 5: Vanilla
          • 7: Scheduler
          • 8: MPI
          • 9: Grid
          • 10: Java
        • LastMatchTime: An integer containing the epoch time when the job was last successfully matched with a resource (gatekeeper) Ad.
        • LastRejMatchReason: If, at any point in the past, this job failed to match with a resource ad, this attribute will contain a string with a human-readable message about why the match failed.
        • LastRejMatchTime: An integer containing the epoch time when HTCondor-G last tried to find a match for the job, but failed to do so.
        • MaxHosts: The maximum number of hosts that this job would like to claim. As long as CurrentHosts is the same as MaxHosts, no more hosts are negotiated for.
        • MaxJobRetirementTime: Maximum time in seconds to let this job run uninterrupted before kicking it off when it is being preempted. This can only decrease the amount of time from what the corresponding startd expression allows.
        • MinHosts: The minimum number of hosts that must be in the claimed state for this job, before the job may enter the running state.
        • NumGlobusSubmits: An integer that is incremented each time the condor_ gridmanager receives confirmation of a successful job submission into Globus.
        • Owner: String describing the user who submitted this job.
        • ProcId: Integer process identifier for this job. Within a cluster of many jobs, each job has the same ClusterId, but will have a unique ProcId. Within a cluster, assignment of a ProcId value will start with the value 0. The job (process) identifier described here is unrelated to operating system PIDs.
        • RemoteIwd: The path to the directory in which a job is to be executed on a remote machine.
      • 7.1.13  Preemption (checkpointing, suspension, eviction)

        7.1.13  Preemption (checkpointing, suspension, eviction)

        Long-running computer programs which are executing in the HTCondor environment face risks that can prevent job completion, for example power loss, overflow of dynamic memory or disk storage, and preemption. Overflow means that a computer program allocates too much dynamic memory or writes too much data to the disk (remote or local) serving the program. Preemption occurs when a higher priority job needs the compute node. It involves either temporarily interrupting a HTCondor job with the intention of resuming that job from the point of preemption at a later time and often on a different compute node (checkpointing), stopping the job but keeping it on the compute node (checkpointing followed by suspension), or restarting the job from the beginning on a different compute node (eviction).

        Checkpointing is a technique for inserting fault tolerance into computing systems. It changes the state of a CPU so that another job can run. This is how HTCondor scavenges unused computing cycles without preventing higher-priority work. It basically consists of storing a snapshot of the current state of an application and later using it to resume the execution. With checkpointing and suspension, a job has a chance to finish. Eviction may cause a job never to finish if the job's run time is significantly longer than the mean time between preemptions or between power failures. Restarting a job from the beginning can be exceedingly wasteful. HTCondor handles preemption somewhat differently on various compute nodes in BoilerGrid because the owners of each compute node may specify how they want preemption handled. However, a few general principles are true for all.

        BoilerGrid offers a heterogeneous collection of compute nodes. These compute nodes support not only HTCondor. The majority are Linux systems also running the Portable Batch System (PBS). Many are Windows desktop machines. Architecture, performance, memory and disk space vary broadly.

        For all compute nodes running PBS, when a PBS-scheduled job needs a compute node, HTCondor evicts any HTCondor jobs running on that node at the time. This is known as preemption. When HTCondor preempts a Standard Universe job, it checkpoints the job, immediately removes it, and starts seeking another compute node to run it, where it will resume the job from the point of preemption. When HTCondor preempts a Vanilla Universe job, HTCondor immediately evicts the job and starts seeking another compute node to run it, where it will restart the job at the beginning.

        To take advantage of checkpointing and remote system calls of HTCondor's Standard Universe, you must re-link your program with the HTCondor libraries. Typically, re-linking requires no change to the source code. Not all applications may take advantage of HTCondor's Standard Universe. Re-linking precludes commercial software binaries from taking advantage of these services because commercial vendors rarely make their source or object code available. Re-linking precludes applications which must be run from a script. Re-linking precludes using compilers which are incompatible with HTCondor. An incompatible compiler might yield more efficient code which reduces run time and the likelihood of eviction. Such applications must use HTCondor's Vanilla Universe. Unless a Vanilla job is self-checkpointing, eviction means that all work is lost.

        Jobs running for long periods on BoilerGrid have a high probability of reaching preemption. These risks can warrant a significant retooling of a job to customize the match between the characteristics of a job's computation and the compute nodes of BoilerGrid in order to maximize throughput. Debugging a computer program and recoding a working program to improve performance are the usual tasks of a programmer. HTCondor may require additional retooling of that program so that it is able to reach completion.

        HTCondor is able to schedule and run any type of process, but HTCondor's Standard Universe does have some limitations on any jobs that it checkpoints and migrates:

        • Multi-process jobs are not allowed. This includes system calls such as fork(), exec(), and system().
        • Interprocess communication is not allowed. This includes pipes, semaphores, and shared memory.
        • Network communication must be brief. A job may make network connections using system calls such as socket(), but a network connection left open for long periods will delay checkpointing and migration.
        • Sending or receiving the SIGUSR2 or SIGTSTP signals is not allowed. HTCondor reserves these signals for its own use. Sending or receiving all other signals is allowed.
        • Alarms, timers, and sleeping are not allowed. This includes system calls such as alarm(), getitimer(), and sleep().
        • Multiple kernel-level threads are not allowed. However, multiple user-level threads are allowed.
        • Memory mapped files are not allowed. This includes system calls such as mmap() and munmap().
        • File locks are allowed, but not retained between checkpoints.
        • All files must be opened read-only or write-only. A file opened for both reading and writing will cause trouble if a job must be rolled back to an old checkpoint image. For compatibility reasons, a file opened for both reading and writing will result in a warning but not an error.

        These limitations apply only to Standard Universe jobs. They do not apply to Vanilla Universe jobs.

      • 7.1.14  Examples

        7.1.14  Examples

        To submit jobs successfully to BoilerGrid and to achieve maximum throughput in HTCondor's computing environment, you must understand the architecture of BoilerGrid and how to request resources which are appropriate to your application. The following examples show how to discover the resources of BoilerGrid. They also explain standard input and output, command-line arguments, file input and output, Standard and Vanilla universe jobs, shared file systems, parameter sweeps, DAG Manager, job requirements and ranks, and how to run commercial and third-party software. You may wish to look here for an example that is most similar to your application and modify that example for your jobs. You may also refer to the HTCondor Manual for more details.

        • 7.1.14.1  Simplest Job Submission File

          7.1.14.1  Simplest Job Submission File

          The job submission file must contain one executable command and at least one queue command. All other commands of the job submission file have default actions. HTCondor's job submission parser ignores blank lines and single-line comments beginning with a pound sign ("#"). There is no block (multi-line) comment in a job submission file. In some cases, a single-line comment may appear on a command line.

          # FILENAME: myjob.sub
          
          executable = myprogram
          queue    # place one copy of the job in the HTCondor queue
          

          This job submission file uses HTCondor's default universe, Vanilla. Because this Vanilla job does not turn on HTCondor's file transfer mechanism, this job will (by default) use the shared file system of the submission host from which you submitted the job. Also by default, the number of jobs queued is one.

          This job submission file may appear to be useless because it lacks the standard input, standard output, standard error, and a common log file; however, it will correctly process a program which reads and writes formatted files. An example of file I/O is this program myprogram.c. To compile this program for the Vanilla Universe, see Compiling Serial Programs.

          To submit this job to HTCondor:

          $ condor_submit myjob.sub
          
        • 7.1.14.2  Standard Input/Output

          7.1.14.2  Standard Input/Output

          HTCondor manages a batch environment. When HTCondor manages the execution of a computer program, that program cannot offer an interactive experience with a terminal. All input normally read from the keyboard (standard input) must be prepared in a file ahead of execution. All output normally written to the screen (standard output and standard error) appear in files where you may view them after execution. Also, HTCondor records in a common log file the main events of running a job.

          There is an example of standard I/O here in the program myprogram.c. To compile this program for the Vanilla Universe, see Compiling Serial Programs.

          Prepare a job submission file with an appropriate filename, here named myjob.sub:

          # FILENAME: myjob.sub
          
          executable = myprogram
          
          # Standard I/O files, HTCondor log file
          input  = mydata.in
          output = mydata.out
          error  = mydata.err
          log    = mydata.log
          
          # queue one job
          queue
          

          This job submission file uses HTCondor's default universe, Vanilla. Because this Vanilla job does not turn on HTCondor's file transfer mechanism, this job wi ll (by default) use the shared file system of the submission host from which you submitted the job. Also by default, the number of jobs queued is one.

          This submission specifies that there exists a file, mydata.in, which contains all text which the program would otherwise read from the keyboard, standard input. It also specifes the names of three files which will receive standard output, standard error, and HTCondor's log entries. These three output files need not preexist, but they can. HTCondor will overwrite standard output and standard error but will append to the log file during subsequent submissions.

          To submit this job to HTCondor:

          $ condor_submit myjob.sub
          
        • 7.1.14.3  Command Line Arguments

          7.1.14.3  Command Line Arguments

          HTCondor allows the specification of command-line arguments in the job submission file. There are two permissible formats for specifying arguments. The old syntax has arguments delimited (separated) by space characters. To use double quotes, escape with a backslash (i.e. put a backslash in front of each double quote). For example:

          arguments = arg1 \"arg2\" 'arg3'
          

          yields the following arguments:

          arg1
          "arg2"
          'arg3'
          

          The new syntax supports uniform quoting of spaces within arguments. A pair of double quotes surrounds the entire argument list. To include a literal double quote, simply repeat it. White space (spaces, tabs) separate arguments. To include literal white space in an argument, surround the argument with a pair of single quotes. To include a literal single quote within a single-quoted argument, repeat the single quote.

          A simple program which will display the command-line arguments specified in a job submission file is this program, myprogram.c. To compile this program for the Vanilla Universe, see Compiling Serial Programs.

          Prepare a job submission file with an appropriate filename, here named myjob.sub. Run with command-line arguments in either the old or new syntax:

          # FILENAME: myjob.sub
          
          universe = VANILLA
          
          executable = myprogram
          # Old Syntax
          # arguments = arg1 arg2 arg3 \"arg4\" 'arg5' 'arg with spaces' arg6 arg7_with_spaces arg8
          
          # New Syntax
          arguments = "arg9 ""arg10"" 'arg with literal '' and spaces'"
          
          # HTCondor Macros
          # arguments = $(Cluster) $(Process)
          
          # standard I/O files, HTCondor log file
          output = myprogram.out
          error  = myprogram.err
          log    = myprogram.log
          
          # queue one job
          queue
          

          To submit this job to HTCondor:

          $ condor_submit myjob.sub
          

          View command-line arguments submitted in the old syntax:

          ***  MAIN START  ***
          
          Number of command line arguments: 12
          
          command line argument, argv[0]: condor_exec.746418.0
          command line argument, argv[1]: arg1
          command line argument, argv[2]: arg2
          command line argument, argv[3]: arg3
          command line argument, argv[4]: "arg4"
          command line argument, argv[5]: 'arg5'
          command line argument, argv[6]: 'arg
          command line argument, argv[7]: with
          command line argument, argv[8]: spaces'
          command line argument, argv[9]: arg6
          command line argument, argv[10]: arg7_with_spaces
          command line argument, argv[11]: arg8
          
          ***  MAIN STOP  ***
          

          The old syntax requires simulating spaces in arguments with the underscore character. Then, user code can replace the underscores with spaces to achieve an argument with spaces.

          View command-line arguments submitted in the new syntax:

          ***  MAIN START  ***
          
          Number of command line arguments: 4
          
          command line argument, argv[0]: condor_exec.341964.0
          command line argument, argv[1]: arg9
          command line argument, argv[2]: "arg10"
          command line argument, argv[3]: arg with literal ' and spaces
          
          ***  MAIN STOP  ***
          

          The array element argv[0] holds HTCondor's name for a job.

          Two HTCondor macros are useful as command-line arguments, $(Cluster) and $(Process):

          ***  MAIN START  ***
          
          Number of command line arguments: 3
          
          command line argument, argv[0]: condor_exec.341965.0
          command line argument, argv[1]: 341965
          command line argument, argv[2]: 0
          
          ***  MAIN STOP  ***
          
        • 7.1.14.4  File Input/Output

          7.1.14.4  File Input/Output

          HTCondor is able to manage a computer program which reads and writes formatted data files.

          An example of formatted file I/O is here in the program myprogram.c. To compile this program for the Vanilla Universe, see Compiling Serial Programs.

          Prepare a job submission file with an appropriate filename, here named myjob.sub. This example combines formatted file I/O with standard output:

          # FILENAME: myjob.sub
          
          executable = myprogram
          
          # Standard I/O files, HTCondor log file
          output = mydata.out
          error  = mydata.err
          log    = mydata.log
          
          # queue one job
          queue
          

          This job submission file uses HTCondor's default universe, Vanilla. Because this Vanilla job does not turn on HTCondor's file transfer mechanism, this job will (by default) use the shared file system of the submission host from which you submitted the job. Also by default, the number of jobs queued is one.

          This submission specifies that there exists a formatted input file, myinputdata, a name which appears in the source code only. The result is a formatted output file, myoutputdata, a name which also appears in the source code only. This submission also specifes the names of three files which will receive standard output, standard error, and HTCondor's log entries. These three output files need not preexist, but they can. HTCondor will overwrite standard output and standard error but append to the log file during subsequent submissions.

          To submit this job to HTCondor:

          $ condor_submit myprogram.sub
          
        • 7.1.14.5  Standard Universe Job

          7.1.14.5  Standard Universe Job

          The Standard Universe is an execution environment of HTCondor. Jobs using the Standard Universe enjoy two advantages. A job with a higher priority may preempt a HTCondor job without loss of completed work. HTCondor can checkpoint the job and move (migrate) the job to a different compute node which would otherwise be idle. HTCondor restarts the job on the new compute node at precisely the point of preemption. The Standard Universe tells HTCondor that you re-linked your job via condor_compile with the HTCondor libraries, and therefore your job supports checkpointing. HTCondor transfers the executable and checkpoint files automatically, when needed.

          The second advantage of HTCondor's Standard Universe is that remote system calls handle access to files (input and output). For example, HTCondor intercepts a call to read a record of a data file. HTCondor sends the read operation to the user's current working directory on the submission host which performs the read operation. HTCondor then sends the desired record to the compute node which processes the record. A similar process occurs with write operations. Therefore, the existence of a shared file system is not relevant. This feature maximizes the number of machines which can run a job. Compute nodes across an entire enterprise can run a job, including compute nodes in different administrative domains.

          This section illustrates how to submit a small job to the Standard Universe of BoilerGrid. This example, myprogram.c, displays the name of the host which runs the job. To compile this program for the Standard Universe, see Compiling Serial Programs.

          Prepare a job submission file with the Standard Universe, the compiled C program as the executable, HTCondor's transferring the compiled program to the chosen compute node, and an appropriate filename, here named myjob.sub:

          # FILENAME:  myjob.sub
          
          universe = STANDARD
          
          # Transfer the "executable" myprogram to the compute node.
          transfer_executable = TRUE
          executable          = myprogram
          
          # Standard I/O files, HTCondor log file
          output = mydata.out
          error  = mydata.err
          log    = mydata.log
          
          # queue one job
          queue
          

          Submit this job to HTCondor:

          $ condor_submit myjob.sub
          Submitting job(s).
          Logging submit event(s).
          1 job(s) submitted to cluster 341956.
          

          View job status:

          $ condor_q myusername
          
          -- Submitter: condor.rcac.purdue.edu : <128.211.157.86:35556> : condor.rcac.purdue.edu
           ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
          341956.0   myusername     10/22 11:18   0+00:00:00 I  0   7.3  myjob
          

          Place the job on hold to study the submission:

          $ condor_hold 341956
          Cluster 341956 held.
          

          Obtain the requirements of this job:

          $ condor_q myusername -attributes requirements -long
          
          -- Submitter: condor.rcac.purdue.edu : <128.211.157.86:35556> : condor.rcac.purdue.edu
          requirements = (Arch == "X86_64") && (OpSys == "LINUX") && ((CkptArch == Arch) || (CkptArch =?= UNDEFINED)) && ((CkptOpSys == OpSys) || (CkptOpSys =?= UNDEFINED)) && (Disk >= DiskUsage) && ((Memory * 1024) >= ImageSize)
          

          Job requirements reflect the Standard Universe (preemption with checkpointing). This job requires a processor core which runs the Linux operating system on the x86_64 architecture and has the ability to checkpoint the job at preemption. The requirements exclude any mention of the shared file system since a shared file system is not relevant to a Standard Universe job. Running a Standard Universe job does not limit the job to the processor cores which use the same shared file system that the submission host uses. The job may land either on a processor core that uses the same shared file system or not; in either case, the remote I/O of the Standard Universe handles the job's file I/O. The ClassAds Disk and Memory ensure that the chosen core has sufficient resources to hold the disk and memory footprint of the job.

          To see how many compute nodes of BoilerGrid are able to satisfy this job's requirements:

          $ condor_status -pool boilergrid.rcac.purdue.edu -total -constraint '(Arch == "X86_64") && (OpSys == "LINUX") && ((CkptArch == Arch) || (CkptArch =?= UNDEFINED)) && ((CkptOpSys == OpSys) || (CkptOpSys =?= UNDEFINED))'
          
                               Total Owner Claimed Unclaimed Matched Preempting Backfill
                  X86_64/LINUX 33118 27602    2878      2596      42          0        0
                         Total 33118 27602    2878      2596      42          0        0
          

          The report shows that 33,118 processor cores are candidates for running the job. Using HTCondor's Standard Universe with its remote file I/O maximizes the number of candidate cores available to the job. This large pool of candidate cores has the potential of offering high throughput. Some subset of the candidate processor cores, possibly none, will be available to accept the job at any given moment.

          Release the job from the queue:

          $ condor_release 341956
          Cluster 341956 released.
          

          View results in the file for all standard output, here named mydata.out:

          ***  MAIN START  ***
          
          hostname = cms-100.rcac.purdue.edu
          domainname = (none)
          
          ***  MAIN  STOP  ***
          

          The output shows the name of the processor core which ran the job. While this job ran on a processor core which resides on the same shared file system used by the submission host, another submission which forced the job onto a core of another shared file system also ran successfully because the remote I/O of the Standard Universe handled the reading and writing of records.

          View the log file, mydata.log:

          000 (341956.000.000) 10/22 11:42:22 Job submitted from host: <128.211.157.86:35556>
          ...
          012 (341956.000.000) 10/22 11:42:57 Job was held.
              via condor_hold (by user myusername)
              Code 1 Subcode 0
          ...
          001 (341956.000.000) 10/22 11:43:57 Job executing on host: <128.211.157.10:52556>
          ...
          005 (341956.000.000) 10/22 11:43:57 Job terminated.
              (1) Normal termination (return value 0)
                  Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
                  Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
                  Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
                  Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
              1110  -  Run Bytes Sent By Job
              5431033  -  Run Bytes Received By Job
              1110  -  Total Bytes Sent By Job
              5431033  -  Total Bytes Received By Job
          

          The log file records the main events related to the processing of this job. In this example, the log records the number of bytes read and written between the submission host and the compute node via the remote I/O of the Standard Universe.

          The Standard Universe maximizes throughput with its ability to checkpoint jobs and to intercept remote system calls. The latter avoids requiring the submission host and the compute node to share a file system. The process of re-linking a job with HTCondor's libraries involves including both HTCondor's libraries and the user's libraries as static libraries. The danger of this effort to maximize throughput is that a HTCondor flock is a heterogeneous collection of old and new compute nodes, so a job can land on a compute node that is unable to run the job. When this happens, the user must consider how to avoid compute nodes which are unable to run a job to a successful completion.

        • 7.1.14.6  Vanilla Universe Job (without shared file system)

          7.1.14.6  Vanilla Universe Job (without shared file system)

          The Vanilla Universe is an execution environment of HTCondor. The Vanilla Universe tells HTCondor that you did re-link your job via condor_compile with the HTCondor libraries, and therefore your job does not support checkpointing or remote system calls. Such jobs include an executable binary of a commercial application, shell script, or a program which is to take advantage of features of a compiler which is not compatible with HTCondor's condor_compile command.

          For jobs submitted under the Vanilla Universe, the existence of a shared file system is relevant since access to files (input and output) involves either a shared file system or HTCondor's file transfer mechanism mechanism, not the remote system calls of the Standard Universe.

          This section illustrates how to submit a small job to the Vanilla Universe of BoilerGrid from a submission host which lacks a shared file system with HTCondor's file transfer mechanism turned on. No matter which processor core HTCondor chooses to run the job, HTCondor transfers files. This example, myprogram.c, displays the name of the host which runs the job, the path name of the current working directory, and the contents of that directory. The Vanilla Universe allows using Linux commands to obtain the information. To compile this program for the Vanilla Universe, see Compiling Serial Programs.

          Prepare a job submission file with the Vanilla Universe, the compiled C program as the executable, HTCondor's file transfer mechanism turned on, HTCondor's transferring the compiled program to the chosen compute node, and an appropriate filename, here named myjob.sub:

          # FILENAME:  myjob.sub
          
          universe = VANILLA
          
          # Transfer the "executable" myprogram to the compute node.
          transfer_executable = TRUE
          executable          = myprogram
          
          # Turn on HTCondor's file transfer mechanism.
          should_transfer_files   = YES
          
          # Let HTCondor handle output files.
          when_to_transfer_output = ON_EXIT
          
          # Standard I/O files, HTCondor log file
          output = mydata.out
          error  = mydata.err
          log    = mydata.log
          
          # queue one job
          queue
          

          Submit this job to HTCondor:

          $ condor_submit myjob.sub
          Submitting job(s).
          Logging submit event(s).
          1 job(s) submitted to cluster 341960.
          

          View job status:

          $ condor_q myusername
          
          -- Submitter: condor.rcac.purdue.edu : <128.211.157.86:35556> : condor.rcac.purdue.edu
           ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
          341960.0   myusername     10/25 15:02   0+00:00:00 I  0   0.0  myjob
          

          Place this job on hold to study the submission:

          $ condor_hold 341960
          Cluster 341960 held.
          

          Obtain the requirements of this job:

          $ condor_q myusername -attributes requirements -long
          
          
          -- Submitter: condor.rcac.purdue.edu : <128.211.157.86:35556> : condor.rcac.purdue.edu
          requirements = (Arch == "X86_64") && (OpSys == "LINUX") && (Disk >= DiskUsage) && ((Memory * 1024) >= ImageSize) && (HasFileTransfer)
          

          Job requirements reflect both the Vanilla Universe (preemption without checkpointing) and HTCondor's file transfer mechanism turned on. This job requires a processor core which runs the Linux operating system on the x86_64 architecture, but more importantly the processor core chosen to run this job can reside on a cluster which lacks a shared file system. The ClassAd of this job states that the chosen core must have the file transfer capability. The ClassAds Disk and Memory ensure that the chosen core has sufficient resources to hold the disk and memory footprint of the job.

          To see how many compute nodes of BoilerGrid are able to satisfy this job's requirements:

          $ condor_status -pool boilergrid.rcac.purdue.edu -total -constraint '(Arch == "X86_64") && (OpSys == "LINUX") && (HasFileTransfer)'
          
                               Total Owner Claimed Unclaimed Matched Preempting Backfill
                  X86_64/LINUX 33068 20690    3850      8520       8          0        0
                         Total 33068 20690    3850      8520       8          0        0
          

          This report shows that 33,068 processor cores are candidates for running this job. Using HTCondor's Vanilla Universe with its file transfer mechanism turned off maximizes the number of candidate cores available to the job. This large pool of candidate cores has the potential of offering high throughput. On the other hand, large data files involved in the file transfer, may diminish throughput. Some subset of the candidate processor cores, possibly none, will be available to accept the job at any given moment.

          Release the job from the queue:

          $ condor_release 341960
          Cluster 341960 released.
          

          View results in the file for all standard output, here named mydata.out:

          cms-100.rcac.purdue.edu
          (none)
          /var/condor/execute/dir_13374
          total 12
          -rwxr-xr-x 1 myusername itap 6863 Oct 25 15:47 condor_exec.exe
          -rw-r--r-- 1 myusername itap    0 Oct 25 15:50 mydata.err
          -rw-r--r-- 1 myusername itap   61 Oct 25 15:50 mydata.out
          ***  MAIN START  ***
          
          
          ***  MAIN  STOP  ***
          

          The output shows the name of the processor core which ran the job. This job ran on a processor core which shares a file system with the submission host. Despite this, the current working directory is a temporary directory on the compute node; therefore, this job used the file transfer mechanism for file I/O.

          View the log file, mydata.log:

          000 (341960.000.000) 10/25 15:03:18 Job submitted from host: <128.211.157.86:35556>
          ...
          012 (341960.000.000) 10/25 15:03:35 Job was held.
              via condor_hold (by user myusername)
              Code 1 Subcode 0
          ...
          013 (341960.000.000) 10/25 15:48:00 Job was released.
              via condor_release (by user myusername)
          ...
          001 (341960.000.000) 10/25 15:50:46 Job executing on host: <128.211.157.10:33047>
          ...
          005 (341960.000.000) 10/25 15:50:46 Job terminated.
              (1) Normal termination (return value 0)
                  Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
                  Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
                  Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
                  Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
              272  -  Run Bytes Sent By Job
              6863  -  Run Bytes Received By Job
              272  -  Total Bytes Sent By Job
              6863  -  Total Bytes Received By Job

          The log file records the main events related to the processing of this job. In this example, the log records the number of bytes transferred between the submission host and the compute node via HTCondor's file transfer mechanism.

          The Vanilla Universe is available for jobs which cannot take advantage of the Standard Universe because you cannot re-link them with HTCondor's libraries. You cannot re-link an executable binary, shell script, and a program which you compiled with an incompatible compiler. If a shared file system is not available and HTCondor's file transfer mechanism is suitable for the job, you may turn on the file transfer mechanism, and the Vanilla job will transfer your files. The size of any file which you intend to transfer must be reasonable. The size of the file must fit on the available disk space of the compute node. The amount of time needed to transfer the file cannot be so great that higher priority jobs constantly preempt the HTCondor job during file transfer.

        • 7.1.14.7  /tmp File

          7.1.14.7  /tmp File

          Some applications write a large amount of intermediate data to a temporary file during an early part of the process then read that data for further processing during a later part of the process. The size of this file may be so large that it cannot fit within the quota of a home directory or that it requires too much I/O activity between the compute node and either the home directory or the scratch file directory. The way to process this intermediate file on BoilerGrid is to use the /tmp directory of the compute node which runs the job. Used properly, /tmp may provide faster local storage to an active process than any other storage option.

          The job may run in the Vanilla Universe. When preemption occurs, a Vanilla job restarts at the beginning, and it rebuilds the intermediate data file from the beginning. HTCondor's Standard Universe is not applicable since checkpointing does not include any file in /tmp.

          This section illustrates how to submit a small job which first writes then reads an intermediate data file which resides on the /tmp directory. This example, myprogram.c, displays the contents of the /tmp directory before and after processing. Linux commands access system information. To compile this program, see Compiling Serial Programs.

          Prepare a job submission file with the Vanilla Universe, HTCondor's transferring the compiled program to the chosen compute node, the compiled C program specified as the executable, HTCondor's file transfer mechanism turned on if needed, and an appropriate filename, here named myjob.sub:

          # FILENAME:  myjob.sub
          
          universe = VANILLA
          
          # Transfer the "executable" myprogram to the compute node.
          transfer_executable = TRUE
          executable          = myprogram
          
          # HTCondor's file transfer mechanism is turned on only when needed.
          should_transfer_files = IF_NEEDED
          
          # Let HTCondor handle output file(s).
          when_to_transfer_output = ON_EXIT
          
          # Standard I/O files, HTCondor log file
          output = mydata.out
          error  = mydata.err
          log    = mydata.log
          
          # queue one job
          queue
          

          Submit this job to HTCondor:

          $ condor_submit myjob.sub
          Submitting job(s).
          1 job(s) submitted to cluster 346033.
          

          View job status:

          $ condor_q myusername
          
          -- Submitter: condor-fe00.rcac.purdue.edu : <128.211.157.87:40924> : condor-fe00.rcac.purdue.edu
           ID         OWNER          SUBMITTED     RUN_TIME   ST PRI SIZE CMD
          346033.0   myusername     6/16  15:05    0+00:00:00 I  0   0.0  myprogram
          
          1 jobs; 1 idle, 0 running, 0 held
          

          View results in the file for all standard output, here named mydata.out:

          -rw-r--r-- 1 kes itap 12 Jun 16 15:12 /tmp/mytmpfile
          ***  MAIN START  ***
          
          /tmp file data:  abcdefghijk
          
          ***  MAIN  STOP  ***
          

          The output verifies the existence of the intermediate data file in the /tmp directory.

          View results in the file for all standard error, here named mydata.err:

          ls: /tmp/mytmpfile: No such file or directory
          

          The results in the error file verify that the intermediate data file does not exist at the start of processing.

          View the log file, mydata.log:

          000 (346033.000.000) 06/16 15:05:25 Job submitted from host: <128.211.158.38:40666>
          ...
          001 (346033.000.000) 06/16 15:12:00 Job executing on host: <172.18.22.85:54211?PrivNet=condor.ccb.purdue.edu>
          ...
          005 (346033.000.000) 06/16 15:12:01 Job terminated.
              (1) Normal termination (return value 0)
                  Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
                  Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
                  Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
                  Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
              0  -  Run Bytes Sent By Job
              0  -  Run Bytes Received By Job
              0  -  Total Bytes Sent By Job
              0  -  Total Bytes Received By Job
          ...
          

          The log file records the main events related to the processing of this job. In this example, the log records that the number of bytes read and written between the submission host and the compute node is zero, an indication that this job used the shared file system for file I/O.

          While the /tmp directory can provide faster local storage to an active process than other storage options, you never know how much storage is available in the /tmp directory of the compute node chosen to run your job. If an intermediate data file consistently fails to fit in the /tmp directories of a set of compute nodes, consider limiting the pool of candidate compute nodes to those which can handle your intermediate data file.

        • 7.1.14.8  Parameter Sweep

          7.1.14.8  Parameter Sweep

          A parameter sweep is a computation applied to several unique sets of data. Computing the average test score for each class in a school is an example of a simple parameter sweep.

          A parameter sweep requires that each queued run of a computer program reads a unique set of input parameters. This example illustrates how to implement a unique command line argument, a unique standard input, and a unique formatted input data file. The program writes to a unique standard output and to a unique formatted output file.

          HTCondor provides a pre-defined macro $(Process) which supports the specification of unique command-line arguments and input/output files so that each queued run of a job sees a unique set of data. This macro appears in the job submission file, myprogram.sub:

          # FILENAME:  myprogram.sub
          
          universe = VANILLA
          
          executable = myprogram
          # Processes 0,1,2
          # command line argument
          arguments  = $(Process)
          
          # Standard I/O files, HTCondor log file
          input  = mydata.in.$(Process)
          output = mydata.out.$(Process)
          error  = mydata.err.$(Process)
          log    = mydata.log
          
          # queue 3 jobs in 1 cluster
          queue 3
          

          This job submission file specifies a parameter sweep of three queued runs. Each process has a unique ID: 0, 1, and 2. Process 0 of the parameter sweep expects a standard input file named "mydata.in.0"; process 1, "mydata.in.1"; and process 2, "mydata.in.2". The sweep will generate similarly named files for standard output and error. HTCondor advises using a single log file in a submission. In addition, the sweep expects to find formatted input data files with the same process number used as a suffix: i_00020.0, i_mydata.1, i_mydata.2. Each copy of the program myprogram.c used in this sweep finds its unique process number in its command-line argument and appends that unique process number to the generic names "i_mydata." and "o_mydata." to make unique formatted data file names. All files reside in the user's current working directory; hence, data file names must be unique. To compile this program for the Vanilla Universe, see Compiling Serial Programs.

          This job submission file uses HTCondor's default universe, Vanilla. Because this Vanilla job does not turn on HTCondor's file transfer mechanism, this job will (by default) use the shared file system of the submission host from which you submitted the job.

          To submit the executable to HTCondor:

          $ condor_submit myprogram.sub
          

          For this submission, command condor_q will show a single cluster number, three unique process numbers, and three unique command-line arguments:

          $ condor_q myusername
          
          -- Submitter: condor.rcac.purdue.edu : <128.211.157.86:60481> : condor.rcac.purdue.edu
           ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
          746419.0   myusername     10/28 10:57   0+00:00:00 I  0   0.0  myprogram 0
          746419.1   myusername     10/28 10:57   0+00:00:00 I  0   0.0  myprogram 1
          746419.2   myusername     10/28 10:57   0+00:00:00 I  0   0.0  myprogram 2
          

          View the standard input file for process 0, mydata.in.0:

          textfromstandardinput:process0
          

          View the formatted input file for process 0, i_mydata.0:

          textfromformattedinput:process0
          

          View the standard output file for process 0, mydata.out.0:

          ***  MAIN START  ***
          
          program name:          condor_exec.exe
          command line argument: 0
          standard input/output: textfromstandardinput:process0
          formatted input/output: textfromformattedinput:process0
          
          ***  MAIN  STOP  ***
          

          View the formatted output file for process 0, o_mydata.0:

          textfromformattedinput:process0
          

          Processes 1 and 2 have similar input and output files.

          The single log file collects records the major events of the submission of the three queued runs of this parameter sweep:

          000 (746419.000.000) 10/28 10:57:43 Job submitted from host: <128.211.157.86:60481>
          ...
          000 (746419.001.000) 10/28 10:57:43 Job submitted from host: <128.211.157.86:60481>
          ...
          000 (746419.002.000) 10/28 10:57:43 Job submitted from host: <128.211.157.86:60481>
          ...
          001 (746419.001.000) 10/28 11:02:14 Job executing on host: <128.211.157.10:44836>
          ...
          005 (746419.001.000) 10/28 11:02:14 Job terminated.
              (1) Normal termination (return value 0)
                  Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
                  Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
                  Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
                  Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
              950  -  Run Bytes Sent By Job
              9800  -  Run Bytes Received By Job
              950  -  Total Bytes Sent By Job
              9800  -  Total Bytes Received By Job
          ...
          001 (746419.000.000) 10/28 11:02:15 Job executing on host: <128.211.157.10:44836>
          ...
          005 (746419.000.000) 10/28 11:02:15 Job terminated.
              (1) Normal termination (return value 0)
                  Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
                  Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
                  Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
                  Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
              950  -  Run Bytes Sent By Job
              9800  -  Run Bytes Received By Job
              950  -  Total Bytes Sent By Job
              9800  -  Total Bytes Received By Job
          ...
          001 (746419.002.000) 10/28 11:02:17 Job executing on host: <128.211.157.10:44836>
          ...
          005 (746419.002.000) 10/28 11:02:17 Job terminated.
              (1) Normal termination (return value 0)
                  Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
                  Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
                  Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
                  Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
              950  -  Run Bytes Sent By Job
              9800  -  Run Bytes Received By Job
              950  -  Total Bytes Sent By Job
              9800  -  Total Bytes Received By Job
          

          HTCondor's parameter sweep offers a huge potential. Simply adding a large number to the queue command in a job submission file yields an enormous amount of computing. The catch is that the effort needed to prepare the unique input files, either standard input files or formatted input files. This effort can be minimal when the input data comes from some data collector operating in the field. This effort can be enormous when you must enter each unique dataset from the keyboard.

        • 7.1.14.9  Parameter Sweep - Initial Directory

          7.1.14.9  Parameter Sweep - Initial Directory

          A parameter sweep is a computation applied to several unique sets of data. Computing the average test score for each class in a school is an example of a simple parameter sweep.

          A parameter sweep requires that each queued run of a computer program reads a unique set of input parameters. This example illustrates how to implement a unique command line argument, a unique standard input, and a unique formatted input data file. The program writes to a unique standard output and to a unique formatted output file.

          HTCondor provides a pre-defined macro $(Process) which supports the specification of unique command-line arguments so that each queued run of a job sees a unique set of data.

          Also, HTCondor provides an "initial directory" which supports the specification of unique input/output files so that each queued run of a job sees a unique set of data. Command initialdir specifies a generic directory name which becomes unique after appending the process number of a queued run of a parameter sweep. Each initial directory is actually a subdirectory of the user's current working directory. Each initial directory holds the unique standard input and formatted input files of a queued run of a parameter sweep; each initial directory receives the unique standard output, error and log files plus any unique formatted output files generated by a queued run of a parameter sweep. Since data files of each run of a sweep reside in a separate directory, identical file names may be used; they need not be modified with a process number. Both macro and command appear in the job submission file, myprogram.sub:

          # FILENAME:  myprogram.sub
          
          universe = VANILLA
          
          executable = myprogram
          # Processes 0,1,2
          # command line argument
          arguments  = $(Process)
          
          initialdir = mydatadirectory.$(Process)
          
          # Standard I/O files, HTCondor log file
          input          = mydata.in
          output         = mydata.out
          error          = mydata.err
          log            = mydata.log
          
          # queue 3 jobs in 1 cluster
          queue 3
          

          This job submission file specifies a parameter sweep of three queued runs. Each process has a unique ID: 0, 1, and 2. Process 0 of the parameter sweep expects a standard input file named "mydata.in" to reside in the initial directory named "mydatadirectory.0"; process 1, "mydata.in" resides in "mydirectory.1"; and process 2, "mydata.in" resides in "mydirectory.2". The sweep will generate similarly named files for standard output, error, and log in the initial directories. In addition, the sweep expects to find in the initial directories formatted input data files with identical names: myinputdata. Each copy of the program myprogram.c used in this sweep finds its unique process number in its command-line argument and finds its unique formatted input data file in its own initial directory. The program does not append its unique process number to the generic names of formatted files to make unique formatted data file names. All files reside in unique subdirectories of the user's current working directory; hence, data file names must be identical. To compile this program for the Vanilla Universe, see Compiling Serial Programs.

          This job submission file uses HTCondor's default universe, Vanilla. Because this Vanilla job does not turn on HTCondor's file transfer mechanism, this job will (by default) use the shared file system of the submission host from which you submitted the job.

          To submit the executable to HTCondor:

          $ condor_submit myprogram.sub
          

          For this submission, command condor_q will show a single cluster number, three unique process numbers, and three unique command-line arguments:

          $ condor_q myusername
          
          -- Submitter: condor.rcac.purdue.edu : <128.211.157.86:60481> : condor.rcac.purdue.edu
           ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
          746420.0   myusername     10/28 12:28   0+00:00:00 I  0   0.0  myprogram 0
          746420.1   myusername     10/28 12:28   0+00:00:00 I  0   0.0  myprogram 1
          746420.2   myusername     10/28 12:28   0+00:00:00 I  0   0.0  myprogram 2
          

          View the standard input file for process 0, mydata.in, in the initial directory mydirectory.0:

          textfromstandardinput:process0
          

          View the formatted input file for process 0, myinputdata, in the initial directory mydirectory.0:

          textfromformattedinput:process0
          

          View the standard output file for process 0, mydata.out, in the initial directory mydirectory.1:

          ***  MAIN START  ***
          
          program name:          condor_exec.exe
          command line argument: 0
          standard input/output: textfromstandardinput:process0
          formatted input/output: textfromformattedinput:process0
          
          ***  MAIN  STOP  ***
          

          View the formatted output file for process 0, myoutputdata, in the initial directory mydirectory.0:

          textfromformattedinput:process0
          

          The log file, mydata.log, records the major events of the submission of the one queued run of this parameter sweep. View the log file for process 0, mydata.log, in the initial directory mydirectory.0:

          000 (746420.000.000) 10/28 12:28:35 Job submitted from host: <128.211.157.86:60481>
          ...
          001 (746420.000.000) 10/28 12:33:48 Job executing on host: <128.211.157.10:34460>
          ...
          005 (746420.000.000) 10/28 12:33:49 Job terminated.
              (1) Normal termination (return value 0)
                  Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
                  Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
                  Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
                  Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
              909  -  Run Bytes Sent By Job
              9800  -  Run Bytes Received By Job
              909  -  Total Bytes Sent By Job
              9800  -  Total Bytes Received By Job

          Processes 1 and 2 have similar input, output and log files and formatted input/output files residing in their respective initial directories.

          HTCondor's parameter sweep offers a huge potential. Simply adding a large number to the queue command in a job submission file yields an enormous amount of computing. The catch is that the effort needed to prepare the unique input files, either standard input files or formatted input files, can be great. This effort is minimal when the input data comes from some data collector operating in the field. This effort can be overwhelming when you must enter each unique dataset from the keyboard.

        • 7.1.14.10  Parameter Sweep - Single Data File

          7.1.14.10  Parameter Sweep - Single Data File

          A parameter sweep is a computation applied to several unique sets of data. Computing the average test score for each class in a school is an example of a simple parameter sweep.

          A parameter sweep requires that each queued run of a computer program reads a unique set of input parameters. This example illustrates how to implement a parameter sweep on a single large file. Each queued run of the job reads a different portion on the same file.

          HTCondor provides a pre-defined macro $(Process) which supports the specification of unique command-line arguments and input/output files so that each queued run of a job sees a unique set of data. This macro appears in the job submission file, myprogram.sub:

          # FILENAME:  myprogram.sub
          
          universe = VANILLA
          
          executable = myprogram
          # Processes 0,1,2
          arguments  = $(Process)
          
          # There is a single formatted input data file, myinputdata.
          
          # Standard I/O files, HTCondor log file
          output = mydata.out.$(Process)
          error  = mydata.err.$(Process)
          log    = mydata.log
          
          # queue 3 jobs in 1 cluster
          queue 3
          

          This job submission file specifies a parameter sweep of three queued runs. Each process has a unique ID: 0, 1, and 2. Each queued run of this job will read a different portion of the data file. Process 0 of the parameter sweep writes a standard output file named "mydata.out.0"; process 1, "mydata.out.1"; and process 2, "mydata.out.2". The sweep will generate similarly named files for standard error. HTCondor advises using a single log file in a submission to record the major events of the sweep. In addition, the sweep expects to find a single formatted input data file, myinputdata. Each copy of the program myprogram.c used in this sweep finds its unique process number in its command-line argument and uses that number to determine where in the single input data file it is to start reading records. All files reside in the user's current working directory; hence, data file names must be unique. To compile this program for the Vanilla Universe, see Compiling Serial Programs.

          This job submission file uses HTCondor's default universe, Vanilla. Because this Vanilla job does not turn on HTCondor's file transfer mechanism, this job will (by default) use the shared file system of the submission host from which you submitted the job.

          To submit the executable to HTCondor:

          $ condor_submit myprogram.sub
          

          For this submission, command condor_q will show a single cluster number, three unique process numbers, and three unique command-line arguments:

          $ condor_q myusername
          
          -- Submitter: condor.rcac.purdue.edu : <128.211.157.86:60481> : condor.rcac.purdue.edu
           ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
          746421.0   myusername     10/29 10:57   0+00:00:00 I  0   0.0  myprogram 0
          746421.1   myusername     10/29 10:57   0+00:00:00 I  0   0.0  myprogram 1
          746421.2   myusername     10/29 10:57   0+00:00:00 I  0   0.0  myprogram 2
          

          View the single formatted input file, myinputdata:

          AAAAAAAAAA
          BBBBBBBBBB
          CCCCCCCCCC
              :
          ZZZZZZZZZZ
          0000000000
          1111111111
          2222222222
          3333333333
          

          View the standard output file for process 0, mydata.out.0:

          ***  MAIN START  ***
          
          program name:          condor_exec.exe
          command line argument: 0
          current file position:   0
          rtn_val = 0
          starting file position:   0
          line 1:   AAAAAAAAAA
          line 2:   BBBBBBBBBB
          line 3:   CCCCCCCCCC
          line 4:   DDDDDDDDDD
          line 5:   EEEEEEEEEE
          line 6:   FFFFFFFFFF
          line 7:   GGGGGGGGGG
          line 8:   HHHHHHHHHH
          line 9:   IIIIIIIIII
          line 10:   JJJJJJJJJJ
          
          ***  MAIN  STOP  ***
          

          View the standard output file for process 1, mydata.out.1:

          ***  MAIN START  ***
          
          program name:          condor_exec.exe
          command line argument: 1
          current file position:   0
          rtn_val = 0
          starting file position:   110
          line 11:   KKKKKKKKKK
          line 12:   LLLLLLLLLL
          line 13:   MMMMMMMMMM
          line 14:   NNNNNNNNNN
          line 15:   OOOOOOOOOO
          line 16:   PPPPPPPPPP
          line 17:   QQQQQQQQQQ
          line 18:   RRRRRRRRRR
          line 19:   SSSSSSSSSS
          line 20:   TTTTTTTTTT
          rtn_val = 0
          starting file position:   0
          line 0:   AAAAAAAAAA
          rtn_val = 0
          starting file position:   220
          line 21:   UUUUUUUUUU
          
          ***  MAIN  STOP  ***
          

          Process 1 also practices additional random file accesses.

          View the standard output file for process 2, mydata.out.2:

          ***  MAIN START  ***
          
          program name:          condor_exec.exe
          command line argument: 2
          current file position:   0
          rtn_val = 0
          starting file position:   220
          line 21:   UUUUUUUUUU
          line 22:   VVVVVVVVVV
          line 23:   WWWWWWWWWW
          line 24:   XXXXXXXXXX
          line 25:   YYYYYYYYYY
          line 26:   ZZZZZZZZZZ
          line 27:   0000000000
          line 28:   1111111111
          line 29:   2222222222
          line 30:   3333333333
          
          ***  MAIN  STOP  ***
          

          HTCondor's parameter sweep, when applied to a single, large data file, offers a huge potential. Simply adding a large number to the queue command in a job submission file applies several compute servers to the data processing.

        • 7.1.14.11  Transfer a Subdirectory

          7.1.14.11  Transfer a Subdirectory

          To review, HTCondor is unable to transfer a subdirectory of data files to a compute server. While the submit command transfer_input_files allows paths when specifying which input files to transfer, HTCondor places all transferred files in a single, flat directory where the executable and standard input file reside - the temporary working directory on the compute server. Therefore, the executing program must access input files without paths.

          A similar situation exists for output files. If the program creates output files during execution, it must create them within the temporary working directory. HTCondor transfers back all new and modified files within the temporary working directory - the output files. To transfer back only a subset of these files, use the submit command transfer_output_files. HTCondor does not support the transfer of output files that exist but that do not reside within the temporary working directory on the compute server.

          This restriction need not deter the user with a subdirectory of input and output files. The user simply makes an archive file of the subdirectory structure with the tar utility and tell HTCondor to transfer the tar file. The application may then un-tar the archive before reading the input files. The application may also write to output files which reside within the subdirectory. The final step of the application archives those files which your job made or modified. HTCondor will see the archive file as an output file and transfer the archive from the compute server to the user's working directory on the submission host. Finally, the user extracts the output files from the archive.

          The computer program, myprogram.c, reads a formatted data file and writes a formatted data file. This example assumes that there exists a formatted input file, i_00110 in a subdirectory name mysubdirectory. The result is a formatted output file, o_00110, in the same subdirectory. The program uses the tar utility to extract the subdirectory structure on the compute server. After the program writes the output file, it then uses the tar utility again to archive the subdirectory of output files only. To compile this program for the Vanilla Universe, see Compiling Serial Programs.

          This example assumes that the current working directory has a subdirectory containing a formatted input file. The tar utility prepares the archive of input files:

          tar cf myarchive.i.tar mysubdirectory
          

          Prepare a job submission file, myprogram.sub. Specify the Vanilla Universe and the file transfer mechanism as "on":

          # FILENAME:  myprogram.sub
          
          universe = VANILLA
          
          executable = myprogram
          
          # Specify the archive as the input data file.
          transfer_input_files = myarchive.i.tar
          
          # Turn on file transfer mechanism.
          should_transfer_files   = YES
          
          # Let HTCondor handle output file(s): myarchive.o.tar.
          when_to_transfer_output = ON_EXIT
          
          # Standard output files, HTCondor log file
          output = mydata.out
          error  = mydata.err
          log    = mydata.log
          
          # queue one job
          queue
          

          To submit the executable to HTCondor:

          $ condor_submit myprogram.sub
          

          The standard output file, mydata.out, shows the evolution of the current working directory on the compute server. Initially, it shows that HTCondor transferred the tar file which contains the archived subdirectory of input data file(s). After extraction, the subdirectory with its formatted input file(s), mysubdirectory and myinputdata, are visible. After processing, the formatted output file(s), myoutputdata, is visible:

          total 24
          -rwxr-xr-x 1 myusername itap  8708 Nov 12 15:27 condor_exec.exe
          -rw-r--r-- 1 myusername itap 10240 Nov 12 15:27 myarchive.i.tar
          -rw-r--r-- 1 myusername itap     0 Nov 12 15:30 mydata.err
          -rw-r--r-- 1 myusername itap     0 Nov 12 15:30 mydata.out
          total 32
          -rwxr-xr-x 1 myusername itap  8708 Nov 12 15:27 condor_exec.exe
          -rw-r--r-- 1 myusername itap 10240 Nov 12 15:27 myarchive.i.tar
          -rw-r--r-- 1 myusername itap     0 Nov 12 15:30 mydata.err
          -rw-r--r-- 1 myusername itap   227 Nov 12 15:30 mydata.out
          drwxr-x--- 3 myusername itap  4096 Feb 14  2008 mysubdirectory
          total 8
          drwx------ 2 myusername itap 4096 Feb 14  2008  ..
          -rw-r--r-- 1 myusername itap   19 Jul 12  2007 myinputdata
          total 12
          drwx------ 2 myusername itap 4096 Feb 14  2008  ..
          -rw-r--r-- 1 myusername itap   19 Jul 12  2007 myinputdata
          -rw-r--r-- 1 myusername itap   28 Nov 12 15:30 myoutputdata
          ***  MAIN START  ***
          
          formatted input/output: textinsubdirectory
          
          ***  MAIN  STOP  ***
          

          At job completion, HTCondor sees file myarchive.o.tar as an output file which it will transfer to the submission host. After the transfer, the user then extracts the output file(s) from this archive:

          tar xf myarchive.o.tar mysubdirectory/myoutputfile
          

          View the log file, mydata.log:

          000 (342352.000.000) 11/12 15:29:31 Job submitted from host: <128.211.157.86:47933>
          ...
          001 (342352.000.000) 11/12 15:30:55 Job executing on host: <128.211.157.10:59987?PrivNet=condor.ccb.purdue.edu>
          ...
          005 (342352.000.000) 11/12 15:30:56 Job terminated.
              (1) Normal termination (return value 0)
                  Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
                  Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
                  Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
                  Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
              11094  -  Run Bytes Sent By Job
              18948  -  Run Bytes Received By Job
              11094  -  Total Bytes Sent By Job
              18948  -  Total Bytes Received By Job
          

          The log file records the main events related to the processing of this job. The log shows the number of bytes transferred between the submission host and the compute server via HTCondor's file transfer mechanism.

        • 7.1.14.12  Requiring Specific Amounts of Memory

          7.1.14.12  Requiring Specific Amounts of Memory

          Some applications require compute nodes with a certain minimum amount of memory. These applications may also perform better when even more memory is available on the compute node.

          This section illustrates how to submit a small job to a BoilerGrid compute node with at least 16 GB of memory (requirements) and to prefer compute nodes with even more memory (rank), if available. This example, myprogram.c, displays the name of the host which runs the job, the path name of the current working directory, and the contents of that directory.

          Prepare a job submission file with an appropriate filename, here named myjob.sub:

          # FILENAME:  myjob.sub
          
          universe = VANILLA
          
          # Require a compute node with at least 16 GB of memory.
          # 16 GB == 16046 MB;
          requirements = TotalMemory >= 16046
          
          # Prefer a compute node with more than 16 GB, if available.
          rank = TotalMemory
          
          # Transfer the "executable" myprogram to the compute node.
          transfer_executable = TRUE
          executable          = myprogram
          
          # Turn on HTCondor's file transfer mechanism only when needed.
          should_transfer_files   = IF_NEEDED
          when_to_transfer_output = ON_EXIT
          
          # Standard I/O files, HTCondor log file
          output = myprogram.out
          error  = myprogram.err
          log    = myprogram.log
          
          # queue one job
          queue
          

          The ClassAd TotalMemory specifies the amount of memory on a compute node. The amount of memory is in units of megabytes. To change this example to request at least 32 GB of total memory, replace "16046" with "32192". For at least 48 GB, use "48297".

          This example assumes that all compute nodes have a definition for the attribute TotalMemory. To see how many compute nodes in BoilerGrid do not have the attribute TotalMemory defined:

          $ condor_status -pool boilergrid.rcac.purdue.edu -total -constraint 'TotalMemory =?= undefined'
          

          There is no output since all compute nodes of BoilerGrid do have this attribute defined.

          Before submitting your job, you may wish to verify that there are a sufficient number of compute nodes which will satisfy your requirements and that those same compute nodes define the preferred ClassAds expressed in the rank command. To see how many compute nodes satisfy your requirements:

          $ condor_status -pool boilergrid.rcac.purdue.edu -total -constraint 'TotalMemory >= 16046'
          
                               Total Owner Claimed Unclaimed Matched Preempting Backfill
                  X86_64/LINUX 26093 18007    3330      4753       3          0        0
                         Total 26093 18007    3330      4753       3          0        0
          

          There are 26,093 compute nodes with at least 16 GB of memory.

          View results in the file for all standard output, here named myjob.out:

          cms-100.rcac.purdue.edu
          (none)
          /home/myusername/condor/Introduction/memory
          total 224
          -rw-r--r-- 1 myusername itap 1508 Mar 11 14:38 README
          -rw-r--r-- 1 myusername itap    0 Mar 11 15:36 myjob.err
          -rw-r--r-- 1 myusername itap  791 Mar 11 15:36 myjob.log
          -rw-r--r-- 1 myusername itap   77 Mar 11 15:36 myjob.out
          -rw-r----- 1 myusername itap  663 Mar 11 15:20 myjob.sub
          -rwxr-xr-x 1 myusername itap 6939 Mar 11 14:38 myprogram
          -rw-r----- 1 myusername itap  488 Mar 11 14:40 myprogram.c
          -rwxr----- 1 myusername itap   58 Mar 11 14:38 run
          ***  MAIN START  ***
          
          
          ***  MAIN  STOP  ***
          

          This job happened to run on compute node cms-100. This compute node has 8 processor cores. To verify that cms-100 has at least 16 GB of memory:

          $ condor_status -pool boilergrid.rcac.purdue.edu -constraint 'Machine=="cms-100.rcac.purdue.edu"' -format "%s\n" TotalMemory
          
          16046
          16046
          16046
          16046
          16046
          16046
          16046
          16046
          

          For more information about requirements and rank:

        • 7.1.14.13  Requiring Specific Architectures or Operating Systems

          7.1.14.13  Requiring Specific Architectures or Operating Systems

          You compile a computer program to run on a specific combination of chip architecture and operating system. This combination is a platform. BoilerGrid contains compute nodes of many different platforms, so you must often specify the platform your program requires to ensure that your job runs on the correct platform. The predominant platform on BoilerGrid is 64-bit Linux ("X86_64/Linux"). To see a list of all platforms available on BoilerGrid:

          $ condor_status -pool boilergrid.rcac.purdue.edu -total
          
                               Total Owner Claimed Unclaimed Matched Preempting Backfill
          
                   INTEL/LINUX   114    18       0        60       0          0       36
                     INTEL/OSX     2     0       0         2       0          0        0
                 INTEL/WINNT51   334     8       0       326       0          0        0
                 INTEL/WINNT61  6299   982       0      5317       0          0        0
              SUN4u/SOLARIS210     3     0       0         3       0          0        0
                  X86_64/LINUX 30170 19460    4559      6150       0          0        1
          
                         Total 36922 20468    4559     11858       0          0       37
          

          The name "INTEL" as used on BoilerGrid means 32-bit Intel-compatible hardware, and it makes no distinction between Intel and AMD CPUs. The name "X86_64" is a vendor-neutral term to refer to 64-bit architecture from either Intel or AMD. The name "WINNT51" means Windows XP, and "WINNT61" means Windows 7.

          By default, HTCondor will send a job to a compute node whose architecture and operating system match the platform of the host from which you submitted your job. Moreover, you may submit jobs to compute nodes which are platforms different from the submission host. You may compile a program to run on a Windows machine and submit the executable file to BoilerGrid from one of BoilerGrid's Linux submission hosts by specifying that the job requires a Windows compute node:

          executable   = myprogram.exe
          requirements = (ARCH == "INTEL") && ((OPSYS == "WINNT51") || (OPSYS == "WINNT61"))
          

          It is possible to allow HTCondor to use a larger pool of compute nodes for a job if executables are available for multiple platforms. You need only take care to not reference any absolute paths within your job submission that are specific to one platform or installation. You can often use some existing ClassAd variables instead of fixed paths to make non-platform-specific submission files.

          For more information about requirements and rank:

        • 7.1.14.14  Requiring Specific Clusters or Compute Nodes

          7.1.14.14  Requiring Specific Clusters or Compute Nodes

          ITaP research resources include several clusters. Currently, the clusters include the following:

          Radon
          Peregrine 1
          Coates
          CMS
          

          This section illustrates how to apply HTCondor ClassAds to submit a small job to a node in some subset of ITaP resources. These examples execute a simple shell script which displays the name of the compute node which ran the job.

          Prepare a shell script with an appropriate filename, here named myjob.sh:

          #!/bin/sh
          # FILENAME:  myjob.sh
          
          hostname
          

          Change the permissions of the shell script to allow execution by the owner (you):

          $ chmod u+x myjob.sh
          

          Executing only on a node of one or more specific research clusters

          Prepare a job submission file with an appropriate filename, here named myjob.sub. The ClassAd requires that the chosen compute node should reside on either of two clusters. This example also specifies the shell script myjob.sh as the executable, transfers the shell script to the compute node, and uses HTCondor's file transfer mechanism to copy input and output files only if needed (if the same file system is not available on the compute node):

          # FILENAME:  myjob.sub
          
          universe = VANILLA
          
          # Require a compute node of either the Steele or Coates cluster.
          # Attribute name is not case sensitive; attribute value is.
          requirements = (CLUSTERNAME=="Steele") || (clustername=="Coates")
          
          # Transfer the "executable" myjob.sh to the compute node.
          transfer_executable = TRUE
          executable          = myjob.sh
          
          # Turn on HTCondor's file transfer mechanism only when needed.
          should_transfer_files   = IF_NEEDED
          when_to_transfer_output = ON_EXIT
          
          # Standard I/O files, HTCondor log file
          output = myjob.out
          error  = myjob.err
          log    = myjob.log
          
          # queue one job
          queue
          

          Submit the job:

          $ condor_submit myjob.sub
          

          View job status:

          $ condor_q myusername
          

          View results in the file for all standard output, here named myjob.out:

          coates-d020.rcac.purdue.edu
          

          Executing only on one specific compute node

          Prepare a job submission file with an appropriate filename, here named myjob.sub. The ClassAd requires a specific compute node. This example also specifies the shell script myjob.sh as the executable, transfers the shell script to the compute node, and uses HTCondor's file transfer mechanism to copy input and output files only if needed (if the same file system is not available on the compute node):

          # FILENAME:  myjob.sub
          
          universe = VANILLA
          
          # Require a specific compute node.
          requirements = Machine=="miner-a500.rcac.purdue.edu"
          
          # Transfer the "executable" myjob.sh to the compute node.
          transfer_executable = TRUE
          executable          = myjob.sh
          
          # Turn on HTCondor's file transfer mechanism only when needed.
          should_transfer_files   = IF_NEEDED
          when_to_transfer_output = ON_EXIT
          
          # Standard I/O files, HTCondor log file
          output = myjob.out
          error  = myjob.err
          log    = myjob.log
          
          # queue one job
          queue
          

          Submit the job:

          $ condor_submit myjob.sub
          

          View job status:

          $ condor_q myusername
          

          View results in the file for all standard output, here named myjob.out:

          miner-a500.rcac.purdue.edu
          

          Executing on any compute node of a cluster except one

          When you discover that a compute node is consistently available and consistently fails to run your job, you may exclude that node from the set of candidate nodes.

          Prepare a job submission file with an appropriate filename, here named myjob.sub. The ClassAd excludes one specific compute node of a chosen cluster. This example also specifies the shell script myjob.sh as the executable, transfers the shell script to the compute node, and uses HTCondor's file transfer mechanism to copy input and output files only if needed (if the same file system is not available on the compute node):

          # FILENAME:  myjob.sub
          
          universe = VANILLA
          
          # Exclude a specific compute node.
          requirements = ClusterName=="Miner" && Machine!="miner-a500.rcac.purdue.edu"
          
          # Transfer the "executable" myjob.sh to the compute node.
          transfer_executable = TRUE
          executable          = myjob.sh
          
          # Turn on HTCondor's file transfer mechanism only when needed.
          should_transfer_files   = IF_NEEDED
          when_to_transfer_output = ON_EXIT
          
          # Standard I/O files, HTCondor log file
          output = myjob.out
          error  = myjob.err
          log    = myjob.log
          
          # queue one job
          queue
          

          Submit the job:

          $ condor_submit myjob.sub
          

          View job status:

          $ condor_q myusername
          

          View results in the file for all standard output, here named myjob.out:

          miner-a502.rcac.purdue.edu
          

          For more information about requirements and rank:

        • 7.1.14.15  DAGMan - Linear DAG

          7.1.14.15  DAGMan - Linear DAG

          HTCondor schedules individual programs to run on unused compute servers, but it does not schedule a sequence of programs; HTCondor does not handle dependencies. Instead, the Directed Acyclic Graph Manager (DAGMan), a meta-scheduler which can handle dependencies, submits programs to HTCondor in a sequence specified by a directed acyclic graph (DAG). A DAG can represent a sequence of computations. Nodes (vertices) of the DAG represent executable programs; edges (arcs) identify the dependencies between programs.

          This example is a linear DAG which represents three ordered executions named "A", "B", and "C". Program A must finish before program B may begin; B must finish before C may begin.

          Diagram of Linear DAG

          The DAG submission file, myprogram.dag, describes the DAG and specifies job submission files which control the execution of individual programs at each node of the DAG and under HTCondor's control:

          # FILENAME:  myprogram.dag
          
          # Specify the nodes (job submission files) of a DAG.
          JOB A myprogram.A.sub
          JOB B myprogram.B.sub
          JOB C myprogram.C.sub
          
          # Specify command-line arguments as macro definitions.
          VARS A nodename="A"
          VARS B nodename="B"
          VARS C nodename="C"
          
          # Specify the edges (dependencies, order of execution) of a DAG.
          PARENT A CHILD B
          PARENT B CHILD C
          

          View the job submission file, myprogram.A.sub, for the first node of the DAG:

          universe   = VANILLA
          executable = myprogram
          arguments  = $(Cluster) $(Process) $(nodename)
          output     = myprogram.A.out
          error      = myprogram.A.err
          log        = myprogram.log
          queue
          

          View the job submission file, myprogram.B.sub, for the second node of the DAG:

          universe   = VANILLA
          executable = myprogram
          arguments  = $(Cluster) $(Process) $(nodename)
          output     = myprogram.B.out
          error      = myprogram.B.err
          log        = myprogram.log
          queue
          

          View the job submission file, myprogram.C.sub, for the third node of the DAG:

          universe   = VANILLA
          executable = myprogram
          arguments  = $(Cluster) $(Process) $(nodename)
          output     = myprogram.C.out
          error      = myprogram.C.err
          log        = myprogram.log
          queue
          

          While DAGMan can execute a different program for each node of the DAG, this example uses a single executable file to remain simple. Here is a short program, myprogram.c, that displays the command-line arguments which originally appear in the DAG submission file and forwarded to the job submission files. To compile this program for the Vanilla Universe, see Compiling Serial Programs.

          To submit the DAG to HTCondor:

          $ condor_submit_dag -force myprogram.dag
          

          The argument -force requires HTCondor to overwrite the files that it produces. This is a useful convenience when the DAG's output files already exist from a preceding run, you no longer need that earlier output. HTCondor appends, not overwrites, the file dagman.out .

          Command condor_rm is able to remove a DAG from the job queue.

          Command condor_q shows the sequence of execution:

          $ condor_q myusername
          -- Submitter: condor.rcac.purdue.edu : <128.211.157.86:38552> : condor.rcac.purdue.edu
           ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
          746893.0 myusername       11/9  10:00   0+00:00:08 R  0   7.3  condor_dagman
          
          $ condor_q myusername
          -- Submitter: condor.rcac.purdue.edu : <128.211.157.86:38552> : condor.rcac.purdue.edu
           ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
          746893.0 myusername       11/9  10:00   0+00:01:42 R  0   7.3  condor_dagman
          746894.0 myusername       11/9  10:00   0+00:00:00 I  0   0.0  myprogram 746894 0 A
          
          $ condor_q myusername
          -- Submitter: condor.rcac.purdue.edu : <128.211.157.86:38552> : condor.rcac.purdue.edu
           ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
          746893.0 myusername       11/9  10:00   0+00:18:48 R  0   7.3  condor_dagman
          746897.0 myusername       11/9  10:15   0+00:00:00 I  0   0.0  myprogram 746897 0 B
          
          $ condor_q myusername
          -- Submitter: condor.rcac.purdue.edu : <128.211.157.86:38552> : condor.rcac.purdue.edu
           ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
          746893.0 myusername       11/9  10:00   0+00:21:28 R  0   7.3  condor_dagman
          746900.0 myusername       11/9  10:21   0+00:00:00 I  0   0.0  myprogram 746900 0 C
          

          This report shows that DAGMan has its own cluster number. Each node of a DAG has its own cluster number. These cluster numbers are not necessarily in sequence since other users are submitting jobs to HTCondor.

          View the output file of the first node of the DAG, myprogram.A.out:

          ***  MAIN START  ***
          
          program name:     condor_exec.exe
          cluster number:   746894
          process number:   0
          node name:        A
          
          ***  MAIN  STOP  ***
          

          View the output file of the second node of the DAG, myprogram.B.out:

          ***  MAIN START  ***
          
          program name:     condor_exec.exe
          cluster number:   746897
          process number:   0
          node name:        B
          
          ***  MAIN  STOP  ***
          

          View the output file of the third node of the DAG, myprogram.C.out:

          ***  MAIN START  ***
          
          program name:     condor_exec.exe
          cluster number:   746900
          process number:   0
          node name:        C
          
          ***  MAIN  STOP  ***
          

          Each execution of the single program sees a unique node name: A, B, C.

          The common log file records the execution of the three nodes of the DAG, myprogram.log:

          000 (746894.000.000) 11/09 10:00:37 Job submitted from host: <128.211.157.86:38552>
              DAG Node: A
          ...
          001 (746894.000.000) 11/09 10:15:09 Job executing on host: <128.211.157.10:59600>
          ...
          005 (746894.000.000) 11/09 10:15:09 Job terminated.
              (1) Normal termination (return value 0)
                  Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
                  Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
                  Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
                  Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
              0  -  Run Bytes Sent By Job
              0  -  Run Bytes Received By Job
              0  -  Total Bytes Sent By Job
              0  -  Total Bytes Received By Job
          ...
          000 (746897.000.000) 11/09 10:15:18 Job submitted from host: <128.211.157.86:38552>
              DAG Node: B
          ...
          001 (746897.000.000) 11/09 10:21:24 Job executing on host: <128.211.157.10:52773>
          ...
          005 (746897.000.000) 11/09 10:21:24 Job terminated.
              (1) Normal termination (return value 0)
                  Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
                  Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
                  Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
                  Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
              0  -  Run Bytes Sent By Job
              0  -  Run Bytes Received By Job
              0  -  Total Bytes Sent By Job
              0  -  Total Bytes Received By Job
          ...
          000 (746900.000.000) 11/09 10:21:36 Job submitted from host: <128.211.157.86:38552>
              DAG Node: C
          ...
          001 (746900.000.000) 11/09 10:26:06 Job executing on host: <128.211.157.10:59600>
          ...
          005 (746900.000.000) 11/09 10:26:06 Job terminated.
              (1) Normal termination (return value 0)
                  Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
                  Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
                  Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
                  Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
              0  -  Run Bytes Sent By Job
              0  -  Run Bytes Received By Job
              0  -  Total Bytes Sent By Job
              0  -  Total Bytes Received By Job
          ...
          

          For more information about DAGMan:

        • 7.1.14.16  DAGMan - Parameter Sweep

          7.1.14.16  DAGMan - Parameter Sweep

          A linear DAG may include a parameter sweep. The following diagram illustrates a three-step linear DAG with the middle process being a parameter sweep which applies a single computer program to unique data sets. The first and third steps might perform data preparation and collation, respectively:

          Diagram of Parameter Sweep DAG

          This example is a linear DAG which represents three ordered executions named "A", "B", and "C". Program A must finish before any run of program B used in the parameter sweep may begin; all runs of program B must finish before C may begin.

          The DAG submission file, myprogram.dag, describes the DAG and specifies job submission files which control the execution of individual programs at each node of the DAG and under HTCondor's control. Notice that this DAG submission file is identical to a linear DAG submission file:

          # FILENAME:  myprogram.dag
          
          # Specify the nodes (job submission files) of a DAG.
          JOB A myprogram.A.sub
          JOB B myprogram.B.sub
          JOB C myprogram.C.sub
          
          # Specify command-line arguments as macro definitions.
          VARS A nodename="A"
          VARS B nodename="B"
          VARS C nodename="C"
          
          # Specify the edges (dependencies, order of execution) of a DAG.
          PARENT A CHILD B
          PARENT B CHILD C
          

          View the job submission file, myprogram.A.sub, for the first node of the DAG:

          universe   = VANILLA
          executable = myprogram
          arguments  = $(Cluster) $(Process) $(nodename)
          output     = myprogram.A.out
          error      = myprogram.A.err
          log        = myprogram.log
          queue
          

          View the job submission file, myprogram.B.sub, for the second node, the parameter sweep, of the DAG. Command queue submits three copies of myprogram to HTCondor:

          universe   = VANILLA
          executable = myprogram
          arguments  = $(Cluster) $(Process) $(nodename)
          output     = myprogram.B.out.$(Process)
          error      = myprogram.B.err.$(Process)
          log        = myprogram.log
          queue 3
          

          View the job submission file, myprogram.C.sub, for the third node of the DAG:

          universe   = VANILLA
          executable = myprogram
          arguments  = $(Cluster) $(Process) $(nodename)
          output     = myprogram.C.out
          error      = myprogram.C.err
          log        = myprogram.log
          queue
          

          While DAGMan can execute a different program for each node of the DAG, this example uses a single executable file to remain simple. Here is a short program, myprogram.c, that displays the command-line arguments which originally appear in the DAG submission file and forwarded to the job submission files. To compile this program for the Vanilla Universe, see Compiling Serial Programs.

          To submit the DAG to HTCondor:

          $ condor_submit_dag -force myprogram.dag
          

          The argument -force requires HTCondor to overwrite the files that it produces. This is a useful convenience when the DAG's output files already exist from a preceding run, and you no longer need the earlier output. DAGMan appends, not overwrites, file dagman.out.

          Command condor_rm is able to remove a DAG from the job queue.

          Three timely submissions of condor_q caught the three steps of the DAG, including the parameter sweep of the middle step:

          $ condor_q myusername
          -- Submitter: condor.rcac.purdue.edu : <128.211.157.86:58916> : condor.rcac.purdue.edu
           ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
          746911.0 myusername       11/10 08:30   0+00:00:19 R  0   7.3  condor_dagman
          746912.0 myusername       11/10 08:30   0+00:00:00 I  0   0.0  myprogram 746912 0 A
          
          $ condor_q myusername
          -- Submitter: condor.rcac.purdue.edu : <128.211.157.86:58916> : condor.rcac.purdue.edu
           ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
          746911.0 myusername       11/10 08:30   0+00:02:25 R  0   7.3  condor_dagman
          746913.0 myusername       11/10 08:32   0+00:00:00 I  0   0.0  myprogram 746913 0 B
          746913.1 myusername       11/10 08:32   0+00:00:00 I  0   0.0  myprogram 746913 1 B
          746913.2 myusername       11/10 08:32   0+00:00:00 I  0   0.0  myprogram 746913 2 B
          
          $ condor_q myusername
          -- Submitter: condor.rcac.purdue.edu : <128.211.157.86:58916> : condor.rcac.purdue.edu
           ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
          746911.0 myusername       11/10 08:30   0+00:14:55 R  0   7.3  condor_dagman
          746914.0 myusername       11/10 08:41   0+00:00:00 I  0   0.0  myprogram 746914 0 C
          

          This report shows that DAGMan has its own cluster number. Each node of a DAG has its own cluster number. These cluster numbers are not necessarily in sequence since other users are submitting jobs to HTCondor. In addition, each process in the parameter sweep has its own process number, and they are in sequence.

          View the output file of the first node of the DAG, myprogram.A.out:

          ***  MAIN START  ***
          
          program name:   condor_exec.exe
          cluster number: 746912
          process number: 0
          node name:      A
          
          ***  MAIN  STOP  ***
          

          View the three output files of the three processes of the parameter sweep that is the second node of the DAG, myprogram.B.out.$(Process):

          ***  MAIN START  ***
          
          program name:   condor_exec.exe
          cluster number: 746913
          process number: 0
          node name:      B
          
          ***  MAIN  STOP  ***
          
          ***  MAIN START  ***
          
          program name:   condor_exec.exe
          cluster number: 746913
          process number: 1
          node name:      B
          
          ***  MAIN  STOP  ***
          
          ***  MAIN START  ***
          
          program name:   condor_exec.exe
          cluster number: 746913
          process number: 2
          node name:      B
          
          ***  MAIN  STOP  ***
          

          View the output file of the third node of the DAG, myprogram.C.out:

          ***  MAIN START  ***
          
          program name:   condor_exec.exe
          cluster number: 746914
          process number: 0
          node name:      C
          
          ***  MAIN  STOP  ***
          

          Each execution of the single program sees a unique node name: A, B, C. In the parameter sweep, all runs of the single program see the same node name, B; however, each copy sees a unique process number.

          The common log file records the execution of the three nodes of the DAG, myprogram.log:

          000 (746912.000.000) 11/10 08:30:43 Job submitted from host: <128.211.157.86:58916>
              DAG Node: A
          ...
          001 (746912.000.000) 11/10 08:32:36 Job executing on host: <128.211.157.10:37230>
          ...
          005 (746912.000.000) 11/10 08:32:36 Job terminated.
              (1) Normal termination (return value 0)
                  Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
                  Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
                  Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
                  Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
              0  -  Run Bytes Sent By Job
              0  -  Run Bytes Received By Job
              0  -  Total Bytes Sent By Job
              0  -  Total Bytes Received By Job
          ...
          000 (746913.000.000) 11/10 08:32:44 Job submitted from host: <128.211.157.86:58916>
              DAG Node: B
          ...
          000 (746913.001.000) 11/10 08:32:44 Job submitted from host: <128.211.157.86:58916>
              DAG Node: B
          ...
          000 (746913.002.000) 11/10 08:32:44 Job submitted from host: <128.211.157.86:58916>
              DAG Node: B
          ...
          001 (746913.000.000) 11/10 08:41:12 Job executing on host: <128.211.157.10:34460>
          ...
          001 (746913.001.000) 11/10 08:41:12 Job executing on host: <128.211.157.10:36048>
          ...
          005 (746913.000.000) 11/10 08:41:12 Job terminated.
              (1) Normal termination (return value 0)
                  Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
                  Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
                  Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
                  Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
              0  -  Run Bytes Sent By Job
              0  -  Run Bytes Received By Job
              0  -  Total Bytes Sent By Job
              0  -  Total Bytes Received By Job
          ...
          005 (746913.001.000) 11/10 08:41:12 Job terminated.
              (1) Normal termination (return value 0)
                  Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
                  Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
                  Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
                  Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
              0  -  Run Bytes Sent By Job
              0  -  Run Bytes Received By Job
              0  -  Total Bytes Sent By Job
              0  -  Total Bytes Received By Job
          ...
          001 (746913.002.000) 11/10 08:41:12 Job executing on host: <128.211.157.10:34460>
          ...
          005 (746913.002.000) 11/10 08:41:13 Job terminated.
              (1) Normal termination (return value 0)
                  Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
                  Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
                  Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
                  Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
              0  -  Run Bytes Sent By Job
              0  -  Run Bytes Received By Job
              0  -  Total Bytes Sent By Job
              0  -  Total Bytes Received By Job
          ...
          000 (746914.000.000) 11/10 08:41:26 Job submitted from host: <128.211.157.86:58916>
              DAG Node: C
          ...
          001 (746914.000.000) 11/10 08:48:03 Job executing on host: <128.211.157.10:40848>
          ...
          005 (746914.000.000) 11/10 08:48:04 Job terminated.
              (1) Normal termination (return value 0)
                  Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
                  Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
                  Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
                  Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
              0  -  Run Bytes Sent By Job
              0  -  Run Bytes Received By Job
              0  -  Total Bytes Sent By Job
              0  -  Total Bytes Received By Job
          ...
          

          For more information about DAGMan:

        • 7.1.14.17  DAGMan - Another Parameter Sweep

          7.1.14.17  DAGMan - Another Parameter Sweep

          All nodes of a DAG may be a parameter sweep. This means that each run of an entire DAG can process a unique set of input data. This is the logical extension of a single progam used in a parameter sweep. The disadvantage of this method is the interdependence among copies of the DAG.

          This example is a linear DAG which represents three ordered executions named "A", "B", and "C". This DAG runs as a parameter sweep. The interdependence among the runs of this DAG means that all runs of program associated with node A must finish before any run of program associated with node B may begin; all runs of the program associated with node B must finish before any run of the program associated with node C may begin. If one of the runs of DAG Node A experiences a delay because the executable file landed on a slow compute node, then all runs of the parameter sweep wait, not just the run which experiences the delay.

          Diagram of Parameter Sweep DAG

          The DAG submission file, myprogram.dag, describes the DAG and specifies job submission files which control the execution of individual programs at each node of the DAG and under HTCondor's control. Notice that this DAG submission file is identical to a linear DAG submission file:

          # FILENAME:  myprogram.dag
          
          # Specify the nodes (job submission files) of a DAG.
          JOB A myprogram.A.sub
          JOB B myprogram.B.sub
          JOB C myprogram.C.sub
          
          # Specify command-line arguments as macro definitions.
          VARS A nodename="A"
          VARS B nodename="B"
          VARS C nodename="C"
          
          # Specify the edges (dependencies, order of execution) of a DAG.
          PARENT A CHILD B
          PARENT B CHILD C
          

          View the job submission file, myprogram.A.sub, for the first node of the DAG:

          universe     = VANILLA
          executable   = myprogram
          arguments    = $(Cluster) $(Process) $(nodename)
          output       = myprogram.A.out.$(Process)
          error        = myprogram.A.err.$(Process)
          log          = myprogram.log
          queue 3      # queue 3 runs
          

          View the job submission file, myprogram.B.sub, for the second node of the DAG:

          universe     = VANILLA
          executable   = myprogram
          arguments    = $(Cluster) $(Process) $(nodename)
          output       = myprogram.B.out.$(Process)
          error        = myprogram.B.err.$(Process)
          log          = myprogram.log
          queue 3      # queue 3 runs
          

          View the job submission file, myprogram.C.sub, for the third node of the DAG:

          universe     = VANILLA
          executable   = myprogram
          arguments    = $(Cluster) $(Process) $(nodename)
          output       = myprogram.C.out.$(Process)
          error        = myprogram.C.err.$(Process)
          log          = myprogram.log
          queue 3      # queue 3 runs
          

          For each node of the DAG, command queue submits three copies of myprogram to HTCondor.

          While DAGMan can execute a different program for each node of the DAG, this example uses a single executable file to remain simple. Here is a short program, myprogram.c, that displays the command-line arguments which originally appear in the DAG submission file and forwarded to the job submission files. To compile this program for the Vanilla Universe, see Compiling Serial Programs.

          To submit the DAG to HTCondor:

          $ condor_submit_dag -force myprogram.dag
          

          The argument -force requires HTCondor to overwrite the files that it produces. This is a useful convenience when the DAG's output files already exist from a preceding run, and you no longer need the earlier output. DAGMan appends, not overwrites, the file dagman.out.

          Command condor_rm is able to remove a DAG from the job queue.

          Three timely submissions of condor_q caught the parameter sweeps of the three steps of the DAG:

          $ condor_q myusername
          -- Submitter: condor.rcac.purdue.edu : <128.211.157.86:58916> : condor.rcac.purdue.edu
           ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
          746924.0 myusername       11/11 14:28   0+00:00:22 R  0   7.3  condor_dagman
          746925.0 myusername       11/11 14:28   0+00:00:00 I  0   0.0  myprogram 746925 0 A
          746925.1 myusername       11/11 14:28   0+00:00:00 I  0   0.0  myprogram 746925 1 A
          746925.2 myusername       11/11 14:28   0+00:00:00 I  0   0.0  myprogram 746925 2 A
          
          $ condor_q myusername
          -- Submitter: condor.rcac.purdue.edu : <128.211.157.86:58916> : condor.rcac.purdue.edu
           ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
          746924.0 myusername       11/11 14:28   0+00:04:51 R  0   7.3  condor_dagman
          746926.0 myusername       11/11 14:32   0+00:00:00 I  0   0.0  myprogram 746926 0 B
          746926.1 myusername       11/11 14:32   0+00:00:00 I  0   0.0  myprogram 746926 1 B
          746926.2 myusername       11/11 14:32   0+00:00:00 I  0   0.0  myprogram 746926 2 B
          
          $ condor_q myusername
          -- Submitter: condor.rcac.purdue.edu : <128.211.157.86:58916> : condor.rcac.purdue.edu
           ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
          746924.0 myusername       11/11 14:28   0+00:09:55 R  0   7.3  condor_dagman
          746927.0 myusername       11/11 14:37   0+00:00:00 I  0   0.0  myprogram 746927 0 C
          746927.1 myusername       11/11 14:37   0+00:00:00 I  0   0.0  myprogram 746927 1 C
          746927.2 myusername       11/11 14:37   0+00:00:00 I  0   0.0  myprogram 746927 2 C
          

          This report shows that DAGMan has its own cluster number. Each node of the DAG has its own cluster number. These cluster numbers are not necessarily in sequence since other users are submitting jobs to HTCondor. In addition, since each node is a parameter sweep, each process in the parameter sweep has its own process number, and they are in sequence.

          View the three output files of the zero-th run of the parameter sweep of the DAG: myprogram.A.out.0, myprogram.B.out.0, and myprogram.C.out.0:

          ***  MAIN START  ***
          
          program name:   condor_exec.exe
          cluster number: 746925
          process number: 0
          node name:      A
          
          ***  MAIN  STOP  ***
          
          ***  MAIN START  ***
          
          program name:   condor_exec.exe
          cluster number: 746926
          process number: 0
          node name:      B
          
          ***  MAIN  STOP  ***
          
          ***  MAIN START  ***
          
          program name:   condor_exec.exe
          cluster number: 746927
          process number: 0
          node name:      C
          
          ***  MAIN  STOP  ***
          

          Similar sets of output files exist for the other two runs of the parameter sweep. Each execution of the single program sees a unique pair of node name (A, B, C) and process number (0, 1, 2).

          The log file myjob.log recorded the execution of the three runs of the parameter sweep. In particular, it shows that all runs of node B start only after all runs of node A reach completion.

          For more information about DAGMan:

        • 7.1.14.18  DAGMan - Multiple, Independent DAGs

          7.1.14.18  DAGMan - Multiple, Independent DAGs

          A single use of condor_submit_dag may execute several independent DAGs. Each independent DAG has its own DAG submission file. The names of these DAG submission files appear as command-line arguments of condor_submit_dag, as in the following:

          condor_submit_dag -force mydagsubmissionfile1 mydagsubmissionfile2 ... mydagsubmissionfileN
          

          This example is two independent linear DAGs which represent three ordered executions named "A", "B", and "C" and two ordered executions named "D" and "E". While each sequence must be executed in the order specified by their respective DAGs, there is no dependency between the two sequences; the two sequences are independent. In other words, the execution of step E does not depend on the completion of either step A, B, or C, only step D.

          Diagram of Multiple Independent DAGs

          Here are the two independent DAG submission files, myprogram.dag.1 and myprogram.dag.2:

          # FILENAME:  myprogram.dag.1
          
          # Specify the nodes (job submission files) of a DAG.
          JOB A myprogram.dag1.A.sub
          JOB B myprogram.dag1.B.sub
          JOB C myprogram.dag1.C.sub
          
          # Specify command-line arguments as macro definitions.
          VARS A nodename="A"
          VARS B nodename="B"
          VARS C nodename="C"
          
          # Specify the edges (dependencies, order of execution) of a DAG.
          PARENT A CHILD B
          PARENT B CHILD C
          

          # FILENAME:  myprogram.dag.2
          
          # Specify the nodes (job submission files) of a DAG.
          JOB D p_00156.dag2.D.sub
          JOB E p_00156.dag2.E.sub
          
          # Specify command-line arguments as macro definitions.
          VARS D nodename="D"
          VARS E nodename="E"
          
          # Specify the edges (dependencies, order of execution) of a DAG.
          PARENT D CHILD E
          

          View the three job submission files of DAG 1:

          # FILENAME:  myprogram.dag1.A.sub
          universe   = VANILLA
          executable = myprogram
          arguments  = $(Cluster) $(Process) $(nodename)
          output     = myprogram.dag1.A.out
          error      = myprogram.dag1.A.err
          log        = myprogram.dag1.log
          queue
          

          # FILENAME:  myprogram.dag1.B.sub
          universe   = VANILLA
          executable = myprogram
          arguments  = $(Cluster) $(Process) $(nodename)
          output     = myprogram.dag1.B.out
          error      = myprogram.dag1.B.err
          log        = myprogram.dag1.log
          queue
          

          # FILENAME:  myprogram.dag1.C.sub
          universe   = VANILLA
          executable = myprogram
          arguments  = $(Cluster) $(Process) $(nodename)
          output     = myprogram.dag1.C.out
          error      = myprogram.dag1.C.err
          log        = myprogram.dag1.log
          queue
          

          View the two job submission files of DAG 2:

          # FILENAME:  myprogram.dag2.D.sub
          universe   = VANILLA
          executable = myprogram
          arguments  = $(Cluster) $(Process) $(nodename)
          output     = myprogram.dag2.D.out
          error      = myprogram.dag2.D.err
          log        = myprogram.dag2.log
          queue
          

          # FILENAME:  myprogram.dag2.E.sub
          universe   = VANILLA
          executable = myprogram
          arguments  = $(Cluster) $(Process) $(nodename)
          output     = myprogram.dag2.E.out
          error      = myprogram.dag2.E.err
          log        = myprogram.dag2.log
          queue
          

          While DAGMan can execute a different program for each node of the DAG, this example uses a single executable file to remain simple. Here is a short program, myprogram.c, that displays the command-line arguments which originally appear in the DAG submission file and forwarded to the job submission files. To compile this program for the Vanilla Universe, see Compiling Serial Programs.

          To submit the independent DAGs to HTCondor:

          $ condor_submit_dag -force myprogram.dag.1 myprogram.dag.2
          

          The argument -force requires HTCondor to overwrite the files that it produces. This is a useful convenience when the DAG's output files already exist from a preceding run, and you no longer need the earlier output. DAGMan appends, not overwrites, the file dagman.out.

          Command condor_rm is able to remove a DAG from the job queue.

          Command condor_q shows the start of the two independent DAGs:

          $ condor_q myusername
          -- Submitter: condor.rcac.purdue.edu : <128.211.157.86:38552> : condor.rcac.purdue.edu
           ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
          746918.0 myusername       11/10 11:06   0+00:00:32 R  0   7.3  condor_dagman
          746919.0 myusername       11/10 11:06   0+00:00:00 I  0   0.0  myprogram.dag1 74691
          746920.0 myusername       11/10 11:06   0+00:00:00 I  0   0.0  myprogram.dag2 74692
          

          This report shows that DAGMan has its own cluster number. Each independent DAG has its own set of cluster numbers. These cluster numbers are not necessarily in sequence since other users are submitting jobs to HTCondor.

          View the output file of the first node of DAG 1, myprogram.dag1.A.out:

          ***  MAIN START  ***
          
          program name:     condor_exec.exe
          cluster number:   746894
          process number:   0
          node name:        A
          
          ***  MAIN  STOP  ***
          

          Similarly named output files exist for the other four nodes.

          This example ran with each independent DAG having its own log file. Here is the log file for DAG 2, myprogram.dag2.log:

          000 (746920.000.000) 11/10 11:06:17 Job submitted from host: <128.211.157.86:58916>
              DAG Node: 1.D
          ...
          001 (746920.000.000) 11/10 11:12:00 Job executing on host: <128.211.157.10:42201>
          ...
          005 (746920.000.000) 11/10 11:12:00 Job terminated.
              (1) Normal termination (return value 0)
                  Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
                  Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
                  Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
                  Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
              0  -  Run Bytes Sent By Job
              0  -  Run Bytes Received By Job
              0  -  Total Bytes Sent By Job
              0  -  Total Bytes Received By Job
          ...
          000 (746922.000.000) 11/10 11:12:08 Job submitted from host: <128.211.157.86:58916>
              DAG Node: 1.E
          ...
          001 (746922.000.000) 11/10 11:18:38 Job executing on host: <128.211.157.10:49358>
          ...
          005 (746922.000.000) 11/10 11:18:38 Job terminated.
              (1) Normal termination (return value 0)
                  Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
                  Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
                  Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
                  Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
              0  -  Run Bytes Sent By Job
              0  -  Run Bytes Received By Job
              0  -  Total Bytes Sent By Job
              0  -  Total Bytes Received By Job
          ...
          

          The text "DAG Node: 1.D" refers to step D of the second independent DAG listed as a command-line argument of condor_submit_dag.

          Finally, this example could be reshaped into a parameter sweep, but the need to list the names of separate DAG submission files as command-line arguments of condor_submit_dag is very inconvenient for large sweeps.

          For more information about DAGMan:

        • 7.1.14.19  DAGMan - Pre and Post Scripts

          7.1.14.19  DAGMan - Pre and Post Scripts

          The HTCondor keyword SCRIPT specifies optional processing that occurs either before a job within a DAG starts its execution or after a job within a DAG completes its execution. A PRE script performs processing before a job starts its execution under HTCondor; a POST script performs processing after a job completes its execution under HTCondor. A node in the DAG includes the job together with PRE and/or POST scripts. These scripts run on the submission host, not on a compute node.

          A common use of a PRE script places files in a staging area for a cluster of jobs to use; a common use of a POST script cleans up or removes files once that cluster of jobs reaches completion. An example might use a PRE script to transfer needed files from long-term storage; the corresponding POST script might return the processed files to long-term storage. In another example about staging files, a PRE script might archive a subdirectory structure of files in preparation for transferring that archive as a single input file to the compute node, while the POST script might extract output files from the archive which HTCondor transferred from the compute node to the submission host after job completion.

          The following flowchart illustrates a DAG with PRE and POST scripts:

          Diagram of Pre/Post-Processing DAG

          The DAG submission file, myprogram.dag, describes the DAG and specifies job submission files which control the execution of individual programs at each node of the DAG and under HTCondor's control. It also specifies the PRE and POST scripts:

          # FILENAME:  myprogram.dag
          
          # Specify the nodes (job submission files) of a DAG.
          JOB A myprogram.A.sub
          JOB B myprogram.B.sub
          
          # Specify PRE and POST scripts.
          SCRIPT PRE  A myprogram_preA.scr
          SCRIPT POST A myprogram_pstA.scr
          SCRIPT PRE  B myprogram_preB.scr
          SCRIPT POST B myprogram_pstB.scr
          
          # Specify command-line arguments as macro definitions.
          VARS A nodename="A"
          VARS B nodename="B"
          
          # Specify the edges (dependencies, order of execution) of a DAG.
          PARENT A CHILD B
          

          View the job submission file, myprogram.A.sub, for the first node of the DAG:

          universe   = VANILLA
          executable = myprogram
          arguments  = $(Cluster) $(Process) $(nodename)
          output     = myprogram.A.out
          error      = myprogram.A.err
          log        = myprogram.log
          queue
          

          View the job submission file, myprogram.B.sub, for the second node of the DAG:

          universe   = VANILLA
          executable = myprogram
          arguments  = $(Cluster) $(Process) $(nodename)
          output     = myprogram.B.out
          error      = myprogram.B.err
          log        = myprogram.log
          queue
          

          The four PRE and POST scripts write a short message to a common output file:

          #!/bin/sh
          # FILENAME:  myprogram_preA.scr
          echo "before node A" >>myprogram.lst
          /bin/hostname >>myprogram.lst
          

          #!/bin/sh
          # FILENAME:  myprogram_pstA.scr
          echo "after node A" >>myprogram.lst
          /bin/hostname >>myprogram.lst
          

          #!/bin/sh
          # FILENAME:  myprogram_preB.scr
          echo "before node B" >>myprogram.lst
          /bin/hostname >>myprogram.lst
          

          #!/bin/sh
          # FILENAME:  myprogram_pstB.scr
          echo "after node B" >>myprogram.lst
          /bin/hostname >>myprogram.lst
          

          While DAGMan can execute a different program for each node of the DAG, this example uses a single executable file to remain simple. Here is a short program, myprogram.c, that displays the command-line arguments which originally appear in the DAG submission file and forwarded to the job submission files. To compile this program for the Vanilla Universe, see Compiling Serial Programs.

          To submit the DAG to HTCondor:

          $ condor_submit_dag -force myprogram.dag
          

          The argument -force requires HTCondor to overwrite the files that it produces. This is a useful convenience when the DAG's output files already exist from a preceding run, and you no longer need the earlier output. DAGMan appends, not overwrites, the file dagman.out.

          Command condor_rm is able to remove a DAG from the job queue.

          View the output file of the first node of the DAG, myprogram.A.out:

          ***  MAIN START  ***
          
          program name:     condor_exec.exe
          cluster number:   746948
          process number:   0
          node name:        A
          
          ***  MAIN  STOP  ***
          

          View the output file of the second node of the DAG, myprogram.B.out:

          ***  MAIN START  ***
          
          program name:     condor_exec.exe
          cluster number:   746949
          process number:   0
          node name:        B
          
          ***  MAIN  STOP  ***
          

          Each execution of the single program sees a unique node name: A, B.

          View the common output file, myprogram.lst, of the four PRE and POST scripts. Output shows that the submission host itself executed the PRE and POST scripts:

          before node A
          condor.rcac.purdue.edu
          after node A
          condor.rcac.purdue.edu
          before node B
          condor.rcac.purdue.edu
          after node B
          condor.rcac.purdue.edu
          

          For more information about DAGMan:

        • 7.1.14.20  Job Priority

          7.1.14.20  Job Priority

          You may assign a priority to each of your jobs within a specific HTCondor queue (on a specific submission host). A priority value can be any integer, where higher values mean higher priority. HTCondor will generally attempt to assign a compute node to the highest priority job of yours first. However, this does not necessarily mean that a higher priority job will get a compute node before a lower priority job. An available compute node may match the requirements of a lower priority job but not the requirements of a higher priority job. Even once started, a higher priority job may not finish before lower priority jobs, because a higher priority job might have a longer run time or be preempted and have to restart more.

          Job priorities are user-specific and queue-specific and will not affect which user's jobs run first—only which jobs of yours start before which other jobs of yours. The default job priority is 0.

          One possible example of when job priorities could be useful is if you have submitted many jobs with the default priority, and only afterward realize that you would really prefer to see the results of another job first. You may submit this new urgent job and give it a higher priority so that HTCondor will try to find a compute node for this job before finding compute nodes for your other jobs. This will also only work if you submit this new job to the same queue (on the same submission host) as your other jobs, because job priorities are queue-specific.

          First submit a job to the HTCondor queue at the default priority (0). To raise this job's priority to 5:

          $ condor_q myusername
          
          -- Submitter: condor.rcac.purdue.edu : <128.211.157.86:35407> : condor.rcac.purdue.edu
           ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
          260187.0   myusername      8/30 13:59   0+00:00:00 I  0   19.5 hello
          
          1 jobs; 1 idle, 0 running, 0 held
          
          $ condor_prio -p 5 260187.0
          $ condor_q myusername
          
          -- Submitter: condor.rcac.purdue.edu : <128.211.157.86:35407> : condor.rcac.purdue.edu
           ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
          260187.0   myusername      8/30 13:59   0+00:00:03 I  5 19.5 hello
          
          1 jobs; 0 idle, 1 running, 0 held
          

          For more information about job priority:

      • 7.1.15  Flocking to Other Grids

        7.1.15  Flocking to Other Grids

        Even though a HTCondor pool usually contains machines owned by many different people, it will often be the case that collaborating researchers from different organizations do not consider it feasible to combine all of their computers into a single HTCondor pool. The solution to this is to create multiple HTCondor pools and allow flocking between these pools. Jobs may then flock (migrate) from one pool to another based on the availability of compute nodes. If your local HTCondor pool does not have any available machines to run your job, it may flock to another pool. You need do nothing special to enable this for your jobs. It will happen automatically.

        If you would like to learn more about how this works, see the Grid Computing Chapter of the HTCondor Users' Manual.