This document follows certain typesetting and naming conventions:
$ example This is an example of commands and output.
BoilerGrid is a large, high-throughput, distributed computing system operated by ITaP, and using the HTCondor system developed by the HTCondor Project at the University of Wisconsin. BoilerGrid provides a way for you to run programs on large numbers of otherwise idle computers in various locations, including any temporarily under-utilized high-performance cluster resources as well as any computer lab desktop machines not currently in use. Whenever a local user or scheduled job needs a machine back, HTCondor stops its job and sends it to another HTCondor node as soon as possible. Because this model limits the ability to do parallel processing and communications, BoilerGrid is only appropriate for relatively quick serial jobs.
If you have a desktop computer on the Purdue West Lafayette campus, please consider donating your desktop's idle time to BoilerGrid! The process is easy and allows other Purdue researchers to use otherwise wasted cycles when your computer is doing nothing. More information on joining BoilerGrid is available on the Join BoilerGrid page.
BoilerGrid scavenges cycles from nearly all ITaP research systems, including all the ITaP-maintained research clusters and specialized systems. BoilerGrid also uses idle time of machines in student labs on the Purdue West Lafayette campus. Through the larger consortium DiaGrid, BoilerGrid may also send jobs to machines at other institutions, including the University of Wisconsin, the University of Louisville, Indiana University, the University of Notre Dame, Indiana State University, the Purdue Calumet and North Central campuses, and the Indiana University – Purdue University Fort Wayne campus. Whenever the primary scheduling system on any of these machines needs a compute node back or a user sits down and starts to use a desktop computer, HTCondor will stop its job and, if possible, checkpoint its work. HTCondor then immediately tries to restart this job on some other available compute node in BoilerGrid.
A recent snapshot of BoilerGrid found 36,524 total processor cores. Of these, there were 29,111 Linux/x86_64, 98 Linux/Intel (ia32), 385 WinNT51/Intel, and 6925 WinNT61/Intel. There are also small numbers of Itanium Linux, Solaris, and Intel OSX nodes. Memory on compute nodes ranges from 512 MB to 192 GB, and most processors run at 2 GHz or faster. With a total of over 60 TFLOPS available, BoilerGrid can provide large numbers of cycles in a short amount of time. HTCondor offers high-throughput computing and is excellent for parameter sweeps, Monte Carlo simulations, or nearly any serial application.
| Owner | Arch/OS | Processor Cores |
|---|---|---|
| ITaP - Research Computing | x86_64/Linux | 30,717 |
| ITaP - Research Computing | Intel/Linux | 29 |
| ITaP - Envision Center | Intel/Linux | 48 |
| ITaP - Teaching & Learning | Intel/WinNTXX | ~9,300 |
| Purdue Calumet | X86_64/Linux | 998 |
| Notre Dame CSE | Intel/Linux, Intel/OSX, Sun4u/Solaris210, x86_64/Linux | 1,213 |
| Purdue Biology, Libraries & some ITaP | Intel/Linux, Intel/WinNT51 | 187 |
BoilerGrid currently uses Condor 7.4.1. You can check on the overall status of BoilerGrid using CondorView.
All Purdue faculty, staff, and students with the approval of their advisor may request access to BoilerGrid. However, if you have an account on Radon or any of the ITaP Community Clusters (Carter, Hansen, Rossmann, Coates, Steele, and Peregrine 1), then you already have access to BoilerGrid. Refer to the Accounts / Access page for more details on how to request access.
To submit jobs on BoilerGrid, log in to the submission host condor.rcac.purdue.edu via SSH. This submission host is actually three front-end hosts: condor-fe00, condor-fe01, and condor-fe02. The login process randomly assigns one of these three front-ends to each login to condor.rcac.purdue.edu. While the three front-end hosts are identical, each has its own HTCondor queue. When you submit jobs to the HTCondor queue from the front-end named condor-fe00, you will not see those jobs on the HTCondor queue while logged in to either condor-fe01 or condor-fe02. To ensure that you always see the same HTCondor queue, log in to the same front-end.
Each front-end host has its own /tmp. Sharing data in /tmp during subsequent sessions may fail. ITaP advises using scratch storage for multisession, shared data instead.
You may also submit jobs to BoilerGrid from Radon or any of the ITaP Community Clusters (Carter, Hansen, Rossmann, Coates, Steele, and Peregrine 1). These clusters also have multiple front-end hosts.
Secure Shell or SSH is a way of establishing a secure (encrypted) connection between two computers. It uses public-key cryptography to authenticate the remote computer and (optionally) to allow the remote computer to authenticate the user. Its usual function involves logging in to a remote machine and executing commands, but it also supports tunneling and forwarding of X11 or arbitrary TCP connections. There are many SSH clients available for all operating systems.
Linux / Solaris / AIX / HP-UX / Unix:
Microsoft Windows:
Mac OS X:
SSH works with many different means of authentication. One popular authentication method is Public Key Authentication (PKA). PKA is a method of establishing your identity to a remote computer using related sets of encryption data called keys. PKA is a more secure alternative to traditional password-based authentication with which you are probably familiar.
To employ PKA via SSH, you manually generate a keypair (also called SSH keys) in the location from where you wish to initiate a connection to a remote machine. This keypair consists of two text files: private key and public key. You keep the private key file confidential on your local machine or local home directory (hence the name "private" key). You then log in to a remote machine (if possible) and append the corresponding public key text to the end of a specific file, or have a system administrator do so on your behalf. In future login attempts, PKA compares the public and private keys to verify your identity; only then do you have access to the remote machine.
As a user, you can create, maintain, and employ as many keypairs as you wish. If you connect to a computational resource from your work laptop, your work desktop, and your home desktop, you can create and employ keypairs on each. You can also create multiple keypairs on a single local machine to serve different purposes, such as establishing access to different remote machines or establishing different types of access to a single remote machine. In short, PKA via SSH offers a secure but flexible means of identifying yourself to all kinds of computational resources.
Creating a keypair prompts you to provide a passphrase for the private key. This passphrase is different from a password in a number of ways. First, a passphrase is, as the name implies, a phrase. It can include most types of characters, including spaces, and has no limits on length. Secondly, the remote machine does not receive this passphrase for verification. Its purpose is only to allow the use of your local private key and is specific to a specific local private key.
Perhaps you are wondering why you would need a private key passphrase at all when using PKA. If the private key remains secure, why the need for a passphrase just to use it? Indeed, if the location of your private keys were always completely secure, a passphrase might not be necessary. In reality, a number of situations could arise in which someone may improperly gain access to your private key files. In these situations, a passphrase offers another level of security for you, the user who created the keypair.
Think of the private key/passphrase combination as being analogous to your ATM card/PIN combination. The ATM card itself is the object that grants access to your important accounts, and as such, should remain secure at all times—just as a private key should. But if you ever lose your wallet or someone steals your ATM card, you are glad that your PIN exists to offer another level of protection. The same is true for a private key passphrase.
When you create a keypair, you should always provide a corresponding private key passphrase. For security purposes, avoid using phrases which automated programs can discover (e.g. phrases that consist solely of words in English-language dictionaries). This passphrase is not recoverable if forgotten, so make note of it. Only a few situations warrant using a non-passphrase-protected private key—conducting automated file backups is one such situation. If you need to use a non-passphrase-protected private key to conduct automated backups to Fortress, see the No-Passphrase SSH Keys section.
If you have received a default password as part of the process of obtaining your account, you should change it immediately when you log on for the first time. Change your password from any terminal/SSH session with the command passwd. You will have the same password on all ITaP systems. If you change your password on any one ITaP system, it will change on all ITaP systems.
If you already have a Purdue career account, then you will initially receive the same username and password as your career account. There is no need to change your career account password because you have received an account on ITaP systems.
There is not currently any requirement regarding how often you must change your password for ITaP research systems, but for security reasons changing a password every six months, preferably every three months, is good practice, and other systems on campus linked to your career account do require this.
A password should employ all of the following features:
Never share your password with another user or make your password known to anyone else. Systems staff will NEVER ask for your password, by email or otherwise.
File storage options on ITaP research systems include long-term storage (home directories, Fortress) and short-term storage (scratch directories, /tmp directory). Each option has different performance and intended uses, and some options vary from system to system as well. ITaP provides daily snapshots of home directories for a limited time for accidental deletion recovery. ITaP does not back up short-term storage and regularly purges old files from scratch and /tmp directories. More details about each storage option appear below.
ITaP provides home directories for long-term file storage. Each user ID has one home directory. You should use your home directory for storing important program files, scripts, input data sets, critical results, and frequently used files. You should store infrequently used files on Fortress. Your home directory becomes your current working directory, by default, when you log in.
ITaP provides daily snapshots of your home directory for a limited period of time in the event of accidental deletion. For additional security, you should store another copy of your files on more permanent storage, such as the Fortress HPSS Archive.
Your home directory physically resides within the Isilon storage system at Purdue. To find the path to your home directory, first log in then immediately enter the following:
$ pwd /home/myusername
Or from any subdirectory:
$ echo $HOME /home/myusername
Your home directory and its contents are available on all ITaP research front-end hosts and compute nodes via the Network File System (NFS).
Your home directory has a quota capping the size and/or number of files you may store within. For more information, refer to the Storage Quotas / Limits Section.
Only files which have been snap-shotted overnight are recoverable. If you lose a file the same day you created it, it is NOT recoverable.
To recover files lost from your home directory, use the flost command:
$ flost
ITaP provides scratch directories for short-term file storage only. Each file system domain has at least one scratch directory. Each user ID may access one scratch directory in a file system domain. The quota of your scratch directory is several times greater than the quota of your home directory. You should use your scratch directory for storing large temporary input files which your job reads or for writing large temporary output files which you may examine after execution of your job. You should use your home directory and Fortress for longer-term storage or for holding critical results.
Users of all ITaP research clusters have access to a scratch directory.
ITaP does not perform backups for scratch directories. In the event of a disk crash or file purge, files in scratch directories are not recoverable. You should copy any important files to more permanent storage.
ITaP automatically removes (purges) from scratch directories all files stored for more than 90 days. Owners of these files receive a notice one week before removal via email. For more information, please refer to our Scratch File Purging Policy.
To find the path to your scratch directory:
$ findscratch
The response from command findscratch depends on your submission host. You may see one of the following paths:
/scratch/scratch95/m/myusername /scratch/scratch96/m/myusername /scratch/lustreA/m/myusername /scratch/miner/m/myusername
The value of variable $RCAC_SCRATCH is the path of your scratch directory. Use this variable in any scripts. Your actual scratch directory path may change without warning, but this variable will remain current.
$ echo $RCAC_SCRATCH
The response will be one of the previously listed paths.
Your scratch directory on ITaP research resources may be the same location and shared by some other ITaP research resources, and also distinct and not shared by other ITaP research resources. All submission hosts on all computational resources are able to access the scratch directories of all other computational resources. However, compute nodes are only able to access the scratch directory allocated to that specific computational resource. ITaP may change which computational resources share scratch storage with other computational resources as needs dictate. For more information about which computational resources share scratch volumes, please see the Network Storage Resource Page.
All BoilerGrid jobs submitted from a submission host of an ITaP research resource will have their HTCondor filesystem domain set such that these jobs will stay on ITaP compute nodes which have access to the scratch directory of the submission host unless you specify file transfer (which would eliminate any need for this). This will ensure that non-file-transfer jobs will always run on nodes which can access the scratch directory you had where you submitted the jobs. If you have no need of this scratch directory and want these jobs to run on systems which do not have access to it, you will need to explicitly set the file system domain of your jobs.
To find the path to someone else's scratch directory:
$ findscratch someusername /scratch/scratch95/s/someusername
Your scratch directory has a quota capping the size and number of files you may store in it. For more information, refer to the Storage Quotas / Limits Section.
ITaP provides /tmp directories for short-term file storage only. Each front-end and compute node has a /tmp directory. Your program may write temporary data to the /tmp directory of the compute node on which it is running. That data is available for as long as your program is active. Once your program terminates, that temporary data is no longer available. When used properly, /tmp may provide faster local storage to an active process than any other storage option. You should use your home directory and Fortress for longer-term storage or for holding critical results.
ITaP does not perform backups for the /tmp directory and removes files from /tmp whenever space is low or whenever the system needs a reboot. In the event of a disk crash or file purge, files in /tmp are not recoverable. You should copy any important files to more permanent storage.
Long-term Storage or Permanent Storage is available to ITaP research users on the High Performance Storage System (HPSS), an archival storage system, commonly referred to as "Fortress". HPSS is a software package that manages a hierarchical storage system. Program files, data files and any other files which are not used often, but which must be saved, can be put in permanent storage. Fortress currently has a 1.2 PB capacity.
Files smaller than 100 MB have their primary copy stored on low-cost disks (disk cache), but the second copy (backup of disk cache) is on tape or optical disks. This provides a rapid restore time to the disk cache. However, the large latency to access a larger file (usually involving a copy from a tape cartridge) makes it unsuitable for direct use by any processes or jobs, even where possible. The primary and secondary copies of larger files are stored on separate tape cartridges in the Quantum (ADIC, Advanced Digital Information Corporation) tape library.
To ensure optimal performance for all users, and to keep the Fortress system healthy, please remember the following tips:
Fortress writes two copies of every file either to two tapes, or to disk and a tape, to protect against medium errors. Unfortunately, Fortress does not automatically switch to the alternate copy when it has trouble accessing the primary. If it seems to be taking an extraordinary amount of time to retrieve a file (hours), please either email rcac-help@purdue.edu or call ITaP Customer Service at 765-49-4400. We can then investigate why it is taking so long. If it is an error on the primary copy, we will instruct Fortress to switch to the alternate copy as the primary and recreate a new alternate copy.
For more information about Fortress, how it works, user guides, and how to obtain an account:
There are many environment variables related to storage locations and paths. Logging in automatically sets these environment variables. You may change the variables at any time.
Use environment variables instead of actual paths whenever possible to avoid problems if the specific paths to any of these change. Some of the environment variables you should have are:
| Name | Description |
|---|---|
| USER | your username |
| HOME | path to your home directory |
| PWD | path to your current directory |
| RCAC_SCRATCH | path to scratch filesystem |
| PATH | all directories searched for commands/applications |
| HOSTNAME | name of the machine you are on |
| SHELL | your current shell (bash, tcsh, csh, ksh) |
| SSH_CLIENT | your local client's IP address |
| TERM | type of terminal or terminal emulator being used |
By convention, environment variable names are all uppercase. Use them on the command line or in any scripts in place of and in combination with hard-coded values:
$ ls $HOME ... $ ls $RCAC_SCRATCH/myproject ...
To find the value of any environment variable:
$ echo $RCAC_SCRATCH /scratch/scratch95/m/myusername $ echo $SHELL /bin/tcsh
To list the values of all environment variables:
$ env USER=myusername HOME=/home/myusername RCAC_SCRATCH=/scratch/scratch95/m/myusername SHELL=/bin/tcsh ...
You may create or overwrite an environment variable. To pass (export) the value of a variable in either bash or ksh:
$ export VARIABLE=value
To assign a value to an environment variable in either tcsh or csh:
% setenv VARIABLE value
ITaP imposes some limits on your disk usage on research systems. Each filesystem (home directory, scratch directory, etc.) may have a different limit. ITaP does not implement a soft limit or quota. However, if you exceed the hard limit or limit, your write will fail. You can then either remove files you no longer need, move them to the Fortress HPSS Archive, or ask us about increasing your quota.
To discover the current quotas of your home and scratch directories:
$ myquota Type Filesystem Size Limit Use Files Limit Use ============================================================================== home extensible 5.0GB 10.0GB 50% - - - scratch /scratch/scratch95/ 8KB 476.8GB 0% 2 100,000 0%
The columns are as follows:
If you find that you reached your quota in either your home directory or your scratch file directory, obtain estimates of your disk usage. Find the top-level directories which have a high disk usage, then study the subdirectories to discover where the heaviest usage lies.
To see in a human-readable format an estimate of the disk usage of your top-level directories in your home directory:
$ du -h --max-depth=1 $HOME >myfile 32K /home/myusername/mysubdirectory_1 529M /home/myusername/mysubdirectory_2 608K /home/myusername/mysubdirectory_3
The second directory is the largest of the three, so apply command du to it.
To see in a human-readable format an estimate of the disk usage of your top-level directories in your scratch file directory:
$ du -h --max-depth=1 $RCAC_SCRATCH >myfile 160K /scratch/scratch95/m/myusername
This strategy can be very helpful in figuring out the location of your largest usage. Move unneeded files and directories to alternate long-term storage to free space in your home and scratch directories.
If you find you need additional disk space in your home directory, please first consider archiving and compressing old files and moving them to long-term storage on the Fortress HPSS Archive. If you are unable to do so, you may go to the BoilerBackpack Quota Management site and use the sliders there to increase the amount of space allocated to your research home directory vs. other storage options, up to a maximum of 100GB.
There are several options for archiving and compressing groups of files or directories on ITaP research systems. ITaP provides the following tools:
(extract contents of somefile.zip) $ unzip somefile.zip (compress file somefile.c) $ zip somefile.zip somefile.c (compress all files in a directory into one archive file) $ zip -r somefile.zip somedirectory/ (compress all ".c" files in current directory into one archive file) $ zip -r somefile.zip . -i \*.c
(extract contents of somefile.7z) $ 7za e somefile.7z (compress file somefile.c) $ 7za a somefile.7z somefile.c (compress all files in a directory into one archive file) $ 7za a somefile.7z somedirectory/ (compress all ".c" files in current directory into one archive file) $ 7za a somefile.7z *.c
(list contents of archive somefile.tar) $ tar tvf somefile.tar (extract contents of somefile.tar) $ tar xvf somefile.tar (extract contents of gzipped archive somefile.tar.gz) $ tar xzvf somefile.tar.gz (extract contents of bzip2 archive somefile.tar.bz2) $ tar xjvf somefile.tar.bz2 (extract contents of xz archive somefile.tar.xz) $ tar xJvf somefile.tar.xz (archive file somefile.c) $ tar cvf somefile.tar somefile.c (archive all ".c" files in current directory into one archive file) $ tar cvf somefile.tar.gz *.c (archive all files in a directory into one archive file) $ tar cvf somefile.tar.gz somedirectory/ (archive and gzip-compress all files in a directory into one archive file) $ tar czvf somefile.tar.gz somedirectory/ (archive and bzip2-compress all files in a directory into one archive file) $ tar cjvf somefile.tar.bz2 somedirectory/ (archive and xz-compress all files in a directory into one archive file) $ tar cJvf somefile.tar.xz somedirectory/
(compress file somefile - also removes uncompressed file) $ gzip somefile (uncompress file somefile.gz - also removes compressed file) $ gunzip somefile.gz
(compress file somefile - also removes uncompressed file) $ bzip2 somefile (uncompress file somefile.bz2 - also removes compressed file) $ bunzip2 somefile.bz2
(compress file somefile - also removes uncompressed file) $ xz somefile (uncompress file somefile.xz - also removes compressed file) $ unxz somefile.xz
Windows users can work with these same formats using some of the following software:
There are a variety of ways to transfer data to and from ITaP research systems. Which you should use depends on several factors, including the ease of use for you personally, connection speed and bandwidth, the size and number of files to be transferred. For more details on file transfer methods and applications, refer to the BoilerGrid Complete User Guide.
The compilers available on Radon and the Community Clusters (Hansen, Rossmann, Coates, Steele, and Miner) are able to compile code for HTCondor. Compilers are available for Fortran 77, Fortran 90, Fortran 95, C, and C++. The compilers can produce general-purpose and architecture-specific optimizations to improve performance. These include loop-level optimizations, inter-procedural analysis and cache optimizations. While the compilers support automatic and user-directed parallelization of Fortran, C, and C++ applications for multiprocessing execution, BoilerGrid allows only serial jobs.
To see the available compilers, choose one of the following entries:
$ module avail intel $ module avail gcc $ module avail pgi
Using statically linked libraries, regardless the chosen HTCondor universe, is good practice; you cannot rely on which versions of dynamic libraries are available on the machines selected to run your job. With static libraries, HTCondor will send the same libraries to all machines. On the other hand, with the HTCondor flock consisting of a mix of machine architectures, there is also the possibility that your job will land on a machine that is so different from or much older than the machine on which you built your executable file that your job may fail to execute an instruction in the statically linked library. In a parameter sweep, this leads to the confusing situation of some of the runs of the sweep completing successfully while others fail. In this case, you must consider using the corresponding dynamic library on the selected machine or using ClassAds to select compute nodes known to run your job successfully or to exclude compute nodes known to fail. So, use static linkage if at all possible. For the Standard Universe, the condor_compile command specifies static linkage as part of its arguments to the linker; the condor_compile command exhibits its arguments in the "LINKING FOR" message. Regarding jobs destined for the Vanilla Universe, use your compiler's command-line option for selecting statically linked libraries.
You may use HTCondor to submit jobs to BoilerGrid. HTCondor performs job scheduling. Jobs may be serial only. You may use only the batch mode for developing and running your program. BoilerGrid does not offer an interactive mode to run your jobs.
HTCondor is one of several distributed computing resources ITaP provides. Like other similar resources, HTCondor provides a framework for running programs on otherwise idle computers. While this imposes serious limitations on parallel jobs and codes with large I/O or memory requirements, HTCondor can provide a large quantity of cycles for researchers who need to run hundreds of smaller jobs.
HTCondor is a specialized batch system for managing compute-intensive jobs. HTCondor provides a queuing mechanism, scheduling policy, priority scheme, and resource classifications. Users submit their jobs to HTCondor, which then puts these jobs in a queue, runs them, and reports back with the results.
In some ways, HTCondor is different from other batch systems. They usually only operate on dedicated machines/compute servers. Instead, HTCondor can both schedule jobs on dedicated machines and effectively utilize non-dedicated machines to run jobs. It only runs jobs on machines which are currently idle (no keyboard activity, no load average, no active telnet users, etc). In this way, HTCondor effectively harnesses otherwise idle machines throughout a pool of machines.
Currently, ITaP uses HTCondor to utilize idle cycles on all ITaP research resources, including all Linux cluster nodes as well as some other servers and workstations. While ITaP uses PBS to schedule the resources of the Linux clusters, HTCondor schedules jobs on compute nodes when the nodes are not running PBS jobs. When PBS elects to run a new job on a node which is currently running HTCondor-scheduled jobs, HTCondor preempts all jobs running on that node to make room for the PBS-scheduled job. You may submit HTCondor jobs from any ITaP research system.
For more information:
Here is the simplest possible job submission file. It will queue one copy of the program hello for execution by HTCondor. HTCondor will use its default universe and the default platform, which means to run the job on a compute node which has the same architecture and operating system as the submission host.
No input, output, and error commands appear in the job submission file, so the files stdin, stdout, and stderr will all refer to /dev/null (a.k.a. the null device. It is a special file that discards all data written to it, but reports that the write operation succeeded. It provides no data to any process that reads from it - returning EOF). The program may produce output by explicitly opening a file and writing to it. This job writes to a log file, hello.log. This log file will contain events the job had during its lifetime inside of HTCondor, such as any possible errors. When the job finishes, its exit conditions will also be noted in the log file. HTCondor recommends a log file so that you know what happened to your jobs.
If your program only returns output to the screen (like the hello.c program below does), then you should include Output = hello.out or something like it somewhere before Queue. Otherwise you will not see the output.
If you do not explicitly choose a universe, HTCondor uses the default universe: Vanilla Universe.
#################### # # Example 1 # Simple HTCondor job description file # #################### executable = hello log = hello.log queue
This example (from the HTCondor Manual), queues two copies of the program Mathematica. The first copy will run in directory run_1, and the second will run in directory run_2. For both queued copies, stdin will be file test.data, stdout will be file loop.out, and stderr will be file loop.error. This job writes two sets of files in separate directories. This is a convenient way to organize data if you have a large group of HTCondor jobs to run. The example file shows program submission of Mathematica as a Vanilla Universe job, since neither the source nor object code to program Mathematica is available for relinking to the HTCondor libraries.
HTCondor recommends using a single log file.
#################### # # Example 2 # Demonstrate use of multiple directories for data organization # #################### universe = VANILLA executable = mathematica input = test.data output = loop.out error = loop.error log = loop.log initialdir = run_1 queue initialdir = run_2 queue
In this example (also from the HTCondor Manual), the job submission file queues 150 runs of program foo which you compiled and linked for Silicon Graphics workstations running IRIX 6.5. This job requires HTCondor to run the program on machines which have greater than 32 megabytes of physical memory, and expresses a preference to run the program on machines with more than 64 megabytes, if such machines are available. It also advises HTCondor that it will use up to 28 megabytes of memory when running. Each of the 150 runs of the program receives its own process number, starting with process number 0. So, files stdin, stdout, and stderr will refer to in.0, out.0, and err.0 for the first run of the program; in.1, out.1, and err.1 for the second run of the program; and so forth. A log file foo.log will contain entries about when and where HTCondor runs, checkpoints, and migrates processes for the 150 queued runs of the program.
#################### # # Example 3 # Show off some fancy features including use of pre-defined macros and logging # #################### executable = foo requirements = Memory >= 32 && OpSys == "IRIX65" && Arch =="SGI" rank = Memory >= 64 image_Size = 28 Meg error = err.$(Process) input = in.$(Process) output = out.$(Process) log = foo.log queue 150
Once you have a job submission file, you may submit this script to HTCondor using the condor_submit command. As described above, a job submission file contains the commands and keywords which specify the type of compute node on which you wish to run your job. HTCondor will find an available processor core and run your job there, or leave your job in a queue until one becomes available.
You may submit jobs to BoilerGrid from any BoilerGrid submission host, including all ITaP research cluster front-ends.
To submit a job submission file:
$ condor_submit myjobsubmissionfile
For more information about job submission:
To check on the progress of your jobs, view the HTCondor queue on the host from which you submitted the jobs.
You must make certain that you logged in to the same submission host (…-fe00, …-fe01, …-fe02, etc.) from which you submitted your jobs, or you will not see them in the queue.
To view the status of all jobs in the HTCondor queue of your login host:
$ condor_q
To see only your own jobs, specify your own username as an argument:
$ condor_q myusername -- Submitter: condor.rcac.purdue.edu : <128.211.157.86:42163> : condor.rcac.purdue.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 1100900.0 myusername 2/20 15:13 0+00:00:00 I 0 0.0 Hello 1 jobs; 1 idle, 0 running, 0 held
Secondly, you may check on the status of your jobs through their log files. In your job submission file, you can specify a log command (log = myjob.log) at any point prior to the queue command. The main events during the processing of the job will appear in this log file: submittal, execution commencement, preemption, checkpoint, eviction, and termination.
Thirdly, as soon as your job begins executing, HTCondor will start a condor_shadow process on the submission host. This shadow process is the mechanism by which the remotely executing jobs can access the environment of the submit host, such as input and output files. There is a shadow process started on the submit host for each job. However, the load on the submit host from this is usually not significant. If you notice degraded performance, you can limit the number of jobs that can run simultaneously using the MAX_JOBS_RUNNING configuration parameter. Please contact us for help with this if you notice poor performance.
To list all the compute nodes which are running your jobs:
$ condor_status -pool boilergrid.rcac.purdue.edu -constraint 'RemoteUser=="myusername@rcac.purdue.edu"' Name OpSys Arch State Activity LoadAv Mem ActvtyTime ba-005.rcac.p LINUX INTEL Claimed Busy 1.000 502 0+00:24:44 ba-006.rcac.p LINUX INTEL Claimed Busy 0.990 502 0+00:20:22 ba-007.rcac.p LINUX INTEL Claimed Busy 1.000 502 0+00:23:16 ba-008.rcac.p LINUX INTEL Claimed Busy 1.000 502 0+00:30:20 ...
For more information about monitoring your job:
The command condor_rm removes a job from the queue. If the job has already started running, then HTCondor kills the job and removes its queue entry. Use condor_q to get the ID of the job.
Queue of jobs before removal:
$ condor_q Submitter: condor.rcac.purdue.edu : <128.211.157.86:35407> : condor.rcac.purdue.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD ... 260076.7 nice-user 8/18 00:05 0+00:21:47 I 0 29.3 startfah.sh -oneun 260076.9 nice-user 8/18 00:05 0+01:40:44 I 0 136.7 startfah.sh -oneun 260185.0 myusername 8/30 13:01 0+00:00:00 R 0 19.5 hello ...
Remove a job:
$ condor_rm 260185.0 Job 260185.0 marked for removal
Queue of jobs after removal:
$ condor_q Submitter: condor.rcac.purdue.edu : <128.211.157.86:35407> : condor.rcac.purdue.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD ... 260076.7 nice-user 8/18 00:05 0+00:21:47 I 0 29.3 startfah.sh -oneun 260076.9 nice-user 8/18 00:05 0+01:40:44 I 0 136.7 startfah.sh -oneun ...
For more information about removing your job:
This section offers a quick overview of the steps involved in preparing and submitting a simple HTCondor job.
Prepare the Code
The "Hello World" program below is a simple program which displays the text "hello, world":
/* FILENAME: hello.c */
#include <stdio.h>
int main (void) {
printf("hello, world\n");
return 0;
}
The two most commonly used HTCondor Universes are Standard and Vanilla. The "Hello World" program above will run in either universe.
Vanilla Universe
Compile the "Hello World" program normally using any available compiler:$ module load intel $ icc -static hello.c -o hello $ module load gcc $ gcc -static hello.c -o hello $ module load pgi $ pgcc -Bstatic hello.c -o hello
Standard Universe
Relink the "Hello World" program with the HTCondor library using the condor_compile command and a compatible compiler:$ module load gcc $ condor_compile gcc hello.c -o hello
Prepare the Job Submission File
Your job submission file defines how to run the job via HTCondor. It specifies the executable file, the chosen universe, a file containing standard input (not used in this example), files which will receive standard output and standard error, and the HTCondor log file, as well as many other possible parameters. The queue directive specifies how many executions of the job are to occur. Usually this is just once, as here:
Vanilla Universe
# FILENAME: hello.sub executable = hello universe = vanilla output = hello.out error = hello.err log = hello.log queue
Standard Universe
# FILENAME: hello.sub executable = hello universe = standard output = hello.out error = hello.err log = hello.log queue
$ condor_submit hello.sub Submitting job(s). Logging submit event(s). 1 job(s) submitted to cluster 1100744.
$ condor_q myusername -- Submitter: condor.rcac.purdue.edu : <128.211.157.86:56939> : condor.rcac.purdue.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 1100744.0 myusername 2/17 15:36 0+00:00:00 I 0 0.0 hello 1 jobs; 1 idle, 0 running, 0 held
$ condor_rm 1100744
View the Results
When the "Hello World" program completes, its output will appear in the file hello.out. The exit status of your program and various statistics about its performance, including time used and I/O performed, will appear in the log file hello.log. To view the output file:$ less hello.out hello, world
$ less hello.log
000 (1100744.000.000) 02/17 15:36:51 Job submitted from host: <128.211.157.86:56939>
...
001 (1100744.000.000) 02/17 15:41:49 Job executing on host: <128.211.157.10:57321>
...
005 (1100744.000.000) 02/17 15:41:53 Job terminated.
(1) Normal termination (return value 0)
Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage
1018 - Run Bytes Sent By Job
5429958 - Run Bytes Received By Job
1018 - Total Bytes Sent By Job
5429958 - Total Bytes Received By Job
...
HTCondor attempts to start jobs by matching submitted jobs with available compute nodes on the basis of ClassAds. HTCondor's ClassAds are analogous to the classified advertising section of the newspaper. Both sellers and buyers advertise details about what they have to sell or want to buy. Both buyers and sellers have some requirements which absolutely must be satisfied, such as the right type of item, and some other criteria by which they will prefer certain offers over others, such as a better price. The same is true in HTCondor, but between users submitting jobs and compute nodes advertising available resources. HTCondor uses ClassAds to make the best matches between these two groups.
By default, your HTCondor jobs will seek an available compute node with the same values for the ClassAds Arch and OpSys as the host from which you submitted your job. The submission process assumes that in most cases your jobs will require the same combination of chip architecture and operating system to run as the host from which you submitted it. You can remove or alter this restriction by looking at the examples in the "Requiring Specific Architectures or Operating Systems" section.
Some applications may require even more specific capabilities. Using ClassAds, you may specify a set of requirements so that only a subset of available compute nodes become candidates to run your job. There are many ClassAds available for you to use in your job requirements. You may also use ClassAds to indicate a preference for certain nodes over others (but not as an absolute requirement) by using the rank command. The following examples illustrate how to discover current ClassAds and how to estimate the number of compute nodes which will match job requirements based on ClassAds.
To save a detailed report of all the ClassAds of all processor cores in BoilerGrid in the file myfile:
$ condor_status -pool boilergrid.rcac.purdue.edu -long > myfile
You may use any of the ClassAds which appear in this list to view a subset of BoilerGrid. For example, to save a listing of all user ID domains or all file system domains in the file myfile:
$ condor_status -pool boilergrid.rcac.purdue.edu -format "%s\n" UidDomain > myfile $ condor_status -pool boilergrid.rcac.purdue.edu -format "%s\n" FileSystemDomain > myfile
To list all platforms (architectures and operating systems) and the number of processor cores of each platform on BoilerGrid:
$ condor_status -pool boilergrid.rcac.purdue.edu -total
Total Owner Claimed Unclaimed Matched Preempting Backfill
INTEL/LINUX 64 13 5 46 0 0 0
INTEL/OSX 2 0 0 2 0 0 0
INTEL/WINNT51 345 29 2 314 0 0 0
INTEL/WINNT61 4683 150 13 4520 0 0 0
SUN4u/SOLARIS210 3 2 0 1 0 0 0
X86_64/LINUX 31395 22617 4734 4035 2 2 5
Total 36492 22811 4754 8918 2 2 5
HTCondor uses the name "INTEL" to indicate x86_32 (32-bit Intel-compatible) architecture.
The total number of processor cores on BoilerGrid is 36,492. The predominant platform of BoilerGrid is the x86_64/Linux with 31,395 processor cores. The values in this table are approximations since compute nodes require repair.
To see how many compute nodes have a given ClassAd value, add the ClassAd value as a constraint.
To see only how many 64-bit, Intel-compatible, Linux (x86_64/Linux) platforms are currently in BoilerGrid:
$ condor_status -pool boilergrid.rcac.purdue.edu -total -constraint '(Arch == "X86_64") && (OpSys == "LINUX")'
Total Owner Claimed Unclaimed Matched Preempting Backfill
X86_64/LINUX 31395 22740 4688 3957 3 2 5
Total 31395 22740 4688 3957 3 2 5
To see how many 64-bit, Intel-compatible, Linux (x86_64/Linux) platforms are currently in BoilerGrid and advertise MATLAB as installed:
$ condor_status -pool boilergrid.rcac.purdue.edu -total -constraint '(Arch == "X86_64") && (OpSys == "LINUX") && (HAS_MATLAB == TRUE)'
Total Owner Claimed Unclaimed Matched Preempting Backfill
X86_64/LINUX 24659 12008 1557 11094 0 0 0
Total 24659 12008 1557 11094 0 0 0
You may specify numeric constraints with other relational operators. To discover how many compute nodes have at least 16 GB of memory:
$ condor_status -pool boilergrid.rcac.purdue.edu -total -constraint 'TotalMemory >= 16046'
Total Owner Claimed Unclaimed Matched Preempting Backfill
X86_64/LINUX 26093 18007 3330 4753 3 0 0
Total 26093 18007 3330 4753 3 0 0
ClassAd string values are case-sensitive. ClassAd attribute names are case-insensitive. The comparison operators (<, >, <=, >=, and ==) compare strings case-insensitively. The special comparison operators =?= and =!= compare strings case-sensitively. ClassAd expressions are similar to C boolean expressions and can be quite elaborate.
For more information about ClassAds, requirements, and rank:
Increasing the throughput of your jobs may not come from maximizing the number of candidate compute nodes but rather from limiting the candidate compute nodes to the set which can access the shared scratch file system of the front-end. This limitation is useful in the case of a large input data file since it avoids both using HTCondor's file transfer mechanism and running the risk of preemptions preventing job completion.
The following table shows the current list of scratch directories:
| Cluster | Scratch Directory | File System Domain |
|---|---|---|
condor.rcac.purdue.edu Radon Steele |
/scratch/scratch95/m/myusername /scratch/scratch96/m/myusername |
bluearc.rcac.purdue.edu |
Coates Rossmann |
/scratch/lustreA/m/myusername |
lustrea.rcac.purdue.edu |
Hansen |
/scratch/lustreC/m/myusername |
lustrec.rcac.purdue.edu |
Miner |
/scratch/miner/m/myusername |
miner.rcac.purdue.edu |
To discover your scratch file directory, log in to your submission host and enter either of the following commands:
$ findscratch $ echo $RCAC_SCRATCH
The response will be one of the following paths:
/scratch/scratch95/m/myusername /scratch/scratch96/m/myusername /scratch/lustreA/m/myusername /scratch/lustreC/m/myusername /scratch/miner/m/myusername
To see which shared scratch file system a specific cluster can access, search on the ClassAd attribute ClusterName:
$ condor_status -pool boilergrid.rcac.purdue.edu -constraint 'ClusterName=="Radon"' -format "%s\n" FileSystemDomain >myfile
To see which shared scratch file systems other clusters use, modify the preceding example with other cluster names: Hansen, Rossmann, Coates, Steele, or Miner.
To see which clusters can access a given shared scratch file system, search on the ClassAd attribute FileSystemDomain:
$ condor_status -pool boilergrid.rcac.purdue.edu -constraint 'FileSystemDomain == "bluearc.rcac.purdue.edu"' -format "%s\n" ClusterName >myfile $ condor_status -pool boilergrid.rcac.purdue.edu -constraint 'FileSystemDomain == "lustrea.rcac.purdue.edu"' -format "%s\n" ClusterName >myfile $ condor_status -pool boilergrid.rcac.purdue.edu -constraint 'FileSystemDomain == "lustrec.rcac.purdue.edu"' -format "%s\n" ClusterName >myfile $ condor_status -pool boilergrid.rcac.purdue.edu -constraint 'FileSystemDomain == "miner.rcac.purdue.edu"' -format "%s\n" ClusterName >myfile
Using logical operators, you may combine ClassAd constraints. For example, to see how many x86_64 processor cores running Linux have access to the BlueArc shared scratch file system:
$ condor_status -pool boilergrid.rcac.purdue.edu -total -constraint '(Arch == "X86_64") && (OpSys == "LINUX") && (FileSystemDomain == "bluearc.rcac.purdue.edu")'
Total Owner Claimed Unclaimed Matched Preempting Backfill
X86_64/LINUX 9232 5515 1431 2286 0 0 0
Total 9232 5515 1431 2286 0 0 0
To submit jobs successfully to BoilerGrid and to achieve maximum throughput in HTCondor's computing environment, you must understand the architecture of BoilerGrid and how to request resources which are appropriate to your application. The following examples show how to discover the resources of BoilerGrid. They also explain standard input and output, command-line arguments, file input and output, Standard and Vanilla universe jobs, shared file systems, parameter sweeps, DAG Manager, job requirements and ranks, and how to run commercial and third-party software. You may wish to look here for an example that is most similar to your application and modify that example for your jobs. You may also refer to the HTCondor Manual for more details.
The job submission file must contain one executable command and at least one queue command. All other commands of the job submission file have default actions. HTCondor's job submission parser ignores blank lines and single-line comments beginning with a pound sign ("#"). There is no block (multi-line) comment in a job submission file. In some cases, a single-line comment may appear on a command line.
# FILENAME: myjob.sub executable = myprogram queue # place one copy of the job in the HTCondor queue
This job submission file uses HTCondor's default universe, Vanilla. Because this Vanilla job does not turn on HTCondor's file transfer mechanism, this job will (by default) use the shared file system of the submission host from which you submitted the job. Also by default, the number of jobs queued is one.
This job submission file may appear to be useless because it lacks the standard input, standard output, standard error, and a common log file; however, it will correctly process a program which reads and writes formatted files. Here is an example of file I/O. To compile this program for the Vanilla Universe, see Compiling Serial Programs.
To submit this job to HTCondor:
$ condor_submit myjob.sub
HTCondor manages a batch environment. When HTCondor manages the execution of a computer program, that program cannot offer an interactive experience with a terminal. All input normally read from the keyboard (standard input) must be prepared in a file ahead of execution. All output normally written to the screen (standard output and standard error) appear in files where you may view them after execution. Also, HTCondor records in a common log file the main events of running a job.
Here is an example of standard I/O. To compile this program for the Vanilla Universe, see Compiling Serial Programs.
Prepare a job submission file with an appropriate filename, here named myjob.sub:
# FILENAME: myjob.sub executable = myprogram # Standard I/O files, HTCondor log file input = mydata.in output = mydata.out error = mydata.err log = mydata.log # queue one job queue
This job submission file uses HTCondor's default universe, Vanilla. Because this Vanilla job does not turn on HTCondor's file transfer mechanism, this job wi ll (by default) use the shared file system of the submission host from which you submitted the job. Also by default, the number of jobs queued is one.
This submission specifies that there exists a file, mydata.in, which contains all text which the program would otherwise read from the keyboard, standard input. It also specifes the names of three files which will receive standard output, standard error, and HTCondor's log entries. These three output files need not preexist, but they can. HTCondor will overwrite standard output and standard error but will append to the log file during subsequent submissions.
To submit this job to HTCondor:
$ condor_submit myjob.sub
HTCondor allows the specification of command-line arguments in the job submission file. There are two permissible formats for specifying arguments. The old syntax has arguments delimited (separated) by space characters. To use double quotes, escape with a backslash (i.e. put a backslash in front of each double quote). For example:
arguments = arg1 \"arg2\" 'arg3'
yields the following arguments:
arg1 "arg2" 'arg3'
The new syntax supports uniform quoting of spaces within arguments. A pair of double quotes surrounds the entire argument list. To include a literal double quote, simply repeat it. White space (spaces, tabs) separate arguments. To include literal white space in an argument, surround the argument with a pair of single quotes. To include a literal single quote within a single-quoted argument, repeat the single quote.
Here is a simple program which will display command-line arguments specified in a job submission file. To compile this program for the Vanilla Universe, see Compiling Serial Programs.
Prepare a job submission file with an appropriate filename, here named myjob.sub. Run with command-line arguments in either the old or new syntax:
# FILENAME: myjob.sub universe = VANILLA executable = myprogram # Old Syntax # arguments = arg1 arg2 arg3 \"arg4\" 'arg5' 'arg with spaces' arg6 arg7_with_spaces arg8 # New Syntax arguments = "arg9 ""arg10"" 'arg with literal '' and spaces'" # HTCondor Macros # arguments = $(Cluster) $(Process) # standard I/O files, HTCondor log file output = myprogram.out error = myprogram.err log = myprogram.log # queue one job queue
To submit this job to HTCondor:
$ condor_submit myjob.sub
View command-line arguments submitted in the old syntax:
*** MAIN START *** Number of command line arguments: 12 command line argument, argv[0]: condor_exec.746418.0 command line argument, argv[1]: arg1 command line argument, argv[2]: arg2 command line argument, argv[3]: arg3 command line argument, argv[4]: "arg4" command line argument, argv[5]: 'arg5' command line argument, argv[6]: 'arg command line argument, argv[7]: with command line argument, argv[8]: spaces' command line argument, argv[9]: arg6 command line argument, argv[10]: arg7_with_spaces command line argument, argv[11]: arg8 *** MAIN STOP ***
The old syntax requires simulating spaces in arguments with the underscore character. Then, user code can replace the underscores with spaces to achieve an argument with spaces.
View command-line arguments submitted in the new syntax:
*** MAIN START *** Number of command line arguments: 4 command line argument, argv[0]: condor_exec.341964.0 command line argument, argv[1]: arg9 command line argument, argv[2]: "arg10" command line argument, argv[3]: arg with literal ' and spaces *** MAIN STOP ***
The array element argv[0] holds HTCondor's name for a job.
Two HTCondor macros are useful as command-line arguments, $(Cluster) and $(Process):
*** MAIN START *** Number of command line arguments: 3 command line argument, argv[0]: condor_exec.341965.0 command line argument, argv[1]: 341965 command line argument, argv[2]: 0 *** MAIN STOP ***
HTCondor is able to manage a computer program which reads and writes formatted data files.
Here is an example of formatted file I/O. To compile this program for the Vanilla Universe, see Compiling Serial Programs.
Prepare a job submission file with an appropriate filename, here named myjob.sub. This example combines formatted file I/O with standard output:
# FILENAME: myjob.sub executable = myprogram # Standard I/O files, HTCondor log file output = mydata.out error = mydata.err log = mydata.log # queue one job queue
This job submission file uses HTCondor's default universe, Vanilla. Because this Vanilla job does not turn on HTCondor's file transfer mechanism, this job will (by default) use the shared file system of the submission host from which you submitted the job. Also by default, the number of jobs queued is one.
This submission specifies that there exists a formatted input file, myinputdata, a name which appears in the source code only. The result is a formatted output file, myoutputdata, a name which also appears in the source code only. This submission also specifes the names of three files which will receive standard output, standard error, and HTCondor's log entries. These three output files need not preexist, but they can. HTCondor will overwrite standard output and standard error but append to the log file during subsequent submissions.
To submit this job to HTCondor:
$ condor_submit myprogram.sub
The Standard Universe is an execution environment of HTCondor. Jobs using the Standard Universe enjoy two advantages. A job with a higher priority may preempt a HTCondor job without loss of completed work. HTCondor can checkpoint the job and move (migrate) the job to a different compute node which would otherwise be idle. HTCondor restarts the job on the new compute node at precisely the point of preemption. The Standard Universe tells HTCondor that you re-linked your job via condor_compile with the HTCondor libraries, and therefore your job supports checkpointing. HTCondor transfers the executable and checkpoint files automatically, when needed.
The second advantage of HTCondor's Standard Universe is that remote system calls handle access to files (input and output). For example, HTCondor intercepts a call to read a record of a data file. HTCondor sends the read operation to the user's current working directory on the submission host which performs the read operation. HTCondor then sends the desired record to the compute node which processes the record. A similar process occurs with write operations. Therefore, the existence of a shared file system is not relevant. This feature maximizes the number of machines which can run a job. Compute nodes across an entire enterprise can run a job, including compute nodes in different administrative domains.
This section illustrates how to submit a small job to the Standard Universe of BoilerGrid. This example, myprogram.c, displays the name of the host which runs the job. To compile this program for the Standard Universe, see Compiling Serial Programs.
Prepare a job submission file with the Standard Universe, the compiled C program as the executable, HTCondor's transferring the compiled program to the chosen compute node, and an appropriate filename, here named myjob.sub:
# FILENAME: myjob.sub universe = STANDARD # Transfer the "executable" myprogram to the compute node. transfer_executable = TRUE executable = myprogram # Standard I/O files, HTCondor log file output = mydata.out error = mydata.err log = mydata.log # queue one job queue
Submit this job to HTCondor:
$ condor_submit myjob.sub Submitting job(s). Logging submit event(s). 1 job(s) submitted to cluster 341956.
View job status:
$ condor_q myusername -- Submitter: condor.rcac.purdue.edu : <128.211.157.86:35556> : condor.rcac.purdue.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 341956.0 myusername 10/22 11:18 0+00:00:00 I 0 7.3 myjob
Place the job on hold to study the submission:
$ condor_hold 341956 Cluster 341956 held.
Obtain the requirements of this job:
$ condor_q myusername -attributes requirements -long -- Submitter: condor.rcac.purdue.edu : <128.211.157.86:35556> : condor.rcac.purdue.edu requirements = (Arch == "X86_64") && (OpSys == "LINUX") && ((CkptArch == Arch) || (CkptArch =?= UNDEFINED)) && ((CkptOpSys == OpSys) || (CkptOpSys =?= UNDEFINED)) && (Disk >= DiskUsage) && ((Memory * 1024) >= ImageSize)
Job requirements reflect the Standard Universe (preemption with checkpointing). This job requires a processor core which runs the Linux operating system on the x86_64 architecture and has the ability to checkpoint the job at preemption. The requirements exclude any mention of the shared file system since a shared file system is not relevant to a Standard Universe job. Running a Standard Universe job does not limit the job to the processor cores which use the same shared file system that the submission host uses. The job may land either on a processor core that uses the same shared file system or not; in either case, the remote I/O of the Standard Universe handles the job's file I/O. The ClassAds Disk and Memory ensure that the chosen core has sufficient resources to hold the disk and memory footprint of the job.
To see how many compute nodes of BoilerGrid are able to satisfy this job's requirements:
$ condor_status -pool boilergrid.rcac.purdue.edu -total -constraint '(Arch == "X86_64") && (OpSys == "LINUX") && ((CkptArch == Arch) || (CkptArch =?= UNDEFINED)) && ((CkptOpSys == OpSys) || (CkptOpSys =?= UNDEFINED))'
Total Owner Claimed Unclaimed Matched Preempting Backfill
X86_64/LINUX 33118 27602 2878 2596 42 0 0
Total 33118 27602 2878 2596 42 0 0
The report shows that 33,118 processor cores are candidates for running the job. Using HTCondor's Standard Universe with its remote file I/O maximizes the number of candidate cores available to the job. This large pool of candidate cores has the potential of offering high throughput. Some subset of the candidate processor cores, possibly none, will be available to accept the job at any given moment.
Release the job from the queue:
$ condor_release 341956 Cluster 341956 released.
View results in the file for all standard output, here named mydata.out:
*** MAIN START *** hostname = cms-100.rcac.purdue.edu domainname = (none) *** MAIN STOP ***
The output shows the name of the processor core which ran the job. While this job ran on a processor core which resides on the same shared file system used by the submission host, another submission which forced the job onto a core of another shared file system also ran successfully because the remote I/O of the Standard Universe handled the reading and writing of records.
View the log file, mydata.log:
000 (341956.000.000) 10/22 11:42:22 Job submitted from host: <128.211.157.86:35556>
...
012 (341956.000.000) 10/22 11:42:57 Job was held.
via condor_hold (by user myusername)
Code 1 Subcode 0
...
001 (341956.000.000) 10/22 11:43:57 Job executing on host: <128.211.157.10:52556>
...
005 (341956.000.000) 10/22 11:43:57 Job terminated.
(1) Normal termination (return value 0)
Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage
1110 - Run Bytes Sent By Job
5431033 - Run Bytes Received By Job
1110 - Total Bytes Sent By Job
5431033 - Total Bytes Received By Job
The log file records the main events related to the processing of this job. In this example, the log records the number of bytes read and written between the submission host and the compute node via the remote I/O of the Standard Universe.
The Standard Universe maximizes throughput with its ability to checkpoint jobs and to intercept remote system calls. The latter avoids requiring the submission host and the compute node to share a file system. The process of re-linking a job with HTCondor's libraries involves including both HTCondor's libraries and the user's libraries as static libraries. The danger of this effort to maximize throughput is that a HTCondor flock is a heterogeneous collection of old and new compute nodes, so a job can land on a compute node that is unable to run the job. When this happens, the user must consider how to avoid compute nodes which are unable to run a job to a successful completion.
The Vanilla Universe is an execution environment of HTCondor. The Vanilla Universe tells HTCondor that you did not re-link your job via condor_compile with the HTCondor libraries, and therefore your job does not support checkpointing or remote system calls. Such jobs include an executable binary of a commercial application, shell script, or a program which is to take advantage of features of a compiler which is not compatible with HTCondor's condor_compile command.
For jobs submitted under the Vanilla Universe, the existence of a shared file system is relevant since access to files (input and output) involves either a shared file system or HTCondor's file transfer mechanism mechanism, not the remote system calls of the Standard Universe.
This section illustrates how to submit a small job to the Vanilla Universe of BoilerGrid from a submission host which has a shared file system with HTCondor's file transfer mechanism turned off, by default. This example, myprogram.c, displays the name of the host which runs the job, the path name of the current working directory, and the contents of that directory. The Vanilla Universe allows using Linux commands to obtain the information. To compile this program for the Vanilla Universe, see Compiling Serial Programs.
Prepare a job submission file with the Vanilla Universe, the compiled C program as the executable, HTCondor's file transfer mechanism turned off by default, HTCondor's transferring the compiled program to the chosen compute node, and an appropriate filename, here named myjob.sub:
# FILENAME: myjob.sub universe = VANILLA # Transfer the "executable" myprogram to the compute node. transfer_executable = TRUE executable = myprogram # HTCondor's file transfer mechanism is off, by default. # Standard I/O files, HTCondor log file output = mydata.out error = mydata.err log = mydata.log # queue one job queue
Submit this job to HTCondor:
$ condor_submit myjob.sub Submitting job(s). Logging submit event(s). 1 job(s) submitted to cluster 746407.
View job status:
$ condor_q myusername -- Submitter: condor.rcac.purdue.edu : <128.211.157.86:60481> : condor.rcac.purdue.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 746407.0 myusername 10/25 10:04 0+00:00:00 I 0 0.0 myjob
Place this job on hold to study the submission:
$ condor_hold 746407 Cluster 746407 held.
Obtain the requirements of this job:
$ condor_q myusername -attributes requirements -long -- Submitter: condor.rcac.purdue.edu : <128.211.157.86:60481> : condor.rcac.purdue.edu requirements = (Arch == "X86_64") && (OpSys == "LINUX") && (Disk >= DiskUsage) && ((Memory * 1024) >= ImageSize) && (TARGET.FileSystemDomain == MY.FileSystemDomain)
Job requirements reflect both the Vanilla Universe (preemption without checkpointing) and the shared file system. This job requires a compute node which runs the Linux operating system on the x86_64 architecture and, more importantly, which shares the same FileSystemDomain as the submission host (both the TARGET and MY shared file system must be the same). So, this submission limits running the job to the processor cores which use the same shared file system that the submission host uses. The ClassAds Disk and Memory ensure that the chosen core has sufficient resources to hold the disk and memory footprint of the job.
To see how many compute nodes in various file system domains of BoilerGrid are able to satisfy this job's requirements :
$ condor_status -pool boilergrid.rcac.purdue.edu -total -constraint '(Arch == "X86_64") && (OpSys == "LINUX") && (FileSystemDomain == "bluearc.rcac.purdue.edu")'
Total Owner Claimed Unclaimed Matched Preempting Backfill
X86_64/LINUX 9924 7466 1579 878 0 1 0
Total 9924 7466 1579 878 0 1 0
$ condor_status -pool boilergrid.rcac.purdue.edu -total -constraint '(Arch == "X86_64") && (OpSys == "LINUX") && (FileSystemDomain == "lustrea.rcac.purdue.edu")'
Total Owner Claimed Unclaimed Matched Preempting Backfill
X86_64/LINUX 18784 16717 1438 628 0 1 0
Total 18784 16717 1438 628 0 1 0
$ condor_status -pool boilergrid.rcac.purdue.edu -total -constraint '(Arch == "X86_64") && (OpSys == "LINUX") && (FileSystemDomain == "miner.rcac.purdue.edu")'
Total Owner Claimed Unclaimed Matched Preempting Backfill
X86_64/LINUX 1006 156 760 90 0 0 0
Total 1006 156 760 90 0 0 0
The report shows that 9,924 and 18,784 processor cores are candidates for running this job in various file system domains. While the number of candidate processor cores which are able to run this job is much less than the number of x86_64 cores running Linux on BoilerGrid, using the shared file system is the preferred method in many situations. Some subset of the candidate processor cores, possibly none, will be available to accept the job at any given moment.
Release the job from the queue:
$ condor_release 746407 Cluster 746407 released.
View results in the file for all standard output, here named mydata.out:
cms-100.rcac.purdue.edu (none) /home/myhomedirectory/Condor/vanilla_w_sfs total 288 -rw-r--r-- 1 myusername itap 1508 Oct 25 10:39 commands -rw-r--r-- 1 myusername itap 0 Oct 25 10:46 mydata.err -rw-r--r-- 1 myusername itap 467 Oct 25 10:46 mydata.log -rw-r--r-- 1 myusername itap 71 Oct 25 10:46 mydata.out -rwxr-xr-x 1 myusername itap 6863 Oct 25 10:14 myprogram -rw-r----- 1 myusername itap 376 Oct 25 09:14 myprogram.c -rw-r----- 1 myusername itap 199 Oct 25 10:04 myprogram.sub -rwxr----- 1 myusername itap 70 Oct 25 09:19 run -rwxr--r-- 1 myusername itap 216 Oct 25 09:14 tally -rw-r----- 1 myusername itap 952 Oct 25 09:14 tmp -rw-r--r-- 1 myusername itap 0 Oct 25 10:19 tmp1 *** MAIN START *** *** MAIN STOP ***
The output shows the name of the compute node which ran the job. This job ran on a compute node which uses the same shared file system which the submission host uses. The fact that the current working directory is the user's home directory on the submission host proves that this job used the shared file system for file I/O.
View the log file, mydata.log:
000 (746407.000.000) 10/25 10:04:51 Job submitted from host: <128.211.157.86:60481>
...
012 (746407.000.000) 10/25 10:05:15 Job was held.
via condor_hold (by user myusername)
Code 1 Subcode 0
...
009 (746406.000.000) 10/25 10:22:07 Job was aborted by the user.
via condor_rm (by user myusername)
...
013 (746407.000.000) 10/25 10:44:07 Job was released.
via condor_release (by user myusername)
...
001 (746407.000.000) 10/25 10:46:47 Job executing on host: <128.211.157.10:57108>
...
005 (746407.000.000) 10/25 10:46:47 Job terminated.
(1) Normal termination (return value 0)
Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage
0 - Run Bytes Sent By Job
0 - Run Bytes Received By Job
0 - Total Bytes Sent By Job
0 - Total Bytes Received By Job
The log file records the main events related to the processing of this job. In this example, the log records that the number of bytes read and written between the submission host and the compute node is zero, another indication that this job used the shared file system for file I/O.
The Vanilla Universe is available for jobs which cannot take advantage of the Standard Universe because you cannot re-link them with HTCondor's libraries. You cannot re-link an executable binary, shell script, and a program which you compiled with an incompatible compiler. If a shared file system is available, the Vanilla job may use it for file I/O by keeping HTCondor's file transfer mechanism turned off. Keeping the file transfer mechanism off excludes compatible compute nodes which do not share a file system with the submission host.
The Vanilla Universe is an execution environment of HTCondor. The Vanilla Universe tells HTCondor that you did not re-link a job via condor_compile with the HTCondor libraries and therefore your job does not support checkpointing or remote system calls. Such jobs include an executable binary of a commercial application, shell script, or a program which is to take advantage of features of a compiler which is not compatible with HTCondor's condor_compile command.
For jobs submitted under the Vanilla Universe, the existence of a shared file system is relevant since access to files (input and output) involves either a shared file system or HTCondor's file transfer mechanism mechanism, not the remote system calls of the Standard Universe.
This section illustrates how to submit a small job to the Vanilla Universe of BoilerGrid from a submission host which has a shared file system with HTCondor's file transfer mechanism turned on "if needed". HTCondor transfers files only when it matches the job with a compute node which uses a different FileSystemDomain from the one which the submission host uses. If HTCondor matches the job with a compute node which uses the same FileSystemDomain which the submission host uses, HTCondor does not transfer files and relies on the shared file system instead. This example, myprogram.c, displays the name of the host which runs the job, the path name of the current working directory, and the contents of that directory. The Vanilla Universe allows using Linux commands to obtain the information. To compile this program for the Vanilla Universe, see Compiling Serial Programs.
Prepare a job submission file with the Vanilla Universe, the compiled C program as the executable, HTCondor's file transfer mechanism turned on if needed, HTCondor's transferring the compiled program to the chosen compute node, and an appropriate filename, here named myjob.sub:
# FILENAME: myjob.sub universe = VANILLA # Transfer the "executable" myprogram to the compute node. transfer_executable = TRUE executable = myprogram # HTCondor's file transfer mechanism is turned on only when needed. should_transfer_files = IF_NEEDED # Let HTCondor handle output file(s). when_to_transfer_output = ON_EXIT # Standard I/O files, HTCondor log file output = mydata.out error = mydata.err log = mydata.log # queue one job queue
Submit this job to HTCondor:
$ condor_submit myjob.sub Submitting job(s). Logging submit event(s). 1 job(s) submitted to cluster 746408.
View job status:
$ condor_q myusername -- Submitter: condor.rcac.purdue.edu : <128.211.157.86:60481> : condor.rcac.purdue.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 746408.0 myusername 10/25 10:04 0+00:00:00 I 0 0.0 myjob
Place this job on hold to study the submission:
$ condor_hold 746408 Cluster 746408 held.
Obtain the requirements of this job:
$ condor_q myusername -attributes requirements -long -- Submitter: condor.rcac.purdue.edu : <128.211.157.86:60481> : condor.rcac.purdue.edu requirements = (Arch == "X86_64") && (OpSys == "LINUX") && (Disk >= DiskUsage) && ((Memory * 1024) >= ImageSize) && ((HasFileTransfer) || (TARGET.FileSystemDomain == MY.FileSystemDomain))
The requirements reflect both the Vanilla Universe (preemption without checkpointing) and HTCondor's file transfer mechanism turned on only if needed. This job requires a processor core which runs the Linux operating system on the x86_64 architecture, but, more importantly, the processor core chosen to run this job need not share the same FileSystemDomain which the submission host uses (both the TARGET and MY shared file system need not be equal). The ClassAd of this job states that the chosen core must either have the file transfer capability or share a file system with the submission host. So, this submission does not limit running the job to the processor cores which use the same shared file system which the submission host uses. The ClassAds Disk and Memory ensure that the chosen core has sufficient resources to hold the disk and memory footprint of the job.
To see how many compute nodes of BoilerGrid are able to satisfy this job's requirements:
$ condor_status -pool boilergrid.rcac.purdue.edu -total -constraint '(Arch == "X86_64") && (OpSys == "LINUX") && ((HasFileTransfer) || (FileSystemDomain == "bluearc.rcac.purdue.edu"))'
Total Owner Claimed Unclaimed Matched Preempting Backfill
X86_64/LINUX 32074 20806 4287 6976 5 0 0
Total 32074 20806 4287 6976 5 0 0 0 0
The report shows that 32,074 processor cores are candidates for running this job. Using HTCondor's Vanilla Universe with its file transfer mechanism turned on only if needed maximizes the number of candidate cores available to the job. This large pool of candidate cores has the potential of offering high throughput. On the other hand, large data files involved in the file transfer may diminish throughput. Some subset of the candidate processor cores, possibly none, will be available to accept the job at any given moment.
Release the job from the queue:
$ condor_release 746408 Cluster 746408 released.
View results in the file for all standard output, here named mydata.out:
cms-100.rcac.purdue.edu (none) /home/myhomedirectory/HTCondor/vanilla_w_sfs_ftm total 284 -rw-r--r-- 1 myusername itap 1508 Oct 25 10:39 commands -rw-r--r-- 1 myusername itap 0 Oct 25 10:46 mydata.err -rw-r--r-- 1 myusername itap 467 Oct 25 10:46 mydata.log -rw-r--r-- 1 myusername itap 71 Oct 25 10:46 mydata.out -rwxr-xr-x 1 myusername itap 6863 Oct 25 10:14 myprogram -rw-r----- 1 myusername itap 376 Oct 25 09:14 myprogram.c -rw-r----- 1 myusername itap 199 Oct 25 10:04 myjob.sub -rwxr----- 1 myusername itap 70 Oct 25 09:19 run -rwxr--r-- 1 myusername itap 216 Oct 25 09:14 tally -rw-r----- 1 myusername itap 952 Oct 25 09:14 tmp -rw-r--r-- 1 myusername itap 0 Oct 25 10:19 tmp1 *** MAIN START *** *** MAIN STOP ***
The output shows the name of the processor core which ran the job. This job ran on a processor core which uses the same shared file system which the submission host uses. The fact that the current working directory is the user's home directory on the submission host proves that this job used the shared file system for file I/O.
View the log file, mydata.log:
000 (746408.000.000) 10/25 10:04:51 Job submitted from host: <128.211.157.86:60481>
...
012 (746408.000.000) 10/25 10:05:15 Job was held.
via condor_hold (by user myusername)
Code 1 Subcode 0
...
013 (746408.000.000) 10/25 10:44:07 Job was released.
via condor_release (by user myusername)
...
001 (746408.000.000) 10/25 10:46:47 Job executing on host: <128.211.157.10:57108>
...
005 (746408.000.000) 10/25 10:46:47 Job terminated.
(1) Normal termination (return value 0)
Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage
0 - Run Bytes Sent By Job
0 - Run Bytes Received By Job
0 - Total Bytes Sent By Job
0 - Total Bytes Received By Job
The log file records the main events related to the processing of this job. In this example, the log records that the number of bytes read and written between the submission host and the compute node is zero, another indication that this job used the shared file system for file I/O.
To see HTCondor's file transfer mechanism at work, repeat the example above but force the job to a compute node which does not share a file system with the submission host.
Modify the job submission file of the previous example to send the job to a processor core which uses a different shared file system:
# FILENAME: myjob.sub universe = VANILLA # A core on the Rossmann cluster uses a different shared file system. requirements = ClusterName == "Rossmann" # Transfer the "executable" myprogram to the compute node. transfer_executable = TRUE executable = myprogram # HTCondor's file transfer mechanism is turned on only when needed. This submission needs the transfer mechanism. should_transfer_files = IF_NEEDED when_to_transfer_output = ON_EXIT # Standard I/O files, HTCondor log file output = mydata.out error = mydata.err log = mydata.log # queue one job queue
View results in the file for all standard output, here named mydata.out:
cms-100.rcac.purdue.edu (none) /var/condor/execute/dir_11554 total 12 -rwxr-xr-x 1 myusername itap 6863 Oct 25 12:28 condor_exec.exe -rw-r--r-- 1 myusername itap 0 Oct 25 12:31 mydata.err -rw-r--r-- 1 myusername itap 67 Oct 25 12:32 mydata.out *** MAIN START *** *** MAIN STOP ***
This output file exhibits a temporary directory on the processor core which HTCondor chose to run the job, rather than the user's home directory, another indication that this job used HTCondor's file transfer mechanism for file I/O.
View the log file, mydata.log:
000 (746411.000.000) 10/25 12:08:12 Job submitted from host: <128.211.157.86:60481>
...
001 (746411.000.000) 10/25 12:31:59 Job executing on host: <128.211.157.10:51871>
...
005 (746411.000.000) 10/25 12:32:00 Job terminated.
(1) Normal termination (return value 0)
Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage
278 - Run Bytes Sent By Job
6863 - Run Bytes Received By Job
278 - Total Bytes Sent By Job
6863 - Total Bytes Received By Job
The log file records the main events related to the processing of this job. In this example, the log records the number of bytes transferred between the submission host and the compute node via the file transfer mechanism, another indication that HTCondor's file transfer mechanism was used.
The Vanilla Universe is available for jobs which cannot take advantage of the Standard Universe because you cannot re-link them with HTCondor's libraries. You cannot re-link an executable binary, shell script, and a program which you compiled with an incompatible compiler. If a shared file system is available and HTCondor's file transfer mechanism is suitable for the job, the Vanilla job may use either for file I/O by specifying that the submission uses the mechanism only "if needed." While this method can maximize throughput, the size of any file which you intend to transfer must be reasonable. The size of the file must fit on the available disk space of the compute node. The amount of time needed to transfer the file cannot be so great that higher priority jobs constantly preempt the HTCondor job during file transfer.
The Vanilla Universe is an execution environment of HTCondor. The Vanilla Universe tells HTCondor that you did re-link your job via condor_compile with the HTCondor libraries, and therefore your job does not support checkpointing or remote system calls. Such jobs include an executable binary of a commercial application, shell script, or a program which is to take advantage of features of a compiler which is not compatible with HTCondor's condor_compile command.
For jobs submitted under the Vanilla Universe, the existence of a shared file system is relevant since access to files (input and output) involves either a shared file system or HTCondor's file transfer mechanism mechanism, not the remote system calls of the Standard Universe.
This section illustrates how to submit a small job to the Vanilla Universe of BoilerGrid from a submission host which lacks a shared file system with HTCondor's file transfer mechanism turned on. No matter which processor core HTCondor chooses to run the job, HTCondor transfers files. This example, myprogram.c, displays the name of the host which runs the job, the path name of the current working directory, and the contents of that directory. The Vanilla Universe allows using Linux commands to obtain the information. To compile this program for the Vanilla Universe, see Compiling Serial Programs.
Prepare a job submission file with the Vanilla Universe, the compiled C program as the executable, HTCondor's file transfer mechanism turned on, HTCondor's transferring the compiled program to the chosen compute node, and an appropriate filename, here named myjob.sub:
# FILENAME: myjob.sub universe = VANILLA # Transfer the "executable" myprogram to the compute node. transfer_executable = TRUE executable = myprogram # Turn on HTCondor's file transfer mechanism. should_transfer_files = YES # Let HTCondor handle output files. when_to_transfer_output = ON_EXIT # Standard I/O files, HTCondor log file output = mydata.out error = mydata.err log = mydata.log # queue one job queue
Submit this job to HTCondor:
$ condor_submit myjob.sub Submitting job(s). Logging submit event(s). 1 job(s) submitted to cluster 341960.
View job status:
$ condor_q myusername -- Submitter: condor.rcac.purdue.edu : <128.211.157.86:35556> : condor.rcac.purdue.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 341960.0 myusername 10/25 15:02 0+00:00:00 I 0 0.0 myjob
Place this job on hold to study the submission:
$ condor_hold 341960 Cluster 341960 held.
Obtain the requirements of this job:
$ condor_q myusername -attributes requirements -long -- Submitter: condor.rcac.purdue.edu : <128.211.157.86:35556> : condor.rcac.purdue.edu requirements = (Arch == "X86_64") && (OpSys == "LINUX") && (Disk >= DiskUsage) && ((Memory * 1024) >= ImageSize) && (HasFileTransfer)
Job requirements reflect both the Vanilla Universe (preemption without checkpointing) and HTCondor's file transfer mechanism turned on. This job requires a processor core which runs the Linux operating system on the x86_64 architecture, but more importantly the processor core chosen to run this job can reside on a cluster which lacks a shared file system. The ClassAd of this job states that the chosen core must have the file transfer capability. The ClassAds Disk and Memory ensure that the chosen core has sufficient resources to hold the disk and memory footprint of the job.
To see how many compute nodes of BoilerGrid are able to satisfy this job's requirements:
$ condor_status -pool boilergrid.rcac.purdue.edu -total -constraint '(Arch == "X86_64") && (OpSys == "LINUX") && (HasFileTransfer)'
Total Owner Claimed Unclaimed Matched Preempting Backfill
X86_64/LINUX 33068 20690 3850 8520 8 0 0
Total 33068 20690 3850 8520 8 0 0
This report shows that 33,068 processor cores are candidates for running this job. Using HTCondor's Vanilla Universe with its file transfer mechanism turned off maximizes the number of candidate cores available to the job. This large pool of candidate cores has the potential of offering high throughput. On the other hand, large data files involved in the file transfer, may diminish throughput. Some subset of the candidate processor cores, possibly none, will be available to accept the job at any given moment.
Release the job from the queue:
$ condor_release 341960 Cluster 341960 released.
View results in the file for all standard output, here named mydata.out:
cms-100.rcac.purdue.edu (none) /var/condor/execute/dir_13374 total 12 -rwxr-xr-x 1 myusername itap 6863 Oct 25 15:47 condor_exec.exe -rw-r--r-- 1 myusername itap 0 Oct 25 15:50 mydata.err -rw-r--r-- 1 myusername itap 61 Oct 25 15:50 mydata.out *** MAIN START *** *** MAIN STOP ***
The output shows the name of the processor core which ran the job. This job ran on a processor core which shares a file system with the submission host. Despite this, the current working directory is a temporary directory on the compute node; therefore, this job used the file transfer mechanism for file I/O.
View the log file, mydata.log:
000 (341960.000.000) 10/25 15:03:18 Job submitted from host: <128.211.157.86:35556>
...
012 (341960.000.000) 10/25 15:03:35 Job was held.
via condor_hold (by user myusername)
Code 1 Subcode 0
...
013 (341960.000.000) 10/25 15:48:00 Job was released.
via condor_release (by user myusername)
...
001 (341960.000.000) 10/25 15:50:46 Job executing on host: <128.211.157.10:33047>
...
005 (341960.000.000) 10/25 15:50:46 Job terminated.
(1) Normal termination (return value 0)
Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage
272 - Run Bytes Sent By Job
6863 - Run Bytes Received By Job
272 - Total Bytes Sent By Job
6863 - Total Bytes Received By Job
The log file records the main events related to the processing of this job. In this example, the log records the number of bytes transferred between the submission host and the compute node via HTCondor's file transfer mechanism.
The Vanilla Universe is available for jobs which cannot take advantage of the Standard Universe because you cannot re-link them with HTCondor's libraries. You cannot re-link an executable binary, shell script, and a program which you compiled with an incompatible compiler. If a shared file system is not available and HTCondor's file transfer mechanism is suitable for the job, you may turn on the file transfer mechanism, and the Vanilla job will transfer your files. The size of any file which you intend to transfer must be reasonable. The size of the file must fit on the available disk space of the compute node. The amount of time needed to transfer the file cannot be so great that higher priority jobs constantly preempt the HTCondor job during file transfer.
A parameter sweep is a computation applied to several unique sets of data. Computing the average test score for each class in a school is an example of a simple parameter sweep.
A parameter sweep requires that each queued run of a computer program reads a unique set of input parameters. This example illustrates how to implement a unique command line argument, a unique standard input, and a unique formatted input data file. The program writes to a unique standard output and to a unique formatted output file.
HTCondor provides a pre-defined macro $(Process) which supports the specification of unique command-line arguments and input/output files so that each queued run of a job sees a unique set of data. This macro appears in the job submission file, myprogram.sub:
# FILENAME: myprogram.sub universe = VANILLA executable = myprogram # Processes 0,1,2 # command line argument arguments = $(Process) # Standard I/O files, HTCondor log file input = mydata.in.$(Process) output = mydata.out.$(Process) error = mydata.err.$(Process) log = mydata.log # queue 3 jobs in 1 cluster queue 3
This job submission file specifies a parameter sweep of three queued runs. Each process has a unique ID: 0, 1, and 2. Process 0 of the parameter sweep expects a standard input file named "mydata.in.0"; process 1, "mydata.in.1"; and process 2, "mydata.in.2". The sweep will generate similarly named files for standard output and error. HTCondor advises using a single log file in a submission. In addition, the sweep expects to find formatted input data files with the same process number used as a suffix: i_00020.0, i_mydata.1, i_mydata.2. Each copy of the computer program used in this sweep finds its unique process number in its command-line argument and appends that unique process number to the generic names "i_mydata." and "o_mydata." to make unique formatted data file names. All files reside in the user's current working directory; hence, data file names must be unique. To compile this program for the Vanilla Universe, see Compiling Serial Programs.
This job submission file uses HTCondor's default universe, Vanilla. Because this Vanilla job does not turn on HTCondor's file transfer mechanism, this job will (by default) use the shared file system of the submission host from which you submitted the job.
To submit the executable to HTCondor:
$ condor_submit myprogram.sub
For this submission, command condor_q will show a single cluster number, three unique process numbers, and three unique command-line arguments:
$ condor_q myusername -- Submitter: condor.rcac.purdue.edu : <128.211.157.86:60481> : condor.rcac.purdue.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 746419.0 myusername 10/28 10:57 0+00:00:00 I 0 0.0 myprogram 0 746419.1 myusername 10/28 10:57 0+00:00:00 I 0 0.0 myprogram 1 746419.2 myusername 10/28 10:57 0+00:00:00 I 0 0.0 myprogram 2
View the standard input file for process 0, mydata.in.0:
textfromstandardinput:process0
View the formatted input file for process 0, i_mydata.0:
textfromformattedinput:process0
View the standard output file for process 0, mydata.out.0:
*** MAIN START *** program name: condor_exec.exe command line argument: 0 standard input/output: textfromstandardinput:process0 formatted input/output: textfromformattedinput:process0 *** MAIN STOP ***
View the formatted output file for process 0, o_mydata.0:
textfromformattedinput:process0
Processes 1 and 2 have similar input and output files.
The single log file collects records the major events of the submission of the three queued runs of this parameter sweep:
000 (746419.000.000) 10/28 10:57:43 Job submitted from host: <128.211.157.86:60481>
...
000 (746419.001.000) 10/28 10:57:43 Job submitted from host: <128.211.157.86:60481>
...
000 (746419.002.000) 10/28 10:57:43 Job submitted from host: <128.211.157.86:60481>
...
001 (746419.001.000) 10/28 11:02:14 Job executing on host: <128.211.157.10:44836>
...
005 (746419.001.000) 10/28 11:02:14 Job terminated.
(1) Normal termination (return value 0)
Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage
950 - Run Bytes Sent By Job
9800 - Run Bytes Received By Job
950 - Total Bytes Sent By Job
9800 - Total Bytes Received By Job
...
001 (746419.000.000) 10/28 11:02:15 Job executing on host: <128.211.157.10:44836>
...
005 (746419.000.000) 10/28 11:02:15 Job terminated.
(1) Normal termination (return value 0)
Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage
950 - Run Bytes Sent By Job
9800 - Run Bytes Received By Job
950 - Total Bytes Sent By Job
9800 - Total Bytes Received By Job
...
001 (746419.002.000) 10/28 11:02:17 Job executing on host: <128.211.157.10:44836>
...
005 (746419.002.000) 10/28 11:02:17 Job terminated.
(1) Normal termination (return value 0)
Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage
950 - Run Bytes Sent By Job
9800 - Run Bytes Received By Job
950 - Total Bytes Sent By Job
9800 - Total Bytes Received By Job
HTCondor's parameter sweep offers a huge potential. Simply adding a large number to the queue command in a job submission file yields an enormous amount of computing. The catch is that the effort needed to prepare the unique input files, either standard input files or formatted input files. This effort can be minimal when the input data comes from some data collector operating in the field. This effort can be enormous when you must enter each unique dataset from the keyboard.
A parameter sweep is a computation applied to several unique sets of data. Computing the average test score for each class in a school is an example of a simple parameter sweep.
A parameter sweep requires that each queued run of a computer program reads a unique set of input parameters. This example illustrates how to implement a unique command line argument, a unique standard input, and a unique formatted input data file. The program writes to a unique standard output and to a unique formatted output file.
HTCondor provides a pre-defined macro $(Process) which supports the specification of unique command-line arguments so that each queued run of a job sees a unique set of data.
Also, HTCondor provides an "initial directory" which supports the specification of unique input/output files so that each queued run of a job sees a unique set of data. Command initialdir specifies a generic directory name which becomes unique after appending the process number of a queued run of a parameter sweep. Each initial directory is actually a subdirectory of the user's current working directory. Each initial directory holds the unique standard input and formatted input files of a queued run of a parameter sweep; each initial directory receives the unique standard output, error and log files plus any unique formatted output files generated by a queued run of a parameter sweep. Since data files of each run of a sweep reside in a separate directory, identical file names may be used; they need not be modified with a process number. Both macro and command appear in the job submission file, myprogram.sub:
# FILENAME: myprogram.sub universe = VANILLA executable = myprogram # Processes 0,1,2 # command line argument arguments = $(Process) initialdir = mydatadirectory.$(Process) # Standard I/O files, HTCondor log file input = mydata.in output = mydata.out error = mydata.err log = mydata.log # queue 3 jobs in 1 cluster queue 3
This job submission file specifies a parameter sweep of three queued runs. Each process has a unique ID: 0, 1, and 2. Process 0 of the parameter sweep expects a standard input file named "mydata.in" to reside in the initial directory named "mydatadirectory.0"; process 1, "mydata.in" resides in "mydirectory.1"; and process 2, "mydata.in" resides in "mydirectory.2". The sweep will generate similarly named files for standard output, error, and log in the initial directories. In addition, the sweep expects to find in the initial directories formatted input data files with identical names: myinputdata. Each copy of the computer program used in this sweep finds its unique process number in its command-line argument and finds its unique formatted input data file in its own initial directory. The program does not append its unique process number to the generic names of formatted files to make unique formatted data file names. All files reside in unique subdirectories of the user's current working directory; hence, data file names must be identical. To compile this program for the Vanilla Universe, see Compiling Serial Programs.
This job submission file uses HTCondor's default universe, Vanilla. Because this Vanilla job does not turn on HTCondor's file transfer mechanism, this job will (by default) use the shared file system of the submission host from which you submitted the job.
To submit the executable to HTCondor:
$ condor_submit myprogram.sub
For this submission, command condor_q will show a single cluster number, three unique process numbers, and three unique command-line arguments:
$ condor_q myusername -- Submitter: condor.rcac.purdue.edu : <128.211.157.86:60481> : condor.rcac.purdue.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 746420.0 myusername 10/28 12:28 0+00:00:00 I 0 0.0 myprogram 0 746420.1 myusername 10/28 12:28 0+00:00:00 I 0 0.0 myprogram 1 746420.2 myusername 10/28 12:28 0+00:00:00 I 0 0.0 myprogram 2
View the standard input file for process 0, mydata.in, in the initial directory mydirectory.0:
textfromstandardinput:process0
View the formatted input file for process 0, myinputdata, in the initial directory mydirectory.0:
textfromformattedinput:process0
View the standard output file for process 0, mydata.out, in the initial directory mydirectory.1:
*** MAIN START *** program name: condor_exec.exe command line argument: 0 standard input/output: textfromstandardinput:process0 formatted input/output: textfromformattedinput:process0 *** MAIN STOP ***
View the formatted output file for process 0, myoutputdata, in the initial directory mydirectory.0:
textfromformattedinput:process0
The log file, mydata.log, records the major events of the submission of the one queued run of this parameter sweep. View the log file for process 0, mydata.log, in the initial directory mydirectory.0:
000 (746420.000.000) 10/28 12:28:35 Job submitted from host: <128.211.157.86:60481>
...
001 (746420.000.000) 10/28 12:33:48 Job executing on host: <128.211.157.10:34460>
...
005 (746420.000.000) 10/28 12:33:49 Job terminated.
(1) Normal termination (return value 0)
Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage
909 - Run Bytes Sent By Job
9800 - Run Bytes Received By Job
909 - Total Bytes Sent By Job
9800 - Total Bytes Received By Job
Processes 1 and 2 have similar input, output and log files and formatted input/output files residing in their respective initial directories.
HTCondor's parameter sweep offers a huge potential. Simply adding a large number to the queue command in a job submission file yields an enormous amount of computing. The catch is that the effort needed to prepare the unique input files, either standard input files or formatted input files, can be great. This effort is minimal when the input data comes from some data collector operating in the field. This effort can be overwhelming when you must enter each unique dataset from the keyboard.
A parameter sweep is a computation applied to several unique sets of data. Computing the average test score for each class in a school is an example of a simple parameter sweep.
A parameter sweep requires that each queued run of a computer program reads a unique set of input parameters. This example illustrates how to implement a parameter sweep on a single large file. Each queued run of the job reads a different portion on the same file.
HTCondor provides a pre-defined macro $(Process) which supports the specification of unique command-line arguments and input/output files so that each queued run of a job sees a unique set of data. This macro appears in the job submission file, myprogram.sub:
# FILENAME: myprogram.sub universe = VANILLA executable = myprogram # Processes 0,1,2 arguments = $(Process) # There is a single formatted input data file, myinputdata. # Standard I/O files, HTCondor log file output = mydata.out.$(Process) error = mydata.err.$(Process) log = mydata.log # queue 3 jobs in 1 cluster queue 3
This job submission file specifies a parameter sweep of three queued runs. Each process has a unique ID: 0, 1, and 2. Each queued run of this job will read a different portion of the data file. Process 0 of the parameter sweep writes a standard output file named "mydata.out.0"; process 1, "mydata.out.1"; and process 2, "mydata.out.2". The sweep will generate similarly named files for standard error. HTCondor advises using a single log file in a submission to record the major events of the sweep. In addition, the sweep expects to find a single formatted input data file, myinputdata. Each copy of the computer program used in this sweep finds its unique process number in its command-line argument and uses that number to determine where in the single input data file it is to start reading records. All files reside in the user's current working directory; hence, data file names must be unique. To compile this program for the Vanilla Universe, see Compiling Serial Programs.
This job submission file uses HTCondor's default universe, Vanilla. Because this Vanilla job does not turn on HTCondor's file transfer mechanism, this job will (by default) use the shared file system of the submission host from which you submitted the job.
To submit the executable to HTCondor:
$ condor_submit myprogram.sub
For this submission, command condor_q will show a single cluster number, three unique process numbers, and three unique command-line arguments:
$ condor_q myusername -- Submitter: condor.rcac.purdue.edu : <128.211.157.86:60481> : condor.rcac.purdue.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 746421.0 myusername 10/29 10:57 0+00:00:00 I 0 0.0 myprogram 0 746421.1 myusername 10/29 10:57 0+00:00:00 I 0 0.0 myprogram 1 746421.2 myusername 10/29 10:57 0+00:00:00 I 0 0.0 myprogram 2
View the single formatted input file, myinputdata:
AAAAAAAAAA
BBBBBBBBBB
CCCCCCCCCC
:
ZZZZZZZZZZ
0000000000
1111111111
2222222222
3333333333
View the standard output file for process 0, mydata.out.0:
*** MAIN START *** program name: condor_exec.exe command line argument: 0 current file position: 0 rtn_val = 0 starting file position: 0 line 1: AAAAAAAAAA line 2: BBBBBBBBBB line 3: CCCCCCCCCC line 4: DDDDDDDDDD line 5: EEEEEEEEEE line 6: FFFFFFFFFF line 7: GGGGGGGGGG line 8: HHHHHHHHHH line 9: IIIIIIIIII line 10: JJJJJJJJJJ *** MAIN STOP ***
View the standard output file for process 1, mydata.out.1:
*** MAIN START *** program name: condor_exec.exe command line argument: 1 current file position: 0 rtn_val = 0 starting file position: 110 line 11: KKKKKKKKKK line 12: LLLLLLLLLL line 13: MMMMMMMMMM line 14: NNNNNNNNNN line 15: OOOOOOOOOO line 16: PPPPPPPPPP line 17: QQQQQQQQQQ line 18: RRRRRRRRRR line 19: SSSSSSSSSS line 20: TTTTTTTTTT rtn_val = 0 starting file position: 0 line 0: AAAAAAAAAA rtn_val = 0 starting file position: 220 line 21: UUUUUUUUUU *** MAIN STOP ***
Process 1 also practices additional random file accesses.
View the standard output file for process 2, mydata.out.2:
*** MAIN START *** program name: condor_exec.exe command line argument: 2 current file position: 0 rtn_val = 0 starting file position: 220 line 21: UUUUUUUUUU line 22: VVVVVVVVVV line 23: WWWWWWWWWW line 24: XXXXXXXXXX line 25: YYYYYYYYYY line 26: ZZZZZZZZZZ line 27: 0000000000 line 28: 1111111111 line 29: 2222222222 line 30: 3333333333 *** MAIN STOP ***
HTCondor's parameter sweep, when applied to a single, large data file, offers a huge potential. Simply adding a large number to the queue command in a job submission file applies several compute servers to the data processing.
To review, HTCondor is unable to transfer a subdirectory of data files to a compute server. While the submit command transfer_input_files allows paths when specifying which input files to transfer, HTCondor places all transferred files in a single, flat directory where the executable and standard input file reside - the temporary working directory on the compute server. Therefore, the executing program must access input files without paths.
A similar situation exists for output files. If the program creates output files during execution, it must create them within the temporary working directory. HTCondor transfers back all new and modified files within the temporary working directory - the output files. To transfer back only a subset of these files, use the submit command transfer_output_files. HTCondor does not support the transfer of output files that exist but that do not reside within the temporary working directory on the compute server.
This restriction need not deter the user with a subdirectory of input and output files. The user simply makes an archive file of the subdirectory structure with the tar utility and tell HTCondor to transfer the tar file. The application may then un-tar the archive before reading the input files. The application may also write to output files which reside within the subdirectory. The final step of the application archives those files which your job made or modified. HTCondor will see the archive file as an output file and transfer the archive from the compute server to the user's working directory on the submission host. Finally, the user extracts the output files from the archive.
The computer program, myprogram.c, reads a formatted data file and writes a formatted data file. This example assumes that there exists a formatted input file, i_00110 in a subdirectory name mysubdirectory. The result is a formatted output file, o_00110, in the same subdirectory. The program uses the tar utility to extract the subdirectory structure on the compute server. After the program writes the output file, it then uses the tar utility again to archive the subdirectory of output files only. To compile this program for the Vanilla Universe, see Compiling Serial Programs.
This example assumes that the current working directory has a subdirectory containing a formatted input file. The tar utility prepares the archive of input files:
tar cf myarchive.i.tar mysubdirectory
Prepare a job submission file, myprogram.sub. Specify the Vanilla Universe and the file transfer mechanism as "on":
# FILENAME: myprogram.sub universe = VANILLA executable = myprogram # Specify the archive as the input data file. transfer_input_files = myarchive.i.tar # Turn on file transfer mechanism. should_transfer_files = YES # Let HTCondor handle output file(s): myarchive.o.tar. when_to_transfer_output = ON_EXIT # Standard output files, HTCondor log file output = mydata.out error = mydata.err log = mydata.log # queue one job queue
To submit the executable to HTCondor:
$ condor_submit myprogram.sub
The standard output file, mydata.out, shows the evolution of the current working directory on the compute server. Initially, it shows that HTCondor transferred the tar file which contains the archived subdirectory of input data file(s). After extraction, the subdirectory with its formatted input file(s), mysubdirectory and myinputdata, are visible. After processing, the formatted output file(s), myoutputdata, is visible:
total 24 -rwxr-xr-x 1 myusername itap 8708 Nov 12 15:27 condor_exec.exe -rw-r--r-- 1 myusername itap 10240 Nov 12 15:27 myarchive.i.tar -rw-r--r-- 1 myusername itap 0 Nov 12 15:30 mydata.err -rw-r--r-- 1 myusername itap 0 Nov 12 15:30 mydata.out total 32 -rwxr-xr-x 1 myusername itap 8708 Nov 12 15:27 condor_exec.exe -rw-r--r-- 1 myusername itap 10240 Nov 12 15:27 myarchive.i.tar -rw-r--r-- 1 myusername itap 0 Nov 12 15:30 mydata.err -rw-r--r-- 1 myusername itap 227 Nov 12 15:30 mydata.out drwxr-x--- 3 myusername itap 4096 Feb 14 2008 mysubdirectory total 8 drwx------ 2 myusername itap 4096 Feb 14 2008 .. -rw-r--r-- 1 myusername itap 19 Jul 12 2007 myinputdata total 12 drwx------ 2 myusername itap 4096 Feb 14 2008 .. -rw-r--r-- 1 myusername itap 19 Jul 12 2007 myinputdata -rw-r--r-- 1 myusername itap 28 Nov 12 15:30 myoutputdata *** MAIN START *** formatted input/output: textinsubdirectory *** MAIN STOP ***
At job completion, HTCondor sees file myarchive.o.tar as an output file which it will transfer to the submission host. After the transfer, the user then extracts the output file(s) from this archive:
tar xf myarchive.o.tar mysubdirectory/myoutputfile
View the log file, mydata.log:
000 (342352.000.000) 11/12 15:29:31 Job submitted from host: <128.211.157.86:47933>
...
001 (342352.000.000) 11/12 15:30:55 Job executing on host: <128.211.157.10:59987?PrivNet=condor.ccb.purdue.edu>
...
005 (342352.000.000) 11/12 15:30:56 Job terminated.
(1) Normal termination (return value 0)
Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage
11094 - Run Bytes Sent By Job
18948 - Run Bytes Received By Job
11094 - Total Bytes Sent By Job
18948 - Total Bytes Received By Job
The log file records the main events related to the processing of this job. The log shows the number of bytes transferred between the submission host and the compute server via HTCondor's file transfer mechanism.
Some applications require compute nodes with a certain minimum amount of memory. These applications may also perform better when even more memory is available on the compute node.
This section illustrates how to submit a small job to a BoilerGrid compute node with at least 16 GB of memory (requirements) and to prefer compute nodes with even more memory (rank), if available. This example, myprogram.c, displays the name of the host which runs the job, the path name of the current working directory, and the contents of that directory.
Prepare a job submission file with an appropriate filename, here named myjob.sub:
# FILENAME: myjob.sub universe = VANILLA # Require a compute node with at least 16 GB of memory. # 16 GB == 16046 MB; requirements = TotalMemory >= 16046 # Prefer a compute node with more than 16 GB, if available. rank = TotalMemory # Transfer the "executable" myprogram to the compute node. transfer_executable = TRUE executable = myprogram # Turn on HTCondor's file transfer mechanism only when needed. should_transfer_files = IF_NEEDED when_to_transfer_output = ON_EXIT # Standard I/O files, HTCondor log file output = myprogram.out error = myprogram.err log = myprogram.log # queue one job queue
The ClassAd TotalMemory specifies the amount of memory on a compute node. The amount of memory is in units of megabytes. To change this example to request at least 32 GB of total memory, replace "16046" with "32192". For at least 48 GB, use "48297".
This example assumes that all compute nodes have a definition for the attribute TotalMemory. To see how many compute nodes in BoilerGrid do not have the attribute TotalMemory defined:
$ condor_status -pool boilergrid.rcac.purdue.edu -total -constraint 'TotalMemory =?= undefined'
There is no output since all compute nodes of BoilerGrid do have this attribute defined.
Before submitting your job, you may wish to verify that there are a sufficient number of compute nodes which will satisfy your requirements and that those same compute nodes define the preferred ClassAds expressed in the rank command. To see how many compute nodes satisfy your requirements:
$ condor_status -pool boilergrid.rcac.purdue.edu -total -constraint 'TotalMemory >= 16046'
Total Owner Claimed Unclaimed Matched Preempting Backfill
X86_64/LINUX 26093 18007 3330 4753 3 0 0
Total 26093 18007 3330 4753 3 0 0
There are 26,093 compute nodes with at least 16 GB of memory.
View results in the file for all standard output, here named myjob.out:
cms-100.rcac.purdue.edu (none) /home/myusername/condor/Introduction/memory total 224 -rw-r--r-- 1 myusername itap 1508 Mar 11 14:38 README -rw-r--r-- 1 myusername itap 0 Mar 11 15:36 myjob.err -rw-r--r-- 1 myusername itap 791 Mar 11 15:36 myjob.log -rw-r--r-- 1 myusername itap 77 Mar 11 15:36 myjob.out -rw-r----- 1 myusername itap 663 Mar 11 15:20 myjob.sub -rwxr-xr-x 1 myusername itap 6939 Mar 11 14:38 myprogram -rw-r----- 1 myusername itap 488 Mar 11 14:40 myprogram.c -rwxr----- 1 myusername itap 58 Mar 11 14:38 run *** MAIN START *** *** MAIN STOP ***
This job happened to run on compute node cms-100. This compute node has 8 processor cores. To verify that cms-100 has at least 16 GB of memory:
$ condor_status -pool boilergrid.rcac.purdue.edu -constraint 'Machine=="cms-100.rcac.purdue.edu"' -format "%s\n" TotalMemory 16046 16046 16046 16046 16046 16046 16046 16046
For more information about requirements and rank:
You compile a computer program to run on a specific combination of chip architecture and operating system. This combination is a platform. BoilerGrid contains compute nodes of many different platforms, so you must often specify the platform your program requires to ensure that your job runs on the correct platform. The predominant platform on BoilerGrid is 64-bit Linux ("X86_64/Linux"). To see a list of all platforms available on BoilerGrid:
$ condor_status -pool boilergrid.rcac.purdue.edu -total
Total Owner Claimed Unclaimed Matched Preempting Backfill
INTEL/LINUX 114 18 0 60 0 0 36
INTEL/OSX 2 0 0 2 0 0 0
INTEL/WINNT51 334 8 0 326 0 0 0
INTEL/WINNT61 6299 982 0 5317 0 0 0
SUN4u/SOLARIS210 3 0 0 3 0 0 0
X86_64/LINUX 30170 19460 4559 6150 0 0 1
Total 36922 20468 4559 11858 0 0 37
The name "INTEL" as used on BoilerGrid means 32-bit Intel-compatible hardware, and it makes no distinction between Intel and AMD CPUs. The name "X86_64" is a vendor-neutral term to refer to 64-bit architecture from either Intel or AMD. The name "WINNT51" means Windows XP, and "WINNT61" means Windows 7.
By default, HTCondor will send a job to a compute node whose architecture and operating system match the platform of the host from which you submitted your job. Moreover, you may submit jobs to compute nodes which are platforms different from the submission host. You may compile a program to run on a Windows machine and submit the executable file to BoilerGrid from one of BoilerGrid's Linux submission hosts by specifying that the job requires a Windows compute node:
executable = myprogram.exe requirements = (ARCH == "INTEL") && ((OPSYS == "WINNT51") || (OPSYS == "WINNT61"))
It is possible to allow HTCondor to use a larger pool of compute nodes for a job if executables are available for multiple platforms. You need only take care to not reference any absolute paths within your job submission that are specific to one platform or installation. You can often use some existing ClassAd variables instead of fixed paths to make non-platform-specific submission files.
For more information about requirements and rank:
ITaP research resources include several clusters. Currently, the clusters include the following:
Radon Peregrine 1 Steele Coates Rossmann Hansen Carter
This section illustrates how to apply HTCondor ClassAds to submit a small job to a node in some subset of ITaP resources. These examples execute a simple shell script which displays the name of the compute node which ran the job.
Prepare a shell script with an appropriate filename, here named myjob.sh:
#!/bin/sh # FILENAME: myjob.sh hostname
Change the permissions of the shell script to allow execution by the owner (you):
$ chmod u+x myjob.sh
Prepare a job submission file with an appropriate filename, here named myjob.sub. The ClassAd requires that the chosen compute node should reside on either of two clusters. This example also specifies the shell script myjob.sh as the executable, transfers the shell script to the compute node, and uses HTCondor's file transfer mechanism to copy input and output files only if needed (if the same file system is not available on the compute node):
# FILENAME: myjob.sub universe = VANILLA # Require a compute node of either the Steele or Coates cluster. # Attribute name is not case sensitive; attribute value is. requirements = (CLUSTERNAME=="Steele") || (clustername=="Coates") # Transfer the "executable" myjob.sh to the compute node. transfer_executable = TRUE executable = myjob.sh # Turn on HTCondor's file transfer mechanism only when needed. should_transfer_files = IF_NEEDED when_to_transfer_output = ON_EXIT # Standard I/O files, HTCondor log file output = myjob.out error = myjob.err log = myjob.log # queue one job queue
Submit the job:
$ condor_submit myjob.sub
View job status:
$ condor_q myusername
View results in the file for all standard output, here named myjob.out:
coates-d020.rcac.purdue.edu
Prepare a job submission file with an appropriate filename, here named myjob.sub. The ClassAd requires a specific compute node. This example also specifies the shell script myjob.sh as the executable, transfers the shell script to the compute node, and uses HTCondor's file transfer mechanism to copy input and output files only if needed (if the same file system is not available on the compute node):
# FILENAME: myjob.sub universe = VANILLA # Require a specific compute node. requirements = Machine=="miner-a500.rcac.purdue.edu" # Transfer the "executable" myjob.sh to the compute node. transfer_executable = TRUE executable = myjob.sh # Turn on HTCondor's file transfer mechanism only when needed. should_transfer_files = IF_NEEDED when_to_transfer_output = ON_EXIT # Standard I/O files, HTCondor log file output = myjob.out error = myjob.err log = myjob.log # queue one job queue
Submit the job:
$ condor_submit myjob.sub
View job status:
$ condor_q myusername
View results in the file for all standard output, here named myjob.out:
miner-a500.rcac.purdue.edu
When you discover that a compute node is consistently available and consistently fails to run your job, you may exclude that node from the set of candidate nodes.
Prepare a job submission file with an appropriate filename, here named myjob.sub. The ClassAd excludes one specific compute node of a chosen cluster. This example also specifies the shell script myjob.sh as the executable, transfers the shell script to the compute node, and uses HTCondor's file transfer mechanism to copy input and output files only if needed (if the same file system is not available on the compute node):
# FILENAME: myjob.sub universe = VANILLA # Exclude a specific compute node. requirements = ClusterName=="Miner" && Machine!="miner-a500.rcac.purdue.edu" # Transfer the "executable" myjob.sh to the compute node. transfer_executable = TRUE executable = myjob.sh # Turn on HTCondor's file transfer mechanism only when needed. should_transfer_files = IF_NEEDED when_to_transfer_output = ON_EXIT # Standard I/O files, HTCondor log file output = myjob.out error = myjob.err log = myjob.log # queue one job queue
Submit the job:
$ condor_submit myjob.sub
View job status:
$ condor_q myusername
View results in the file for all standard output, here named myjob.out:
miner-a502.rcac.purdue.edu
For more information about requirements and rank:
There are currently no FAQs for BoilerGrid.