Condor Boot Camp at Purdue
Lecture Materials
- Using Condor (Powerpoint) (PDF)
- Administrating Condor (Powerpoint) (PDF) (Handout)
- Condor Tutorial
Other Materials
Implementing an Industrial-Strength Academic Cyberinfrastructure at Purdue University
Implementing a Central Quill Database in a Large Condor Installation (Condor Week 2008)
BoilerGrid for cyro-EM image processing (Condor Week 2008)
This site is currently under construction. Please check back frequently for updates.
- Introduction
- Short and quick instructions
- Allocation of resources
- Getting ready to run
- Submitting Jobs
- Requirements and Rank
- List of Attributes
- Job/machine matching
- Submitting Jobs Using a Shared File System
- Submitting Jobs Without a Shared File System
- Execution on Differing Architectures
- Grid Computing
- Examples for submitting a job
- Condor DAG
- Running commercial packages in Condor - on Linux
- Running commercial packages in Condor - on Windows
- Multiple jobs in one Condor file
- Managing a Job
- Log files and job completion
- Condor Universes
- Limitations on Jobs which can Checkpointed
- Notes, problems and errors
- Condor test
- Why does the job not run?
- Why will my vanilla jobs only run on the machine where I submitted them from?
- My job starts but exits right away with signal 9
- Why aren't any or all of my jobs running?
- Why might my job be preempted (evicted)?
- Why does the time output from condor_status appear as [?????] ?
- Where are my missing files?
- Useful tips and comments
- References
- Examples
1. Introduction
Condor is one of several distributed computing resources RCAC provides. Like other similar resources, Condor provides a framework for running programs on otherwise idle computers. While this has serious limitations for parallel jobs and codes with serious IO or memory requirements, Condor can provide a large quantity of cycles for researchers who need to run hundreds of smaller jobs.
Condor is a specialized batch system for managing compute-intensive jobs. Condor provides a queuing mechanism, scheduling policy, priority scheme, and resource classifications. Users submit their compute jobs to Condor, which then put these jobs in a queue, runs them, and reports back with the results.
In some ways, Condor is different from other batch systems. They usually only operate on dedicated machines/compute servers. Instead, Condor can both schedule jobs on dedicated machines and effectively utilize non-dedicated machines to run jobs. It only runs compute jobs on machines which are currently not being used (no keyboard activity, no load average, no active telnet users, etc). This way, Condor effectively harnesses otherwise idle machines throughout a pool of machines.
See here for a more detailed description of the resources in the Condor Pools.
The status of the Condor pools can always be monitored with CondorView.
Look here for Condor tutorials and slides from the 'Condor Boot Camp' that was held at Purdue University.
Most of the information in this manual is taken from either the Condor Version 7.0.0 Manual or the man pages for the various commands.
2. Short and quick instructions
These instructions are very short and merely meant to give you the ability to run a small example immidiately. Read the rest of the sections, and maybe the Condor manual for more details on how to use some of all the possibilities in Condor.
Compiling
condor_compile <compiler> <program>.<extension> -o <program name>
Example:
condor_compile gcc hello.c -o hello
Running
Write a submit description file and submit it:
condor_submit file
Example:
condor_submit run_hello (my submit description file is called run_hello).
3. Allocation of resources
Condor allocates resources by matching the submitted jobs with the machines. It does this by matching ClassAds. Condor's ClassAds are analogous to the classified advertising section of the newspaper. Sellers/buyers advertise specifics about what they have to sell/wants to buy. Both buyers and sellers have some constraints which must be satisfied, like buyers only being able to pay a certain sum of money or sellers asking for no less than a certain price. Sellers and buyers both want to rank requests to their own advantage, for example, the seller would give a higher rank to a higher price offer. In Condor, users submitting jobs can be thought of as buyers of compute resources and machine owners are sellers.
All the machines in a Condor pool advertise their attributes. These could be available RAM memory, CPU type, CPU speed, virtual memory size, current load average, or other static and dynamic properties. This machine ClassAd also advertises under what conditions it is willing to run a Condor job and what type of job it would prefer.
The different owners which allows their machines to be part of the Condor pool, may set individual terms and preferences - maybe specifying that their machines may only be used to run jobs at night or that they have a preference/higher rank for running jobs submitted by their own department.
A very useful program for finding out which machines and architectures are out there, is the program condor_all. It should be noted, that even though it is located in the "official" Condor directory - /opt/condor/bin, it is a locally (Purdue) developed tool. It is very handy for finding out how many of a certain machine architecture that are available - useful for the submit description file.
Just as the machines have requirements and preferences, the same is true for the users submitting a job. The users specify a ClassAd with their requirements and preferences when they submit a job. This ClassAd includes the type of machine you wish to use - you would perhaps like to use the machine with the fastest floating point performance available and you thus want Condor to rank the available machines based upon their floating point performance.
Another example could be that your job requires a machine with a minimum of, say, 4 GB of RAM and you thus only want Condor to consider machines which fulfill this requirement.
Sometimes, the user may be ready to use any machine available and this too can be communicated to Condor through the job ClassAd.
Condor's job then is to read all the machine ClassAds and all the user job ClassAds and match them up. Condor makes certain that all requirements in both ClassAds are satisfied, if possible.
To get a feel for what a machine ClassAd does, try typing the commands condor_status. This will give you a summary of the information in the resource ClassAds in your Condor pool. To see an example of running this command, click here.
Some options can be given to the condor_status command, for example:
- -available shows only machines which are willing to run jobs now.
- -run shows only machines which are currently running jobs.
- -l lists the machine ClassAds for all machines in the pool.
A more complete list of options can be seen by running man condor_status or by looking in the Condor Manual at the University of Wisconsin. You can go directly to that manual page
Beware that running condor_status -l will produce a great deal of output.
As can be seen from the example, there are quite many attributes. Some of them are used by Condor for scheduling. Other attributes are for information purposes. An important point is that any of the attributes in a machine ad can be utilized at job submission time as part of a request or preference on what machine to use. Additional attributes can be easily added. For example, your site administrator can add a physical location attribute to your machine ClassAds.
4. Getting ready to run
There is not much to learn before you can start using Condor effectively. Here are a short list of the steps:
- Code Preparation To get a job to run under Condor, it must be able to run as a background batch job. Since Condor runs the program unattended and in the background, it will not be able to do interactive input and output. Condor can redirect console output (stdout and stderr) and keyboard input (stdin) to and from files for you. Create any needed files that contain the proper keystrokes needed for program input. Make certain the program will run correctly with the files.
- The Condor Universe Condor has more than one runtime environment (called a universe) from which to choose. The most used ones are:
- the standard universe, which allows a job running under Condor to handle system calls by returning them to the machine where the job was submitted and it also provides the mechanisms necessary to take a checkpoint and migrate a partially completed job, should the machine on which the job is executing become unavailable. To use the standard universe, it is necessary to relink the program with the Condor library using the condor_compile command. To read more about compiling for Condor, look at man condor_compile or in the longer manual.
- the vanilla universe, which provides a way to run jobs that cannot be relinked. There is no way to take a checkpoint or migrate a job executed under the vanilla universe. For access to input and output files, jobs must either use a shared file system, or use Condor's File Transfer mechanism.
Choose a universe under which to run the Condor program, and re-link the program if necessary. - Submit description file To control the details of a job submission, you use a submit description file. The file contains information about the job such as what executable to run, the files to use for keyboard and screen data, the platform type required to run the program, which universe use wish to use, and where to send e-mail when the job completes. You can also tell Condor how many times to run a program; it is simple to run the same program multiple times with multiple data sets. The sumbit description file is where the requirements and rank commands are defined.
Write a submit description file to go with the job. Look at this example for guidance. - Submit the Job Submit the program to Condor with the condor_submit command.
Once the job is submitted, Condor will do the rest. You can monitor the jobs progress with the commands condor_q and condor_status. You may modify the order in which Condor will run your jobs with the command condor_prio. If desired, Condor can even inform you in a log file every time your job is checkpointed and/or migrated to a different machine.
When your program completes, Condor will tell you (by e-mail, if preferred) the exit status of your program and various statistics about its performances, including time used and I/O performed. If you are using a log file for the job (which is recommended) the exit status will be recorded in the log file. You can remove a job from the queue prematurely with the command condor_rm.
Compiling: To compile a program for Condor, you can use the command:
condor_compile <compiler> <program.extension> -o <program name>
Example: To compile the C program hello.c with the compiler gcc, just type the following:
condor_compile gcc hello.c -o hello
5. Submitting Jobs
To submit a job to Condor for execution, you must use the condor_submit command. This command takes as an argument the submit description file. As described above, this file contains the commands and keywords used to direct the queuing of jobs - the name of the executable to run, which universe to run in, any requirements and rank info, how many times to run the program, any command line arguments, etc. Based on this information, condor_submit will create a job ClassAd to use for matching with a machine ClassAd. When this have been done, Condor can queue the job for running on that machine.
There are many advantages to the submit description file. One example could be if you want to run the same program many times, wach time with a different input data set (say, 500 times with 500 different input data sets). It is then easy to tell Condor to do this. Just arrange your data files accordingly so that each run reads its own input, and each run writes its own output. Each individual run may have its own initial working directory, stdin, stdout, stderr, command-line arguments, and shell environment. A program that directly opens its own files will read the file names to use either from stdin or from the command line. A program that opens a static filename every time will need to use a separate subdirectory for the output of each run.
See condor_submit in the manual pages, for a more complete description of how to use it.
Look here for examples of submit description files.
5.1 Requirements and Rank
It is important to list the correct requirements and rank commands in the submit description file. This way you can assure that your program is run on the machine that best fits your requirements.
These requirements and rank, must be specified as valid Condor ClassAd expressions. There are, however, default values set by the condor_submit program, which are used if none are deined in the submit description file. The ClassAd expressions are intuitive and reminiscent of C. It is possible to write quite elaborate expressions with ClassAds. Check out chapter 4.1 in the Condor manual for a complete description.
All of the commands in the submit description file are case insensitive, except for the ClassAd attribute string values. ClassAds attribute names are case insensitive, but ClassAd string values are case preserving.
Note that the comparison operators (<, >, <=, >=, and ==) compare strings case insensitively. The special comparison operators =?= and =!= compare strings case sensitively.
The allowed ClassAd attributes varies from machine to machine. To see all of the machine ClassAd attributes for all machines in the Condor pool, run the command condor_status -l. If there are any jobs in the queue, you can see the job ClassAds with the command condor_q -l.
5.2 List of Attributes
Machine attributes:
Here follows a description of some of the common machine attributes. Please see the example under section 3. Allocation of resources for an example of the attributes for one of the machines on radon. For a longer, more complete listing of attributes, look here.
- Activity: String which describes Condor job activity on the machine. Can have one of the following values:
- "Idle": There is no job activity
- "Busy": A job is busy running
- "Suspended": A job is currently suspended
- "Vacating": A job is currently checkpointing
- "Killing": A job is currently being killed
- "Benchmarking": The startd is running benchmarks
- Arch: String with the architecture of the machine.
- ClockDay: The day of the week, where 0 = Sunday, 1 = Monday, ... , 6 = Saturday.
- ClockMin: The number of minutes passed since midnight.
- ConsoleIdle: The number of seconds since activity on the system console keyboard or console mouse has last been detected.
- Cpus: Number of CPUs in this machine.
- CurrentRank: A float which represents this machine owner's affinity for running the Condor job which it is currently hosting. If not currently hosting a Condor job, CurrentRank is 0.0. When a machine is claimed, the attribute's value is computed by evaluating the machine's Rank expression with respect to the current job's ClassAd.
- Disk: The amount of disk space on this machine available for the job in Kbytes.
- EnteredCurrentActivity: Time at which the machine entered the current Activity. On all platforms (including NT), this is measured in the number of integer seconds since the Unix epoch (00:00:00 UTC, Jan 1, 1970).
- FileSystemDomain: A ``domain'' name configured by the Condor administrator which describes a cluster of machines which all access the same, uniformly-mounted, networked file systems usually via NFS or AFS. This is useful for Vanilla universe jobs which require remote file access.
- KeyboardIdle: The number of seconds since activity on any keyboard or mouse associated with this machine has last been detected.
- KFlops: Relative floating point performance as determined via a Linpack benchmark.
- LoadAvg: A floating point number with the machine's current load average.
- Machine: A string with the machine's fully qualified hostname.
- Memory: The amount of RAM in megabytes.
- Name: The name of this resource; typically the same value as the Machine attribute, but could be customized by the site administrator. On SMP machines, the condor_startd will divide the CPUs up into separate virtual machines, each with with a unique name. These names will be of the form ``vm#@full.hostname'', for example, ``vm1@vulture.cs.wisc.edu'', which signifies virtual machine 1 from vulture.cs.wisc.edu.
- OpSys: String describing the operating system running on this machine.
- Requirements: A boolean, which when evaluated within the context of the machine ClassAd and a job ClassAd, must evaluate to TRUE before Condor will allow the job to use this machine.
- MaxJobRetirementTime: An expression giving the maximum time in seconds that the startd will wait for the job to finish before kicking it off if it needs to do so.
- StartdIpAddr: String with the IP and port address of the condor_startd daemon which is publishing this machine ClassAd.
- State: String which publishes the machine's Condor state. Can be:
- "Owner": The machine owner is using the machine, and it is unavailable to Condor.
- "Unclaimed": The machine is available to run Condor jobs, but a good match is either not available or not yet found.
- "Matched": The Condor central manager has found a good match for this resource, but a Condor scheduler has not yet claimed it.
- "Claimed": The machine is claimed by a remote condor_ schedd and is probably running a job.
- "Preempting": A Condor job is being preempted (possibly via checkpointing) in order to clear the machine for either a higher priority job or because the machine owner wants the machine back.
- VirtualMachineID: For SMP machines, the integer that identifies the VM. The value will be X for the VM with name="vmX@full.hostname". For non-SMP machines with one virtual machine, the value will be 1.
- VirtualMemory: The amount of currently available virtual memory (swap space) expressed in Kbytes.
Job attributes:
- Args: String representing the arguments passed to the job.
- CkptArch: String describing the architecture of the machine this job executed on at the time it last produced a checkpoint. If the job has never produced a checkpoint, this attribute is undefined.
- CkptOpSys: String describing the operating system of the machine this job executed on at the time it last produced a checkpoint. If the job has never produced a checkpoint, this attribute is undefined.
- ClusterId: Integer cluster identifier for this job. A cluster is a group of jobs that were submitted together. Each job has its own unique job identifier within the cluster, but shares a common cluster identifier. The value changes each time a job or set of jobs are queued for execution under Condor.
- CompletionDate: The time when the job completed, or the value 0 if the job has not yet completed. Measured in the number of seconds since the epoch (00:00:00 UTC, Jan 1, 1970).
- CurrentHosts: The number of hosts in the claimed state, due to this job.
- EnteredCurrentStatus: An integer containing the epoch time of when the job entered into its current status So for example, if the job is on hold, the ClassAd expression: CurrentTime - EnteredCurrentStatus will equal the number of seconds that the job has been on hold.
- ImageSize: Estimate of the memory image size of the job in Kbytes. The initial estimate may be specified in the job submit file. Otherwise, the initial value is equal to the size of the executable. When the job checkpoints, the ImageSize attribute is set to the size of the checkpoint file (since the checkpoint file contains the job's memory image). A vanilla universe job's ImageSize is recomputed internally every 15 seconds.
- JobPrio: Integer priority for this job, set by condor_ submit or condor_ prio. The default value is 0. The higher the number, the worse the priority.
- JobStartDate: Time at which the job first began running. Measured in the number of seconds since the epoch (00:00:00 UTC, Jan 1, 1970).
- JobStatus: Integer which indicates the current status of the job.
- 0: Unexpanded (the job has never run)
- 1: Idle
- 2: Running
- 3: Removed
- 4: Completed
- 5: Held
- JobUniverse: Integer which indicates the job universe.
- 1: standard
- 4: PVM
- 5: vanilla
- 7: scheduler
- 8: MPI
- 9: grid
- 10: java
- LastMatchTime: An integer containing the epoch time when the job was last successfully matched with a resource (gatekeeper) Ad.
- LastRejMatchReason: If, at any point in the past, this job failed to match with a resource ad, this attribute will contain a string with a human-readable message about why the match failed.
- LastRejMatchTime: An integer containing the epoch time when Condor-G last tried to find a match for the job, but failed to do so.
- MaxHosts: The maximum number of hosts that this job would like to claim. As long as CurrentHosts is the same as MaxHosts, no more hosts are negotiated for.
- MaxJobRetirementTime: Maximum time in seconds to let this job run uninterrupted before kicking it off when it is being preempted. This can only decrease the amount of time from what the corresponding startd expression allows.
- MinHosts: The minimum number of hosts that must be in the claimed state for this job, before the job may enter the running state.
- NumGlobusSubmits: An integer that is incremented each time the condor_ gridmanager receives confirmation of a successful job submission into Globus.
- Owner: String describing the user who submitted this job.
- ProcId: Integer process identifier for this job. Within a cluster of many jobs, each job has the same ClusterId, but will have a unique ProcId. Within a cluster, assignment of a ProcId value will start with the value 0. The job (process) identifier described here is unrelated to operating system PIDs.
- RemoteIwd: The path to the directory in which a job is to be executed on a remote machine.
5.3 Job/machine matching
When Condor is considering a match between a job and a machine, the rank is used to choose a match from among all machines that satisfy the job's requirements and are available to the user, after accounting for the user's priority and the machine's rank of the job. The rank expressions, simple or complex, define a numerical value that expresses preferences.
The job's rank expression evaluates to one of three values:
- UNDEFINED
- ERROR
- a floating point value
If rank evaluates to a floating point value, the best match will be the one with the largest, positive value. If no rank is given in the submit description file, then Condor substitutes a default value of 0.0 when considering machines to match. If the job's rank of a given machine evaluates to UNDEFINED or ERROR, this same value of 0.0 is used. Therefore, the machine is still considered for a match, but has no rank above any other.
A boolean expression evaluates to the numerical value of 1.0 if true, and 0.0 if false.
Here are some examples of rank expressions from the Condor manual:
- For a job that desires the machine with the most available memory:
Rank = memory
- For a job that prefers to run on a friend's machine on Saturdays and Sundays:
Rank = ( (clockday == 0) || (clockday == 6) ) && (machine == "friend.cs.wisc.edu")
- For a job that prefers to run on one of three specific machines:
Rank = (machine == "friend1.cs.wisc.edu") || (machine == "friend2.cs.wisc.edu") || (machine == "friend3.cs.wisc.edu")
- For a job that wants the machine with the best floating point performance (on Linpack benchmarks):
Rank = kflops
This last example may give problems, since not all machines have the kflops attribute defined. For machines where this attribute is not defined, Rank will evaluate to the value UNDEFINED, and Condor will use a default rank of the machine of 0.0. The rank attribute will only rank machines where the attribute is defined. Therefore, the machine with the highest floating point performance may not be the one given the highest rank.
Thus, it is always wise to check if the expression's evaluation will lead to the expected resulting ranking of machines, before writing a rank expression (check with the command condor_status -constraint <name>, to see a list of machines that fits a certain constraint). For wxample, to see which machines in the pool that have kflops defined, use condor_status -constraint kflops.
Alternatively, to see a list of machines where kflops is not defined, use condor_status -constraint "kflops=?=undefined".
- For a job that prefers specific machines in a specific order:
Rank = ((machine == "friend1.cs.wisc.edu")*3) + ((machine == "friend2.cs.wisc.edu")*2) + (machine == "friend3.cs.wisc.edu")
Example: If the machine being ranked is "friend1.cs.wisc.edu", then the expression
(machine == "friend1.cs.wisc.edu")
is true, and gives the value 1.0. The expressions
(machine == "friend2.cs.wisc.edu")and
(machine == "friend3.cs.wisc.edu")
are false, and give the value 0.0. Therefore, rank evaluates to the value 3.0. In this way, machine "friend1.cs.wisc.edu" is ranked higher than machine "friend2.cs.wisc.edu", machine "friend2.cs.wisc.edu" is ranked higher than machine "friend3.cs.wisc.edu", and all three of these machines are ranked higher than others.
5.4 Submitting Jobs Using a Shared File System
To submit certain kinds of jobs (vanilla, Java, Parallel or MPI) without using the File Transfer mechanism, Condor must use a shared file system to access input and output files. This means that the job must be able to access the data files from any machine on which it could potentially run.
Example:
Assume a job is being submitted from condor.rcac.purdue.edu, and that this job requires a data-file /data/3/data.dat. To run this job in Condor, the input file must be available through a shared file system (NFS or AFS), in the same path as on the submission host, for the job to run correctly.
To make sure that users have access to the correct shared files, Condor uses the FileSystemDomain and UidDomain machine ClassAd attributes. These attributes specify which machines have access to the same shared file systems. All machines that mount the same shared directories in the same locations are considered to belong to the same file system domain. In the same way, all those machines which shares the same user information and/or the same UID, will be considered to be part of the same UID domain. To assure that a job which relies on a shared file system is indeed made to run on a machine in the correct UidDomain and FileSystemDomain, Condor uses the requirements expression. The default requirements are that the job must run on a machine with the same UidDomain and FileSystemDomain as the machine from which the job is submitted.
The only cases where this is not the right thing to do is when a Condor pool is spanning multiple UidDomains and/or FileSystemDomains. In these cases, the user may need to specify a different requirements expression to have the job run on the correct machines.
Example:
Consider a Condor pool which is made up of a combination of ordinary desktop workstations and a dedicated compute cluster. Most of the pool, including the compute cluster, has access to a shared file system, but some of the desktop machines do not. To handle this, define the FileSystemDomain to be rcac.purdue.edu for all the machines that mounted the shared files, and to the full hostname for each machine that did not (fx. enterprise.rcac.purdue.edu).
If you then want to submit vanilla universe jobs from your desktop (enterprise.rcac.purdue.edu) which does not mount the shared file system (and is therefore in its own file system domain, in its own world), but which you want to be able to use the compute cluster, you should put the program and input data files on the shared file system. When you then submit your job, you must tell Condor to send these jobs only to machines which have access to that shared data. To do this, specify the following requirements expression:
Requirements = UidDomain == "rcac.purdue.edu" && FileSystemDomain == "rcac.purdue.edu"
5.5 Submitting Jobs Without a Shared File System
A shared file system is not necessary to use Condor, as it works fine without it. Condor has a file transfer mechanism, which is utilized by the user when he/she submits jobs. Any files that are needed by a job, will be temporarily transfered by Condor from the machine where the job was submitted to the machine where the job is executed. After the job has been executed, Condor transfers the output back to the machine from where the job was submitted. The specification of which files that should be transfered and when to copy back the output files are made by the user in the job submit description file.
For jobs submitted under the standard universe, the existence of a shared file system is not relevant. Access to files (input and output) is handled through Condor's remote system call mechanism. The executable and checkpoint files are transfered automatically, when needed. Therefore, the user does not need to change the submit description file if there is no shared file system.
For the vanilla (and Java) universe, access to files (including the executable) through a shared file system is presumed as a default on UNIX machines. If there is no shared file system, then Condor's file transfer mechanism must be explicitly enabled. When submitting a job from a Windows machine, Condor presumes the opposite: no access to a shared file system. It instead enables the file transfer mechanism by default. Submission of a job might need to specify which files to transfer, and/or when to transfer the output files back.
For the grid universe, jobs are to be executed on remote machines, so there would never be a shared file system between machines. For details about this, see section 5.3.2 of the manual.
For the scheduler universe, Condor is only using the machine from which the job is submitted and no shared file system is relevant.
Specification of file transfers
In order to use the file transfer mechanism, you must place two commands in the job's submit description file: should_transfer_files and when_to_transfer_output.
Example:
should_transfer_files = YES when_to_transfer_output = ON_EXIT
- should_transfer_files: specifies whether Condor should transfer input files from the submit machine to the remote machine where the job executes. It also specifies whether the output files are transferred back to the submit machine. There are three possible values:
- YES: Condor always transfers both input and output files.
- IF_NEEDED: Condor transfers files if the job is matched with (and to be executed on) a machine in a different FileSystemDomain than the one the submit machine belongs to. If the job is matched with a machine in the local FileSystemDomain, Condor will not transfer files and relies on a shared file system.
- NO: Condor's file transfer mechanism is disabled.
- when_to_transfer_output: tells Condor when output files are to be transferred back to the submit machine after the job has executed on a remote machine.There are two possible values:
- ON_EXIT: Condor transfers output files back to the submit machine only when the job exits on its own.
- ON_EXIT_OR_EVICT: Condor will always do the transfer, no matter what. When the job has completed, the files are transferred back to the submitting directory.
When submitting from a Unix platform, the file transfer mechanism is unused by default.
If neither when_to_transfer_output or should_transfer_files are defined, Condor assumes should_transfer_files = NO.
When submitting from a Windows platform, Condor does not provide any way to use a shared file system for jobs. Therefore, if neither when_to_transfer_output or should_transfer_files are defined, the file transfer mechanism is enabled by default with the following values:
should_transfer_files = YES when_to_transfer_output = ON_EXIT
Specification of which files that should be transfered
If the file transfer mechanism is enabled, Condor will transfer the following files before the job is run on a remote machine.
- the executable
- the input, as defined with the input command
- any jar files (for the Java universe)
If your job requires other input files, these can be specified in the submit description file. These files will also be transfered before the job is run. To specify the files to transfer, include the following in the submit description file:
transfer_input_files = file1,file2,...
Note: except for grid universe jobs, it is best to not specify transfer_output_files (for other than grid universe jobs) and thus let Condor figure things out by itself based upon what output the job produces.
The file transfer mechanism specifies file names and/or paths on both the file system of the submit machine and on the file system of the execute machine. Files in the transfer_input_files command are specified as they are accessed on the submit machine. As the program executes, it accesses files as they are found on the execute machine.
There are three ways to specify files and paths for transfer_input_files:
- Relative to the submit directory, if the submit command initialdir is not specified.
- Relative to the initial directory, if the submit command initialdir is specified.
- Absolute.
Before Condor starts executing the program, it will copy the executable, any input files (specified by the submit command input and perhaps by transfer_input_files). Since these files are all put in a temporary directory, the program must not use paths to access input files and the input files must of course also all be uniquely named (or else the last of two or more files with the same name will overwrite any earlier files).
Any output files created by the program will also be placed in this temporary directory and later transfered back to the submitting machine when the job completes.
To see some examples on the use of file transfer, click here. They are all from the Condor manual.
5.6 Execution on Differing Architectures
It is possible to allow Condor to choose between a perhaps larger pool of machines for a job, if executables are available for all the different platforms. This is done by making changes to the submit description file.
Example:
Cross submission. An executable is available for one platform, but the submission is done from a different platform. Given the correct executable, the requirements command in the submit description file specifies the target architecture. Here, an executable compiled for a Sun 4, submitted from an Intel architecture running Linux would add the requirement
requirements = Arch == "SUN4x" && OpSys == "SOLARIS251"
Without this requirement, condor_submit will assume that the program is to be executed on a machine with the same platform as the machine where the job is submitted.
Cross submission works for both standard and vanilla universes. To see the architecture and OS for the machines in the pool, type the command condor_status.
Click here to see some examples (from the Condor manual) showing how cross submission works in the vanilla universe and here for an example for the standard universe.
5.7 Grid Computing
The idea of grid computing is to be able to use resources which are spanning many administrative domains. Even though a Condor pool usually conatains machines owned by many different people, it will often be the case that collaborating researchers from different organizations does not consider it feasible to combine all their computers in one large Condor pool. They will therefore have to use grid computing.
Condor has its own mechanisms for grid computing, but is able to interact with other grid systems. The usual way for Condor to submit jobs from one pool to another, is via flocking.
Flocking is enabled by configuration within each of the pools. Jobs migrate from one pool to another based on the availability of machines to execute jobs. If the local Condor pool currently don't have any available machines to run a job, it will flock to another pool. This is not something the user needs to think about - nothing need to be added or changed in the submit description file.
To learn more about this, Condor-C jobs, glidein (a mechanism by which one or more Grid resources (remote machines) temporarily join a local Condor pool. The program condor_glidein is used to add a machine to a Condor pool) and running when there is other middleware like Globus running, see section 5 of the official Condor manual.
Condor-C Job submission
Job submission is done the same way for Condor-C jobs as for all other Condor jobs. The only thing to remember is that the universe must be 'grid'. There should also be an entry 'grid_resource' in the submit description file, which specifies the remote condor_schedd daemon to which the job should be submitted. The value of 'grid_resource' consists of three fields: 1) the grid type (condor), the name of the remote condor_schedd daemon (the same as the condor_schedd ClassAd attribute Name on the remote machine), 3) the third field is the name of the remote pool's condor_collector. Here is an example submit description file:
Universe = grid Executable = myjob Output = myoutput Error = myerror Log = mylog grid_resource = condor joe@remotemachine.example.com remotecentralmanager.example.com +remote_jobuniverse = 5 +remote_requirements = True +remote_ShouldTransferFiles = "YES" +remote_WhenToTransferOutput = "ON_EXIT" Queue
The remote machine needs to know the attributes of the job. In the submit description file these are specified with the '+' syntax, followed by the string remote_.
As a minimum, these must be the job's universe and the job's requirements. Most likely there will also be other attributes specific to the job's universe (on the remote pool).
Note: attributes set with '+' are inserted directly into the job's ClassAd. Specify attributes as they must appear in the job's ClassAd, not the submit description file.
See section 5.3.1.2 in the official Condor manual for more information and examples.
5.8 Examples for submitting a job
It is very simple to submit the job to Condor, when the submit description file has been written. At the command-prompt, just type condor_submit <job-name>, where job-name is the name of the submit description file.
Example: Here I am using a very simple submit description file, namely:
Executable = hello Log = hello.log Output = hello.out Queue
Where hello is a C-program which where first compiled with the command condor_compile gcc hello.c -o hello. I have named this submit description file 'run_hello'. In the following, I am running on radon:
user123@radon:~$ condor_submit run_hello Submitting job(s). Logging submit event(s). 1 job(s) submitted to cluster 260182. user123@radon:~$
It may take a (sometimes long) while before the job is submiied and finishes running, depending on how many others are using the machines, your rank, the requirements you have given for the job, etc. The progress can be checked with the command condor_status. When the job has completed, I have the two files hello.log and hello.out in my directory - just as I asked for in the submit description file. You should always use a log-file.
The contents of the files are:
hello.log:
000 (260182.000.000) 08/29 16:21:31 Job submitted from host: <128.210.9.35:35407> ... 001 (260182.000.000) 08/29 16:22:42 Job executing on host: <128.211.131.51:32780> ... 005 (260182.000.000) 08/29 16:22:42 Job terminated. (1) Normal termination (return value 13) Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage 830 - Run Bytes Sent By Job 13490672 - Run Bytes Received By Job 830 - Total Bytes Sent By Job 13490672 - Total Bytes Received By Job ...
and
hello.out:
Hello World!
which was the output the program would otherwise have written to the screen. You will also receive an email, sent to the user, unless otherwise specified. The email received after running the above Condor job, can be seen here.
5.9 Condor DAGMan
DAGMan is Condor's Directed Acyclic Graph manager, and is a way of submitting many jobs at the same time, and automating workflows for execution in Condor.
As an example, consider the DAGMan submit file dagfile.dag:
Job 1 testrun1.submit Job 2 testrun2.submit Job 3 testrun3.submit PARENT 1 CHILD 2 PARENT 2 CHILD 3
We then have Condor submit files for each part, this could be 'testrun1.submit' (in the below example the DAG is submitted across the grid - you can just change universe and remove globusscheduler for a local submission):
universe = globus executable = testrun1 Transfer_Executable = true globusscheduler = tg-login1.ncsa.teragrid.org/jobmanager output = testrun1.out error = testrun1.error log = testrun1.log queue
The files testrun2.submit and testrun3.submit would be similar, with just 1 changed to 2 or 3. Create the file testrun1, testrun2, testrun3. Your directory should contain the following files:
cu12:~/dagtest238% ls dagfile.dag testrun1.submit testrun2.submit testrun3.submit testrun1 testrun2 testrun3 cu12:~/dagtest239%
To submit this DAG, give the command:
condor_submit_dag dagfile.dag
This gives the output:
cu12:~/dagtest239% condor_submit_dag dagfile.dag Checking all your submit files for log file names. This might take a while... Done. ----------------------------------------------------------------------- File for submitting this DAG to Condor : dagfile.dag.condor.sub Log of DAGMan debugging messages : dagfile.dag.dagman.out Log of Condor library debug messages : dagfile.dag.lib.out Log of the life of condor_dagman itself : dagfile.dag.dagman.log Condor Log file for all jobs of this DAG : /u/ncsa/user123/dagtest/testrun1.log Submitting job(s). Logging submit event(s). 1 job(s) submitted to cluster 58. ----------------------------------------------------------------------- cu12:~/dagtest240%
Just as for the ordinary condor_submit, the status of the job can be checked with condor_q and a job can be removed with condor_rm.
There are some examples of using DAG here and here.
The manual page for Condor DAG can be found here.
5.10 Running commercial packages in Condor - on Linux
It is possible to use Condor to run a number of commercial applications in Condor. Since it may vary which packages are installed and where, the examples below are only meant to run locally, at RCAC.
The examples below is for running on Linux:
Matlab:
Most machines in BoilerGrid that have Matlab installed and available for user advertise that fact via the ClassAd "HAS_MATLAB".
To run the simple .m file ""fact.m":
You need to write a submit file that will run Matlab on fact.m, using Condor to transfer the input and output files. The example below will execute on any machine that advertises Matlab capability, regardless of operating system or administrative domain.
matlab.submit:
Executable = $$(MATLAB_EXE) Arguments = $$(MATLAB_ARGS) -r fact Universe = vanilla Getenv = True Requirements = ( HAS_MATLAB == True ) should_transfer_files = YES transfer_executable = false when_to_transfer_output = ON_EXIT Input = fact.m Log = mat.log Output = mat.out Queue
To submit you do the following:
condor_submit matlab.submit
Example:
-bash-3.00$ condor_submit matlab.submit vanilla Submitting job(s) Logging submit event(s). 1 job(s) submitted to cluster 9586. -bash-3.00$
You can use condor_q to check on the status of your job:
-bash-3.00$ condor_q -- Submitter: tg-login64.rcac.purdue.edu : <128.211.143.238:32773> : tg-login64.rcac.purdue.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 6558.0 user123 1/4 16:07 0+00:00:00 H 0 9.8 condor_dagman -f - 7080.0 user123 1/8 12:19 0+11:24:32 H 0 9.8 testrun1 7114.0 user123 1/8 16:31 0+00:02:20 H 0 9.8 testrun1 7386.0 user123 1/10 14:07 0+11:55:35 H 0 9.8 testrun1 7543.0 user123 1/11 15:34 0+00:00:00 I 0 9.8 hello 9545.0 user123 2/6 16:38 0+00:00:00 H 0 0.0 remote-testscript 9545.1 user123 2/6 16:38 0+00:00:00 H 0 0.0 remote-testscript 9546.0 user123 2/6 16:55 0+00:00:00 H 0 9.8 a.out 9588.0 user123 2/16 13:27 0+00:00:00 R 0 9.8 matlab -nodisplay 9 jobs; 1 idle, 1 running, 7 held -bash-3.00$
When the job returns (and is thus no longer shown in the above queue), you can get the answer from your output file, which in this case was called 'mat.out'.
R:
Like Matlab, machines which have R installed advertise that fact with the "HAS__R" ClassAd.
This example: R_input was found on http://www.mayin.org/ajayshah/KB/R/index.html, where other R examples can be found.
The following submit file will run R on any machines that advertises it, using Condor's file transfer.
Universe = vanilla
Executable =$$(R_EXE)
Requirements = ( HAS_R == TRUE )
initialdir =/home/ba01/u113/user123/condor_running/R
log = R.log
arguments = $$(R_ARGS)
should_transfer_files = YES
transfer_executable = false
when_to_transfer_output = ON_EXIT
input = R_input
output = R_output
error = err.$(Process)
Queue
Testrun:
-bash-3.00$ condor_submit r.submit Submitting job(s) Logging submit event(s). 1 job(s) submitted to cluster 11. -bash-3.00$
Check with condor_q if the program has finnished running. When that is the case, you can look at the output: R_output.
SAS:
Machines with SAS installed advertise tha tfact with the "HAS_SAS" ClassAd.
The submit file below will run SAS on any node that advertises it, using Condor's fie transfer.
Universe = vanilla Executable = $$(SAS_EXE) Requirements = ( HAS_SAS == TRUE ) initialdir = /home/ba01/u113/user123/SAS log = SAS.log # SAS needs the environment variable $HOME set to # *your* home directory. environment = HOME=/home/ba01/u113/user123 arguments =$$(SAS_ARGS) input = SAS_input output = SAS_output error = err.$(Process) Queue
The SAS example, SAS_input was found on http://ftp.sas.com/samples/A57496 - Advanced Log-Linear Models Using SAS. Many other examples can be found on "SAS Online Samples".
Example showing this run:
-bash-3.00$ condor_submit sas.submit vanilla Submitting job(s) Logging submit event(s). 1 job(s) submitted to cluster 10005. -bash-3.00$
condor_q then gives:
-bash-3.00$ condor_q -- Submitter: tg-login64.rcac.purdue.edu : <128.211.143.238:32773> : tg-login64.rcac.purdue.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 9545.0 user123 2/6 16:38 0+00:00:00 H 0 0.0 remote-testscript 9545.1 user123 2/6 16:38 0+00:00:00 H 0 0.0 remote-testscript 9546.0 user123 2/6 16:55 0+00:00:00 H 0 9.8 a.out 0005.0 user123 2/19 13:50 0+00:00:00 R 0 9.8 sas -nonews -stdio 4 jobs; 0 idle, 1 running, 3 held -bash-3.00$
When the program stops running, it returns the file SAS_output.
Maple:
Machines with Maple installed advertise that fact with the "HAS_MAPLE" ClassAd.
For the maple example, assuming that we require a specific version of Maple (Maple 11), we will run a small program, maple_input:
And then the submit file, maple.submit for Condor, using file transfer:
Universe = vanilla Requirements = ( HAS_MAPLE == TRUE && MAPLE_VERSION == "11") Executable = $$(MAPLE_EXE) Arguments = maple_input should_transfer_files = YES when_to_transfer_output = ON_EXIT Input = maple_input Error = maple.err Log = maple.log Output = maple.out Queue
You can now submit it:
-bash-3.00$ condor_submit maple.submit vanilla Submitting job(s) Logging submit event(s). 1 job(s) submitted to cluster 10210. -bash-3.00$
And do a condor_q:
-bash-3.00$ condor_q -- Submitter: tg-login64.rcac.purdue.edu : <128.211.143.238:32773> : tg-login64.rcac.purdue.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 6558.0 user123 1/4 16:07 0+00:00:00 H 0 9.8 condor_dagman -f - ... 10210.0 user123 2/19 14:30 0+00:00:00 R 0 9.8 maple maple_input 212 jobs; 202 idle, 1 running, 9 held -bash-3.00$
When the program finishes running, the output can be seen in maple.out.
Mathematica:
Mathematica (command-line) is installed on RCAC ia32 Linux clusters at /opt/Wolfram/Mathematica/current/Executables/math.
I am using a small test program, mathematica_input which solves a third degree equation.
Create a shell script to run the Mathematica executable:
#!/bin/sh /opt/Wolfram/Mathematica/current/Executables/math < $1
And then the submit file for Condor, using file transfer:
Universe = vanilla Requirements = ( ARCH == "INTEL") && OPSYS == "LINUX" ) Executable = mathematica.sh Arguments = mathematica_input should_transfer_files = YES when_to_transfer_output = ON_EXIT Input = mathematica_input Error = mathematica.err Log = mathematica.log Output = mathematica.out Queue
Testrun:
-bash-3.00$ condor_submit mathematica.submit Submitting job(s) Logging submit event(s). 1 job(s) submitted to cluster 2. -bash-3.00$
A little while later, after the program has returned (check with condor_q), you can see the returned result:
-bash-3.00$ less mathematica.out
Mathematica 5.2 for Linux x86 (64 bit)
Copyright 1988-2005 Wolfram Research, Inc.
-- Terminal graphics initialized --
In[1]:=
2 3
Out[1]= 1 + 3 x + 3 x + x
In[2]:=
Out[2]= {{x -> -1}, {x -> -1}, {x -> -1}}
In[3]:=
-bash-3.00$
5.12 Multiple jobs in one Condor file
It is possible to submit multiple jobs in one Condor file. This is often useful if you, for example have many small Matlab jobs to run or generally have many small jobs.
Multiple Data Runs for a Single Executable:
Suppose you have a set of jobs that all use a common executable. For example, if you have two Mathematica jobs, you could enqueue them for Condor as in the following Job Configuration File:
Example:
executable= mathematica universe= vanilla input= test.data output= loop.out error= loop.err Initialdir= run_1 queue Initialdir= run_2 queue
Such that the first pass will store to directory run_1, and the second job will use run_2. This avoids the problem with overwritten files. Another way of doing this is to give each output/log file a new name, like in the example below:
Example:
Universe = vanilla Notification = Complete Executable = /bin/echo Arguments = "test job" GetEnv = True Initialdir = /home/ba01/u113/user123/condor_running/multiple_jobs Input = /dev/null Output = /home/ba01/u113/user123/condor_running/multiple_jobs/myjob.out.$(Process) Error = /home/ba01/u113/user123/condor_running/multiple_jobs/myjob.err Log = /home/ba01/u113/user123/condor_running/multiple_jobs/myjob.log.$(Process) Notify_user = user123@purdue.edu Queue 10
The above example will simply run the executable 10 times with the same arguments. To run with different arguments or to run different programs, you will have to write a submit script similar to the following.
Example:
Universe = vanilla Notification = Complete Executable = /bin/echo Arguments = "test job" GetEnv = True Initialdir = /home/ba01/u113/user123/condor_running/multiple_jobs Input = /dev/null Output = /home/ba01/u113/user123/condor_running/multiple_jobs/myjob.out.$(Process) Error = /home/ba01/u113/user123/condor_running/multiple_jobs/myjob.err Log = /home/ba01/u113/user123/condor_running/multiple_jobs/myjob.log.$(Process) Notify_user = user123@purdue.edu Queue 10 Arguments = "$(Process)" Requirements = CPU_Speed >= 1 Queue 9 Executable = myjob.sh Arguments = 99 Queue
The example above submits 10 instances of the first executable/arguments, then 9 with the same executable, but different arguments (it is also possible to add extra requirements). Finally, 1 job is submitted which has a different executable and arguments.
6. Managing a Job
In this section we are looking at commands regarding the job after it has been submitted. The first part looks at how the job is monitored. The commands will be discussed briefly, for a more detailed description, you should look at the man pages for the commands referred to. This can either be done by typing man <command>, or by looking in the online, official manual, chapter 9.
The last part of this section looks at ways to affect the jobs execution after it has been submitted. This can (among other things) be done by changing the job priority.
6.1 Monitoring the job
As soon as the job has been submitted, Condor will start looking for resources to run it. By typing condor_status -submitters, you will get a list of those which have a submitted job. An example of this can be seen below:
user123@radon:~$ condor_status -submitters Name Machine Running IdleJobs HeldJobs username1@bio.purdue.e epsilon.bi 0 3 0 nice-user.user1@r hamlet.rca 0 716 0 nice-user.login1@rca itb.rcac.p 0 35 0 userlogin@rcac.purdue lear.rcac. 0 1 0 username2@rcac.purdue.e lepton.rca 0 1 0 user@rcac.purdue. osg.rcac.p 0 74 0 nice-user.login2@r radon.rcac 0 0 190 nice-user.login1@r radon.rcac 0 586 0 userlogin@rcac.purdue.e radon.rcac 0 4 0 username3@rcac.purdue.ed radon.rcac 0 0 0 login3@nd.edu sirius.cse 0 0 0 login1@rcac.purdue.edu tg-login-l 0 0 0 username4@rcac.purdue tg-login-l 0 1 0 login4@rcac.purdue.ed tg-login-l 0 3 0 RunningJobs IdleJobs HeldJobs username3@rcac.purdue.edu 0 0 0 login1@rcac.purdue. 0 74 0 username1@bio.purdue.e 0 3 0 userlogin@rcac.purdue 0 1 0 nice-user.username2@r 0 0 190 nice-user.user@r 0 1302 0 nice-user.user2@rca 0 35 0 userlogin@rcac.purdue.e 0 5 0 login1@rcac.purdue 0 1 0 username2@nd.edu 0 0 0 login1@rcac.purdue.ed 0 3 0 Total 0 1424 190 user123@radon:~$
Checking on the progress of jobs
To check on the status of your jobs, use the command condor_q. This command will display the status of all the queued jobs, not just your own. Click here to see an example.
That is, however not the only way of tracking the progress of your jobs. Another way of doing this is through the user log. In your submit description file, you can specify a log command (by adding Log = <name>.log somewhere before the Queue command). When you have done this, the progress of the job may be followed by viewing the log file. Various events such as execution commencement, checkpoint, eviction and termination are logged in the file. Also logged is the time at which the event occurred.
As soon as your job begins executing, Condor will start up a condor_shadow process on the submit machine. The shadow process is the mechanism by which the remotely executing jobs can access the environment from which it was submitted, such as input and output files.
It is normal for a machine which has submitted hundreds of jobs to have hundreds of shadows running on the machine.
To find all the machines which are running your job, use the command condor_status. Example: say you wish to find all the machines which runs jobs submitted by user123@purdue.edu. You would then type condor_status -constraint 'RemoteUser == "user123@rcac.purdue.edu"'.
user123@radon:~$ condor_status -constraint 'RemoteUser == "user123@rcac.purdue.edu"' Name OpSys Arch State Activity LoadAv Mem ActvtyTime ba-005.rcac.p LINUX INTEL Claimed Busy 1.000 502 0+00:24:44 ba-006.rcac.p LINUX INTEL Claimed Busy 0.990 502 0+00:20:22 ba-007.rcac.p LINUX INTEL Claimed Busy 1.000 502 0+00:23:16 ba-008.rcac.p LINUX INTEL Claimed Busy 1.000 502 0+00:30:20 ...
If you want to find all the machines that are running any job at all, then type: condor_status -run. Click here to see an example.
Removing a job from the queue
The command condor_rm can be used at any time to remove a job from the queue. If the job has already started running, then the job will be killed without a checkpoint, and its queue entry is removed. Here are an example:
Queue of jobs before:
user123@radon:~$ Submitter: radon.rcac.purdue.edu : <128.210.9.35:35407> : radon.rcac.purdue.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD ... 260076.7 nice-user.user1 8/18 00:05 0+00:21:47 I 0 29.3 startfah.sh -oneun 260076.9 nice-user.user1 8/18 00:05 0+01:40:44 I 0 136.7 startfah.sh -oneun 260185.0 user123 8/30 13:01 0+00:00:00 R 0 19.5 hello ...
Queues of jobs after:
user123@radon:~$ condor_rm 260185.0 Job 260185.0 marked for removal user123@radon:~$ condor_q Submitter: radon.rcac.purdue.edu : <128.210.9.35:35407> : radon.rcac.purdue.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD ... 260076.7 nice-user.user1 8/18 00:05 0+00:21:47 I 0 29.3 startfah.sh -oneun 260076.9 nice-user.user1 8/18 00:05 0+01:40:44 I 0 136.7 startfah.sh -oneun ...
6.2 Affecting the jobs execution
Placing a job on hold
To place a job in the queue on hold, use the command condor_hold. A job that is in the hold state remains there until later released for execution by the command condor_release.
See the manual page for more information.
Changing the priority of jobs
In addition to the priorities assigned to each user, Condor also provides each user with the capability of assigning priorities to each submitted job. These job priorities are local to each queue and can be any integer value, with higher values meaning better priority.
The default priority of a job is 0, but can be changed using the condor_prio command. Example: to change the priority of a job to -15
user123@radon:~$ condor_q user123 -- Submitter: radon.rcac.purdue.edu : <128.210.9.35:35407> : radon.rcac.purdue.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 260187.0 user123 8/30 13:59 0+00:00:00 I 0 19.5 hello 1 jobs; 1 idle, 0 running, 0 held user123@radon:~$ condor_prio -p -15 260187.0 user123@radon:~$ condor_q user123 -- Submitter: radon.rcac.purdue.edu : <128.210.9.35:35407> : radon.rcac.purdue.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 260187.0 user123 8/30 13:59 0+00:00:03 R -15 19.5 hello 1 jobs; 0 idle, 1 running, 0 held user123@radon:~$
Note these job priorities are different from the user priorities assigned by Condor. Job priorities do not impact user priorities and are only a mechanism for the user to identify the relative importance of jobs among all the jobs submitted by the user to that specific queue.
7. Log files and job completion
It is always a good idea to use a log file (include Log = file.log in the submit description file, somewhere before the word Queue). In the log file there will be a log of the events which happened during the job run, listed in chronological order. Since the formatting is always the same, it is possible for it to be machine readable. Four fields are always present, and they will most often be followed by other fields that give further information that is specific to the type of event.
See an example of a log file here.
- The first field in an event is the numeric value assigned as the event type in a 3-digit format.
- The second field identifies the job which generated the event. Within parentheses are the ClassAd job attributes of ClusterId value, ProcId value, and the MPI-specific rank for MPI universe jobs or a set of zeros (for jobs run under universes other than MPI), separated by periods.
- The third field is the date and time of the event logging.
- The fourth field is a string that briefly describes the event. Fields that follow the fourth field give further information for the specific event type.
Click here to see a list of all the events which can show up in a job log file.
Job Completion
When your Condor job completes (either through normal means or abnormal termination by signal), Condor will remove it from the job queue. This means that it dissapears from the condor_q output, but is inserted into the job history file.
This job history file can be inspected with the command condor_history.
Since the output from condor_history will likely be very long, you should constrain it by something like condor_history -constraint 'OWNER == "user123"', where you change 'user123' to your own username. It will only report what you have run from the machine you are currently logged into.
This is how that will look (for me, on radon):
user123@radon:~$ condor_history -constraint 'OWNER == "user123"' ID OWNER SUBMITTED RUN_TIME ST COMPLETED CMD 260177.0 user123 8/25 15:45 0+00:00:03 X ??? /autohome/u96/b 260178.0 user123 8/25 15:50 0+00:00:15 X ??? /autohome/u96/b 260179.0 user123 8/28 10:38 0+00:00:01 X ??? /autohome/u96/b 260180.0 user123 8/28 10:42 0+00:00:12 C 8/28 10:42 /autohome/u96/b 260181.0 user123 8/28 13:03 0+00:00:05 C 8/28 13:04 /autohome/u96/b 260182.0 user123 8/29 16:21 0+00:00:05 C 8/29 16:22 /autohome/u96/b 260183.0 user123 8/30 12:10 0+00:00:06 C 8/30 12:11 /autohome/u96/b 260184.0 user123 8/30 12:11 0+00:00:03 C 8/30 12:11 /autohome/u96/b 260185.0 user123 8/30 13:01 0+00:00:06 C 8/30 13:02 /autohome/u96/b 260186.0 user123 8/30 13:03 0+00:00:00 X ??? /autohome/u96/b 260187.0 user123 8/30 13:59 0+00:00:05 C 8/30 14:01 /autohome/u96/b 260188.0 user123 8/30 14:00 0+00:00:05 C 8/30 14:01 /autohome/u96/b user123@radon:~$
The column marked 'ST' gives the status of the job. 'C' means completed, but 'X' means the job was removed.
If you specified a log file in your submit description file, then the job exit status will be recorded there as well.
When your job has completed, Condor will send you an email message. This is the default behaviour, which can be changed with the condor_submit notification = Always | Complete | Error | Never. The different options means:
- Always: the owner will be notified whenever the job produces a checkpoint, as well as when the job completes.
- Complete: (the default), the owner will be notified when the job terminates.
- Error: the owner will only be notified if the job terminates abnormally.
- Never: the owner will not receive e-mail, regardless to what happens to the job.
Condor will as default send the email to the address defined by job-owner@UID_DOMAIN, usually the person logged into the machine and submitting the job. If this is not the email address that you wish Condor to use, then you can add the option notify_user = email-address to condor_submit.
8. Condor Universes
Condor allows several types of jobs, but the most used are "standard" and "vanilla". Standard jobs can be checkpointed and migrated from system to system transparently by Condor - jobs can be moved from node to node without restarting. However, for a code to be submitted as a standard job it must be recompiled jobs can be checkpointed and migrated from system to system transparently by Condor - jobs can be moved from node to node without restarting. However, for a code to be submitted as a standard job it must be recompiled using various Condor-specific compiler options and libraries. An application must also conform to a few other restrictions in order to run in the standard universe.
Those programs that cannot be recompiled can be submitted as vanilla jobs. Virtually any non-parallel program can be submitted. Vanilla jobs cannot be checkpointed. If a node ceases to be idle any running vanilla jobs may be suspended or killed (to be restarted elsewhere).
Under Windows, only vanilla jobs are allowed.
8.1 Choosing a Condor Universe
A universe in Condor defines an execution environment. Condor Version 7.0.0 supports several different universes for user jobs. :
- Standard: The standard universe provides migration and reliability, but has some restrictions on the programs that can be run.
- Vanilla: The vanilla universe provides fewer services, but has very few restrictions.
- MPI: The MPI universe is for programs written to the MPICH interface. See section 2.10 of the Condor manual for more about MPI and Condor. The MPI Universe has be superseded by the Parallel universe.
- Globus or Grid: The Globus or Grid universe allows users to submit jobs using Condor's interface. These jobs are submitted for execution on grid resources. For Globus jobs, see http://www.globus.org for more information.
- Java: The Java universe allows users to run jobs written for the Java Virtual Machine (JVM).
- Scheduler: The scheduler universe allows users to submit lightweight jobs to be spawned by the condor_schedd on the submit host itself.
- Local: The local universe allows a Condor job to be submitted and executed with different assumptions for the execution conditions of the job.
- Parallel: The Parallel universe is for programs that require multiple machines for one job. See section 2.10 for more about the Parallel universe.
The Universe attribute is specified in the submit description file. If a universe is not specified, the default is standard.
See chapter 2.4.1 of the Condor manual for more details about the different universes.
9. Limitations on Jobs which can be Checkpointed
Condor are able to schedule and run ant type of process, but it does have some limitations on which jobs that it can transparently checkpoint and migrate:
- Multi-process jobs are not allowed. This includes system calls such as fork(), exec(), and system().
- Interprocess communication is not allowed. This includes pipes, semaphores, and shared memory.
- Network communication must be brief. A job may make network connections using system calls such as socket(), but a network connection left open for long periods will delay checkpointing and migration.
- Sending or receiving the SIGUSR2 or SIGTSTP signals is not allowed. Condor reserves these signals for its own use. Sending or receiving all other signals is allowed.
- Alarms, timers, and sleeping are not allowed. This includes system calls such as alarm(), getitimer(), and sleep().
- Multiple kernel-level threads are not allowed. However, multiple user-level threads are allowed.
- Memory mapped files are not allowed. This includes system calls such as mmap() and munmap().
- File locks are allowed, but not retained between checkpoints.
- All files must be opened read-only or write-only. A file opened for both reading and writing will cause trouble if a job must be rolled back to an old checkpoint image. For compatibility reasons, a file opened for both reading and writing will result in a warning but not an error.
- A fair amount of disk space must be available on the submitting machine for storing a job's checkpoint images. A checkpoint image is approximately equal to the virtual memory consumed by a job while it runs. If disk space is short, a special checkpoint server can be designated for storing all the checkpoint images for a pool.
- On Digital Unix (OSF/1), HP-UX, and Linux, your job must be statically linked. Dynamic linking is allowed on all other platforms.
Note: these limitations only apply to jobs which Condor has been asked to transparently checkpoint. If job checkpointing is not desired, the limitations above do not apply.
Note: Jobs Need to be Re-linked to get Checkpointing and Remote System Calls: Although typically no source code changes are required, Condor requires that the jobs be re-linked with the Condor libraries to take advantage of checkpointing and remote system calls. This often precludes commercial software binaries from taking advantage of these services because commercial packages rarely make their object code available. Condor's other services are still available for these commercial packages.
10. Notes, problems and errors
10.1 Condor test
To test condor:
Create a condor script. Here's a simple one:
######################## # Submit description file for hello program ######################## Executable = /bin/echo Arguments = hello woooorld Universe = vanilla Output = hello.out Log = hello.log Queue
Submit your script to condor:
$ condor_submit <condorscript>
Your output will be in "hello.out" and "hello.log" will contain a log of your job.
10.2 Why does the job not run?
Users sometimes find that their jobs do not run. There are several reasons why a specific job does not run. These reasons include failed job or machine constraints, bias due to preferences, insufficient priority, and the preemption throttle that is implemented by the condor_negotiator to prevent thrashing. Many of these reasons can be diagnosed by using condor_q -analyze.
Example: Let us assume that a job (assigned the job id 100.0) submitted to the local pool is not running. Running condor_ q's analyzer provided the following information:
-bash-2.05b$ condor_q -analyze 100.0 -- Submitter: lear.rcac.purdue.edu : <128.211.128.239:32788> : lear.rcac.purdue.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD --- 100.000: Run analysis summary. Of 1016 machines, 1016 are rejected by your job's requirements 0 reject your job because of their own requirements 0 match but are serving users with a better priority in the pool 0 match but reject the job for unknown reasons 0 match but will not currently preempt their existing job 0 are available to run your job No successful match recorded. Last failed match: Fri Sep 1 13:37:18 2006 Reason for last match failure: no match found WARNING: Be advised: No resources matched request's constraints Check the Requirements expression below: Requirements = (Arch == "X86_64") && (OpSys == "LINUX") && (Disk >= DiskUsage) && ((Memory * 1024) >= ImageSize) && (TARGET.FileSystemDomain == MY.FileSystemDomain) -bash-2.05b$
As can be seen, the requirements where too stringent and no machine could satisfy them.
Other common problems could be that the job is not running because it does not have a high enough priority to cause other jobs to be preempted.
While the analyzer can diagnose most common problems, there are some situations that it cannot reliably detect due to the instantaneous and local nature of the information it uses to detect the problem. Thus, it may be that the analyzer reports that resources are available to service the request, but the job still does not run. In most of these situations, the delay is transient, and the job will run during the next negotiation cycle.
If the problem persists and the analyzer is unable to detect the situation, it may be that the job begins to run but immediately terminates due to some problem. Viewing the job's error and log files (specified in the submit command file) and Condor's SHADOW_LOG file may assist in tracking down the problem. If the cause is still unclear, please contact your system administrator.
10.3 Why will my vanilla jobs only run on the machine where I submitted them from?
If your vanilla jobs will only run on the machine where you submit it from, you should try and check the following:
- Did you submit the job from a local file system that other computers can't access? See section 3.3.7 of the Condor manual for more details.
- Did you set a special requirements expression for vanilla jobs that's preventing them from running but not other jobs? See section 3.3.7 of the Condor manual for more details.
- Is Condor running as a non-root user? See section 3.6.12.1 of the Condor manual for more details.
10.4 My job starts but exits right away with signal 9.
This can occur when the machine your job is running on is missing a shared library required by your program. One solution is to install the shared library on all machines the job may execute on. Another, easier, solution is to try to re-link your program statically so it contains all the routines it needs.
10.5 Why aren't any or all of my jobs running?
A common problem is that people have submitted a number of jobs to their pool, but only some of them appear to be running, even though there are many free machines available. To solve this problem, try the following steps:
- Run condor_q -analyze and see what it says.
- Look at the User Log file (whatever you specified as "log = XXX" in the submit file). See if the jobs are starting to run but then exiting right away, or if they never even start.
- Look at the SchedLog on the submit machine after it negotiates for this user. If a user doesn't have enough priority to get more machines the SchedLog will contain a message like "lost priority, no more jobs".
- If jobs are successfully being matched with machines, they still might be dying when they try to execute due to file permission problems or the like. Check the ShadowLog on the submit machine for warnings or errors.
- Look at the NegotiatorLog during the negotiation for the user. Look for messages about priority, "no more machines", or similar.
- Another problem shows itself with statements within the log file produced by the condor_schedd daemon (given by $(SCHEDD_LOG)) that say the following:
2/3 17:46:53 Swap space estimate reached! No more jobs can be run!
12/3 17:46:53 Solution: get more swap space, or set RESERVED_SWAP = 0
12/3 17:46:53 0 jobs matched, 1 jobs idle
Condor computes the total swap space on your submit machine and then tries to limit the total number of jobs it will spawn based on an estimate of the size of the condor_shadow daemon's memory footprint and a configurable amount of swap space that should be reserved. This is done to avoid the situation within a very large pool in which all the jobs are submitted from a single host. The huge number of condor_ shadow processes would overwhelm the submit machine, it would run out of swap space, and thrash. - Things can go wrong if a machine has a lot of physical memory and little or no swap space. Condor does not consider the physical memory size, so the situation occurs where Condor thinks it has no swap space to work with, and it will not run the submitted jobs.
To see how much swap space Condor thinks a given machine has, type: condor_status -schedd [hostname] -long | grep VirtualMemory. Look at the output. If the value listed is 0, then this is what is confusing Condor. There are two ways to fix the problem:
- Configure your machine with some real swap space.
- Disable this check within Condor. Define the amount of reserved swap space for the submit machine to 0. Set RESERVED_SWAP to 0 in the configuration file: RESERVED_SWAP = 0, and then send a condor_restart to the submit machine.
10.6 Why might my job be preempted (evicted)?
There are four circumstances under which Condor may evict a job. They are controlled by different expressions.
- User priority: controlled by the PREEMPTION_REQUIREMENTS expression in the configuration file. If there is a job from a higher priority user sitting idle, the condor_negotiator daemon may evict a currently running job submitted from a lower priority user if PREEMPTION_REQUIREMENTS is True.
- Owner (machine) policy: controlled by the PREEMPT expression in the configuration file. When a job is running and the PREEMPT expression evaluates to True, the condor_startd will evict the job. The PREEMPT expression should reflect the requirements under which the machine owner will not permit a job to continue to run.
- Owner (machine) preference: controlled by the RANK expression in the configuration file (sometimes called the startd rank or machine rank). The RANK expression is evaluated as a floating point number. When one job is running, a second idle job that evaluates to a higher RANK value tells the condor_startd to prefer the second job over the first. Therefore, the condor_startd will evict the first job so that it can start running the second (preferred) job.
- If Condor is to be shutdown: on a machine that is currently running a job. Condor evicts the currently running job before proceeding with the shutdown.
10.7 Why does the time output from condor_status appear as [?????] ?
As Condor runs, it collects timing information for many different uses. The collection of data depends on accurate times. Since it is a distributed system it is running on, the clock may be skewed among the different machines, leading to errors in the timing calculations. Since the values can be reported as both too large or too small, the timing values may end up being calculated as having negative value.
When this happens at the time when the user is looking at the output of condor_status, the ActivityTime field will appear as [?????]. Since a negative time does not make sense, condor_status instead displays [?????].
To solve this problem, the clocks on the machines will have to be synchronized.
10.8 Where are my missing files?
The command when_to_transfer_output = ON_EXIT_OR_EVICT is in the submit description file.
Although it may appear as if files are missing, they are not. The transfer does take place whenever a job is preempted by another job, vacates the machine, or is killed. Look for the files in the directory defined by the SPOOL configuration variable. See section 2.5.4 of the manual for details on the naming of the intermediate files.
10.9 Useful tips and comments
- Don't queue up thousands and thousands of jobs in a queue. Use DAGman to divide your jobs into reasonably-sized chunks. (500 jobs or so)
- Long jobs should run in the standard universe, not in the vanilla universe, since without checkpointing, the jobs will otherwise never finish.
- Standard Universe is the most desirable due to checkpoint availability, but no possibility of sub-processes. Scripts can be used as executables. It is also necessary to link with Condor run-time library (use of Intel compilers is not possible). Only static links works. Good for longer jobs because of the checkpoint availability.
- Vanilla Universe is the only possibility for Windows machines. It only has preemption by suspension or eviction and is thus bad for long jobs, but OK for short jobs (eviction is when the owner of the cluster bumps your job. It will then restart.) Can use Intel compilers (may run 30%-40% faster). Thus it may even be faster for somewhat longer jobs, because the speed gain may be bigger than the advantage from the checkpoint availability.
- To maximize throughput and minimize the possiblity of your job being preempted, try and size your jobs to complete in 60-90 minutes.
- Purdue have both a scavenging/preempting and a scheduling system. Remember that the Condor pool is very heterogeneous, both regarding processor versions and OS versions/types (both Linux of different varieties and some Windows.)
- It is a good idea to statically link your binaries when submitting to Condor, to eliminate potential issues with differing library versions on execution nodes.
11. References
- Condor tutorials and slides from the 'Condor Boot Camp' that was held at Purdue University.
- The best source of documentation for using Condor can be found in the official Condor Manual at the University of Wisconsin.
- CondorView
- Resources in the Condor Pools
12. Examples
- Examples of Condor submit description files
- Compiling: condor_compile <compiler> <program.extension> -o <program name>
- Submitting the job: condor_submit <submit description file>