TeraGrid.org Parent Organization Links

Running Jobs on TeraGrid

Overview of Running Jobs on TeraGrid

TeraGrid sites use several different local systems for managing jobs and queues. However, all support the Globus protocol for job submission, and by extension, Condor-G job submission. Following is specific instructions on how to use some of these to submit jobs either to Purdue resources or from Purdue to other TeraGrid sites.

Overview of Condor

Condor is one of several submission systems Purdue supports which you may use to run jobs on TeraGrid sites. Condor provides a framework for running programs on otherwise idle computers. While this has serious limitations for parallel jobs and programs with large I/O or memory requirements, Condor can provide a very large quantity of cycles for researchers who need to run hundreds or even thousands of smaller jobs. Condor may be used both to submit jobs to Purdue resources from outside Purdue as well as from Purdue hosts to submit jobs either locally or to other TeraGrid sites. Here are a couple of other references on Condor use:

General Tips on Condor Use

Do not queue up thousands of jobs at once. Use DAGMan to divide your jobs into reasonably-sized chunks. Overburdening the queue can slow down or even kill the scheduler.

Long jobs need to be run in the "standard" universe—not in the "vanilla" universe. Unless your application supports check-pointing, in the vanilla universe, your jobs may never have enough continuous time to complete.

To find out what machines and architectures are available and the status of all the pools at Purdue, use "condor_all":
"condor_all" is a Purdue-specific tool. It is not available at other sites.

bash-3.00$ condor_all
Pool emu.rcac.purdue.edu
               Total Owner Claimed Unclaimed Matched Preempting Backfill

   INTEL/LINUX   332   319       4         9       0          0        0
  X86_64/LINUX  7249  3680    3004       565       0          0        0

         Total  7581  3999    3008       574       0          0        0

For a brief summary of the current pools' availability, use "condor_pool":

bash-3.00$ condor_pool -t
POOL
-------
egret.rcac.purdue.edu             (Total=2539,Unused=1716)
broker.ics.purdue.edu             (Total=241,Unused=151)
condor.calumet.purdue.edu         (Total=348,Unused=255)
emu.rcac.purdue.edu               (Total=7581,Unused=594)
flamingo.rcac.purdue.edu          (Total=2594,Unused=683)

Submitting to Purdue Resources from Purdue Hosts

Submitting to Purdue resources from a Purdue host requires only basic Condor (not Condor-G).

All Condor jobs submitted to Purdue resources must specify the TeraGrid project allocation they are being run under. This is done by adding the following to your Condor submission file:

+TGProject = "YourProjectNumber"

Your project number will look something like "TG-XYZ123456". Other ways to specify this are to create the file "~/.tg_default_project" containing your project number (and nothing else) or to set the environment variable $DEFAULT_PROJECT to your project number.

Submitting to Purdue Resources over the Grid

Submitting to Purdue resources over the Grid (not from a Purdue host) requires the use of Condor-G. This is a Condor front-end to Globus. While Globus may be used directly, and we also provide instructions on direct Globus use, you may find it more convenient to use Condor to manage the Globus submission(s) for you. This is what Condor-G does.

All Condor-G jobs submitted to Purdue resources must specify the TeraGrid project allocation they are being run under. This is done by adding the following to your Condor-G submission file:

GlobusRSL = (project=YourProjectNumber)

Your project number will look something like "TG-XYZ123456". Other ways to specify this are to create the file "~/.tg_default_project" containing your project number (and nothing else) or to set the environment variable $DEFAULT_PROJECT to your project number.

In order to have access to Condor-G, you will need to add it to your environment via softenv:

bash-3.00$ soft add +condor-g

You will also need to have a currently valid proxy certificate.

Below is an example Condor-G submission to give you an idea of how to get started.

Submitting to (Non-Purdue) Grid Resources from Purdue Hosts

Submitting to Non-Purdue Grid resources from a Purdue host requires the use of Condor-G. This is a Condor front-end to Globus. While Globus may be used directly, and we also provide instructions on direct Globus use, you may find it more convenient to use Condor to manage the Globus submission(s) for you. This is what Condor-G does.

All Condor-G jobs submitted to TeraGrid resources must specify the TeraGrid project allocation they are being run under. This is done by adding the following to your Condor-G submission file:

GlobusRSL = (project=YourProjectNumber)

Your project number will look something like "TG-XYZ123456". Other ways to specify this are to create the file "~/.tg_default_project" containing your project number (and nothing else) or to set the environment variable $DEFAULT_PROJECT to your project number.

In order to have access to Condor-G, you will need to add it to your environment via softenv:

bash-3.00$ soft add +condor-g

You will also need to have a currently valid proxy certificate.

Below is an example Condor-G submission to give you an idea of how to get started.

Example Condor-G Submission Script

To run a Condor-G job you must write a Condor submission script. Here is a simple example:

#
# example.condor
# Simple Condor-G Example
#

# Specify your TeraGrid allocation project here.
globusrsl = (project=TG-XYZ123456)

# Submissions over the Grid must use the "globus" universe.
universe = globus

# The executable to run.  Need the full path.  ~/ does not work.
executable = /bin/hostname

# Command-line arguments to the executable.
arguments = 1 2 3

# false:  The executable is already on remote machine.
# true:   Copy the executable from the local machine to the remote.
transfer_executable = false

# Where to submit the job.  See the "Resources" page for local jobmanagers.
globusscheduler = tg-steele.purdue.teragrid.org/jobmanager-pbs

# Filenames for standard output, standard error, and Condor log.
output = example.out
error = example.err
log = example.log

# The following line is always required.  It is the command to submit the above.
queue

Condor Job Submission

To submit a job, run "condor_submit" and provide the Condor submission script filename:

bash-3.00$ condor_submit example.condor
Submitting job(s)...
Logging submit event(s)...
1 job(s) submitted to cluster 5890.

Condor Job Status

The command "condor_q" will report the progress of your job in the queue:

bash-3.00$ condor_q

-- Submitter: tg-steele.rcac.purdue.edu : <128.211.143.238:32775> : tg-steele.rcac.purdue.edu
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
  40.0   user123        11/20 12:36   0+00:00:00 H  0   0.0  example_hung
  40.1   user123        11/20 12:36   0+00:00:00 H  0   0.0  example_hung
  40.2   user123        11/20 12:36   0+00:00:00 H  0   0.0  example_hung
  57.0   user123        12/19 10:21   0+00:01:49 R  0   0.0  example_works
  57.1   user123        12/19 10:21   0+00:00:00 I  0   0.0  example_nomatch

8 jobs; 1 idle, 1 running, 6 held

If a job is not running for some time, you may can try to find out why using "condor_q -better-analyze". This will report if your job failed to match any resources and if so, which job constraints machines could not be found that meet:

bash-3.00$ condor_q -better-analyze 57.1

Condor Job Cancellation

To cancel a job, use the "condor_rm" command and the ID of the job from "condor_q":

bash-3.00$ condor_rm 57.1

Condor DAG (Workflows)

A Condor DAG or Directed Acyclic Graph is a way of submitting jobs which depend on each other's completion. This can be used to create a workflow, where job A must complete before job B can start, or to batch up large numbers of unrelated jobs, so that each set of 100 jobs will wait for the previous set of 100 jobs to complete before starting, or any combination of these arrangements. As a result, Condor DAGs can be extremely powerful and useful, and are highly encouraged. DAGMan is the Directed Acyclic Graph Manager and is used to create and submit Condor DAGs.

Here is an example Condor DAG:

Job 1 example1.condor
Job 2 example2.condor
Job 3 example3.condor
PARENT 1 CHILD 2
PARENT 2 CHILD 3

Each of the files "example1.condor", "example2.condor", and "example3.condor" are Condor submission scripts as explained above.

To submit this DAG, use the "condor_submit_dag" command:

bash-3.00$ condor_submit_dag example.dag

Checking all your submit files for log file names.
This might take a while...
Done.
-----------------------------------------------------------------------
File for submitting this DAG to Condor        : example.dag.condor.sub
Log of DAGMan debugging messages              : example.dag.dagman.out
Log of Condor library debug messages          : example.dag.lib.out
Log of the life of condor_dagman itself       : example.dag.dagman.log

Condor Log file for all jobs of this DAG      : /home/rcac/user123/dagtest/example1.log
Submitting job(s).
Logging submit event(s).
1 job(s) submitted to cluster 58.
-----------------------------------------------------------------------

Just as with ordinary Condor above, the status of the DAG can be checked with "condor_q" and a DAG can be removed with "condor_rm". Some further DAG references you may find helpful are:

Running Commercial Applications via Condor

It is possible to use Condor to run a number of commercial applications. However, which software packages are installed on different TeraGrid sites and how they are configured varies widely. There is also often some licensing problems with using commercial applications outside of Purdue. Below is information about R, which is currently the only specific package that can be used by non-Purdue affiliates on Purdue's Steele Cluster and Condor Pool resources.

Specific Commercial Application Instructions
R: on Linux

Overview of Globus

Globus is one of several submission systems Purdue supports which you may use to run jobs on TeraGrid sites. Globus provides a framework for job submission and management over the Internet using certificate-based authentication credentials. This is the de facto standard means of issuing jobs over most Grids, including TeraGrid. However, Condor-G provides a front-end to the Globus protocol with some additional features and is generally found by users to be simpler to use. That said, users do have the option to use Globus directly instead. Globus may be used both to submit jobs to Purdue resources from outside Purdue as well as from Purdue hosts to submit jobs either locally or to other TeraGrid sites. TeraGrid supports Globus Toolkit 4.0, which also includes file transfer and resource description tools. Here is another refernce on Globus use:

Globus Authentication Test

If you are unsure your proxy certificate is being accepted, or that the remote Globus gatekeeper is responsive, you may wish to test using a simple authentication-only check:

bash-3.00$ globusrun -a -r tg-steele.purdue.teragrid.org/jobmanager-fork 

GRAM Authentication test successful

Globus Job Submission

To submit a job, there are three distinct commands you may use. Each offers some different functionality. To start a job, wait completion, and see the output as it runs, use "globus-job-run". To submit a batch job and not wait for it, you can use "globus-job-submit". Both of those take the script or executable you wish to run as an argument and any Globus RSL parameters (such as you project number, the number of nodes requested, or the type of machine needed) must also be specified on the command line. The third option is to use "globusrun", which takes an RSL file as an argument, and this RSL file may contain all the RSL parameters otherwise on the command line. For quick tests, you may wish to use one of the "globus-job-*" commands, but if you want to save how you submitted a job for future reuse, you should construct an RSL submission file and use "globusrun".

All Globus jobs submitted to Purdue resources must specify the TeraGrid project allocation they are being run under. This is done by using the "project" RSL parameter:

(project = "YourProjectNumber")

Your project number will look something like "TG-XYZ123456". Another way to specify this is to set the environment variable $DEFAULT_PROJECT to your project number.

globus-job-run:

Use "globus-job-run" to start a quick job, wait for its completion, and view output as it runs:

bash-3.00$ globus-job-run tg-steele.purdue.teragrid.org/jobmanager-pbs \
                          -x '&(project=TG-XYZ123456)' /bin/hostname

For convenience, "globus-job-run" also offers the ability to access a local file using "-s" (this is done using GASS behind the scenes):

bash-3.00$ globus-job-run tg-steele.purdue.teragrid.org/jobmanager-pbs \
                          -x '&(project=TG-XYZ123456)' -s my_script

globus-job-submit:

You may "globus-job-submit" to submit a batch job, although all RSL aparameters must be specified on the command line. It returns a contact string, which is a URL unique to your job, and is used by other commands to manage this job:

bash-3.00$ globus-job-submit tg-steele.purdue.teragrid.org/jobmanager-pbs \
                             -x '&(project=TG-XYZ123456)' /bin/hostname
https://tg-steele.purdue.teragrid.org:42396/6249/1164739382/

Executables may not be automatically copied over or accessed remotely using globus-job-submit. To provide any local files, you must use GASS (direct remote access) or GridFTP (copy files in advance).

globusrun:

You may "globusrun" to submit a batch job and provide an RSL file as input which contains all the job parameters, executable, output filenames, etc. It returns a contact string, which is a URL unique to your job, and is used by other commands to manage this job:

bash-3.00$ globusrun -r tg-steele.purdue.teragrid.org/jobmanager-pbs -f example.rsl
https://tg-steele.purdue.teragrid.org:3768/sdfkhkdfhg/ououjko/wouiu/

For convenience, "globusrun" also offers the ability to access a local file using "-s" (this is done using GASS behind the scenes):

bash-3.00$ globusrun -s -r tg-steele.purdue.teragrid.org/jobmanager-pbs -f example.rsl
https://tg-steele.purdue.teragrid.org:42396/6249/1164739382/
DONE

Example Globus RSL File

When submitting a Globus job you may wish to predefine and save all your job parameters in a file. This can be done using a Resource Specification Language (RSL) file, which may then be submitted using globusrun. Here is a simple example:

& (project=TG-XYZ123456)
  (environment=(GLOBUS_DUROC_SUBJOB_INDEX 0))
  (jobtype=single)
  (executable="/bin/hostname")
  (stdout="example.out")
  (stderr="example.err")

For more information on RSL, you may refer to the Official Globus RSL Documentation.

Globus Job Status

The command "globus-job-status" with your job's contact string (URL) will report the progress of your job:

bash-3.00$ globus-job-status https://tg-steele.purdue.teragrid.org:42396/6249/1164739382/
DONE

Globus Output Retrieval

Once your job is done, you can retrieve your output remotely using the "globus-job-get-output" command and the job's contact string (URL):

bash-3.00$ globus-job-get-output https://tg-steele.purdue.teragrid.org:42396/6249/1164739382/
tg-steele

Globus Job Cancellation and Clean-Up

To cancel a job and clean up a job's output, use the "globus-job-clean" command and the job's contact string (URL):

bash-3.00$ globus-job-clean https://tg-steele.purdue.teragrid.org:42396/6249/1164739382/

    WARNING: Cleaning a job means:
        - Kill the job if it still running, and
        - Remove the cached output on the remote resource

    Are you sure you want to cleanup the job now (Y/N) ? y

Cleanup successful.

Access to Local Storage by Remote Jobs (GASS)

Globus Access to Secondary Storage (GASS) is meant to simplify remote file I/O when using Globus. Typically, a GASS server is started on the local machine a user is submitting from, which has local access to the user's files. This server may be manually started by the user themselves, or automatically by using the "-s" option to globus-job-run or globusrun. This server can then serve files needed and allow file output back from the remote machine to which the user submits Globus jobs, with some caching on the remote machine. This is generally much easier than it may sound, and with the "-s" option, you may not be aware this is being done at all.

If using the "-s" option to globus-job-run or globusrun, you will also need to use the $GLOBUSRUN_GASS_URL environment variable in your job submission, as the exact GASS URL will not be known until the job is submitted. Here is an example of some RSL that uses this to specify files in the working directory of the local filesystem:
Note that the "/./" is required and tells the GASS server to use the directory the GASS server was started in rather than an absolute path.

  (executable=$(GLOBUSRUN_GASS_URL)/./my_script.sh)
  (stdin=$(GLOBUSRUN_GASS_URL)/./my_input)
  (stdout=$(GLOBUSRUN_GASS_URL)/./my_output)

To manually start a GASS server, run "globus-gass-server":

bash-3.00$ globus-gass-server
https://tg-steele.purdue.teragrid.org:50000

There are several possible options to the GASS server as well. Some of these are:

  • -r
    Enable read access to the local filesystem.
  • -w
    Enable write access to the local filesystem.
  • -t
    Expand the "~" in "/~/filename" to reference the user's home directory.
  • -help
    Display all options

Overview of PBS

PBS is one of several submission systems Purdue supports you may use to run jobs at Purdue. Note: It is only possible to submit jobs to Purdue resources via PBS from a Purdue host. Some other TeraGrid resources may also offer local PBS access, but not all. While we encourage the use of Grid tools such as Globus and Condor-G, it may be useful to use PBS if you are currently having problems submitting jobs using Grid tools. Here is another reference on PBS use:

PBS Queues

A given resource may offer several queues for the same resources which have different constraints such as maximum job duration, maximum memory usage, maximum number of CPUs, etc. and also often will also have different wait times as a result. In general, try to choose a queue which minimally meets your job's requirements, so that the resource may be queued most efficiently and your job run as soon as possible.

To list the queues available on a resource, use "qstat -q":

bash-3.00$ qstat -q

server: steele.rcac.purdue.edu

Queue            Memory CPU Time Walltime Node   Run   Que   Lm  State
---------------- ------ -------- -------- ---- ----- ----- ----  -----
tg_workq           --      --    720:00:0  --      0     3   --   D S
preemptdef         --      --       --     --      0     0   --   D S
standby            --      --    04:00:00  --      0     1   --   D S
testq              --      --    720:00:0  --      0     0   --   D S
                                               ----- -----
                                                   0     4

To see more details about the limits on each queue, use "qstat -Qf":

bash-3.00$ qstat -Qf

Queue: tg_workq
    queue_type = Execution
    Priority = 1000
    total_jobs = 3
    state_count = Transit:0 Queued:3 Held:0 Waiting:0 Running:0 Exiting:0 Begun:0 
    resources_max.walltime = 720:00:00
    resources_default.ncpus = 1
    resources_default.nodes = 1
    resources_default.walltime = 00:30:00
    acl_group_enable = True
    acl_groups = teragrid,tgusers,itap,pucc
    resources_available.ncpus = 224
    enabled = False
    started = False

Example PBS Submission Script

To run a PBS job you must write a PBS submission script. Here is a simple example:

#!/bin/sh
#
## example.pbs
## Simple PBS Example
#
#PBS -q tg_workq
#PBS -N myexample
#PBS -l nodes=10:ppn=2
#PBS -l walltime=0:50:00
#PBS -o example.out
#PBS -e example.err
#PBS -V
#
mkdir -p $TG_CLUSTER_SCRATCH/username/myexample
cd $TG_CLUSTER_SCRATCH/username/myexample
mpirun -v -machinefile $PBS_NODEFILE -np 20 $TG_CLUSTER_HOME/a.out

PBS Job Submission

You must first log in to the head/login node of the resource. From there, you may submit a PBS submission script using the "qsub" command. You must also specify which queue you wish to submit to (see above for how to list available queues), and the TeraGrid allocation project number this job is being run under (a number of the form "TG-XYZ123456"):

bash-3.00$ qsub -q queue_name -A TG-XYZ123456 example.pbs

If you are a local Purdue user (using a Purdue career account), you may belong to other local Unix groups in addition to the "tgusers" Unix group. In order to submit jobs to TeraGrid, "tgusers" or "itap" must be your primary group. To determine your current primary group and secondary group memberships, use the "id" command:

bash-3.00$ id -Gn
groupa groupb tgusers groupc

The first group in this list is your primary group. If this is not "tgusers", you will need to specify you wish to submit as part of the "tgusers" group when using qsub:

bash-3.00$ qsub -W group_list=tgusers -q queue_name -A TG-XYZ123456 example.pbs

PBS Job Status

The command "qstat" will report the progress of your job in the queue:

bash-3.00$ qstat
Job id            Name             User              Time Use S Queue
----------------  ---------------- ----------------  -------- - -----
515520.steele     myexample        me                       0 Q tg_workq
524630.steele     foo              someone                  0 Q standby
526698.steele     bar              someoneelse              0 Q tg_workq
526698.steele     myexample2       me                       0 Q tg_workq        

PBS Job Cancellation

To cancel a job, use the "qdel" command and the ID of the job from "qstat":

bash-3.00$ qdel 515520

Test/Probe Remote System Environment

To conduct a basic test of a remote system and retrieve some information about the environment there, save the following as the file "probe.sh" and then send a job to the remote system you wish to probe with this script as the executable:

#!/bin/sh
#
# probe.sh
# Basic Environment Probe
#
echo "************************************************************"
echo "Date/Time = `date '+%Y-%m-%d %T'`"
echo "Machine = `hostname`"
echo "User = `whoami`"
echo "Working Directory = `pwd`"
echo "Environment Variables ="
echo ""
echo "`env`"
echo "************************************************************"

It will report the date and time (when it ran), machine name (where it ran), the user (who it ran as), the working directory (what directory it was run from), and the full set of environment variables. This information may prove useful in constructing your submissions or in locating a problem with another submission.