Purdue offers several resources to TeraGrid users. This document provides detailed information on how to obtain an account and start using these. For more targetted information for new or potential TeraGrid users, refer to the new user guide.
TeraGrid grants allocations to research projects that have been approved for a specified amount of compute cycles on TeraGrid resources. How to obtain a TeraGrid account will depend on whether you need a brand new allocation or if you only need to be added as a user of an existing allocation.
Regardless of the allocation, once your initial TeraGrid account has been approved and created, you will receive a start up package via regular mail. In this package will be your account information for all the sites to which you requested access. There is also a User Responsibility Form. The last page of this form must be signed and returned within 30 days. You may start using your account as soon as you receive this package. You should immediately login to all the sites to which you have been given access and change your password(s).
To request a new allocation on the TeraGrid, refer to the TeraGrid Allocation Guide and follow the process described there.
If your institute/research group already has a TeraGrid allocation, you may be added to this allocation. This is something your PI must do. They will need to fill out the "Add User" form. To access the form they must first login to the TeraGrid User Portal. Then click on the "My TeraGrid" tab and select "Add/Remove User". After the application has been reviewed and accepted, the new user(s) will receive a packet via conventional (postal) mail that contains their account information in about two weeks.
Refer to the TeraGrid Allocation Guide for more information.
To get an account on Purdue TeraGrid resources, you will need to either be a local user with an existing Purdue career account, or you will need a User Certificate. For more information about user certificates, refer to the Certificates section of the User Guide. Once you have a certificate or Purdue career account, send an email to the TeraGrid HelpDesk including your certificate DN or career account name and ask to be given access to Purdue TeraGrid resources.
In order to log in and interactively use a shell on any TeraGrid resource, you must use a GSI-SSH or SSH client. GSI-SSH client software such as "gsissh" may be installed on Linux, Mac OSX, or other Unix variants. GSI-SSH allows SSH access to machines using your user certificate for authentication instead of a password.
We strongly urge you to use a GSI-SSH client and a proxy grid certificate to log in to remote TeraGrid resources rather than a standard SSH client. However, if that is not possible, standard SSH may be used to some TeraGrid resources. Standard SSH clients will either require the password defined at each TeraGrid resource for your account there or may be accessed using an SSH key. If you find GSI-SSH is not available to you, we have more information on setting up SSH keys.
Email services are not supported on Purdue TeraGrid systems. Outgoing email will be delivered, but incoming email will be forwarded to the address supplied when your TeraGrid account was requested. If that address is incorrect, please send your correct forwarding address to the TeraGrid HelpDesk.
When your account is first activated, your shell will be set to tcsh, an enhanced version of the Berkeley UNIX C shell (csh). If you would like to use another shell (e.g. bash, the GNU Bourne-Again SHell), please send a request to have your shell changed to the TeraGrid HelpDesk.
TeraGrid resources primarily use certificate-based authentication. These certificates (X.509 certificates) are somewhat similar to how SSL/TLS and credit card transactions on the internet work. In this case, however, a certificate must be presented by you as proof you are who you claim to be. These user certificates are issued and signed by trusted Certificate Authorities (CAs). Not all CAs are trusted by TeraGrid sites.
You should receive instructions on obtaining an NCSA certificate with your introductory TeraGrid packet. Purdue affiliates may also obtain a Purdue certificate.
It is fairly easy to obtain an NCSA user certificate, so you may wish to start by getting one of those and using it to log in to systems in the future.
bash-3.00$ ncsa-cert-request
bash-3.00$ grid-cert-info -subject bash-3.00$ grid-cert-info -enddate
If you do not wish to obtain an NCSA user certificate, it is possible to use a user certificate from any TeraGrid-trusted Certificate Authority. To do so, place those certificate files in your ".globus" directory. Read the "File Transfer" section of the user guide for information on how to transfer your certificate (and other files) between sites.
You should confirm your certificate is approved by locating your user certificate's Distinguished Name (DN) in the gridmap file.
First, to extract your DN from your user certificate:
bash-3.00$ grid-cert-info -subject
Then search for this DN in the gridmap file:
bash-3.00$ grep <your_DN> /etc/grid-security/grid-mapfile
If you are not in the gridmap file, you may use "gx-request" on many of the sites to request your DN be added. See here for a list. If gx-request is not supported on a system, you will have to email your DN to the TeraGrid HelpDesk.
Proxy Certificates are temporary authentication credentials issued based on your user certificate. These proxy certificates may be used to authenticate to various TeraGrid resources in lieu of your user certificate. This is generally safer than using your actual certificate, as these proxy certificates are only good for a matter of hours, so should one fall into the wrong hands, it would be of limited impact.
There are two ways of obtaining a proxy certificate. One is to derive a proxy certificate from your existing user certificate in your .globus directory, as explained above. However, you may also obtain a proxy certificate without a local user certificate. This is possible using the automatic NCSA user certificate created for all TeraGrid users, and the TeraGrid MyProxy server. Either method may be used. You may already have a local user certificate, or you may find it easier to directly obtain proxy certificates whenever you need them from the TeraGrid MyProxy server.
Either method's proxy certificates may be used to run jobs at TeraGrid sites or login via gsissh, as described elsewhere.
If you have a local user certificate in place in your ".globus" directory, you may derive a proxy certificate from this by running "grid-proxy-init":
bash-3.00$ grid-proxy-init Enter pass phrase: <user_certificate_passphrase> A proxy has been received for user username in /tmp/x509up_u#####
By default, the proxy certificate is valid for 12 hours. If you wish to have it remain valid longer, this can be done using the "-valid" option:
bash-3.00$ grid-proxy-init -valid <hh:mm>
When your TeraGrid account was created, a user certificate and Distinguished Name (DN) was created on your behalf by the NCSA Certificate Authority (CA). This DN is stored in your profile in the TeraGrid account database and automatically propagated to all TeraGrid sites. A proxy certificate derived from this can be retrieved from the TeraGrid MyProxy certificate repository (myproxy.teragrid.org) while on any TeraGrid resource by using the command "myproxy-get-delegation", supplying your TeraGrid Portal user name with the "-l" option, and your TeraGrid Portal password as the MyProxy passphrase:
bash-3.00$ myproxy-get-delegation -l <portal_username> -s myproxy.teragrid.org Enter MyProxy pass phrase: <portal_password> A proxy has been received for user username in /tmp/x509up_u#####
By default, the proxy certificate is valid for 12 hours. If you wish to have it remain valid longer, this can be done using the "-t" option:
bash-3.00$ myproxy-get-delegation -t <hh:mm>
You can verify the status of your current proxy certificate at any time by running "grid-proxy-info":
bash-3.00$ grid-proxy-info
You may delete your current proxy certificate at any time by running "grid-proxy-destroy":
bash-3.00$ grid-proxy-destroy
If you are a Purdue affiliate and physically on campus, you may obtain a user certificate from the Purdue Certificate Authority. This is not necessary, as you only need a certificate from any one of the TeraGrid sites, and you can generate one yourself at NCSA once you have received your initial TeraGrid account as described above.
To get a Purdue user certificate, contact ca-admin. You will then be contacted with further information. You will be asked to show up in person and show an ID. You should also bring a USB-drive or disc, for them to copy your certificate to.
Copy the certificate files to the ".globus" directory in your home directory (create this directory if it does not exist):
bash-3.00$ mkdir -p ~/.globus bash-3.00$ cp userkey.pem ~/.globus bash-3.00$ cp usercert.pem ~/.globus
Important: Be sure your certificate has a passphrase! To change or add a passphrase:
bash-3.00$ grid-change-pass-phrase
It is now possible to use Single Sign-On (SSO) directly from your desktop/laptop. TeraGrid provides detailed instructions on Single Sign-On for Mac OSX, Linux, and Windows. You will need a user certificate. If you have an NCSA certificate (see above), your certificate will already be registered for you to log in via Single Sign-On.
Install 'Single-sign-on software:
The TeraGrid Portal is another way of logging in to TeraGrid resources. It is a Web interface that makes TeraGrid account management easier, displays information about TeraGrid resources, and enables access to many of the existing TeraGrid services in a single place. All new TeraGrid users receive an introductory packet via U.S. postal mail that contains a portal username and password along with their other TeraGrid system account usernames and passwords.
The portal can be reached from the menu bar of any User Info page on the TeraGrid web site or directly at the URL http://portal.teragrid.org/.
On the Portal home page, login using your portal username and password. If you do not see the login form immediately upon going to the Portal web site, click the "Login" link in the upper right-hand corner of the site.
Once you are logged in to the portal, you will find six tabs across the top of the page. These tabs represent the different services that are available from the portal. Under each tab is a set of navigation links for services within each category.
Purdue has two major TeraGrid compute resources, the Steele Cluster and the Condor Pool, as well as FPGA, file storage, data, and visualization resources.
This cluster has 893 dual quad-core Intel E5410 processor compute nodes, all running Red Hat Enterprise Linux, version 4 (RHEL4). Each of the nodes has eight 64-bit 2.33 GHz Dell 1950 CPUs, and either 16 GB or 32 GB of RAM. Some of the nodes in this cluster are interconnected with Infiniband, and most with Gigabit Ethernet. Steele users have access to a 1.3 PB DXUL archive system. Steele's peak performance is rated at 66.59 TFLOPS, and we believe it is well suited for a wide range of both serial and parallel jobs.
| SSH / Login: | tg-login.purdue.teragrid.org |
| GridFTP: | tg-steele.purdue.teragrid.org (non-striped) tg-data.purdue.teragrid.org (striped) |
| GRAM: | tg-steele.purdue.teragrid.org/jobmanager-pbs |
| Compilers: | Intel, PGI, GNU (use softenv to select) |
| PBS Queue: | tg_workq |
The Purdue Condor pool consists of over 14,000 CPUs. Of these, more than 6,500 are Linux/x86_64 CPUs, approximately 800 are Linux/Intel (ia32) CPUs, and 1,700 WinNT51/Intel CPUs. There are also small numbers of Itanium Linux, Solaris and Mac OSX machines. Memory on compute nodes ranges from 512 MB to 16 GB, and most CPUs run at 3 GHz or better. With a total of over 25 TFLOPS available, the Purdue Condor pools can provide very large numbers of cycles in a short amount of time. All shared areas and software packages available on Steele are also available on Condor. Condor is designed for high-throughput computing, and is excellent for parameter sweeps, Monte Carlo simulation, or most any serial application. Also, some classes of parallel jobs (master-worker) may be run via Condor.
| GridFTP: | tg-condor.purdue.teragrid.org (non-striped) tg-data.purdue.teragrid.org (striped) |
| GRAM: | tg-condor.purdue.teragrid.org/jobmanager-condor |
Purdue provides limited FPGA resources to TeraGrid users. These resources consist of an SGI 450 (brutus.rcac.purdue.edu) with two RC100 FPGA blades, totaling 4 available FPGAs. Also available is a Sun Fire X2200 M2 (portia.rcac.purdue.edu) which serves both as a place & route node for preparing FPGA code for use on Brutus and as an entry point for GSI-SSH and job submission to Brutus by TeraGrid users.
| SSH / Login: | fpga.purdue.teragrid.org |
| Place & Route Software: | Mitrionics + Xilinx |
Purdue provides a cloud computing testbed "Wispy" to TeraGrid users. It consists on one frontend and VM image storage nodes, and four dual-CPU VM host machines. The machines have one and a half gigabytes of available memory.
The cloud supports virtual machines with real Internet addresses, so researchers are able to get running in the cloud with minimal complications. Anyone interested in an account on our cloud should send justification and an X.509 DN to rcac-help@purdue.edu.
| Cloud Quick Start Guide (globus.org) |
| No VPN software is necessary. |
| Wispy works with version 9 of the workspace client |
| Configuration file is located here (remove .txt to use) |
| Nimbus Software Release T2.1 |
All TeraGrid users at Purdue have a home directory, which is the default directory you are placed in when you log in. This is where you should store files you want to keep over a long term such as source code, scripts, input data sets, etc. It should also be used for files you need to use often. Your home directory physically resides on a BlueArc Titan 2500 system, and this is backed up regularly. Aside from a home directory, all users have access to a scratch directory, which is pre-created for everyone with a quota of 250 GB. Note that no backup is made of this scratch space, and it is purged after 60 days. The scratch directories are also stored on a BlueArc server. User scratch directories are located in subdirectories on the scratch95 or scratch96 filesystems with names beginning with the first letter of the user's login name.
When referring to your scratch space in scripts, always use the variable $RCAC_SCRATCH. The specific path may change at any time, but this variable will always point to the correct location.
Purdue provides various file storage resources to TeraGrid users, and offers GridFTP access for data transfer to and from those resources. Following is more detailed information on these resources and guidelines on proper use.
Your home directory is the default directory you are placed in when you log on.
You should use this space for storing files you want to keep long term such as source code, scripts, input data sets, etc. It should also be used for files you want to keep and which you use often. The home directory will physically reside on the BlueArc. You can find the path to your home directory is located by logging onto tg-login.purdue.teragrid.org, and typing "pwd".
bash-3.00$ pwd /home/ba01/u103/user123
The second component of the reply indicates the name of the host where your home directory physically resides. In this example, the home directory is on the RCAC home directory file server named "ba01" under area "u103". That will be different from person to person. Remember, you can always check where your home directory is located by doing a "pwd" command in your home directory.
Regardless of its physical location, your home directory and its contents are available on all the machines and their nodes via the Network File System (NFS).
The command to see your disk usage and limits is "quota". The command "du -h" can also be used to get an idea of how much space you use. Home directories are backed up daily.
Quotas are measured in 1024-byte blocks and limits. You should note that the BlueArc file systems don't have the concept of soft and hard limits, nor do they have grace periods. Thus, as you can see in the example below, the "quota" column reports "0" as your soft quota. You don't get any warnings about being over your grace soft quota, instead you will just get an error when you hit your quota, which is reported under the limit column. This is true both for home directories on our systems and for scratch directories in the BlueArc scratch file systems (scratch95 and scratch96).
Here are an example on running the "quota" command on tg-login:
bash-3.00$ quota
Disk quotas for user user123 (uid 13185):
Filesystem blocks quota limit grace files quota limit grace
ba12:/scratch96
0 0 250000000 3 0 100000
ba02.rcac.purdue.edu:/apps/recycled
288130368 0 524288000 4490829 0 0
ba01:/u103 2660608 0 5000000 18097 0 65535
The "Filesystem" column indicates the file system for which the quota is being reported. The next four columns indicate the account's current disk usage, its soft quota and hard limit, and its grace period. (If your usage is under your soft limit, the grace period will be blank.) The last four columns show similar information about the number of files the account has created in the file system.
If the soft quota is exceeded an asterisk will be placed next to the disk usage and there will be a time period under "grace", showing how much time is left before the grace period expires.
Scratch file systems (shared temporary filesystems) are intended for short term use and should be considered volatile.
Please note that backups are not performed on the scratch directories. In the event of a disk crash or file purge, files on the scratch directories cannot be recovered. Therefore, you should make sure to back up your files to permanent storage as often as significant changes are made (at least daily). Files stored in the RCAC scratch storage areas will be purged after 60 days.
The scratch storage is provided by a BlueArc server. It is the same system across all systems, as opposed to having separate areas for each system. There are two scratch file systems, scratch95, and scratch96. A scratch directory will have already been created for you on one of these systems. These user scratch directories are located in subdirectories under scratch95 or scratch96, that have names that are the first letter of the user's login name.
To find the path to your scratch directory, run the command "myscratch":
bash-3.00$ myscratch /scratch/scratch96/b/user123
or type "echo $RCAC_SCRATCH" at the command prompt:
bash-3.00$ echo $RCAC_SCRATCH /scratch/scratch96/b/user123
When referring to your scratch space in scripts you should always use either the variable $RCAC_SCRATCH or $TG_CLUSTER_SCRATCH, since the actual path my change, but the variables will always be set right. The two variables point to the same place.
bash-3.00$ echo $TG_CLUSTER_SCRATCH /scratch/scratch96/b/user123
To find the path to someone else's scratch directory, run the command "findscratch XXX", where "XXX" is the login id you are interested in:
bash-3.00$ findscratch user123 /scratch/scratch96/b/user123
Note: Each user has a quota in their scratch directory. By default, this quota is 250 GB. If you need more space, send a request to rcac-help@purdue.edu.
The /tmp directory are intended for temporary files that are used during the execution of a process or job or while you examine files created by your jobs. Please do not use these directories for longer term storage of user files.
Since files in /tmp are removed whenever space is low and when the system is rebooted, you should only use it for files that can be recreated relatively easily. Files that are difficult or expensive to recreate should be stored elsewhere, such as your home directory. Files placed in /tmp may be purged at any time.
The /tmp on the nodes are purged whenever the node is rebooted. The files are also removed when that node is assigned to a job by the PBS job scheduler. This is done to ensure that each job will have access to as much /tmp space as possible as it begins its execution.
Always use the variable $TG_NODE_SCRATCH to refer to /tmp in all scripts.
You can see all your environment variable by typing "env" at the command prompt. Some of the specific ones important for TeraGrid are:
If you need to move a file to another site, such as your certificate, this can be done either with traditional file transfer programs or with grid-ftp programs. Here is how you might use the grid-ftp program tgcp:
tgcp local_filename remote.system.name:/path/to/where_you_want_the_file/
TeraGrid has more information on grid-ftp programs and use on their site as well.
TeraGrid sites use several different local systems for managing jobs and queues. However, all support the Globus protocol for job submission, and by extension, Condor-G job submission. Following is specific instructions on how to use some of these to submit jobs either to Purdue resources or from Purdue to other TeraGrid sites.
Condor is one of several submission systems Purdue supports which you may use to run jobs on TeraGrid sites. Condor provides a framework for running programs on otherwise idle computers. While this has serious limitations for parallel jobs and programs with large I/O or memory requirements, Condor can provide a very large quantity of cycles for researchers who need to run hundreds or even thousands of smaller jobs. Condor may be used both to submit jobs to Purdue resources from outside Purdue as well as from Purdue hosts to submit jobs either locally or to other TeraGrid sites. Here are a couple of other references on Condor use:
Do not queue up thousands of jobs at once. Use DAGMan to divide your jobs into reasonably-sized chunks. Overburdening the queue can slow down or even kill the scheduler.
Long jobs need to be run in the "standard" universe—not in the "vanilla" universe. Unless your application supports check-pointing, in the vanilla universe, your jobs may never have enough continuous time to complete.
To find out what machines and architectures are available and the status of all the pools at Purdue, use "condor_all":
"condor_all" is a Purdue-specific tool. It is not available at other sites.
bash-3.00$ condor_all
Pool emu.rcac.purdue.edu
Total Owner Claimed Unclaimed Matched Preempting Backfill
INTEL/LINUX 332 319 4 9 0 0 0
X86_64/LINUX 7249 3680 3004 565 0 0 0
Total 7581 3999 3008 574 0 0 0
For a brief summary of the current pools' availability, use "condor_pool":
bash-3.00$ condor_pool -t POOL ------- egret.rcac.purdue.edu (Total=2539,Unused=1716) broker.ics.purdue.edu (Total=241,Unused=151) condor.calumet.purdue.edu (Total=348,Unused=255) emu.rcac.purdue.edu (Total=7581,Unused=594) flamingo.rcac.purdue.edu (Total=2594,Unused=683)
Submitting to Purdue resources from a Purdue host requires only basic Condor (not Condor-G).
All Condor jobs submitted to Purdue resources must specify the TeraGrid project allocation they are being run under. This is done by adding the following to your Condor submission file:
+TGProject = "YourProjectNumber"
Your project number will look something like "TG-XYZ123456". Other ways to specify this are to create the file "~/.tg_default_project" containing your project number (and nothing else) or to set the environment variable $DEFAULT_PROJECT to your project number.
Submitting to Purdue resources over the Grid (not from a Purdue host) requires the use of Condor-G. This is a Condor front-end to Globus. While Globus may be used directly, and we also provide instructions on direct Globus use, you may find it more convenient to use Condor to manage the Globus submission(s) for you. This is what Condor-G does.
All Condor-G jobs submitted to Purdue resources must specify the TeraGrid project allocation they are being run under. This is done by adding the following to your Condor-G submission file:
GlobusRSL = (project=YourProjectNumber)
Your project number will look something like "TG-XYZ123456". Other ways to specify this are to create the file "~/.tg_default_project" containing your project number (and nothing else) or to set the environment variable $DEFAULT_PROJECT to your project number.
In order to have access to Condor-G, you will need to add it to your environment via softenv:
bash-3.00$ soft add +condor-g
You will also need to have a currently valid proxy certificate.
Below is an example Condor-G submission to give you an idea of how to get started.
Submitting to Non-Purdue Grid resources from a Purdue host requires the use of Condor-G. This is a Condor front-end to Globus. While Globus may be used directly, and we also provide instructions on direct Globus use, you may find it more convenient to use Condor to manage the Globus submission(s) for you. This is what Condor-G does.
All Condor-G jobs submitted to TeraGrid resources must specify the TeraGrid project allocation they are being run under. This is done by adding the following to your Condor-G submission file:
GlobusRSL = (project=YourProjectNumber)
Your project number will look something like "TG-XYZ123456". Other ways to specify this are to create the file "~/.tg_default_project" containing your project number (and nothing else) or to set the environment variable $DEFAULT_PROJECT to your project number.
In order to have access to Condor-G, you will need to add it to your environment via softenv:
bash-3.00$ soft add +condor-g
You will also need to have a currently valid proxy certificate.
Below is an example Condor-G submission to give you an idea of how to get started.
To run a Condor-G job you must write a Condor submission script. Here is a simple example:
# # example.condor # Simple Condor-G Example # # Specify your TeraGrid allocation project here. globusrsl = (project=TG-XYZ123456) # Submissions over the Grid must use the "globus" universe. universe = globus # The executable to run. Need the full path. ~/ does not work. executable = /bin/hostname # Command-line arguments to the executable. arguments = 1 2 3 # false: The executable is already on remote machine. # true: Copy the executable from the local machine to the remote. transfer_executable = false # Where to submit the job. See the "Resources" page for local jobmanagers. globusscheduler = tg-steele.purdue.teragrid.org/jobmanager-pbs # Filenames for standard output, standard error, and Condor log. output = example.out error = example.err log = example.log # The following line is always required. It is the command to submit the above. queue
To submit a job, run "condor_submit" and provide the Condor submission script filename:
bash-3.00$ condor_submit example.condor Submitting job(s)... Logging submit event(s)... 1 job(s) submitted to cluster 5890.
The command "condor_q" will report the progress of your job in the queue:
bash-3.00$ condor_q -- Submitter: tg-steele.rcac.purdue.edu : <128.211.143.238:32775> : tg-steele.rcac.purdue.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 40.0 user123 11/20 12:36 0+00:00:00 H 0 0.0 example_hung 40.1 user123 11/20 12:36 0+00:00:00 H 0 0.0 example_hung 40.2 user123 11/20 12:36 0+00:00:00 H 0 0.0 example_hung 57.0 user123 12/19 10:21 0+00:01:49 R 0 0.0 example_works 57.1 user123 12/19 10:21 0+00:00:00 I 0 0.0 example_nomatch 8 jobs; 1 idle, 1 running, 6 held
If a job is not running for some time, you may can try to find out why using "condor_q -better-analyze". This will report if your job failed to match any resources and if so, which job constraints machines could not be found that meet:
bash-3.00$ condor_q -better-analyze 57.1
To cancel a job, use the "condor_rm" command and the ID of the job from "condor_q":
bash-3.00$ condor_rm 57.1
A Condor DAG or Directed Acyclic Graph is a way of submitting jobs which depend on each other's completion. This can be used to create a workflow, where job A must complete before job B can start, or to batch up large numbers of unrelated jobs, so that each set of 100 jobs will wait for the previous set of 100 jobs to complete before starting, or any combination of these arrangements. As a result, Condor DAGs can be extremely powerful and useful, and are highly encouraged. DAGMan is the Directed Acyclic Graph Manager and is used to create and submit Condor DAGs.
Here is an example Condor DAG:
Job 1 example1.condor Job 2 example2.condor Job 3 example3.condor PARENT 1 CHILD 2 PARENT 2 CHILD 3
Each of the files "example1.condor", "example2.condor", and "example3.condor" are Condor submission scripts as explained above.
To submit this DAG, use the "condor_submit_dag" command:
bash-3.00$ condor_submit_dag example.dag Checking all your submit files for log file names. This might take a while... Done. ----------------------------------------------------------------------- File for submitting this DAG to Condor : example.dag.condor.sub Log of DAGMan debugging messages : example.dag.dagman.out Log of Condor library debug messages : example.dag.lib.out Log of the life of condor_dagman itself : example.dag.dagman.log Condor Log file for all jobs of this DAG : /home/rcac/user123/dagtest/example1.log Submitting job(s). Logging submit event(s). 1 job(s) submitted to cluster 58. -----------------------------------------------------------------------
Just as with ordinary Condor above, the status of the DAG can be checked with "condor_q" and a DAG can be removed with "condor_rm". Some further DAG references you may find helpful are:
It is possible to use Condor to run a number of commercial applications. However, which software packages are installed on different TeraGrid sites and how they are configured varies widely. There is also often some licensing problems with using commercial applications outside of Purdue. Below is information about R, which is currently the only specific package that can be used by non-Purdue affiliates on Purdue's Steele Cluster and Condor Pool resources.
| R: | on Linux |
Globus is one of several submission systems Purdue supports which you may use to run jobs on TeraGrid sites. Globus provides a framework for job submission and management over the Internet using certificate-based authentication credentials. This is the de facto standard means of issuing jobs over most Grids, including TeraGrid. However, Condor-G provides a front-end to the Globus protocol with some additional features and is generally found by users to be simpler to use. That said, users do have the option to use Globus directly instead. Globus may be used both to submit jobs to Purdue resources from outside Purdue as well as from Purdue hosts to submit jobs either locally or to other TeraGrid sites. TeraGrid supports Globus Toolkit 4.0, which also includes file transfer and resource description tools. Here is another refernce on Globus use:
If you are unsure your proxy certificate is being accepted, or that the remote Globus gatekeeper is responsive, you may wish to test using a simple authentication-only check:
bash-3.00$ globusrun -a -r tg-steele.purdue.teragrid.org/jobmanager-fork GRAM Authentication test successful
To submit a job, there are three distinct commands you may use. Each offers some different functionality. To start a job, wait completion, and see the output as it runs, use "globus-job-run". To submit a batch job and not wait for it, you can use "globus-job-submit". Both of those take the script or executable you wish to run as an argument and any Globus RSL parameters (such as you project number, the number of nodes requested, or the type of machine needed) must also be specified on the command line. The third option is to use "globusrun", which takes an RSL file as an argument, and this RSL file may contain all the RSL parameters otherwise on the command line. For quick tests, you may wish to use one of the "globus-job-*" commands, but if you want to save how you submitted a job for future reuse, you should construct an RSL submission file and use "globusrun".
All Globus jobs submitted to Purdue resources must specify the TeraGrid project allocation they are being run under. This is done by using the "project" RSL parameter:
(project = "YourProjectNumber")
Your project number will look something like "TG-XYZ123456". Another way to specify this is to set the environment variable $DEFAULT_PROJECT to your project number.
Use "globus-job-run" to start a quick job, wait for its completion, and view output as it runs:
bash-3.00$ globus-job-run tg-steele.purdue.teragrid.org/jobmanager-pbs \
-x '&(project=TG-XYZ123456)' /bin/hostname
For convenience, "globus-job-run" also offers the ability to access a local file using "-s" (this is done using GASS behind the scenes):
bash-3.00$ globus-job-run tg-steele.purdue.teragrid.org/jobmanager-pbs \
-x '&(project=TG-XYZ123456)' -s my_script
You may "globus-job-submit" to submit a batch job, although all RSL aparameters must be specified on the command line. It returns a contact string, which is a URL unique to your job, and is used by other commands to manage this job:
bash-3.00$ globus-job-submit tg-steele.purdue.teragrid.org/jobmanager-pbs \
-x '&(project=TG-XYZ123456)' /bin/hostname
https://tg-steele.purdue.teragrid.org:42396/6249/1164739382/
Executables may not be automatically copied over or accessed remotely using globus-job-submit. To provide any local files, you must use GASS (direct remote access) or GridFTP (copy files in advance).
You may "globusrun" to submit a batch job and provide an RSL file as input which contains all the job parameters, executable, output filenames, etc. It returns a contact string, which is a URL unique to your job, and is used by other commands to manage this job:
bash-3.00$ globusrun -r tg-steele.purdue.teragrid.org/jobmanager-pbs -f example.rsl https://tg-steele.purdue.teragrid.org:3768/sdfkhkdfhg/ououjko/wouiu/
For convenience, "globusrun" also offers the ability to access a local file using "-s" (this is done using GASS behind the scenes):
bash-3.00$ globusrun -s -r tg-steele.purdue.teragrid.org/jobmanager-pbs -f example.rsl https://tg-steele.purdue.teragrid.org:42396/6249/1164739382/ DONE
When submitting a Globus job you may wish to predefine and save all your job parameters in a file. This can be done using a Resource Specification Language (RSL) file, which may then be submitted using globusrun. Here is a simple example:
& (project=TG-XYZ123456) (environment=(GLOBUS_DUROC_SUBJOB_INDEX 0)) (jobtype=single) (executable="/bin/hostname") (stdout="example.out") (stderr="example.err")
For more information on RSL, you may refer to the Official Globus RSL Documentation.
The command "globus-job-status" with your job's contact string (URL) will report the progress of your job:
bash-3.00$ globus-job-status https://tg-steele.purdue.teragrid.org:42396/6249/1164739382/ DONE
Once your job is done, you can retrieve your output remotely using the "globus-job-get-output" command and the job's contact string (URL):
bash-3.00$ globus-job-get-output https://tg-steele.purdue.teragrid.org:42396/6249/1164739382/ tg-steele
To cancel a job and clean up a job's output, use the "globus-job-clean" command and the job's contact string (URL):
bash-3.00$ globus-job-clean https://tg-steele.purdue.teragrid.org:42396/6249/1164739382/
WARNING: Cleaning a job means:
- Kill the job if it still running, and
- Remove the cached output on the remote resource
Are you sure you want to cleanup the job now (Y/N) ? y
Cleanup successful.
Globus Access to Secondary Storage (GASS) is meant to simplify remote file I/O when using Globus. Typically, a GASS server is started on the local machine a user is submitting from, which has local access to the user's files. This server may be manually started by the user themselves, or automatically by using the "-s" option to globus-job-run or globusrun. This server can then serve files needed and allow file output back from the remote machine to which the user submits Globus jobs, with some caching on the remote machine. This is generally much easier than it may sound, and with the "-s" option, you may not be aware this is being done at all.
If using the "-s" option to globus-job-run or globusrun, you will also need to use the $GLOBUSRUN_GASS_URL environment variable in your job submission, as the exact GASS URL will not be known until the job is submitted. Here is an example of some RSL that uses this to specify files in the working directory of the local filesystem:
Note that the "/./" is required and tells the GASS server to use the directory the GASS server was started in rather than an absolute path.
(executable=$(GLOBUSRUN_GASS_URL)/./my_script.sh) (stdin=$(GLOBUSRUN_GASS_URL)/./my_input) (stdout=$(GLOBUSRUN_GASS_URL)/./my_output)
To manually start a GASS server, run "globus-gass-server":
bash-3.00$ globus-gass-server https://tg-steele.purdue.teragrid.org:50000
There are several possible options to the GASS server as well. Some of these are:
PBS is one of several submission systems Purdue supports you may use to run jobs at Purdue. Note: It is only possible to submit jobs to Purdue resources via PBS from a Purdue host. Some other TeraGrid resources may also offer local PBS access, but not all. While we encourage the use of Grid tools such as Globus and Condor-G, it may be useful to use PBS if you are currently having problems submitting jobs using Grid tools. Here is another reference on PBS use:
A given resource may offer several queues for the same resources which have different constraints such as maximum job duration, maximum memory usage, maximum number of CPUs, etc. and also often will also have different wait times as a result. In general, try to choose a queue which minimally meets your job's requirements, so that the resource may be queued most efficiently and your job run as soon as possible.
To list the queues available on a resource, use "qstat -q":
bash-3.00$ qstat -q
server: steele.rcac.purdue.edu
Queue Memory CPU Time Walltime Node Run Que Lm State
---------------- ------ -------- -------- ---- ----- ----- ---- -----
tg_workq -- -- 720:00:0 -- 0 3 -- D S
preemptdef -- -- -- -- 0 0 -- D S
standby -- -- 04:00:00 -- 0 1 -- D S
testq -- -- 720:00:0 -- 0 0 -- D S
----- -----
0 4
To see more details about the limits on each queue, use "qstat -Qf":
bash-3.00$ qstat -Qf
Queue: tg_workq
queue_type = Execution
Priority = 1000
total_jobs = 3
state_count = Transit:0 Queued:3 Held:0 Waiting:0 Running:0 Exiting:0 Begun:0
resources_max.walltime = 720:00:00
resources_default.ncpus = 1
resources_default.nodes = 1
resources_default.walltime = 00:30:00
acl_group_enable = True
acl_groups = teragrid,tgusers,itap,pucc
resources_available.ncpus = 224
enabled = False
started = False
To run a PBS job you must write a PBS submission script. Here is a simple example:
#!/bin/sh # ## example.pbs ## Simple PBS Example # #PBS -q tg_workq #PBS -N myexample #PBS -l nodes=10:ppn=2 #PBS -l walltime=0:50:00 #PBS -o example.out #PBS -e example.err #PBS -V # mkdir -p $TG_CLUSTER_SCRATCH/username/myexample cd $TG_CLUSTER_SCRATCH/username/myexample mpirun -v -machinefile $PBS_NODEFILE -np 20 $TG_CLUSTER_HOME/a.out
You must first log in to the head/login node of the resource. From there, you may submit a PBS submission script using the "qsub" command. You must also specify which queue you wish to submit to (see above for how to list available queues), and the TeraGrid allocation project number this job is being run under (a number of the form "TG-XYZ123456"):
bash-3.00$ qsub -q queue_name -A TG-XYZ123456 example.pbs
If you are a local Purdue user (using a Purdue career account), you may belong to other local Unix groups in addition to the "tgusers" Unix group. In order to submit jobs to TeraGrid, "tgusers" or "itap" must be your primary group. To determine your current primary group and secondary group memberships, use the "id" command:
bash-3.00$ id -Gn groupa groupb tgusers groupc
The first group in this list is your primary group. If this is not "tgusers", you will need to specify you wish to submit as part of the "tgusers" group when using qsub:
bash-3.00$ qsub -W group_list=tgusers -q queue_name -A TG-XYZ123456 example.pbs
The command "qstat" will report the progress of your job in the queue:
bash-3.00$ qstat Job id Name User Time Use S Queue ---------------- ---------------- ---------------- -------- - ----- 515520.steele myexample me 0 Q tg_workq 524630.steele foo someone 0 Q standby 526698.steele bar someoneelse 0 Q tg_workq 526698.steele myexample2 me 0 Q tg_workq
To cancel a job, use the "qdel" command and the ID of the job from "qstat":
bash-3.00$ qdel 515520
To conduct a basic test of a remote system and retrieve some information about the environment there, save the following as the file "probe.sh" and then send a job to the remote system you wish to probe with this script as the executable:
#!/bin/sh # # probe.sh # Basic Environment Probe # echo "************************************************************" echo "Date/Time = `date '+%Y-%m-%d %T'`" echo "Machine = `hostname`" echo "User = `whoami`" echo "Working Directory = `pwd`" echo "Environment Variables =" echo "" echo "`env`" echo "************************************************************"
It will report the date and time (when it ran), machine name (where it ran), the user (who it ran as), the working directory (what directory it was run from), and the full set of environment variables. This information may prove useful in constructing your submissions or in locating a problem with another submission.