Other Materials
Implementing an Industrial-Strength Academic Cyberinfrastructure at Purdue University
Implementing a Central Quill Database in a Large Condor Installation (Condor Week 2008)
BoilerGrid for cyro-EM image processing (Condor Week 2008)
Events
Event Number: 000
Event Name: Job submitted
Event Description: This event occurs when a user submits a job. It is the
first event you will see for a job, and it should only
occur once.
Event Number: 001
Event Name: Job executing
Event Description: This shows up when a job is running. It might occur more
than once.
Event Number: 002
Event Name: Error in executable
Event Description: The job couldn't be run because the executable was bad.
Event Number: 003
Event Name: Job was checkpointed
Event Description: The job's complete state was written to a checkpoint file.
This might happen without the job being removed from a
machine, because the checkpointing can happen periodically.
Event Number: 004
Event Name: Job evicted from machine
Event Description: A job was removed from a machine before it finished,
usually for a policy reason: perhaps an interactive user
has claimed the computer, or perhaps another job is higher
priority.
Event Number: 005
Event Name: Job terminated
Event Description: The job has completed.
Event Number: 006
Event Name: Image size of job updated
Event Description: This is informational. It is referring to the memory that
the job is using while running. It does not reflect the
state of the job.
Event Number: 007
Event Name: Shadow exception
Event Description: The condor_ shadow, a program on the submit computer that
watches over the job and performs some services for the
job, failed for some catastrophic reason. The job will
leave the machine and go back into the queue.
Event Number: 008
Event Name: Generic log event
Event Description: Not used.
Event Number: 009
Event Name: Job aborted
Event Description: The user cancelled the job.
Event Number: 010
Event Name: Job was suspended
Event Description: The job is still on the computer, but it is no longer
executing. This is usually for a policy reason, like an
interactive user using the computer.
Event Number: 011
Event Name: Job was unsuspended
Event Description: The job has resumed execution, after being suspended earlier.
Event Number: 012
Event Name: Job was held
Event Description: The user has paused the job, perhaps with the condor_hold
command. It was stopped, and will go back into the queue
again until it is aborted or released.
Event Number: 013
Event Name: Job was released
Event Description: The user is requesting that a job on hold be re-run.
Event Number: 014
Event Name: Parallel node executed
Event Description: A parallel (MPI) program is running on a node.
Event Number: 015
Event Name: Parallel node terminated
Event Description: A parallel (MPI) program has completed on a node.
Event Number: 016
Event Name: POST script terminated
Event Description: A node in a DAGMan workflow has a script that should be
run after a job. The script is run on the submit host. This
event signals that the post script has completed.
Event Number: 017
Event Name: Job submitted to Globus
Event Description: A grid job has been delegated to Globus (version 2, 3, or 4).
Event Number: 018
Event Name: Globus submit failed
Event Description: The attempt to delegate a job to Globus failed.
Event Number: 019
Event Name: Globus resource up
Event Description: The Globus resource that a job wants to run on was
unavailable, but is now available.
Event Number: 020
Event Name: Detected Down Globus Resource
Event Description: The Globus resource that a job wants to run on has become
unavailable.
Event Number: 021
Event Name: Remote error
Event Description: The condor_starter (which monitors the job on the
execution machine) has failed.
Event Number: 022
Event Name: Remote system call socket lost
Event Description: The condor_shadow and condor_starter (which communicate
while the job runs) have lost contact.
Event Number: 023
Event Name: Remote system call socket reestablished
Event Description: The condor_ shadow and condor_ starter (which communicate
while the job runs) have been able to resume contact before
the job lease expired.
Event Number: 024
Event Name: Remote system call reconnect failure
Event Description: The condor_ shadow and condor_ starter (which communicate
while the job runs) were unable to resume contact before
the job lease expired.
Event Number: 025
Event Name: Grid Resource Back Up
Event Description: A grid resource that was previously unavailable is now
available.
Event Number: 026
Event Name: Detected Down Grid Resource
Event Description: The grid resource that a job is to run on is unavailable.
Event Number: 027
Event Name: Job submitted to grid resource
Event Description: A job has been submitted, and is under the auspices of the
grid resource.
