RCAC - Knowledge Base: Hammer User Guide: Jobs: All topics

Jobs

Frequently asked questions related to running jobs.

Errors

Common errors and potential solutions/workarounds for them.

cannot connect to X server / cannot open display

Link to section 'Problem' of 'cannot connect to X server / cannot open display' Problem

You receive the following message after entering a command to bring up a graphical window

cannot connect to X server cannot open display

Link to section 'Solution' of 'cannot connect to X server / cannot open display' Solution

This can happen due to multiple reasons:

Reason: Your SSH client software does not support graphical display by itself (e.g. SecureCRT or PuTTY).
- Solution: Try using a client software like Thinlinc or MobaXterm as described in the SSH X11 Forwarding guide.
Reason: You did not enable X11 forwarding in your SSH connection.
- Solution: If you are in a Windows environment, make sure that X11 forwarding is enabled in your connection settings (e.g. in MobaXterm or PuTTY). If you are in a Linux environment, try
  
  ssh -Y -l username hostname
Reason: If you are trying to open a graphical window within an interactive PBS job, make sure you are using the -X option with qsub after following the previous step(s) for connecting to the front-end. Please see the example in the Interactive Jobs guide.
Reason: If none of the above apply, make sure that you are within quota of your home directory.

bash: command not found

Link to section 'Problem' of 'bash: command not found' Problem

You receive the following message after typing a command

bash: command not found

Link to section 'Solution' of 'bash: command not found' Solution

This means the system doesn't know how to find your command. Typically, you need to load a module to do it.

qdel: Server could not connect to MOM 12345.hammer-adm.rcac.purdue.edu

Link to section 'Problem' of 'qdel: Server could not connect to MOM 12345.hammer-adm.rcac.purdue.edu' Problem

You receive the following message after attempting to delete a job with the qdel command

qdel: Server could not connect to MOM 12345.hammer-adm.rcac.purdue.edu

Link to section 'Solution' of 'qdel: Server could not connect to MOM 12345.hammer-adm.rcac.purdue.edu' Solution

This error usually indicates that at least one node running your job has stopped responding or crashed. Please forward the job ID to support, and staff can help remove the job from the queue.

bash: module command not found

Link to section 'Problem' of 'bash: module command not found' Problem

You receive the following message after typing a command, e.g. module load intel

bash: module command not found

Link to section 'Solution' of 'bash: module command not found' Solution

The system cannot find the module command. You need to source the modules.sh file as below

source /etc/profile.d/modules.sh

#!/bin/bash -i

1234.hammer-adm.rcac.purdue.edu.SC: line 12: 12345 Killed

Link to section 'Problem' of '1234.hammer-adm.rcac.purdue.edu.SC: line 12: 12345 Killed' Problem

Your PBS job stopped running and you received an email with the following:

/var/spool/torque/mom_priv/jobs/1234.hammer-adm.rcac.purdue.edu.SC: line 12: 12345 Killed <command name>

Link to section 'Solution' of '1234.hammer-adm.rcac.purdue.edu.SC: line 12: 12345 Killed' Solution

This means that the node your job was running on ran out of memory to support your program or code. This may be due to your job or other jobs sharing your node(s) consuming more memory in total than is available on the node. Your program was killed by the node to preserve the operating system. There are two possible causes:

You requested your job share node(s) with other jobs. You should request all cores of the node or request exclusive access. Either your job or one of the other jobs running on the node consumed too much memory. Requesting exclusive access will give you full control over all the memory on the node.
Your job requires more memory than is available on the node. You should use more nodes if your job supports MPI or run a smaller dataset.

Questions

Frequently asked questions about jobs.

How do I check my job output while it is running?

Link to section 'Problem' of 'How do I check my job output while it is running?' Problem

After submitting your job to the cluster, you want to see the output that it generates.

Link to section 'Solution' of 'How do I check my job output while it is running?' Solution

There are two simple ways to do this:

qpeek: Use the tool qpeek to check the job's output. Syntax of the command is:
```
qpeek <jobid>
```
Redirect your output to a file: To do this you need to edit the main command in your jobscript as shown below. Please note the redirection command starting with the greater than (>) sign.
```
myapplication ...other arguments... > "${PBS_JOBID}.output"
```
On any front-end, go to the working directory of the job and scan the output file.
```
tail "<jobid>.output"
```
Make sure to replace <jobid> with an appropriate jobid.

What is the "debug" queue?

The debug queue allows you to quickly start small, short, interactive jobs in order to debug code, test programs, or test configurations. You are limited to one running job at a time in the queue, and you may run up to two compute nodes for 30 minutes.

How can I get email alerts about my PBS job status?

Link to section 'Question' of 'How can I get email alerts about my PBS job status?' Question

How can I be notified when my PBS job was executed and if it completed successfully?

Link to section 'Answer' of 'How can I get email alerts about my PBS job status?' Answer

Submit your job with the following command line arguments

qsub -M email_address -m bea myjobsubmissionfile

Or, include the following in your job submission file.

#PBS -M email_address                                                  
#PBS -m bae

The -m option can have the following letters; "a", "b", and "e":

a - mail is sent when the job is aborted by the batch system.
b - mail is sent when the job begins execution.
e - mail is sent when the job terminates.

Can I extend the walltime on a job?

In some circumstances, yes. Walltime extensions must be requested of and completed by staff. Walltime extension requests will be considered on named (your advisor or research lab) queues. Standby or debug queue jobs cannot be extended.

Extension requests are at the discretion of staff based on factors such as any upcoming maintenance or resource availability. Extensions can be made past the normal maximum walltime on named queues but these jobs are subject to early termination should a conflicting maintenance downtime be scheduled.

Please be mindful of time remaining on your job when making requests and make requests at least 24 hours before the end of your job AND during business hours. We cannot guarantee jobs will be extended in time with less than 24 hours notice, after-hours, during weekends, or on a holiday.

We ask that you make accurate walltime requests during job submissions. Accurate walltimes will allow the job scheduler to efficiently and quickly schedule jobs on the cluster. Please consider that extensions can impact scheduling efficiency for all users of the cluster.

Requests can be made by contacting support. We ask that you:

Provide numerical job IDs, cluster name, and your desired extension amount.
Provide at least 24 hours notice before job will end (more if request is made on a weekend or holiday).
Consider making requests during business hours. We may not be able to respond in time to requests made after-hours, on a weekend, or on a holiday.

How do I know Non-uniform Memory Access (NUMA) layout on Hammer?

You can learn about processor layout on Hammer nodes using the following command:
```
hammer-a003:~$ lstopo-no-graphics
```

For detailed IO connectivity:

hammer-a003:~$ lstopo-no-graphics --physical --whole-io

Please note that NUMA information is useful for advanced MPI/OpenMP/GPU optimizations. For most users, using default NUMA settings in MPI or OpenMP would give you the best performance.

Why cannot I use --mem=0 when submitting jobs?

Link to section 'Question' of 'Why cannot I use --mem=0 when submitting jobs?' Question

Why can't I specify --mem=0 for my job?

Link to section 'Answer' of 'Why cannot I use --mem=0 when submitting jobs?' Answer

We no longer support requesting unlimited memory (--mem=0) as it has an adverse effect on the way scheduler allocates job, and could lead to large amount of nodes being blocked from usage.

Most often we suggest relying on default memory allocation (cluster-specific). But if you have to request custom amounts of memory, you can do it explicitly. For example --mem=20G.

If you want to use the entire node's memory, you can submit the job with the --exclusive option.