Skip to main content

Gautschi User Guide

Gautschi is a Community Cluster optimized for communities running traditional, tightly-coupled science and engineering applications.

Link to section 'Overview of Gautschi' of 'Overview of Gautschi' Overview of Gautschi

Gautschi is a Community Cluster optimized for communities running traditional, tightly-coupled science and engineering applications. The Gautschi Community Cluster is equipped with both CPU and GPU compute nodes, each designed for specific computational tasks. The cluster includes Dell CPU compute nodes that feature dual 96-core AMD Epyc "Genoa" processors, providing 192 cores per node and 384 GB of memory. GPU compute nodes come with two 56-core Intel Xeon Platinum 8480+ processors, a total of 1031 GB of CPU memory, and eight NVIDIA H100 GPUs, each boasting 80 GB of memory.  All compute nodes have 400 Gbps NDR Infiniband interconnect and service through 2030.

Access to the Gautschi cluster is being offered in units of 48-cores or one quarter share of a CPU node or 5 GPU-year package for GPU nodes. To purchase access to Gautschi today, go to the Cluster Access Purchase page. Please subscribe to our Community Cluster Program Mailing List to stay informed on the latest purchasing developments or contact us via email at rcac-cluster-purchase@lists.purdue.edu if you have any questions.

Link to section 'Gautschi Namesake' of 'Overview of Gautschi' Gautschi Namesake

Gautschi is named in honor of Dr. Walter Gautschi, the Professor in the Department of Computer Science and Mathematics at Purdue. More information about his life and impact on Purdue is available in a Biography of Gautschi.

Link to section 'Gautschi Specifications' of 'Overview of Gautschi' Gautschi Specifications

All Gautschi CPU compute nodes have 192 processing cores, 384 GB of memory, and 400 Gbps NDR Infiniband interconnects. Each Gautschi GPU compute node has eight NVIDIA H100 GPUs with 80 GB of GPU memory, 112 processing cores, 1032 GB of CPU memory, and 400 Gbps NDR Infiniband interconnects.

Gautschi Front-Ends
Front-Ends Number of Nodes Processors per Node Cores per Node Memory per Node Retires in
  8 Two AMD EPYC 9654 96-Core Processor “Genoa” CPUs @ 2.4 GHz 192 768 GB 2030
Gautschi Sub-Clusters
Sub-Cluster Number of Nodes Processors per Node Cores per Node Memory per Node Retires in
A 338 Two AMD EPYC 9654 96-Core Processor “Genoa” CPUs @ 2.4 GHz 192 384 GB 2030
B 6 Two AMD Epyc 9654 96-Core Processor “Genoa” CPUs @ 2.4GHz 192 1.5 TB 2030
6 Two AMD EPYC 9554 64-Core Processor “Genoa” CPUs @ 3.1GHz,
Two NVIDIA L40S GPUs (48GB)
192 384 GB 2030
20 Two Intel Xeon Platinum 8480+ 56-Core CPUs @ 3.8GHz,
Eight NVIDIA H100 GPUs (80GB)
112 1031 GB 2030

Gautschi nodes run Rocky Linux 9 and use Slurm (Simple Linux Utility for Resource Management) as the batch scheduler for resource and job management. The application of operating system patches occurs as security needs dictate.

On Gautschi, the following set of compiler and message-passing libraries for parallel code are recommended:

  • GCC 14.1.0
  • OpenMPI
Portrait of Clara Bell Sessions

Link to section 'Walter Gautschi' of 'Biography of Gautschi' Walter Gautschi

Professor Gautschi is the Professor Emeritus of Computer Science and Mathematics in the Department of Computer Science at Purdue. Before coming to Purdue, Professor Gautschi did postdoctoral work as a Janggen-Pöhn Research Fellow at the National Institute of Applied Mathematics in Rome and at the Harvard Computation Laboratory. He also held positions at the National Bureau of Standards, the American University, the Oak Ridge National Laboratory, and the University of Tennessee. Since coming to Purdue, he has been a Fulbright Scholar at the Technical University of Munich and has held visiting appointments at the University of Wisconsin, Argonne National Laboratory, the Wright-Patterson Air Force Base, ETH Zurich, the University of Padova, and the University of Basel.

He has been a Fulbright Lecturer, an ACM National Lecturer, and a SIAM Visiting Lecturer. He is, or has been, on the editorial boards of SIAM Journal on Mathematical Analysis, Numerische Mathematik, Calcolo, and Mathematics of Computation, and has served as a special editor for Linear Algebra and Its Applications. From 1984 to 1995, he was the managing editor of Mathematics of Computation and, since 1991, an honorary editor of Numerische Mathematik. In 2001, Professor Gautschi was elected a Corresponding Member of the Bavarian Academy of Sciences and Humanities and, in the same year, a Foreign Member of the Academy of Sciences of Turin.

Link to section 'Accounts on Gautschi' of 'Accounts' Accounts on Gautschi

Link to section 'Obtaining an Account' of 'Accounts' Obtaining an Account

To obtain an account, you must be part of a research group which has purchased access to Gautschi. Refer to the Accounts / Access page for more details on how to request access.

Link to section 'Outside Collaborators' of 'Accounts' Outside Collaborators

A valid Purdue Career Account is required for access to any resource. If you do not currently have a valid Purdue Career Account you must have a current Purdue faculty or staff member file a Request for Privileges (R4P) before you can proceed.

Logging In

To submit jobs on Gautschi, log in to the submission host gautschi.rcac.purdue.edu via SSH. This submission host is actually 8 front-end hosts: login00.gautschi through login07.gautschi. The login process randomly assigns one of these front-ends to each login to gautschi.rcac.purdue.edu.

Purdue Login

Link to section 'SSH' of 'Purdue Login' SSH

  • SSH to the cluster as usual.
  • When asked for a password, type your password followed by ",push".
  • Your Purdue Duo client will receive a notification to approve the login.

Link to section 'Thinlinc' of 'Purdue Login' Thinlinc

  • When asked for a password, type your password followed by ",push".
  • Your Purdue Duo client will receive a notification to approve the login.
  • The native Thinlinc client will prompt for Duo approval twice due to the way Thinlinc works.
  • The native Thinlinc client also supports key-based authentication.

Passwords

Gautschi supports either Purdue two-factor authentication (Purdue Login) or SSH keys.

SSH Client Software

Secure Shell or SSH is a way of establishing a secure connection between two computers. It uses public-key cryptography to authenticate the user with the remote computer and to establish a secure connection. Its usual function involves logging in to a remote machine and executing commands. There are many SSH clients available for all operating systems:

Linux / Solaris / AIX / HP-UX / Unix:

  • The ssh command is pre-installed. Log in using ssh myusername@gautschi.rcac.purdue.edu from a terminal.

Microsoft Windows:

  • MobaXterm is a small, easy to use, full-featured SSH client. It includes X11 support for remote displays, SFTP capabilities, and limited SSH authentication forwarding for keys.

Mac OS X:

  • The ssh command is pre-installed. You may start a local terminal window from "Applications->Utilities". Log in by typing the command ssh myusername@gautschi.rcac.purdue.edu.

When prompted for password, enter your Purdue career account password followed by ",push ". Your Purdue Duo client will then receive a notification to approve the login.

SSH Keys

Link to section 'General overview' of 'SSH Keys' General overview

To connect to Gautschi using SSH keys, you must follow three high-level steps:

  1. Generate a key pair consisting of a private and a public key on your local machine.
  2. Copy the public key to the cluster and append it to $HOME/.ssh/authorized_keys file in your account.
  3. Test if you can ssh from your local computer to the cluster without using your Purdue password.

Detailed steps for different operating systems and specific SSH client softwares are give below.

Link to section 'Mac and Linux:' of 'SSH Keys' Mac and Linux:

  1. Run ssh-keygen in a terminal on your local machine. You may supply a filename and a passphrase for protecting your private key, but it is not mandatory. To accept the default settings, press Enter without specifying a filename.
    Note: If you do not protect your private key with a passphrase, anyone with access to your computer could SSH to your account on Gautschi.

  2. By default, the key files will be stored in ~/.ssh/id_rsa and ~/.ssh/id_rsa.pub on your local machine.

  3. Copy the contents of the public key into $HOME/.ssh/authorized_keys on the cluster with the following command. When asked for a password, type your password followed by ",push". Your Purdue Duo client will receive a notification to approve the login.

    ssh-copy-id -i ~/.ssh/id_rsa.pub myusername@gautschi.rcac.purdue.edu

    Note: use your actual Purdue account user name.

    If your system does not have the ssh-copy-id command, use this instead:

    cat ~/.ssh/id_rsa.pub | ssh myusername@gautschi.rcac.purdue.edu "mkdir -p ~/.ssh && chmod 700 ~/.ssh && cat >> ~/.ssh/authorized_keys"

  4. Test the new key by SSH-ing to the server. The login should now complete without asking for a password.

  5. If the private key has a non-default name or location, you need to specify the key by

    ssh -i my_private_key_name myusername@gautschi.rcac.purdue.edu

Link to section 'Windows:' of 'SSH Keys' Windows:

Windows SSH Instructions
Programs Instructions
MobaXterm Open a local terminal and follow Linux steps
Git Bash Follow Linux steps
Windows 10 PowerShell Follow Linux steps
Windows 10 Subsystem for Linux Follow Linux steps
PuTTY Follow steps below

PuTTY:

  1. Launch PuTTYgen, keep the default key type (RSA) and length (2048-bits) and click Generate button.

    PuTTYgen interface
    The "Generate" button can be found under the "Actions" section of the PuTTY Key Generator interface.
  2. Once the key pair is generated:

    Use the Save public key button to save the public key, e.g. Documents\SSH_Keys\mylaptop_public_key.pub

    Use the Save private key button to save the private key, e.g. Documents\SSH_Keys\mylaptop_private_key.ppk. When saving the private key, you can also choose a reminder comment, as well as an optional passphrase to protect your key, as shown in the image below. Note: If you do not protect your private key with a passphrase, anyone with access to your computer could SSH to your account on Gautschi.

    PuTTY Key Generator form with the passphrase and comment fields highlighted
    The PuTTY Key Generator form has inputs for the Key passphrase and optional reminder comment.

    From the menu of PuTTYgen, use the "Conversion -> Export OpenSSH key" tool to convert the private key into openssh format, e.g. Documents\SSH_Keys\mylaptop_private_key.openssh to be used later for Thinlinc.

  3. Configure PuTTY to use key-based authentication:

    Launch PuTTY and navigate to "Connection -> SSH ->Auth" on the left panel, click Browse button under the "Authentication parameters" section and choose your private key, e.g. mylaptop_private_key.ppk

    PuTTY Auth panel
    After clicking Connection -> SSH ->Auth panel, the "Browse" option can be found at the bottom of the resulting panel.

    Navigate back to "Session" on the left panel. Highlight "Default Settings" and click the "Save" button to ensure the change in place.

  4. Connect to the cluster. When asked for a password, type your password followed by ",push". Your Purdue Duo client will receive a notification to approve the login. Copy the contents of public key from PuTTYgen as shown below and paste it into $HOME/.ssh/authorized_keys. Please double-check that your text editor did not wrap or fold the pasted value (it should be one very long line).

    PuTTY Key Generator form with the generated key highlighted
    The "Public key" will look like a long string of random letters and numbers in a text box at the top of the window.
  5. Test by connecting to the cluster. If successful, you will not be prompted for a password or receive a Duo notification. If you protected your private key with a passphrase in step 2, you will instead be prompted to enter your chosen passphrase when connecting.

SSH X11 Forwarding

SSH supports tunneling of X11 (X-Windows). If you have an X11 server running on your local machine, you may use X11 applications on remote systems and have their graphical displays appear on your local machine. These X11 connections are tunneled and encrypted automatically by your SSH client.

Link to section 'Installing an X11 Server' of 'SSH X11 Forwarding' Installing an X11 Server

To use X11, you will need to have a local X11 server running on your personal machine. Both free and commercial X11 servers are available for various operating systems.

Linux / Solaris / AIX / HP-UX / Unix:

  • An X11 server is at the core of all graphical sessions. If you are logged in to a graphical environment on these operating systems, you are already running an X11 server.
  • ThinLinc is an alternative to running an X11 server directly on your Linux computer. ThinLinc is a service that allows you to connect to a persistent remote graphical desktop session.

Microsoft Windows:

  • ThinLinc is an alternative to running an X11 server directly on your Windows computer. ThinLinc is a service that allows you to connect to a persistent remote graphical desktop session.
  • MobaXterm is a small, easy to use, full-featured SSH client. It includes X11 support for remote displays, SFTP capabilities, and limited SSH authentication forwarding for keys.

Mac OS X:

  • X11 is available as an optional install on the Mac OS X install disks prior to 10.7/Lion. Run the installer, select the X11 option, and follow the instructions. For 10.7+ please download XQuartz.
  • ThinLinc is an alternative to running an X11 server directly on your Mac computer. ThinLinc is a service that allows you to connect to a persistent remote graphical desktop session.

Link to section 'Enabling X11 Forwarding in your SSH Client' of 'SSH X11 Forwarding' Enabling X11 Forwarding in your SSH Client

Once you are running an X11 server, you will need to enable X11 forwarding/tunneling in your SSH client:

  • ssh: X11 tunneling should be enabled by default. To be certain it is enabled, you may use ssh -Y.
  • MobaXterm: Select "New session" and "SSH." Under "Advanced SSH Settings" check the box for X11 Forwarding.

SSH will set the remote environment variable $DISPLAY to "localhost:XX.YY" when this is working correctly. If you had previously set your $DISPLAY environment variable to your local IP or hostname, you must remove any set/export/setenv of this variable from your login scripts. The environment variable $DISPLAY must be left as SSH sets it, which is to a random local port address. Setting $DISPLAY to an IP or hostname will not work.

ThinLinc

RCAC provides Cendio's ThinLinc as an alternative to running an X11 server directly on your computer. It allows you to run graphical applications or graphical interactive jobs directly on Gautschi through a persistent remote graphical desktop session.

ThinLinc is a service that allows you to connect to a persistent remote graphical desktop session. This service works very well over a high latency, low bandwidth, or off-campus connection compared to running an X11 server locally. It is also very helpful for Windows users who do not have an easy to use local X11 server, as little to no set up is required on your computer.

There are two ways in which to use ThinLinc: preferably through the native client or through a web browser.

Link to section 'Installing the ThinLinc native client' of 'ThinLinc' Installing the ThinLinc native client

The native ThinLinc client will offer the best experience especially over off-campus connections and is the recommended method for using ThinLinc. It is compatible with Windows, Mac OS X, and Linux.

  • Download the ThinLinc client from the ThinLinc website.
  • Start the ThinLinc client on your computer.
  • In the client's login window, use desktop.gautschi.rcac.purdue.edu as the Server. Use your Purdue Career Account username and password, but append ",push" to your password.
  • Click the Connect button.
  • Your Purdue Login Duo will receive a notification to approve your login.
  • Continue to following section on connecting to Gautschi from ThinLinc.

Link to section 'Using ThinLinc through your web browser' of 'ThinLinc' Using ThinLinc through your web browser

The ThinLinc service can be accessed from your web browser as a convenience to installing the native client. This option works with no set up and is a good option for those on computers where you do not have privileges to install software. All that is required is an up-to-date web browser. Older versions of Internet Explorer may not work.

  • Open a web browser and navigate to desktop.gautschi.rcac.purdue.edu.
  • Log in with your Purdue Career Account username and password, but append ",push" to your password.
  • You may safely proceed past any warning messages from your browser.
  • Your Purdue Login Duo will receive a notification to approve your login.
  • Continue to the following section on connecting to Gautschi from ThinLinc.

Link to section 'Connecting to Gautschi from ThinLinc' of 'ThinLinc' Connecting to Gautschi from ThinLinc

  • Once logged in, you will be presented with a remote Linux desktop running directly on a cluster front-end.
  • Open the terminal application on the remote desktop.
  • Once logged in to the Gautschi head node, you may use graphical editors, debuggers, software like Matlab, or run graphical interactive jobs. For example, to test the X forwarding connection issue the following command to launch the graphical editor gedit:
    $ gedit
  • This session will remain persistent even if you disconnect from the session. Any interactive jobs or applications you left running will continue running even if you are not connected to the session.

Link to section 'Tips for using ThinLinc native client' of 'ThinLinc' Tips for using ThinLinc native client

  • To exit a full screen ThinLinc session press the F8 key on your keyboard (fn + F8 key for Mac users) and click to disconnect or exit full screen.
  • Full screen mode can be disabled when connecting to a session by clicking the Options button and disabling full screen mode from the Screen tab.

Link to section 'Configure ThinLinc to use SSH Keys' of 'ThinLinc' Configure ThinLinc to use SSH Keys

  • The web client does NOT support public-key authentication.
  • ThinLinc native client supports the use of an SSH key pair. For help generating and uploading keys to the cluster, see SSH Keys section in our user guide for details.

    To set up SSH key authentication on the ThinLinc client:

    • Open the Options panel, and select Public key as your authentication method on the Security tab.

      ThinLinc Options window
      The "Options..." button in the ThinLinc Client can be found towards the bottom left, above the "Connect" button.
    • In the options dialog, switch to the "Security" tab and select the "Public key" radio button:

      ThinLinc's Security tab
      The "Security" tab found in the options dialog, will be the last of available tabs. The "Public key" option can be found in the "Authentication method" options group.
    • Click OK to return to the ThinLinc Client login window. You should now see a Key field in place of the Password field.
    • In the Key field, type the path to your locally stored private key or click the ... button to locate and select the key on your local system. Note: If PuTTY is used to generate the SSH Key pairs, please choose the private key in the openssh format.

      Thinlinc login with key
      The ThinLinc Client login window will now display key field instead of a password field.

Software

Link to section 'Environment module' of 'Software' Environment module

Link to section 'Software catalog' of 'Software' Software catalog

Our clusters provide a number of software packages to users of the system via the module command.

Link to section 'Environment Management with the Module Command' of 'Environment Management with the Module Command' Environment Management with the Module Command

The module command is the preferred method to manage your processing environment. With this command, you may load applications and compilers along with their libraries and paths. Modules are packages that you load and unload as needed.

Please use the module command and do not manually configure your environment, as staff may make changes to the specifics of various packages. If you use the module command to manage your environment, these changes will not be noticeable.

Link to section 'Hierarchy' of 'Environment Management with the Module Command' Hierarchy

Many modules have dependencies on other modules. For example, a particular openmpi module requires a specific version of the Intel compiler to be loaded. Often, these dependencies are not clear to users of the module, and there are many modules which may conflict. Arranging modules in a hierarchical fashion makes this dependency clear. This arrangement also helps make the software stack easy to understand - your view of the modules will not be cluttered with a bunch of conflicting packages.

Your default module view on Gautschi will include a set of compilers and a set of basic software that has no dependencies (such as Matlab and Fluent). To make software available that depends on a compiler, you must first load the compiler, and then software which depends on it becomes available to you. In this way, all software you see when doing module avail is completely compatible with each other.

Link to section 'Using the Hierarchy' of 'Environment Management with the Module Command' Using the Hierarchy

Your default module view on Gautschi will include a set of compilers and a set of basic software that has no dependencies (such as Matlab and Fluent).

To see what modules are available on this system by default:

$ module avail

To see which versions of a specific compiler are available on this system:

$ module avail gcc
$ module avail intel

To continue further into the hierarchy of modules, you will need to choose a compiler. As an example, if you are planning on using the Intel compiler you will first want to load the Intel compiler:

$ module load intel

With intel loaded, you can repeat the avail command, and at the bottom of the output you will see the section of additional software that the intel module provides:

$ module avail

Several of these new packages also provide additional software packages, such as MPI libraries. You can repeat the last two steps with one of the MPI packages such as openmpi and you will have a few more software packages available to you.

If you are looking for a specific software package and do not see it in your default view, the module command provides a search function for searching the entire hierarchy tree of modules without the need for you to manually load and avail on every module.

To search for a software package:

$ module spider openmpi
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  openmpi:
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
     Versions:
        openmpi/4.1.6
        openmpi/5.0.5

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  For detailed information about a specific "openmpi" package (including how to load the modules) use the module's full name.
  Note that names that have a trailing (E) are extensions provided by other modules.
  For example:

     $ module spider openmpi/5.0.5
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------


This will search for the openmpi software package. If you do not specify a specific version of the package, you will be given a list of versions available on the system. Select the version you wish to use and spider that to see how to access the module:

$ module spider openmpi/5.0.5
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  openmpi: openmpi/5.0.5
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

    You will need to load all module(s) on any one of the lines below before the "openmpi/5.0.5" module is available to load.

      gcc/11.4.1
      gcc/14.1.0
 
    Help:
      An open source Message Passing Interface implementation. The Open MPI
      Project is an open source Message Passing Interface implementation that
      is developed and maintained by a consortium of academic, research, and
      industry partners. Open MPI is therefore able to combine the expertise,
      technologies, and resources from all across the High Performance
      Computing community in order to build the best MPI library available.
      Open MPI offers advantages for system and software vendors, application
      developers and computer science researchers.

The output of this command will instruct you that you can load this module directly, or in the case of the above example, that you will need to first load a module or two. With the information provided with this command, you can now construct a load command to load a version of OpenMPI into your environment:

$ module load gcc/14.1.0 openmpi/5.0.5

Some user communities may maintain copies of their domain software for others to use. For example, the Purdue Bioinformatics Core provides a wide set of bioinformatics software for use by any user of RCAC clusters via the bioinfo module. The spider command will also search this repository of modules. If it finds a software package available in the bioinfo module repository, the spider command will instruct you to load the bioinfo module first.

Link to section 'Load / Unload a Module' of 'Environment Management with the Module Command' Load / Unload a Module

All modules consist of both a name and a version number. When loading a module, you may use only the name to load the default version, or you may specify which version you wish to load.

For each cluster, RCAC makes a recommendation regarding the set of compiler, math library, and MPI library for parallel code. To load the recommended set:

$ module load rcac

To verify what you loaded:

$ module list

To load the default version of a specific compiler, choose one of the following commands:

$ module load gcc
$ module load intel

To load a specific version of a compiler, include the version number:

$ module load gcc/11.2.0

When running a job, you must use the job submission file to load on the compute node(s) any relevant modules. Loading modules on the front end before submitting your job makes the software available to your session on the front-end, but not to your job submission script environment. You must load the necessary modules in your job submission script.

To unload a compiler or software package you loaded previously:

$ module unload gcc
$ module unload intel
$ module unload matlab

To unload all currently loaded modules and reset your environment:

$ module purge

Link to section 'Show Module Details' of 'Environment Management with the Module Command' Show Module Details

To learn more about what a module does to your environment, you may use the module show command.

$ module show matlab

Here is an example showing what loading the default Matlab does to the processing environment:

-------------------------------------------------------------------------------------------------------------------------------------------
   /opt/spack/modulefiles/Core/matlab/R2022a.lua:
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
whatis("Name : matlab")
whatis("Version : R2022a")
...
setenv("MATLAB_HOME","/apps/spack/gautschi/apps/matlab/R2022a-gcc-8.5.0-u54n6sa")
setenv("RCAC_MATLAB_ROOT","/apps/spack/gautschi/apps/matlab/R2022a-gcc-8.5.0-u54n6sa")
setenv("RCAC_MATLAB_VERSION","R2022a")
setenv("MATLAB","/apps/spack/gautschi/apps/matlab/R2022a-gcc-8.5.0-u54n6sa")
setenv("MLROOT","/apps/spack/gautschi/apps/matlab/R2022a-gcc-8.5.0-u54n6sa")
setenv("ARCH","glnxa64")
append_path("PATH","/apps/spack/gautschi/apps/matlab/R2022a-gcc-8.5.0-u54n6sa/bin/glnxa64:/apps/spack/gautschi/apps/matlab/R2019a-gcc-4.8.5-jg35hvf/bin")
append_path("CMAKE_PREFIX_PATH","/apps/spack/gautschi/apps/matlab/R2022a-gcc-8.5.0-u54n6sa/")
append_path("LD_LIBRARY_PATH","/apps/spack/gautschi/apps/matlab/R2022a-gcc-8.5.0-u54n6sa/runtime/glnxa64:/apps/spack/gautschi/apps/matlab/R2022a-gcc-8.5.0-u54n6sa/bin/glnxa64")

For more information about Lmod:

Running Jobs

There is one method for submitting jobs to Gautschi. You may use SLURM to submit jobs to a partition on Gautschi. SLURM performs job scheduling. Jobs may be any type of program. You may use either the batch or interactive mode to run your jobs. Use the batch mode for finished programs; use the interactive mode only for debugging.

In this section, you'll find a few pages describing the basics of creating and submitting SLURM jobs. As well, a number of example SLURM jobs that you may be able to adapt to your own needs.

Basics of SLURM Jobs

The Simple Linux Utility for Resource Management (SLURM) is a system providing job scheduling and job management on compute clusters. With SLURM, a user requests resources and submits a job to a queue. The system will then take jobs from queues, allocate the necessary nodes, and execute them.

Do NOT run large, long, multi-threaded, parallel, or CPU-intensive jobs on a front-end login host. All users share the front-end hosts, and running anything but the smallest test job will negatively impact everyone's ability to use Gautschi. Always use SLURM to submit your work as a job.

Link to section 'Submitting a Job' of 'Basics of SLURM Jobs' Submitting a Job

The main steps to submitting a job are:

Follow the links below for information on these steps, and other basic information about jobs. A number of example SLURM jobs are also available.

Job Submission Script

To submit work to a SLURM queue, you must first create a job submission file. This job submission file is essentially a simple shell script. It will set any required environment variables, load any necessary modules, create or modify files and directories, and run any applications that you need:

#!/bin/bash
# FILENAME:  myjobsubmissionfile

# Loads Matlab and sets the application up
module load matlab

# Change to the directory from which you originally submitted this job.
cd $SLURM_SUBMIT_DIR

# Runs a Matlab script named 'myscript'
matlab -nodisplay -singleCompThread -r myscript

Once your script is prepared, you are ready to submit your job.

Link to section 'Job Script Environment Variables' of 'Job Submission Script' Job Script Environment Variables

SLURM sets several potentially useful environment variables which you may use within your job submission files. Here is a list of some:
Name Description
SLURM_SUBMIT_DIR Absolute path of the current working directory when you submitted this job
SLURM_JOBID Job ID number assigned to this job by the batch system
SLURM_JOB_NAME Job name supplied by the user
SLURM_JOB_NODELIST Names of nodes assigned to this job
SLURM_CLUSTER_NAME Name of the cluster executing the job
SLURM_SUBMIT_HOST Hostname of the system where you submitted this job
SLURM_JOB_PARTITION Name of the original queue to which you submitted this job

Submitting a Job

Once you have a job submission file, you may submit this script to SLURM using the sbatch command. SLURM will find, or wait for, available resources matching your request and run your job there.

On Gautschi, in order to submit jobs, you need to specify the partition, account and Quality of Service (QoS) name to which you want to submit your jobs. To familiarize yourself with the partitions and QoS available on Gautschi, visit Gautschi Queues and Partitons. To check the available partitions on Gautschi, you can use the showpartitions , and to check your available accounts you can use slist commands. Slurm uses the term "Account" with the option -A or --account= to specify different batch accounts, the option -p or --partition= to select a specific partition for job submission, and the option -q or --qos= . 

showpartitions

Partition statistics for cluster gautschi at Fri Dec  6 05:26:30 PM EST 2024
     Partition     #Nodes     #CPU_cores  Cores_pending   Job_Nodes MaxJobTime Cores Mem/Node
     Name State Total  Idle  Total   Idle Resorc  Other   Min   Max  Day-hr:mn /node     (GB)
       ai    up    20    20   2240   2240      0      0     1 infin   infinite   112    1031 
      cpu    up   336   300  64512  57600      0      0     1 infin   infinite   192     386 
  highmem    up     6     6   1152   1152      0      0     1 infin   infinite   192    1547   
 smallgpu    up     6     6    768    768      0      0     1 infin   infinite   128     386 
profiling    up     2     2    384    384      0      0     1 infin   infinite   192     386 


Link to section 'CPU Partition' of 'Submitting a Job' CPU Partition

The CPU partition on Gautschi has two Quality of Service (QoS) levels: normal and standby. To submit your job to one compute node on cpu partition and 'normal' QoS which has "high priority":

$ sbatch --nodes=1 --ntasks=1 --partition=cpu --account=accountname --qos=normal myjobsubmissionfile
$ sbatch -N1 -n1 -p cpu -A accountname -q normal myjobsubmissionfile

To submit your job to one compute node on cpu partition and 'standby' QoS which is has "low priority":

$ sbatch --nodes=1 --ntasks=1 --partition=cpu --account=accountname --qos=standby myjobsubmissionfile
$ sbatch -N1 -n1 -p cpu -A accountname -q standby myjobsubmissionfile

Link to section ' AI Partition' of 'Submitting a Job' AI Partition

The CPU partition on Gautschi has two Quality of Service (QoS) levels: normal and preemptible. To submit your job to one compute node requesting one GPU on ai partition and 'normal' QoS which is has "high priority":

$ sbatch --nodes=1 --gpus-per-node=1 --partition=ai --account=accountname --qos=standby myjobsubmissionfile
$ sbatch -N1 --gpus-per-node=1 -p ai -A accountname -q standby myjobsubmissionfile

To submit your job to one compute node requesting one GPU on ai partition and 'preemptible' QoS which  has "high priority":

$ sbatch --nodes=1 --gpus-per-node=1 --partition=ai --account=accountname --qos=standby myjobsubmissionfile
$ sbatch -N1 --gpus-per-node=1 -p ai -A accountname -q standby myjobsubmissionfile

Link to section 'Highmem Partition' of 'Submitting a Job' Highmem Partition

To submit your job to a compute node on the highmem partition, you don’t need to specify the QoS name because only one QoS exists for this partition, and the default is normal. However, the highmem partition is suitable for jobs with memory requirements that exceed the capacity of a standard node, so the number of requested tasks should be appropriately high.

Link to section 'Profiling Partition' of 'Submitting a Job' Profiling Partition

To submit your job to a compute node on the profiling partition, you also don’t need to specify the QoS name because only one QoS exists for this partition, and the default is normal.

$ sbatch --nodes=1 --ntasks=1 --partition=profiling --account=accountname myjobsubmissionfile
$ sbatch -N1 -n1 -p profiling -A accountname myjobsubmissionfile

Link to section 'Smallgpu Partition' of 'Submitting a Job' Smallgpu Partition

To submit your job to a compute node on the smallgpu partition, you don’t need to specify the QoS name because only one QoS exists for this partition, and the default is normal. You should request cores proportional to the number of GPUs you are using in this partition (i.e. if you only need one of the two GPUs, you should request half of the cores on the node).

$ sbatch --nodes=1 --ntasks=64 --gpus-per-node=1 --partition=smallgpu --account=accountname myjobsubmissionfile
$ sbatch -N1 -n64 --gpus-per-node=1 -p smallgpu -A accountname myjobsubmissionfile

Link to section 'General Information' of 'Submitting a Job' General Information

By default, each job receives 30 minutes of wall time, or clock time. If you know that your job will not need more than a certain amount of time to run, request less than the maximum wall time, as this may allow your job to run sooner. To request the 1 hour and 30 minutes of wall time:

$ sbatch -t 01:30:00 -N=1 -n=1 -p=cpu -A=accountname -q=standby myjobsubmissionfile

The --nodes= or -N value indicates how many compute nodes you would like for your job, and --ntasks= or -n value indicates the number of tasks you want to run.

In some cases, you may want to request multiple nodes. To utilize multiple nodes, you will need to have a program or code that is specifically programmed to use multiple nodes such as with MPI. Simply requesting more nodes will not make your work go faster. Your code must support this ability.

To request 2 compute nodes:

$ sbatch -t 01:30:00 -N=2 -n=16 -p=cpu -A=accountname -q=standby myjobsubmissionfile

By default, jobs on Gautschi will share nodes with other jobs.

If more convenient, you may also specify any command line options to sbatch from within your job submission file, using a special form of comment:

#!/bin/sh -l
# FILENAME:  myjobsubmissionfile

#SBATCH --account=accountname
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --partition=cpu
#SBATCH --qos=normal
#SBATCH --time=1:30:00
#SBATCH --job-name myjobname

# Print the hostname of the compute node on which this job is running.
/bin/hostname

If an option is present in both your job submission file and on the command line, the option on the command line will take precedence.

After you submit your job with SBATCH, it may wait in queue for minutes, hours, or even weeks. How long it takes for a job to start depends on the specific queue, the resources and time requested, and other jobs already waiting in that queue requested as well. It is impossible to say for sure when any given job will start. For best results, request no more resources than your job requires.

Once your job is submitted, you can monitor the job status, wait for the job to complete, and check the job output.

Checking Job Status

Once a job is submitted there are several commands you can use to monitor the progress of the job.

To see your jobs, use the squeue -u command and specify your username:

(Remember, in our SLURM environment a queue is referred to as an 'Account')

squeue -u myusername

    JOBID   ACCOUNT    NAME    USER   ST    TIME   NODES  NODELIST(REASON)
   182792   standby    job1    myusername    R   20:19       1  gautschi-a000
   185841   standby    job2    myusername    R   20:19       1  gautschi-a001
   185844   standby    job3    myusername    R   20:18       1  gautschi-a002
   185847   standby    job4    myusername    R   20:18       1  gautschi-a003

To retrieve useful information about your queued or running job, use the scontrol show job command with your job's ID number. The output should look similar to the following:

scontrol show job 3519

JobId=3519 JobName=t.sub
   UserId=myusername GroupId=mygroup MCS_label=N/A
   Priority=3 Nice=0 Account=(null) QOS=(null)
   JobState=PENDING Reason=BeginTime Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=7-00:00:00 TimeMin=N/A
   SubmitTime=2019-08-29T16:56:52 EligibleTime=2019-08-29T23:30:00
   AccrueTime=Unknown
   StartTime=2019-08-29T23:30:00 EndTime=2019-09-05T23:30:00 Deadline=N/A
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   LastSchedEval=2019-08-29T16:56:52
   Partition=workq AllocNode:Sid=mack-fe00:54476
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=(null)
   NumNodes=1 NumCPUs=2 NumTasks=2 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=2,node=1,billing=2
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/home/myusername/jobdir/myjobfile.sub
   WorkDir=/home/myusername/jobdir
   StdErr=/home/myusername/jobdir/slurm-3519.out
   StdIn=/dev/null
   StdOut=/home/myusername/jobdir/slurm-3519.out
   Power=

There are several useful bits of information in this output.

  • JobState lets you know if the job is Pending, Running, Completed, or Held.
  • RunTime and TimeLimit will show how long the job has run and its maximum time.
  • SubmitTime is when the job was submitted to the cluster.
  • The job's number of Nodes, Tasks, Cores (CPUs) and CPUs per Task are shown.
  • WorkDir is the job's working directory.
  • StdOut and Stderr are the locations of stdout and stderr of the job, respectively.
  • Reason will show why a PENDING job isn't running. The above error says that it has been requested to start at a specific, later time.

Checking Job Output

Once a job is submitted, and has started, it will write its standard output and standard error to files that you can read.

SLURM catches output written to standard output and standard error - what would be printed to your screen if you ran your program interactively. Unless you specfied otherwise, SLURM will put the output in the directory where you submitted the job in a file named slurm- followed by the job id, with the extension out. For example slurm-3509.out. Note that both stdout and stderr will be written into the same file, unless you specify otherwise.

If your program writes its own output files, those files will be created as defined by the program. This may be in the directory where the program was run, or may be defined in a configuration or input file. You will need to check the documentation for your program for more details.

Link to section 'Redirecting Job Output' of 'Checking Job Output' Redirecting Job Output

It is possible to redirect job output to somewhere other than the default location with the --error and --output directives:

#!/bin/bash
#SBATCH --output=/home/myusername/joboutput/myjob.out
#SBATCH --error=/home/myusername/joboutput/myjob.out

# This job prints "Hello World" to output and exits
echo "Hello World"

Holding a Job

Sometimes you may want to submit a job but not have it run just yet. You may be wanting to allow lab mates to cut in front of you in the queue - so hold the job until their jobs have started, and then release yours.

To place a hold on a job before it starts running, use the scontrol hold job command:

$ scontrol hold job  myjobid

Once a job has started running it can not be placed on hold.

To release a hold on a job, use the scontrol release job command:

$ scontrol release job  myjobid

You find the job ID using the squeue command as explained in the SLURM Job Status section.

Job Dependencies

Dependencies are an automated way of holding and releasing jobs. Jobs with a dependency are held until the condition is satisfied. Once the condition is satisfied jobs only then become eligible to run and must still queue as normal.

Job dependencies may be configured to ensure jobs start in a specified order. Jobs can be configured to run after other job state changes, such as when the job starts or the job ends.

These examples illustrate setting dependencies in several ways. Typically dependencies are set by capturing and using the job ID from the last job submitted.

To run a job after job myjobid has started:

sbatch --dependency=after:myjobid myjobsubmissionfile

To run a job after job myjobid ends without error:

sbatch --dependency=afterok:myjobid myjobsubmissionfile

To run a job after job myjobid ends with errors:

sbatch --dependency=afternotok:myjobid myjobsubmissionfile

To run a job after job myjobid ends with or without errors:

sbatch --dependency=afterany:myjobid myjobsubmissionfile

To set more complex dependencies on multiple jobs and conditions:

sbatch --dependency=after:myjobid1:myjobid2:myjobid3,afterok:myjobid4 myjobsubmissionfile

Canceling a Job

To stop a job before it finishes or remove it from a queue, use the scancel command:

scancel myjobid

You find the job ID using the squeue command as explained in the SLURM Job Status section.

Queues

On Gautschi, the required options for job submisison deviates from some of the other community clusters you might have experience using. In general every job submission will have four parts:

  1. The number and type of resources you want 
  2. The account the resources should come out of
  3. The quality of service (QOS) this job expects from the resources
  4. The partition where the resources are located

If you have used other clusters, you will be familiar with first item. If you have not, you can read about how to format the request for those resources here: <>. The rest of this page will focus on the last three items as they are specific to Gautschi.

Link to section 'Accounts' of 'Queues' Accounts

On the Gautschi community cluster, users will have access to one or more accounts, also known as queues. These accounts are dedicated to and named after each partner who has purchased access to the cluster, and they provide partners and their researchers with priority access to their portion of the cluster. These accounts can be thought of as bank accounts that contain the resources a group has purchased access to which may include some number of cores, GPU hours or both. To see the list of accounts that you have access to on Gautschi as well as the resources they contain, you can use the command "slist".

On previous clusters, you may have had access to several other special accounts like "standby", "highmem", "gpu", etc. On Gautschi, these modes of access still exist, but they are no longer accounts. As well on Gautschi, you must explicitly define the account that you want to submit to using the "-A/--account=" option.

Link to section 'Quality of Service (QOS)' of 'Queues' Quality of Service (QOS)

On Gautschi, we use a Slurm concept called a Quality of Service or a QOS. A QOS can be thought of as a tag for a job that tells the scheduler how that job should be treated with respect to limits, priority, etc. The cluster administrators define the available QOSes as well as the policies for how each QOS should be treated on the cluster. A toy example of such a policy may be "no single user can have more than 200 jobs" that has been tagged with a QOS named "highpriority". There are two classes of QOSes and a job can have both:

  1. Partition QOSes: A partition QOS is a tag that is automatically added to your job when you submit to a partition that defines a partition QOS.
  2. Job QOSes: A Job QOS is a tag that you explicitly give to a job using the option "-q/--qos=". By explicitly tagging your jobs this way, you can choose the policy that each one of your jobs should abide by. We will describe the policies for the available job QOSes in the partition section below.

As an extended metaphor, if we think of a job as a package that we need to have shipped to some destination, then the partition can be thought of as the carrier we decide to ship our package with. That carrier is going to have some company policies that dictate how you need to label/pack that package, and that company policy is like the partition QOS. It is the policy that is enforced for simply deciding to submit to a particular partition. The Job QOS can then be thought of as the various different types of shipping options that carrier might offer. You might pay extra to have that package shipped overnight or have tracking information for it or you may choose to pay less and have your package arrive as available. Once we decide to go with a particular carrier, we are subject to their company policy, but we may negotiate how they handle that package by choosing one of their available shipping options. In the same way, when you choose to submit to a partition, you are subject to the limits enforced by the partition QOS and you may be able to ask for your job to be handled a particular way by specifying the job QOS, but that option has to be offered by the partition.

In order for a job to use a Job QOS, the user submitting the job must have access to the QOS, the account the job is being submitted to must accept the QOS, and the partition the job is being submitted to must accept the QOS. The following job QOSes every user and every account of Gautschi has access to:

  1. "normal": The "normal" QOS is the default job QOS on the cluster meaning if you do not explicitly list an alternative job QOS, your job will be tagged with this QOS. The policy for this QOS provides a high priority and does not add any additional limits.
  2. "standby": The "standby" QOS must be explicitly used if desired by using the option "-q standby" or "--qos=standby". The policy for this QOS gives access to idle resources on the cluster. Jobs tagged with this QOS are "low priority" jobs and are only allowed to run for up to four hours at a time, however the resources used by these jobs do not count against the resources in your Account. For users of our previous clusters, usage of this QOS replaces the previous -A standby style of submission. 

Some of these QOSes may not be available in every partition. Each of the partitions in the following section will enumerate which of these QOSes are allowed in the partition. 

Link to section 'Partitions' of 'Queues' Partitions

On Gautschi, the various types of nodes on the cluster are organized into distinct partitions. This allows CPU and GPU nodes to be charged separately and differently. This also means that Instead of only needing to specify the account name in the job script, the desired partition must also be specified. Each of these partitions is subject to different limitations and has a specific use case that will be described below. 

Link to section 'CPU Partition' of 'Queues' CPU Partition

This partition contains the resources a group purchases access to when they purchase CPU resources on Gautschi and is made up of 336 Gautschi-A nodes. Each of these nodes contains two Zen 4 AMD EPYC 9654 96-core processors for a total of 192 cores and 384 GB of memory for a total of more than 64,000 cores in the partition. Memory in this partition is allocated proportional to your core request such that each core is given about 2 GB of memory per core requested. Submission to this partition can be accomplished by using the option: "-p cpu" or "--partition=cpu".

The purchasing model for this partition allows groups to purchase high priority access to some number of cores. When an account uses resources in this account by submitting a job tagged with the "normal" QOS, the cores used by that job are withdrawn from the account and deposited back into the account when the job terminates.

When using the CPU partition, jobs are tagged by the "normal" QOS by default, but they can be tagged with the "standby" QOS if explicitly submitted using the "-q standby" or "--qos=standby" option.

  1. Jobs tagged with the "normal" QOS are subject to the following policies:
    1. Jobs have a high priority and should not need to wait very long before starting.
    2. Any cores requested by these jobs are withdrawn from the account until the job terminates.
    3. These jobs can run for up to two weeks at a time. 
  2. Jobs tagged with the "standby" QOS are subject to the following policies:
    1. Jobs have a low priority and there is no expectation of job start time. If the partition is very busy with jobs using the "normal" QOS or if you are requesting a very large job, then jobs using the "standby" QOS may take hours or days to start.
    2. These jobs can use idle resources on the cluster and as such cores requested by these jobs are not withdrawn from the account to which they were submitted. 
    3. These jobs can run for up to four hours at a time.

Groups who purchased GPU resources on Gautschi will have the ability to run jobs in this partition tagged with the "standby" QOS described above, but will not be able to submit jobs tagged with the "normal" qos unless they have purchased CPU resources on the cluster.

Available QOSes: "normal", "standby"

Link to section 'AI Partition' of 'Queues' AI Partition

This partition contains the resources a group purchases access to when they purcahse GPU resources on Gautschi and is made up of 20 Gautschi-H nodes. Each of these nodes contains two Intel Xeon Platinum 8480+ 56-core processors and 8-way NVLinked H100s for a total of 160 H100s in the partition. CPU memory in this partition is allocated proportional to your core request such that each core is given about 9 GB of memory per core requested. Submission to this partition can be accomplished by using the option: "-p ai" or "--partition=ai".

The purchasing model for this cluster allows groups to purchase access to GPUs. When a group purchases access to a GPU for a year, they are given 365*24 hours of GPU time on the partition. The advantage to tracking resources this way is that if a group needs access to more than one GPU for a project, they can use multiple GPUs at once since instead of having to purchase additional GPUs, they just consume those GPU hours faster. When an account uses resources in this partition, the balance of GPU hours in their account is charged at a rate of at least one GPU hour for each GPU their job uses per hour (this is tracked by the minute), however the usage of some QOSes described below will charge usage at a slower rate. 

When using the AI partition, jobs are tagged by the "normal" QOS by default, but they can be tagged with the "preemptible" QOS if explicitly submitted using the "-q preemptible" or "--qos=preemptible" option.

  1. Jobs tagged with the "normal" QOS are subject to the following policies:
    1. Jobs have a high priority and should not need to wait very long before starting.
    2. The account the job is submitted to will be charged 1 GPU hour for each GPU this job uses per hour (this is tracked by the minute).
    3. These jobs can run for up to two weeks at a time. 
  2. Jobs tagged with the "preemptible" QOS subject to the following policies:
    1. Jobs have a low priority and there is no expectation of job start time. If the partition is very busy with jobs using the "normal" QOS or if you are requesting a very large job, then jobs using the "standby" QOS may take hours or days to start.
    2. The account the job is submitted to will be charged 0.25 GPU hour for each GPU this job uses per hour (this is tracked by the minute) which effectively allows jobs to use 4x as many resources as a job tagged with the "normal" qos for the same cost.
    3. These jobs can run for up to two weeks at a time.
    4. If there are not enough idle resources in this partition for a job tagged with the "normal" QOS to start, then this job may be cancelled to make room for that job. This means it is imperative to use checkpointing if using this QOS. 

Available QOSes: "normal", "preemptible"

Link to section 'Highmem Partition' of 'Queues' Highmem Partition

This partition is made up of 6 Gautschi-B nodes which have four times as much memory as a standard Gautschi-A node, and access to this partition is given to all accounts on the cluster to enable work that has higher memory requirements. Each of these nodes contains two Zen 4 AMD EPYC 9654 96-core processors for a total of 192 cores and 1.5 TB of memory. Memory in this partition is allocated proportional to your core request such that each core is given about 8 GB of memory per core requested. Submission to this partition can be accomplished by using the option: "-p highmem" or "--partition=highmem".

When using the Highmem partition, jobs are tagged by the "normal" QOS by default, and this is the only QOS that is available for this partition, so there is no need to specify a QOS when using this partition. Additionally jobs are tagged by a highmem partition QOS that enforces the following policies

  1. There is no expectation of job start time as these nodes are a shared resources that are given as a bonus for purchasing access to high priority access to resources on Gautschi
  2. You can have 2 jobs running in this partition at once
  3. You can have 8 jobs submitted to thie partition at once
  4. Your jobs must use more than 48 of the 192 cores on the node otherwise your memory footprint would fit on a standard Gautschi-A node
  5. These jobs can run for up to 24 hours at a time. 

Available QOSes: "normal"

Link to section 'Profiling Partition' of 'Queues' Profiling Partition

This partition is made up of 2 Gautschi-A nodes that have hardware performance counters enabled. By enabling hardware performance counters, profiling applications such as AMD MicroProf can track certain performance criteria for execution on the CPU such as L3 cache events, speculative execution misses, etc. Due to the fact that this allows greater visibility into the execution of each process, the nodes in this partition can only be used if your job uses the entire node.  Submission to this partition can be accomplished by using the option: "-p profiling" or "--partition=profiling".

When using the profiling partition, jobs are tagged by the "normal" QOS by default, and this is the only QOS that is available for this partition, so there is no need to specify a QOS when using this partition. Additionally jobs are tagged by a profiling partition QOS that enforces the following policies

  1. There is no expectation of job start time as these nodes are a shared resources that are given as a bonus for purchasing access to high priority access to resources on Gautschi
  2. You can have 1 job running in this partition at once
  3. If you have a single process running on one of the nodes, you must use that entire node when submitting to this partition
  4. These jobs can run for up to 24 hours at a time. 

This is a resource that is reserved for groups profiling their applications and we monitor to ensure that this partition is not being used simply for more compute.

Available QOSes: "normal"

Link to section 'Smallgpu Partiton' of 'Queues' Smallgpu Partiton

This partition is made up of 6 Gautschi-G nodes. Each of these nodes contains two NVIDIA L40s and two Zen 4 AMD EPYC 9554 64-core processors for a total of 128 cores and 384GB of memory. Memory in this partition is allocated proportional to your core request such that each core is given about 3 GB of memory per core requested. You should request cores proportional to the number of GPUs you are using in this partition (i.e. if you only need one of the two GPUs, you should request half of the cores on the node) Submission to this partition can be accomplished by using the option: "-p smallgpu" or "--partition=smallgpu".

When using the smallgpu partition, jobs are tagged by the "normal" QOS by default, and this is the only QOS that is available for this partition, so there is no need to specify a QOS when using this partition. Additionally jobs are tagged by a smallgpu partition QOS that enforces the following policies

  1. There is no expectation of job start time as these nodes are a shared resources that are given as a bonus for purchasing access to high priority access to resources on Gautschi
  2. You can use up to 2 GPUs in this partition at once
  3. You can have 8 jobs submitted to thie partition at once
  4. These jobs can run for up to 24 hours at a time. 

Available QOSes: "normal"

Example Jobs

A number of example jobs are available for you to look over and adapt to your own needs. The first few are generic examples, and latter ones go into specifics for particular software packages.

Generic SLURM Jobs

The following examples demonstrate the basics of SLURM jobs, and are designed to cover common job request scenarios. These example jobs will need to be modified to run your application or code.

Simple Job

Every SLURM job consists of a job submission file. A job submission file contains a list of commands that run your program and a set of resource (nodes, walltime, queue) requests. The resource requests can appear in the job submission file or can be specified at submit-time as shown below.

This simple example submits the job submission file hello.sub to the standby queue on Gautschi and requests a single node:

#!/bin/bash
# FILENAME: hello.sub

# Show this ran on a compute node by running the hostname command.
hostname

echo "Hello World"

sbatch -A myallocation -p queue-name --nodes=1 --ntasks=1 --cpus-per-task=1 --time=00:01:00 hello.sub 
Submitted batch job 3521

For a real job you would replace echo "Hello World" with a command, or sequence of commands, that run your program.

After your job finishes running, the ls command will show a new file in your directory, the .out file:

ls -l
hello.sub
slurm-3521.out

The file slurm-3521.out contains the output and errors your program would have written to the screen if you had typed its commands at a command prompt:

cat slurm-3521.out 


gautschi-a001.rcac.purdue.edu 
Hello World

You should see the hostname of the compute node your job was executed on. Following should be the "Hello World" statement.

Multiple Node

In some cases, you may want to request multiple nodes. To utilize multiple nodes, you will need to have a program or code that is specifically programmed to use multiple nodes such as with MPI. Simply requesting more nodes will not make your work go faster. Your code must support this ability.

This example shows a request for multiple compute nodes. The job submission file contains a single command to show the names of the compute nodes allocated:

# FILENAME:  myjobsubmissionfile.sub
#!/bin/bash
echo "$SLURM_JOB_NODELIST"

sbatch --nodes=2 --ntasks=384 --time=00:10:00 -A myallocation -p queue-name myjobsubmissionfile.sub

Compute nodes allocated:

gautschi-a[014-015]

The above example will allocate the total of 384 CPU cores across 2 nodes. Note that if your multi-node job requests fewer than each node's full 192 cores per node, by default Slurm provides no guarantee with respect to how this total is distributed between assigned nodes (i.e. the cores may not necessarily be split evenly). If you need specific arrangements of your tasks and cores, you can use --cpus-per-task= and/or --ntasks-per-node= flags. See Slurm documentation or man sbatch for more options.

Directives

So far these examples have shown submitting jobs with the resource requests on the sbatch command line such as:

sbatch -A myallocation -p queue-name --nodes=1 --time=00:01:00 hello.sub

The resource requests can also be put into job submission file itself. Documenting the resource requests in the job submission is desirable because the job can be easily reproduced later. Details left in your command history are quickly lost. Arguments are specified with the #SBATCH syntax:

#!/bin/bash

# FILENAME: hello.sub

#SBATCH -A myallocation -p queue-name 

#SBATCH --nodes=1 --time=00:01:00 

# Show this ran on a compute node by running the hostname command.
hostname

echo "Hello World"

The #SBATCH directives must appear at the top of your submission file. SLURM will stop parsing directives as soon as it encounters a line that does not start with '#'. If you insert a directive in the middle of your script, it will be ignored.

This job can be then submitted with:

sbatch hello.sub

Interactive Jobs

Interactive jobs are run on compute nodes, while giving you a shell to interact with. They give you the ability to type commands or use a graphical interface in the same way as if you were on a front-end login host.

To submit an interactive job, use sinteractive to run a login shell on allocated resources.

sinteractive accepts most of the same resource requests as sbatch, so to request a login shell on the cpu account while allocating 2 nodes and 192 total cores, you might do:

sinteractive -A myallocation -p cpu -N2 -n384

To quit your interactive job:

exit or Ctrl-D

The above example will allocate the total of 384 CPU cores across 2 nodes. Note that if your multi-node job requests fewer than each node's full 192 cores per node, by default Slurm provides no guarantee with respect to how this total is distributed between assigned nodes (i.e. the cores may not necessarily be split evenly). If you need specific arrangements of your tasks and cores, you can use --cpus-per-task= and/or --ntasks-per-node= flags. See Slurm documentation or man salloc for more options.

Serial Jobs

This shows how to submit one of the serial programs compiled in the section Compiling Serial Programs.

Create a job submission file:

#!/bin/bash
# FILENAME:  serial_hello.sub

./serial_hello

Submit the job:

sbatch --nodes=1 --ntasks=1 --time=00:01:00 serial_hello.sub

After the job completes, view results in the output file:

cat slurm-myjobid.out

Runhost:gautschi-a009.rcac.purdue.edu
hello, world 

If the job failed to run, then view error messages in the file slurm-myjobid.out.

Link to section 'Collecting System Resource Utilization Data' of 'Monitoring Resources' Collecting System Resource Utilization Data

Knowing the precise resource utilization an application had during a job, such as CPU load or memory, can be incredibly useful. This is especially the case when the application isn't performing as expected.

One approach is to run a program like htop during an interactive job and keep an eye on system resources. You can get precise time-series data from nodes associated with your job using XDmod as well, online. But these methods don't gather telemetry in an automated fashion, nor do they give you control over the resolution or format of the data.

As a matter of course, a robust implementation of some HPC workload would include resource utilization data as a diagnostic tool in the event of some failure.

The monitor utility is a simple command line system resource monitoring tool for gathering such telemetry and is available as a module.

module load monitor 

Complete documentation is available online at resource-monitor.readthedocs.io. A full manual page is also available for reference, man monitor.

In the context of a SLURM job you will need to put this monitoring task in the background to allow the rest of your job script to proceed. Be sure to interrupt these tasks at the end of your job.

#!/bin/bash
# FILENAME: monitored_job.sh

 module load monitor 

# track per-code CPU load
monitor cpu percent --all-cores >cpu-percent.log &
CPU_PID=$!

# track memory usage
monitor cpu memory >cpu-memory.log &
MEM_PID=$!

# your code here

# shut down the resource monitors
kill -s INT $CPU_PID $MEM_PID

A particularly elegant solution would be to include such tools in your prologue script and have the tear down in your epilogue script.

For large distributed jobs spread across multiple nodes, mpiexec can be used to gather telemetry from all nodes in the job. The hostname is included in each line of output so that data can be grouped as such. A concise way of constructing the needed list of hostnames in SLURM is to simply use srun hostname | sort -u.

#!/bin/bash
# FILENAME: monitored_job.sh

 module load monitor 

# track all CPUs (one monitor per host)
mpiexec -machinefile <(srun hostname | sort -u) \
    monitor cpu percent --all-cores >cpu-percent.log &
CPU_PID=$!

# track memory on all hosts (one monitor per host)
mpiexec -machinefile <(srun hostname | sort -u) \
    monitor cpu memory >cpu-memory.log &
MEM_PID=$!

# your code here

# shut down the resource monitors
kill -s INT $CPU_PID $MEM_PID

To get resource data in a more readily computable format, the monitor program can be told to output in CSV format with the --csv flag.

monitor cpu memory --csv >cpu-memory.csv

For a distributed job you will need to suppress the header lines otherwise one will be created by each host.

monitor cpu memory --csv | head -1 >cpu-memory.csv
mpiexec -machinefile <(srun hostname | sort -u) \
    monitor cpu memory --csv --no-header >>cpu-memory.csv

Specific Applications

The following examples demonstrate job submission files for some common real-world applications. See the Generic SLURM Examples section for more examples on job submissions that can be adapted for use.

R

R, a GNU project, is a language and environment for data manipulation, statistics, and graphics. It is an open source version of the S programming language. R is quickly becoming the language of choice for data science due to the ease with which it can produce high quality plots and data visualizations. It is a versatile platform with a large, growing community and collection of packages.

For more general information on R visit The R Project for Statistical Computing.

Loading Data into R

R is an environment for manipulating data. In order to manipulate data, it must be brought into the R environment. R has a function to read any file that data is stored in. Some of the most common file types like comma-separated variable(CSV) files have functions that come in the basic R packages. Other less common file types require additional packages to be installed. To read data from a CSV file into the R environment, enter the following command in the R prompt:

> read.csv(file = "path/to/data.csv", header = TRUE)

When R reads the file it creates an object that can then become the target of other functions. By default the read.csv() function will give the object the name of the .csv file. To assign a different name to the object created by read.csv enter the following in the R prompt:

> my_variable <- read.csv(file = "path/to/data.csv", header = FALSE)

To display the properties (structure) of loaded data, enter the following:

> str(my_variable)

For more functions and tutorials:

Installing R packages

Link to section 'Challenges of Managing R Packages in the Cluster Environment' of 'Installing R packages' Challenges of Managing R Packages in the Cluster Environment

  • Different clusters have different hardware and softwares. So, if you have access to multiple clusters, you must install your R packages separately for each cluster.
  • Each cluster has multiple versions of R and packages installed with one version of R may not work with another version of R. So, libraries for each R version must be installed in a separate directory.
  • You can define the directory where your R packages will be installed using the environment variable R_LIBS_USER.
  • For your convenience, a sample ~/.Rprofile example file is provided that can be downloaded to your cluster account and renamed into ~/.Rprofile (or appended to one) to customize your installation preferences. Detailed instructions.

Link to section 'Installing Packages' of 'Installing R packages' Installing Packages

  • Step 0: Set up installation preferences.
    Follow the steps for setting up your ~/.Rprofile preferences. This step needs to be done only once. If you have created a ~/.Rprofile file previously on Gautschi, ignore this step.

  • Step 1: Check if the package is already installed.
    As part of the R installations on community clusters, a lot of R libraries are pre-installed. You can check if your package is already installed by opening an R terminal and entering the command installed.packages(). For example,

    module load r/4.4.1
    R
    installed.packages()["units",c("Package","Version")]
    Package Version 
    "units" "0.8-1"
    quit()

    If the package you are trying to use is already installed, simply load the library, e.g., library('units'). Otherwise, move to the next step to install the package.

  • Step 2: Load required dependencies. (if needed)
    For simple packages you may not need this step. However, some R packages depend on other libraries. For example, the sf package depends on gdal and geos libraries. So, you will need to load the corresponding modules before installing sf. Read the documentation for the package to identify which modules should be loaded.

    module load gdal
    module load geos
  • Step 3: Install the package.
    Now install the desired package using the command install.packages('package_name'). R will automatically download the package and all its dependencies from CRAN and install each one. Your terminal will show the build progress and eventually show whether the package was installed successfully or not.

    R
    install.packages('sf', repos="https://cran.case.edu/")
    Installing package into ‘/home/myusername/R/x86_64-pc-linux-gnu-library/4.4.1’
    (as ‘lib’ is unspecified)
    trying URL 'https://cran.case.edu/src/contrib/sf_0.9-7.tar.gz'
    Content type 'application/x-gzip' length 4203095 bytes (4.0 MB)
    ==================================================
    downloaded 4.0 MB
    ...
    ...
    more progress messages
    ...
    ...
    ** testing if installed package can be loaded from final location
    ** testing if installed package keeps a record of temporary installation path
    * DONE (sf)
    
    The downloaded source packages are in
        ‘/tmp/RtmpSVAGio/downloaded_packages’
  • Step 4: Troubleshooting. (if needed)
    If Step 3 ended with an error, you need to investigate why the build failed. Most common reason for build failure is not loading the necessary modules.

Link to section 'Loading Libraries' of 'Installing R packages' Loading Libraries

Once you have packages installed you can load them with the library() function as shown below:

library('packagename')

The package is now installed and loaded and ready to be used in R.

Link to section 'Example: Installing dplyr' of 'Installing R packages' Example: Installing dplyr

The following demonstrates installing the dplyr package assuming the above-mentioned custom ~/.Rprofile is in place (note its effect in the "Installing package into" information message):

module load r
R
install.packages('dplyr', repos="http://ftp.ussg.iu.edu/CRAN/")
Installing package into ‘/home/myusername/R/gautschi/4.4.1’
(as ‘lib’ is unspecified)
 ...
also installing the dependencies 'crayon', 'utf8', 'bindr', 'cli', 'pillar', 'assertthat', 'bindrcpp', 'glue', 'pkgconfig', 'rlang', 'Rcpp', 'tibble', 'BH', 'plogr'
 ...
 ...
 ...
The downloaded source packages are in 
    '/tmp/RtmpHMzm9z/downloaded_packages'

library(dplyr)

Attaching package: 'dplyr'

For more information about installing R packages:

RStudio

RStudio is a graphical integrated development environment (IDE) for R. RStudio is the most popular environment for developing both R scripts and packages. RStudio is provided on most Research systems.

There are two methods to launch RStudio on the cluster: command-line and application menu icon.

Link to section 'Launch RStudio by the command-line:' of 'RStudio' Launch RStudio by the command-line:

module load gcc
module load r
module load rstudio
rstudio

Note that RStudio is a graphical program and in order to run it you must have a local X11 server running or use Thinlinc Remote Desktop environment. See the ssh X11 forwarding section for more details.

Link to section 'Launch Rstudio by the application menu icon:' of 'RStudio' Launch Rstudio by the application menu icon:

  • Log into desktop.gautschi.rcac.purdue.edu with web browser or ThinLinc client
  • Click on the Applications drop down menu on the top left corner
  • Choose Cluster Software and then RStudio

This shows where to find Rstudio under the 'Cluster Software' option in the list of Applications.

R and RStudio are free to download and run on your local machine. For more information about RStudio:

Running R jobs

This section illustrates how to submit a small R job to a SLURM queue. The example job computes a Pythagorean triple.

Prepare an R input file with an appropriate filename, here named myjob.R:

# FILENAME:  myjob.R

# Compute a Pythagorean triple.
a = 3
b = 4
c = sqrt(a*a + b*b)
c     # display result

Prepare a job submission file with an appropriate filename, here named myjob.sub:

#!/bin/bash
# FILENAME:  myjob.sub

module load r

# --vanilla:
# --no-save: do not save datasets at the end of an R session
R --vanilla --no-save < myjob.R

submit the job

View job status

View results of the job

For other examples or R jobs:

Singularity

On Gautschi, Singularity functionality is provided by Apptainer - see Apptainer section for details.

Apptainer

Note: Apptainer was formerly known as Singularity and is now a part of the Linux Foundation. When migrating from Singularity see the user compatibility documentation.

Link to section 'What is Apptainer?' of 'Apptainer' What is Apptainer?

Apptainer is an open-source container platform designed to be simple, fast, and secure. It allows the portability and reproducibility of operating systems and application environments through the use of Linux containers. It gives users complete control over their environment.

Apptainer is like Docker but tuned explicitly for HPC clusters. More information is available on the project’s website.

Link to section 'Features' of 'Apptainer' Features

  • Run the latest applications on an Ubuntu or Centos userland
  • Gain access to the latest developer tools
  • Launch MPI programs easily
  • Much more

Apptainer’s user guide is available at: apptainer.org/docs/user/main/introduction.html

Link to section 'Example' of 'Apptainer' Example

Here is an example using an Ubuntu 16.04 image on Gautschi:

apptainer exec /depot/itap/singularity/ubuntu1604.img cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=16.04
DISTRIB_CODENAME=xenial
DISTRIB_DESCRIPTION="Ubuntu 16.04 LTS"

Here is another example using a Centos 7 image:

apptainer exec /depot/itap/singularity/centos7.img cat /etc/redhat-release
CentOS Linux release 7.3.1611 (Core) 

Link to section 'Purdue Cluster Specific Notes' of 'Apptainer' Purdue Cluster Specific Notes

All service providers will integrate Apptainer slightly differently depending on site. The largest customization will be which default files are inserted into your images so that routine services will work.

Services we configure for your images include DNS settings and account information. File systems we overlay into your images are your home directory, scratch, Data Depot, and application file systems.

Here is a list of paths:

  • /etc/resolv.conf
  • /etc/hosts
  • /home/$USER
  • /apps
  • /scratch
  • /depot

This means that within the container environment these paths will be present and the same as outside the container. The /apps, /scratch, and /depot directories will need to exist inside your container to work properly.

Link to section 'Creating Apptainer Images' of 'Apptainer' Creating Apptainer Images

You can build on your system or straight on the cluster (you do not need root privileges to build or run the container).

You can find information and documentation for how to install and use Apptainer on your system:

We have version 1.1.6 (or newer) on the cluster. Please note that installed versions may change throughout cluster life time, so when in doubt, please check exact version with a --version command line flag:

apptainer --version
apptainer version 1.3.3-1.el9

Everything you need on how to build a container is available from their user guide. Below are merely some quick tips for getting your own containers built for Gautschi.

You can use a Definition File to both build your container and share its specification with collaborators (for the sake of reproducibility). Here is a simplistic example of such a file:

# FILENAME: Buildfile

Bootstrap: docker
From: ubuntu:18.04

%post
    apt-get update && apt-get upgrade -y
    mkdir /apps /depot /scratch

To build the image itself:

apptainer build ubuntu-18.04.sif Buildfile

The challenge with this approach however is that it must start from scratch if you decide to change something. In order to create a container image iteratively and interactively, you can use the --sandbox option.

apptainer build --sandbox ubuntu-18.04 docker://ubuntu:18.04

This will not create a flat image file but a directory tree (i.e., a folder), the contents of which are the container's filesystem. In order to get a shell inside the container that allows you to modify it, user the --writable option.

apptainer shell --writable ubuntu-18.04
Apptainer>

You can then proceed to install any libraries, software, etc. within the container. Then to create the final image file, exit the shell and call the build command once more on the sandbox.

apptainer build ubuntu-18.04.sif ubuntu-18.04

Finally, copy the new image to Gautschi and run it.

File Storage and Transfer

Learn more about file storage transfer for Gautschi.

Storage Options

File storage options on RCAC systems include long-term storage (home directories, depot, Fortress) and short-term storage (scratch directories, /tmp directory). Each option has different performance and intended uses, and some options vary from system to system as well. Daily snapshots of home directories are provided for a limited time for accidental deletion recovery. Scratch directories and temporary storage are not backed up and old files are regularly purged from scratch and /tmp directories. More details about each storage option appear below.

Home Directory

Home directories are provided for long-term file storage. Each user has one home directory. You should use your home directory for storing important program files, scripts, input data sets, critical results, and frequently used files. You should store infrequently used files on Fortress. Your home directory becomes your current working directory, by default, when you log in.

Daily snapshots of your home directory are provided for a limited period of time in the event of accidental deletion. For additional security, you should store another copy of your files on more permanent storage, such as the Fortress HPSS Archive.

Your home directory physically resides on a GPFS storage system in the data center. To find the path to your home directory, first log in then immediately enter the following:

$ pwd
/home/myusername

Or from any subdirectory:

$ echo $HOME
/home/myusername

Your home directory has a quota limiting the total size of files you may store within. For more information, refer to the Storage Quotas / Limits Section.

Link to section 'Lost File Recovery' of 'Home Directory' Lost File Recovery

Nightly snapshots for 7 days, weekly snapshots for 4 weeks, and monthly snapshots for 3 months are kept. This means you will find snapshots from the last 7 nights, the last 4 Sundays, and the last 3 first of the months. Files are available going back between two and three months, depending on how long ago the last first of the month was. Snapshots beyond this are not kept. For additional security, you should store another copy of your files on more permanent storage, such as the Fortress HPSS Archive

Link to section 'Performance' of 'Home Directory' Performance

Your home directory is medium-performance, non-purged space suitable for tasks like sharing data, editing files, developing and building software, and many other uses.

Your home directory is not designed or intended for use as high-performance working space for running data-intensive jobs with heavy I/O demands.

Scratch Space

Scratch directories are provided for short-term file storage only. The quota of your scratch directory is much greater than the quota of your home directory. You should use your scratch directory for storing temporary input files which your job reads or for writing temporary output files which you may examine after execution of your job. You should use your home directory and Fortress for longer-term storage or for holding critical results. The hsi and htar commands provide easy-to-use interfaces into the archive and can be used to copy files into the archive interactively or even automatically at the end of your regular job submission scripts.

Files in scratch directories are not recoverable. Files in scratch directories are not backed up. If you accidentally delete a file, a disk crashes, or old files are purged, they cannot be restored.

Files are purged from scratch directories not accessed or had content modified in 60 days. Owners of these files receive a notice one week before removal via email. Be sure to regularly check your Purdue email account or set up mail forwarding to an email account you do regularly check. For more information, please refer to our Scratch File Purging Policy.

All users may access scratch directories on Gautschi. To find the path to your scratch directory:

$ findscratch
/scratch/gautschi/myusername

The value of variable $RCAC_SCRATCH is your scratch directory path. Use this variable in any scripts. Your actual scratch directory path may change without warning, but this variable will remain current.

$ echo $RCAC_SCRATCH
/scratch/gautschi/myusername

Scratch directories are specific per cluster. I.e. only the /scratch/gautschi directory is available on Gautschi front-end and compute nodes. No other scratch directories are available on Gautschi.

Your scratch directory has a quota capping the total size and number of files you may store in it. For more information, refer to the section Storage Quotas / Limits.

Link to section 'Performance' of 'Scratch Space' Performance

Your scratch directory is located on a high-performance, large-capacity parallel filesystem engineered to provide work-area storage optimized for a wide variety of job types. It is designed to perform well with data-intensive computations, while scaling well to large numbers of simultaneous connections.

/tmp Directory

/tmp directories are provided for short-term file storage only. Each front-end and compute node has a /tmp directory. Your program may write temporary data to the /tmp directory of the compute node on which it is running. That data is available for as long as your program is active. Once your program terminates, that temporary data is no longer available. When used properly, /tmp may provide faster local storage to an active process than any other storage option. You should use your home directory and Fortress for longer-term storage or for holding critical results.

Backups are not performed for the /tmp directory and removes files from /tmp whenever space is low or whenever the system needs a reboot. In the event of a disk crash or file purge, files in /tmp are not recoverable. You should copy any important files to more permanent storage.

Link to section 'Long-Term Storage' of 'Long-Term Storage' Long-Term Storage

Long-term Storage or Permanent Storage is available to users on the High Performance Storage System (HPSS), an archival storage system, called Fortress. Program files, data files and any other files which are not used often, but which must be saved, can be put in permanent storage. Fortress currently has over 10PB of capacity.

For more information about Fortress, how it works, and user guides, and how to obtain an account:

File Transfer

Gautschi supports several methods for file transfer. Use the links below to learn more about these methods.

SCP

SCP (Secure CoPy) is a simple way of transferring files between two machines that use the SSH protocol. SCP is available as a protocol choice in some graphical file transfer programs and also as a command line program on most Linux, Unix, and Mac OS X systems. SCP can copy single files, but will also recursively copy directory contents if given a directory name.

After Aug 17, 2020, the community clusters will not support password-based authentication for login. Methods that can be used include two-factor authentication (Purdue Login) or SSH keys. If you do not have SSH keys installed, you would need to type your Purdue Login response into the SFTP's "Password" prompt.

Link to section 'Command-line usage:' of 'SCP' Command-line usage:

You can transfer files both to and from Gautschi while initiating an SCP session on either some other computer or on Gautschi (in other words, directionality of connection and directionality of data flow are independent from each other). The scp command appears somewhat similar to the familiar cp command, with an extra user@host:file syntax to denote files and directories on a remote host. Either Gautschi or another computer can be a remote.

  • Example: Initiating SCP session on some other computer (i.e. you are on some other computer, connecting to Gautschi):

          (transfer TO Gautschi)
          (Individual files) 
    $ scp  sourcefile  myusername@gautschi.rcac.purdue.edu:somedir/destinationfile
    $ scp  sourcefile  myusername@gautschi.rcac.purdue.edu:somedir/
          (Recursive directory copy)
    $ scp -pr sourcedirectory/  myusername@gautschi.rcac.purdue.edu:somedir/
    
          (transfer FROM Gautschi)
          (Individual files)
    $ scp  myusername@gautschi.rcac.purdue.edu:somedir/sourcefile  destinationfile
    $ scp  myusername@gautschi.rcac.purdue.edu:somedir/sourcefile  somedir/
          (Recursive directory copy)
    $ scp -pr myusername@gautschi.rcac.purdue.edu:sourcedirectory  somedir/
    

    The -p flag is optional. When used, it will cause the transfer to preserve file attributes and permissions. The -r flag is required for recursive transfers of entire directories.

  • Example: Initiating SCP session on Gautschi (i.e. you are on Gautschi, connecting to some other computer):

          (transfer TO Gautschi)
          (Individual files) 
    $ scp  myusername@$another.computer.example.com:sourcefile  somedir/destinationfile
    $ scp  myusername@$another.computer.example.com:sourcefile  somedir/
          (Recursive directory copy)
    $ scp -pr myusername@$another.computer.example.com:sourcedirectory/  somedir/
    
          (transfer FROM Gautschi)
          (Individual files)
    $ scp  somedir/sourcefile  myusername@$another.computer.example.com:destinationfile
    $ scp  somedir/sourcefile  myusername@$another.computer.example.com:somedir/
          (Recursive directory copy)
    $ scp -pr sourcedirectory  myusername@$another.computer.example.com:somedir/
    

    The -p flag is optional. When used, it will cause the transfer to preserve file attributes and permissions. The -r flag is required for recursive transfers of entire directories.

Link to section 'Software (SCP clients)' of 'SCP' Software (SCP clients)

Linux and other Unix-like systems:

  • The scp command-line program should already be installed.

Microsoft Windows:

  • MobaXterm
    Free, full-featured, graphical Windows SSH, SCP, and SFTP client.
  • Command-line scp program can be installed as part of Windows Subsystem for Linux (WSL), or Git-Bash.

Mac OS X:

  • The scp command-line program should already be installed. You may start a local terminal window from "Applications->Utilities".
  • Cyberduck is a full-featured and free graphical SFTP and SCP client.

FTP / SFTP

FTP is not supported on any research systems because it does not allow for secure transmission of data. Use SFTP instead, as described below.

SFTP (Secure File Transfer Protocol) is a reliable way of transferring files between two machines. SFTP is available as a protocol choice in some graphical file transfer programs and also as a command-line program on most Linux, Unix, and Mac OS X systems. SFTP has more features than SCP and allows for other operations on remote files, remote directory listing, and resuming interrupted transfers. Command-line SFTP cannot recursively copy directory contents; to do so, try using SCP or graphical SFTP client.

After Aug 17, 2020, the community clusters will not support password-based authentication for login. Methods that can be used include two-factor authentication (Purdue Login) or SSH keys. If you do not have SSH keys installed, you would need to type your Purdue Login response into the SFTP's "Password" prompt.

Link to section 'Command-line usage' of 'FTP / SFTP' Command-line usage

You can transfer files both to and from Gautschi while initiating an SFTP session on either some other computer or on Gautschi (in other words, directionality of connection and directionality of data flow are independent from each other). Once the connection is established, you use put or get subcommands between "local" and "remote" computers. Either Gautschi or another computer can be a remote.

  • Example: Initiating SFTP session on some other computer (i.e. you are on another computer, connecting to Gautschi):

    $ sftp myusername@gautschi.rcac.purdue.edu
    
          (transfer TO Gautschi)
    sftp> put sourcefile somedir/destinationfile
    sftp> put -P sourcefile somedir/
    
          (transfer FROM Gautschi)
    sftp> get sourcefile somedir/destinationfile
    sftp> get -P sourcefile somedir/
    
    sftp> exit
    

    The -P flag is optional. When used, it will cause the transfer to preserve file attributes and permissions.

  • Example: Initiating SFTP session on Gautschi (i.e. you are on Gautschi, connecting to some other computer):

    $ sftp myusername@$another.computer.example.com
    
          (transfer TO Gautschi)
    sftp> get sourcefile somedir/destinationfile
    sftp> get -P sourcefile somedir/
    
          (transfer FROM Gautschi)
    sftp> put sourcefile somedir/destinationfile
    sftp> put -P sourcefile somedir/
    
    sftp> exit
    

    The -P flag is optional. When used, it will cause the transfer to preserve file attributes and permissions.

Link to section 'Software (SFTP clients)' of 'FTP / SFTP' Software (SFTP clients)

Linux and other Unix-like systems:

  • The sftp command-line program should already be installed.

Microsoft Windows:

  • MobaXterm
    Free, full-featured, graphical Windows SSH, SCP, and SFTP client.
  • Command-line sftp program can be installed as part of Windows Subsystem for Linux (WSL), or Git-Bash.

Mac OS X:

  • The sftp command-line program should already be installed. You may start a local terminal window from "Applications->Utilities".
  • Cyberduck is a full-featured and free graphical SFTP and SCP client.

Globus

Link to section 'Globus' of 'Globus' Globus

Globus, previously known as Globus Online, is a powerful and easy to use file transfer service for transferring files virtually anywhere. It works within RCAC's various research storage systems; it connects between RCAC and remote research sites running Globus; and it connects research systems to personal systems. You may use Globus to connect to your home, scratch, and Fortress storage directories. Since Globus is web-based, it works on any operating system that is connected to the internet. The Globus Personal client is available on Windows, Linux, and Mac OS X. It is primarily used as a graphical means of transfer but it can also be used over the command line.

Link to section 'Link to section &#39;Globus Web:&#39; of &#39;Globus&#39; Globus Web:' of 'Globus' Link to section 'Globus Web:' of 'Globus' Globus Web:

  • Navigate to http://transfer.rcac.purdue.edu
  • Click "Proceed" to log in with your Purdue Career Account.
  • On your first login it will ask to make a connection to a Globus account. Accept the conditions.
  • Now you are at the main screen. Click "File Transfer" which will bring you to a two-panel interface (if you only see one panel, you can use selector in the top-right corner to switch the view).
  • You will need to select one collection and file path on one side as the source, and the second collection on the other as the destination. This can be one of several Purdue endpoints, or another University, or even your personal computer (see Personal Client section below).

The RCAC collections are as follows. A search for "Purdue" will give you several suggested results you can choose from, or you can give a more specific search.

  • Home directory and Scratch storage: "Gautschi Cluster Collection", however, you can start typing "gautschi" and it will suggest appropriate matches.
  • Research Data Depot: "Purdue Research Computing - Data Depot", a search for "Depot" should provide appropriate matches to choose from.
  • Fortress: "Purdue Fortress HPSS Archive", a search for "Fortress" should provide appropriate matches to choose from.

From here, select a file or folder in either side of the two-pane window, and then use the arrows in the top-middle of the interface to instruct Globus to move files from one side to the other. You can transfer files in either direction. You will receive an email once the transfer is completed.

Link to section 'Link to section &#39;Globus Personal Client setup:&#39; of &#39;Globus&#39; Globus Personal Client setup:' of 'Globus' Link to section 'Globus Personal Client setup:' of 'Globus' Globus Personal Client setup:

Globus Connect Personal is a small software tool you can install to make your own computer a Globus endpoint on its own. It is useful if you need to transfer files via Globus to and from your computer directly.

  • On the "Collections" page from earlier, click "Get Globus Connect Personal" or download a version for your operating system it from here: Globus Connect Personal
  • Name this particular personal system and follow the setup prompts to create your Globus Connect Personal endpoint.
  • Your personal system is now available as a collection within the Globus transfer interface.

Link to section 'Link to section &#39;Globus Command Line:&#39; of &#39;Globus&#39; Globus Command Line:' of 'Globus' Link to section 'Globus Command Line:' of 'Globus' Globus Command Line:

Globus supports command line interface, allowing advanced automation of your transfers.

To use the recommended standalone Globus CLI application (the globus command):

Link to section 'Link to section &#39;Sharing Data with Outside Collaborators&#39; of &#39;Globus&#39; Sharing Data with Outside Collaborators' of 'Globus' Link to section 'Sharing Data with Outside Collaborators' of 'Globus' Sharing Data with Outside Collaborators

Globus allows convenient sharing of data with outside collaborators. Data can be shared with collaborators' personal computers or directly with many other computing resources at other institutions. See the Globus documentation on how to share data:

For links to more information, please see Globus Support page and RCAC Globus presentation.

Windows Network Drive / SMB

SMB (Server Message Block), also known as CIFS, is an easy to use file transfer protocol that is useful for transferring files between RCAC systems and a desktop or laptop. You may use SMB to connect to your home, scratch, and Fortress storage directories. The SMB protocol is available on Windows, Linux, and Mac OS X. It is primarily used as a graphical means of transfer but it can also be used over the command line.

Note: to access Gautschi through SMB file sharing, you must be on a Purdue campus network or connected through VPN.

Link to section 'Windows:' of 'Windows Network Drive / SMB' Windows:

  • Windows 7: Click Windows menu > Computer, then click Map Network Drive in the top bar
  • Windows 8 & 10: Tap the Windows key, type computer, select This PC, click Computer > Map Network Drive in the top bar
  • Windows 11: Tap the Windows key, type File Explorer, select This PC, click Computer > Map Network Drive in the top bar
  • In the folder location enter the following information and click Finish:
    • To access your Gautschi home directory, enter \\home.gautschi.rcac.purdue.edu\gautschi-home.
    • To access your scratch space on Gautschi, enter \\scratch.gautschi.rcac.purdue.edu\gautschi-scratch. Once mapped, you will be able to navigate to your scratch directory.
  • Note: Use your career account login name and password when prompted. (You will not need to add ",push" nor use your Purdue Duo client.)
  • Your home or scratch directory should now be mounted as a drive in the Computer window.

Link to section 'Mac OS X:' of 'Windows Network Drive / SMB' Mac OS X:

  • In the Finder, click Go > Connect to Server
  • In the Server Address enter the following information and click Connect:
    • To access your Gautschi home directory, enter smb://home.gautschi.rcac.purdue.edu/gautschi-home.
    • To access your scratch space on Gautschi, enter smb://scratch.gautschi.rcac.purdue.edu/gautschi-scratch. Once mapped, you will be able to navigate to your scratch directory.
  • Note: Use your career account login name and password when prompted. (You will not need to add ",push" nor use your Purdue Duo client.)
  • Your home or scratch directory should now be mounted as a drive in the Computer window.

Link to section 'Linux:' of 'Windows Network Drive / SMB' Linux:

  • There are several graphical methods to connect in Linux depending on your desktop environment. Once you find out how to connect to a network server on your desktop environment, choose the Samba/SMB protocol and adapt the information from the Mac OS X section to connect.
  • If you would like access via samba on the command line you may install smbclient which will give you FTP-like access and can be used as shown below. For all the possible ways to connect look at the Mac OS X instructions.
    smbclient //home.gautschi.rcac.purdue.edu/gautschi-home -U myusername
    
    smbclient //scratch.gautschi.rcac.purdue.edu/gautschi-scratch -U myusername
  • Note: Use your career account login name and password when prompted. (You will not need to add ",push" nor use your Purdue Duo client.)

HSI

HSI, the Hierarchical Storage Interface, is the preferred method of transferring files to and from Gautschi. HSI is designed to be a friendly interface for users of the High Performance Storage System (HPSS). It provides a familiar Unix-style environment for working within HPSS while automatically taking advantage of high-speed, parallel file transfers without requiring any special user knowledge.

HSI is provided on all research systems as the command hsi. HSI is also available for download for many operating systems.

Interactive usage:

$ hsi

*************************************************************************
*                    Purdue University
*                  High Performance Storage System (HPSS)
*************************************************************************
* This is the Purdue Data Archive, Fortress.  For further information
* see http://www.rcac.purdue.edu/storage/fortress/
*
*   If you are having problems with HPSS, please call IT/Operational
*   Services at 49-44000 or send E-mail to rcac-help@purdue.edu.
*
*************************************************************************
Username: myusername  UID: 12345  Acct: 12345(12345) Copies: 1 Firewall: off [hsi.3.5.8 Wed Sep 21 17:31:14 EDT 2011]

[Fortress HSI]/home/myusername->put data1.fits
put  'test' : '/home/myusername/test' ( 1024000000 bytes, 250138.1 KBS (cos=11))

[Fortress HSI]/home/myusername->lcd /tmp

[Fortress HSI]/home/myusername->get data1.fits
get  '/tmp/data1.fits' : '/home/myusername/data1.fits' (2011/10/04 16:28:50 1024000000 bytes, 325844.9 KBS )

[Fortress HSI]/home/myusername->quit

Batch transfer file:

put data1.fits
put data2.fits
put data3.fits
put data4.fits
put data5.fits
put data6.fits
put data7.fits
put data8.fits
put data9.fits

Batch usage:

$ hsi < my_batch_transfer_file
*************************************************************************
*                    Purdue University
*                  High Performance Storage System (HPSS)
*************************************************************************
* This is the Purdue Data Archive, Fortress.  For further information
* see http://www.rcac.purdue.edu/storage/fortress/
*
*   If you are having problems with HPSS, please call IT/Operational
*   Services at 49-44000 or send E-mail to rcac-help@purdue.edu.
*
*************************************************************************
Username: myusername  UID: 12345  Acct: 12345(12345) Copies: 1 Firewall: off [hsi.3.5.8 Wed Sep 21 17:31:14 EDT 2011]
put  'data1.fits' : '/home/myusername/data1.fits' ( 1024000000 bytes, 250200.7 KBS (cos=11))
put  'data2.fits' : '/home/myusername/data2.fits' ( 1024000000 bytes, 258893.4 KBS (cos=11))
put  'data3.fits' : '/home/myusername/data3.fits' ( 1024000000 bytes, 222819.7 KBS (cos=11))
put  'data4.fits' : '/home/myusername/data4.fits' ( 1024000000 bytes, 224311.9 KBS (cos=11))
put  'data5.fits' : '/home/myusername/data5.fits' ( 1024000000 bytes, 323707.3 KBS (cos=11))
put  'data6.fits' : '/home/myusername/data6.fits' ( 1024000000 bytes, 320322.9 KBS (cos=11))
put  'data7.fits' : '/home/myusername/data7.fits' ( 1024000000 bytes, 253192.6 KBS (cos=11))
put  'data8.fits' : '/home/myusername/data8.fits' ( 1024000000 bytes, 253056.2 KBS (cos=11))
put  'data9.fits' : '/home/myusername/data9.fits' ( 1024000000 bytes, 323218.9 KBS (cos=11))
EOF detected on TTY - ending HSI session

For more information about HSI:

HTAR

HTAR (short for "HPSS TAR") is a utility program that writes TAR-compatible archive files directly onto Gautschi, without having to first create a local file. Its command line was originally based on tar, with a number of extensions added to provide extra features.

HTAR is provided on all research systems as the command htar. HTAR is also available for download for many operating systems.

Link to section 'Usage:' of 'HTAR' Usage:

Create a tar archive on Gautschi named data.tar including all files with the extension ".fits":

$ htar -cvf data.tar *.fits
HTAR: a   data1.fits
HTAR: a   data2.fits
HTAR: a   data3.fits
HTAR: a   data4.fits
HTAR: a   data5.fits
HTAR: a   /tmp/HTAR_CF_CHK_17953_1317760775
HTAR Create complete for data.tar. 5,120,006,144 bytes written for 5 member files, max threads: 3 Transfer time: 16.457 seconds (311.121 MB/s)
HTAR: HTAR SUCCESSFUL

Unpack a tar archive on Gautschi named data.tar into a scratch directory for use in a batch job:

$ cd $RCAC_SCRATCH/job_dir
$ htar -xvf data.tar
HTAR: x data1.fits, 1024000000 bytes, 2000001 media blocks
HTAR: x data2.fits, 1024000000 bytes, 2000001 media blocks
HTAR: x data3.fits, 1024000000 bytes, 2000001 media blocks
HTAR: x data4.fits, 1024000000 bytes, 2000001 media blocks
HTAR: x data5.fits, 1024000000 bytes, 2000001 media blocks
HTAR: Extract complete for data.tar, 5 files. total bytes read: 5,120,004,608 in 18.841 seconds (271.749 MB/s )
HTAR: HTAR SUCCESSFUL

Look at the contents of the data.tar HTAR archive on Gautschi:

$ htar -tvf data.tar
HTAR: -rw-r--r--  myusername/pucc 1024000000 2011-10-04 16:30  data1.fits
HTAR: -rw-r--r--  myusername/pucc 1024000000 2011-10-04 16:35  data2.fits
HTAR: -rw-r--r--  myusername/pucc 1024000000 2011-10-04 16:35  data3.fits
HTAR: -rw-r--r--  myusername/pucc 1024000000 2011-10-04 16:35  data4.fits
HTAR: -rw-r--r--  myusername/pucc 1024000000 2011-10-04 16:35  data5.fits
HTAR: -rw-------  myusername/pucc        256 2011-10-04 16:39  /tmp/HTAR_CF_CHK_17953_1317760775
HTAR: Listing complete for data.tar, 6 files 6 total objects
HTAR: HTAR SUCCESSFUL

Unpack a single file, "data5.fits", from the tar archive on Gautschi named data.tar into a scratch directory.:

$ htar -xvf data.tar data5.fits
HTAR: x data5.fits, 1024000000 bytes, 2000001 media blocks
HTAR: Extract complete for data.tar, 1 files. total bytes read: 1,024,000,512 in 3.642 seconds (281.166 MB/s )
HTAR: HTAR SUCCESSFUL

Link to section 'HTAR Archive Verification' of 'HTAR' HTAR Archive Verification

HTAR allows different types of content verification while creating archives. Users can ask HTAR to verify the contents of an archive during (or after) creation using the '-Hverify' switch. The syntax of this option is:

$ htar -Hverify=option[,option...] ... other arguments ... 
where option can be any of the following:
Option Explanation
info Compares tar header info with the corresponding values in the index.
crc Enables CRC checking of archive files for which a CRC was generated when the file is added to the archive.
compare Enables a byte-by-byte comparison of archive member files and their local file counterparts.
nocrc Disables CRC checking of archive files.
nocompare Disables a byte-by-byte comparison of archive member files and their local file counterparts.

Users can use a comma-separated list of options shown above, or a numeric value, or the wildcard all to specify the degree of verification. The numeric values for Hverify can be interpreted as follows:

0: Enables "info" verification.
1: Enables level 0 + "crc" verification.
2: Enables level 1 + "compare" verification.
all: Enables all comparison options.

An example to verify an archive during creation using checksums (crc):

htar -Hverify=1 -cvf abc.tar ./abc

An example to verify a previously created archive using checksums (crc):

htar -Hverify=1 -Kvf abc.tar

Please note that the time for verifying an archive increases as you increase the verification level. Carefully choose the option that suits your dataset best.

For details please see the HTAR Man Page.

For more information about HTAR:

HTAR has an individual file size limit of 64GB. If any files you are trying to archive with HTAR are greater than 64GB, then HTAR will immediately fail. This does not limit the number of files in the archive or the total overall size of the archive. To get around this limitation, try using the htar_large command. It is slower than using htar but it will work around the 64GB file size limit. This does not limit the number of files in the archive or the total overall size of the archive.

To get around this limitation, try using the htar_large command. It is slower than using HTAR but it will work around the 64GB file size limit. The usage of htar_large is almost the same as htar except that htar_large will not generate the tar index file. Thus, the -Hverify=1 option cannot be used since it's based on index file.

Storage Quota / Limits

Some limits are imposed on your disk usage on research systems. A quota is implemented on each filesystem. Each filesystem (home directory, scratch directory, etc.) may have a different limit. If you exceed the quota, you will not be able to save new files or new data to the filesystem until you delete or move data to long-term storage.

Link to section 'Checking Quota' of 'Storage Quota / Limits' Checking Quota

To check the current quotas of your home and scratch directories check the My Quota page or use the myquota command:

$ myquota
Type        Filesystem          Size    Limit  Use         Files    Limit  Use
==============================================================================
home        myusername         5.0GB   25.0GB  20%             -        -   -
scratch     gautschi        220.7GB  100.0TB  0.22%            8k   2,000k  0.43%

The columns are as follows:

  • Type: indicates home or scratch directory or your depot space.
  • Filesystem: name of storage option.
  • Size: sum of file sizes in bytes.
  • Limit: allowed maximum on sum of file sizes in bytes.
  • Use: percentage of file-size limit currently in use.
  • Files: number of files and directories (not the size).
  • Limit: allowed maximum on number of files and directories. It is possible, though unlikely, to reach this limit and not the file-size limit if you create a large number of very small files.
  • Use: percentage of file-number limit currently in use.

If you find that you reached your quota in either your home directory or your scratch file directory, obtain estimates of your disk usage. Find the top-level directories which have a high disk usage, then study the subdirectories to discover where the heaviest usage lies.

To see in a human-readable format an estimate of the disk usage of your top-level directories in your home directory:

$ du -h --max-depth=1 $HOME >myfile
32K     /home/myusername/mysubdirectory_1
529M    /home/myusername/mysubdirectory_2
608K    /home/myusername/mysubdirectory_3

The second directory is the largest of the three, so apply command du to it.

To see in a human-readable format an estimate of the disk usage of your top-level directories in your scratch file directory:

$ du -h --max-depth=1 $RCAC_SCRATCH >myfile
160K    /scratch/gautschi/myusername

This strategy can be very helpful in figuring out the location of your largest usage. Move unneeded files and directories to long-term storage to free space in your home and scratch directories.

Link to section 'Increasing Quota' of 'Storage Quota / Limits' Increasing Quota

Link to section 'Home Directory' of 'Storage Quota / Limits' Home Directory

If you find you need additional disk space in your home directory, please consider archiving and compressing old files and moving them to long-term storage on the Fortress HPSS Archive, or purchase the Depot space for long-term storage. Unfortunately, it is not possible to increase your home directory quota beyond it's current level.

Link to section 'Scratch Space' of 'Storage Quota / Limits' Scratch Space

If you find you need additional disk space in your scratch space, please first consider archiving and compressing old files and moving them to long-term storage on the Fortress HPSS Archive. If you are unable to do so, you may ask for a quota increase by contacting support.

Link to section 'Storage Environment Variables' of 'Storage Environment Variables' Storage Environment Variables

Several environment variables are automatically defined for you to help you manage your storage. Use environment variables instead of actual paths whenever possible to avoid problems if the specific paths to any of these change.

Some of the environment variables you should have are:
Name Description
HOME /home/myusername
PWD path to your current directory
RCAC_SCRATCH /scratch/gautschi/myusername

By convention, environment variable names are all uppercase. You may use them on the command line or in any scripts in place of and in combination with hard-coded values:

$ ls $HOME
...

$ ls $RCAC_SCRATCH/myproject
...

To find the value of any environment variable:

$ echo $RCAC_SCRATCH
/scratch/gautschi/myusername 

To list the values of all environment variables:

$ env
USER=myusername
HOME=/home/myusername
RCAC_SCRATCH=/scratch/gautschi/myusername 
...

You may create or overwrite an environment variable. To pass (export) the value of a variable in bash:

$ export MYPROJECT=$RCAC_SCRATCH/myproject

To assign a value to an environment variable in either tcsh or csh:

$ setenv MYPROJECT value

Link to section 'Archive and Compression' of 'Archive and Compression' Archive and Compression


There are several options for archiving and compressing groups of files or directories. The mostly commonly used options are:

Link to section 'tar' of 'Archive and Compression' tar

See the official documentation for tar for more information.

Saves many files together into a single archive file, and restores individual files from the archive. Includes automatic archive compression/decompression options and special features for incremental and full backups.

Examples:


  (list contents of archive somefile.tar)
$ tar tvf somefile.tar

  (extract contents of somefile.tar)
$ tar xvf somefile.tar

  (extract contents of gzipped archive somefile.tar.gz)
$ tar xzvf somefile.tar.gz

  (extract contents of bzip2 archive somefile.tar.bz2)
$ tar xjvf somefile.tar.bz2

  (archive all ".c" files in current directory into one archive file)
$ tar cvf somefile.tar *.c

  (archive and gzip-compress all files in a directory into one archive file)
$ tar czvf somefile.tar.gz somedirectory/

  (archive and bzip2-compress all files in a directory into one archive file)
$ tar cjvf somefile.tar.bz2 somedirectory/

Other arguments for tar can be explored by using the man tar command.

Link to section 'gzip' of 'Archive and Compression' gzip

  (more information)

The standard compression system for all GNU software.

Examples:


  (compress file somefile - also removes uncompressed file)
$ gzip somefile

  (uncompress file somefile.gz - also removes compressed file)
$ gunzip somefile.gz

Link to section 'bzip2' of 'Archive and Compression' bzip2

See the official documentation for bzip for more information.

Strong, lossless data compressor based on the Burrows-Wheeler transform. Stronger compression than gzip.

Examples:


  (compress file somefile - also removes uncompressed file)
$ bzip2 somefile

  (uncompress file somefile.bz2 - also removes compressed file)
$ bunzip2 somefile.bz2

There are several other, less commonly used, options available as well:

  • zip
  • 7zip
  • xz

Link to section 'Sharing Files from Gautschi' of 'Sharing' Sharing Files from Gautschi

Gautschi supports several methods for file sharing. Use the links below to learn more about these methods.

Link to section 'Sharing Data with Globus' of 'Globus' Sharing Data with Globus

Data on any RCAC resource can be shared with other users within Purdue or with collaborators at other institutions. Globus allows convenient sharing of data with outside collaborators. Data can be shared with collaborators' personal computers or directly with many other computing resources at other institutions.

To share files, login to https://transfer.rcac.purdue.edu, navigate to the endpoint (collection) of your choice, and follow instructions as described in Globus documentation on how to share data:

See also RCAC Globus presentation.

Link to section 'Sharing static content from your Data Depot space via the WWW' of 'WWW' Sharing static content from your Data Depot space via the WWW

Your research group can easily share static files (images, data, HTML) from your depot space via the WWW.

  • Contact support to set up a "www" folder in your Data Depot space.
  • Copy any files that you wish to share via the WWW into your Data Depot space's "www" folder.
  • For example, cp /path/to/image.jpg /depot/mylab/www/, where mylab is your research group name.
  • or upload to smb://datadepot.rcac.purdue.edu/depot/mylab/www, where mylab is your research group name.
  • Your file is now accessible via your web browser at the URL https://www.datadepot.rcac.purdue.edu/mylab/image.jpg

Note that it is not possible to run web sites, dynamic content, interpreters (PHP, Perl, Python), or CGI scripts from this web site.

Frequently Asked Questions

Some common questions, errors, and problems are categorized below. Click the Expand Topics link in the upper right to see all entries at once. You can also use the search box above to search the user guide for any issues you are seeing.

About Gautschi

Frequently asked questions about Gautschi.

Can you remove me from the Gautschi mailing list?

Your subscription in the Gautschi mailing list is tied to your account on Gautschi. If you are no longer using your account on Gautschi, your account can be deleted from the My Accounts page. Hover over the resource you wish to remove yourself from and click the red 'X' button. Your account and mailing list subscription will be removed overnight. Be sure to make a copy of any data you wish to keep first.

Do I need to do anything to my firewall to access Gautschi?

No firewall changes are needed to access Gautschi. However, to access data through Network Drives (i.e., CIFS, "Z: Drive"), you must be on a Purdue campus network or connected through VPN.

Can I set up a shared space for my research group to share data?

Research groups are assigned a group data storage space within Fortress with each Data Depot group space. Faculty should request a Data Depot trial to create a shared Fortress space for their research group.

RCAC resources are not intended to store data protected by Federal privacy and security laws (e.g., HIPAA, ITAR, classified, etc.). It is the responsibility of the faculty partner to ensure that no protected data is stored on the systems.

Please keep in mind that such spaces are, by design, accessible by others and should not be used to store private information such as grades, login credentials, or personal data. Contact us to create a group space for your group.

Does Gautschi have the same home directory as other clusters?

The Gautschi home directory and its contents are exclusive to Gautschi cluster front-end hosts and compute nodes. This home directory is not available on other RCAC machines but Gautschi. There is no automatic copying or synchronization between home directories.

At your discretion you can manually copy all or parts of your main research computing home to Gautschi using one of the suggested methods.

If you plan to use hsi or htar commands to access Fortress tape archive from Gautschi, please see also the keytab generation question for a temporary workaround to a potential caveat, while a permanent mitigation is being developed.

Logging In & Accounts

Frequently asked questions about logging in & accounts.

Errors

Common errors and solutions/work-arounds for them.

Account creation failed

An email came into rcac-help from the automated account checker that an account creation failed. There are a few scenarios that can cause this. There are a few things to check.

Link to section 'Account not created' of 'Account creation failed' Account not created

First check what resource they were added to and the corresponding role status from the User Search page.

Take the following steps for these scenarios:

Link to section 'No Role' of 'Account creation failed' No Role

This means either our website failed and didn't add the role (rare, but there is a known bug where when a faculty requests Radon/Hathi for themselves it fails) or IAMO rejected the role.

You can try manually adding the role through the tool and see if it rejects it again, or ask IAMO about the status and if the role can be added (see below).

Link to section 'Role Pending' of 'Account creation failed' Role Pending

This means two things: IAMO's overnight process failed or the account was added just past the cutoff for the overnight process, but before the account check run.

In the former scenario, something went wrong on IAMO's side. Usually Ben is on top of things and gets things sorted quickly when he gets in the morning, but if it's afternoon and it's still not there ask IAMO about it.

For the latter scenario, there is a very narrow window when users can be added and trigger a false alarm (something like ~4-5am). It's rare, but it happens from time to time when we have a night owl/early bird faculty (or traveling abroad).

Link to section 'Role Ready' of 'Account creation failed' Role Ready

The are two scenarios here: IAMO's overnight process failed and has already been fixed or the transd is broken on our end.

In the first scenario, there probably isn't anything to do. You can verify their account with ldapsearch -x uid=USERNAME | grep host and see if the have the proper host entry. If they do, they should be able to log in.

In the second scenario, the next step would be to investigate the transd. The transd translates packets from IAMO into accounts on our systems. Log into xenon.rcac and look at /var/log/transd_log. Is there recent activity at the end of log? If the end of the log is stale, something is probably stuck, like a full disk or some such. In this case, assign ticket to systems and ask them to look at it. If it has recent activity, you should be able to grep the log for the username and look for account entries for them. If the transd is running further investigation is probably needed.

Link to section 'Asking IAMO' of 'Account creation failed' Asking IAMO

The Footprints queue for IAMO is ITAP_IDENTITY_MANAGEMENT. Ben Lewis and Scott Morris are familiar with our web app, and should be familiar with seeing this "account failed" emails. If they come back and say the account is expired/graduated/etc contact the faculty separately with this information (see below). Otherwise Ben should be able to push accounts or unjam the logjam.

Link to section 'Login Shell /opt/acmaint-3.10/etc/disable is invalid.' of 'Account creation failed' Login Shell /opt/acmaint-3.10/etc/disable is invalid.

This means the user account is no longer valid, ie, they graduated. Remove the account from the Manage User page, and inform the faculty separately (don't use the FP ticket) that added them that we were unable to create an account for the user. Good to verify with PI about student's graudation status (usually that'll ring some bells with the faculty). They will need to have an Request for Privileges (R4P) filed, and then they can re-add the account once complete. If the faculty thinks the student should be valid, ask IAMO about the status. They may have been very recently added back, or had some other issue.

/usr/bin/xauth: error in locking authority file

Link to section 'Problem' of '/usr/bin/xauth: error in locking authority file' Problem

I receive this message when logging in:

/usr/bin/xauth: error in locking authority file

Link to section 'Solution' of '/usr/bin/xauth: error in locking authority file' Solution

Your home directory disk quota is full. You may check your quota with myquota.

You will need to free up space in your home directory.

ncdu command is a convenient interactive tool to examine disk usage. Consider running ncdu $HOME to analyze where the bulk of the usage is. With this knowledge, you could then archive your data elsewhere (e.g. your research group's Data Depot space, or Fortress tape archive), or delete files you no longer need.

There are several common locations that tend to grow large over time and are merely cached downloads.  The following are safe to delete if you see them in the output of ncdu $HOME:


/home/myusername/.local/share/Trash
/home/myusername/.cache/pip
/home/myusername/.conda/pkgs
/home/myusername/.singularity/cache

My SSH connection hangs

Link to section 'Problem' of 'My SSH connection hangs' Problem

Your console hangs while trying to connect to a RCAC Server.

Link to section 'Solution' of 'My SSH connection hangs' Solution

This can happen due to various reasons. Most common reasons for hanging SSH terminals are:

  • Network: If you are connected over wifi, make sure that your Internet connection is fine.
  • Busy front-end server: When you connect to a cluster, you SSH to one of the front-end login nodes. Due to transient user loads, one or more of the front-ends may become unresponsive for a short while. To avoid this, try reconnecting to the cluster or wait until the login node you have connected to has reduced load.
  • File system issue: If a server has issues with one or more of the file systems (home, scratch, or depot) it may freeze your terminal. To avoid this you can connect to another front-end.

If neither of the suggestions above work, please contact support specifying the name of the server where your console is hung.

Thinlinc session frozen

Link to section 'Problem' of 'Thinlinc session frozen' Problem

Your Thinlinc session is frozen and you can not launch any commands or close the session.

Link to section 'Solution' of 'Thinlinc session frozen' Solution

This can happen due to various reasons. The most common reason is that you ran something memory-intensive inside that Thinlinc session on a front-end, so parts of the Thinlinc session got killed by Cgroups, and the entire session got stuck.

  • If you are using a web-version Thinlinc remote desktop (inside the browser):

    The web version does not have the capability to kill the existing session, only the standalone client does. Please install the standalone client and follow the steps below:

    ThinLinc

  • If you are using a Thinlinc client:

    Close the ThinLinc client, reopen the client login popup, and select End existing session.

    ThinLinc Login Popup
    Select "End existing session" and try "Connect" again.

Thinlinc session unreachable

Link to section 'Problem' of 'Thinlinc session unreachable' Problem

When trying to login to Thinlinc and re-connect to your existing session, you receive an error "Your Thinlinc session is currently unreachable".

Link to section 'Solution' of 'Thinlinc session unreachable' Solution

This can happen if the specific login node your existing remote desktop session was residing on is currently offline or down, so Thinlinc can not reconnect to your existing session.  Most often the session is non-recoverable at this point, so the solution is to terminate your existing Thinlinc desktop session and start a new one.

  • If you are using a web-version Thinlinc remote desktop (inside the browser):

    The web version does not have the capability to kill the existing session, only the standalone client does. Please install the standalone client and follow the steps below:

    ThinLinc

  • If you are using a Thinlinc client:

    Close the ThinLinc client, reopen the client login popup, and select End existing session.

    ThinLinc Login Popup
    Select "End existing session" and try "Connect" again.

How to disable Thinlinc screensaver

Link to section 'Problem' of 'How to disable Thinlinc screensaver' Problem

Your ThinLinc desktop is locked after being idle for a while, and it asks for a password to refresh it. It means the "screensaver" and "lock screen" functions are turned on, but you want to disable these functions.

Link to section 'Solution' of 'How to disable Thinlinc screensaver' Solution

If your screen is locked, close the ThinLinc client, reopen the client login popup, and select End existing session.

ThinLinc Login Popup
Select "End existing session" and try "Connect" again.

To permanently avoid screen lock issue, right click desktop and select Applications, then settings, and select Screensaver.

ThinLinc Screensaver
Select "Applications", then "settings", and select "Screensaver".

Under Screensaver, turn off the Enable Screensaver, then under Lock Screen, turn off the Enable Lock Screen, and close the window.

ThinLinc Disable Screensaver
Under "Screensaver" tab, turn off the "Enable Screensaver" option.
ThinLinc Disable Lock Screen
Under "Lock Screen" tab, turn off the "Enable Lock Screen" option.

Questions

Frequently asked questions about logging in & accounts.

I worked on Gautschi after I graduated/left Purdue, but can not access it anymore

Link to section 'Problem' of 'I worked on Gautschi after I graduated/left Purdue, but can not access it anymore' Problem

You have graduated or left Purdue but continue collaboration with your Purdue colleagues. You find that your access to Purdue resources has suddenly stopped and your password is no longer accepted.

Link to section 'Solution' of 'I worked on Gautschi after I graduated/left Purdue, but can not access it anymore' Solution

Access to all resources depends on having a valid Purdue Career Account. Expired Career Accounts are removed twice a year, during Spring and October breaks (more details at the official page). If your Career Account was purged due to expiration, you will not be be able to access the resources.

To provide remote collaborators with valid Purdue credentials, the University provides a special procedure called Request for Privileges (R4P). If you need to continue your collaboration with your Purdue PI, the PI will have to submit or renew an R4P request on your behalf.

After your R4P is completed and Career Account is restored, please note two additional necessary steps:

  • Access: Restored Career Accounts by default do not have any RCAC resources enabled for them. Your PI will have to login to the Manage Users tool and explicitly re-enable your access by un-checking and then ticking back checkboxes for desired queues/Unix groups resources.

  • Email: Restored Career Accounts by default do not have their @purdue.edu email service enabled. While this does not preclude you from using RCAC resources, any email messages (be that generated on the clusters, or any service announcements) would not be delivered - which may cause inconvenience or loss of compute jobs. To avoid this, we recommend setting your restored @purdue.edu email service to "Forward" (to an actual address you read). The easiest way to ensure it is to go through the Account Setup process.

Jobs

Frequently asked questions related to running jobs.

Errors

Common errors and potential solutions/workarounds for them.

cannot connect to X server / cannot open display

Link to section 'Problem' of 'cannot connect to X server / cannot open display' Problem

You receive the following message after entering a command to bring up a graphical window

cannot connect to X server cannot open display

Link to section 'Solution' of 'cannot connect to X server / cannot open display' Solution

This can happen due to multiple reasons:

  1. Reason: Your SSH client software does not support graphical display by itself (e.g. SecureCRT or PuTTY).
  2. Reason: You did not enable X11 forwarding in your SSH connection.

    • Solution: If you are in a Windows environment, make sure that X11 forwarding is enabled in your connection settings (e.g. in MobaXterm or PuTTY). If you are in a Linux environment, try

      ssh -Y -l username hostname

  3. Reason: If you are trying to open a graphical window within an interactive PBS job, make sure you are using the -X option with qsub after following the previous step(s) for connecting to the front-end. Please see the example in the Interactive Jobs guide.
  4. Reason: If none of the above apply, make sure that you are within quota of your home directory.

bash: command not found

Link to section 'Problem' of 'bash: command not found' Problem

You receive the following message after typing a command

bash: command not found

Link to section 'Solution' of 'bash: command not found' Solution

This means the system doesn't know how to find your command. Typically, you need to load a module to do it.

qdel: Server could not connect to MOM 12345.gautschi-adm.rcac.purdue.edu

Link to section 'Problem' of 'qdel: Server could not connect to MOM 12345.gautschi-adm.rcac.purdue.edu' Problem

You receive the following message after attempting to delete a job with the qdel command

qdel: Server could not connect to MOM 12345.gautschi-adm.rcac.purdue.edu

Link to section 'Solution' of 'qdel: Server could not connect to MOM 12345.gautschi-adm.rcac.purdue.edu' Solution

This error usually indicates that at least one node running your job has stopped responding or crashed. Please forward the job ID to support, and staff can help remove the job from the queue.

bash: module command not found

Link to section 'Problem' of 'bash: module command not found' Problem

You receive the following message after typing a command, e.g. module load intel

bash: module command not found

Link to section 'Solution' of 'bash: module command not found' Solution

The system cannot find the module command. You need to source the modules.sh file as below

source /etc/profile.d/modules.sh

or

#!/bin/bash -i

1234.gautschi-adm.rcac.purdue.edu.SC: line 12: 12345 Killed

Link to section 'Problem' of '1234.gautschi-adm.rcac.purdue.edu.SC: line 12: 12345 Killed' Problem

Your PBS job stopped running and you received an email with the following:

/var/spool/torque/mom_priv/jobs/1234.gautschi-adm.rcac.purdue.edu.SC: line 12: 12345 Killed <command name>

Link to section 'Solution' of '1234.gautschi-adm.rcac.purdue.edu.SC: line 12: 12345 Killed' Solution

This means that the node your job was running on ran out of memory to support your program or code. This may be due to your job or other jobs sharing your node(s) consuming more memory in total than is available on the node. Your program was killed by the node to preserve the operating system. There are two possible causes:

  • On Gautschi, jobs using less than 192 cores per node default to allowing your jobs to share the node(s) with other jobs. You should request all cores of the node or request exclusive access. Either your job or one of the other jobs running on the node consumed too much memory. Requesting exclusive access will give you full control over all the memory on the node.
  • You requested exclusive access to the nodes but your job requires more memory than is available on the node. You should use more nodes if your job supports MPI or run a smaller dataset.

Close Firefox / Firefox is already running but not responding

Link to section 'Problem' of 'Close Firefox / Firefox is already running but not responding' Problem

You receive the following message after trying to launch Firefox browser inside your graphics desktop:

Close Firefox

Firefox is already running, but not responding.  To open a new window,
you  must first close the existing Firefox process, or restart your system.

Link to section 'Solution' of 'Close Firefox / Firefox is already running but not responding' Solution

When Firefox runs, it creates several lock files in the Firefox profile directory (inside ~/.mozilla/firefox/ folder in your home directory). If a newly-started Firefox instance detects the presence of these lock files, it complains.

This error can happen due to multiple reasons:

  1. Reason: You had a single Firefox process running, but it terminated abruptly without a chance to clean its lock files (e.g. the job got terminated, session ended, node crashed or rebooted, etc).
    • Solution: If you are certain you do not have any other Firefox processes running elsewhere, please use the following command in a terminal window to detect and remove the lock files:
      $ unlock-firefox
  2. Reason: You may indeed have another Firefox process (in another Thinlinc or Gateway session on this or other cluster, another front-end or compute node). With many clusters sharing common home directory, a running Firefox instance on one can affect another.
    • Solution: Try finding and closing running Firefox process(es) on other nodes and clusters.
    • Solution: If you must have multiple Firefoxes running simultaneously, you may be able to create separate Firefox profiles and select which one to use for each instance.

Jupyter: database is locked / can not load notebook format

Link to section 'Problem' of 'Jupyter: database is locked / can not load notebook format' Problem

You receive the following message after trying to load existing Jupyter notebooks inside your JupyterHub session:

Error loading notebook

An unknown error occurred while loading this notebook.  This version can load notebook formats or earlier. See the server log for details.

Alternatively, the notebook may open but present an error when creating or saving a notebook:

Autosave Failed!

Unexpected error while saving file:  MyNotebookName.ipynb database is locked

Link to section 'Solution' of 'Jupyter: database is locked / can not load notebook format' Solution

When Jupyter notebooks are opened, the server keeps track of their state in an internal database (located inside ~/.local/share/jupyter/ folder in your home directory). If a Jupyter process gets terminated abruptly (e.g. due to an out-of-memory error or a host reboot), the database lock is not cleared properly, and future instances of Jupyter detect the lock and complain.

Please follow these steps to resolve:

  1. Fully exit from your existing Jupyter session (close all notebooks, terminate Jupyter, log out from JupyterHub or JupyterLab, terminate OnDemand gateway's Jupyter app, etc).
  2. In a terminal window (SSH, Thinlinc or OnDemand gateway's terminal app) use the following command to clean up stale database locks:
    $ unlock-jupyter
  3. Start a new Jupyter session as usual.

Questions

Frequently asked questions about jobs.

How do I check my job output while it is running?

Link to section 'Problem' of 'How do I check my job output while it is running?' Problem

After submitting your job to the cluster, you want to see the output that it generates.

Link to section 'Solution' of 'How do I check my job output while it is running?' Solution

There are two simple ways to do this:

  • qpeek: Use the tool qpeek to check the job's output. Syntax of the command is:
    qpeek <jobid>
  • Redirect your output to a file: To do this you need to edit the main command in your jobscript as shown below. Please note the redirection command starting with the greater than (>) sign.
    myapplication ...other arguments... > "${PBS_JOBID}.output"
    On any front-end, go to the working directory of the job and scan the output file.
    tail "<jobid>.output"
    Make sure to replace <jobid> with an appropriate jobid.

How can I get email alerts about my PBS job status?

Link to section 'Question' of 'How can I get email alerts about my PBS job status?' Question

How can I be notified when my PBS job was executed and if it completed successfully?

Link to section 'Answer' of 'How can I get email alerts about my PBS job status?' Answer

Submit your job with the following command line arguments

qsub -M email_address -m bea myjobsubmissionfile

Or, include the following in your job submission file.

#PBS -M email_address                                                  
#PBS -m bae                                                                         

The -m option can have the following letters; "a", "b", and "e":

a - mail is sent when the job is aborted by the batch system.
b - mail is sent when the job begins execution.
e - mail is sent when the job terminates.

How do I know Non-uniform Memory Access (NUMA) layout on Gautschi?

  • You can learn about processor layout on Gautschi nodes using the following command:
    gautschi-a003:~$ lstopo-no-graphics
  • For detailed IO connectivity:
    gautschi-a003:~$ lstopo-no-graphics --physical --whole-io
  • Please note that NUMA information is useful for advanced MPI/OpenMP/GPU optimizations. For most users, using default NUMA settings in MPI or OpenMP would give you the best performance.

Why cannot I use --mem=0 when submitting jobs?

Link to section 'Question' of 'Why cannot I use --mem=0 when submitting jobs?' Question

Why can't I specify --mem=0 for my job?

Link to section 'Answer' of 'Why cannot I use --mem=0 when submitting jobs?' Answer

We no longer support requesting unlimited memory (--mem=0) as it has an adverse effect on the way scheduler allocates job, and could lead to large amount of nodes being blocked from usage.

Most often we suggest relying on default memory allocation (cluster-specific). But if you have to request custom amounts of memory, you can do it explicitly. For example --mem=20G.

If you want to use the entire node's memory, you can submit the job with the --exclusive option.

Software

Frequently asked questions about software.

Cannot use pip after loading ml-toolkit modules

Link to section 'Question' of 'Cannot use pip after loading ml-toolkit modules' Question

Pip throws an error after loading the machine learning modules. How can I fix it?

Link to section 'Answer' of 'Cannot use pip after loading ml-toolkit modules' Answer

Machine learning modules (tensorflow, pytorch, opencv etc.) include a version of pip that is newer than the one installed with Anaconda. As a result it will throw an error when you try to use it.

$ pip --version
Traceback (most recent call last):
  File "/apps/cent7/anaconda/5.1.0-py36/bin/pip", line 7, in <module>
    from pip import main
ImportError: cannot import name 'main'

The preferred way to use pip with the machine learning modules is to invoke it via Python as shown below.

$ python -m pip --version

Julia package installation

Users do not have write permission to the default julia package installation destination. However, users can install packages into home directory under ~/.julia.

Users can side step this by explicitly defining where to put julia packages:

$ export JULIA_DEPOT_PATH=$HOME/.julia
$ julia -e 'using Pkg; Pkg.add("PackageName")'

Gateway (Open OnDemand)

Gautschi's Gateway is an open-source HPC portal developed by the Ohio Supercomputing Center. Open OnDemand allows one to interact with HPC resources through a web browser and easily manage files, submit jobs, and interact with graphical applications directly in a browser, all with no software to install. Gautschi has an instance of OnDemand available that can be accessed via gateway.gautschi.rcac.purdue.edu.

Link to section 'Logging In' of 'Gateway (Open OnDemand)' Logging In

To log into Gateway:

On the splash page you will see a quota usage report. If you are over 90% on any of your quotas a warning will be displayed. This information will update every 10-15 minutes while you are active on Gateway.

Link to section 'Apps' of 'Gateway (Open OnDemand)' Apps

There are a number of built-in apps in Gateway that can be accessed from the top menu bar. Below are links to documentation on each app.

Files

The Files app will let you access your files in your Home Directory, Scratch, and Data Depot spaces. The app lets you manage create, manage, and delete files and directories from your web browser. Navigate by double clicking on folders in the file explorer or by using the file tree on the left.

Open OnDemand file browser
The browser-based file explorer. Navigate by double clicking on folders in the file explorer or by using the file tree on the left.

On the top row, there are buttons to:

  • Go To: directly input a directory to navigate to
  • Open in Terminal: launches the Shell app and navigates you to the current directory in the terminal
  • New File: creates a new, empty file
  • New Dir: creates a new, empty directory
  • Upload: upload a file from your computer

Note: File uploads from your browser are limited to 100 GB per file. Be mindful that uploads over a few gigabytes may be unreliable through your browser, especially from off-campus connections. For very large files or off-campus transfers alternative methods such as Globus are highly recommended.

The second row of buttons lets you perform typical file management operations. The Edit button will open files in a fully fledged browser based text editor - it features syntax highlighting and vim and Emacs key bindings.

Open OnDemand file editor
The browser-based text editor interface, shown here editing a Bash script, includes syntax highlighting, font-size adjustments, and various key bindings.

Jobs

There are two apps under the Jobs apps: Active Jobs and Job Composer. These are detailed below.

Link to section 'Active Jobs' of 'Jobs' Active Jobs

This shows you active SLURM jobs currently on the cluster. The default view will show you your current jobs, similar to squeue -u rices. Using the button labeled "Your Jobs" in the upper right allows you to select different filters by queue (account). All accounts output by slist will appear for you here. Using the arrow on the left hand side will expand the full job details.

A table of active jobs
The table of active jobs shows useful information such as queue, status, cluster, and ID. It can be sorted by clicking the headers of each column or searched with the "Filter" box above it.

Link to section 'Job Composer' of 'Jobs' Job Composer

The Job Composer app allows you to create and submit jobs to the cluster. You can select from pre-defined templates (most of these are taken from the User Guide examples) or you can create your own templates for frequently used workflows.

Link to section 'Creating Job from Existing Template' of 'Jobs' Creating Job from Existing Template

Click "New Job" menu, then select "From Template":

The job composer interface
When clicking the 'New Job' button a drop-down will show a few options. "From Template" is usually the second item in the list.

Then select from one of the available templates.

A sortable data table containing a list of all the available templates.
Select one of the templates by clicking its row in the table of available templates.

Click 'Create New Job' in second pane.

The 'Create New Job' pane
The "Create New Job" pane will show form options for "Job Name", "Cluster", and "Script Name" with the "Create New Job" button below.

Your new job should be selected in your list of jobs. In the 'Submit Script' pane you can see the job script that was generated with an 'Open Editor' link to open the script in the built-in editor. Open the file in the editor and edit the script as necessary. By default the job will specify standby queue - this should be changed as appropriate, along with the node and walltime requests.

The 'Submit Script' pane
The "Submit Script" pane will show a preview of the contents of the script file and action buttons below.

When you are finished with editing the job and are ready to submit, click the green 'Submit' button at the top of the job list. You can monitor progress from here or from the Active Jobs app. Once completed, you should see the output files appear:

A list of files found in the output folder
The folder contents will be listed, showing the resulting output files from running the submitted script.

Clicking on one of the output files will open it in the file editor for your viewing.

Link to section 'Creating New Template' of 'Jobs' Creating New Template

First, prepare a template directory containing a template submission script along with any input files. Then, to import the job into the Job Composer app, click the 'Create New Template' button. Fill in the directory containing your template job script and files in the first box. Give it an appropriate name and notes.

The 'Create New Template' form
The "Create New Template" form has inputs for "Path", "Name", "Cluster", and "Notes". If "Path" is left blank, a default job script will be added to the new template.

This template will now appear in your list of templates to choose from when composing jobs. You can now go create and submit a job from this new template.

Cluster Tools

The Cluster Tools menu contains cluster utilities. At the moment, only a terminal app is provided. Additional apps may be developed and provided in the future.

Link to section 'Shell Access' of 'Cluster Tools' Shell Access

Launching the shell app will provide you with a web-based terminal session on the cluster front-end. This is equivalent to using a standalone SSH client to connect to gautschi.rcac.purdue.edu where you are connected to one several front-ends. The normal acceptable front-end use policy applies to access through the web-app. X11 Forwarding is not supported. Use of one of the interactive apps is recommended for graphical applications.

Interactive Apps

There are several interactive apps available through Gateway that can be accessed through the Interactive Apps dropdown menu. These apps are provided with a basic node and software configuration as a 'quick-launch' option to get your work up and running quickly. For simplicity, minimal options are provided - these apps are not intended for complex configuration/customization scenarios.

After you a submit an interactive app to the queue, Gateway will track and manage the session. Once it starts, you may connect and disconnect from the session in your browser, leaving the job running while you log out of your browser.

Each of the available apps are documented through the following links.

Compute Node Desktop

The Compute Node Desktop app will launch a graphical desktop session on a compute node. This is similar to using Thinlinc, however, this gives you a desktop directly on a compute node instead on a front-end. This app is useful if you have a custom application or application not directly available as an interactive app you would like to run inside Gateway.

To launch a desktop session on a compute node, select the Gautschi Compute Desktop app. From the submit form, select from the available options - the queue to which you wish to submit and the number of wallclock hours you wish to have job running. There is also a checkbox that enable a notification to your email when the job starts.

After the interactive job is submitted you will be taken to your list of active interactive app sessions. You can monitor the status of the job from here until it starts, or if you enabled the email notification, watch your Purdue email for the notification the job has started.

Once it is indicated the job has started you can connect to the desktop with the "Launch noVNC in New Tab" button. The session will be terminated after the wallclock hours you specified have elapsed or you terminate the session early with the "Delete" button from the list of sessions. Deleting the session when you are finished will free up queue resources for your lab mates and other users on the system.

MATLAB

The MATLAB app will launch a MATLAB session on a compute node and allow you to connect directly to it in a web browser.

To launch a MATLAB session on a compute node, select the MATLAB app. From the submit form, select from the available options - the version of MATLAB you are interested in running, the queue to which you wish to submit, and the number of wallclock hours you wish to have job running. There is also a checkbox that enable a notification to your email when the job starts.

After the interactive job is submitted you will be taken to your list of active interactive app sessions. You can monitor the status of the job from here until it starts, or if you enabled the email notification, watch your Purdue email for the notification the job has started.

Once it is indicated the job has started you can connect to the desktop with the "Launch noVNC in New Tab" button. The session will be terminated after the wallclock hours you specified have elapsed or you terminate the session early with the "Delete" button from the list of sessions. Deleting the session when you are finished will free up queue resources for your lab mates and other users on the system.

NOTE: There are known issues with running Matlab in this way and resizing your web browser. Graphical corruption may occur if you resize the browser. Fixes for this are being investigated.

Jupyter Notebook - Deep Neural Networks Demo (GPU)

The Notebook app will launch a Notebook session on a compute node and allow you to connect directly to it in a web browser. It can be used to run GPU applications such as Tensorflow and Keras. Below is a demo of this to get you started.

Open OnDemand launch page for Jupyter Notebooks
"Jupyter Notebook" can be found under "GUIs" in the "Interactive Apps" menu. This takes you to the launch page, with options for selecting the 'Queue', 'Number of hours', and email notifications.
  • Select the queue to which you wish to submit and enter the number of wallclock hours you require. Your notebook will be terminated after this number of hours elapses.
  • Click Launch.
  • Wait for your interactive session to change to Running state. This may take some time depending on how busy the queue and system is.
  • Click on 'Connect to Jupyter' once the button appears.
Active Jupiter Notebook session in Open OnDemand
When ready, the session will show a "Running state" with details about the session such as "Host", "Created at", "Time Remaining", and "Session ID". The "Connect to Jupyter" button will also become available.
  • Once in Jupyter, select 'Upload' in the upper right corner. You may wish to create a folder or change into a different directory to put the demo notebook first.
Upload button in a Jupyter Notebook
The 'Upload' button in a Notebook can be found in the upper right corner next to a directory selector and refresh button.
  • Select the demo notebook file you downloaded earlier. Click the blue Upload button to complete the upload. Then click the dnn.ipynb item from the file list to launch the notebook.
  • You should now have the notebook loaded and you should be able to re-execute the code cells, or modify them to your needs.
A running Jupyter Notebook
A running Notebook will have a main menu and toolbar buttons across the top with individually marked code and text cells below.

Jupyter Notebook

The Notebook app will launch a Notebook session on a compute node and allow you to connect directly to it in a web browser.

To launch a Notebook session on a compute node, select the Notebook app. From the submit form, select from the available options:

  1. Queue: This is a dropdown menu from which you can select a queue from all of the queues to which you have permission to submit.
  2. Walltime: This is a field which expects a number and represents how many hours you want to keep the session running. Note that this value should not exceed the maximum value given next to the selected queue name from the queue dropdown menu.
  3. Number of Cores/GPUs: This is a field which expects a number and represents the number of your resources your session is requesting. Note that the amount of memory allocated for your session is proportional to the number of cores or GPUs that you request for your job, so if your session is running out of memory, consider increasing this value.
  4. Use Jupyter Lab: This is a checkbox which, when checked, will run Jupyter Lab instead of Jupyter Notebook. Both of these applications are interfaces to Jupyter, and you can launch Jupyter notebooks from within Jupyter Lab. Jupyter Notebook is more "barebones" while Jupyter Lab has additional features such as the ability to interact with additional file types.
  5. E-mail Notice: This is a checkbox which, when checked, will send you an e-mail notification to your Purdue e-mail that your session is ready when the scheduler has found resources to dedicate to your session.

After the interactive job is submitted you will be taken to your list of active interactive app sessions. You can monitor the status of the job from here until it starts, or if you enabled the email notification, watch your Purdue email for the notification the job has started.

Once it is indicated the job has started you can connect to the desktop with the "Connect to Jupyter" button. Once connected, you can create new notebooks, selecting the currently available Anaconda versions available as modules, and any personally created Notebook kernels.

Often times you may want to use one of your existing Anaconda environments within your Jupyter session to use libraries specific to your workflow. In order to do so, you must ensure that the Anaconda environment you want to use contains the Python packages "IPyKernel" and "IPython" which are packages that are required by Jupyter. When you create a Jupyter session, Open OnDemand will check through your existing Anaconda environments and create a Jupyter kernel for any Anaconda environment that contains these two packages, and you will be able to select to use that kernel from within the application.

The session will be terminated after the number of hours you specified have elapsed or you terminate the session early with the "Delete" button from the list of sessions. Deleting the session when you are finished will free up queue resources for your lab mates and other users on the system.

RStudio Server

The RStudio app will launch a RStudio session on a compute node and allow you to connect directly to it in a web browser.

To launch a RStudio session on a compute node, select the RStudio app. From the submit form, select from the available options - the queue to which you wish to submit, and the number of wallclock hours you wish to have job running. There is also a checkbox that enable a notification to your email when the job starts.

After the interactive job is submitted you will be taken to your list of active interactive app sessions. You can monitor the status of the job from here until it starts, or if you enabled the email notification, watch your Purdue email for the notification the job has started.

Once it is indicated the job has started you can connect to the desktop with the "Connect to RStudio Server" button. The session will be terminated after the wallclock hours you specified have elapsed or you terminate the session early with the "Delete" button from the list of sessions. Deleting the session when you are finished will free up queue resources for your lab mates and other users on the system.

Windows Desktop

The Windows Desktop app will launch a Windows desktop session on a compute node. This is similar to using the Windows menu launcher through Thinlinc, however, this gives you a Windows desktop directly on a compute node instead on a front-end.

To launch a Windows session on a compute node, select the Windows Desktop app. From the submit form, select from the available options - choose from the basic Windows configuration or the GIS configured image, the queue to which you wish to submit, and the number of wallclock hours you wish to have job running. There is also a checkbox that enable a notification to your email when the job starts.

This will create a file in your scratch space called windows-base.qcow2 or windows-gis.qcow2. If the file already exists, the existing image will be restarted. You can delete or rename the image at any time through the Files App to generate a fresh image. You can only have one instance of the image running at a time or corruption will occur. There are lock files to prevent this, but be mindful of this restriction. It is also recommended you make periodic backups of the image if you are making any modifications to it.

After the interactive job is submitted you will be taken to your list of active interactive app sessions. You can monitor the status of the job from here until it starts, or if you enabled the email notification, watch your Purdue email for the notification the job has started.

Once it is indicated the job has started you can connect to the desktop with the "Launch noVNC in New Tab" button. The session will be terminated after the wallclock hours you specified have elapsed or you terminate the session early with the "Delete" button from the list of sessions. Deleting the session when you are finished will free up queue resources for your lab mates and other users on the system.

Helpful?

Thanks for letting us know.

Please don't include any personal information in your comment. Maximum character limit is 250.
Characters left: 250
Thanks for your feedback.