Weber User Guide

Link to section 'Overview of Weber' of 'Overview of Weber' Overview of Weber

Weber is Purdue's new specialty high performance computing cluster for data, applications, and research which are covered by export control regulations such as EAR, ITAR, or requiring compliance with the NIST SP 800-171. Weber was built through a partnership with HP and AMD in August 2019. Weber consists of HP compute nodes with two 10-core Intel Xeon-E5 "Haswell" processors (20 cores per node) and 64 GB of memory. All nodes have 56 Gbps EDR Infiniband interconnect.

To purchase access to Weber today, please contact the Export Controls office at exportcontrols@purdue.edu, or contact us via email at rcac-cluster-purchase@lists.purdue.edu if you have any questions.

Link to section 'Weber Namesake' of 'Overview of Weber' Weber Namesake

Weber is named in honor of Mary Ellen Weber, scientist and former astronaut. More information about her life and impact on Purdue is available in an ITaP Biography of Weber.

Link to section 'Weber Specifications' of 'Overview of Weber' Weber Specifications

All Weber nodes have 20 processor cores, 64 GB of RAM, and 56 Gbps Infiniband interconnects.

Weber Front-Ends
Front-Ends Number of Nodes Processors per Node Cores per Node Memory per Node Retires in
Interim 2 Two Sky Lake CPUs @ 2.10GHz 16 192 GB 2020
Coming 4 AMD Rome CPUs 64 256 GB 2023
Weber Sub-Clusters
Sub-Cluster Number of Nodes Processors per Node Cores per Node Memory per Node Retires in
A 4 Two Haswell CPUs @ 2.60GHz 20 64 GB 2023

Weber nodes run CentOS 7 and use SLURM as the batch system for resource and job management. The application of operating system patches occurs as security needs dictate. All nodes allow for unlimited stack usage, as well as unlimited core dump size (though disk space and server quotas may still be a limiting factor).

On Weber, ITaP recommends the following set of compiler, math library, and message-passing library for parallel code:

  • Intel
  • MKL
  • Intel MPI

This compiler and these libraries are loaded by default. To load the recommended set again:

$ module load rcac

To verify what you loaded:

$ module list

Accounts on Weber

Link to section 'Access' of 'Accounts on Weber' Access

Access to Weber may only be granted to projects subject to Controlled Unclassified Information (CUI) regulations, including International Traffic in Arms Regulations (ITAR). For more information, please visit the Export Controls office website . Questions about accessing this cluster may be directed to exportcontrols@purdue.edu.

Link to section 'VPN' of 'Accounts on Weber' VPN

In addition to login credentials for Weber itself, you will need access to a restricted Virtual Private Network (VPN) in order to connect to Weber. Login to the VPN is through BoilerKey, using two-factor authentication.

Weber users can apply for approval of reedvpn.itap.purdue.edu/cui themselves through Purdue Boilerkey management page. Click on the Manage link in the center and Where you use Boilerkey in the second row of icons. Scroll down to the section of "Boilerkey Services you are eligible to request" and click on Request for cui. Once it is approved, you should receive a email notification and cui should appear in the "Boilerkey Services Approved for Use" section.

Link to section 'BoilerKey' of 'Accounts on Weber' BoilerKey

Information about BoilerKey can be found here and a self-serve page to request a BoilerKey is here

You can choose to request either a physical keyfob device or a smartphone app for your BoilerKey access. In either case, you will need to use your BoilerKey each time you want to connect your workstation to the VPN.

Connecting to Weber

Link to section 'Windows:' of 'Connecting to Weber' Windows:

Link to section 'VPN' of 'Connecting to Weber' VPN

  • Download and install the CISCO VPN client from Purdue WebVPN. Your VPN client may periodically auto-patch itself with the latest security enhancements.

Link to section 'Linux / Mac:' of 'Connecting to Weber' Linux / Mac:

Login follows the same process as for Windows.

Additional log in instructions may be available to you after signing in to this website in the upper right corner.

Link to section 'File Storage and Transfer for Weber' of 'File Storage and Transfer' File Storage and Transfer for Weber

Link to section 'Archive and Compression' of 'Archive and Compression' Archive and Compression

Archived files and directories must remain on Weber and cannot be removed from the cluster without prior authorization. Even after when a project ends, project materials must be placed within ???. There are several options for archiving and compressing groups of files or directories on ITaP research systems. The mostly commonly used options are:

Link to section 'tar' of 'Archive and Compression' tar

See the official documentation for tar for more information.

Saves many files together into a single archive file, and restores individual files from the archive. Includes automatic archive compression/decompression options and special features for incremental and full backups.

Examples:


  (list contents of archive somefile.tar)
$ tar tvf somefile.tar

  (extract contents of somefile.tar)
$ tar xvf somefile.tar

  (extract contents of gzipped archive somefile.tar.gz)
$ tar xzvf somefile.tar.gz

  (extract contents of bzip2 archive somefile.tar.bz2)
$ tar xjvf somefile.tar.bz2

  (archive all ".c" files in current directory into one archive file)
$ tar cvf somefile.tar *.c

  (archive and gzip-compress all files in a directory into one archive file)
$ tar czvf somefile.tar.gz somedirectory/

  (archive and bzip2-compress all files in a directory into one archive file)
$ tar cjvf somefile.tar.bz2 somedirectory/

Other arguments for tar can be explored by using the man tar command.

Link to section 'gzip' of 'Archive and Compression' gzip

  (more information)

The standard compression system for all GNU software.

Examples:


  (compress file somefile - also removes uncompressed file)
$ gzip somefile

  (uncompress file somefile.gz - also removes compressed file)
$ gunzip somefile.gz

Link to section 'bzip2' of 'Archive and Compression' bzip2

See the official documentation for bzip for more information.

Strong, lossless data compressor based on the Burrows-Wheeler transform. Stronger compression than gzip.

Examples:


  (compress file somefile - also removes uncompressed file)
$ bzip2 somefile

  (uncompress file somefile.bz2 - also removes compressed file)
$ bunzip2 somefile.bz2

There are several other, less commonly used, options available as well:

  • zip
  • 7zip
  • xz

Link to section 'Environment Variables' of 'Environment Variables' Environment Variables

Several environment variables are automatically defined for you to help you manage your storage. Use environment variables instead of actual paths whenever possible to avoid problems if the specific paths to any of these change.

Some of the environment variables you should have are:
Name Description
HOME path to your home directory
PWD path to your current directory
RCAC_SCRATCH path to scratch filesystem

By convention, environment variable names are all uppercase. You may use them on the command line or in any scripts in place of and in combination with hard-coded values:

$ ls $HOME
...

$ ls $RCAC_SCRATCH/myproject
...

To find the value of any environment variable:

$ echo $RCAC_SCRATCH
 /scratch/weber/myusername 

To list the values of all environment variables:

$ env
USER=myusername
HOME=/home/myusername
 RCAC_SCRATCH=/scratch/weber/myusername 
...

You may create or overwrite an environment variable. To pass (export) the value of a variable in bash:

$ export MYPROJECT=$RCAC_SCRATCH/myproject

To assign a value to an environment variable in either tcsh or csh:

$ setenv MYPROJECT value

Storage Options

File storage options on ITaP research systems include long-term storage (home directories, Archive) and short-term storage (scratch directories). Each option has different performance and intended uses, and some options vary from system to system as well. ITaP provides daily snapshots of home directories for a limited time for accidental deletion recovery. ITaP does not back up scratch directories and regularly purges old files from scratch directories. More details about each storage option appear below.

Home Directory

ITaP provides home directories for long-term file storage. Each user has one home directory. You should use your home directory for storing important program files, scripts, input data sets, critical results, and frequently used files. Your home directory becomes your current working directory, by default, when you log in.

Your home directory physically resides on a dedicated storage system only accessible for Weber. To find the path to your home directory, first log in then immediately enter the following:

$ pwd
/home/myusername

Or from any subdirectory:

$ echo $HOME
/home/myusername

Your home directory has a quota limiting the total size of files you may store within. For more information, refer to the Storage Quotas / Limits Section.

Link to section 'Lost File Recovery' of 'Home Directory' Lost File Recovery

ITaP maintains daily snapshots of your home directory for seven days in the event of accidental deletion. Cold storage backups of snapshots are kept for 90 days. For additional security, you should store another copy of your files on more permanent storage, such as the Fortress HPSS Archive.

Link to section 'Performance' of 'Home Directory' Performance

Your home directory is medium-performance, non-purged space suitable for tasks like sharing data, editing files, developing and building software, and many other uses.

Your home directory is not designed or intended for use as high-performance working space for running data-intensive jobs with heavy I/O demands.

Project Directory

ITaP provides project directories for storing important program files, scripts, input data sets, critical results, and frequently used files that should be accessible to an entire research group. These files are shared on Weber Data Depot, which is a high-capacity, fast, reliable and secure data storage service designed, configured, and operated for restricted data. These data files are only accessible via Weber.

Link to section 'Weber Data Depot Features' of 'Project Directory' Weber Data Depot Features

Weber Data Depot offers research groups in need of restricted data storage unique features and benefits:

  • Available

    Research groups that use Weber have access to unlimited mirrored data storage without additional charges.

  • Accessible

    Directly from Weber.

  • Capable

    The Weber Data Depot facilitates joint work on shared files across your research group, avoiding the need for numerous copies of datasets across individuals' home or scratch directories. It is an ideal place to store group applications, tools, scripts, and documents.

  • Controllable Access

    Access is managed in consultation with the Export Control Office. Additional Unix groups may be created to assist you in setting appropriate permissions to allow exactly the access you want and prevent any you do not.

  • Data Retention

    All data kept in the Weber Data Depot remains owned by the research group's lead faculty. When researchers or students leave your group, any files left in their home directories may become difficult to recover. Files kept in Weber Data Depot remain with the research group, unaffected by turnover, and could head off potentially difficult disputes.

  • Never Purged

    The Weber Data Depot is never subject to purging.

  • Reliable

    The Weber Data Depot is redundant and protected against hardware failures and accidental deletion.
    ITaP maintains daily snapshots of your project directory for seven days in the event of accidental deletion. Cold storage backups of snapshots are kept for 90 days. For additional security, you should store another copy of your files on more permanent storage, such as the Fortress HPSS Archiva.

  • Restricted Data

    The Weber Data Depot is approved for ITAR/CUI restricted data.

Link to section 'Weber Data Depot Hardware Details' of 'Project Directory' Weber Data Depot Hardware Details

The Weber Data Depot uses an enterprise-class ZFS storage solution with an initial total capacity of 10 TB. This storage is redundant, reliable, and features regular snapshots. The Weber Data Depot is non-purged space suitable for tasks such as sharing data, editing files, developing and building software, and many other uses. Built on Data Direct Networks' SFA12k storage platform, the Weber Data Depot has redundant storage arrays.

Link to section 'Default Configuration' of 'Project Directory' Default Configuration

This is what a default configuration looks like for a research group called "mylab":

/depot/mylab/
            +--apps/
            |
            +--data/
            |
            +--etc/
            |     +--bashrc
            |     +--cshrc
            |
 (other subdirectories)

The /depot/mylab/ directory is the main top-level directory for all your research group storage. All files are to be kept within one of the subdirectories of this, based on your specific access requirements. ITaP will create these subdirectories after consulting with you as to exactly what you need.

By default, ITaP will create the following subdirectories, with the following access and use models. All of these details can be changed to suit the particular needs of your research group.

  • data/
    Intended for read and write use by a limited set of people chosen by the research group's managers.
    Restricted to not be readable or writable by anyone else.
    This is frequently used as an open space for storage of shared research data.
  • apps/
    Intended for write use by a limited set of people chosen by the research group's managers.
    Restricted to not be writable by anyone else.
    Allows read and execute by anyone who has access to any cluster queues owned by the research group and anyone who has other file permissions granted by the research group (such as "data" access above).
    This is frequently used as a space for central management of shared research applications.
  • etc/
    Intended for write use by a limited set of people chosen by the research group's managers (by default, the same as for "apps" above).
    Restricted to not be writable by anyone else.
    Allows read and execute by anyone who has access to any cluster queues owned by the research group and anyone who has other file permissions granted by the research group (such as "data" access above).
    This is frequently used as a space for central management of shared startup/login scripts, environment settings, aliases, etc.
  • etc/bashrc
    etc/cshrc
    Examples of group-wide shell startup files. Group members can source these from their own $HOME/.bashrc or $HOME/.cshrc files so they would automatically pick up changes to their environment needed to work with applications and data for the research group. There are more detailed instructions in these files on how to use them.
  • Additional subdirectories can be created as needed in the top and/or any of the lower levels. Just contact rcac-help@purdue.edu and we will be happy to figure out what will work best for your needs.

Link to section 'Archive "Cold Storage"' of 'Long-Term Storage' Archive "Cold Storage"

Storage for completed, inactive, or retired projects is available via Archive. Data stored in Archive must be stored in a compressed format. Once data is stored in Archive, it can only be retrieved by contacting Export Control to request access. Access may be granted for report generation, thesis/dissertation development, etc. However, once data is stored in Archive, it cannot be modified in any way.

Scratch Space

ITaP provides scratch directories for short-term file storage only. The quota of your scratch directory is much greater than the quota of your home directory. You should use your scratch directory for storing temporary input files which your job reads or for writing temporary output files which you may examine after execution of your job. You should use your home directory and Weber long-term storage for holding critical results.

Files in scratch directories are not recoverable. ITaP does not back up files in scratch directories. If you accidentally delete a file, a disk crashes, or old files are purged, they cannot be restored. Unique among our cluster resources, data are not purged from Weber scratch directories at this time.

All users may access scratch directories on Weber. To find the path to your scratch directory:

$ findscratch
/scratch/weber/myusername

The value of variable $RCAC_SCRATCH is your scratch directory path. Use this variable in any scripts. Your actual scratch directory path may change without warning, but this variable will remain current.

$ echo $RCAC_SCRATCH
/scratch/weber/myusername

Your scratch directory has a quota capping the total size and number of files you may store in it. For more information, refer to the section Storage Quotas / Limits .

Link to section 'Performance' of 'Scratch Space' Performance

Your scratch directory is located on a high-performance, large-capacity parallel filesystem engineered to provide work-area storage optimized for a wide variety of job types. It is designed to perform well with data-intensive computations, while scaling well to large numbers of simultaneous connections.

Storage Quota / Limits

ITaP imposes some limits on your disk usage on research systems. ITaP implements a quota on each filesystem. Each filesystem (home directory, scratch directory, etc.) may have a different limit. If you exceed the quota, you will not be able to save new files or new data to the filesystem until you delete or move data to long-term storage.

Link to section 'Checking Quota' of 'Storage Quota / Limits' Checking Quota

To check the current quotas of your home and scratch directories check the My Quota page or use the myquota command:

$ myquota
Type        Filesystem          Size    Limit  Use         Files    Limit  Use
==============================================================================
home        myusername         5.0GB   25.0GB  20%             -        -   -
scratch     /scratch/weber/    8KB  476.8GB   0%             2  100,000   0%

The columns are as follows:

  • Type: indicates home or scratch directory.
  • Filesystem: name of storage option.
  • Size: sum of file sizes in bytes.
  • Limit: allowed maximum on sum of file sizes in bytes.
  • Use: percentage of file-size limit currently in use.
  • Files: number of files and directories (not the size).
  • Limit: allowed maximum on number of files and directories. It is possible, though unlikely, to reach this limit and not the file-size limit if you create a large number of very small files.
  • Use: percentage of file-number limit currently in use.

If you find that you reached your quota in either your home directory or your scratch file directory, obtain estimates of your disk usage. Find the top-level directories which have a high disk usage, then study the subdirectories to discover where the heaviest usage lies.

To see in a human-readable format an estimate of the disk usage of your top-level directories in your home directory:

$ du -h --max-depth=1 $HOME >myfile
32K /home/myusername/mysubdirectory_1
529M    /home/myusername/mysubdirectory_2
608K    /home/myusername/mysubdirectory_3

The second directory is the largest of the three, so apply command du to it.

To see in a human-readable format an estimate of the disk usage of your top-level directories in your scratch file directory:

$ du -h --max-depth=1 $RCAC_SCRATCH >myfile
160K    /scratch/weber/myusername

This strategy can be very helpful in figuring out the location of your largest usage. Move unneeded files and directories to long-term storage to free space in your home and scratch directories.

Link to section 'Increasing Quota' of 'Storage Quota / Limits' Increasing Quota

Link to section 'Home Directory' of 'Storage Quota / Limits' Home Directory

If you find you need additional disk space in your home directory, please first consider archiving and compressing old files and moving them to long-term storage on Weber. Unfortunately, it is not possible to increase your home directory quota beyond it's current level.

Link to section 'Scratch Space' of 'Storage Quota / Limits' Scratch Space

If you find you need additional disk space in your scratch space, please first consider archiving and compressing old files and moving them to long-term storage on Weber.

File Transfer

In order to comply with regulatory security requirements, files may only be imported into Weber via Weber Inbox and exported from the system via Weber Outbox.

Egress HTTPs

Users may access white-listed, secure Drop Sites approved by Export Control and ITSP. These websites may be accessed via web browser within the ThinLinc server instance.

Ingress SFTP

ITaP does not support FTP on any ITaP research systems because it does not allow for secure transmission of data. Use SFTP instead, as described below.

SFTP (Secure File Transfer Protocol) is a reliable way of transferring files between two machines. SFTP is available as a protocol choice in some graphical file transfer programs and also as a command-line program on most Linux, Unix, and Mac OS X systems. SFTP has more features than SCP and allows for other operations on remote files, remote directory listing, and resuming interrupted transfers. Command-line SFTP cannot recursively copy directory contents; to do so, try using SCP or a graphical SFTP client.

Microsoft Windows:

  • MobaXterm
    Free, full-featured, graphical Windows SSH, SCP, and SFTP client.

Mac OS X:

  • The "sftp" command-line program should already be installed. You may start a local terminal window from "Applications->Utilities".
  • Cyberduck is a full-featured and free graphical SFTP and SCP client.

Link to section 'Accessing the Weber Inbound SFTP Server' of 'Ingress SFTP' Accessing the Weber Inbound SFTP Server

The only data that may be uploaded to Weber are those allowed by your project's Technology Control Plan. Normal, fundamental research data should not be uploaded to Weber.

Additional log in instructions may be available to you after signing in to this website in the upper right corner.

Egress SFTP

ITaP does not support FTP on any ITaP research systems because it does not allow for secure transmission of data. Use SFTP instead, as described below.

SFTP (Secure File Transfer Protocol) is a reliable way of transferring files between two machines. SFTP is available as a protocol choice in some graphical file transfer programs and also as a command-line program on most Linux, Unix, and Mac OS X systems. SFTP has more features than SCP and allows for other operations on remote files, remote directory listing, and resuming interrupted transfers. Command-line SFTP cannot recursively copy directory contents; to do so, try using SCP or graphical SFTP client.

Microsoft Windows:

  • MobaXterm
    Free, full-featured, graphical Windows SSH, SCP, and SFTP client.

Mac OS X:

  • The "sftp" command-line program should already be installed. You may start a local terminal window from "Applications->Utilities".
  • Cyberduck is a full-featured and free graphical SFTP and SCP client.

Link to section 'Transferring Files out of Weber' of 'Egress SFTP' Transferring Files out of Weber

The only data that may be downloaded from Weber are those allowed by your project's Technology Control Plan.

It is the responsibility of the Principal Investigator and their project team members to ensure that their data is A) unrestricted or B) uncontrolled research prior to removing that data from Weber. Controlled Research could be subject to publication restrictions, dissemination controls, or may involve proprietary or controlled inputs that make it subject to regulations. Controlled research must be properly marked and protected prior to distribution if specified by the Technology Control Plan.

Additional log in instructions may be available to you after signing in to this website in the upper right corner.

Applications

The cluster provides a number of software packages to users of the system via the module command.

Environment Management with the Module Command

The Weber cluster provides a number of software packages to users of the system via the module command.

Link to section 'Environment Management with the Module Command' of 'Environment Management with the Module Command' Environment Management with the Module Command

ITaP uses the module command as the preferred method to manage your processing environment. With this command, you may load applications and compilers along with their libraries and paths. Modules are packages which you load and unload as needed.

Please use the module command and do not manually configure your environment, as ITaP staff may make changes to the specifics of various packages. If you use the module command to manage your environment, these changes will not be noticeable.

Link to section 'Hierarchy' of 'Environment Management with the Module Command' Hierarchy

Many modules have dependencies on other modules. For example, a particular openmpi module requires a specific version of the Intel compiler to be loaded. Often, these dependencies are not clear for users of the module, and there are many modules which may conflict. Arranging modules in an hierarchical fashion makes this dependency clear. This arrangement also helps make the software stack easy to understand - your view of the modules will not be cluttered with a bunch of conflicting packages.

Your default module view on Weber will include a set of compilers and the set of basic software that has no dependencies (such as Matlab and Fluent). To make software available that depends on a compiler, you must first load the compiler, and then software which depends on it becomes available to you. In this way, all software you see when doing "module avail" is completely compatible with each other.

Link to section 'Using the Hierarchy' of 'Environment Management with the Module Command' Using the Hierarchy

Your default module view on Weber will include a set of compilers, and the set of basic software that has no dependencies (such as Matlab and Fluent).

To see what modules are available on this system by default:

$ module avail

To see which versions of a specific compiler are available on this system:

$ module avail gcc
$ module avail intel

To continue further into the hierarchy of modules, you will need to choose a compiler. As an example, if you are planning on using the Intel compiler you will first want to load the Intel compiler:

$ module load intel

With intel loaded, you can repeat the avail command and at the bottom of the output you will see the a section of additional software that the intel module provides:

$ module avail

Several of these new packages also provide additional software packages, such as MPI libraries. You can repeat the last two steps with one of the MPI packages such as openmpi and you will have a few more software packages available to you.

If you are looking for a specific software package and do not see it in your default view, the module command provides a search function for searching the entire hierarchy tree of modules without need for you to manually load and avail on every module.

Link to section 'Load / Unload a Module' of 'Environment Management with the Module Command' Load / Unload a Module

All modules consist of both a name and a version number. When loading a module, you may use only the name to load the default version, or you may specify which version you wish to load.

For each cluster, ITaP makes a recommendation regarding the set of compiler, math library, and MPI library for parallel code. To load the recommended set:

$ module load rcac

To verify what you loaded:

$ module list

To load the default version of a specific compiler, choose one of the following commands:

$ module load gcc
$ module load intel

When running a job, you must use the job submission file to load on the compute node(s) any relevant modules. Loading modules on the front end before submitting your job makes the software available to your session on the front-end, but not to your job submission script environment. You must load the necessary modules in your job submission script.

To unload a compiler or software package you loaded previously:

$ module unload gcc
$ module unload intel
$ module unload matlab

To unload all currently loaded modules and reset your environment:

module purge

Link to section 'Show Module Details' of 'Environment Management with the Module Command' Show Module Details

To learn more about what a module does to your environment, you may use the module show command.

Running Jobs

There is one method for submitting jobs to Weber. You may use SLURM to submit jobs to a partition on Weber. SLURM performs job scheduling. Jobs may be any type of program. You may use either the batch or interactive mode to run your jobs. Use the batch mode for finished programs; use the interactive mode only for debugging.

In this section, you'll find a few pages describing the basics of creating and submitting SLURM jobs. As well, a number of example SLURM jobs that you may be able to adapt to your own needs.

Basics of SLURM Jobs

The Simple Linux Utility for Resource Management (SLURM) is a system providing job scheduling and job management on compute clusters. With SLURM, a user requests resources and submits a job to a queue. The system will then take jobs from queues, allocate the necessary nodes, and execute them.

Do NOT run large, long, multi-threaded, parallel, or CPU-intensive jobs on a front-end login host. All users share the front-end hosts, and running anything but the smallest test job will negatively impact everyone's ability to use Weber. Always use SLURM to submit your work as a job.

Link to section 'Submitting a Job' of 'Basics of SLURM Jobs' Submitting a Job

The main steps to submitting a job are:

Follow the links below for information on these steps, and other basic information about jobs. A number of example SLURM jobs are also available.

Job Submission Script

To submit work to a SLURM queue, you must first create a job submission file. This job submission file is essentially a simple shell script. It will set any required environment variables, load any necessary modules, create or modify files and directories, and run any applications that you need:

#!/bin/sh -l
# FILENAME:  myjobsubmissionfile

# Loads Matlab and sets the application up
module load matlab

# Change to the directory from which you originally submitted this job.
cd $SLURM_SUBMIT_DIR

# Runs a Matlab script named 'myscript'
matlab -nodisplay -singleCompThread -r myscript

Once your script is prepared, you are ready to submit your job.

Link to section 'Job Script Environment Variables' of 'Job Submission Script' Job Script Environment Variables

SLURM sets several potentially useful environment variables which you may use within your job submission files. Here is a list of some:
Name Description
SLURM_SUBMIT_DIR Absolute path of the current working directory when you submitted this job
SLURM_JOBID Job ID number assigned to this job by the batch system
SLURM_JOB_NAME Job name supplied by the user
SLURM_JOB_NODELIST Names of nodes assigned to this job
SLURM_CLUSTER_NAME Name of the cluster executing the job
SLURM_SUBMIT_HOST Hostname of the system where you submitted this job
SLURM_JOB_PARTITION Name of the original queue to which you submitted this job

Submitting a Job

Once you have a job submission file, you may submit this script to SLURM using the sbatch command. SLURM will find, or wait for, available resources matching your request and run your job there.

To submit your job to one compute node:

 $ sbatch --nodes=1 myjobsubmissionfile 

Slurm uses the word 'Account' and the option '-A' to specify different batch queues. To submit your job to a specific queue:

 $ sbatch --nodes=1 -A partner myjobsubmissionfile 

By default, each job receives 30 minutes of wall time, or clock time. If you know that your job will not need more than a certain amount of time to run, request less than the maximum wall time, as this may allow your job to run sooner. To request the 1 hour and 30 minutes of wall time:

 $ sbatch -t 1:30:00 --nodes=1 -A partner myjobsubmissionfile 

The --nodes value indicates how many compute nodes you would like for your job.

Each compute node in Weber has 20 processor cores.

In some cases, you may want to request multiple nodes. To utilize multiple nodes, you will need to have a program or code that is specifically programmed to use multiple nodes such as with MPI. Simply requesting more nodes will not make your work go faster. Your code must support this ability.

To request 2 compute nodes:

 $ sbatch --nodes=2 myjobsubmissionfile 

SLURM jobs will have exclusive access to compute nodes and other jobs will not use the same nodes. SLURM will allow a single job to run multiple tasks, and those tasks can be allocated resources with the --ntasks option.

To submit a job using 1 compute node with 4 tasks, each using the default 1 core:

$ sbatch --nodes=1 --ntasks=4 myjobsubmissionfile

If more convenient, you may also specify any command line options to sbatch from within your job submission file, using a special form of comment:

#!/bin/sh -l
# FILENAME:  myjobsubmissionfile

#SBATCH -A myqueuename

#SBATCH --nodes=1 
#SBATCH --time=1:30:00
#SBATCH --job-name myjobname

# Print the hostname of the compute node on which this job is running.
/bin/hostname

If an option is present in both your job submission file and on the command line, the option on the command line will take precedence.

After you submit your job with SBATCH, it may wait in queue for minutes, hours, or even weeks. How long it takes for a job to start depends on the specific queue, the resources and time requested, and other jobs already waiting in that queue requested as well. It is impossible to say for sure when any given job will start. For best results, request no more resources than your job requires.

Once your job is submitted, you can monitor the job status, wait for the job to complete, and check the job output.

Checking Job Status

Once a job is submitted there are several commands you can use to monitor the progress of the job.

To see your jobs, use the squeue -u command and specify your username:

(Remember, in our SLURM environment a queue is referred to as an 'Account')

$ squeue -u myusername

    JOBID   ACCOUNT    NAME          USER   ST    TIME   NODES  NODELIST(REASON)
   182792   partner    job1    myusername    R   20:19       1  weber-a000
   185841   partner    job2    myusername    R   20:19       1  weber-a001
   185844   partner    job3    myusername    R   20:18       1  weber-a002
   185847   partner    job4    myusername    R   20:18       1  weber-a003

To retrieve useful information about your queued or running job, use the scontrol show job command with your job's ID number. The output should look similar to the following:

$ scontrol show job 3519

JobId=3519 JobName=t.sub
   UserId=myusername GroupId=mygroup MCS_label=N/A
   Priority=3 Nice=0 Account=(null) QOS=(null)
   JobState=PENDING Reason=BeginTime Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=7-00:00:00 TimeMin=N/A
   SubmitTime=2019-08-29T16:56:52 EligibleTime=2019-08-29T23:30:00
   AccrueTime=Unknown
   StartTime=2019-08-29T23:30:00 EndTime=2019-09-05T23:30:00 Deadline=N/A
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   LastSchedEval=2019-08-29T16:56:52
   Partition=workq AllocNode:Sid=mack-fe00:54476
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=(null)
   NumNodes=1 NumCPUs=2 NumTasks=2 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=2,node=1,billing=2
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/home/myusername/jobdir/myjobfile.sub
   WorkDir=/home/myusername/jobdir
   StdErr=/home/myusername/jobdir/slurm-3519.out
   StdIn=/dev/null
   StdOut=/home/myusername/jobdir/slurm-3519.out
   Power=

There are several useful bits of information in this output.

  • JobState lets you know if the job is Pending, Running, Completed, or Held.
  • RunTime and TimeLimit will show how long the job has run and its maximum time.
  • SubmitTime is when the job was submitted to the cluster.
  • The job's number of Nodes, Tasks, Cores (CPUs) and CPUs per Task are shown.
  • WorkDir is the job's working directory.
  • StdOut and Stderr are the locations of stdout and stderr of the job, respectively.
  • Reason will show why a PENDING job isn't running. The above error says that it has been requested to start at a specific, later time.

Checking Job Output

Once a job is submitted, and has started, it will write its standard output and standard error to files that you can read.

SLURM catches output written to standard output and standard error - what would be printed to your screen if you ran your program interactively. Unless you specfied otherwise, SLURM will put the output in the directory where you submitted the job in a file named slurm- followed by the job id, with the extension out. For example slurm-3509.out. Note that both stdout and stderr will be written into the same file, unless you specify otherwise.

If your program writes its own output files, those files will be created as defined by the program. This may be in the directory where the program was run, or may be defined in a configuration or input file. You will need to check the documentation for your program for more details.

Link to section 'Redirecting Job Output' of 'Checking Job Output' Redirecting Job Output

It is possible to redirect job output to somewhere other than the default location with the --error and --output directives:

#! /bin/sh -l
#SBATCH --output=/home/myusername/joboutput/myjob.out
#SBATCH --error=/home/myusername/joboutput/myjob.out

# This job prints "Hello World" to output and exits
echo "Hello World"

Holding a Job

Sometimes you may want to submit a job but not have it run just yet. You may be wanting to allow labmates to cut in front of you in the queue - so hold the job until their jobs have started, and then release yours.

To place a hold on a job before it starts running, use the scontrol hold job command:

$ scontrol hold job  myjobid

Once a job has started running it can not be placed on hold.

To release a hold on a job, use the scontrol release job command:

$ scontrol release job  myjobid

You find the job ID using the squeue command as explained in the SLURM Job Status section.

Job Dependencies

Dependencies are an automated way of holding and releasing jobs. Jobs with a dependency are held until the condition is satisfied. Once the condition is satisfied jobs only then become eligible to run and must still queue as normal.

Job dependencies may be configured to ensure jobs start in a specified order. Jobs can be configured to run after other job state changes, such as when the job starts or the job ends.

These examples illustrate setting dependencies in several ways. Typically dependencies are set by capturing and using the job ID from the last job submitted.

To run a job after job myjobid has started:

$ sbatch --dependency=after:myjobid myjobsubmissionfile

To run a job after job myjobid ends without error:

$ sbatch --dependency=afterok:myjobid myjobsubmissionfile

To run a job after job myjobid ends with errors:

$ sbatch --dependency=afternotok:myjobid myjobsubmissionfile

To run a job after job myjobid ends with or without errors:

$ sbatch --dependency=afterany:myjobid myjobsubmissionfile

To set more complex dependencies on multiple jobs and conditions:

$ sbatch --dependency=after:myjobid1:myjobid2:myjobid3,afterok:myjobid4 myjobsubmissionfile

Canceling a Job

To stop a job before it finishes or remove it from a queue, use the scancel command:

$ scancel myjobid

You find the job ID using the squeue command as explained in the SLURM Job Status section.

Queues

Link to section 'Partner Queue' of 'Queues' Partner Queue

Weber provides partners and their researchers shared access to the cluster through a shared partner queue. This is the default queue for submitting short to moderately long jobs. It allows jobs up to 24 hours and lets researchers run up to 4 jobs simultaneously. The expectation is that any jobs submitted to the partner queue will start within 4 hours, assuming the queue currently has enough capacity for the job.

Link to section 'Dedicated Queues' of 'Queues' Dedicated Queues

If a research group has purchased dedicated access to Weber there will be a queue named after the faculty or research group. These queues provide faculty and their researchers with priority access to their portion of the cluster. Jobs in these queues are typically limited to 336 hours. The expectation is that any jobs submitted to dedicated queues will start within 4 hours, assuming the queue currently has enough capacity for the job (that is, your labmates aren't using all of the cores currently).

Link to section 'Debug Queue' of 'Queues' Debug Queue

The debug queue allows you to quickly start small, short, interactive jobs in order to debug code, test programs, or test configurations. You are limited to one running job at a time in the queue, and you may run up to two compute nodes for 30 minutes. The expectation is that debug jobs should start within a couple of minutes, assuming all of its dedicated nodes are not taken by others.

Link to section 'List of Queues' of 'Queues' List of Queues

{::if resource.batchsystem == slurm}_6 To see a list of all queues on Weber that you may submit to, use the slist command

This lists each queue you can submit to, the number of nodes allocated to the queue, how many are available to run jobs, and the maximum walltime you may request. Options to the command will give more detailed information. This command can be used to get a general idea of how busy an individual queue is and how long you may have to wait for your job to start.

PBS to Slurm

This is a reference for the most common command, environment variables, and job specification options used by the workload management systems and their equivalents.

Quick Guide

This table lists the most common command, environment variables, and job specification options used by the workload management systems and their equivalents (adapted from http://www.schedmd.com/slurmdocs/rosetta.html).

Common commands across workload management systems
User Commands PBS/Torque Slurm
Job submission qsub [script_file] sbatch [script_file]
Interactive Job qsub -I sinteractive
Job deletion qdel [job_id] scancel [job_id]
Job status (by job) qstat [job_id] squeue [-j job_id]
Job status (by user) qstat -u [user_name] squeue [-u user_name]
Job hold qhold [job_id] scontrol hold [job_id]
Job release qrls [job_id] scontrol release [job_id]
Queue info qstat -Q squeue
Queue access qlist slist
Node list pbsnodes -l sinfo -N
scontrol show nodes
Cluster status qstat -a sinfo
GUI xpbsmon sview
Environment PBS/Torque Slurm
Job ID $PBS_JOBID $SLURM_JOB_ID
Job Name $PBS_JOBNAME $SLURM_JOB_NAME
Job Queue/Account $PBS_QUEUE $SLURM_JOB_ACCOUNT
Submit Directory $PBS_O_WORKDIR $SLURM_SUBMIT_DIR
Submit Host $PBS_O_HOST $SLURM_SUBMIT_HOST
Number of nodes $PBS_NUM_NODES $SLURM_JOB_NUM_NODES
Number of Tasks $PBS_NP $SLURM_NTASKS
Number of Tasks Per Node $PBS_NUM_PPN $SLURM_NTASKS_PER_NODE
Node List (Compact) n/a $SLURM_JOB_NODELIST
Node List (One Core Per Line) LIST=$(cat $PBS_NODEFILE) LIST=$(srun hostname)
Job Array Index $PBS_ARRAYID $SLURM_ARRAY_TASK_ID
Job Specification PBS/Torque Slurm
Script directive #PBS #SBATCH
Queue -q [queue] -A [queue]
Node Count -l nodes=[count] -N [min[-max]]
CPU Count -l ppn=[count] -n [count]
Note: total, not per node
Wall Clock Limit -l walltime=[hh:mm:ss] -t [min] OR
-t [hh:mm:ss] OR
-t [days-hh:mm:ss]
Standard Output FIle -o [file_name] -o [file_name]
Standard Error File -e [file_name] -e [file_name]
Combine stdout/err -j oe (both to stdout) OR
-j eo (both to stderr)
(use -o without -e)
Copy Environment -V --export=[ALL | NONE | variables]
Note: default behavior is ALL
Copy Specific Environment Variable -v myvar=somevalue --export=NONE,myvar=somevalue OR
--export=ALL,myvar=somevalue
Event Notification -m abe --mail-type=[events]
Email Address -M [address] --mail-user=[address]
Job Name -N [name] --job-name=[name]
Job Restart -r [y|n] --requeue OR
--no-requeue
Working Directory   --workdir=[dir_name]
Resource Sharing -l naccesspolicy=singlejob --exclusive OR
--shared
Memory Size -l mem=[MB] --mem=[mem][M|G|T] OR
--mem-per-cpu=[mem][M|G|T]
Account to charge -A [account] -A [account]
Tasks Per Node -l ppn=[count] --tasks-per-node=[count]
CPUs Per Task   --cpus-per-task=[count]
Job Dependency -W depend=[state:job_id] --depend=[state:job_id]
Job Arrays -t [array_spec] --array=[array_spec]
Generic Resources -l other=[resource_spec] --gres=[resource_spec]
Licenses   --licenses=[license_spec]
Begin Time -A "y-m-d h:m:s" --begin=y-m-d[Th:m[:s]]

See the official Slurm Documentation for further details.

Notable Differences

  • Separate commands for Batch and Interactive jobs

    Unlike PBS, in Slurm interactive jobs and batch jobs are launched with completely distinct commands.
    Use sbatch [allocation request options] script to submit a job to the batch scheduler, and sinteractive [allocation request options] to launch an interactive job. sinteractive accepts most of the same allocation request options as sbatch does.

  • No need for cd $PBS_O_WORKDIR

    In Slurm your batch job starts to run in the directory from which you submitted the script whereas in PBS/Torque you need to explicitly move back to that directory with cd $PBS_O_WORKDIR.

  • No need to manually export environment

    The environment variables that are defined in your shell session at the time that you submit the script are exported into your batch job, whereas in PBS/Torque you need to use the -V flag to export your environment.

  • Location of output files

    The output and error files are created in their final location immediately that the job begins or an error is generated, whereas in PBS/Torque temporary files are created that are only moved to the final location at the end of the job. Therefore in Slurm you can examine the output and error files from your job during its execution.

See the official Slurm Documentation for further details.

Example Jobs

A number of example jobs are available for you to look over and adapt to your own needs. The first few are generic examples, and latter ones go into specifics for particular software packages.

Generic SLURM Jobs

The following examples demonstrate the basics of SLURM jobs, and are designed to cover common job request scenarios. These example jobs will need to be modified to run your application or code.

Simple Job

Every SLURM job consists of a job submission file. A job submission file contains a list of commands that run your program and a set of resource (nodes, walltime, queue) requests. The resource requests can appear in the job submission file or can be specified at submit-time as shown below.

This simple example submits the job submission file hello.sub to the partner queue on Weber and requests a single node:

#!/bin/bash
# FILENAME: hello.sub

# Show this ran on a compute node by running the hostname command.
hostname

echo "Hello World"
$ sbatch -A partner --nodes=1 --time=00:01:00 hello.sub
Submitted batch job 3521

For a real job you would replace echo "Hello World" with a command, or sequence of commands, that run your program.

After your job finishes running, the ls command will show a new file in your directory, the .out file:

$ ls -l
hello.sub
slurm-3521.out

The file slurm-3521.out contains the output and errors your program would have written to the screen if you had typed its commands at a command prompt:

$ cat slurm-3521.out 

weber.rcac.purdue.edu 
Hello World

You should see the hostname of the compute node your job was executed on. Following should be the "Hello World" statement.

Multiple Node

In some cases, you may want to request multiple nodes. To utilize multiple nodes, you will need to have a program or code that is specifically programmed to use multiple nodes such as with MPI. Simply requesting more nodes will not make your work go faster. Your code must support this ability.

This example shows a request for multiple compute nodes. The job submission file contains a single command to show the names of the compute nodes allocated:

# FILENAME:  myjobsubmissionfile.sub
echo $SLURM_JOB_NODELIST
$ sbatch --nodes=2 --time=00:10:00 -A partner myjobsubmissionfile.sub

Compute nodes allocated:

weber-a[014-015]

Directives

So far these examples have shown submitting jobs with the resource requests on the sbatch command line such as:

$ sbatch -A partner --nodes=1 --time=00:01:00 hello.sub

The resource requests can also be put into job submission file itself. Documenting the resource requests in the job submission is desirable because the job can be easily reproduced later. Details left in your command history are quickly lost. Arguments are specified with the #SBATCH syntax:

#!/bin/bash

# FILENAME: hello.sub
#SBATCH -A partner

#SBATCH --nodes=1 --time=00:01:00 

# Show this ran on a compute node by running the hostname command.
hostname

echo "Hello World"

The #SBATCH directives must appear at the top of your submission file. SLURM will stop parsing directives as soon as it encounters a line that does not start with '#'. If you insert a directive in the middle of your script, it will be ignored.

This job can be then submitted with:

$ sbatch hello.sub

Specific Types of Nodes

SLURM allows running a job on specific types of compute nodes to accommodate special hardware requirements (e.g. a certain CPU or GPU type, etc.)

Cluster nodes have a set of descriptive features assigned to them, and users can specify which of these features are required by their job by using the constraint option at submission time. Only nodes having features matching the job constraints will be used to satisfy the request.

Example: a job requires a compute node in an "A" sub-cluster:

$ sbatch --nodes=1 --ntasks=20 --constraint=A myjobsubmissionfile.sub

Compute node allocated:

weber-a003

Feature constraints can be used for both batch and interactive jobs, as well as for individual job steps inside a job. Multiple constraints can be specified with a predefined syntax to achieve complex request logic (see detailed description of the '--constraint' option in man sbatch or online Slurm documentation).

Refer to Detailed Hardware Specification section for list of available sub-cluster labels, their respective per-node memory sizes and other hardware details. You could also use sfeatures command to list available constraint feature names for different node types.

Interactive Jobs

Interactive jobs are run on compute nodes, while giving you a shell to interact with. They give you the ability to type commands or use a graphical interface as if you were on a front-end.

To submit an interactive job, use sinteractive to run a login shell on allocated resources.

sinteractive accepts most of the same resource requests as sbatch, so to request a login shell on the partner Account while allocating 2 nodes and 20 total cores, you might do:

$ sinteractive -A partner -N2 -n20

To quit your interactive job:

exit or Ctrl-D

Serial Jobs

This shows how to submit one of the serial programs compiled in the section Compiling Serial Programs.

Create a job submission file:

#!/bin/bash
# FILENAME:  serial_hello.sub

./serial_hello

Submit the job:

$ sbatch --nodes=1 --ntasks=1 --time=00:01:00 serial_hello.sub

After the job completes, view results in the output file:

$ cat slurm-myjobid.out

Runhost:weber.rcac.purdue.edu   hello, world 

If the job failed to run, then view error messages in the file slurm-myjobid.out.

MPI

An MPI job is a set of processes that take advantage of multiple compute nodes by communicating with each other. OpenMPI and Intel MPI (IMPI) are implementations of the MPI standard.

This section shows how to submit one of the MPI programs compiled in the section Compiling MPI Programs.

Use module load to set up the paths to access these libraries. Use module avail to see all MPI packages installed on Weber.

Create a job submission file:

#!/bin/bash
# FILENAME:  mpi_hello.sub
#SBATCH  --nodes=2
#SBATCH  --ntasks-per-node=20
#SBATCH  --time=00:01:00
#SBATCH  -A partner

srun -n 40 ./mpi_hello

SLURM can run an MPI program with the srun command. The number of processes is requested with the -n option. If you do not specify the -n option, it will default to the total number of processor cores you request from SLURM.

If the code is built with OpenMPI, it can be run with a simple srun -n command. If it is built with Intel IMPI, then you also need to add the --mpi=pmi2 option: srun --mpi=pmi2 -n 40 ./mpi_hello in this example.

Submit the MPI job:

$ sbatch ./mpi_hello.sub

View results in the output file:

$ cat slurm-myjobid.out
Runhost:weber-a010.rcac.purdue.edu   Rank:0 of 40 ranks   hello, world
Runhost:weber-a010.rcac.purdue.edu   Rank:1 of 40 ranks   hello, world
...
Runhost:weber-a011.rcac.purdue.edu   Rank:20 of 40 ranks   hello, world
Runhost:weber-a011.rcac.purdue.edu   Rank:21 of 40 ranks   hello, world
...

If the job failed to run, then view error messages in the output file.

If an MPI job uses a lot of memory and 20 MPI ranks per compute node use all of the memory of the compute nodes, request more compute nodes, while keeping the total number of MPI ranks unchanged.

Submit the job with double the number of compute nodes and modify the resource request to halve the number of MPI ranks per compute node.

#!/bin/bash
# FILENAME:  mpi_hello.sub

#SBATCH --nodes=4                                                                                                                                        
#SBATCH --ntasks-per-node=10                                                                                                        
#SBATCH -t 00:01:00 
#SBATCH -A partner

srun -n 40 ./mpi_hello
$ sbatch ./mpi_hello.sub

View results in the output file:

$ cat slurm-myjobid.out
Runhost:weber-a010.rcac.purdue.edu   Rank:0 of 40 ranks   hello, world
Runhost:weber-a010.rcac.purdue.edu   Rank:1 of 40 ranks   hello, world
...
Runhost:weber-a011.rcac.purdue.edu   Rank:10 of 40 ranks   hello, world
...
Runhost:weber-a012.rcac.purdue.edu   Rank:20 of 40 ranks   hello, world
...
Runhost:weber-a013.rcac.purdue.edu   Rank:30 of 40 ranks   hello, world
...

Notes

  • Use slist to determine which queues (--account or -A option) are available to you. The name of the queue which is available to everyone on Weber is "partner".
  • Invoking an MPI program on Weber with ./program is typically wrong, since this will use only one MPI process and defeat the purpose of using MPI. Unless that is what you want (rarely the case), you should use srun or mpiexec to invoke an MPI program.
  • In general, the exact order in which MPI ranks output similar write requests to an output file is random.

OpenMP

A shared-memory job is a single process that takes advantage of a multi-core processor and its shared memory to achieve parallelization.

This example shows how to submit an OpenMP program compiled in the section Compiling OpenMP Programs.

When running OpenMP programs, all threads must be on the same compute node to take advantage of shared memory. The threads cannot communicate between nodes.

To run an OpenMP program, set the environment variable OMP_NUM_THREADS to the desired number of threads:

In csh:

$ setenv OMP_NUM_THREADS 20

In bash:

$ export OMP_NUM_THREADS=20

This should almost always be equal to the number of cores on a compute node. You may want to set to another appropriate value if you are running several processes in parallel in a single job or node.

Create a job submissionfile:

#!/bin/bash
# FILENAME:  omp_hello.sub
#SBATCH --nodes=1
#SBATCH --ntasks=20
#SBATCH --time=00:01:00

export OMP_NUM_THREADS=20
./omp_hello 

Submit the job:

$ sbatch omp_hello.sub 

View the results from one of the sample OpenMP programs about task parallelism:

$ cat omp_hello.sub.omyjobid
SERIAL REGION:     Runhost:weber-01.rcac.purdue.edu   Thread:0 of 1 thread    hello, world
PARALLEL REGION:   Runhost:weber-01.rcac.purdue.edu   Thread:0 of 20 threads   hello, world
PARALLEL REGION:   Runhost:weber-01.rcac.purdue.edu   Thread:1 of 20 threads   hello, world
   ...

If the job failed to run, then view error messages in the file slurm-myjobid.out.

If an OpenMP program uses a lot of memory and 20 threads use all of the memory of the compute node, use fewer processor cores (OpenMP threads) on that compute node.

Link to section 'Collecting System Resource Utilization Data' of 'Monitoring Resources' Collecting System Resource Utilization Data

Knowing the precise resource utilization an application had during a job, such as CPU load or memory, can be incredibly useful. This is especially the case when the application isn't performing as expected.

One approach is to run a program like htop during an interactive job and keep an eye on system resources. You can get precise time-series data from nodes associated with your job using XDmod as well, online. But these methods don't gather telemetry in an automated fashion, nor do they give you control over the resolution or format of the data.

As a matter of course, a robust implementation of some HPC workload would include resource utilization data as a diagnostic tool in the event of some failure.

The monitor utility is a simple command line system resource monitoring tool for gathering such telemetry and is available as a module.

$ module load utilities monitor

Complete documentation is available online at resource-monitor.readthedocs.io. A full manual page is also available for reference, man monitor.

In the context of a SLURM job you will need to put this monitoring task in the background to allow the rest of your job script to proceed. Be sure to interrupt these tasks at the end of your job.

#!/bin/bash
# FILENAME: monitored_job.sh

module load utilities monitor

# track per-code CPU load
monitor cpu percent --all-cores >cpu-percent.log &
CPU_PID=$!

# track memory usage
monitor cpu memory >cpu-memory.log &
MEM_PID=$!

# your code here

# shut down the resource monitors
kill -s INT $CPU_PID $MEM_PID

A particularly elegant solution would be to include such tools in your prologue script and have the tear down in your epilogue script.

For large distributed jobs spread across multiple nodes, mpiexec can be used to gather telemetry from all nodes in the job. The hostname is included in each line of output so that data can be grouped as such. A concise way of constructing the needed list of hostnames in SLURM is to simply use srun hostname | sort -u.

#!/bin/bash
# FILENAME: monitored_job.sh

module load utilities monitor

# track all CPUs (one monitor per host)
mpiexec -machinefile <(srun hostname | sort -u) \
    monitor cpu percent --all-cores >cpu-percent.log &
CPU_PID=$!

# track memory on all hosts (one monitor per host)
mpiexec -machinefile <(srun hostname | sort -u) \
    monitor cpu memory >cpu-memory.log &
MEM_PID=$!

# your code here

# shut down the resource monitors
kill -s INT $CPU_PID $MEM_PID

To get resource data in a more readily computable format, the monitor program can be told to output in CSV format with the --csv flag.

monitor cpu memory --csv >cpu-memory.csv

For a distributed job you will need to suppress the header lines otherwise one will be created by each host.

monitor cpu memory --csv | head -1 >cpu-memory.csv
mpiexec -machinefile <(srun hostname | sort -u) \
    monitor cpu memory --csv --no-header >>cpu-memory.csv

Specific Applications

The following examples demonstrate job submission files for some common real-world applications. See the Generic SLURM Examples section for more examples on job submissions that can be adapted for use.

Gaussian

Gaussian is a computational chemistry software package which works on electronic structure. This section illustrates how to submit a small Gaussian job to a PBS queue. This Gaussian example runs the Fletcher-Powell multivariable optimization.

Prepare a Gaussian input file with an appropriate filename, here named myjob.com. The final blank line is necessary:

#P TEST OPT=FP STO-3G OPTCYC=2

STO-3G FLETCHER-POWELL OPTIMIZATION OF WATER

0 1
O
H 1 R
H 1 R 2 A

R 0.96
A 104.

To submit this job, load Gaussian then run the provided script, named subg16. This job uses one compute node with 20 processor cores:

$ module load gaussian16

$ subg16 myjob -N 1 -n 20 

View job status:

$ squeue -u myusername

View results in the file for Gaussian output, here named myjob.log. Only the first and last few lines appear here:

 
 Entering Gaussian System, Link 0=/apps/cent7/gaussian/g16-A.03/g16-haswell/g16/g16
 Initial command:

 /apps/cent7/gaussian/g16-A.03/g16-haswell/g16/l1.exe /scratch/weber/myusername/gaussian/Gau-7781.inp -scrdir=/scratch/weber/myusername/gaussian/ 
 Entering Link 1 = /apps/cent7/gaussian/g16-A.03/g16-haswell/g16/l1.exe PID=      7782.

 Copyright (c) 1988,1990,1992,1993,1995,1998,2003,2009,2016,
            Gaussian, Inc.  All Rights Reserved.

.
.
.

 Job cpu time:       0 days  0 hours  3 minutes 28.2 seconds.
 Elapsed time:       0 days  0 hours  0 minutes 12.9 seconds.
 File lengths (MBytes):  RWF=     17 Int=      0 D2E=      0 Chk=      2 Scr=      2
 Normal termination of Gaussian 16 at Tue May  1 17:12:00 2018.
real 13.85
user 202.05
sys 6.12
Machine:
weber-a012
weber-a012
weber-a012
weber-a012
weber-a012
weber-a012
weber-a012
weber-a012

Link to section 'Examples of Gaussian SLURM Job Submissions' of 'Gaussian' Examples of Gaussian SLURM Job Submissions

Submit job using 20 processor cores on a single node:


$ subg16 myjob  -N 1 -n 20 -t 200:00:00 -A myqueuename 

Submit job using 20 processor cores on each of 2 nodes:


$ subg16 myjob -N 2 -n 20 -t 200:00:00 -q myqueuename 

For more information about Gaussian:

Machine Learning

We support several Machine Learning (ML) applications on ITaP community clusters. The collection of these packages are referred to as ML-Toolkit throughout this documentation. Currently, following 9 applications are included in ML-Toolkit.

caffe           cntk            gym
keras           opencv          pytorch
tensorflow      tflearn         theano

Note that managing Python dependencies of ML applications is non-trivial, therefore, we recommend that you read the documentations carefully before embarking on a journey to build intelligent machines.

ML-Toolkit

ITaP maintains a set of popular machine learning (ML) applications on Weber. These are Anaconda/Python based distribution of the respective applications. Currently, applications are supported for two major Python versions (2.7 and 3.6). Detailed instructions for searching and using the installed ML applications are presented below.

Important: You must load one of the learning modules described below before loading the ML applications.

Link to section 'Instructions for using ML packages' of 'ML-Toolkit' Instructions for using ML packages

Link to section 'Prerequisite' of 'ML-Toolkit' Prerequisite

Make sure your Python environment is clean. Python is very sensitive about packages installed in your local pip folder or in your Conda environments. It is always safer to start with a clean environment. The steps below archive all your existing python packages to backup directories reducing chances of conflict.

$ mv ~/.conda ~/.conda.bak
$ mv ~/.local ~/.local.bak
$ mv ~/.cache ~/.cache.bak

Link to section 'Find installed ML applications' of 'ML-Toolkit' Find installed ML applications

To search or load a machine learning application, you must first load one of the learning modules. The learning module loads the prerequisites (such as anaconda ) and makes ML applications visible to the user.

Step 1. Find and load a preferred learning module.

There are two learning modules available on weber, each corresponding to a specific Python version. In the example below, we want to use the learning module for Python 3.6.

$ module spider learning

----------------------------------------------------------------------------
  learning:
----------------------------------------------------------------------------
     Versions:
        learning/conda-5.1.0-py27-cpu
        learning/conda-5.1.0-py36-cpu

.........
$ module load learning/conda-5.1.0-py36-cpu

Step 2. Find a machine learning application.

You can now use the module spider command to find installed applications. The following example searches for available PyTorch installations.

$ module spider pytorch

---------------------------------------------------------------------------------
  ml-toolkit-cpu/pytorch: ml-toolkit-cpu/pytorch/0.4.0
---------------------------------------------------------------------------------

    This module can be loaded directly: module load ml-toolkit-cpu/pytorch/0.4.0 

Step 3. List all machine learning applications.

Note that the ML packages are installed under the common application name ml-toolkit-cpu. To list all machine learning packages installed on weber, run the command:

$ module spider ml-toolkit-cpu

Currently, ml-toolkit-cpu includes 9 popular ML packages listed below.

ml-toolkit-cpu/caffe/1.0.0
ml-toolkit-cpu/cntk/2.3
ml-toolkit-cpu/gym/0.10.5
ml-toolkit-cpu/keras/2.1.5
ml-toolkit-cpu/opencv/3.4.1
ml-toolkit-cpu/pytorch/0.4.0
ml-toolkit-cpu/tensorflow/1.4.0
ml-toolkit-cpu/tflearn/0.3.2
ml-toolkit-cpu/theano/1.0.2

Link to section 'Load and use the ML applications' of 'ML-Toolkit' Load and use the ML applications

Step 4. After loading a preferred learning module in Step 1, you can now load the desired ML applications in your environment. In the following example, we load the OpenCV and PyTorch modules.

$ module load ml-toolkit-cpu/opencv/3.4.1
$ module load ml-toolkit-cpu/pytorch/0.4.0

Step 5. You can list which ML applications are loaded in your environment using the command

$ module list

Link to section 'Verify application import' of 'ML-Toolkit' Verify application import

Step 6. The next step is to check that you can actually use the desired ML application. You can do this by running the import command in Python.

$ python -c "import torch; print(torch.__version__)"

If the import operation succeeded, then you can run your own ML codes. Few ML applications (such as tensorflow) print diagnostic warnings while loading--this is the expected behavior.

If the import failed with an error, please see the troubleshooting information below.

Step 7. To load a different set of applications, unload the previously loaded applications and load the new applications. The example below loads Tensorflow and Keras instead of PyTorch and OpenCV.

$ module unload ml-toolkit-cpu/opencv/3.4.1
$ module unload ml-toolkit-cpu/pytorch/0.4.0
$ module load ml-toolkit-cpu/tensorflow/1.4.0
$ module load ml-toolkit-cpu/keras/2.1.5

Link to section 'Troubleshooting' of 'ML-Toolkit' Troubleshooting

ML applications depend on a wide range of Python packages and mixing multiple versions of these packages can lead to error. The following guidelines will assist you in identifying the cause of the problem.

  • Check that you are using the correct version of Python with the command python --version. This should match the Python version in the loaded anaconda module.
  • Make sure that your Python environment is clean. Follow the instructions in "Prerequisites" section above.
  • Start from a clean environment. Either start a new terminal session or unload all the modules: module purge. Then load the desired modules following Steps 1-4.
  • Verify that PYTHONPATH does not point to undesired packages. Run the following command to print PYTHONPATH: echo $PYTHONPATH
  • Note that Caffe has a conflicting version of PyQt5. So, if you want to use Spyder (or any GUI application that uses PyQt), then you should unload the caffe module.
  • Use Google search to your advantage. Copy the error message in Google and check probable causes.

More examples showing how to use ml-toolkit modules in a batch job are presented in this guide.

Link to section 'Installing ML applications' of 'ML-Toolkit' Installing ML applications

If the ML application you are trying to use is not in the list of supported applications or if you need a newer version of an installed application, you can install it in your home directory. We recommend using anaconda environments to install and manage Python packages. Please follow the steps carefully, otherwise you may end up with a faulty installation. The example below shows how to install PyTorch. 0.4.1 (a newer version) in your home directory.

Step 1: Unload all modules and start with a clean environment.

$ module purge

Step 2: Load the anaconda module with desired Python version.

$ module load anaconda/5.1.0-py36

Step 3: Create a custom anaconda environment. Make sure the python version matches the Python version in the anaconda module.

$ conda-env-mod create -n env_name_here

Step 4: Activate the anaconda environment by loading the modules displayed at the end of step 3.

$ module load use.own
$ module load conda-env/env_name_here-py3.6.4

Step 5: Now install the desired ML application. You can install multiple Python packages at this step using either conda or pip.

$ conda install -c pytorch pytorch-cpu=0.4.1

If the installation succeeded, you can now use the installed application.

Note that loading the modules generated by conda-env-mod has different behavior than conda create env_name_here followed by source activate env_name_here. After running source activate, you may not be able to access any Python packages in anaconda or ml-toolkit modules. Therefore, using conda-env-mod is the preferred way of using your custom installations.

Link to section 'Troubleshooting' of 'ML-Toolkit' Troubleshooting

In most situations, dependencies among Python modules lead to error. If you cannot use a Python package after installing it, please follow the steps below to find a workaround.

  • Unload all the modules.
    $ module purge
    
  • Clean up PYTHONPATH.
    $ unset PYTHONPATH
  • Next load the modules, e.g., anaconda and your custom environment.
    $ module load anaconda/5.1.0-py36
    $ module load use.own
    $ module load conda-env/env_name_here-py3.6.4
    
  • Now try running your code again.
  • Few applications only run on specific versions of Python (e.g. Python 3.6). Please check the documentation of your application if that is the case.
  • If you have installed a newer version of an ml-toolkit package (e.g., a newer version of PyTorch or Tensorflow), make sure that the ml-toolkit modules are NOT loaded. In general, we recommend that you don't mix ml-toolkit modules with your custom installations.

Link to section 'Running Tensorflow code in a batch job' of 'Tensorflow Batch Job' Running Tensorflow code in a batch job

Batch jobs allow us to automate model training without human intervention. They are also useful when you need to run a large number of simulations on the clusters. In the example below, we shall run the tensor_hello.py script in a batch job (refer to Tensorflow guide to see the code). We consider two situations: in the first example, we use the ML-Toolkit modules to run tensorflow, while in the second example, we use our custom installation of tensorflow.

Link to section 'Using Ml-Toolkit modules' of 'Tensorflow Batch Job' Using Ml-Toolkit modules

Save the following code as tensor_hello.sub in the same directory where tensor_hello.py is located.

# filename: tensor_hello.sub

#SBATCH --nodes=1
#SBATCH --ntasks-per-node=20 
#SBATCH --time=00:05:00
#SBATCH -A partner
#SBATCH -J hello_tensor

module purge

module load learning/conda-5.1.0-py36-cpu
module load ml-toolkit-cpu/tensorflow 
module list

python tensor_hello.py

Link to section 'Using custom tensorflow installation' of 'Tensorflow Batch Job' Using custom tensorflow installation

Save the following code as tensor_hello.sub in the same directory where tensor_hello.py is located.

# filename: tensor_hello.sub

#SBATCH --nodes=1
#SBATCH --ntasks-per-node=20 
#SBATCH --time=00:05:00
#SBATCH -A partner
#SBATCH -J hello_tensor

module purge

module load anaconda/5.1.0-py36
module load use.own
module load conda-env/my_tf_env-py3.6.4 
module list

echo $PYTHONPATH

python tensor_hello.py

Now you can submit the batch job using the sbatch command.

$ sbatch tensor_hello.sub

Once the job finishes, you will find an output (slurm-xxxxx.out). If tensorflow ran successfully, then the output file will contain the message shown below.

Hello, TensorFlow!

Link to section 'Tensorflow on Weber' of 'Tensorflow' Tensorflow on Weber

Link to section 'Tensorflow Modules' of 'Tensorflow' Tensorflow Modules

ITaP provides a set of stable tensorflow builds on Weber. At present, tensorflow is part of the ML-Toolkit packages. You must load one of the learning modules before you can load the tensorflow module. We recommend getting an interactive job for running Tensorflow.

  • First, load a desired learning module:
    $ module load learning/conda-5.1.0-py36-cpu
    
  • To list available tensorflow modules:
    $ module spider ml-toolkit-cpu/tensorflow
    
  • To show default tensorflow module:
    $ module show ml-toolkit-cpu/tensorflow
    
  • To load default tensorflow module:
    $ module load ml-toolkit-cpu/tensorflow
    
  • To test that tensorflow is available:
    $ python -c "import tensorflow as tf"
    
  • Important: The tensorflow modules previosuly available on Research Computing systems, such as tensorflow/1.2.0_py27-cpu and tensorflow/1.2.0_py35-cpu, are deprecated and will not work with the ml-toolkit-cpu modules. Please update your job scripts to use the ml-toolkit-cpu/tensorflow module.
  • Link to section 'Install' of 'Tensorflow' Install

    ITaP recommends downloading and installing Tensorflow in user's home directory using anaconda environments. Installing Tensorflow in your home directory has the advantage that it can be upgraded to newer versions easily. Therefore, researchers will have access to the latest libraries when needed.

    • We recommend getting an interactive job for installing and running Tensorflow.
    • First load the necessary modules and define which tensorflow version to install:
      $ module purge
      $ module load anaconda/5.1.0-py36
      
      $ module list $ export TF_BINARY_URL="https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-1.12.0-cp36-cp36m-linux_x86_64.whl" 
      
    • Create an anaconda environment using conda-env-mod. The script also prints a list of modules that should be loaded to use the custom environment, please note down these module names.
      $ conda-env-mod create -n my_tf_env
      
    • Activate the anaconda environment.
      $ module load use.own
      $ module load conda-env/my_tf_env-py3.6.4
      
    • Now install Tensorflow binaries in your home directory:
      $ pip install --ignore-installed --upgrade $TF_BINARY_URL
      
    • Wait for installation to finish.
    • If the installation finished successfully, you can now proceed with the examples below. If not, please look at common installation problems and how to resolve them.

    Link to section 'Testing the installation' of 'Tensorflow' Testing the installation

    • Check that you have the anaconda module and your custom environment loaded using the command module list. Otherwise, load the necessary modules:
      $ module load anaconda/5.1.0-py36
      
      $ module load use.own
      $ module load conda-env/my_tf_env-py3.6.4 
      
    • Save the following code as tensor_hello.py
      # filename: tensor_hello.py
      import tensorflow as tf
      hello = tf.constant('Hello, TensorFlow!')
      sess = tf.Session()
      print(sess.run(hello))
      
    • Run the example
      $ python tensor_hello.py
    • This will produce an output like the following:
      < ... tensorflow build related information ... >
      < ... hardware information ... >
      Hello, TensorFlow!
      

    Link to section 'Tensorboard' of 'Tensorflow' Tensorboard

    • You can visualize data from a Tensorflow session using Tensorboard. For this, you need to save your session summary as described in the Tensorboard User Guide.
    • Launch Tensorboard:
      $ python -m tensorboard.main --logdir=/path/to/session/logs
    • When Tensorboard is launched successfully, it will give you the URL for accessing Tensorboard.
      <... build related warnings ...> 
      TensorBoard 0.4.0 at http://weber-a000.rcac.purdue.edu:6006
      
    • Follow the printed URL to visualize your model.
    • Please note that due to firewall rules, the Tensorboard URL may only be accessible from Weber nodes. If you cannot access the URL directly, you can use Firefox browser in Thinlinc.
    • For more details, please refer to the Tensorboard User Guide.

    Matlab

    MATLAB® (MATrix LABoratory) is a high-level language and interactive environment for numerical computation, visualization, and programming. MATLAB is a product of MathWorks.

    MATLAB, Simulink, Compiler, and several of the optional toolboxes are available to faculty, staff, and students. To see the kind and quantity of all MATLAB licenses plus the number that you are currently using you can use the matlab_licenses command:

    $ module load matlab
    $ matlab_licenses
    

    The MATLAB client can be run in the front-end for application development, however, computationally intensive jobs must be run on compute nodes.

    The following sections provide several examples illustrating how to submit MATLAB jobs to a Linux compute cluster.

    Matlab Script (.m File)

    This section illustrates how to submit a small, serial, MATLAB program as a job to a batch queue. This MATLAB program prints the name of the run host and gets three random numbers.

    Prepare a MATLAB script myscript.m, and a MATLAB function file myfunction.m:

    % FILENAME:  myscript.m
    
    % Display name of compute node which ran this job.
    [c name] = system('hostname');
    fprintf('\n\nhostname:%s\n', name);
    
    % Display three random numbers.
    A = rand(1,3);
    fprintf('%f %f %f\n', A);
    
    quit;
    
    % FILENAME:  myfunction.m
    
    function result = myfunction ()
    
        % Return name of compute node which ran this job.
        [c name] = system('hostname');
        result = sprintf('hostname:%s', name);
    
        % Return three random numbers.
        A = rand(1,3);
        r = sprintf('%f %f %f', A);
        result=strvcat(result,r);
    
    end
    

    Also, prepare a job submission file, here named myjob.sub. Run with the name of the script:

    #!/bin/bash
    # FILENAME:  myjob.sub
    
    echo "myjob.sub"
    
    # Load module, and set up environment for Matlab to run
    module load matlab
    
    unset DISPLAY
    
    # -nodisplay:        run MATLAB in text mode; X11 server not needed
    # -singleCompThread: turn off implicit parallelism
    # -r:                read MATLAB program; use MATLAB JIT Accelerator
    # Run Matlab, with the above options and specifying our .m file
    matlab -nodisplay -singleCompThread -r myscript
    

    Submit the job

    View job status

    View results of the job

    myjob.sub
    
                                < M A T L A B (R) >
                      Copyright 1984-2011 The MathWorks, Inc.
                        R2011b (7.13.0.564) 64-bit (glnxa64)
                                  August 13, 2011
    
    To get started, type one of these: helpwin, helpdesk, or demo.
    For product information, visit www.mathworks.com.
    
    hostname:weber-a001.rcac.purdue.edu
    0.814724 0.905792 0.126987
    

    Output shows that a processor core on one compute node (weber-a001) processed the job. Output also displays the three random numbers.

    For more information about MATLAB:

    Implicit Parallelism

    MATLAB implements implicit parallelism which is automatic multithreading of many computations, such as matrix multiplication, linear algebra, and performing the same operation on a set of numbers. This is different from the explicit parallelism of the Parallel Computing Toolbox.

    MATLAB offers implicit parallelism in the form of thread-parallel enabled functions. Since these processor cores, or threads, share a common memory, many MATLAB functions contain multithreading potential. Vector operations, the particular application or algorithm, and the amount of computation (array size) contribute to the determination of whether a function runs serially or with multithreading.

    When your job triggers implicit parallelism, it attempts to allocate its threads on all processor cores of the compute node on which the MATLAB client is running, including processor cores running other jobs. This competition can degrade the performance of all jobs running on the node.

    When you know that you are coding a serial job but are unsure whether you are using thread-parallel enabled operations, run MATLAB with implicit parallelism turned off. Beginning with the R2009b, you can turn multithreading off by starting MATLAB with -singleCompThread:

    $ matlab -nodisplay -singleCompThread -r mymatlabprogram
    

    When you are using implicit parallelism, make sure you request exclusive access to a compute node, as MATLAB has no facility for sharing nodes.

    For more information about MATLAB's implicit parallelism:

    Profile Manager

    MATLAB offers two kinds of profiles for parallel execution: the 'local' profile and user-defined cluster profiles. The 'local' profile runs a MATLAB job on the processor core(s) of the same compute node, or front-end, that is running the client. To run a MATLAB job on compute node(s) different from the node running the client, you must define a Cluster Profile using the Cluster Profile Manager.

    To prepare a user-defined cluster profile, use the Cluster Profile Manager in the Parallel menu. This profile contains the scheduler details (queue, nodes, processors, walltime, etc.) of your job submission. Ultimately, your cluster profile will be an argument to MATLAB functions like batch().

    For your convenience, ITaP provides a generic cluster profile that can be downloaded: myslurmprofile.settings

    Please note that modifications are very likely to be required to make myslurmprofile.settings work. You may need to change values for number of nodes, number of workers, walltime, and submission queue specified in the file. As well, the generic profile itself depends on the particular job scheduler on the cluster, so you may need to download or create two or more generic profiles under different names. Each time you run a job using a Cluster Profile, make sure the specific profile you are using is appropriate for the job and the cluster.

    To import the profile, start a MATLAB session and select Manage Cluster Profiles... from the Parallel menu. In the Cluster Profile Manager, select Import, navigate to the folder containing the profile, select myslurmprofile.settings and click OK. Remember that the profile will need to be customized for your specific needs. If you have any questions, please contact us.

    For detailed information about MATLAB's Parallel Computing Toolbox, examples, demos, and tutorials:

    Parallel Toolbox (spmd)

    The MATLAB Parallel Computing Toolbox (PCT) extends the MATLAB language with high-level, parallel-processing features such as parallel for loops, parallel regions, message passing, distributed arrays, and parallel numerical methods. It offers a shared-memory computing environment with a maximum of eight MATLAB workers (labs, threads; versions R2009a) and 12 workers (labs, threads; version R2011a) running on the local configuration in addition to your MATLAB client. Moreover, the MATLAB Distributed Computing Server (DCS) scales PCT applications up to the limit of your DCS licenses.

    This section illustrates how to submit a small, parallel, MATLAB program with a parallel region (spmd statement) as a MATLAB pool job to a batch queue.

    This example uses the submission command to submit to compute nodes a MATLAB client which interprets a Matlab .m with a user-defined cluster profile which scatters the MATLAB workers onto different compute nodes. This method uses the MATLAB interpreter, the Parallel Computing Toolbox, and the Distributed Computing Server; so, it requires and checks out six licenses: one MATLAB license for the client running on the compute node, one PCT license, and four DCS licenses. Four DCS licenses run the four copies of the spmd statement. This job is completely off the front end.

    Prepare a MATLAB script called myscript.m:

    % FILENAME:  myscript.m
    
    % SERIAL REGION
    [c name] = system('hostname');
    fprintf('SERIAL REGION:  hostname:%s\n', name)
    p = parpool('4');
    fprintf('                    hostname                         numlabs  labindex\n')
    fprintf('                    -------------------------------  -------  --------\n')
    tic;
    
    % PARALLEL REGION
    spmd
        [c name] = system('hostname');
        name = name(1:length(name)-1);
        fprintf('PARALLEL REGION:  %-31s  %7d  %8d\n', name,numlabs,labindex)
        pause(2);
    end
    
    % SERIAL REGION
    elapsed_time = toc;          % get elapsed time in parallel region
    delete(p);
    fprintf('\n')
    [c name] = system('hostname');
    name = name(1:length(name)-1);
    fprintf('SERIAL REGION:  hostname:%s\n', name)
    fprintf('Elapsed time in parallel region:   %f\n', elapsed_time)
    quit;
    

    Prepare a job submission file with an appropriate filename, here named myjob.sub. Run with the name of the script:

    #!/bin/bash 
    # FILENAME:  myjob.sub
    
    echo "myjob.sub"
    
    module load matlab
    
    unset DISPLAY
    
    matlab -nodisplay -r myscript
    

    Run MATLAB to set the default parallel configuration to your job configuration:

    $ matlab -nodisplay
    >> parallel.defaultClusterProfile('myslurmprofile');
    >> quit;
    $
    

    Submit the job

    Once this job starts, a second job submission is made.

    View job status

    View results for the job

    myjob.sub
    
                                < M A T L A B (R) >
                      Copyright 1984-2011 The MathWorks, Inc.
                        R2011b (7.13.0.564) 64-bit (glnxa64)
                                  August 13, 2011
    
    To get started, type one of these: helpwin, helpdesk, or demo.
    For product information, visit www.mathworks.com.
    
    SERIAL REGION:  hostname:weber-a001.rcac.purdue.edu
    
    Starting matlabpool using the 'myslurmprofile' profile ... connected to 4 labs.
                        hostname                         numlabs  labindex
                        -------------------------------  -------  --------
    Lab 2:
      PARALLEL REGION:  weber-a002.rcac.purdue.edu            4         2
    Lab 1:
      PARALLEL REGION:  weber-a001.rcac.purdue.edu            4         1
    Lab 3:
      PARALLEL REGION:  weber-a003.rcac.purdue.edu            4         3
    Lab 4:
      PARALLEL REGION:  weber-a004.rcac.purdue.edu            4         4
    
    Sending a stop signal to all the labs ... stopped.
    
    SERIAL REGION:  hostname:weber-a001.rcac.purdue.edu
    Elapsed time in parallel region:   3.382151
    

    Output shows the name of one compute node (a001) that processed the job submission file myjob.sub and the two serial regions. The job submission scattered four processor cores (four MATLAB labs) among four different compute nodes (a001,a002,a003,a004) that processed the four parallel regions. The total elapsed time demonstrates that the jobs ran in parallel.

    For more information about MATLAB Parallel Computing Toolbox:

    Parallel Computing Toolbox (parfor)

    The MATLAB Parallel Computing Toolbox (PCT) extends the MATLAB language with high-level, parallel-processing features such as parallel for loops, parallel regions, message passing, distributed arrays, and parallel numerical methods. It offers a shared-memory computing environment running on the local cluster profile in addition to your MATLAB client. Moreover, the MATLAB Distributed Computing Server (DCS) scales PCT applications up to the limit of your DCS licenses.

    This section illustrates the fine-grained parallelism of a parallel for loop (parfor) in a pool job.

    The following examples illustrate a method for submitting a small, parallel, MATLAB program with a parallel loop (parfor statement) as a job to a queue. This MATLAB program prints the name of the run host and shows the values of variables numlabs and labindex for each iteration of the parfor loop.

    This method uses the job submission command to submit a MATLAB client which calls the MATLAB batch() function with a user-defined cluster profile.

    Prepare a MATLAB pool program in a MATLAB script with an appropriate filename, here named myscript.m:

    % FILENAME:  myscript.m
    
    % SERIAL REGION
    [c name] = system('hostname');
    fprintf('SERIAL REGION:  hostname:%s\n', name)
    numlabs = parpool('poolsize');
    fprintf('        hostname                         numlabs  labindex  iteration\n')
    fprintf('        -------------------------------  -------  --------  ---------\n')
    tic;
    
    % PARALLEL LOOP
    parfor i = 1:8
        [c name] = system('hostname');
        name = name(1:length(name)-1);
        fprintf('PARALLEL LOOP:  %-31s  %7d  %8d  %9d\n', name,numlabs,labindex,i)
        pause(2);
    end
    
    % SERIAL REGION
    elapsed_time = toc;        % get elapsed time in parallel loop
    fprintf('\n')
    [c name] = system('hostname');
    name = name(1:length(name)-1);
    fprintf('SERIAL REGION:  hostname:%s\n', name)
    fprintf('Elapsed time in parallel loop:   %f\n', elapsed_time)
    

    The execution of a pool job starts with a worker executing the statements of the first serial region up to the parfor block, when it pauses. A set of workers (the pool) executes the parfor block. When they finish, the first worker resumes by executing the second serial region. The code displays the names of the compute nodes running the batch session and the worker pool.

    Prepare a MATLAB script that calls MATLAB function batch() which makes a four-lab pool on which to run the MATLAB code in the file myscript.m. Use an appropriate filename, here named mylclbatch.m:

    % FILENAME:  mylclbatch.m
    
    !echo "mylclbatch.m"
    !hostname
    
    pjob=batch('myscript','Profile','myslurmprofile','Pool',4,'CaptureDiary',true);
    wait(pjob);
    diary(pjob);
    quit;
    

    Prepare a job submission file with an appropriate filename, here named myjob.sub:

    #!/bin/bash
    # FILENAME:  myjob.sub
    
    echo "myjob.sub"
    hostname
    
    module load matlab
    
    unset DISPLAY
    
    matlab -nodisplay -r mylclbatch
    

    Submit the job as a single compute node with one processor core.

    One processor core runs myjob.sub and mylclbatch.m.

    Once this job starts, a second job submission is made.

    View job status

    View results of the job

    myjob.sub
    
                                < M A T L A B (R) >
                      Copyright 1984-2013 The MathWorks, Inc.
                        R2013a (8.1.0.604) 64-bit (glnxa64)
                                 February 15, 2013
    
    To get started, type one of these: helpwin, helpdesk, or demo.
    For product information, visit www.mathworks.com.
    
    mylclbatch.m
    weber-a000.rcac.purdue.edu
    SERIAL REGION:  hostname:weber-a000.rcac.purdue.edu
    
                    hostname                         numlabs  labindex  iteration
                    -------------------------------  -------  --------  ---------
    PARALLEL LOOP:  weber-a001.rcac.purdue.edu            4         1          2
    PARALLEL LOOP:  weber-a002.rcac.purdue.edu            4         1          4
    PARALLEL LOOP:  weber-a001.rcac.purdue.edu            4         1          5
    PARALLEL LOOP:  weber-a002.rcac.purdue.edu            4         1          6
    PARALLEL LOOP:  weber-a003.rcac.purdue.edu            4         1          1
    PARALLEL LOOP:  weber-a003.rcac.purdue.edu            4         1          3
    PARALLEL LOOP:  weber-a004.rcac.purdue.edu            4         1          7
    PARALLEL LOOP:  weber-a004.rcac.purdue.edu            4         1          8
    
    SERIAL REGION:  hostname:weber-a000.rcac.purdue.edu
    
    Elapsed time in parallel loop:   5.411486
    

    To scale up this method to handle a real application, increase the wall time in the submission command to accommodate a longer running job. Secondly, increase the wall time of myslurmprofile by using the Cluster Profile Manager in the Parallel menu to enter a new wall time in the property SubmitArguments.

    For more information about MATLAB Parallel Computing Toolbox:

    Distributed Computing Server (parallel job)

    The MATLAB Parallel Computing Toolbox (PCT) enables a parallel job via the MATLAB Distributed Computing Server (DCS). The tasks of a parallel job are identical, run simultaneously on several MATLAB workers (labs), and communicate with each other. This section illustrates an MPI-like program.

    This section illustrates how to submit a small, MATLAB parallel job with four workers running one MPI-like task to a batch queue. The MATLAB program broadcasts an integer to four workers and gathers the names of the compute nodes running the workers and the lab IDs of the workers.

    This example uses the job submission command to submit a Matlab script with a user-defined cluster profile which scatters the MATLAB workers onto different compute nodes. This method uses the MATLAB interpreter, the Parallel Computing Toolbox, and the Distributed Computing Server; so, it requires and checks out six licenses: one MATLAB license for the client running on the compute node, one PCT license, and four DCS licenses. Four DCS licenses run the four copies of the parallel job. This job is completely off the front end.

    Prepare a MATLAB script named myscript.m :

    
    % FILENAME:  myscript.m
    
    % Specify pool size.
    % Convert the parallel job to a pool job.
    parpool('4');
    spmd
    
    if labindex == 1
        % Lab (rank) #1 broadcasts an integer value to other labs (ranks).
        N = labBroadcast(1,int64(1000));
    else
        % Each lab (rank) receives the broadcast value from lab (rank) #1.
        N = labBroadcast(1);
    end
    
    % Form a string with host name, total number of labs, lab ID, and broadcast value.
    [c name] =system('hostname');
    name = name(1:length(name)-1);
    fmt = num2str(floor(log10(numlabs))+1);
    str = sprintf(['%s:%d:%' fmt 'd:%d   '], name,numlabs,labindex,N);
    
    % Apply global concatenate to all str's.
    % Store the concatenation of str's in the first dimension (row) and on lab #1.
    result = gcat(str,1,1);
    if labindex == 1
        disp(result)
    end
    
    end   % spmd
    matlabpool close force;
    quit;
    

    Also, prepare a job submission, here named myjob.sub. Run with the name of the script:

    # FILENAME:  myjob.sub
    
    echo "myjob.sub"
    
    module load matlab
    
    unset DISPLAY
    
    # -nodisplay: run MATLAB in text mode; X11 server not needed
    # -r:         read MATLAB program; use MATLAB JIT Accelerator
    matlab -nodisplay -r myscript
    

    Run MATLAB to set the default parallel configuration to your appropriate Profile:

    $ matlab -nodisplay
    >> defaultParallelConfig('myslurmprofile');
    >> quit;
    $
    

    Submit the job as a single compute node with one processor core.

    Once this job starts, a second job submission is made.

    View job status

    View results of the job

    myjob.sub
    
                                < M A T L A B (R) >
                      Copyright 1984-2011 The MathWorks, Inc.
                        R2011b (7.13.0.564) 64-bit (glnxa64)
                                  August 13, 2011
    
    To get started, type one of these: helpwin, helpdesk, or demo.
    For product information, visit www.mathworks.com.
    
    >Starting matlabpool using the 'myslurmprofile' configuration ... connected to 4 labs.
    Lab 1:
      weber-a006.rcac.purdue.edu:4:1:1000
      weber-a007.rcac.purdue.edu:4:2:1000
      weber-a008.rcac.purdue.edu:4:3:1000
      weber-a009.rcac.purdue.edu:4:4:1000
    Sending a stop signal to all the labs ... stopped.
    Did not find any pre-existing parallel jobs created by matlabpool.
    

    Output shows the name of one compute node (a006) that processed the job submission file myjob.sub. The job submission scattered four processor cores (four MATLAB labs) among four different compute nodes (a006,a007,a008,a009) that processed the four parallel regions.

    To scale up this method to handle a real application, increase the wall time in the submission command to accommodate a longer running job. Secondly, increase the wall time of myslurmprofile by using the Cluster Profile Manager in the Parallel menu to enter a new wall time in the property SubmitArguments.

    For more information about parallel jobs:

    Python

    Notice: Python 2.7 has reached end-of-life on Jan 1, 2020 (announcement). Please update your codes and your job scripts to use Python 3.

    Python is a high-level, general-purpose, interpreted, dynamic programming language. We suggest using Anaconda which is a Python distribution made for large-scale data processing, predictive analytics, and scientific computing. For example, to use the default Anaconda distribution:

    $ module load anaconda
    

    For a full list of available Anaconda and Python modules enter:

    $ module spider anaconda
    

    Example Python Jobs

    This section illustrates how to submit a small Python job to a PBS queue.

    Link to section 'Example 1: Hello world' of 'Example Python Jobs' Example 1: Hello world

    Prepare a Python input file with an appropriate filename, here named myjob.in:

    # FILENAME:  hello.py
    
    import string, sys
    print "Hello, world!"
    

    Prepare a job submission file with an appropriate filename, here named myjob.sub:

    #!/bin/bash
    # FILENAME:  myjob.sub
    
    module load anaconda
    
    python hello.py
    

    Submit the job

    View job status

    View results of the job

    Hello, world!
    

    Link to section 'Example 2: Matrix multiply' of 'Example Python Jobs' Example 2: Matrix multiply

    Save the following script as matrix.py:

    # Matrix multiplication program
    
    x = [[3,1,4],[1,5,9],[2,6,5]]
    y = [[3,5,8,9],[7,9,3,2],[3,8,4,6]]
    
    result = [[sum(a*b for a,b in zip(x_row,y_col)) for y_col in zip(*y)] for x_row in x]
    
    for r in result:
            print(r)
    

    Change the last line in the job submission file above to read:

    python matrix.py
    

    The standard output file from this job will result in the following matrix:

    [28, 56, 43, 53]
    [65, 122, 59, 73]
    [63, 104, 54, 60]
    

    Link to section 'Example 3: Sine wave plot using numpy and matplotlib packages' of 'Example Python Jobs' Example 3: Sine wave plot using numpy and matplotlib packages

    Save the following script as sine.py:

    import numpy as np
    import matplotlib
    matplotlib.use('Agg')
    import matplotlib.pylab as plt
    
    x = np.linspace(-np.pi, np.pi, 201)
    plt.plot(x, np.sin(x))
    plt.xlabel('Angle [rad]')
    plt.ylabel('sin(x)')
    plt.axis('tight')
    plt.savefig('sine.png')
    

    Change your job submission file to submit this script and the job will output a png file and blank standard output and error files.

    For more information about Python:

    Managing Environments with Conda

    Conda is a package manager in Anaconda that allows you to create and manage multiple environments where you can pick and choose which packages you want to use. To use Conda you must load an Anaconda module:

    $ module load anaconda
    

    Many packages are pre-installed in the global environment. To see these packages:

    $ conda list
    

    To create your own custom environment:

    $ conda create --name MyEnvName python=2.7 FirstPackageName SecondPackageName -y
    

    The --name option specifies that the environment created will be named MyEnvName. You can include as many packages as you require separated by a space. Including the -y option lets you skip the prompt to install the package. By default environments are created and stored in the $HOME/.conda directory.

    To create an environment at a custom location:

    $ conda create --prefix=$HOME/MyEnvName python=2.7 PackageName -y
    

    To see a list of your environments:

    $ conda env list
    

    To remove unwanted environments:

    $ conda remove --name MyEnvName --all
    

    To remove a package from an environment:

    $ conda remove --name MyEnvName PackageName
    

    Installing packages when creating your environment, instead of one at a time, will help you avoid dependency issues.

    To activate or deactivate an environment you have created:

    $ source activate MyEnvName
    $ source deactivate MyEnvName
    

    If you created your conda environment at a custom location using --prefix option, then you can activate or deactivate it using the full path.

    $ source activate $HOME/MyEnvName
    $ source deactivate $HOME/MyEnvName
    

    To use a custom environment inside a job you must load the module and activate the environment inside your job submission script. Add the following lines to your submission script:

    $ module load anaconda
    $ source activate MyEnvName
    

    For more information about Python:

    Managing Packages with Pip

    Pip is a Pythom package manager. Many Python package documentation provide pip instructions that result in permission errors because by default pip will install in a system-wide location and fail.

    Exception:
    Traceback (most recent call last):
    ... ... stack trace ... ...
    OSError: [Errno 13] Permission denied: '/apps/cent7/anaconda/5.1.0-py27/lib/python2.7/site-packages/mpi4py-3.0.1.dist-info'
    

    If you encounter this error, it means that you cannot modify the global Python installation. We recommend installing Python packages in a conda environment. Detailed instructions for installing packages with pip can be found in our Python package installation page.

    Below we list some other useful pip commands.

    • Search for a package in PyPI channels:
      $ pip search packageName
      
    • Check which packages are installed globally:
      $ pip list
      
    • Check which packages you have personally installed:
      $ pip list --user
      
    • Snapshot installed packages:
      $ pip freeze > requirements.txt
      
    • You can install packages from a snapshot inside a new conda environment. Make sure to load the appropriate conda environment first.
      $ pip install -r requirements.txt
      

    For more information about Python:

    Installing Packages

    ITaP recommends installing Python packages in an Anaconda environment. One key advantage of Anaconda is that it allows users to install unrelated packages in separate self-contained environments. Individual packages can later be reinstalled or updated without impacting others. If you are unfamiliar with Conda environments, please check our Conda Guide.

    To facilitate the process of creating and using Conda environments, we support a script (conda-env-mod) that generates a module file for an environment, as well as an optional Jupyter kernel to use this environment in a JupyterHub notebook.

    You must load one of the anaconda modules in order to use this script.

    $ module load anaconda/5.1.0-py36

    Step-by-step instructions for installing custom Python packages are presented below.

    Link to section 'Step 1: Create a conda environment' of 'Installing Packages' Step 1: Create a conda environment

    Users can use the conda-env-mod script to create an empty conda environment. This script needs either a name or a path for the desired environment. After the environment is created, it generates a module file for using it in future. Please note that conda-env-mod is different from the official conda-env script and supports a limited set of subcommands. Detailed instructions for using conda-env-mod can be found with the command conda-env-mod --help.

    • Example 1: Create a conda environment named mypackages in user's home directory.

      $ conda-env-mod create -n mypackages
    • Example 2: Create a conda environment named mypackages at a custom location.

      $ conda-env-mod create -p /depot/mylab/apps/mypackages

      Please follow the on-screen instructions while the environment is being created. After finishing, the script will print the instructions to use this environment.

      ... ... ...
      Preparing transaction: ...working... done
      Verifying transaction: ...working... done
      Executing transaction: ...working... done
      +------------------------------------------------------+
      | To use this environment, load the following modules: |
      |       module load use.own                            |
      |       module load conda-env/mypackages-py3.6.4       |
      +------------------------------------------------------+
      Your environment "mypackages" was created successfully.
      

    Note down the module names, as you will need to load these modules every time you want to use this environment. You may also want to add the module load lines in your jobscript, if it depends on custom Python packages.

    By default, module files are generated in your $HOME/privatemodules directory. The location of module files can be customized by specifying the -m /path/to/modules option to rcac-conda-env.

    • Example 3: Create a conda environment named labpackages in your group's Data Depot space and place the module file at a shared location for the group to use.
      $ conda-env-mod create -p /depot/mylab/apps/labpackages -m /depot/mylab/etc/modules
      ... ... ...
      Preparing transaction: ...working... done
      Verifying transaction: ...working... done
      Executing transaction: ...working... done
      +-------------------------------------------------------+
      | To use this environment, load the following modules:  |
      |       module use /depot/mylab/etc/modules             |
      |       module load conda-env/labpackages-py3.6.4       |
      +-------------------------------------------------------+
      Your environment "labpackages" was created successfully.
      

    If you used a custom module file location, you need to run the module use command as printed by the script.

    By default, only the environment and a module file are created (no Jupyter kernel). If you plan to use your environment in a JupyterHub notebook, you need to append a --jupyter flag to the above commands.

    • Example 4: Create a Jupyter-enabled conda environment named labpackages in your group's Data Depot space and place the module file at a shared location for the group to use.
      $ conda-env-mod create -p /depot/mylab/apps/labpackages -m /depot/mylab/etc/modules --jupyter
      ... ... ...
      Jupyter kernel created: "Python (My labpackages Kernel)"
      ... ... ...
      Your environment "labpackages" was created successfully.
      

    Link to section 'Step 2: Load the conda environment' of 'Installing Packages' Step 2: Load the conda environment

    • The following instructions assume that you have used rcac-conda-env script to create an environment named mypackages (Examples 1 or 2 above). If you used conda create instead, please use conda activate mypackages.

      $ module load use.own
      $ module load conda-env/mypackages-py3.6.4
      

      Note that the conda-env module name includes the Python version that it supports (Python 3.6.4 in this example). This is same as the Python version in the anaconda module.

    • If you used a custom module file location (Example 3 above), please use module use to load the conda-env module.

      $ module use /depot/mylab/etc/modules
      $ module load conda-env/mypackages-py3.6.4
      

    Link to section 'Step 3: Test the installed packages' of 'Installing Packages' Step 3: Test the installed packages

    To use the installed Python packages, you must load the module for your conda environment. If you have not loaded the conda-env module, please do so following the instructions at the end of Step 1.

    $ module load use.own
    $ module load conda-env/mypackages-py3.6.4
    
    • Example 1: Test that OpenCV is available.
      $ python -c "import cv2; print(cv2.__version__)"
      
    • Example 2: Test that mpi4py is available.
      $ python -c "import mpi4py; print(mpi4py.__version__)"
      

    If the commands finished without errors, then the installed packages can be used in your program.

    Link to section 'Additional capabilities of rcac-conda-env script' of 'Installing Packages' Additional capabilities of rcac-conda-env script

    The conda-env-mod tool is intended to facilitate creation of a minimal Anaconda environment, matching module file and optionally a Jupyter kernel. Once created, the environment can then be accessed via familiar module load command, tuned and expanded as necessary. Additionally, the script provides several auxiliary functions to help managing environments, module files and Jupyter kernels.

    General usage for the tool adheres to the following pattern:

    $ rcac-conda-env help
    $ conda-env-mod   [optional arguments]
    

    where required arguments are one of

    • -n|--name ENV_NAME (name of the environment)
    • -p|--prefix ENV_PATH (location of the environment)

    and optional arguments further modify behavior for specific actions (e.g. -m to specify alternative location for generated module file).

    Given a required name or prefix for an environment, the conda-env-mod script supports the following subcommands:

    • create - to create a new environment, its corresponding module file and optional Jupyter kernel.
    • delete - to delete existing environment along with its module file and Jupyter kernel.
    • module - to generate just the module file for a given existing environment.
    • kernel - to generate just the Jupyter kernel for a given existing environment (note that the environment has to be created with a --jupyter option).
    • help - to display script usage help.

    Using these subcommands, you can iteratively fine-tune your environments, module files and Jupyter kernels, as well as delete and re-create them with ease. Below we cover several commonly occurring scenarios.

    Link to section 'Generating module file for an existing environment' of 'Installing Packages' Generating module file for an existing environment

    If you already have an existing configured Anaconda environment and want to generate a module file for it, follow appropriate examples from Step 1 above, but use the module subcommand instead of the create one. E.g.

    $ conda-env-mod module -n mypackages

    and follow printed instructions on how to load this module. With an optional --jupyter flag, a Jupyter kernel will also be generated.

    Note that if you intend to proceed with a Jupyter kernel generation (via the --jupyter flag or a kernel subcommand later), you will have to ensure that your environment has ipython and ipykernel packages installed into it. To avoid this and other related complications, we highly recommend making a fresh environment using a suitable conda-env-mod create .... --jupyter commmand instead.

    Link to section 'Generating Jupyter kernel for an existing environment' of 'Installing Packages' Generating Jupyter kernel for an existing environment

    If you already have an existing configured Anaconda environment and want to generate a Jupyter kernel file for it, you can use the kernel subcommand. E.g.

    $ conda-env-mod kernel -n mypackages

    This will add a "Python (My mypackages Kernel)" item to the dropdown list of available kernels upon your next login to the JupyterHub.

    Note that generated Jupiter kernels are always personal (i.e. each user has to make their own, even for shared environments). Note also that you (or the creator of the shared environment) will have to ensure that your environment has ipython and ipykernel packages installed into it.

    Link to section 'Managing and using shared Python environments' of 'Installing Packages' Managing and using shared Python environments

    Here is a suggested workflow for a common group-shared Anaconda environment with Jupyter capabilities:

    The PI or lab software manager:

    • Creates the environment and module file (once):

      $ module purge
      $ module load anaconda
      $ conda-env-mod create -p /depot/mylab/apps/labpackages -m /depot/mylab/etc/modules --jupyter
      
    • Installs required Python packages into the environment (as many times as needed):

      $ module use /depot/mylab/etc/modules
      $ module load conda-env/labpackages-py3.6.4
      $ conda install  .......                       # all the necessary packages
      

    Lab members:

    • Lab members can start using the environment in their command line scripts or batch jobs simply by loading the corresponding module:

      $ module use /depot/mylab/etc/modules
      $ module load conda-env/labpackages-py3.6.4
      $ python my_data_processing_script.py .....
      
    • To use the environment in Jupyter notebooks, each lab member will need to create his/her own Jupyter kernel (once). This is because Jupyter kernels are private to individuals, even for shared environments.

      $ module use /depot/mylab/etc/modules
      $ module load conda-env/labpackages-py3.6.4
      $ conda-env-mod kernel -p /depot/mylab/apps/labpackages
      

    A similar process can be devised for instructor-provided or individually-managed class software, etc.

    Link to section 'Troubleshooting' of 'Installing Packages' Troubleshooting

    • Python packages often fail to install or run due to dependency with other packages. More specifically, if you previously installed packages in your home directory it is safer to clean those installations.
      $ mv ~/.local ~/.local.bak
      $ mv ~/.cache ~/.cache.bak
      
    • Unload all the modules.
      $ module purge
      
    • Clean up PYTHONPATH.
      $ unset PYTHONPATH
      
    • Next load the modules (e.g. anaconda) that you need.
      $ module load anaconda/5.1.0-py36
      $ module load use.own
      $ module load conda-env/mypackages-py3.6.4
      
    • Now try running your code again.
    • Few applications only run on specific versions of Python (e.g. Python 3.6). Please check the documentation of your application if that is the case.

    Installing Packages from Source

    We maintain several Anaconda installations. Anaconda maintains numerous popular scientific Python libraries in a single installation. If you need a Python library not included with normal Python we recommend first checking Anaconda. For a list of modules currently installed in the Anaconda Python distribution:

    $ module load anaconda
    $ conda list
    # packages in environment at /apps/cent7/anaconda/5.1.0-py27:
    #
    # Name                    Version                   Build  Channel
    _ipyw_jlab_nb_ext_conf    0.1.0            py27h08a7f0c_0  
    alabaster                 0.7.10           py27he5a193a_0  
    anaconda                  5.1.0                    py27_2  
    ...
    

    If you see the library in the list, you can simply import it into your Python code after loading the Anaconda module.

    If you do not find the package you need, you should be able to install the library in your own Anaconda customization. First try to install it with Conda or Pip. If the package is not available from either Conda or Pip, you may be able to install it from source.

    Use the following instructions as a guideline for installing packages from source. Make sure you have a download link to the software (usually it will be a tar.gz archive file). You will substitute it on the wget line below.

    We also assume that you have already created an empty conda environment as described in our Python package installation guide.

    $ mkdir ~/src
    $ cd ~/src
    $ wget http://path/to/source/tarball/app-1.0.tar.gz
    $ tar xzvf app-1.0.tar.gz
    $ cd app-1.0
    $ module load anaconda
    $ module load use.own
    $ module load conda-env/mypackages-py2.7.14
    $ python setup.py install
    $ cd ~
    $ python
    >>> import app
    >>> quit()
    

    The "import app" line should return without any output if installed successfully. You can then import the package in your python scripts.

    If you need further help or run into any issues installing a library contact us at rcac-help@purdue.edu or drop by Coffee Hour for in-person help.

    For more information about Python:

    Example: Create and Use Biopython Environment with Conda

    Link to section 'Using conda to create an environment that uses the biopython package' of 'Example: Create and Use Biopython Environment with Conda' Using conda to create an environment that uses the biopython package

    To use Conda you must first load the anaconda module:

    $ module load anaconda
    

    Create an empty conda environment to install biopython:

    $ conda-env-mod create -n biopython
    

    Now activate the biopython environment:

    $ module load use.own
    $ module load conda-env/biopython-py2.7.14
    

    Install the biopython packages in your environment:

    $ conda install --channel anaconda biopython -y
    Fetching package metadata ..........
    Solving package specifications .........
    .......
    Linking packages ...
    [    COMPLETE    ]|################################################################
    

    The --channel option specifies that it searches the anaconda channel for the biopython package. The -y argument is optional and allows you to skip the installation prompt. A list of packages will be displayed as they are installed.

    Remember to add the following lines to your job submission script to use the custom environment in your jobs:

    module load anaconda
    module load use.own
    module load conda-env/biopython-py2.7.14
    

    If you need further help or run into any issues with creating environments contact us at rcac-help@purdue.edu or drop by Coffee Hour for in-person help.

    For more information about Python:

    Numpy Parallel Behavior

    The widely available Numpy package is the best way to handle numerical computation in Python. The numpy package provided by our anaconda modules is optimized using Intel's MKL library. It will automatically parallelize many operations to make use of all the cores available on a machine.

    In many contexts that would be the ideal behavior. On the cluster however that very likely is not in fact the preferred behavior because often more than one user is present on the system and/or more than one job on a node. Having multiple processes contend for those resources will actually result in lesser performance.

    Setting the MKL_NUM_THREADS or OMP_NUM_THREADS environment variable(s) allows you to control this behavior. Our anaconda modules automatically set these variables to 1 if and only if you do not currently have that variable defined.

    When submitting batch jobs it is always a good idea to be explicit rather than implicit. If you are submitting a job that you want to make use of the full resources available on the node, set one or both of these variables to the number of cores you want to allow numpy to make use of.

    #!/bin/bash
    
    
    module load anaconda
    export MKL_NUM_THREADS=20
    
    ...

    If you are submitting multiple jobs that you intend to be scheduled together on the same node, it is probably best to restrict numpy to a single core.

    #!/bin/bash
    
    
    module load anaconda
    export MKL_NUM_THREADS=1

    R

    R, a GNU project, is a language and environment for data manipulation, statistics, and graphics. It is an open source version of the S programming language. R is quickly becoming the language of choice for data science due to the ease with which it can produce high quality plots and data visualizations. It is a versatile platform with a large, growing community and collection of packages.

    For more general information on R visit The R Project for Statistical Computing.

    Running R jobs

    This section illustrates how to submit a small R job to a SLURM queue. The example job computes a Pythagorean triple.

    Prepare an R input file with an appropriate filename, here named myjob.R:

    # FILENAME:  myjob.R
    
    # Compute a Pythagorean triple.
    a = 3
    b = 4
    c = sqrt(a*a + b*b)
    c     # display result
    

    Prepare a job submission file with an appropriate filename, here named myjob.sub:

    #!/bin/bash
    # FILENAME:  myjob.sub
    
    module load r
    
    # --vanilla:
    # --no-save: do not save datasets at the end of an R session
    R --vanilla --no-save < myjob.R
    

    submit the job

    View job status

    View results of the job

    For other examples or R jobs:

    Installing R packages

    Link to section 'Challenges of Managing R Packages in the Cluster Environment' of 'Installing R packages' Challenges of Managing R Packages in the Cluster Environment

    • Different clusters have different hardware and softwares. So, if you have access to multiple clusters, you must install your R packages separately for each cluster.
    • Each cluster has multiple versions of R and packages installed with one version of R may not work with another version of R. So, libraries for each R version must be installed in a separate directory.
    • You can define the directory where your R packages will be installed using the environment variable R_LIBS_USER.
    • For your convenience, ITaP provides a sample ~/.Rprofile example file that can be downloaded to your cluster account and renamed into ~/.Rprofile (or appended to one) to customize your installation preferences. Detailed instructions.

    Link to section 'Installing Packages' of 'Installing R packages' Installing Packages

    • Step 0: Set up installation preferences.
      Follow the steps for setting up your ~/.Rprofile preferences. This step needs to be done only once. If you have created a ~/.Rprofile file previously on Weber, ignore this step.

    • Step 1: Check if the package is already installed.
      As part of the R installations on ITaP community clusters, a lot of R libraries are pre-installed. You can check if your package is alreday installed by opening an R terminal and entering the command installed.packages(). For example,

      $ module load r/4.0.0
      $ R
      > installed.packages()["units",c("Package","Version")]
      Package Version 
      "units" "0.6-3"
      > quit()

      If the package you are trying to use is already installed, simply load the library, e.g., library('units'). Otherwise, move to the next step to install the package.

    • Step 2: Load required dependencies. (if needed)
      For simple packages you may not need this step. However, some R packages depend on other libraries. For example, the sf package depends on gdal and geos libraries. So, you will need to load the corresponding modules before installing sf. Read the documentation for the package to identify which modules should be loaded.

      $ module load gdal
      $ module load geos
    • Step 3: Install the package.
      Now install the desired package using the command install.packages('package_name'). R will automatically download the package and all its dependencies from CRAN and install each one. Your terminal will show the build progress and eventually show whether the package was installed successfully or not.

      $ R
      > install.packages('sf', repos="https://cran.case.edu/")
      Installing package into ‘/home/myusername/R/weber/4.0.0’
      (as ‘lib’ is unspecified)
      trying URL 'https://cran.case.edu/src/contrib/sf_0.9-7.tar.gz'
      Content type 'application/x-gzip' length 4203095 bytes (4.0 MB)
      ==================================================
      downloaded 4.0 MB
      ...
      ...
      more progress messages
      ...
      ...
      ** testing if installed package can be loaded from final location
      ** testing if installed package keeps a record of temporary installation path
      * DONE (sf)
      
      The downloaded source packages are in
          ‘/tmp/RtmpSVAGio/downloaded_packages’
    • Step 4: Troubleshooting. (if needed)
      If Step 3 ended with an error, you need to investigate why the build failed. Most common reason for build failure is not loading the necessary modules.

    Link to section 'Loading Libraries' of 'Installing R packages' Loading Libraries

    Once you have packages installed you can load them with the library() function as shown below:

    > library('packagename')

    The package is now installed and loaded and ready to be used in R.

    Link to section 'Example: Installing dplyr' of 'Installing R packages' Example: Installing dplyr

    The following demonstrates installing the dplyr package assuming the above-mentioned custom ~/.Rprofile is in place (note its effect in the "Installing package into" information message):

    $ module load r
    $ R
    > install.packages('dplyr', repos="http://ftp.ussg.iu.edu/CRAN/")
    Installing package into ‘/home/myusername/R/weber/4.0.0’
    (as ‘lib’ is unspecified)
     ...
    also installing the dependencies 'crayon', 'utf8', 'bindr', 'cli', 'pillar', 'assertthat', 'bindrcpp', 'glue', 'pkgconfig', 'rlang', 'Rcpp', 'tibble', 'BH', 'plogr'
     ...
     ...
     ...
    The downloaded source packages are in 
        '/tmp/RtmpHMzm9z/downloaded_packages'
    
    >library(dplyr)
    
    Attaching package: 'dplyr'
    >

    For more information about installing R packages:

    Loading Data into R

    R is an environment for manipulating data. In order to manipulate data, it must be brought into the R environment. R has a function to read any file that data is stored in. Some of the most common file types like comma-separated variable(CSV) files have functions that come in the basic R packages. Other less common file types require additional packages to be installed. To read data from a CSV file into the R environment, enter the following command in the R prompt:

    > read.csv(file = "path/to/data.csv", header = TRUE)

    When R reads the file it creates an object that can then become the target of other functions. By default the read.csv() function will give the object the name of the .csv file. To assign a different name to the object created by read.csv enter the following in the R prompt:

    > my_variable <- read.csv(file = "path/to/data.csv", header = FALSE)

    To display the properties (structure) of loaded data, enter the following:

    > str(my_variable)

    For more functions and tutorials:

    Setting Up R Preferences with .Rprofile

    For your convenience, ITaP provides a sample ~/.Rprofile example file that can be downloaded to your cluster account and renamed into ~/.Rprofile (or appended to one). Follow these steps to download our recommended ~/.Rprofile example and copy it into place:

    $ curl -#LO https://www.rcac.purdue.edu/files/knowledge/run/examples/apps/r/Rprofile_example
    $ mv -ib Rprofile_example ~/.Rprofile

    The above installation step needs to be done only once on Weber. Now load the R module and run R:

    $ module load r/4.0.0
    $ R
    > .libPaths()
    [1] "/home/myusername/R/weber/4.0.0"                           
    [2] "/apps/spack/weber/apps/r/4.0.0-gcc-6.3.0-righufz/rlib/R/library"

    .libPaths() should output something similar to above if it is set up correctly.

    You are now ready to install R packages into the directory /home/myusername/R/weber/4.0.0.

    Singularity

    Note: Singularity was originally a project out of Lawrence Berkeley National Laboratory. It has now been spun off into a distinct offering under a new corporate entity under the name Sylabs Inc. This guide pertains to the open source community edition, SingularityCE.

    Link to section 'What is Singularity?' of 'Singularity' What is Singularity?

    Singularity is a new feature of the Community Clusters allowing the portability and reproducibility of operating system and application environments through the use of Linux containers. It gives users complete control over their environment.

    Singularity is like Docker but tuned explicitly for HPC clusters. More information is available from the project’s website.

    Link to section 'Features' of 'Singularity' Features

    • Run the latest applications on an Ubuntu or Centos userland
    • Gain access to the latest developer tools
    • Launch MPI programs easily
    • Much more

    Singularity’s user guide is available at: sylabs.io/guides/3.8/user-guide

    Link to section 'Example' of 'Singularity' Example

    Here is an example using an Ubuntu 16.04 image on Weber:

    $ singularity exec /depot/itap/singularity/ubuntu1604.img cat /etc/lsb-release
    DISTRIB_ID=Ubuntu
    DISTRIB_RELEASE=16.04
    DISTRIB_CODENAME=xenial
    DISTRIB_DESCRIPTION="Ubuntu 16.04 LTS"

    Here is another example using a Centos 7 image:

    $ singularity exec /depot/itap/singularity/centos7.img cat /etc/redhat-release
    CentOS Linux release 7.2.1511 (Core) 

    Link to section 'Purdue Cluster Specific Notes' of 'Singularity' Purdue Cluster Specific Notes

    All service providers will integrate Singularity slightly differently depending on site. The largest customization will be which default files are inserted into your images so that routine services will work.

    Services we configure for your images include DNS settings and account information. File systems we overlay into your images are your home directory, scratch, Data Depot, and application file systems.

    Here is a list of paths:

    • /etc/resolv.conf
    • /etc/hosts
    • /home/$USER
    • /apps
    • /scratch
    • /depot

    This means that within the container environment these paths will be present and the same as outside the container. The /apps, /scratch, and /depot directories will need to exist inside your container to work properly.

    Link to section 'Creating Singularity Images' of 'Singularity' Creating Singularity Images

    Due to how singularity containers work, you must have root privileges to build an image. Once you have a singularity container image built on your own system, you can copy the image file up to the cluster (you do not need root privileges to run the container).

    You can find information and documentation for how to install and use singularity on your system:

    We have version 2.6.1-dist on the cluster. You will most likely not be able to run any container built with any singularity past that version (i.e., version 3). So be sure to follow the installation guide for version 2.6 on your system.

    $ singularity --version
    2.6.1-dist

    Everything you need on how to build a container is available from their user-guide. Below are merely some quick tips for getting your own containers built for Weber.

    You can use a Container Recipe to both build your container and share its specification with collaborators (for the sake of reproducibility). Here is a simplistic example of such a file:

    # FILENAME: Buildfile
    
    Bootstrap: docker
    From: ubuntu:18.04
    
    %post
        apt-get update && apt-get upgrade -y
        mkdir /apps /depot /scratch

    To build the image itself:

    $ sudo singularity build ubuntu-18.04.simg Buildfile

    The challenge with this approach however is that it must start from scratch if you decide to change something. In order to create a container image iteratively and interactively, you can use the --sandbox option.

    $ sudo singularity build --sandbox ubuntu-18.04 docker://ubuntu:18.04

    This will not create a flat image file but a directory tree (i.e., a folder), the contents of which are the container's filesystem. In order to get a shell inside the container that allows you to modify it, user the --writable option.

    $ sudo singularity shell --writable ubuntu-18.04
    Singularity: Invoking an interactive shell within container...
    
    Singularity ubuntu-18.04.sandbox:~>

    You can then proceed to install any libraries, software, etc. within the container. Then to create the final image file, exit the shell and call the build command once more on the sandbox.

    $ sudo singularity build ubuntu-18.04.simg ubuntu-18.04

    Finally, copy the new image to Weber and run it.

    Windows

    Windows virtual machines (VMs) are supported as batch jobs on HPC systems. This section illustrates how to submit a job and run a Windows instance in order to run Windows applications on the high-performance computing systems.

    Weber provides a basic Windows 10 image to execute Microsoft Office applications within the cluster's boundaries.

    • The Windows image is not persistent, and will default to a baseline state each time Windows is launched.
    • Only the provided Windows image is to be launched on Weber.

    The Windows VMs can be launched in two fashions:

    Click each of the above links for detailed instructions on using them.

    Command line

    If you wish to work with Windows VMs on the command line or work into scripted workflows you can interact directly with the Windows system:

    • Load the "qemu" module:
      $ module load qemu
      

    To launch a virtual machine in a batch job, use the "windows" script, specifying the path to your Windows virtual machine image. With no other command-line arguments, the windows script will autodetect a number cores and memory for the Windows VM. A Windows network connection will be made to your home directory. To launch:

    
    $ /depot/windows/weberwin.sh 
    

    The Windows desktop will open, and automatically log in as a temporary user. No changes to the VM will be preserved.

    Menu Launcher

    Windows VMs can be easily launched through the Thinlinc remote desktop environment.

    • Log in via Thinlinc.
    • Click on Applications menu in the upper left corner.
    • Look under the Cluster Software menu.
    • The "Windows 10" launcher will launch a VM directly on the front-end.
    • Follow the dialogs to set up your VM.
    Find Windows 10 under the 'Cluster Software' option in the list of Applications.

    The dialog menus will walk you through setting up and loading your VM.

    Link to section 'Notes' of 'Menu Launcher' Notes

    Using the menu launcher will launch automatically select reasonable CPU and memory values. If you wish to choose other options or work Windows VMs into scripted workflows see the section on using the command line.

    Link to section 'Compiling Source Code on Weber' of 'Compiling Source Code' Compiling Source Code on Weber

    Compiling Serial Programs

    A serial program is a single process which executes as a sequential stream of instructions on one processor core. Compilers capable of serial programming are available for C, C++, and versions of Fortran.

    Here are a few sample serial programs:

    $ module load intel
    $ module load gcc
    The following table illustrates how to compile your serial program:
    Language Intel Compiler GNU Compiler
    Fortran 77
    $ ifort myprogram.f -o myprogram
    $ gfortran myprogram.f -o myprogram
    Fortran 90
    $ ifort myprogram.f90 -o myprogram
    $ gfortran myprogram.f90 -o myprogram
    Fortran 95
    $ ifort myprogram.f90 -o myprogram
    $ gfortran myprogram.f95 -o myprogram
    C
    $ icc myprogram.c -o myprogram
    $ gcc myprogram.c -o myprogram
    C++
    $ icc myprogram.cpp -o myprogram
    $ g++ myprogram.cpp -o myprogram

    The Intel and GNU compilers will not output anything for a successful compilation. Also, the Intel compiler does not recognize the suffix ".f95".

    Compiling OpenMP Programs

    All compilers installed on Brown include OpenMP functionality for C, C++, and Fortran. An OpenMP program is a single process that takes advantage of a multi-core processor and its shared memory to achieve a form of parallel computing called multithreading. It distributes the work of a process over processor cores in a single compute node without the need for MPI communications.

    OpenMP programs require including a header file:
    Language Header Files
    Fortran 77
    INCLUDE 'omp_lib.h'
    Fortran 90
    use omp_lib
    Fortran 95
    use omp_lib
    C
    #include <omp.h>
    C++
    #include <omp.h>

    Sample programs illustrate task parallelism of OpenMP:

    A sample program illustrates loop-level (data) parallelism of OpenMP:

    To load a compiler, enter one of the following:

    $ module load intel
    $ module load gcc
    The following table illustrates how to compile your shared-memory program. Any compiler flags accepted by ifort/icc compilers are compatible with OpenMP.
    Language Intel Compiler GNU Compiler
    Fortran 77
    $ ifort -openmp myprogram.f -o myprogram
    $ gfortran -fopenmp myprogram.f -o myprogram
    Fortran 90
    $ ifort -openmp myprogram.f90 -o myprogram
    $ gfortran -fopenmp myprogram.f90 -o myprogram
    Fortran 95
    $ ifort -openmp myprogram.f90 -o myprogram
    $ gfortran -fopenmp myprogram.f95 -o myprogram
    C
    $ icc -openmp myprogram.c -o myprogram
    $ gcc -fopenmp myprogram.c -o myprogram
    C++
    $ icc -openmp myprogram.cpp -o myprogram
    $ g++ -fopenmp myprogram.cpp -o myprogram

    The Intel and GNU compilers will not output anything for a successful compilation. Also, the Intel compiler does not recognize the suffix ".f95".

    Here is some more documentation from other sources on OpenMP:

    Intel MKL Library

    Intel Math Kernel Library (MKL) contains ScaLAPACK, LAPACK, Sparse Solver, BLAS, Sparse BLAS, CBLAS, GMP, FFTs, DFTs, VSL, VML, and Interval Arithmetic routines. MKL resides in the directory stored in the environment variable MKL_HOME, after loading a version of the Intel compiler with module.

    By using module load to load an Intel compiler your environment will have several variables set up to help link applications with MKL. Here are some example combinations of simplified linking options:

    $ module load intel
    $ echo $LINK_LAPACK
    -L${MKL_HOME}/lib/intel64 -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -liomp5 -lpthread
    
    $ echo $LINK_LAPACK95
    -L${MKL_HOME}/lib/intel64 -lmkl_lapack95_lp64 -lmkl_blas95_lp64 -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -liomp5 -lpthread
    

    ITaP recommends you use the provided variables to define MKL linking options in your compiling procedures. The Intel compiler modules also provide two other environment variables, LINK_LAPACK_STATIC and LINK_LAPACK95_STATIC that you may use if you need to link MKL statically.

    ITaP recommends that you use dynamic linking of libguide. If so, define LD_LIBRARY_PATH such that you are using the correct version of libguide at run time. If you use static linking of libguide, then:

    • If you use the Intel compilers, link in the libguide version that comes with the compiler (use the -openmp option).
    • If you do not use the Intel compilers, link in the libguide version that comes with the Intel MKL above.

    Here are some more documentation from other sources on the Intel MKL:

    Provided Compilers

    Compilers are available on Weber for Fortran, C, and C++. Compiler sets from Intel and GNU are installed.

    • Intel
    • MKL
    • Intel MPI

    To load the recommended set:

    $ module load rcac
    $ module list
    

    More information about using these compilers:

    GNU Compilers

    The official name of the GNU compilers is "GNU Compiler Collection" or "GCC". To discover which versions are available:

    $ module avail gcc

    Choose an appropriate GCC module and load it. For example:

    $ module load gcc

    An older version of the GNU compiler will be in your path by default. Do NOT use this version. Instead, load a newer version using the command module load gcc.

    Here are some examples for the GNU compilers:
    Language Serial Program MPI Program OpenMP Program
    Fortran77
    $ gfortran myprogram.f -o myprogram
    $ mpif77 myprogram.f -o myprogram
    $ gfortran -fopenmp myprogram.f -o myprogram
    Fortran90
    $ gfortran myprogram.f90 -o myprogram
    $ mpif90 myprogram.f90 -o myprogram
    $ gfortran -fopenmp myprogram.f90 -o myprogram
    Fortran95
    $ gfortran myprogram.f95 -o myprogram
    $ mpif90 myprogram.f95 -o myprogram
    $ gfortran -fopenmp myprogram.f95 -o myprogram
    C
    $ gcc myprogram.c -o myprogram
    $ mpicc myprogram.c -o myprogram
    $ gcc -fopenmp myprogram.c -o myprogram
    C++
    $ g++ myprogram.cpp -o myprogram
    $ mpiCC myprogram.cpp -o myprogram
    $ g++ -fopenmp myprogram.cpp -o myprogram

    More information on compiler options appear in the official man pages, which are accessible with the man command after loading the appropriate compiler module.

    For more documentation on the GCC compilers:

    Intel Compilers

    One or more versions of the Intel compiler are available on Weber. To discover which ones:

    $ module avail intel

    Choose an appropriate Intel module and load it. For example:

    $ module load intel
    Here are some examples for the Intel compilers:
    Language Serial Program MPI Program OpenMP Program
    Fortran77
    $ ifort myprogram.f -o myprogram
    
    $ mpiifort myprogram.f -o myprogram
    
    $ ifort -openmp myprogram.f -o myprogram
    
    Fortran90
    $ ifort myprogram.f90 -o myprogram
    
    $ mpiifort myprogram.f90 -o myprogram
    
    $ ifort -openmp myprogram.f90 -o myprogram
    
    Fortran95 (same as Fortran 90) (same as Fortran 90) (same as Fortran 90)
    C
    $ icc myprogram.c -o myprogram
    
    $ mpiicc myprogram.c -o myprogram
    
    $ icc -openmp myprogram.c -o myprogram
    
    C++
    $ icpc myprogram.cpp -o myprogram
    
    $ mpiicpc myprogram.cpp -o myprogram
    
    $ icpc -openmp myprogram.cpp -o myprogram
    

    More information on compiler options appear in the official man pages, which are accessible with the man command after loading the appropriate compiler module.

    For more documentation on the Intel compilers:

    Frequently Asked Questions

    Some common questions, errors, and problems are categorized below. Click the Expand Topics link in the upper right to see all entries at once. You can also use the search box above to search the user guide for any issues you are seeing.

    Link to section 'About Weber' of 'About Weber' About Weber

    Can you remove me from the Weber mailing list?

    Your subscription in the Weber mailing list is tied to your account on Weber. If you are no longer using your account on Weber, your account can be deleted from the My Accounts page. Hover over the resource you wish to remove yourself from and click the red 'X' button. Your account and mailinglist subscription will be removed overnight. Be sure to make a copy of any data you wish to keep first.

    Link to section 'Logging In & Accounts' of 'Logging In & Accounts' Logging In & Accounts

    Link to section 'Errors' of 'Errors' Errors

    /usr/bin/xauth: error in locking authority file

    Link to section 'Problem' of '/usr/bin/xauth: error in locking authority file' Problem

    I receive this message when logging in:

    /usr/bin/xauth: error in locking authority file

    Link to section 'Solution' of '/usr/bin/xauth: error in locking authority file' Solution

    Your home directory disk quota is full. You may check your quota with myquota.

    You will need to free up space in your home directory.

    My SSH connection hangs

    Link to section 'Problem' of 'My SSH connection hangs' Problem

    Your console hangs while trying to connect to a RCAC Server.

    Link to section 'Solution' of 'My SSH connection hangs' Solution

    This can happen due to various reasons. Most common reasons for hanging SSH terminals are:

    • Network: If you are connected over wifi, make sure that your Internet connection is fine.
    • Busy front-end server: When you connect to a cluster, you SSH to one of the front-ends. Due to transient user loads, one or more of the front-ends may become unresponsive for a short while. To avoid this, try reconnecting to the cluster or wait until the server you have connected to has reduced load.
    • File system issue: If a server has issues with one or more of the file systems (home, scratch, or depot) it may freeze your terminal. To avoid this you can connect to another front-end.

    If neither of the suggestions above work, please contact rcac-help@purdue.edu specifying the name of the server where your console is hung.

    Link to section 'Questions' of 'Questions' Questions

    I worked on Weber after I graduated/left Purdue, but can not access it anymore

    Link to section 'Problem' of 'I worked on Weber after I graduated/left Purdue, but can not access it anymore' Problem

    You have graduated or left Purdue but continue collaboration with your Purdue colleagues. You find that your access to Purdue resources has suddenly stopped and your password is no longer accepted.

    Link to section 'Solution' of 'I worked on Weber after I graduated/left Purdue, but can not access it anymore' Solution

    Access to all Research Computing resources depends on having a valid Purdue Career Account. Expired Career Accounts are removed twice a year, during Spring and October breaks (more details at the official page). If your Career Account was purged due to expiration, you will not be be able to access the resources.

    To provide remote collaborators with valid Purdue credentials, the University provides a special procedure called R4P ("request for privileges") (see details under 'Data/Access' tab). If you need to continue your collaboration with your Purdue PI, the PI will have to work with their departmental Business Office to submit or renew an R4P request on your behalf.

    After your R4P is completed and Career Account is restored, please note two additional necessary steps:

    • Access: Restored Career Accounts by default do not have any Research Computing resources enabled for them. Your PI will have to login to the Manage Users tool and explicitly re-enable your access by un-checking and then ticking back checkboxes for desired queues/Unix groups resources.

    • Email: Restored Career Accounts by default do not have their @purdue.edu email service enabled. While this does not preclude you from using Research Computing resources, any email messages (be that generated on the clusters, or any service announcements) would not be delivered - which may cause inconvenience or loss of compute jobs. To avoid this, we recommend setting your restored @purdue.edu email service to "Forward" (to an actual address you read). The easiest way to ensure it is to go through the Account Setup process.

    Link to section 'Jobs' of 'Jobs' Jobs

    Link to section 'Errors' of 'Errors' Errors

    bash: command not found

    Link to section 'Problem' of 'bash: command not found' Problem

    You receive the following message after typing a command

    bash: command not found

    Link to section 'Solution' of 'bash: command not found' Solution

    This means the system doesn't know how to find your command. Typically, you need to load a module to do it.

    bash: module command not found

    Link to section 'Problem' of 'bash: module command not found' Problem

    You receive the following message after typing a command, e.g. module load intel

    bash: module command not found

    Link to section 'Solution' of 'bash: module command not found' Solution

    The system cannot find the module command. You need to source the modules.sh file as below

    source /etc/profile.d/modules.sh

    or

    #!/bin/bash -i

    Link to section 'Questions' of 'Questions' Questions

    How do I know Non-uniform Memory Access (NUMA) layout on Weber?

    • You can learn about processor layout on Weber nodes using the following command:
      weber-a000:~$ lstopo-no-graphics
    • For detailed IO connectivity:
      weber-a000:~$ lstopo-no-graphics --physical --whole-io
    • Please note that NUMA information is useful for advanced MPI/OpenMP/GPU optimizations. For most users, using default NUMA settings in MPI or OpenMP would give you the best performance.

    Link to section 'Data' of 'Data' Data

    How is my Data Secured on Weber?

    Weber is operated in line with policies, standards, and best practices as described within Secure Purdue, and specific to Research Computing Resources. In addition, L4 Export Controlled (ITAR) or Controlled Unclassfied Information (CUI) stored within Weber are compliant with EAR, ITAR, or NIST SP 800-171 regulations.

    Link to section 'For additional information' of 'How is my Data Secured on Weber?' For additional information

    Log in with your Purdue Career Account.

    Can I share data with outside collaborators?

    No, external collaboration is not allowed for Weber.

    Can I access Fortress from Weber?

    No. Weber has a dedicated secure long-term archive storage.

    Link to section 'Software' of 'Software' Software

    Cannot use pip after loading ml-toolkit modules

    Neither pip nor ml-toolkit are available on Weber, although they are available on the other community clusters.

    How can I get access to Sentaurus software?

    Sentaurus is not currently available on Weber. Please contact our support team at rcac-help@purdue.edu if you would like it to be installed for your project.

    Link to section 'About Research Computing' of 'About Research Computing' About Research Computing

    Can I get a private server from RCAC?

    Link to section 'Question' of 'Can I get a private server from RCAC?' Question

    Can I get a private (virtual or physical) server from RCAC?

    Link to section 'Answer' of 'Can I get a private server from RCAC?' Answer

    Often, researchers may want a private server to run databases, web servers, or other software. RCAC currently does not offer private servers (formerly known as "Firebox").

    For use cases like this, we recommend the Jetstream Cloud (http://jetstream-cloud.org/) an NSF-funded science cloud allocated through the XSEDE project. RCAC staff can help get you access to Jetstream to test, or to help write an allocation proposal for larger projects.

    Alternatively, you may consider commercial cloud providers such as Amazon Web Services, Azure, or Digital Ocean. These services are very flexible, but do come with a monetary cost.

    Biography of Mary Ellen Weber

    Portrait of Mary Ellen Weber

    Mary Ellen Weber is a Purdue alumna, astronaut, chemist, business executive and speaker.

    Dr. Weber grew up in Ohio and earned her bachelor's degree in chemical engineering with honors from Purdue in 1984. She went on to earn a doctorate in physical chemistry from the University of California-Berkeley in 1988 and a master of business administration degree from Southern Methodist University in 2002.

    Dr. Weber was selected by NASA to become an astronaut in 1992. She served on two space shuttle missions, STS-70 Discovery in 1995 and STS-101 Atlantis in 2000, traveling a total of 297 earth orbits and 7.8 million miles. On the Discovery mission, Dr. Weber successfully deployed a $200 million NASA communications satellite to its orbit 22,000 miles above Earth and performed biotechnology research related to colon cancer.

    On the Atlantis mission, which was the third shuttle mission devoted to the construction of the International Space Station, Dr. Weber operated the shuttle's 60-foot robotic arm to maneuver spacewalking crewmembers along the Station's surface and directed the transfer of more than three thousand pounds of equipment.

    In addition to her work in the Astronaut Corps, Dr. Weber held a variety of other positions within NASA, including working as the Legislative Affairs liaison at NASA headquarters in Washington, D.C. She is the recipient of the NASA Exceptional Service Medal.

    After leaving NASA, Dr. Weber was the Vice President for Government Affairs and Policy for nine years at the University of Texas Southwestern Medical Center in Dallas, Texas. She is the founder of Stellar Strategies, LLC, consulting in strategic communications, technology innovation and high-risk operations. She has over 20 years of experience as a speaker and has been a keynote speaker at many conferences and a frequent TV news guest.

    Dr. Weber is an active competitive skydiver, who has logged nearly 6,000 skydives and won two dozen medals at the U.S. National Skydiving Championships.

    Helpful?

    Thanks for letting us know.

    Please don’t include any personal information in your comment. Maximum character limit is 250.
    Characters left: 250
    Thanks for your feedback.