Data Workbench User Guide
The Data Workbench is an interactive compute environment for non-batch big data analysis and simulation, and is a part of Purdue's Community Cluster Program.
Link to section 'Overview of Data Workbench' of 'Overview of Data Workbench' Overview of Data Workbench
The Data Workbench is an interactive compute environment for non-batch big data analysis and simulation, and is a part of Purdue's Community Cluster Program. The Data Workbench consists of Dell compute nodes with 24-core AMD EPYC 7401P processors (24 cores per node), and 512 GB of memory. All nodes are interconnected with 10 Gigabit Ethernet. The Data Workbench entered production on October 1, 2017.
To purchase access to Data Workbench today, go to the Cluster Access Purchase page. Please subscribe to our Community Cluster Program Mailing List to stay informed on the latest purchasing developments or contact us via email at rcac-cluster-purchase@lists.purdue.edu if you have any questions.
Link to section 'Data Workbench Specifications' of 'Overview of Data Workbench' Data Workbench Specifications
The Data Workbench consists of Dell Servers with one 24-core AMD EPYC 7401P CPU, 512 GB of memory, and 10 Gigabit Ethernet network.
Front-Ends | Number of Nodes | Processors per Node | Cores per Node | Memory per Node | Retires in |
---|---|---|---|---|---|
1 | 6 | One AMD EPYC 7401P CPU @ 2.00GHz | 24 | 512 GB | 2024 |
Data Workbench nodes run CentOS 7 and are intended for interactive work via the Thinlinc remote desktop software or JupyterHub. Data Workbench provides no batch system.
The application of operating system patches occurs as security needs dictate. All nodes allow for unlimited stack usage, as well as unlimited core dump size (though disk space and server quotas may still be a limiting factor). All nodes guarantee even access to CPU and memory resources via Linux cgroups.
On Data Workbench, the following set of compiler and math libraries are recommended:
- Intel 17.0.1.132
- MKL
This compiler and these libraries are loaded by default. To load the recommended set again:
$ module load rcac
To verify what you loaded:
$ module list
Link to section 'Data Workbench Regular Maintenance' of 'Overview of Data Workbench' Data Workbench Regular Maintenance
Regular planned maintenance on Data Workbench is scheduled for the first Thursday of every month, 8:00am to 5:00pm.
Link to section 'Software catalog' of 'Overview of Data Workbench' Software catalog
Link to section 'Accounts on Data Workbench' of 'Accounts' Accounts on Data Workbench
Link to section 'Obtaining an Account' of 'Accounts' Obtaining an Account
To obtain an account, you must be part of a research group which has purchased access to Data Workbench. Refer to the Accounts / Access page for more details on how to request access.
Link to section 'Outside Collaborators' of 'Accounts' Outside Collaborators
A valid Purdue Career Account is required for access to any resource. If you do not currently have a valid Purdue Career Account you must have a current Purdue faculty or staff member file a Request for Privileges (R4P) before you can proceed.
Logging In
To submit jobs on Data Workbench, log in to the submission host workbench.rcac.purdue.edu
via SSH.
Purdue Login
Link to section 'SSH' of 'Purdue Login' SSH
- SSH to the cluster as usual.
- When asked for a password, type your password followed by "
,push
". - Your Purdue Duo client will receive a notification to approve the login.
Link to section 'Thinlinc' of 'Purdue Login' Thinlinc
- When asked for a password, type your password followed by "
,push
". - Your Purdue Duo client will receive a notification to approve the login.
- The native Thinlinc client will prompt for Duo approval twice due to the way Thinlinc works.
- The native Thinlinc client also supports key-based authentication.
Jupyterhub
The Jupyterhub service can be accessed from your web browser.
- Open a web browser and navigate to workbench.rcac.purdue.edu.
- Log in with your Purdue Career Account username and password.
Passwords
Data Workbench supports either Purdue two-factor authentication (Purdue Login) or SSH keys.
Link to section 'Rstudio Server' of 'Rstudio Server' Rstudio Server
After Aug 22, 2020, the workbench cluster will NOT support RStudio Server. Please see how to launch RStudio on workbench.
SSH Client Software
Secure Shell or SSH is a way of establishing a secure connection between two computers. It uses public-key cryptography to authenticate the user with the remote computer and to establish a secure connection. Its usual function involves logging in to a remote machine and executing commands. There are many SSH clients available for all operating systems:
Linux / Solaris / AIX / HP-UX / Unix:
- The
ssh
command is pre-installed. Log in usingssh myusername@workbench.rcac.purdue.edu
from a terminal.
Microsoft Windows:
- MobaXterm is a small, easy to use, full-featured SSH client. It includes X11 support for remote displays, SFTP capabilities, and limited SSH authentication forwarding for keys.
Mac OS X:
- The
ssh
command is pre-installed. You may start a local terminal window from "Applications->Utilities". Log in by typing the commandssh myusername@workbench.rcac.purdue.edu
.
When prompted for password, enter your Purdue career account password followed by ",push
". Your Purdue Duo client will then receive a notification to approve the login.
SSH Keys
Link to section 'General overview' of 'SSH Keys' General overview
To connect to Data Workbench using SSH keys, you must follow three high-level steps:
- Generate a key pair consisting of a private and a public key on your local machine.
- Copy the public key to the cluster and append it to
$HOME/.ssh/authorized_keys
file in your account. - Test if you can ssh from your local computer to the cluster without using your Purdue password.
Detailed steps for different operating systems and specific SSH client softwares are give below.
Link to section 'Mac and Linux:' of 'SSH Keys' Mac and Linux:
-
Run
ssh-keygen
in a terminal on your local machine. You may supply a filename and a passphrase for protecting your private key, but it is not mandatory. To accept the default settings, press Enter without specifying a filename.
Note: If you do not protect your private key with a passphrase, anyone with access to your computer could SSH to your account on Data Workbench. -
By default, the key files will be stored in
~/.ssh/id_rsa
and~/.ssh/id_rsa.pub
on your local machine. -
Copy the contents of the public key into
$HOME/.ssh/authorized_keys
on the cluster with the following command. When asked for a password, type your password followed by ",push
". Your Purdue Duo client will receive a notification to approve the login.ssh-copy-id -i ~/.ssh/id_rsa.pub myusername@workbench.rcac.purdue.edu
Note: use your actual Purdue account user name.
If your system does not have the
ssh-copy-id
command, use this instead:cat ~/.ssh/id_rsa.pub | ssh myusername@workbench.rcac.purdue.edu "mkdir -p ~/.ssh && chmod 700 ~/.ssh && cat >> ~/.ssh/authorized_keys"
-
Test the new key by SSH-ing to the server. The login should now complete without asking for a password.
-
If the private key has a non-default name or location, you need to specify the key by
ssh -i my_private_key_name myusername@workbench.rcac.purdue.edu
Link to section 'Windows:' of 'SSH Keys' Windows:
Programs | Instructions |
---|---|
MobaXterm | Open a local terminal and follow Linux steps |
Git Bash | Follow Linux steps |
Windows 10 PowerShell | Follow Linux steps |
Windows 10 Subsystem for Linux | Follow Linux steps |
PuTTY | Follow steps below |
PuTTY:
-
Launch PuTTYgen, keep the default key type (RSA) and length (2048-bits) and click Generate button.
The "Generate" button can be found under the "Actions" section of the PuTTY Key Generator interface. -
Once the key pair is generated:
Use the Save public key button to save the public key, e.g.
Documents\SSH_Keys\mylaptop_public_key.pub
Use the Save private key button to save the private key, e.g.
Documents\SSH_Keys\mylaptop_private_key.ppk
. When saving the private key, you can also choose a reminder comment, as well as an optional passphrase to protect your key, as shown in the image below. Note: If you do not protect your private key with a passphrase, anyone with access to your computer could SSH to your account on Data Workbench.The PuTTY Key Generator form has inputs for the Key passphrase and optional reminder comment. From the menu of PuTTYgen, use the "Conversion -> Export OpenSSH key" tool to convert the private key into openssh format, e.g.
Documents\SSH_Keys\mylaptop_private_key.openssh
to be used later for Thinlinc. -
Configure PuTTY to use key-based authentication:
Launch PuTTY and navigate to "Connection -> SSH ->Auth" on the left panel, click Browse button under the "Authentication parameters" section and choose your private key, e.g. mylaptop_private_key.ppk
After clicking Connection -> SSH ->Auth panel, the "Browse" option can be found at the bottom of the resulting panel. Navigate back to "Session" on the left panel. Highlight "Default Settings" and click the "Save" button to ensure the change in place.
-
Connect to the cluster. When asked for a password, type your password followed by "
,push
". Your Purdue Duo client will receive a notification to approve the login. Copy the contents of public key from PuTTYgen as shown below and paste it into$HOME/.ssh/authorized_keys
. Please double-check that your text editor did not wrap or fold the pasted value (it should be one very long line).The "Public key" will look like a long string of random letters and numbers in a text box at the top of the window. - Test by connecting to the cluster. If successful, you will not be prompted for a password or receive a Duo notification. If you protected your private key with a passphrase in step 2, you will instead be prompted to enter your chosen passphrase when connecting.
ThinLinc
RCAC provides Cendio's ThinLinc as an alternative to running an X11 server directly on your computer. It allows you to run graphical applications or graphical interactive jobs directly on Data Workbench through a persistent remote graphical desktop session.
ThinLinc is a service that allows you to connect to a persistent remote graphical desktop session. This service works very well over a high latency, low bandwidth, or off-campus connection compared to running an X11 server locally. It is also very helpful for Windows users who do not have an easy to use local X11 server, as little to no set up is required on your computer.
There are two ways in which to use ThinLinc: preferably through the native client or through a web browser.
Link to section 'Installing the ThinLinc native client' of 'ThinLinc' Installing the ThinLinc native client
The native ThinLinc client will offer the best experience especially over off-campus connections and is the recommended method for using ThinLinc. It is compatible with Windows, Mac OS X, and Linux.
- Download the ThinLinc client from the ThinLinc website.
- Start the ThinLinc client on your computer.
- In the client's login window, use desktop.workbench.rcac.purdue.edu as the Server. Use your Purdue Career Account username and password, but append "
,push
" to your password. - Click the Connect button.
- Your Purdue Login Duo will receive a notification to approve your login.
- Continue to following section on connecting to Data Workbench from ThinLinc.
Link to section 'Using ThinLinc through your web browser' of 'ThinLinc' Using ThinLinc through your web browser
The ThinLinc service can be accessed from your web browser as a convenience to installing the native client. This option works with no set up and is a good option for those on computers where you do not have privileges to install software. All that is required is an up-to-date web browser. Older versions of Internet Explorer may not work.
- Open a web browser and navigate to desktop.workbench.rcac.purdue.edu.
- Log in with your Purdue Career Account username and password, but append "
,push
" to your password. - You may safely proceed past any warning messages from your browser.
- Your Purdue Login Duo will receive a notification to approve your login.
- Continue to the following section on connecting to Data Workbench from ThinLinc.
Link to section 'Connecting to Data Workbench from ThinLinc' of 'ThinLinc' Connecting to Data Workbench from ThinLinc
- Once logged in, you will be presented with a remote Linux desktop running directly on a cluster front-end.
- Open the terminal application on the remote desktop.
- Once logged in to the Data Workbench head node, you may use graphical editors, debuggers, software like Matlab, or run graphical interactive jobs. For example, to test the X forwarding connection issue the following command to launch the graphical editor gedit:
$ gedit
- This session will remain persistent even if you disconnect from the session. Any interactive jobs or applications you left running will continue running even if you are not connected to the session.
Link to section 'Tips for using ThinLinc native client' of 'ThinLinc' Tips for using ThinLinc native client
- To exit a full screen ThinLinc session press the F8 key on your keyboard (fn + F8 key for Mac users) and click to disconnect or exit full screen.
- Full screen mode can be disabled when connecting to a session by clicking the Options button and disabling full screen mode from the Screen tab.
Link to section 'Configure ThinLinc to use SSH Keys' of 'ThinLinc' Configure ThinLinc to use SSH Keys
- The web client does NOT support public-key authentication.
-
ThinLinc native client supports the use of an SSH key pair. For help generating and uploading keys to the cluster, see SSH Keys section in our user guide for details.
To set up SSH key authentication on the ThinLinc client:
-
Open the Options panel, and select Public key as your authentication method on the Security tab.
The "Options..." button in the ThinLinc Client can be found towards the bottom left, above the "Connect" button. -
In the options dialog, switch to the "Security" tab and select the "Public key" radio button:
The "Security" tab found in the options dialog, will be the last of available tabs. The "Public key" option can be found in the "Authentication method" options group. - Click OK to return to the ThinLinc Client login window. You should now see a Key field in place of the Password field.
-
In the Key field, type the path to your locally stored private key or click the ... button to locate and select the key on your local system. Note: If PuTTY is used to generate the SSH Key pairs, please choose the private key in the openssh format.
The ThinLinc Client login window will now display key field instead of a password field.
-
SSH X11 Forwarding
SSH supports tunneling of X11 (X-Windows). If you have an X11 server running on your local machine, you may use X11 applications on remote systems and have their graphical displays appear on your local machine. These X11 connections are tunneled and encrypted automatically by your SSH client.
Link to section 'Installing an X11 Server' of 'SSH X11 Forwarding' Installing an X11 Server
To use X11, you will need to have a local X11 server running on your personal machine. Both free and commercial X11 servers are available for various operating systems.
Linux / Solaris / AIX / HP-UX / Unix:
- An X11 server is at the core of all graphical sessions. If you are logged in to a graphical environment on these operating systems, you are already running an X11 server.
- ThinLinc is an alternative to running an X11 server directly on your Linux computer. ThinLinc is a service that allows you to connect to a persistent remote graphical desktop session.
Microsoft Windows:
- ThinLinc is an alternative to running an X11 server directly on your Windows computer. ThinLinc is a service that allows you to connect to a persistent remote graphical desktop session.
- MobaXterm is a small, easy to use, full-featured SSH client. It includes X11 support for remote displays, SFTP capabilities, and limited SSH authentication forwarding for keys.
Mac OS X:
- X11 is available as an optional install on the Mac OS X install disks prior to 10.7/Lion. Run the installer, select the X11 option, and follow the instructions. For 10.7+ please download XQuartz.
- ThinLinc is an alternative to running an X11 server directly on your Mac computer. ThinLinc is a service that allows you to connect to a persistent remote graphical desktop session.
Link to section 'Enabling X11 Forwarding in your SSH Client' of 'SSH X11 Forwarding' Enabling X11 Forwarding in your SSH Client
Once you are running an X11 server, you will need to enable X11 forwarding/tunneling in your SSH client:
ssh
: X11 tunneling should be enabled by default. To be certain it is enabled, you may usessh -Y
.- MobaXterm: Select "New session" and "SSH." Under "Advanced SSH Settings" check the box for X11 Forwarding.
SSH will set the remote environment variable $DISPLAY
to "localhost:XX.YY"
when this is working correctly. If you had previously set your $DISPLAY
environment variable to your local IP or hostname, you must remove any set
/export
/setenv
of this variable from your login scripts. The environment variable $DISPLAY
must be left as SSH sets it, which is to a random local port address. Setting $DISPLAY
to an IP or hostname will not work.
Purchasing Nodes
RCAC operates a significant shared cluster computing infrastructure developed over several years through focused acquisitions using funds from grants, faculty startup packages, and institutional sources. These "community clusters" are now at the foundation of Purdue's research cyberinfrastructure.
We strongly encourage any Purdue faculty or staff with computational needs to join this growing community and enjoy the enormous benefits this shared infrastructure provides:
- Peace of Mind
RCAC system administrators take care of security patches, attempted hacks, operating system upgrades, and hardware repair so faculty and graduate students can concentrate on research.
- Low Overhead
RCAC data centers provide infrastructure such as networking, racks, floor space, cooling, and power.
- Cost Effective
RCAC works with vendors to obtain the best price for computing resources by pooling funds from different disciplines to leverage greater group purchasing power.
Through the Community Cluster Program, Purdue affiliates have invested several million dollars in computational and storage resources from Q4 2006 to the present with great success in both the research accomplished and the money saved on equipment purchases.
For more information or to purchase access to our latest cluster today, see the Purchase page. Have questions? contact us at rcac-cluster-purchase@lists.purdue.edu to discuss.
File Storage and Transfer
Learn more about file storage transfer for Data Workbench.
Link to section 'Archive and Compression' of 'Archive and Compression' Archive and Compression
There are several options for archiving and compressing groups of files or directories. The mostly commonly used options are:
Link to section 'tar' of 'Archive and Compression' tar
See the official documentation for tar for more information.
Saves many files together into a single archive file, and restores individual files from the archive. Includes automatic archive compression/decompression options and special features for incremental and full backups.
Examples:
(list contents of archive somefile.tar)
$ tar tvf somefile.tar
(extract contents of somefile.tar)
$ tar xvf somefile.tar
(extract contents of gzipped archive somefile.tar.gz)
$ tar xzvf somefile.tar.gz
(extract contents of bzip2 archive somefile.tar.bz2)
$ tar xjvf somefile.tar.bz2
(archive all ".c" files in current directory into one archive file)
$ tar cvf somefile.tar *.c
(archive and gzip-compress all files in a directory into one archive file)
$ tar czvf somefile.tar.gz somedirectory/
(archive and bzip2-compress all files in a directory into one archive file)
$ tar cjvf somefile.tar.bz2 somedirectory/
Other arguments for tar can be explored by using the man tar command.
Link to section 'gzip' of 'Archive and Compression' gzip
The standard compression system for all GNU software.
Examples:
(compress file somefile - also removes uncompressed file)
$ gzip somefile
(uncompress file somefile.gz - also removes compressed file)
$ gunzip somefile.gz
Link to section 'bzip2' of 'Archive and Compression' bzip2
See the official documentation for bzip for more information.
Strong, lossless data compressor based on the Burrows-Wheeler transform. Stronger compression than gzip.
Examples:
(compress file somefile - also removes uncompressed file)
$ bzip2 somefile
(uncompress file somefile.bz2 - also removes compressed file)
$ bunzip2 somefile.bz2
There are several other, less commonly used, options available as well:
- zip
- 7zip
- xz
Link to section 'Environment Variables' of 'Environment Variables' Environment Variables
Several environment variables are automatically defined for you to help you manage your storage. Use environment variables instead of actual paths whenever possible to avoid problems if the specific paths to any of these change.
Name | Description |
---|---|
HOME | path to your home directory |
PWD | path to your current directory |
RCAC_SCRATCH | path to scratch filesystem |
By convention, environment variable names are all uppercase. You may use them on the command line or in any scripts in place of and in combination with hard-coded values:
$ ls $HOME
...
$ ls $RCAC_SCRATCH/myproject
...
To find the value of any environment variable:
$ echo $RCAC_SCRATCH
${resource.scratch}/m/myusername
To list the values of all environment variables:
$ env
USER=myusername
HOME=/home/myusername
RCAC_SCRATCH=${resource.scratch}/m/myusername
...
You may create or overwrite an environment variable. To pass (export) the value of a variable in bash:
$ export MYPROJECT=$RCAC_SCRATCH/myproject
To assign a value to an environment variable in either tcsh or csh:
$ setenv MYPROJECT value
Storage Options
File storage options on RCAC systems include long-term storage (home directories, Fortress) and short-term storage (scratch directories, /tmp directory). Each option has different performance and intended uses, and some options vary from system to system as well. Daily snapshots of home directories are provided for a limited time for accidental deletion recovery. Scratch directories and temporary storage are not backed up and old files are regularly purged from scratch and /tmp directories. More details about each storage option appear below.
Home Directory
Home directories are provided for long-term file storage. Each user has one home directory. You should use your home directory for storing important program files, scripts, input data sets, critical results, and frequently used files. You should store infrequently used files on Fortress. Your home directory becomes your current working directory, by default, when you log in.
Daily snapshots of your home directory are provided for a limited period of time in the event of accidental deletion. For additional security, you should store another copy of your files on more permanent storage, such as the Fortress HPSS Archive.
Your home directory physically resides on a GPFS storage system in the data center. To find the path to your home directory, first log in then immediately enter the following:
$ pwd
/home/myusername
Or from any subdirectory:
$ echo $HOME
/home/myusername
Your home directory and its contents are available on all RCAC machines, including front-end hosts and compute nodes.
Your home directory has a quota limiting the total size of files you may store within. For more information, refer to the Storage Quotas / Limits Section.
Link to section 'Lost File Recovery' of 'Home Directory' Lost File Recovery
Nightly snapshots for 7 days, weekly snapshots for 4 weeks, and monthly snapshots for 3 months are kept. This means you will find snapshots from the last 7 nights, the last 4 Sundays, and the last 3 first of the months. Files are available going back between two and three months, depending on how long ago the last first of the month was. Snapshots beyond this are not kept. For additional security, you should store another copy of your files on more permanent storage, such as the Fortress HPSS Archive.
Link to section 'Performance' of 'Home Directory' Performance
Your home directory is medium-performance, non-purged space suitable for tasks like sharing data, editing files, developing and building software, and many other uses.
Your home directory is not designed or intended for use as high-performance working space for running data-intensive jobs with heavy I/O demands.
Link to section 'Long-Term Storage' of 'Long-Term Storage' Long-Term Storage
Long-term Storage or Permanent Storage is available to users on the High Performance Storage System (HPSS), an archival storage system, called Fortress. Program files, data files and any other files which are not used often, but which must be saved, can be put in permanent storage. Fortress currently has over 10PB of capacity.
For more information about Fortress, how it works, and user guides, and how to obtain an account:
/tmp Directory
/tmp directories are provided for short-term file storage only. Each front-end and compute node has a /tmp directory. Your program may write temporary data to the /tmp
directory of the compute node on which it is running. That data is available for as long as your program is active. Once your program terminates, that temporary data is no longer available. When used properly, /tmp
may provide faster local storage to an active process than any other storage option. You should use your home directory and Fortress for longer-term storage or for holding critical results.
Backups are not performed for the /tmp
directory and removes files from /tmp
whenever space is low or whenever the system needs a reboot. In the event of a disk crash or file purge, files in /tmp
are not recoverable. You should copy any important files to more permanent storage.
Storage Quota / Limits
Some limits are imposed on your disk usage on research systems. A quota is implemented on each filesystem. Each filesystem (home directory, scratch directory, etc.) may have a different limit. If you exceed the quota, you will not be able to save new files or new data to the filesystem until you delete or move data to long-term storage.
Link to section 'Checking Quota' of 'Storage Quota / Limits' Checking Quota
To check the current quotas of your home and scratch directories check the My Quota page or use the myquota
command:
$ myquota
Type Filesystem Size Limit Use Files Limit Use
==============================================================================
home myusername 5.0GB 25.0GB 20% - - -
The columns are as follows:
- Type: indicates home or scratch directory.
- Filesystem: name of storage option.
- Size: sum of file sizes in bytes.
- Limit: allowed maximum on sum of file sizes in bytes.
- Use: percentage of file-size limit currently in use.
- Files: number of files and directories (not the size).
- Limit: allowed maximum on number of files and directories. It is possible, though unlikely, to reach this limit and not the file-size limit if you create a large number of very small files.
- Use: percentage of file-number limit currently in use.
If you find that you reached your quota in either your home directory or your scratch file directory, obtain estimates of your disk usage. Find the top-level directories which have a high disk usage, then study the subdirectories to discover where the heaviest usage lies.
To see in a human-readable format an estimate of the disk usage of your top-level directories in your home directory:
$ du -h --max-depth=1 $HOME >myfile
32K /home/myusername/mysubdirectory_1
529M /home/myusername/mysubdirectory_2
608K /home/myusername/mysubdirectory_3
The second directory is the largest of the three, so apply command du to it.
To see in a human-readable format an estimate of the disk usage of your top-level directories in your scratch file directory:
$ du -h --max-depth=1 $RCAC_SCRATCH >myfile
K ${resource.scratch}/m/myusername
This strategy can be very helpful in figuring out the location of your largest usage. Move unneeded files and directories to long-term storage to free space in your home and scratch directories.
Link to section 'Increasing Quota' of 'Storage Quota / Limits' Increasing Quota
Link to section 'Home Directory' of 'Storage Quota / Limits' Home Directory
If you find you need additional disk space in your home directory, please first consider archiving and compressing old files and moving them to long-term storage on the Fortress HPSS Archive. Unfortunately, it is not possible to increase your home directory quota beyond it's current level.
Link to section 'Sharing Files from Data Workbench' of 'Sharing' Sharing Files from Data Workbench
Data Workbench supports several methods for file sharing. Use the links below to learn more about these methods.
Link to section 'Sharing Data with Globus' of 'Globus' Sharing Data with Globus
Data on any RCAC resource can be shared with other users within Purdue or with collaborators at other institutions. Globus allows convenient sharing of data with outside collaborators. Data can be shared with collaborators' personal computers or directly with many other computing resources at other institutions.
To share files, login to https://transfer.rcac.purdue.edu, navigate to the endpoint (collection) of your choice, and follow instructions as described in Globus documentation on how to share data:
See also RCAC Globus presentation.>
File Transfer
Data Workbench supports several methods for file transfer. Use the links below to learn more about these methods.
SCP
SCP (Secure CoPy) is a simple way of transferring files between two machines that use the SSH protocol. SCP is available as a protocol choice in some graphical file transfer programs and also as a command line program on most Linux, Unix, and Mac OS X systems. SCP can copy single files, but will also recursively copy directory contents if given a directory name.
After Aug 17, 2020, the community clusters will not support password-based authentication for login. Methods that can be used include two-factor authentication (Purdue Login) or SSH keys. If you do not have SSH keys installed, you would need to type your Purdue Login response into the SFTP's "Password" prompt.
Link to section 'Command-line usage:' of 'SCP' Command-line usage:
You can transfer files both to and from Data Workbench while initiating an SCP session on either some other computer or on Data Workbench (in other words, directionality of connection and directionality of data flow are independent from each other). The scp
command appears somewhat similar to the familiar cp
command, with an extra user@host:file
syntax to denote files and directories on a remote host. Either Data Workbench or another computer can be a remote.
Example: Initiating SCP session on some other computer (i.e. you are on some other computer, connecting to Data Workbench):
(transfer TO Data Workbench) (Individual files) $ scp sourcefile myusername@workbench.rcac.purdue.edu:somedir/destinationfile $ scp sourcefile myusername@workbench.rcac.purdue.edu:somedir/ (Recursive directory copy) $ scp -pr sourcedirectory/ myusername@workbench.rcac.purdue.edu:somedir/ (transfer FROM Data Workbench) (Individual files) $ scp myusername@workbench.rcac.purdue.edu:somedir/sourcefile destinationfile $ scp myusername@workbench.rcac.purdue.edu:somedir/sourcefile somedir/ (Recursive directory copy) $ scp -pr myusername@workbench.rcac.purdue.edu:sourcedirectory somedir/
The -p flag is optional. When used, it will cause the transfer to preserve file attributes and permissions. The -r flag is required for recursive transfers of entire directories.
Example: Initiating SCP session on Data Workbench (i.e. you are on Data Workbench, connecting to some other computer):
(transfer TO Data Workbench) (Individual files) $ scp myusername@$another.computer.example.com:sourcefile somedir/destinationfile $ scp myusername@$another.computer.example.com:sourcefile somedir/ (Recursive directory copy) $ scp -pr myusername@$another.computer.example.com:sourcedirectory/ somedir/ (transfer FROM Data Workbench) (Individual files) $ scp somedir/sourcefile myusername@$another.computer.example.com:destinationfile $ scp somedir/sourcefile myusername@$another.computer.example.com:somedir/ (Recursive directory copy) $ scp -pr sourcedirectory myusername@$another.computer.example.com:somedir/
The -p flag is optional. When used, it will cause the transfer to preserve file attributes and permissions. The -r flag is required for recursive transfers of entire directories.
Link to section 'Software (SCP clients)' of 'SCP' Software (SCP clients)
Linux and other Unix-like systems:
- The
scp
command-line program should already be installed.
Microsoft Windows:
- MobaXterm
Free, full-featured, graphical Windows SSH, SCP, and SFTP client. - Command-line
scp
program can be installed as part of Windows Subsystem for Linux (WSL), or Git-Bash.
Mac OS X:
- The
scp
command-line program should already be installed. You may start a local terminal window from "Applications->Utilities". - Cyberduck is a full-featured and free graphical SFTP and SCP client.
Globus
Globus, previously known as Globus Online, is a powerful and easy to use file transfer service for transferring files virtually anywhere. It works within RCAC's various research storage systems; it connects between RCAC and remote research sites running Globus; and it connects research systems to personal systems. You may use Globus to connect to your home, scratch, and Fortress storage directories. Since Globus is web-based, it works on any operating system that is connected to the internet. The Globus Personal client is available on Windows, Linux, and Mac OS X. It is primarily used as a graphical means of transfer but it can also be used over the command line.
Link to section 'Globus Web:' of 'Globus' Globus Web:
- Navigate to http://transfer.rcac.purdue.edu
- Click "Proceed" to log in with your Purdue Career Account.
- On your first login it will ask to make a connection to a Globus account. Accept the conditions.
- Now you are at the main screen. Click "File Transfer" which will bring you to a two-panel interface (if you only see one panel, you can use selector in the top-right corner to switch the view).
- You will need to select one collection and file path on one side as the source, and the second collection on the other as the destination. This can be one of several Purdue endpoints, or another University, or even your personal computer (see Personal Client section below).
The RCAC collections are as follows. A search for "Purdue" will give you several suggested results you can choose from, or you can give a more specific search.
- Home Directory storage: "Purdue Research Computing - Home Directories", however, you can start typing "Purdue" and "Home Directories" and it will suggest appropriate matches.
- Data Workbench scratch storage: "Purdue Data Workbench Cluster", however, you can start typing "Purdue" and "Data Workbench and it will suggest appropriate matches. From here you will need to navigate into the first letter of your username, and then into your username.
- Research Data Depot: "Purdue Research Computing - Data Depot", a search for "Depot" should provide appropriate matches to choose from.
- Fortress: "Purdue Fortress HPSS Archive", a search for "Fortress" should provide appropriate matches to choose from.
From here, select a file or folder in either side of the two-pane window, and then use the arrows in the top-middle of the interface to instruct Globus to move files from one side to the other. You can transfer files in either direction. You will receive an email once the transfer is completed.
Link to section 'Globus Personal Client setup:' of 'Globus' Globus Personal Client setup:
Globus Connect Personal is a small software tool you can install to make your own computer a Globus endpoint on its own. It is useful if you need to transfer files via Globus to and from your computer directly.
- On the "Collections" page from earlier, click "Get Globus Connect Personal" or download a version for your operating system it from here: Globus Connect Personal
- Name this particular personal system and follow the setup prompts to create your Globus Connect Personal endpoint.
- Your personal system is now available as a collection within the Globus transfer interface.
Link to section 'Globus Command Line:' of 'Globus' Globus Command Line:
Globus supports command line interface, allowing advanced automation of your transfers.
To use the recommended standalone Globus CLI application (the globus command):
- First time use: issue the globus login command and follow instructions for initial login.
- Commands for interfacing with the CLI can be found via Using the Command Line Interface, as well as the Globus CLI Examples pages.
Link to section 'Sharing Data with Outside Collaborators' of 'Globus' Sharing Data with Outside Collaborators
Globus allows convenient sharing of data with outside collaborators. Data can be shared with collaborators' personal computers or directly with many other computing resources at other institutions. See the Globus documentation on how to share data:
For links to more information, please see Globus Support page and RCAC Globus presentation.
Windows Network Drive / SMB
SMB (Server Message Block), also known as CIFS, is an easy to use file transfer protocol that is useful for transferring files between RCAC systems and a desktop or laptop. You may use SMB to connect to your home, scratch, and Fortress storage directories. The SMB protocol is available on Windows, Linux, and Mac OS X. It is primarily used as a graphical means of transfer but it can also be used over the command line.
Note: to access Data Workbench through SMB file sharing, you must be on a Purdue campus network or connected through VPN.
Link to section 'Windows:' of 'Windows Network Drive / SMB' Windows:
- Windows 7: Click Windows menu > Computer, then click Map Network Drive in the top bar
- Windows 8 & 10: Tap the Windows key, type computer, select This PC, click Computer > Map Network Drive in the top bar
- In the folder location enter the following information and click Finish:
- To access your home directory, enter \\home.rcac.purdue.edu\myusername.
- Note: Use your career account login name and password when prompted. (You will not need to add "
,push
" nor use your Purdue Duo client.)
- Your home directory should now be mounted as a drive in the Computer window.
Link to section 'Mac OS X:' of 'Windows Network Drive / SMB' Mac OS X:
- In the Finder, click Go > Connect to Server
- In the Server Address enter the following information and click Connect:
- To access your home directory, enter smb://home.rcac.purdue.edu/myusername.
- Note: Use your career account login name and password when prompted. (You will not need to add "
,push
" nor use your Purdue Duo client.)
Link to section 'Linux:' of 'Windows Network Drive / SMB' Linux:
- There are several graphical methods to connect in Linux depending on your desktop environment. Once you find out how to connect to a network server on your desktop environment, choose the Samba/SMB protocol and adapt the information from the Mac OS X section to connect.
- If you would like access via samba on the command line you may install smbclient which will give you FTP-like access and can be used as shown below. For all the possible ways to connect look at the Mac OS X instructions.
smbclient //home.rcac.purdue.edu/myusername -U myusername
- Note: Use your career account login name and password when prompted. (You will not need to add "
,push
" nor use your Purdue Duo client.)
FTP / SFTP
FTP is not supported on any research systems because it does not allow for secure transmission of data. Use SFTP instead, as described below.
SFTP (Secure File Transfer Protocol) is a reliable way of transferring files between two machines. SFTP is available as a protocol choice in some graphical file transfer programs and also as a command-line program on most Linux, Unix, and Mac OS X systems. SFTP has more features than SCP and allows for other operations on remote files, remote directory listing, and resuming interrupted transfers. Command-line SFTP cannot recursively copy directory contents; to do so, try using SCP or graphical SFTP client.
After Aug 17, 2020, the community clusters will not support password-based authentication for login. Methods that can be used include two-factor authentication (Purdue Login) or SSH keys. If you do not have SSH keys installed, you would need to type your Purdue Login response into the SFTP's "Password" prompt.
Link to section 'Command-line usage' of 'FTP / SFTP' Command-line usage
You can transfer files both to and from Data Workbench while initiating an SFTP session on either some other computer or on Data Workbench (in other words, directionality of connection and directionality of data flow are independent from each other). Once the connection is established, you use put
or get
subcommands between "local" and "remote" computers. Either Data Workbench or another computer can be a remote.
Example: Initiating SFTP session on some other computer (i.e. you are on another computer, connecting to Data Workbench):
$ sftp myusername@workbench.rcac.purdue.edu (transfer TO Data Workbench) sftp> put sourcefile somedir/destinationfile sftp> put -P sourcefile somedir/ (transfer FROM Data Workbench) sftp> get sourcefile somedir/destinationfile sftp> get -P sourcefile somedir/ sftp> exit
The -P flag is optional. When used, it will cause the transfer to preserve file attributes and permissions.
Example: Initiating SFTP session on Data Workbench (i.e. you are on Data Workbench, connecting to some other computer):
$ sftp myusername@$another.computer.example.com (transfer TO Data Workbench) sftp> get sourcefile somedir/destinationfile sftp> get -P sourcefile somedir/ (transfer FROM Data Workbench) sftp> put sourcefile somedir/destinationfile sftp> put -P sourcefile somedir/ sftp> exit
The -P flag is optional. When used, it will cause the transfer to preserve file attributes and permissions.
Link to section 'Software (SFTP clients)' of 'FTP / SFTP' Software (SFTP clients)
Linux and other Unix-like systems:
- The
sftp
command-line program should already be installed.
Microsoft Windows:
- MobaXterm
Free, full-featured, graphical Windows SSH, SCP, and SFTP client. - Command-line
sftp
program can be installed as part of Windows Subsystem for Linux (WSL), or Git-Bash.
Mac OS X:
- The
sftp
command-line program should already be installed. You may start a local terminal window from "Applications->Utilities". - Cyberduck is a full-featured and free graphical SFTP and SCP client.
Software
Link to section 'Environment module' of 'Software' Environment module
Link to section 'Software catalog' of 'Software' Software catalog
Compiling Source Code
Documentation on compiling source code on Data Workbench.
Compiling Serial Programs
A serial program is a single process which executes as a sequential stream of instructions on one processor core. Compilers capable of serial programming are available for C, C++, and versions of Fortran.
Here are a few sample serial programs:
- serial_hello.f
- serial_hello.f90
- serial_hello.f95
- serial_hello.c
-
To load a compiler, enter one of the following:
$ module load intel
$ module load gcc
Language | Intel Compiler | GNU Compiler | |
---|---|---|---|
Fortran 77 |
|
|
|
Fortran 90 |
|
|
|
Fortran 95 |
|
|
|
C |
|
|
|
C++ |
|
|
The Intel and GNU compilers will not output anything for a successful compilation. Also, the Intel compiler does not recognize the suffix ".f95".
Compiling OpenMP Programs
All compilers installed on Brown include OpenMP functionality for C, C++, and Fortran. An OpenMP program is a single process that takes advantage of a multi-core processor and its shared memory to achieve a form of parallel computing called multithreading. It distributes the work of a process over processor cores in a single compute node without the need for MPI communications.
Language | Header Files |
---|---|
Fortran 77 |
|
Fortran 90 |
|
Fortran 95 |
|
C |
|
C++ |
|
Sample programs illustrate task parallelism of OpenMP:
A sample program illustrates loop-level (data) parallelism of OpenMP:
To load a compiler, enter one of the following:
$ module load intel
$ module load gcc
Language | Intel Compiler | GNU Compiler |
---|---|---|
Fortran 77 |
|
|
Fortran 90 |
|
|
Fortran 95 |
|
|
C |
|
|
C++ |
|
|
The Intel and GNU compilers will not output anything for a successful compilation. Also, the Intel compiler does not recognize the suffix ".f95".
Here is some more documentation from other sources on OpenMP:
Intel MKL Library
Intel Math Kernel Library (MKL) contains ScaLAPACK, LAPACK, Sparse Solver, BLAS, Sparse BLAS, CBLAS, GMP, FFTs, DFTs, VSL, VML, and Interval Arithmetic routines. MKL resides in the directory stored in the environment variable MKL_HOME, after loading a version of the Intel compiler with module.
By using module load to load an Intel compiler your environment will have several variables set up to help link applications with MKL. Here are some example combinations of simplified linking options:
$ module load intel
$ echo $LINK_LAPACK
-L${MKL_HOME}/lib/intel64 -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -liomp5 -lpthread
$ echo $LINK_LAPACK95
-L${MKL_HOME}/lib/intel64 -lmkl_lapack95_lp64 -lmkl_blas95_lp64 -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -liomp5 -lpthread
RCAC recommends you use the provided variables to define MKL linking options in your compiling procedures. The Intel compiler modules also provide two other environment variables, LINK_LAPACK_STATIC and LINK_LAPACK95_STATIC that you may use if you need to link MKL statically.
RCAC recommends that you use dynamic linking of libguide. If so, define LD_LIBRARY_PATH such that you are using the correct version of libguide at run time. If you use static linking of libguide, then:
- If you use the Intel compilers, link in the libguide version that comes with the compiler (use the -openmp option).
- If you do not use the Intel compilers, link in the libguide version that comes with the Intel MKL above.
Here are some more documentation from other sources on the Intel MKL:
Provided Compilers
Compilers are available on Data Workbench for Fortran, C, and C++. Compiler sets from Intel and GNU are installed.
Detailed documentation on each compiler set available on Data Workbench follows.
On Data Workbench, the following set of compiler and libraries for building code are recommended:
- Intel 17.0.1.132
- MKL
To load the recommended set:
$ module load rcac
$ module list
More information about using these compilers:
GNU Compilers
The official name of the GNU compilers is "GNU Compiler Collection" or "GCC". To discover which versions are available:
$ module avail gcc
Choose an appropriate GCC module and load it. For example:
$ module load gcc
An older version of the GNU compiler will be in your path by default. Do NOT use this version. Instead, load a newer version using the command module load gcc.
Language | Serial Program | OpenMP Program |
---|---|---|
Fortran77 |
|
|
Fortran90 |
|
|
Fortran95 |
|
|
C |
|
|
C++ |
|
|
More information on compiler options appear in the official man pages, which are accessible with the man command after loading the appropriate compiler module.
For more documentation on the GCC compilers:
Intel Compilers
One or more versions of the Intel compiler are available on Data Workbench. To discover which ones:
$ module avail intel
Choose an appropriate Intel module and load it. For example:
$ module load intel
Language | Serial Program | OpenMP Program |
---|---|---|
Fortran77 |
|
|
Fortran90 |
|
|
Fortran95 | (same as Fortran 90) | (same as Fortran 90) |
C |
|
|
C++ |
|
|
More information on compiler options appear in the official man pages, which are accessible with the man command after loading the appropriate compiler module.
For more documentation on the Intel compilers:
Running Jobs
SLURM performs job scheduling. Jobs may be any type of program. You may use either the batch or interactive mode to run your jobs. Use the batch mode for finished programs; use the interactive mode only for debugging.
In this section, you'll find a few pages describing the basics of creating and submitting SLURM jobs. As well, a number of example SLURM jobs that you may be able to adapt to your own needs.
Example Jobs
A number of example jobs are available for you to look over and adapt to your own needs. The first few are generic examples, and latter ones go into specifics for particular software packages.
Specific Applications
The following examples demonstrate job submission files for some common real-world applications.
Python
Notice: Python 2.7 has reached end-of-life on Jan 1, 2020 (announcement). Please update your codes and your job scripts to use Python 3.
Python is a high-level, general-purpose, interpreted, dynamic programming language. We suggest using Anaconda which is a Python distribution made for large-scale data processing, predictive analytics, and scientific computing. For example, to use the default Anaconda distribution:
$ module load anaconda
For a full list of available Anaconda and Python modules enter:
$ module spider anaconda
Managing Environments with Conda
Conda is a package manager in Anaconda that allows you to create and manage multiple environments where you can pick and choose which packages you want to use. To use Conda you must load an Anaconda module:
$ module load anaconda
Many packages are pre-installed in the global environment. To see these packages:
$ conda list
To create your own custom environment:
$ conda create --name MyEnvName python=3.8 FirstPackageName SecondPackageName -y
The --name option specifies that the environment created will be named MyEnvName. You can include as many packages as you require separated by a space. Including the -y option lets you skip the prompt to install the package. By default environments are created and stored in the $HOME/.conda directory.
To create an environment at a custom location:
$ conda create --prefix=$HOME/MyEnvName python=3.8 PackageName -y
To see a list of your environments:
$ conda env list
To remove unwanted environments:
$ conda remove --name MyEnvName --all
To add packages to your environment:
$ conda install --name MyEnvName PackageNames
To remove a package from an environment:
$ conda remove --name MyEnvName PackageName
Installing packages when creating your environment, instead of one at a time, will help you avoid dependency issues.
To activate or deactivate an environment you have created:
$ source activate MyEnvName
$ source deactivate MyEnvName
If you created your conda environment at a custom location using --prefix option, then you can activate or deactivate it using the full path.
$ source activate $HOME/MyEnvName
$ source deactivate $HOME/MyEnvName
To use a custom environment inside a job you must load the module and activate the environment inside your job submission script. Add the following lines to your submission script:
$ module load anaconda
$ source activate MyEnvName
For more information about Python:
Managing Packages with Pip
Pip is a Python package manager. Many Python package documentation provide pip
instructions that result in permission errors because by default pip
will install in a system-wide location and fail.
Exception:
Traceback (most recent call last):
... ... stack trace ... ...
OSError: [Errno 13] Permission denied: '/apps/cent7/anaconda/2020.07-py38/lib/python3.8/site-packages/mkl_random-1.1.1.dist-info'
If you encounter this error, it means that you cannot modify the global Python installation. We recommend installing Python packages in a conda environment. Detailed instructions for installing packages with pip can be found in our Python package installation page.
Below we list some other useful pip commands.
- Search for a package in PyPI channels:
$ pip search packageName
- Check which packages are installed globally:
$ pip list
- Check which packages you have personally installed:
$ pip list --user
- Snapshot installed packages:
$ pip freeze > requirements.txt
- You can install packages from a snapshot inside a new conda environment. Make sure to load the appropriate conda environment first.
$ pip install -r requirements.txt
For more information about Python:
Installing Packages
Installing Python packages in an Anaconda environment is recommended. One key advantage of Anaconda is that it allows users to install unrelated packages in separate self-contained environments. Individual packages can later be reinstalled or updated without impacting others. If you are unfamiliar with Conda environments, please check our Conda Guide.
To facilitate the process of creating and using Conda environments, we support a script (conda-env-mod) that generates a module file for an environment, as well as an optional Jupyter kernel to use this environment in a JupyterHub notebook.
You must load one of the anaconda modules in order to use this script.
$ module load anaconda
Step-by-step instructions for installing custom Python packages are presented below.
Link to section 'Step 1: Create a conda environment' of 'Installing Packages' Step 1: Create a conda environment
Users can use the conda-env-mod script to create an empty conda environment. This script needs either a name or a path for the desired environment. After the environment is created, it generates a module file for using it in future. Please note that conda-env-mod is different from the official conda-env script and supports a limited set of subcommands. Detailed instructions for using conda-env-mod can be found with the command conda-env-mod --help.
-
Example 1: Create a conda environment named mypackages in user's
$HOME
directory.$ conda-env-mod create -n mypackages
-
Example 2: Create a conda environment named mypackages at a custom location.
$ conda-env-mod create -p /depot/mylab/apps/mypackages
Please follow the on-screen instructions while the environment is being created. After finishing, the script will print the instructions to use this environment.
... ... ... Preparing transaction: ...working... done Verifying transaction: ...working... done Executing transaction: ...working... done +------------------------------------------------------+ | To use this environment, load the following modules: | | module load use.own | | module load conda-env/mypackages-py3.8.5 | +------------------------------------------------------+ Your environment "mypackages" was created successfully.
Note down the module names, as you will need to load these modules every time you want to use this environment. You may also want to add the module load lines in your jobscript, if it depends on custom Python packages.
By default, module files are generated in your $HOME/privatemodules directory. The location of module files can be customized by specifying the -m /path/to/modules option to conda-env-mod.
Note: The main differences between -p and -m are: 1) -p will change the location of packages to be installed for the env and the module file will still be located at the $HOME/privatemodules directory as defined in use.own. 2) -m will only change the location of the module file. So the method to load modules created with -m and -p are different, see Example 3 for details.
- Example 3: Create a conda environment named labpackages in your group's Data Depot space and place the module file at a shared location for the group to use.
$ conda-env-mod create -p /depot/mylab/apps/labpackages -m /depot/mylab/etc/modules ... ... ... Preparing transaction: ...working... done Verifying transaction: ...working... done Executing transaction: ...working... done +-------------------------------------------------------+ | To use this environment, load the following modules: | | module use /depot/mylab/etc/modules | | module load conda-env/labpackages-py3.8.5 | +-------------------------------------------------------+ Your environment "labpackages" was created successfully.
If you used a custom module file location, you need to run the module use command as printed by the command output above.
By default, only the environment and a module file are created (no Jupyter kernel). If you plan to use your environment in a JupyterHub notebook, you need to append a --jupyter flag to the above commands.
- Example 4: Create a Jupyter-enabled conda environment named labpackages in your group's Data Depot space and place the module file at a shared location for the group to use.
$ conda-env-mod create -p /depot/mylab/apps/labpackages -m /depot/mylab/etc/modules --jupyter ... ... ... Jupyter kernel created: "Python (My labpackages Kernel)" ... ... ... Your environment "labpackages" was created successfully.
Link to section 'Step 2: Load the conda environment' of 'Installing Packages' Step 2: Load the conda environment
-
The following instructions assume that you have used conda-env-mod script to create an environment named mypackages (Examples 1 or 2 above). If you used conda create instead, please use conda activate mypackages.
$ module load use.own $ module load conda-env/mypackages-py3.8.5
Note that the conda-env module name includes the Python version that it supports (Python 3.8.5 in this example). This is same as the Python version in the anaconda module.
-
If you used a custom module file location (Example 3 above), please use module use to load the conda-env module.
$ module use /depot/mylab/etc/modules $ module load conda-env/labpackages-py3.8.5
Link to section 'Step 3: Install packages' of 'Installing Packages' Step 3: Install packages
Now you can install custom packages in the environment using either conda install or pip install.
Link to section 'Installing with conda' of 'Installing Packages' Installing with conda
-
Example 1: Install OpenCV (open-source computer vision library) using conda.
$ conda install opencv
-
Example 2: Install a specific version of OpenCV using conda.
$ conda install opencv=4.5.5
-
Example 3: Install OpenCV from a specific anaconda channel.
$ conda install -c anaconda opencv
Link to section 'Installing with pip' of 'Installing Packages' Installing with pip
-
Example 4: Install pandas using pip.
$ pip install pandas
-
Example 5: Install a specific version of pandas using pip.
$ pip install pandas==1.4.3
Follow the on-screen instructions while the packages are being installed. If installation is successful, please proceed to the next section to test the packages.
Note: Do NOT run Pip with the --user argument, as that will install packages in a different location and might mess up your account environment.
Link to section 'Step 4: Test the installed packages' of 'Installing Packages' Step 4: Test the installed packages
To use the installed Python packages, you must load the module for your conda environment. If you have not loaded the conda-env module, please do so following the instructions at the end of Step 1.
$ module load use.own
$ module load conda-env/mypackages-py3.8.5
- Example 1: Test that OpenCV is available.
$ python -c "import cv2; print(cv2.__version__)"
- Example 2: Test that pandas is available.
$ python -c "import pandas; print(pandas.__version__)"
If the commands finished without errors, then the installed packages can be used in your program.
Link to section 'Additional capabilities of conda-env-mod script' of 'Installing Packages' Additional capabilities of conda-env-mod script
The conda-env-mod tool is intended to facilitate creation of a minimal Anaconda environment, matching module file and optionally a Jupyter kernel. Once created, the environment can then be accessed via familiar module load command, tuned and expanded as necessary. Additionally, the script provides several auxiliary functions to help manage environments, module files and Jupyter kernels.
General usage for the tool adheres to the following pattern:
$ conda-env-mod help
$ conda-env-mod <subcommand> <required argument> [optional arguments]
where required arguments are one of
- -n|--name ENV_NAME (name of the environment)
- -p|--prefix ENV_PATH (location of the environment)
and optional arguments further modify behavior for specific actions (e.g. -m to specify alternative location for generated module files).
Given a required name or prefix for an environment, the conda-env-mod script supports the following subcommands:
- create - to create a new environment, its corresponding module file and optional Jupyter kernel.
- delete - to delete existing environment along with its module file and Jupyter kernel.
- module - to generate just the module file for a given existing environment.
- kernel - to generate just the Jupyter kernel for a given existing environment (note that the environment has to be created with a --jupyter option).
- help - to display script usage help.
Using these subcommands, you can iteratively fine-tune your environments, module files and Jupyter kernels, as well as delete and re-create them with ease. Below we cover several commonly occurring scenarios.
Note: When you try to use conda-env-mod delete, remember to include the arguments as you create the environment (i.e. -p package_location and/or -m module_location).
Link to section 'Generating module file for an existing environment' of 'Installing Packages' Generating module file for an existing environment
If you already have an existing configured Anaconda environment and want to generate a module file for it, follow appropriate examples from Step 1 above, but use the module subcommand instead of the create one. E.g.
$ conda-env-mod module -n mypackages
and follow printed instructions on how to load this module. With an optional --jupyter flag, a Jupyter kernel will also be generated.
Note that the module name mypackages should be exactly the same with the older conda environment name. Note also that if you intend to proceed with a Jupyter kernel generation (via the --jupyter flag or a kernel subcommand later), you will have to ensure that your environment has ipython and ipykernel packages installed into it. To avoid this and other related complications, we highly recommend making a fresh environment using a suitable conda-env-mod create .... --jupyter command instead.
Link to section 'Generating Jupyter kernel for an existing environment' of 'Installing Packages' Generating Jupyter kernel for an existing environment
If you already have an existing configured Anaconda environment and want to generate a Jupyter kernel file for it, you can use the kernel subcommand. E.g.
$ conda-env-mod kernel -n mypackages
This will add a "Python (My mypackages Kernel)" item to the dropdown list of available kernels upon your next login to the JupyterHub.
Note that generated Jupiter kernels are always personal (i.e. each user has to make their own, even for shared environments). Note also that you (or the creator of the shared environment) will have to ensure that your environment has ipython and ipykernel packages installed into it.
Link to section 'Managing and using shared Python environments' of 'Installing Packages' Managing and using shared Python environments
Here is a suggested workflow for a common group-shared Anaconda environment with Jupyter capabilities:
The PI or lab software manager:
-
Creates the environment and module file (once):
$ module purge $ module load anaconda $ conda-env-mod create -p /depot/mylab/apps/labpackages -m /depot/mylab/etc/modules --jupyter
-
Installs required Python packages into the environment (as many times as needed):
$ module use /depot/mylab/etc/modules $ module load conda-env/labpackages-py3.8.5 $ conda install ....... # all the necessary packages
Lab members:
-
Lab members can start using the environment in their command line scripts or batch jobs simply by loading the corresponding module:
$ module use /depot/mylab/etc/modules $ module load conda-env/labpackages-py3.8.5 $ python my_data_processing_script.py .....
-
To use the environment in Jupyter notebooks, each lab member will need to create his/her own Jupyter kernel (once). This is because Jupyter kernels are private to individuals, even for shared environments.
$ module use /depot/mylab/etc/modules $ module load conda-env/labpackages-py3.8.5 $ conda-env-mod kernel -p /depot/mylab/apps/labpackages
A similar process can be devised for instructor-provided or individually-managed class software, etc.
Link to section 'Troubleshooting' of 'Installing Packages' Troubleshooting
- Python packages often fail to install or run due to dependency incompatibility with other packages. More specifically, if you previously installed packages in your home directory it is safer to clean those installations.
$ mv ~/.local ~/.local.bak $ mv ~/.cache ~/.cache.bak
- Unload all the modules.
$ module purge
- Clean up PYTHONPATH.
$ unset PYTHONPATH
- Next load the modules (e.g. anaconda) that you need.
$ module load anaconda/2020.11-py38 $ module load use.own $ module load conda-env/mypackages-py3.8.5
- Now try running your code again.
- Few applications only run on specific versions of Python (e.g. Python 3.6). Please check the documentation of your application if that is the case.
Installing Packages from Source
We maintain several Anaconda installations. Anaconda maintains numerous popular scientific Python libraries in a single installation. If you need a Python library not included with normal Python we recommend first checking Anaconda. For a list of modules currently installed in the Anaconda Python distribution:
$ module load anaconda
$ conda list
# packages in environment at /apps/spack/bell/apps/anaconda/2020.02-py37-gcc-4.8.5-u747gsx:
#
# Name Version Build Channel
_ipyw_jlab_nb_ext_conf 0.1.0 py37_0
_libgcc_mutex 0.1 main
alabaster 0.7.12 py37_0
anaconda 2020.02 py37_0
...
If you see the library in the list, you can simply import it into your Python code after loading the Anaconda module.
If you do not find the package you need, you should be able to install the library in your own Anaconda customization. First try to install it with Conda or Pip. If the package is not available from either Conda or Pip, you may be able to install it from source.
Use the following instructions as a guideline for installing packages from source. Make sure you have a download link to the software (usually it will be a tar.gz
archive file). You will substitute it on the wget line below.
We also assume that you have already created an empty conda environment as described in our Python package installation guide.
$ mkdir ~/src
$ cd ~/src
$ wget http://path/to/source/tarball/app-1.0.tar.gz
$ tar xzvf app-1.0.tar.gz
$ cd app-1.0
$ module load anaconda
$ module load use.own
$ module load conda-env/mypackages-py3.8.5
$ python setup.py install
$ cd ~
$ python
>>> import app
>>> quit()
The "import app" line should return without any output if installed successfully. You can then import the package in your python scripts.
If you need further help or run into any issues installing a library, contact us or drop by Coffee Hour for in-person help.
For more information about Python:
Example: Create and Use Biopython Environment with Conda
Link to section 'Using conda to create an environment that uses the biopython package' of 'Example: Create and Use Biopython Environment with Conda' Using conda to create an environment that uses the biopython package
To use Conda you must first load the anaconda module:
module load anaconda
Create an empty conda environment to install biopython:
conda-env-mod create -n biopython
Now activate the biopython environment:
module load use.own
module load conda-env/biopython-py3.8.5
Install the biopython packages in your environment:
conda install --channel anaconda biopython -y
Fetching package metadata ..........
Solving package specifications .........
.......
Linking packages ...
[ COMPLETE ]|################################################################
The --channel option specifies that it searches the anaconda channel for the biopython package. The -y argument is optional and allows you to skip the installation prompt. A list of packages will be displayed as they are installed.
Remember to add the following lines to your job submission script to use the custom environment in your jobs:
module load anaconda
module load use.own
module load conda-env/biopython-py3.8.5
If you need further help or run into any issues with creating environments, contact us or drop by Coffee Hour for in-person help.
For more information about Python:
Numpy Parallel Behavior
The widely available Numpy package is the best way to handle numerical computation in Python. The numpy
package provided by our anaconda
modules is optimized using Intel's MKL library. It will automatically parallelize many operations to make use of all the cores available on a machine.
In many contexts that would be the ideal behavior. On the cluster however that very likely is not in fact the preferred behavior because often more than one user is present on the system and/or more than one job on a node. Having multiple processes contend for those resources will actually result in lesser performance.
Setting the MKL_NUM_THREADS
or OMP_NUM_THREADS
environment variable(s) allows you to control this behavior. Our anaconda modules automatically set these variables to 1 if and only if you do not currently have that variable defined.
When submitting batch jobs it is always a good idea to be explicit rather than implicit. If you are submitting a job that you want to make use of the full resources available on the node, set one or both of these variables to the number of cores you want to allow numpy to make use of.
#!/bin/bash
module load anaconda
export MKL_NUM_THREADS=24
...
If you are submitting multiple jobs that you intend to be scheduled together on the same node, it is probably best to restrict numpy to a single core.
#!/bin/bash
module load anaconda
export MKL_NUM_THREADS=1
R
R, a GNU project, is a language and environment for data manipulation, statistics, and graphics. It is an open source version of the S programming language. R is quickly becoming the language of choice for data science due to the ease with which it can produce high quality plots and data visualizations. It is a versatile platform with a large, growing community and collection of packages.
For more general information on R visit The R Project for Statistical Computing.
Loading Data into R
R is an environment for manipulating data. In order to manipulate data, it must be brought into the R environment. R has a function to read any file that data is stored in. Some of the most common file types like comma-separated variable(CSV) files have functions that come in the basic R packages. Other less common file types require additional packages to be installed. To read data from a CSV file into the R environment, enter the following command in the R prompt:
> read.csv(file = "path/to/data.csv", header = TRUE)
When R reads the file it creates an object that can then become the target of other functions. By default the read.csv() function will give the object the name of the .csv file. To assign a different name to the object created by read.csv enter the following in the R prompt:
> my_variable <- read.csv(file = "path/to/data.csv", header = FALSE)
To display the properties (structure) of loaded data, enter the following:
> str(my_variable)
For more functions and tutorials:
Installing R packages
Link to section 'Challenges of Managing R Packages in the Cluster Environment' of 'Installing R packages' Challenges of Managing R Packages in the Cluster Environment
- Different clusters have different hardware and softwares. So, if you have access to multiple clusters, you must install your R packages separately for each cluster.
- Each cluster has multiple versions of R and packages installed with one version of R may not work with another version of R. So, libraries for each R version must be installed in a separate directory.
- You can define the directory where your R packages will be installed using the environment variable
R_LIBS_USER
. - For your convenience, a sample ~/.Rprofile example file is provided that can be downloaded to your cluster account and renamed into
~/.Rprofile
(or appended to one) to customize your installation preferences. Detailed instructions.
Link to section 'Installing Packages' of 'Installing R packages' Installing Packages
-
Step 0: Set up installation preferences.
Follow the steps for setting up your~/.Rprofile
preferences. This step needs to be done only once. If you have created a~/.Rprofile
file previously on Data Workbench, ignore this step. -
Step 1: Check if the package is already installed.
As part of the R installations on community clusters, a lot of R libraries are pre-installed. You can check if your package is already installed by opening an R terminal and entering the commandinstalled.packages()
. For example,module load r/4.1.2 R
installed.packages()["units",c("Package","Version")] Package Version "units" "0.6-3" quit()
If the package you are trying to use is already installed, simply load the library, e.g.,
library('units')
. Otherwise, move to the next step to install the package. -
Step 2: Load required dependencies. (if needed)
For simple packages you may not need this step. However, some R packages depend on other libraries. For example, thesf
package depends ongdal
andgeos
libraries. So, you will need to load the corresponding modules before installingsf
. Read the documentation for the package to identify which modules should be loaded.module load gdal module load geos
-
Step 3: Install the package.
Now install the desired package using the commandinstall.packages('package_name')
. R will automatically download the package and all its dependencies from CRAN and install each one. Your terminal will show the build progress and eventually show whether the package was installed successfully or not.R
install.packages('sf', repos="https://cran.case.edu/") Installing package into ‘/home/myusername/R/workbench/4.0.0’ (as ‘lib’ is unspecified) trying URL 'https://cran.case.edu/src/contrib/sf_0.9-7.tar.gz' Content type 'application/x-gzip' length 4203095 bytes (4.0 MB) ================================================== downloaded 4.0 MB ... ... more progress messages ... ... ** testing if installed package can be loaded from final location ** testing if installed package keeps a record of temporary installation path * DONE (sf) The downloaded source packages are in ‘/tmp/RtmpSVAGio/downloaded_packages’
- Step 4: Troubleshooting. (if needed)
If Step 3 ended with an error, you need to investigate why the build failed. Most common reason for build failure is not loading the necessary modules.
Link to section 'Loading Libraries' of 'Installing R packages' Loading Libraries
Once you have packages installed you can load them with the library()
function as shown below:
library('packagename')
The package is now installed and loaded and ready to be used in R.
Link to section 'Example: Installing dplyr' of 'Installing R packages' Example: Installing dplyr
The following demonstrates installing the dplyr
package assuming the above-mentioned custom ~/.Rprofile
is in place (note its effect in the "Installing package into" information message):
module load r
R
install.packages('dplyr', repos="http://ftp.ussg.iu.edu/CRAN/")
Installing package into ‘/home/myusername/R/workbench/4.0.0’
(as ‘lib’ is unspecified)
...
also installing the dependencies 'crayon', 'utf8', 'bindr', 'cli', 'pillar', 'assertthat', 'bindrcpp', 'glue', 'pkgconfig', 'rlang', 'Rcpp', 'tibble', 'BH', 'plogr'
...
...
...
The downloaded source packages are in
'/tmp/RtmpHMzm9z/downloaded_packages'
library(dplyr)
Attaching package: 'dplyr'
For more information about installing R packages:
RStudio
RStudio is a graphical integrated development environment (IDE) for R. RStudio is the most popular environment for developing both R scripts and packages. RStudio is provided on most Research systems.
There are two methods to launch RStudio on the cluster: command-line and application menu icon.
Link to section 'Launch RStudio by the command-line:' of 'RStudio' Launch RStudio by the command-line:
module load gcc
module load r
module load rstudio
rstudio
Note that RStudio is a graphical program and in order to run it you must have a local X11 server running or use Thinlinc Remote Desktop environment. See the ssh X11 forwarding section for more details.
Link to section 'Launch Rstudio by the application menu icon:' of 'RStudio' Launch Rstudio by the application menu icon:
- Log into desktop.workbench.rcac.purdue.edu with web browser or ThinLinc client
- Click on the
Applications
drop down menu on the top left corner - Choose
Cluster Software
and thenRStudio
R and RStudio are free to download and run on your local machine. For more information about RStudio:
Setting Up R Preferences with .Rprofile
For your convenience, a sample ~/.Rprofile example file is provided that can be downloaded to your cluster account and renamed into ~/.Rprofile
(or appended to one). Follow these steps to download our recommended ~/.Rprofile
example and copy it into place:
curl -#LO https://www.rcac.purdue.edu/files/knowledge/run/examples/apps/r/Rprofile_example
mv -ib Rprofile_example ~/.Rprofile
The above installation step needs to be done only once on Data Workbench. Now load the R module and run R:
module load r/4.1.2
R
.libPaths()
[1] "/home/myusername/R/workbench/4.1.2-gcc-6.3.0-ymdumss"
[2] "/apps/spack/workbench/apps/r/4.1.2-gcc-6.3.0-ymdumss/rlib/R/library"
.libPaths()
should output something similar to above if it is set up correctly.
You are now ready to install R packages into the dedicated directory /home/myusername/R/workbench/4.1.2-gcc-6.3.0-ymdumss
.
Singularity
Note: Singularity was originally a project out of Lawrence Berkeley National Laboratory. It has now been spun off into a distinct offering under a new corporate entity under the name Sylabs Inc. This guide pertains to the open source community edition, SingularityCE.
Link to section 'What is Singularity?' of 'Singularity' What is Singularity?
Singularity is a new feature of the Community Clusters allowing the portability and reproducibility of operating system and application environments through the use of Linux containers. It gives users complete control over their environment.
Singularity is like Docker but tuned explicitly for HPC clusters. More information is available from the project’s website.
Link to section 'Features' of 'Singularity' Features
- Run the latest applications on an Ubuntu or Centos userland
- Gain access to the latest developer tools
- Launch MPI programs easily
- Much more
Singularity’s user guide is available at: sylabs.io/guides/3.8/user-guide
Link to section 'Example' of 'Singularity' Example
Here is an example using an Ubuntu 16.04 image on Data Workbench:
singularity exec /depot/itap/singularity/ubuntu1604.img cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=16.04
DISTRIB_CODENAME=xenial
DISTRIB_DESCRIPTION="Ubuntu 16.04 LTS"
Here is another example using a Centos 7 image:
singularity exec /depot/itap/singularity/centos7.img cat /etc/redhat-release
CentOS Linux release 7.2.1511 (Core)
Link to section 'Purdue Cluster Specific Notes' of 'Singularity' Purdue Cluster Specific Notes
All service providers will integrate Singularity slightly differently depending on site. The largest customization will be which default files are inserted into your images so that routine services will work.
Services we configure for your images include DNS settings and account information. File systems we overlay into your images are your home directory, scratch, Data Depot, and application file systems.
Here is a list of paths:
- /etc/resolv.conf
- /etc/hosts
- /home/$USER
- /apps
- /scratch
- /depot
This means that within the container environment these paths will be present and the same as outside the container. The /apps
, /scratch
, and /depot
directories will need to exist inside your container to work properly.
Link to section 'Creating Singularity Images' of 'Singularity' Creating Singularity Images
Due to how singularity containers work, you must have root privileges to build an image. Once you have a singularity container image built on your own system, you can copy the image file up to the cluster (you do not need root privileges to run the container).
You can find information and documentation for how to install and use singularity on your system:
We have version 3.8.0-1.el7
on the cluster. You will most likely not be able to run any container built with any singularity past that version. So be sure to follow the installation guide for version 3.8 on your system.
singularity --version
singularity version 3.8.0-1.el7
Everything you need on how to build a container is available from their user-guide. Below are merely some quick tips for getting your own containers built for Data Workbench.
You can use a Definition File to both build your container and share its specification with collaborators (for the sake of reproducibility). Here is a simplistic example of such a file:
# FILENAME: Buildfile
Bootstrap: docker
From: ubuntu:18.04
%post
apt-get update && apt-get upgrade -y
mkdir /apps /depot /scratch
To build the image itself:
sudo singularity build ubuntu-18.04.sif Buildfile
The challenge with this approach however is that it must start from scratch if you decide to change something. In order to create a container image iteratively and interactively, you can use the --sandbox
option.
sudo singularity build --sandbox ubuntu-18.04 docker://ubuntu:18.04
This will not create a flat image file but a directory tree (i.e., a folder), the contents of which are the container's filesystem. In order to get a shell inside the container that allows you to modify it, user the --writable
option.
sudo singularity shell --writable ubuntu-18.04
Singularity: Invoking an interactive shell within container...
Singularity ubuntu-18.04.sandbox:~>
You can then proceed to install any libraries, software, etc. within the container. Then to create the final image file, exit
the shell and call the build
command once more on the sandbox.
sudo singularity build ubuntu-18.04.sif ubuntu-18.04
Finally, copy the new image to Data Workbench and run it.
Windows
Windows virtual machines (VMs) are supported as batch jobs on HPC systems. This section illustrates how to submit a job and run a Windows instance in order to run Windows applications on the high-performance computing systems.
The following images are pre-configured and made available by staff:
- Windows 2016 Server Basic (minimal software pre-loaded)
- Windows 2016 Server GIS (GIS Software Stack pre-loaded)
The Windows VMs can be launched in two fashions:
- Menu Launcher - Point and click to start
- Command Line - Advanced and customized usage
Click each of the above links for detailed instructions on using them.
Link to section 'Software Provided in Pre-configured Virtual Machines' of 'Windows' Software Provided in Pre-configured Virtual Machines
The Windows 2016 Base server image available on Data Workbench has the following software packages preloaded:
- Anaconda Python 2 and Python 3
- JMP 13
- Matlab R2017b
- Microsoft Office 2016
- Notepad++
- NVivo 12
- Rstudio
- Stata SE 15
- VLC Media Player
The Windows 2016 GIS server image available on Data Workbench has the following software packages preloaded:
- ArcGIS Desktop 10.5
- ArcGIS Pro
- ArcGIS Server 10.5
- Anaconda Python 2 and Python 3
- ENVI5.3/IDL 8.5
- ERDAS Imagine
- GRASS GIS 7.4.0
- JMP 13
- Matlab R2017b
- Microsoft Office 2016
- Notepad++
- Pix4d Mapper
- QGIS Desktop
- Rstudio
- VLC Media Player
Command line
If you wish to work with Windows VMs on the command line or work into scripted workflows you can interact directly with the Windows system:
Copy a Windows 2016 Server VM image to your storage. Scratch or Research Data Depot are good locations to save a VM image. If you are using scratch, remember that scratch spaces are temporary, and be sure to safely back up your disk image somewhere permanent, such as Research Data Depot or Fortress. To copy a basic image:
$ cp /depot/itap/windows/base/2k16.qcow2 $RCAC_SCRATCH/windows.qcow2
To copy a GIS image:
$ cp /depot/itap/windows/gis/2k16.qcow2 $RCAC_SCRATCH/windows.qcow2
To launch a virtual machine in a batch job, use the "windows" script, specifying the path to your Windows virtual machine image. With no other command-line arguments, the windows script will autodetect a number cores and memory for the Windows VM. A Windows network connection will be made to your home directory. To launch:
$ windows -i $RCAC_SCRATCH/windows.qcow2
Link to section 'Command line options:' of 'Command line' Command line options:
-i <path to qcow image file> (For example, $RCAC_SCRATCH/windows-2k16.qcow2)
-m <RAM>G (For example, 32G)
-c <cores> (For example, 20)
-s <smbpath> (UNIX Path to map as a drive, for example, $RCAC_SCRATCH)
-b (If present, launches VM in background. Use VNC to connect to Windows.)
To launch a virtual machine with 32GB of RAM, 20 cores, and a network mapping to your home directory:
$ windows -i /path/to/image.qcow2 -m 32G -c 20 -s $HOME
To launch a virtual machine with 16GB of RAM, 10 cores, and a network mapping to your Data Depot space:
$ windows -i /path/to/image.qcow2 -m 16G -c 10 -s /depot/mylab
The Windows 2016 server desktop will open, and automatically log in as an administrator, so that you can install any software into the Windows virtual machine that your research requires. Changes to the image will be stored in the file specified with the -i option.
Menu Launcher
Windows VMs can be easily launched through the Thinlinc remote desktop environment.
- Log in via Thinlinc.
- Click on Applications menu in the upper left corner.
- Look under the Cluster Software menu.
- The "Windows 10" launcher will launch a VM directly on the front-end.
- Follow the dialogs to set up your VM.

The dialog menus will walk you through setting up and loading your VM.
- You can choose to create a new image or load a saved image.
- New VMs should be saved on Scratch or Research Data Depot as they are too large for Home Directories.
- If you are using scratch, remember that scratch spaces are temporary, and be sure to safely back up your disk image somewhere permanent, such as Research Data Depot or Fortress.
You will also be prompted to select a storage space to mount on your image (Home, Scratch, or Data Depot). You can only choose one to be mounted. It will appear on a shortcut on the desktop once the VM loads.
Link to section 'Notes' of 'Menu Launcher' Notes
Using the menu launcher will launch automatically select reasonable CPU and memory values. If you wish to choose other options or work Windows VMs into scripted workflows see the section on using the command line.
BioContainers Collection
Link to section 'What is BioContainers?' of 'BioContainers Collection' What is BioContainers?
The BioContainers project came from the idea of using the containers-based technologies such as Docker or rkt for bioinformatics software. Having a common and controllable environment for running software could help to deal with some of the current problems during software development and distribution. BioContainers is a community-driven project that provides the infrastructure and basic guidelines to create, manage and distribute bioinformatics containers with a special focus on omics fields such as proteomics, genomics, transcriptomics and metabolomics. . For more information, please visit BioContainers project.
Link to section ' Getting Started ' of 'BioContainers Collection' Getting Started
Users can download bioinformatic containers from the BioContainers.pro and run them directly using Singularity instructions from the corresponding container’s catalog page.
Brief Singularity guide and examples are available at the Data Workbench Singularity user guide page. Detailed Singularity user guide is available at: sylabs.io/guides/3.8/user-guide
In addition, a subset of pre-downloaded biocontainers wrapped into convenient software modules are provided. These modules wrap underlying complexity and provide the same commands that are expected from non-containerized versions of each application.
On Data Workbench, type the command below to see the lists of biocontainers we deployed.
module load biocontainers
module avail
------------ BioContainers collection modules -------------
bamtools/2.5.1
beast2/2.6.3
bedtools/2.30.0
blast/2.11.0
bowtie2/2.4.2
bwa/0.7.17
cufflinks/2.2.1
deeptools/3.5.1
fastqc/0.11.9
faststructure/1.0
htseq/0.13.5
[....]
Link to section ' Example ' of 'BioContainers Collection' Example
This example demonstrates how to run BLASTP with the blast module. This blast module is a biocontainer wrapper for NCBI BLAST.
module load biocontainers
module load blast
blastp -query query.fasta -db nr -out output.txt -outfmt 6 -evalue 0.01
To run a job in batch mode, first prepare a job script that specifies the BioContainer modules you want to launch and the resources required to run it. Then, use the sbatch
command to submit your job script to Slurm. The following example shows the job script to use Bowtie2 in bioinformatic analysis.
#!/bin/bash
#SBATCH -A myqueuename
#SBATCH -o bowtie2_%j.txt
#SBATCH -e bowtie2_%j.err
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=8
#SBATCH --time=1:30:00
#SBATCH --job-name bowtie2
# Load the Bowtie module
module load biocontainers
module load bowtie2
# Indexing a reference genome
bowtie2-build ref.fasta ref
# Aligning paired-end reads
bowtie2 -p 8 -x ref -1 reads_1.fq -2 reads_2.fq -S align.sam
To help users get started, we provided detailed user guides for each containerized bioinformatics module on the ReadTheDocs platform
Ansys Fluent
Ansys is a CAE/multiphysics engineering simulation software that utilizes finite element analysis for numerically solving a wide variety of mechanical problems. The software contains a list of packages and can simulate many structural properties such as strength, toughness, elasticity, thermal expansion, fluid dynamics as well as acoustic and electromagnetic attributes.
Link to section 'Ansys Licensing' of 'Ansys Fluent' Ansys Licensing
The Ansys licensing on our community clusters is maintained by Purdue ECN group. There are two types of licenses: teaching and research. For more information, please refer to ECN Ansys licensing page. If you are interested in purchasing your own research license, please send email to software@ecn.purdue.edu.
Link to section 'Ansys Workflow' of 'Ansys Fluent' Ansys Workflow
Ansys software consists of several sub-packages such as Workbench and Fluent. Most simulations are performed using the Ansys Workbench console, a GUI interface to manage and edit the simulation workflow. It requires X11 forwarding for remote display so a SSH client software with X11 support or a remote desktop portal is required. Please see Logging In section for more details. To ensure preferred performance, ThinLinc remote desktop connection is highly recommended.
Typically users break down larger structures into small components in geometry with each of them modeled and tested individually. A user may start by defining the dimensions of an object, adding weight, pressure, temperature, and other physical properties.
Ansys Fluent is a computational fluid dynamics (CFD) simulation software known for its advanced physics modeling capabilities and accuracy. Fluent offers unparalleled analysis capabilities and provides all the tools needed to design and optimize new equipment and to troubleshoot existing installations.
In the following sections, we provide step-by-step instructions to lead you through the process of using Fluent. We will create a classical elbow pipe model and simulate the fluid dynamics when water flows through the pipe. The project files have been generated and can be downloaded via fluent_tutorial.zip.
Link to section 'Loading Ansys Module' of 'Ansys Fluent' Loading Ansys Module
Different versions of Ansys are installed on the clusters and can be listed with module spider
or module avail
command in the terminal.
$ module avail ansys/
---------------------- Core Applications -----------------------------
ansys/2019R3 ansys/2020R1 ansys/2021R2 ansys/2022R1 (D)
Before launching Ansys Workbench, a specific version of Ansys module needs to be loaded. For example, you can module load ansys/2021R2
to use the latest Ansys 2021R2. If no version is specified, the default module -> (D) (ansys/2022R1
in this case) will be loaded. You can also check the loaded modules with module list
command.
Link to section 'Launching Ansys Workbench' of 'Ansys Fluent' Launching Ansys Workbench
Open a terminal on Data Workbench, enter rcac-runwb2
to launch Ansys Workbench.
You can also use runwb2
to launch Ansys Workbench. The main difference between runwb2
and rcac-runwb2
is that the latter sets the project folder to be in your scratch space. Ansys has an known bug that it might crash when the project folder is set to $HOME
on our systems.
Preparing Case Files for Fluent
Link to section 'Creating a Fluent fluid analysis system' of 'Preparing Case Files for Fluent' Creating a Fluent fluid analysis system
In the Ansys Workbench, create a new fluid flow analysis by double-clicking the Fluid Flow (Fluent) option under the Analysis Systems in the Toolbox on the left panel. You can also drag-and-drop the analysis system into the Project Schematic. A green dotted outline indicating a potential location for the new system initially appears in the Project Schematic. When you drag the system to one of the outlines, it turns into a red box to indicate the chosen location of the new system.

The red rectangle indicates the Fluid Flow system for Fluent, which includes all the essential workflows from “2 Geometry” to “6 Results”. You can rename it and carry out the necessary step-by-step procedures by double-clicking the corresponding cells.
It is important to save the project. Ansys Workbench saves the project with a .wbpj
extension and also all the supporting files into a folder with the same name. In this case, a file named elbow_demo.wbpj
and a folder $Ansys_PROJECT_FOLDER/elbow_demo_files/
are created in the Ansys project folder:
$ ll
total 33
drwxr-xr-x 7 myusername itap 9 Mar 3 17:47 elbow_demo_files
-rw-r--r-- 1 myusername itap 42597 Mar 3 17:47 elbow_demo.wbpj
You should always “Update Project” and save it after finishing a procedure.
Link to section 'Creating Geometry in the Ansys DesignModeler' of 'Preparing Case Files for Fluent' Creating Geometry in the Ansys DesignModeler
Create a geometry in the Ansys DesignModeler (by double-clicking “Geometry” cell in workflow), or import the appropriate geometry file (by right-clicking the Geometry cell and selecting “Import Geometry” option from the context menu).
You can use Ansys DesignModeler to create 2D/3D geometries or even draw the objects yourself. In our example, we created only half of the elbow pipe because the symmetry of the structure is taken into account to reduce the computation intensity.

After saving the geometry, a geometry file FFF.agdb
will be created in the folder: $Ansys_PROJECT_FOLDER/elbow_demo_file/dp0/FFF/DM/
. The project in Workbench will be updated automatically.
If you import a pre-existing geometry into Ansys DesignModeler, it will also generate this file with the same filename at this location.
Link to section 'Creating mesh in the Ansys Meshing' of 'Preparing Case Files for Fluent' Creating mesh in the Ansys Meshing
Now that we have created the elbow pipe geometry, a computational mesh can be generated by the Meshing application throughout the flow volume.
With the successful creation of the geometry, there should be a green check showing the completion of “Geometry” in the Ansys Workbench. A Refresh Required icon within the “Mesh” cell indicates the mesh needs to be updated and refreshed for the system.
Then it’s time to open the Ansys Meshing application by double-clicking the “Mesh” cell and editing the mesh for the project. Generally, there are several steps we need to take to define the mesh:
- Create names for all geometry boundaries such as the inlets, outlets and fluid body. Note: You can use the strings “velocity inlet” and “pressure outlet” in the named selections (with or without hyphens or underscore characters) to allow Ansys Fluent to automatically detect and assign the corresponding boundary types accordingly. Use “Fluid” for the body to let Ansys Fluent automatically detect that the volume is a fluid zone and treat it accordingly.
- Set basic meshing parameters for the Ansys Meshing application. Here are several important parameters you may need to assign: Sizing, Quality, Body Sizing Control, Inflation.
- Select “Generate” to generate the mesh and “Update” to update the mesh into the system. Note: Once the mesh is generated, you can view the mesh statistics by opening the Statistics node in the Details of “Mesh” view. This will display information such as the number of nodes and the number of elements, which gives you a general idea for the future computational resources and time.
After generation and updating the mesh, a mesh file FFF.msh
will be generated in folder $Ansys_PROJECT_FOLDER/elbow_demo_file/dp0/FFF/MECH/
and a mesh database file FFF.mshdb
will be generated in folder $Ansys_PROJECT_FOLDER/elbow_demo_file/dp0/global/MECH/
.
Parameters used in demo case (use default if not assigned):
- Length Unit=”mm”
- Names defined for geometry:
- velocity-inlet-large (large inlet on pipe);
- velocity-inlet-small (small inlet on pipe);
- pressure-outlet (outlet on pipe);
- symmetry (symmetry surface);
- Fluid (body);
- Mesh:
- Quality: Smoothing=”high”;
- Inflation: Use Automatic Inflation=“Program Controlled”, Inflation Option=”Smooth Transition”;
- Statistics:
- Nodes=29371;
- Elements=87647.
Link to section 'Calculation with Fluent' of 'Preparing Case Files for Fluent' Calculation with Fluent
Now all the preparations have been ready for the numerical calculation in Ansys Fluent. Both “Geometry” and “Mesh” cells should have green checks on. We can set up the CFD simulation parameters in Ansys Fluent by double-clicking the “Setup” cell.
When Ansys Fluent is first started or by selecting “editing” on the “Setup” cell, the Fluent Launcher is displayed, enabling you to view and/or set certain Ansys Fluent start-up options (e.g. Precision, Parallel, Display). Note that “Dimension” is fixed to “3D” because we are using a 3D model in this project.
After the Fluent is opened, an Ansys Fluent settings file FFF.set
is written under the folder $Ansys_PROJECT_FOLDER/elbow_demo_file/dp0/FFF/Fluent/
.
Then we are going to set up all the necessary parameters for Fluent computation. Here are the key steps for the setup:
- Setting up the domain:
- Change the units for length to be consistent with the Mesh;
- Check the mesh statistics and quality;
- Setting up physics:
- Solver: “Energy”, “Viscous Model”, “Near-Wall Treatment”;
- Materials;
- Zones;
- Boundaries: Inlet, Outlet, Internal, Symmetry, Wall;
- Solving:
- Solution Methods;
- Reports;
- Initialization;
- Iterations and output frequency.
Then the calculation will be carried out and the results will be written out into FFF-1.cas.gz
under folder $Ansys_PROJECT_FOLDER/elbow_demo_file/dp0/FFF/Fluent/
.
This file contains all the settings and simulation results which can be loaded for post analysis and re-computation (more details will be introduced in the following sections). If only configurations and settings within the Fluent are needed, we can open independent Fluent or submit Fluent jobs with bash commands by loading the existing case in order to facilitate the computation process.
Parameters used in demo case (use default if not assigned):
- Domain Setup: Length Units=”mm”;
- Solver: Energy=”on”; Viscous Model=”k-epsilon”; Near-Wall Treatment=”Enhanced Wall Treatment”;
- Materials: water (Density=1000[kg/m^3]; Specific Heat=4216[J/kg-k]; Thermal Conductivity=0.677[w/m-k]; Viscosity=8e-4[kg/m-s]);
- Zones=”fluid (water)”;
- Inlet=”velocity-inlet-large” (Velocity Magnitude=0.4m/s, Specification Method=”Intensity and Hydraulic Diameter”, Turbulent Intensity=5%; Hydraulic Diameter=100mm; Thermal Temperature=293.15k) &”velocity-inlet-small” (Velocity Magnitude=1.2m/s, Specification Method=”Intensity and Hydraulic Diameter”, Turbulent Intensity=5%; Hydraulic Diameter=25mm; Thermal Temperature=313.15k); Internal=”interior-fluid”; Symmetry=”symmetry”; Wall=”wall-fluid”;
- Solution Methods: Gradient=”Green-Gauss Node Based”;
- Report: plot residual and “Facet Maximum” for “pressure-outlet”
- Hybrid Initialization;
- 300 iterations.
Case Calculating with Fluent
Link to section 'Calculation with Fluent' of 'Case Calculating with Fluent' Calculation with Fluent
Now all the files are ready for the Fluent calculations. Both “Geometry” and “Mesh” cells should have green checks. We can set up the CFD simulation parameters in the Ansys Fluent by double-clicking the “Setup” cell.
Ansys Fluent Launcher can be started by selecting “editing” on the “Setup” cell with many startup options (e.g. Precision, Parallel, Display). Note that “Dimension” is fixed to “3D” because we are using a 3D model in this project.

After the Fluent is opened, an Ansys Fluent settings file FFF.set
is written under the folder $Ansys_PROJECT_FOLDER/elbow_demo_file/dp0/FFF/Fluent/
.
Then we are going to set up all the necessary parameters for Fluent computation. Here are the key steps for the setup:
- Setting up the domain:
- Change the units for length to be consistent with the Mesh;
- Check the mesh statistics and quality;
- Setting up physics:
- Solver: “Energy”, “Viscous Model”, “Near-Wall Treatment”;
- Materials;
- Zones;
- Boundaries: Inlet, Outlet, Internal, Symmetry, Wall;
- Solving:
- Solution Methods;
- Reports;
- Initialization;
- Iterations and output frequency.
Then the calculation will be carried out and the results will be written out into FFF-1.cas.gz
under folder $Ansys_PROJECT_FOLDER/elbow_demo_file/dp0/FFF/Fluent/
.
This file contains all the settings and simulation results which can be loaded for post analysis and re-computation (more details will be introduced in the following sections). If only configurations and settings within the Fluent are needed, we can open independent Fluent or submit Fluent jobs with bash commands by loading the existing case in order to facilitate the computation process.
Parameters used in demo case (use default if not assigned):
- Domain Setup: Length Units=”mm”;
- Solver: Energy=”on”; Viscous Model=”k-epsilon”; Near-Wall Treatment=”Enhanced Wall Treatment”;
- Materials: water (Density=1000[kg/m^3]; Specific Heat=4216[J/kg-k]; Thermal Conductivity=0.677[w/m-k]; Viscosity=8e-4[kg/m-s]);
- Zones=”fluid (water)”;
- Inlet=”velocity-inlet-large” (Velocity Magnitude=0.4m/s, Specification Method=”Intensity and Hydraulic Diameter”, Turbulent Intensity=5%; Hydraulic Diameter=100mm; Thermal Temperature=293.15k) &”velocity-inlet-small” (Velocity Magnitude=1.2m/s, Specification Method=”Intensity and Hydraulic Diameter”, Turbulent Intensity=5%; Hydraulic Diameter=25mm; Thermal Temperature=313.15k); Internal=”interior-fluid”; Symmetry=”symmetry”; Wall=”wall-fluid”;
- Solution Methods: Gradient=”Green-Gauss Node Based”;
- Report: plot residual and “Facet Maximum” for “pressure-outlet”
- Hybrid Initialization;
- 300 iterations.
Link to section 'Results analysis' of 'Case Calculating with Fluent' Results analysis
The best methods to view and analyze the simulation should be the Ansys Fluent (directly after computation) or the Ansys CFD-Post (entering “Results” in Ansys Workbench). Both methods are straightforward so we will not cover this part in this tutorial. Here is a final simulation result showing the temperature of the symmetry after 300 iterations for reference:

Fluent Text User Interface and Journal File
Link to section 'Fluent Text User Interface (TUI)' of 'Fluent Text User Interface and Journal File' Fluent Text User Interface (TUI)
If you pay attention to the “Console” window in the Fluent window when setting up and carrying out the calculation, corresponding commands can be found and executed one after another. Almost all the setting processes can be accomplished by the command lines, which is called Fluent Text User Interface (TUI). Here are the main commands in Fluent TUI:
adjoint/ parallel/ solve/
define/ plot/ surface/
display/ preferences/ turbo-workflow/
exit print-license-usage views/
file/ report/
mesh/ server/
For example, instead of opening a case by clicking buttons in Ansys Fluent, we can type /file read-case case_file_name.cas.gz
to open the saved case.
Link to section 'Fluent Journal Files' of 'Fluent Text User Interface and Journal File' Fluent Journal Files
A Fluent journal file is a series of TUI commands stored in a text file. The file can be written in a text editor or generated by Fluent as a transcript of the commands given to Fluent during your session.
A journal file generated by Fluent will include any GUI operations (in a TUI form, though). This is quite useful if you have a series of tasks that you need to execute, as it provides a shortcut. To record a journal file, start recording with File -> Write -> Start Journal..., perform whatever tasks you need, and then stop recording with File -> Write -> Stop Journal...
You can also write your own journal file into a text file. The basic rule for a Fluent journal file is to reproduce the TUI commands that controlled the configuration and calculation of Fluent in their order. You can add a comment in a line starting with a ;
(semicolon).
Here are some reasons why you should use a Fluent journal file:
- Using journal files with bash scripting can allow you to automate your jobs.
- Using journal files can allow you to parameterize your models easily and automatically.
- Using a journal file can set parameters you do not have in your case file e.g. autosaving.
- Using a journal file can allow you to safely save, stop and restart your jobs easily.
The order of your journal file commands is highly important. The correct sequences must be followed and some stages have multiple options e.g. different initialization methods.
Here is a sample Fluent journal file for the demo case:
;testJournal.jou
;Set the TUI version for Fluent
/file/set-tui-version "22.1"
;Read the case. The default folder
/file read-case /home/jin456/Fluent_files/tutorial_case1/elbow_files/dp0/FFF/Fluent/FFF-1.cas.gz
;Initialize the case with Hybrid Initialization
/solve/initialize/hyb-initialization
;Set Number of Iterations to 1000, Reporting Interval to 10 iterations and Profile Update Interval to 1 iteration
/solve/iterate 1000 10 1
;Outputting solver performance data upon completion of the simulation
/parallel timer usage
;Write out the simulation results.
/file write-case-data /home/jin456/Fluent_files/tutorial_case1/elbow_files/dp0/FFF/Fluent/result.cas.h5
;After computation, exit Flent
/exit
Before running this Fluent journal file, you need to make sure: 1) the ansys module has been loaded (it’s highly recommended to load the same version of Ansys when you built the case project); 2) the project case file (***.cas.gz
) has been created.
Then we can use Fluent to run this journal file by simply using:fluent 3ddp -t$NTASKS -g -i testJournal.jou
in the terminal. Here, 3d
indicates this is a 3d model, dp
indicates double precision, -t$NTASKS
tells Fluent how many Solver Processes it will take (e.g. -t4
), -g
means to run without the GUI or graphics, -i
testJournal.jou tells Fluent to read the specific journal file.
Here is a table for the available command line Options for Linux/UNIX and Windows Platforms in Ansys Fluent.
Option | Platform | Description |
---|---|---|
-cc |
all | Use the classic color scheme |
-ccp x |
Windows only | Use the Microsoft Job Scheduler where x is the head node name. |
-cnf=x |
all | Specify the hosts or machine list file |
-driver |
all | Sets the graphics driver (available drivers vary by platform - opengl or x11 or null(Linux/UNIX) - opengl or msw or null (Windows)) |
-env |
all | Show environment variables |
-fgw |
all | Disables the embedded graphics |
-g |
all | Run without the GUI or graphics (Linux/UNIX); Run with the GUI minimized (Windows) |
-gr |
all | Run without graphics |
-gu |
all | Run without the GUI but with graphics (Linux/UNIX); Run with the GUI minimized but with graphics (Windows) |
-help |
all | Display command line options |
-hidden |
Windows only | Run in batch mode |
-host_ip=host:ip |
all | Specify the IP interface to be used by the host process |
-i journal |
all | Reads the specified journal file |
-lsf |
Linux/UNIX only | Run FLUENT using LSF |
-mpi= |
all | Specify MPI implementation |
-mpitest |
all | Will launch an MPI program to collect network performance data |
-nm |
all | Do not display mesh after reading |
-pcheck |
Linux/UNIX only | Checks all nodes |
-post |
all | Run the FLUENT post-processing-only executable |
-p |
all | Choose the interconnect |
-r |
all | List all releases installed |
-rx |
all | Specify release number |
-sge |
Linux/UNIX only | Run FLUENT under Sun Grid Engine |
-sge queue |
Linux/UNIX only | Name of the queue for a given computing grid |
-sgeckpt ckpt_obj |
Linux/UNIX only | Set checkpointing object to ckpt_objfor SGE |
-sgepe fluent_pe min_n-max_n |
Linux/UNIX only | Set the parallel environment for SGE to fluent_pe, min_nand max_n are number of min and max nodes requested |
-tx |
all | Specify the number of processors x |
For more information for Fluent text user interface and journal files, please refer to Fluent FAQ.
Using Jupyter Hub
Link to section 'What is Jupyter Hub' of 'Using Jupyter Hub' What is Jupyter Hub
Jupyter is an acronym meaning Julia, Python and R. The application was originally developed for use with these languages but now supports many more. Jupyter stores your project in a notebook. It is called a notebook because it is not just a block of code but rather a collection of information that relate to a project. The way you organize your notebook can explain processes and steps taken as well as highlight results. Notebooks provide a variety of formatting options while downloading so you can share the project appropriately for the situation. In addition, Jupyter can compile and run code, as well as save its output, making it an ideal workspace for many types of projects.
Jupyter Hub is currently available here or under the url https://notebook.workbench.rcac.purdue.edu.
Link to section 'Getting Started' of 'Using Jupyter Hub' Getting Started
When you are logging to Jupyter Hub on one of the clusters you need to use your career account credentials. After, you will see the contents of your home directory in a file explorer. To start a new notebook click the "New" dropdown menu at the right-top and select one of the kernels available. Bash, R or Python.
Link to section 'Create your own environment' of 'Using Jupyter Hub' Create your own environment
You can create your own environment in a kernel using a conda environment. Whatever environment you have created using conda can become in a Kernel ready to use in Jupyter Hub, just following some steps in the terminal or from the conda tab in the Jupyter Hub dashboard.
Below are listed the steps needed to create the environment for Jupyter from the terminal.
-
Load the anaconda module or use your own local installation.
$ module load anaconda/5.1.0-py36
-
Create your own Conda environment with the following packages.
$ conda create -n MyEnvName ipython ipykernel [...more-needed-packages...]
(and if you need a specific Python version in your environment, you can also add a
python=x.y
specification to the above command). -
Activate your environment.
$ source activate MyEnvName
-
Install the new Kernel.
$ ipython kernel install --user --name MyEnvName --display-name "Python (My Own MyEnvName Kernel)"
The
--name
value is used by Jupyter internally. These commands will overwrite any existing kernel with the same name.--display-name
is what you see in the notebook menus. - Go to your Jupyter dashboard and reload the page, you will see your own Kernel when you create a new Notebook. If you want to change the Kernel in the current Notebook, just go to the Kernel tab and select it from the "Change Kernel" option.
If you want to create the environment from the Dashboard, just go to the conda tab and create a new one with one of the available kernels, it will take some minutes while all base packages are being installed, after the new environment shows up in the list you can just select the libraries you want from the box under the list.

Additionally, You can change the environment you are using at any time by clicking the "Kernel" dropdown menu and selecting "Change kernel".
If you want to install a new kernel different from Python (e.g. R or Bash), please refer to the links at the end.
To run code in a cell, select the cell and click the "run cell" icon on the toolbar.
To add descriptions or other plain text change the cell to markdown format. Any standard markdown tags will apply after you click the "run cell" tool.
Below is a simple example of a notebook created following the steps outlined above.
For more information about Jupyter Hub, kernels and example notebooks:
Frequently Asked Questions
Some common questions, errors, and problems are categorized below. Click the Expand Topics link in the upper right to see all entries at once. You can also use the search box above to search the user guide for any issues you are seeing.
About Data Workbench
Frequently asked questions about Data Workbench.
Can you remove me from the Data Workbench mailing list?
Your subscription in the Data Workbench mailing list is tied to your account on Data Workbench. If you are no longer using your account on Data Workbench, your account can be deleted from the My Accounts page. Hover over the resource you wish to remove yourself from and click the red 'X' button. Your account and mailing list subscription will be removed overnight. Be sure to make a copy of any data you wish to keep first.
Do I need to do anything to my firewall to access Data Workbench?
No firewall changes are needed to access Data Workbench. However, to access data through Network Drives (i.e., CIFS, "Z: Drive"), you must be on a Purdue campus network or connected through VPN.
Logging In & Accounts
Frequently asked questions about logging in & accounts.
Errors
Common errors and solutions/work-arounds for them.
/usr/bin/xauth: error in locking authority file
Link to section 'Problem' of '/usr/bin/xauth: error in locking authority file' Problem
I receive this message when logging in:
/usr/bin/xauth: error in locking authority file
Link to section 'Solution' of '/usr/bin/xauth: error in locking authority file' Solution
Your home directory disk quota is full. You may check your quota with myquota
.
You will need to free up space in your home directory.
ncdu
command is a convenient interactive tool to examine disk usage. Consider running ncdu $HOME
to analyze where the bulk of the usage is. With this knowledge, you could then archive your data elsewhere (e.g. your research group's Data Depot space, or Fortress tape archive), or delete files you no longer need.
There are several common locations that tend to grow large over time and are merely cached downloads. The following are safe to delete if you see them in the output of ncdu $HOME
:
/home/myusername/.local/share/Trash
/home/myusername/.cache/pip
/home/myusername/.conda/pkgs
/home/myusername/.singularity/cache
My SSH connection hangs
Link to section 'Problem' of 'My SSH connection hangs' Problem
Your console hangs while trying to connect to a RCAC Server.
Link to section 'Solution' of 'My SSH connection hangs' Solution
This can happen due to various reasons. Most common reasons for hanging SSH terminals are:
- Network: If you are connected over wifi, make sure that your Internet connection is fine.
- Busy front-end server: When you connect to a cluster, you SSH to one of the front-end login nodes. Due to transient user loads, one or more of the front-ends may become unresponsive for a short while. To avoid this, try reconnecting to the cluster or wait until the login node you have connected to has reduced load.
- File system issue: If a server has issues with one or more of the file systems (
home
,scratch
, ordepot
) it may freeze your terminal. To avoid this you can connect to another front-end.
If neither of the suggestions above work, please contact support specifying the name of the server where your console is hung.
Thinlinc session frozen
Link to section 'Problem' of 'Thinlinc session frozen' Problem
Your Thinlinc session is frozen and you can not launch any commands or close the session.
Link to section 'Solution' of 'Thinlinc session frozen' Solution
This can happen due to various reasons. The most common reason is that you ran something memory-intensive inside that Thinlinc session on a front-end, so parts of the Thinlinc session got killed by Cgroups, and the entire session got stuck.
- If you are using a web-version Thinlinc remote desktop (inside the browser):
The web version does not have the capability to kill the existing session, only the standalone client does. Please install the standalone client and follow the steps below:
- If you are using a Thinlinc client:
Close the ThinLinc client, reopen the client login popup, and select
End existing session
.Select "End existing session" and try "Connect" again.
Thinlinc session unreachable
Link to section 'Problem' of 'Thinlinc session unreachable' Problem
When trying to login to Thinlinc and re-connect to your existing session, you receive an error "Your Thinlinc session is currently unreachable".
Link to section 'Solution' of 'Thinlinc session unreachable' Solution
This can happen if the specific login node your existing remote desktop session was residing on is currently offline or down, so Thinlinc can not reconnect to your existing session. Most often the session is non-recoverable at this point, so the solution is to terminate your existing Thinlinc desktop session and start a new one.
- If you are using a web-version Thinlinc remote desktop (inside the browser):
The web version does not have the capability to kill the existing session, only the standalone client does. Please install the standalone client and follow the steps below:
- If you are using a Thinlinc client:
Close the ThinLinc client, reopen the client login popup, and select
End existing session
.Select "End existing session" and try "Connect" again.
How to disable Thinlinc screensaver
Link to section 'Problem' of 'How to disable Thinlinc screensaver' Problem
Your ThinLinc desktop is locked after being idle for a while, and it asks for a password to refresh it. It means the "screensaver" and "lock screen" functions are turned on, but you want to disable these functions.
Link to section 'Solution' of 'How to disable Thinlinc screensaver' Solution
If your screen is locked, close the ThinLinc client, reopen the client login popup, and select End existing session
.

To permanently avoid screen lock issue, right click desktop and select Applications
, then settings
, and select Screensaver
.

Under Screensaver, turn off the Enable Screensaver
, then under Lock Screen, turn off the Enable Lock Screen
, and close the window.


Questions
Frequently asked questions about logging in & accounts.
I worked on Data Workbench after I graduated/left Purdue, but can not access it anymore
Link to section 'Problem' of 'I worked on Data Workbench after I graduated/left Purdue, but can not access it anymore' Problem
You have graduated or left Purdue but continue collaboration with your Purdue colleagues. You find that your access to Purdue resources has suddenly stopped and your password is no longer accepted.
Link to section 'Solution' of 'I worked on Data Workbench after I graduated/left Purdue, but can not access it anymore' Solution
Access to all resources depends on having a valid Purdue Career Account. Expired Career Accounts are removed twice a year, during Spring and October breaks (more details at the official page). If your Career Account was purged due to expiration, you will not be be able to access the resources.
To provide remote collaborators with valid Purdue credentials, the University provides a special procedure called Request for Privileges (R4P). If you need to continue your collaboration with your Purdue PI, the PI will have to submit or renew an R4P request on your behalf.
After your R4P is completed and Career Account is restored, please note two additional necessary steps:
-
Access: Restored Career Accounts by default do not have any RCAC resources enabled for them. Your PI will have to login to the Manage Users tool and explicitly re-enable your access by un-checking and then ticking back checkboxes for desired queues/Unix groups resources.
-
Email: Restored Career Accounts by default do not have their @purdue.edu email service enabled. While this does not preclude you from using RCAC resources, any email messages (be that generated on the clusters, or any service announcements) would not be delivered - which may cause inconvenience or loss of compute jobs. To avoid this, we recommend setting your restored @purdue.edu email service to "Forward" (to an actual address you read). The easiest way to ensure it is to go through the Account Setup process.
Jobs
Frequently asked questions related to running jobs.
Errors
Common errors and potential solutions/workarounds for them.
cannot connect to X server / cannot open display
Link to section 'Problem' of 'cannot connect to X server / cannot open display' Problem
You receive the following message after entering a command to bring up a graphical window
cannot connect to X server
cannot open display
Link to section 'Solution' of 'cannot connect to X server / cannot open display' Solution
This can happen due to multiple reasons:
- Reason: Your SSH client software does not support graphical display by itself (e.g. SecureCRT or PuTTY).
- Solution: Try using a client software like Thinlinc or MobaXterm as described in the SSH X11 Forwarding guide.
-
Reason: You did not enable X11 forwarding in your SSH connection.
-
Solution: If you are in a Windows environment, make sure that X11 forwarding is enabled in your connection settings (e.g. in MobaXterm or PuTTY). If you are in a Linux environment, try
ssh -Y -l username hostname
- Reason: If none of the above apply, make sure that you are within quota of your home directory.
-
bash: command not found
Link to section 'Problem' of 'bash: command not found' Problem
You receive the following message after typing a command
bash: command not found
Link to section 'Solution' of 'bash: command not found' Solution
This means the system doesn't know how to find your command. Typically, you need to load a module to do it.
bash: module command not found
Link to section 'Problem' of 'bash: module command not found' Problem
You receive the following message after typing a command, e.g. module load intel
bash: module command not found
Link to section 'Solution' of 'bash: module command not found' Solution
The system cannot find the module command. You need to source the modules.sh file as below
source /etc/profile.d/modules.sh
or
#!/bin/bash -i
Close Firefox / Firefox is already running but not responding
Link to section 'Problem' of 'Close Firefox / Firefox is already running but not responding' Problem
You receive the following message after trying to launch Firefox browser inside your graphics desktop:
Close Firefox
Firefox is already running, but not responding. To open a new window,
you must first close the existing Firefox process, or restart your system.
Link to section 'Solution' of 'Close Firefox / Firefox is already running but not responding' Solution
When Firefox runs, it creates several lock files in the Firefox profile directory (inside ~/.mozilla/firefox/
folder in your home directory). If a newly-started Firefox instance detects the presence of these lock files, it complains.
This error can happen due to multiple reasons:
- Reason: You had a single Firefox process running, but it terminated abruptly without a chance to clean its lock files (e.g. the job got terminated, session ended, node crashed or rebooted, etc).
- Solution: If you are certain you do not have any other Firefox processes running elsewhere, please use the following command in a terminal window to detect and remove the lock files:
$ unlock-firefox
- Solution: If you are certain you do not have any other Firefox processes running elsewhere, please use the following command in a terminal window to detect and remove the lock files:
- Reason: You may indeed have another Firefox process (in another Thinlinc or Gateway session on this or other cluster, another front-end or compute node). With many clusters sharing common home directory, a running Firefox instance on one can affect another.
- Solution: Try finding and closing running Firefox process(es) on other nodes and clusters.
- Solution: If you must have multiple Firefoxes running simultaneously, you may be able to create separate Firefox profiles and select which one to use for each instance.
Jupyter: database is locked / can not load notebook format
Link to section 'Problem' of 'Jupyter: database is locked / can not load notebook format' Problem
You receive the following message after trying to load existing Jupyter notebooks inside your JupyterHub session:
Error loading notebook
An unknown error occurred while loading this notebook. This version can load notebook formats or earlier. See the server log for details.
Alternatively, the notebook may open but present an error when creating or saving a notebook:
Autosave Failed!
Unexpected error while saving file: MyNotebookName.ipynb database is locked
Link to section 'Solution' of 'Jupyter: database is locked / can not load notebook format' Solution
When Jupyter notebooks are opened, the server keeps track of their state in an internal database (located inside ~/.local/share/jupyter/
folder in your home directory). If a Jupyter process gets terminated abruptly (e.g. due to an out-of-memory error or a host reboot), the database lock is not cleared properly, and future instances of Jupyter detect the lock and complain.
Please follow these steps to resolve:
- Fully exit from your existing Jupyter session (close all notebooks, terminate Jupyter, log out from JupyterHub or JupyterLab, terminate OnDemand gateway's Jupyter app, etc).
- In a terminal window (SSH, Thinlinc or OnDemand gateway's terminal app) use the following command to clean up stale database locks:
$ unlock-jupyter
- Start a new Jupyter session as usual.
Questions
Frequently asked questions about jobs.
How do I know Non-uniform Memory Access (NUMA) layout on Data Workbench?
- You can learn about processor layout on Data Workbench nodes using the following command:
workbench-a003:~$ lstopo-no-graphics
- For detailed IO connectivity:
workbench-a003:~$ lstopo-no-graphics --physical --whole-io
- Please note that NUMA information is useful for advanced MPI/OpenMP/GPU optimizations. For most users, using default NUMA settings in MPI or OpenMP would give you the best performance.
Why cannot I use --mem=0 when submitting jobs?
Link to section 'Question' of 'Why cannot I use --mem=0 when submitting jobs?' Question
Why can't I specify --mem=0
for my job?
Link to section 'Answer' of 'Why cannot I use --mem=0 when submitting jobs?' Answer
We no longer support requesting unlimited memory (--mem=0
) as it has an adverse effect on the way scheduler allocates job, and could lead to large amount of nodes being blocked from usage.
Most often we suggest relying on default memory allocation (cluster-specific). But if you have to request custom amounts of memory, you can do it explicitly. For example --mem=20G
.
If you want to use the entire node's memory, you can submit the job with the --exclusive
option.
Data
Frequently asked questions about data and data management.
How is my Data Secured on Data Workbench?
Data Workbench is operated in line with policies, standards, and best practices as described within Secure Purdue, and specific to RCAC Resources.
Security controls for Data Workbench are based on ones defined in NIST cybersecurity standards.
Data Workbench supports research at the L1 fundamental and L2 sensitive levels. Data Workbench is not approved for storing data at the L3 restricted (covered by HIPAA) or L4 Export Controlled (ITAR), or any Controlled Unclassified Information (CUI).
For resources designed to support research with heightened security requirements, please look for resources within the REED+ Ecosystem.
Link to section 'For additional information' of 'How is my Data Secured on Data Workbench?' For additional information
Log in with your Purdue Career Account.
Can I share data with outside collaborators?
Yes! Globus allows convenient sharing of data with outside collaborators. Data can be shared with collaborators' personal computers or directly with many other computing resources at other institutions. See the Globus documentation on how to share data:
Can I access Fortress from Data Workbench?
Yes. While Fortress directories are not directly mounted on Data Workbench for performance and archival protection reasons, they can be accessed from Data Workbench front-ends and nodes using any of the recommended methods of HSI, HTAR or Globus.
Software
Frequently asked questions about software.
Cannot use pip after loading ml-toolkit modules
Link to section 'Question' of 'Cannot use pip after loading ml-toolkit modules' Question
Pip throws an error after loading the machine learning modules. How can I fix it?
Link to section 'Answer' of 'Cannot use pip after loading ml-toolkit modules' Answer
Machine learning modules (tensorflow, pytorch, opencv etc.) include a version of pip that is newer than the one installed with Anaconda. As a result it will throw an error when you try to use it.
$ pip --version
Traceback (most recent call last):
File "/apps/cent7/anaconda/5.1.0-py36/bin/pip", line 7, in <module>
from pip import main
ImportError: cannot import name 'main'
The preferred way to use pip with the machine learning modules is to invoke it via Python as shown below.
$ python -m pip --version
How can I get access to Sentaurus software?
Link to section 'Question' of 'How can I get access to Sentaurus software?' Question
How can I get access to Sentaurus tools for micro- and nano-electronics design?
Link to section 'Answer' of 'How can I get access to Sentaurus software?' Answer
Sentaurus software license requires a signed NDA. Please contact Dr. Mark Johnson, Director of ECE Instructional Laboratories to complete the process.
Once the licensing process is complete and you have been added into a cae2
Unix group, you could use Sentaurus on RCAC community clusters by loading the corresponding environment module:
module load sentaurus
Julia package installation
Users do not have write permission to the default julia package installation destination. However, users can install packages into home directory under ~/.julia
.
Users can side step this by explicitly defining where to put julia packages:
$ export JULIA_DEPOT_PATH=$HOME/.julia
$ julia -e 'using Pkg; Pkg.add("PackageName")'
About Research Computing
Frequently asked questions about RCAC.
Can I get a private server from RCAC?
Link to section 'Question' of 'Can I get a private server from RCAC?' Question
Can I get a private (virtual or physical) server from RCAC?
Link to section 'Answer' of 'Can I get a private server from RCAC?' Answer
Often, researchers may want a private server to run databases, web servers, or other software. RCAC currently has Geddes, a Community Composable Platform optimized for composable, cloud-like workflows that are complementary to the batch applications run on Community Clusters. Funded by the National Science Foundation under grant OAC-2018926, Geddes consists of Dell Compute nodes with two 64-core AMD Epyc 'Rome' processors (128 cores per node).
To purchase access to Geddes today, go to the Cluster Access Purchase page. Please subscribe to our Community Cluster Program Mailing List to stay informed on the latest purchasing developments or contact us (rcac-cluster-purchase@lists.purdue.edu) if you have any questions.