Trinity RNAseq assembly software running on RCAC Clusters

October 29, 2012 12:00am - December 31, 2012 11:59pm EST
Announcements

Trinity RNAseq assembly on the RCAC Clusters Trinity on the RCAC clusters.

Trinity is a software package for reconstructing genetic sequences from RNA transcripts. It runs well on RCAC cluster nodes, with some consideration for the specific challenges it presents, due to its fairly high memory requirements, and its dependence on a very large number of intermediate files created during its operation. This document represents our suggestions for making best use of the RCAC resources while running Trinity and is presented as two separate Standard Operating Procedure (SOP) checklists.

Trinity runs on a single working node, but makes good use of threading within the node; the Hansen and Rossmann clusters, with 48 or 24 processor cores per node respectively, are good choices. The Carter cluster has faster processors, but offers fewer cores per node (16); it may offer better overall performance for some data sets, but we don't yet have statistics from testing to guide that decision. For any of the clusters, you should use a work queue that has access to nodes with at least 128 GB of RAM.

Trinity can be used in two modes that require some preliminary setup at RCAC – using the Lustre scratch space for output, or using a working node's local disk drive for output, and collecting the relevant files after the run. Each mode has some caveats. Using Lustre generates a load on the underlying file system that may have an impact on other users, but maintains the intermediate files between runs, and allows for Trinity's internal checkpointing to resume operation in the event of a program failure. Using the internal disk does not impact other system users, but files are deleted between runs, so other means must be used to maintain checkpointability. Our suggestion for this case is to run the program interactively, as detailed in the second suggested SOP.

Suggested SOPs for Trinity on RCAC Clusters

SOP for Trinity Using the Lustre filesystem

Ensure you have sufficient quota for running Trinity. Because of the high number of files produced, you will need to request a quota increase specifically for running Trinity. This can be done in an email to rcac-help@purdue.edu. Make sure you specifically mention 'Trinity' as the reason for your quota request.
Create a subdirectory in you Lustre scratch space to use for Trinity output, and configure it for handling small files. By default, files are striped across multiple locations in the Lustre filesystem. This becomes wasteful for Trinity, which writes large numbers of files that are too small to benefit from this striping. You can set the striping for specific files or directories in your scratch space; if you set the striping for a directory, all it's subdirectories and files will inherit the new value. For Trinity, the optimum stripe value is 1, and is changed with the "lfs setstripe" command.

For example:

"cd $RCAC_SCRATCH" "mkdir trinity-out" "lfs setstripe -c 1 trinity-out"

Now $RCAC_SCRATCH/trinity-out is configured to handle Trinity's workload.

In your submission file for Trinity, add the PBS directive “#PBS -l software=trinity” (see sample submission file below)
In your submission file, request exclusive access to one node by requesting all the cores on the node. On Hansen the directive would be “#PBS -l nodes=1:ppn=48”. On Rossmann, the directive would be “#PBS -l nodes=1:ppn=24”. In general, find the number of cores per node that your work queue has access to, and use that number in the “ppn=” option to the directive.
In your submission file, load the Bioinformatics modules with the line “module use /apps/group/bioinformatics/modules” Note – this is not a PBS directive, so it doesn't get the "#PBS" prefix. Also load the Trinity-specific module with the line “module load trinity”
Within the submission file, specify the directory you configured in setp 2. as the Trinity output parameter: “Trinity.pl … --output $RCAC_SCRATCH/trinity-out” for example. The actual directory name for output is up to you.

When Trinity completes its operation, the output directory will contain the final “Trinity.fasta” result file, as well as all the intermediate files necessary to produce it. If you want to save the complete directory tree, you can use the “tar” and “hsi” commands to save the entire output set to your Fortress account. For example, if your output directory is called “trinity-out”, and you want to save it to a file called “trinity-run-1”, you would use “cd” to go to the trinity-out directory, then use the command:

“tar -cf - . | hsi put - : trinity-run-1"

In this command, the "." character in the tar command represents the current directory, and is the directory that will be archived. The ":" in the hsi command means the following name will be used for the file on the Fortress system.

Sample submission file for Trinity running on the Hansen cluster. (see Notes below)

 #!/bin/bash
 #PBS -l nodes=1:ppn=48
 #PBS -l walltime=100:00:00
 #PBS -l software=trinity
 #PBS -q myqueuename
 #PBS -N Trinity

 module use /apps/group/bioinformatics/modules
 module load trinity

 Trinity.pl --seqType fq --JM 50G --left $RCAC_SCRATCH/leftreads.fastq --right $RCAC_SCRATCH/rightreads.fastq --output $RCAC_SCRATCH/trinity-out --min_contig_length 300 --CPU 48 --bflyMaxHeapsize 12 G --bflyCPU 12 2&gt;&1 | tee -a $RCAC_SCRATCH/trinity-stdout.txt

NOTES on sample submission file:

In this example the entire Trinity command is actually on one line - the line breaks are for presentation only.
Hansen has 48 processor cores per node, which is why we use "ppn=48". On Rossmann, with 24 cores per node, the directive would be "#PBS -l nodes=1:ppn=24".
"myqueuename" should be replaced by the name of the working queue you are using. It should have access to nodes with at least 128 GB RAM for any substantial Trinity data set.
The construction "2>&1" means to combine error and informational messages that Trinity would normally print to the screen, and the "| tee -a $RCAC_SCRATCH/trinity-stdout.txt" construction means to direct those messages into a file named trinity-stdout.txt in the scratch directory. This allows you to easily follow the progress of Trinity during the analysis run, by using the commands "cat" or "tail -f" on the file.

Trinity using a node-local disk for output

When a PBS job finishes, any files on the working node's local disk are erased. This means that if Trinity has a correctable failure during its run on a local disk, no intermediate files are saved, and once the failure has been addressed, Trinity needs to be started from the beginning again. This can be avoided by running Trinity as an interactive job, rather than a batch job.

Since an interactive job ends only when its walltime runs out, or when its connection to its working node is broken, a failure of Trinity will not result in the loss of its files, as long as the interactive job is still running. We strongly recommend that interactive Trinity jobs be run under the "screen" command (described in the following SOP). This will help protect the interactive job from inadvertant disconnection, while allowing the running job to be monitored and modified by the user.

SOP for Trinity using a local disk

Login to the cluster frontend. Note the exact frontend you are on (e.g. "rossmann-fe00, rossmann-fe01, etc")
Run the command "screen". You won't see any changes in your login session, except the screen will clear.
Submit an interactive job with a walltime long enough to ensure you can finish the Trinity job, and if necessary correct any errors and restart Trinity. The command to submit an interactive job will be something like: "qsub -I -q myqueuename -l nodes=1:ppn=24, walltime=200:00:00" You do not include a script name on this command. You should use a queue name that you have access to, that uses nodes with at least 128 GB RAM and 500 GB local disk. You may be able to use a smaller local disk for small Trinity data sets.

There will be a delay, but when the jobs starts, you will be at a command prompt on your working node. Note the node's name - it is where Trinity will run.

NB. "ppn=24" is appropriate for Rossmann. On Hansen, use "ppn=48" in its place.

Start Trinity. You can use a script similar to a batch submission file, since PBS directives are interpreted as comments by a command shell. The notable difference is that your Trinity output directory should begin with "/tmp/", rather than with "$RCAC_SCRATCH" or a simple directory name.
Give the two-keystroke command "Ctrl-A" followed by "d".

Notes for SOP for Trinity on a node-local disk.

After step 3 in the SOP above, your actual PBS job is a command shell that will run for your walltime. PBS doesn't know anything about Trinity, only about the command shell, so as long as that shell is running nothing else matters - you have sole use of the particular node your job is running on.

Sole use of the node means that you can login to that node separately (not using "screen") to monitor the Trinity job's progress if necessary, without affecting the PBS job. If, for instance, the Trinity component Butterfly fails, you can login to the node to correct Butterfly's memory specification and restart Trinity. Your job's walltime clock will still be running throughout, so you need to specify enough time in your initial "qsub" request to allow for any re-runs you think you will need.

"Screen" allows you to detach from and reattach to your running interactive session. While you are detached, logging out from the cluster's frontend, or losing your network connection to it, will not affect your running job.

To detach from a running "screen" session: Ctrl-a d

To reattach to a detached session: Ensure you are logged into the same frontend the session was started from, and use the command: screen -r

To list active "screen" sessions: screen -ls Sample Trinity command script for interactive session on node local disk

#!/bin/bash

module use /apps/group/bioinformatics/modules
module load trinity

Trinity.pl --seqType fq --JM 50G --left $RCAC_SCRATCH/leftreads.fastq --right $RCAC_SCRATCH/rightreads.fastq --output /tmp/trinity-out --min_contig_length 300 --CPU 48 --bflyMaxHeapsize 12 G --bflyCPU 12 2&gt;&1 | tee -a $RCAC_SCRATCH/trinity-stdout.txt

cp /tmp/trinity-out/Trinity.fasta $RCAC_SCRATCH

Extended Notes on SOP for Trinity on a node local disk.

It's important to run 'screen' BEFORE you submit the job.

After the "qsub" command, there will be a delay until PBS starts the job. Typically it's only a minute or two, but if the system is loaded it may be as long as 4 hours. Eventually the command prompt will return. If your command prompt does not display it, you can find the name of the working node with the command 'hostname -s'. It should be rossmann-txxx, where the 'xxx' is a 3 digit number and the 't' is a single letter.

After you 'ctrl-a d' to detach from the node, your terminal will automatically switch back to the system you ran 'screen' on. That's okay, don't worry.

You'll see Trinity's output start to display on your terminal before you 'ctrl-a d' That's okay, you don't need a command prompt before you detach - anything can be going on inside the 'screen' session and it will still detach and keep track of things for you.

The thing in the script at the end of the Trinity command line "2>&1 | tee $RCAC_SCRATCH/trinity-stdout.txt" will create a file in your scratch that contains all the stuff that Trinity writes to the screen if you just run it as a command. You can then use 'cat' or 'tail -f' or 'vi -r' on that file to see what Trinity is doing without affecting the job in any way. You'll also see that output whenever you reattach to the 'screen' session that Trinity is running in.

It's fairly important that unless you are directly manipulating Trinity for your overall job (e.g. re-running Butterfly commands), you stay detached from the 'screen' session. It's easy to get confused and kill the terminal window that contains your 'screen' session. If you do that while you are detached, there is no problem at all. If you are attached when you do it, your PBS job is killed and you'll need to start from the very beginning again.

It's very easy to get a little bit confused about these instructions and end up starting 'screen' twice on the same frontend. That's not fatal, but that's the exact point where everything becomes A LOT MORE CONFUSING, so try not to do that. If you do, try to dig through 'man screen' to see how to attach to specific sessions, and try not to kill the session Trinity is running in.

Originally posted: October 29, 2012

Trinity RNAseq assembly software running on RCAC Clusters

Follow Us