RNAseq Processing Quickstart¶
Introduction¶
This page summarizes the steps that must be completed to process a flow cell’s worth of RNAseq data. All of these steps assume that you have installed bripipetools according to the instructions at Installing bripipetools.
Retrieving Data¶
Before you begin: Make sure that you have an Illumina BaseSpace account at http://basespace.illumina.com/. Note that the following download can take up to a few hours, depending on the amount of data being retrieved.
The Genomics Core will send an email link to share access to the run and projects. You must accept all shares using the BaseSpace dashboard.
After accepting all shares and confirming the sequencing run is finished, go to your BaseSpace dashboard and click “RUNS” (top of the page), then the name of the run you want to download, then “Download” in the “SUMMARY” tab.
A popup will prompt you to download the data. If you haven’t already installed the BaseSpace Sequence Hub Downloader do so, then select “FASTQ” as the file type to be downloaded, and click the “Download” button.
Save the FASTQ files to a directory called
Unalignedwithin the flow cell directory. The exact path will differ depending on how you have the Bioinformatics directory mounted, but eg: on a Linux server the path will be/mnt/bioinformatics/pipeline/Illumina/{FlowcellID}/Unaligned
Preprocessing Data¶
Warning
FASTQ Directory Structure
Starting in October 2017, the directory structure used by BaseSpace to store FASTQs changed slightly. As a result, the following script must be run in order to set up the directory structure for workflow processing. If the script is not run there will be no warning or indication of failure, but only one lane’s worth of data will be processed for each project. If you see many fewer counts than expected, make sure that this step was done properly!
bripipetools/scripts/fix_fastq_paths.sh /mnt/bioinformatics/pipeline/Illumina/{FlowcellID}
Creating Workflow Batch Files¶
Before you begin: Make sure that the database configuration in bripipetools/bripipetools/config/default.ini is correct. There should be a config entry for [researchdb] with appropriately-set fields db_name, db_host, user, and password. If you have questions about the appropriate values to use, please contact Mario Rosasco.
Activate the
bripipetoolsenvironment
source activate bripipetools
Create a batch submission file. If you need to select non-standard workflows, you may want to look at the
--workflow-dirand--all-workflowsoptions.
bripipetools submit /mnt/bioinformatics/pipeline/Illumina/{FlowcellID}
Submit Workflow Batch to Globus Galaxy¶
Before you begin: Make sure that you have a Globus Genomics account that’s connected to the instance of Galaxy Globus hosts for BRI. If you have questions about this, contact Mario Rosasco or Scott Presnell.
Log in to your Globus account at https://app.globus.org/endpoints.
Click on “ENDPOINTS” on the left-hand side if you’re not already there.
Click on the local endpoint “benaroyaresearch#BRIGridFTP” then click on “Activate”. Activate using your BRI credentials, and click the “Advanced” button to set the credential lifetime to something large, like 10000 hours.
Go to the Globus-hosted Galaxy instance, which is currently at https://bri.globusgenomics.org/.
Under “DATA TRANSFER” (top, left menu), select the tool “Globus Data Transfer -> Get Data via Globus”.
In the “Source Path” field, copy and paste in the “”/mnt/bioinformatics/pipeline…”” path for the workflow data file that was generated by
bripipetools submitin the previous section. Enter the local endpoint (“benaroyaresearch#BRIGridFTP”) in the “Source Endpoint” field, then click “Execute.” If the file transfer is successful, the upload job on the right hand side of the screen will turn green in a moment.Under “DATA MANAGEMENT” (bottom, left menu), select the tool “Batch Management -> Workflow batch submit”. Select the appropriate job number for the batch file you just uploaded from the drop down menu, then click “Execute”.
Note
Monitoring Batch Jobs
In general, it’s a good idea to monitor the status of jobs intermittently during a run. This can help diagnose any issues that come up early, which will save time and AWS resources. To view currently-running jobs, you can click on the gear in the top right corner of the Galaxy dashboard, then select “Saved Histories”. Any jobs with errors will appear with red boxes in the “Datasets” column.
Warning
Batch Submission Size
Depending on the number and type of jobs in the batch, it may take several hours or even a day or two for Galaxy to complete all of the jobs. It’s best to submit workflows with only a couple hundred jobs and wait for them to complete, in case there’s any troubleshooting that needs to take place during this phase. However, there’s nothing wrong with uploading all of your batch files at once and submitting them one at a time after each finishes.
Post Processing: Gene Counts and Alignment Metrics¶
Make sure that you’re in the
bripipetoolsenvironment again if necessary.
source activate bripipetools
Wrap up the processing, stitching together summary files and inserting data into the Research Database. This step will alert you if there are any missing or empty files from the run. If that’s the case, you can make a copy of the workflow batch file you submitted, and modify it to include only the jobs that need to be re-processed. This can be re-submitted as described above.
bripipetools wrapup /mnt/bioinformatics/pipeline/Illumina/{FlowcellID}
Create the gene metrics plots
while read path; do python scripts/plot_gene_coverage.py $path/; done < <(find /mnt/bioinformatics/pipeline/Illumina/{FlowcellID} -name "metrics" -maxdepth 2)
Post Processing: Trinity and MiXCR (Optional)¶
Before you begin: Regardless of the machine you used for the previous steps, you must do the following from srvgalaxy01, which serves as the head node for the SLURM cluster at BRI.
Concatenate Trinity results.
while read path; do python scripts/concatenate_trinity_output.py $path; done < <(find /mnt/bioinformatics/pipeline/Illumina/{FlowcellID} -name "Trinity" -maxdepth 2)
Run MiXCR on the Trinity contigs. Note that you first have to move to a directory where SLURM has write capabilities, or the jobs will not be started properly.
# this could be a different SLURM-writeable directory, but this one is standard.
cd /mnt/bioinformatics/pipeline/Illumina
while read path; do outdir="$(dirname $path)/mixcrOutput_trinity"; python /mnt/bioinformatics/workspace/code/shared/bripipetools/scripts/run_mixcr.py -i $path -o $outdir; done < <(find /mnt/bioinformatics/pipeline/Illumina/{FlowcellID} -name "Trinity" -maxdepth 2)
Confirm that the jobs are running properly using
squeue. Once they’ve completed, generate a summary file and push the TCR data into the Research Database:
Rscript --vanilla /mnt/bioinformatics/workspace/code/shared/bripipetools/scripts/summarize_mixcr_output.R /mnt/bioinformatics/pipeline/Illumina/{FlowcellID}
Sharing Data¶
Depending on the flow cell, information will need to be shared with bioinformaticians and analysts, other researchers, and outside collaborators/contractors. The nature of the data to be shared will vary from flow cell to flow cell, but to generate a list of links to the summarized project directories, you can use the following script:
/mnt/bioinformatics/workspace/code/shared/bripipetools/scripts/generate_project_links.sh /mnt/bioinformatics/pipeline/Illumina/{FlowcellID}
Backing Up Illumina Run Data¶
Before you begin: Make sure that you’re on a machine with Illumina’s basemount tool installed.
Mount BaseSpace data (the first time you do this you’ll need to authenticate with your BaseSpace account).
mkdir ~/basespace_mount # if necessary
basemount ~/basespace_mount
Run the backup script
python /mnt/bioinformatics/workspace/code/shared/bripipetools/scripts/backup_basespace.py ~/basespace_mount/ /mnt/bioinformatics/pipeline/Illumina/basespace_backup
After the backup is complete, unmount the BaseSpace directory.
basemount --unmount ~/basespace_mount