Processing data¶
Data processing is the primary function of bripipetools, encompassing all bioinformatics operations performed on raw data (typically sequencing libraries) to generate processed output files. BRI processing pipelines do not include statistical analyses performed on output data.
Workflows¶
Workflows are the heart of bioinformatics processing operation. They comprise a repeatable series of steps performed on a collection of data files. Steps within a workflow can be broken down or categorized into specific modules, depending on the goal or output of the step.
Processing modules¶
data QC¶
Quality control refers to basic inspection and computation of quality metrics/statistics for raw sequencing data. Of course, quality assessment can and should occur at multiple stages from data generation to analysis. Many QC-related metrics can be evaluated to help diagnose library performance, but in practice we tend to focus on 3 main metrics:
The median CV of gene model coverage for the top-expressed genes (generated by
picard)The total number of reads (ie: “depth”) in the input fastq file (generated by
tophatStats)The number of reads that were aligned, either as singletons or pairs (generated by
picard)
Other metrics such as library complexity, GC content, percent duplication, etc. may be useful in some experiments, and can be retrieved from the genomicsMetrics collection in The Research Database.
Current tools: FastQC, picard, tophatStats
trimming¶
Trimming in an RNAseq workflow refers to two different processing steps.
Hard trimming: Removal of a fixed number of bases from raw sequencing reads. A nearly universal characteristic of sequencing data from the BRI Genomics Core is an unexpected spike in GC skew at the last position (on the 3’ end) of reads INCLUDE SCREENSHOT. For this reason, the first step in most workflows is the removal of the last 3’ base from each read.
Quality trimming: Modern aligners use “dynamic” or “adaptive” trimming to remove bases from either end of individual reads to improve mapping to the reference. Reads can also be pre-processed to remove bases that do not pass a certain quality threshold. Generally, BRI workflows trim reads from both ends until a minimum PHRED-scaled quality score of 30 is reached at both ends. In theory, removing low quality (and thus, low confidence) bases can also improve mapping rate; however, care must be taken to impose a minimum length for trimmed reads, as extremely short reads will lead to high duplication rates and biased results.
Current tools: FASTQ Trimmer, FASTQ Quality Trimmer, trimmomatic
adapter removal¶
For certain library prep procedures (e.g., Nextera), oligonucleotide indexes will be included as part of the PCR amplifcation step, prior to library construction. In these cases, these adapter sequences will appear within reads, in contrast to typical sequencing adapters (e.g., Illumina adapters and indexes used for demultiplexing) that are nominally removed by tools like CASAVA and bcl2fastq.
Current tools: FastqMcf, trimmomatic
alignment¶
This is the central step for almost all NGS processing workflows. Following any trimming, short reads are aligned to a reference genome and the results are stored in a Sequence Alignment Map (SAM) file or in the corresponding binary BAM format. For RNAseq data, it is important to use aligners that are “splice aware” (e.g., TopHat and STAR) to account for the fact that reads from mRNA transcripts may include one or more exons that are not adjacent in the genome (and therefore might not align). Alternatively, RNA reads could be aligned to a reference transcriptome.
In summer of 2018, after multiple comparisons between STAR-aligned and TopHat-aligned libraries, BRI decided to transition sequencing workflows to STAR from TopHat. Some ongoing projects may require continued support for the legacy TopHat workflows.
Current tools: STAR
Previous tools: TopHat
counting¶
After reads have been aligned to the genome, reference annotations (i.e., gene models) can be used to inspect and quantify read coverage within regions of interest (e.g., exons). Note that the exact numbers resulting from this counting are sensitive to the version of the gene model annotations (ie: GTF file) that are used.
Current tool: htseq-count
variant calling and SNP fingerprinting¶
In order to detect sample annotation errors, SNPs called in multiple libraries from the same subject can be compared for consistency. This is done using KING, which generates an estimate of a “kinship coefficient.” Additionally, SNPs can be called for association with traits of interest, such as gene expression or disease phenotype.
Current tools: samtools, bcftools, KING
metrics¶
Unlike QC, “metrics” is a catch-all category to describe any summary or quality measures of post-alignment data. The primary source of metrics is the Picard suite of tools for evaluating alignment in SAM/BAM files. For downstream tools that produce reports or logs (e.g., htseq-count), outputs are also categorized and stored under metrics.
Current tools: Picard
VDJ annotation¶
Many projects (eg: scRNAseq projects) involve identification of TCR sequences. To achieve sequencing of these highly polymorphic sequences, we first perform a de novo assembly using Trinity (see below), and then use MiXCR to identify the TCR chains from the assembly.
Current tools: mixcr
assembly¶
For experiments where it doesn’t make sense to align to a reference (eg: no reference available, huge number of polymorphisms), we perform a de novo assembly of the reads. This essentially aligns the reads to one another in a series of steps, building long sequences representing transcripts from the short reads of fragments. These long transcript sequences can then be used for a number of purposes, such as building a transcriptome, or aligning transcripts with partially conserved and partially variable regions (as in TCR identification).
Current tools: Trinity
peak calling¶
For some sequencing experiments (ChIPseq, ATACseq, CUT&RUN, CUT&TAG) it is necessary to call peaks where reads are found to pile up. MACS2 is the most common tool used to achieve this, but performs variably depending on the details of the experimental design and the settings used. In practice, we have found it best to assess the quality of peak calls from different peak callers on an experiment-by-experiment basis. Current comparisons in the literature suggest that there is a fair amount of room for optimization of peak calling precision and recall (<https://www.biorxiv.org/content/10.1101/306621v2.full/>_).
Current tools: MACS2 (used in Galaxy workflows), Genrich (used outside of Galaxy)
Workflow options¶
The following workflows are currently available for batch processing in Globus Genomics.
Production Workflows: RNAseq: TruSeq, single-end, stranded, STAR (with or without Trinity) RNAseq: Nextera, single-end, non-stranded, STAR (with or without Trinity)
Workflows in Development: ATACseq: Nextera, paired-end, non-stranded, STAR, MACS2 CUT&RUN/CUT&TAG: custom libraries, paired-end, non-stranded, STAR, MACS2
Deprecated Workflows: TruSeq, Stranded, TopHat (with or without Trinity) Nextera, Non-stranded, TopHat (with or without Trinity)
Composing a workflow¶
(in Globus Galaxy)
Implementing a new (production) workflow in Globus Galaxy consists of two steps: building a new workflow and annotating all input and output steps.
Building a workflow in Galaxy¶
Use the Workflow Editor in Globus Galaxy for the following steps:
Add all tools for processing modules (e.g., trimming, alignment, counting).
Connect inputs and outputs of individual tools.
Add workflow inputs: 1. Get Globus FASTQ data 2. Input Dataset (for reference/annotation files)
Add workflow outputs (Send Globus data)
Set all get/send data endpoint and path options to ‘set at runtime’
(optional) Set build-specific and other options to ‘set at runtime’
Annotate input and output steps (and potentially build-specific parameters)
Annotating parameters¶
For all parameters where values are to be set at runtime *, tags of the following format should be added to the Annotation / Notes field in the Globus Galaxy Workflow Editor.
* “option” parameters are recognized by the combination of their tag (in the Annotation field) as well as their name. This is different than the label field in the Galaxy workflow. In older versions of Galaxy the label was assigned automatically, but more recent versions require the user to specify a unique label name for each tool in the Galaxy workflow. This is important to remember when importing/editing workflows that were developed on a previous instance of Galaxy.
Input parameters¶
Input parameters — indicating local files that will be uploaded to Globus Galaxy nodes at the start of workflow processing — should have the following form:
extension_in
This typically only applies to fastq_in.
Output parameters¶
Output parameters are expected to have the following form:
<source>_<type>_<extension>_<out>
For example, the tag picard-rnaseq_metrics_html_out will be parsed into a dictionary like this::
{
'type': 'metrics',
'label': 'metrics',
'source': 'picard-rnaseq',
'extension': 'html'
}
Both source and label can be given added specificity with a hyphen-separated string (e.g., picard vs. picard-rnaseq or metrics vs. metrics-rmdup). The parsing code should automatically detect and group these clauses appropriately.
Annotation input parameters¶
Some workflows will access and load datasets stored in the Globus Galaxy library. These inputs (represented as Input Dataset in the workflow editor) should have annotation tags in the following form:
annotation_<type>
You can also give a name to the dataset to possibly ease navigation within the editor, but these names will not be used by downstream code.
The most common annotation input parameters are the following:
GTF gene model files:
annotation_gtf(optional name:gtfFile)Gene model refFlat files:
annotation_refflat(optional name:refFlatFile)Ribosomal interval files:
annotation_ribosomal-intervals(optional name:riboIntsFile)Adapter files:
annotation_adapters(optional name:adapterFile)
Saving the workflow for use in bripipetools¶
Once a workflow is finished and ready for testing, both the workflow template and the workflow detail files must be downloaded from Galaxy. The template file will be used to generate workflow batch files, and the workflow detail file will be used to store tool version information in the research database.
Save the workflow template¶
Click the arrow next to the workflow name in the Galaxy Workflows tab.
Select “Submit via API batch mode”.
On the following page, click the link to “Export Workflow Parameters for batch submission” and save the .txt file under
genomics/galaxy_workflows(wherever the path exists relative to your local system); make sure to remove the leadingGalaxy-API-Workflow-from the filename.
Save the workflow details¶
Click the arrow next to the workflow name in the Galaxy Workflows tab.
Select “Download or Export”
Click the link that says “Download workflow to file so that it can be saved or imported into another Galaxy server” and save the .ga file under
genomics/galaxy_workflows(wherever the path exists relative to your local system); make sure to remove the leadingGalaxy-Workflow-from the filename.
You should now have a template file with a .txt extension and a details file with a .ga extension, with otherwise identical file names that corresponding to your workflow. Note that bripipetools requires both of these files for a given workflow in order to function properly.
Running a workflow¶
The following is a general overview of how to a workflow. For more details please see RNAseq Processing Quickstart.
All of the following steps except the initial BaseSpace download should work while on srvgalaxy01.
Pipeline steps¶
process-upload
process-collect
Downloading & prepping data¶
When a new flow cell is ready for processing, a notification email is sent from the Genomics Core via BaseSpace. Information about the flowcell and corresponding projects can be found in the Flowcell log.xlsx file under DFS_Chaussabel_LabShare/Illumina HiScan SQ/ on the [srvstor01](srvstor01.brivmrc.org) server. In particular, you’ll need to pay attention to the Lane Contents tab to determine the appropriate workflow to use for each project.
On srvgalaxy01 under /mnt/genomics/Illumina/<flowcell-folder>/, create a new folder called Unaligned/ (if it doesn’t already exist). Modify permissions such that all users can write to and read from the folder (chmod -R 777 Unaligned/). The new folder should look something like this:
FC_FOLDER="/mnt/genomics/Illumina/150615_D00565_0087_AC6VG0ANXX/Unaligned"
Using bripipetools¶
The bripipetools command (which calls bripipetools/__main__.py) is the entrypoint to application functionality. If you have the bripipetools package installed, you should be able to use this command from anywhere on your system.
bripipetools --help
Usage: bripipetools [OPTIONS] COMMAND [ARGS]...
Command line interface for the `bripipetools` library.
Options:
--quiet only display printed outputs in the console - i.e., no log messages
--debug include all debug log messages in the console
--help Show this message and exit.
Commands:
dbify Import data from a flowcell run or workflow...
postprocess Perform postprocessing operations on outputs...
qc Run quality control analyses on a target...
submit Prepare batch submission for unaligned...
wrapup Perform 'dbify' and 'postprocess' operations...
Preparing workflow batches for submission¶
At this point, you’ll need to identify the most applicable workflow. Important considerations are:
Species (mouse, human, E. coli, etc)
Genome assembly and gene annotation version (GRCh38.91 vs GRCh38.77, etc)
Library preparation method (TruSeq, Nextera, CUT&RUN, etc.)
Aligner (STAR unless the project needs to be combined with data from older, TopHat-based workflows)
Additional workflow requirements (Trinity/MiXCR, MACS2, etc)
Refer to flowcell log¶
The flowcell log can be found at DFS_Chaussabel_LabShare/Illumina HiScan SQ/Flowcell log.xlsx.
Using bripipetools to submit¶
bripipetools submit --help
Usage: bripipetools submit [OPTIONS] PATH
Prepare batch submission for unaligned samples from a flowcell run or from
a list of paths in a manifest file.
Options:
--endpoint TEXT Globus Online endpoint where input data is
stored and outputs will be saved
--workflow-dir TEXT path to folder containing Galaxy workflow
template files to be used for batch
processing
--all-workflows / --optimized-only
indicate whether to include all detected
workflows as options or to keep 'optimized'
workflows only
-s, --sort-samples sort samples from smallest to largest (based
on total size of raw data files) before
submitting; this is most useful when also
restricting the number of samples
-n, --num-samples INTEGER restrict the number of samples submitted for
each project on the flowcell
-m, --manifest indicates that input path is a manifest of
sample or folder paths (not a flowcell run)
from which a workflow batch is to be created
(note: options 'sort-samples' and 'num-
samples' will be ignored)
-o, --out-dir TEXT for input manifest, folder where outputs are
to be saved; default is current directory
--help Show this message and exit.
Here’s an example call::
bripipetools submit \
--workflow-dir /mnt/genomics/galaxy_workflows \
--endpoint benaroyaresearch#BRIGridFTP \
/mnt/genomics/Illumina/150615_D00565_0087_AC6VG0ANX
Here’s another example with a manifest file:
bripipetools submit \
--workflow-dir /Volumes/genomics/galaxy_workflows/ \
--out-dir /Volumes/genomics/ICAC/Gern/ -\
-tag gern \
--manifest <(find /Volumes/genomics/ICAC/Gern -name "Sample_*")
Submitting batches in Galaxy/Globus Genomics¶
Authenticating Globus endpoint¶
First, sign in to Globus <https://app.globus.org/endpoints/>_ and navigate to the ENDPOINTS page. Select benaroyaresearch#BRIGridFTP then Activate, after which you’ll be prompted to enter your login credentials for the srvgridftp01 BRI server. Make sure to expand the “advanced” options and set the “Credential Lifetime” to something large like 10000 hours (that way, you won’t need to reauthenticate for about a week).
Uploading batch submit files¶
In the web interface for the instance of Galaxy managed by Globus for BRI (currently at <bri.globusgenomics.org/>_), select “Batch Management -> Workflow batch submit” from the left-hand side menu.
Enter “benaroyaresearch#BRIGridFTP” into the Source Endpoint field, and enter the file path generated by
bripipetools submitfor your workflow batch into the Source Path field.
Submitting batch jobs¶
Note
Monitoring Batch Jobs
In general, it’s a good idea to monitor the status of jobs intermittently during a run. This can help diagnose any issues that come up early, which will save time and AWS resources. To view currently-running jobs, you can click on the gear in the top right corner of the Galaxy dashboard, then select “Saved Histories”. Any jobs with errors will appear with red boxes in the “Datasets” column.
Warning
Batch Submission Size
Depending on the number and type of jobs in the batch, it may take several hours or even a day or two for Galaxy to complete all of the jobs. It’s best to submit workflows with only a couple hundred jobs and wait for them to complete, in case there’s any troubleshooting that needs to take place during this phase. However, there’s nothing wrong with uploading all of your batch files at once and submitting them one at a time after each finishes.
In the Galaxy web interface select “Globus Data Transfer -> Get Data via Globus” from the left-hand side menu.
Select the job number for the workflow batch file you’ve uploaded, and click Execute.
Collecting workflow batch results¶
Usage: bripipetools wrapup [OPTIONS] PATH
Perform 'dbify' and 'postprocess' operations on all projects and workflow
batches from a flowcell run.
Options:
-t, --output-type [c|m|q|v|a] type of output file to combine: c [counts],
m [metrics], q [qc], v [validation], a [all]
-x, --exclude-types [c|m|q|v] type of output file to exclude: c [counts],
m [metrics], q [qc], v [validation]
--stitch-only / --stitch-and-compile
Do NOT compile and merge all summary (non-
count) data into a single file at the
project level
--clean-outputs / --outputs-as-is
Attempt to clean/organize output files
--help Show this message and exit.
Importing flowcell data into GenLIMS¶
Usage: bripipetools dbify [OPTIONS] PATH
Import data from a flowcell run or workflow processing batch into GenLIMS
database.
Options:
--help Show this message and exit.
Postprocessing workflow outputs¶
Usage: bripipetools postprocess [OPTIONS] PATH
Perform postprocessing operations on outputs of a workflow batch.
Options:
-t, --output-type [c|m|q|v|a] type of output file to combine: c [counts],
m [metrics], q [qc], v [validation], a [all]
-x, --exclude-types [c|m|q|v] type of output file to exclude: c [counts],
m [metrics], q [qc], v [validation]
--stitch-only / --stitch-and-compile
Do NOT compile and merge all summary (non-
count) data into a single file at the
project level
--clean-outputs / --outputs-as-is
Attempt to clean/organize output files
--help Show this message and exit.
Follow up steps¶
Not all pipeline steps have been integrated into the bripipetools application code base. Remaining steps are performed with scripts located in the scripts folder.
Generating gene model coverage plots¶
usage: plot_gene_coverage.py PATH
while read path; do \
python scripts/plot_gene_coverage.py $path;
done < <(find <path-to-flowcell-folder> -name "metrics" -maxdepth 2)
Running MiXCR (depending on workflow version)¶
Note: requires SLURM!! (must run on server srvgalaxy01)
/mnt/code/shared/bripipetools/
usage: run_mixcr.py [-h] -i INPUTDIR -o RESULTSDIR [-x EXCLUDENODES]
[-s SPECIES] [-c CHAINTYPE] [-k]
Notes on arguments:
-x nodename allows the user to exclude slurm nodes from use in processing the batch
-s species sets the species (‘hsa’ = human (default), ‘mmu’ = mouse)
-c chaintype defines the immunological chain to check for. This should be ‘TCR’ (default) or ‘ALL’ (for BCR identification)
-k sets the use of KAligner2 during the align phase, which is useful for alignments with large gaps (ie: BCR identification).
while read path; do \
outdir="$(dirname $path)/mixcrOutput_trinity";
python scripts/run_mixcr.py -i $path -o $outdir;
done < <(find <path-to-flowcell-folder> -name "Trinity" -maxdepth 2)
Handy shortcut::
# Custom formatted output from squeue
alias squeuel='squeue -o "%.7i %.9P %.30j %.10u %.8T %.10M %.6D %.5C %.8p %R"'
Concatenating Trinity outputs¶
usage: concatenate_trinity_output.py PATH
while read path; do \
python scripts/concatenate_trinity_output.py $path;
done < <(find <path-to-flowcell-folder> -name "Trinity" -maxdepth 2)
Generating project links¶
usage: generate_project_links.sh PATH
bash scripts/generate_project_links.sh <path-to-flowcell-folder>
Inspecting outputs¶
After running the pulldownGalaxyData.py script, results will be stored under the flowcell folder in a new folder that looks like Project_<project-id>Processed_<date>, where date is the YYMMDD string of the date on which the script was run — e.g., Project_P43-12Processed_151208.
Retrieving details for old workflows (DEPRECATED)¶
Note
For Reference Only
The following details refer to an instance of Galaxy that was hosted and managed by BRI. This instance has been turned off and the data archived. The following information is retained for reference and completeness, but workflow and history tracking are now managed via the Databases collections.
To collect details about old workflows and histories from processing jobs on the local Galaxy server, one can either use the PostgreSQL database directly, or take advantage of an R script for interacting with the database.
Galaxy PostgreSQL database queries¶
Keeping track of various queries here with thought of eventually combining into scripts or functions.
Basic login to db::
svc_galaxy@srvgalaxy02:~$ psql svc_galaxy
History info for a project::
svc_galaxy=# select * from history where name like '%P15-8%';
svc_galaxy=# select id from history where name like '%P15-8%';
Dataset info for a specific History¶
List datasets::
svc_galaxy=# SELECT dataset_id FROM history_dataset_association WHERE history_id = '536';
Get full dataset info::
svc_galaxy=# SELECT * FROM dataset WHERE id IN (SELECT dataset_id FROM history_dataset_association WHERE history_id = '536');
Job info for a specific History¶
svc_galaxy=# SELECT * FROM job WHERE history_id = '536';
Job metrics for specific steps¶
svc_galaxy=# SELECT * FROM job_metric_numeric WHERE job_id IN (SELECT id FROM job WHERE history_id = '529' AND tool_id LIKE '%/tophat/%') AND metric_name = 'runtime_seconds';
Job metrics for datasets¶
svc_galaxy=# SELECT * FROM job_to_input_dataset WHERE dataset_id IN (SELECT dataset_id FROM history_dataset_association WHERE history_id = '536');