Processing data

Data processing is the primary function of bripipetools, encompassing all bioinformatics operations performed on raw data (typically sequencing libraries) to generate processed output files. BRI processing pipelines do not include statistical analyses performed on output data.

Workflows

Workflows are the heart of bioinformatics processing operation. They comprise a repeatable series of steps performed on a collection of data files. Steps within a workflow can be broken down or categorized into specific modules, depending on the goal or output of the step.

Processing modules

data QC

Quality control refers to basic inspection and computation of quality metrics/statistics for raw sequencing data. Of course, quality assessment can and should occur at multiple stages from data generation to analysis. Many QC-related metrics can be evaluated to help diagnose library performance, but in practice we tend to focus on 3 main metrics:

  • The median CV of gene model coverage for the top-expressed genes (generated by picard)

  • The total number of reads (ie: “depth”) in the input fastq file (generated by tophatStats)

  • The number of reads that were aligned, either as singletons or pairs (generated by picard)

Other metrics such as library complexity, GC content, percent duplication, etc. may be useful in some experiments, and can be retrieved from the genomicsMetrics collection in The Research Database.

Current tools: FastQC, picard, tophatStats

trimming

Trimming in an RNAseq workflow refers to two different processing steps.

  1. Hard trimming: Removal of a fixed number of bases from raw sequencing reads. A nearly universal characteristic of sequencing data from the BRI Genomics Core is an unexpected spike in GC skew at the last position (on the 3’ end) of reads INCLUDE SCREENSHOT. For this reason, the first step in most workflows is the removal of the last 3’ base from each read.

  2. Quality trimming: Modern aligners use “dynamic” or “adaptive” trimming to remove bases from either end of individual reads to improve mapping to the reference. Reads can also be pre-processed to remove bases that do not pass a certain quality threshold. Generally, BRI workflows trim reads from both ends until a minimum PHRED-scaled quality score of 30 is reached at both ends. In theory, removing low quality (and thus, low confidence) bases can also improve mapping rate; however, care must be taken to impose a minimum length for trimmed reads, as extremely short reads will lead to high duplication rates and biased results.

Current tools: FASTQ Trimmer, FASTQ Quality Trimmer, trimmomatic

adapter removal

For certain library prep procedures (e.g., Nextera), oligonucleotide indexes will be included as part of the PCR amplifcation step, prior to library construction. In these cases, these adapter sequences will appear within reads, in contrast to typical sequencing adapters (e.g., Illumina adapters and indexes used for demultiplexing) that are nominally removed by tools like CASAVA and bcl2fastq.

Current tools: FastqMcf, trimmomatic

alignment

This is the central step for almost all NGS processing workflows. Following any trimming, short reads are aligned to a reference genome and the results are stored in a Sequence Alignment Map (SAM) file or in the corresponding binary BAM format. For RNAseq data, it is important to use aligners that are “splice aware” (e.g., TopHat and STAR) to account for the fact that reads from mRNA transcripts may include one or more exons that are not adjacent in the genome (and therefore might not align). Alternatively, RNA reads could be aligned to a reference transcriptome.

In summer of 2018, after multiple comparisons between STAR-aligned and TopHat-aligned libraries, BRI decided to transition sequencing workflows to STAR from TopHat. Some ongoing projects may require continued support for the legacy TopHat workflows.

Current tools: STAR

Previous tools: TopHat

counting

After reads have been aligned to the genome, reference annotations (i.e., gene models) can be used to inspect and quantify read coverage within regions of interest (e.g., exons). Note that the exact numbers resulting from this counting are sensitive to the version of the gene model annotations (ie: GTF file) that are used.

Current tool: htseq-count

variant calling and SNP fingerprinting

In order to detect sample annotation errors, SNPs called in multiple libraries from the same subject can be compared for consistency. This is done using KING, which generates an estimate of a “kinship coefficient.” Additionally, SNPs can be called for association with traits of interest, such as gene expression or disease phenotype.

Current tools: samtools, bcftools, KING

metrics

Unlike QC, “metrics” is a catch-all category to describe any summary or quality measures of post-alignment data. The primary source of metrics is the Picard suite of tools for evaluating alignment in SAM/BAM files. For downstream tools that produce reports or logs (e.g., htseq-count), outputs are also categorized and stored under metrics.

Current tools: Picard

VDJ annotation

Many projects (eg: scRNAseq projects) involve identification of TCR sequences. To achieve sequencing of these highly polymorphic sequences, we first perform a de novo assembly using Trinity (see below), and then use MiXCR to identify the TCR chains from the assembly.

Current tools: mixcr

assembly

For experiments where it doesn’t make sense to align to a reference (eg: no reference available, huge number of polymorphisms), we perform a de novo assembly of the reads. This essentially aligns the reads to one another in a series of steps, building long sequences representing transcripts from the short reads of fragments. These long transcript sequences can then be used for a number of purposes, such as building a transcriptome, or aligning transcripts with partially conserved and partially variable regions (as in TCR identification).

Current tools: Trinity

peak calling

For some sequencing experiments (ChIPseq, ATACseq, CUT&RUN, CUT&TAG) it is necessary to call peaks where reads are found to pile up. MACS2 is the most common tool used to achieve this, but performs variably depending on the details of the experimental design and the settings used. In practice, we have found it best to assess the quality of peak calls from different peak callers on an experiment-by-experiment basis. Current comparisons in the literature suggest that there is a fair amount of room for optimization of peak calling precision and recall (<https://www.biorxiv.org/content/10.1101/306621v2.full/>_).

Current tools: MACS2 (used in Galaxy workflows), Genrich (used outside of Galaxy)


Workflow options

The following workflows are currently available for batch processing in Globus Genomics.

Production Workflows: RNAseq: TruSeq, single-end, stranded, STAR (with or without Trinity) RNAseq: Nextera, single-end, non-stranded, STAR (with or without Trinity)

Workflows in Development: ATACseq: Nextera, paired-end, non-stranded, STAR, MACS2 CUT&RUN/CUT&TAG: custom libraries, paired-end, non-stranded, STAR, MACS2

Deprecated Workflows: TruSeq, Stranded, TopHat (with or without Trinity) Nextera, Non-stranded, TopHat (with or without Trinity)


Composing a workflow

(in Globus Galaxy)

Implementing a new (production) workflow in Globus Galaxy consists of two steps: building a new workflow and annotating all input and output steps.

Building a workflow in Galaxy

Use the Workflow Editor in Globus Galaxy for the following steps:

  1. Add all tools for processing modules (e.g., trimming, alignment, counting).

  2. Connect inputs and outputs of individual tools.

  3. Add workflow inputs: 1. Get Globus FASTQ data 2. Input Dataset (for reference/annotation files)

  4. Add workflow outputs (Send Globus data)

  5. Set all get/send data endpoint and path options to ‘set at runtime’

  6. (optional) Set build-specific and other options to ‘set at runtime’

  7. Annotate input and output steps (and potentially build-specific parameters)

Annotating parameters

For all parameters where values are to be set at runtime *, tags of the following format should be added to the Annotation / Notes field in the Globus Galaxy Workflow Editor.

* “option” parameters are recognized by the combination of their tag (in the Annotation field) as well as their name. This is different than the label field in the Galaxy workflow. In older versions of Galaxy the label was assigned automatically, but more recent versions require the user to specify a unique label name for each tool in the Galaxy workflow. This is important to remember when importing/editing workflows that were developed on a previous instance of Galaxy.

Input parameters

Input parameters — indicating local files that will be uploaded to Globus Galaxy nodes at the start of workflow processing — should have the following form:

extension_in

This typically only applies to fastq_in.

Output parameters

Output parameters are expected to have the following form:

<source>_<type>_<extension>_<out>

For example, the tag picard-rnaseq_metrics_html_out will be parsed into a dictionary like this::

{
    'type': 'metrics',
    'label': 'metrics',
    'source': 'picard-rnaseq',
    'extension': 'html'
 }

Both source and label can be given added specificity with a hyphen-separated string (e.g., picard vs. picard-rnaseq or metrics vs. metrics-rmdup). The parsing code should automatically detect and group these clauses appropriately.

Annotation input parameters

Some workflows will access and load datasets stored in the Globus Galaxy library. These inputs (represented as Input Dataset in the workflow editor) should have annotation tags in the following form:

annotation_<type>

You can also give a name to the dataset to possibly ease navigation within the editor, but these names will not be used by downstream code.

The most common annotation input parameters are the following:

  • GTF gene model files: annotation_gtf (optional name: gtfFile)

  • Gene model refFlat files: annotation_refflat (optional name: refFlatFile)

  • Ribosomal interval files: annotation_ribosomal-intervals (optional name: riboIntsFile)

  • Adapter files: annotation_adapters (optional name: adapterFile)

Saving the workflow for use in bripipetools

Once a workflow is finished and ready for testing, both the workflow template and the workflow detail files must be downloaded from Galaxy. The template file will be used to generate workflow batch files, and the workflow detail file will be used to store tool version information in the research database.

Save the workflow template

  1. Click the arrow next to the workflow name in the Galaxy Workflows tab.

  2. Select “Submit via API batch mode”.

  3. On the following page, click the link to “Export Workflow Parameters for batch submission” and save the .txt file under genomics/galaxy_workflows (wherever the path exists relative to your local system); make sure to remove the leading Galaxy-API-Workflow- from the filename.

Save the workflow details

  1. Click the arrow next to the workflow name in the Galaxy Workflows tab.

  2. Select “Download or Export”

  3. Click the link that says “Download workflow to file so that it can be saved or imported into another Galaxy server” and save the .ga file under genomics/galaxy_workflows (wherever the path exists relative to your local system); make sure to remove the leading Galaxy-Workflow- from the filename.

You should now have a template file with a .txt extension and a details file with a .ga extension, with otherwise identical file names that corresponding to your workflow. Note that bripipetools requires both of these files for a given workflow in order to function properly.

Importing a new workflow to the Research Database

When bripipetools wrapup is run on a workflow batch file for a new workflow, the tool details will be included in the new document in the genomicsWorkflowbatches collection (see Databases for more information).


Running a workflow

The following is a general overview of how to a workflow. For more details please see RNAseq Processing Quickstart.

All of the following steps except the initial BaseSpace download should work while on srvgalaxy01.

Pipeline steps

  1. Downloading & prepping data

  2. process-upload

  3. process-collect

  4. Follow up steps

Downloading & prepping data

When a new flow cell is ready for processing, a notification email is sent from the Genomics Core via BaseSpace. Information about the flowcell and corresponding projects can be found in the Flowcell log.xlsx file under DFS_Chaussabel_LabShare/Illumina HiScan SQ/ on the [srvstor01](srvstor01.brivmrc.org) server. In particular, you’ll need to pay attention to the Lane Contents tab to determine the appropriate workflow to use for each project.

On srvgalaxy01 under /mnt/genomics/Illumina/<flowcell-folder>/, create a new folder called Unaligned/ (if it doesn’t already exist). Modify permissions such that all users can write to and read from the folder (chmod -R 777 Unaligned/). The new folder should look something like this:

FC_FOLDER="/mnt/genomics/Illumina/150615_D00565_0087_AC6VG0ANXX/Unaligned"

Using bripipetools

The bripipetools command (which calls bripipetools/__main__.py) is the entrypoint to application functionality. If you have the bripipetools package installed, you should be able to use this command from anywhere on your system.

bripipetools --help
Usage: bripipetools [OPTIONS] COMMAND [ARGS]...

  Command line interface for the `bripipetools` library.

Options:
  --quiet  only display printed outputs in the console - i.e., no log messages
  --debug  include all debug log messages in the console
  --help   Show this message and exit.

Commands:
  dbify        Import data from a flowcell run or workflow...
  postprocess  Perform postprocessing operations on outputs...
  qc           Run quality control analyses on a target...
  submit       Prepare batch submission for unaligned...
  wrapup       Perform 'dbify' and 'postprocess' operations...

Preparing workflow batches for submission

At this point, you’ll need to identify the most applicable workflow. Important considerations are:

  • Species (mouse, human, E. coli, etc)

  • Genome assembly and gene annotation version (GRCh38.91 vs GRCh38.77, etc)

  • Library preparation method (TruSeq, Nextera, CUT&RUN, etc.)

  • Aligner (STAR unless the project needs to be combined with data from older, TopHat-based workflows)

  • Additional workflow requirements (Trinity/MiXCR, MACS2, etc)

Refer to flowcell log

The flowcell log can be found at DFS_Chaussabel_LabShare/Illumina HiScan SQ/Flowcell log.xlsx.

Using bripipetools to submit

bripipetools submit --help
Usage: bripipetools submit [OPTIONS] PATH

  Prepare batch submission for unaligned samples from a flowcell run or from
  a list of paths in a manifest file.

Options:
  --endpoint TEXT                 Globus Online endpoint where input data is
                                  stored and outputs will be saved
  --workflow-dir TEXT             path to folder containing Galaxy workflow
                                  template files to be used for batch
                                  processing
  --all-workflows / --optimized-only
                                  indicate whether to include all detected
                                  workflows as options or to keep 'optimized'
                                  workflows only
  -s, --sort-samples              sort samples from smallest to largest (based
                                  on total size of raw data files) before
                                  submitting; this is most useful when also
                                  restricting the number of samples
  -n, --num-samples INTEGER       restrict the number of samples submitted for
                                  each project on the flowcell
  -m, --manifest                  indicates that input path is a manifest of
                                  sample or folder paths (not a flowcell run)
                                  from which a workflow batch is to be created
                                  (note: options 'sort-samples' and 'num-
                                  samples' will be ignored)
  -o, --out-dir TEXT              for input manifest, folder where outputs are
                                  to be saved; default is current directory
  --help                          Show this message and exit.

Here’s an example call::

bripipetools submit \
    --workflow-dir /mnt/genomics/galaxy_workflows \
    --endpoint benaroyaresearch#BRIGridFTP \
    /mnt/genomics/Illumina/150615_D00565_0087_AC6VG0ANX

Here’s another example with a manifest file:

bripipetools submit \
    --workflow-dir /Volumes/genomics/galaxy_workflows/ \
    --out-dir /Volumes/genomics/ICAC/Gern/ -\
    -tag gern \
    --manifest <(find /Volumes/genomics/ICAC/Gern -name "Sample_*")

Submitting batches in Galaxy/Globus Genomics

Authenticating Globus endpoint

First, sign in to Globus <https://app.globus.org/endpoints/>_ and navigate to the ENDPOINTS page. Select benaroyaresearch#BRIGridFTP then Activate, after which you’ll be prompted to enter your login credentials for the srvgridftp01 BRI server. Make sure to expand the “advanced” options and set the “Credential Lifetime” to something large like 10000 hours (that way, you won’t need to reauthenticate for about a week).

Uploading batch submit files

  1. In the web interface for the instance of Galaxy managed by Globus for BRI (currently at <bri.globusgenomics.org/>_), select “Batch Management -> Workflow batch submit” from the left-hand side menu.

  2. Enter “benaroyaresearch#BRIGridFTP” into the Source Endpoint field, and enter the file path generated by bripipetools submit for your workflow batch into the Source Path field.

Submitting batch jobs

Note

Monitoring Batch Jobs

In general, it’s a good idea to monitor the status of jobs intermittently during a run. This can help diagnose any issues that come up early, which will save time and AWS resources. To view currently-running jobs, you can click on the gear in the top right corner of the Galaxy dashboard, then select “Saved Histories”. Any jobs with errors will appear with red boxes in the “Datasets” column.

Warning

Batch Submission Size

Depending on the number and type of jobs in the batch, it may take several hours or even a day or two for Galaxy to complete all of the jobs. It’s best to submit workflows with only a couple hundred jobs and wait for them to complete, in case there’s any troubleshooting that needs to take place during this phase. However, there’s nothing wrong with uploading all of your batch files at once and submitting them one at a time after each finishes.

  1. In the Galaxy web interface select “Globus Data Transfer -> Get Data via Globus” from the left-hand side menu.

  2. Select the job number for the workflow batch file you’ve uploaded, and click Execute.

Collecting workflow batch results

Usage: bripipetools wrapup [OPTIONS] PATH

  Perform 'dbify' and 'postprocess' operations on all projects and workflow
  batches from a flowcell run.

Options:
  -t, --output-type [c|m|q|v|a]   type of output file to combine: c [counts],
                                  m [metrics], q [qc], v [validation], a [all]
  -x, --exclude-types [c|m|q|v]   type of output file to exclude: c [counts],
                                  m [metrics], q [qc], v [validation]
  --stitch-only / --stitch-and-compile
                                  Do NOT compile and merge all summary (non-
                                  count) data into a single file at the
                                  project level
  --clean-outputs / --outputs-as-is
                                  Attempt to clean/organize output files
  --help                          Show this message and exit.

Importing flowcell data into GenLIMS

Usage: bripipetools dbify [OPTIONS] PATH

  Import data from a flowcell run or workflow processing batch into GenLIMS
  database.

Options:
  --help  Show this message and exit.

Postprocessing workflow outputs

Usage: bripipetools postprocess [OPTIONS] PATH

  Perform postprocessing operations on outputs of a workflow batch.

Options:
  -t, --output-type [c|m|q|v|a]   type of output file to combine: c [counts],
                                  m [metrics], q [qc], v [validation], a [all]
  -x, --exclude-types [c|m|q|v]   type of output file to exclude: c [counts],
                                  m [metrics], q [qc], v [validation]
  --stitch-only / --stitch-and-compile
                                  Do NOT compile and merge all summary (non-
                                  count) data into a single file at the
                                  project level
  --clean-outputs / --outputs-as-is
                                  Attempt to clean/organize output files
  --help                          Show this message and exit.

Follow up steps

Not all pipeline steps have been integrated into the bripipetools application code base. Remaining steps are performed with scripts located in the scripts folder.

Generating gene model coverage plots

usage: plot_gene_coverage.py PATH
while read path; do \
    python scripts/plot_gene_coverage.py $path;
done < <(find <path-to-flowcell-folder> -name "metrics" -maxdepth 2)

Running MiXCR (depending on workflow version)

Note: requires SLURM!! (must run on server srvgalaxy01)

/mnt/code/shared/bripipetools/
usage: run_mixcr.py [-h] -i INPUTDIR -o RESULTSDIR [-x EXCLUDENODES]
            [-s SPECIES] [-c CHAINTYPE] [-k]

Notes on arguments:

  • -x nodename allows the user to exclude slurm nodes from use in processing the batch

  • -s species sets the species (‘hsa’ = human (default), ‘mmu’ = mouse)

  • -c chaintype defines the immunological chain to check for. This should be ‘TCR’ (default) or ‘ALL’ (for BCR identification)

  • -k sets the use of KAligner2 during the align phase, which is useful for alignments with large gaps (ie: BCR identification).

while read path; do \
    outdir="$(dirname $path)/mixcrOutput_trinity";
    python scripts/run_mixcr.py -i $path -o $outdir;
done < <(find <path-to-flowcell-folder> -name "Trinity" -maxdepth 2)

Handy shortcut::

# Custom formatted output from squeue
alias squeuel='squeue -o "%.7i %.9P %.30j %.10u %.8T %.10M %.6D %.5C %.8p %R"'

Concatenating Trinity outputs

usage: concatenate_trinity_output.py PATH
while read path; do \
    python scripts/concatenate_trinity_output.py $path;
done < <(find <path-to-flowcell-folder> -name "Trinity" -maxdepth 2)

Inspecting outputs

After running the pulldownGalaxyData.py script, results will be stored under the flowcell folder in a new folder that looks like Project_<project-id>Processed_<date>, where date is the YYMMDD string of the date on which the script was run — e.g., Project_P43-12Processed_151208.


Retrieving details for old workflows (DEPRECATED)

Note

For Reference Only

The following details refer to an instance of Galaxy that was hosted and managed by BRI. This instance has been turned off and the data archived. The following information is retained for reference and completeness, but workflow and history tracking are now managed via the Databases collections.

To collect details about old workflows and histories from processing jobs on the local Galaxy server, one can either use the PostgreSQL database directly, or take advantage of an R script for interacting with the database.

Galaxy PostgreSQL database queries

Keeping track of various queries here with thought of eventually combining into scripts or functions.

Basic login to db::

svc_galaxy@srvgalaxy02:~$ psql svc_galaxy

History info for a project::

svc_galaxy=# select * from history where name like '%P15-8%';
svc_galaxy=# select id from history where name like '%P15-8%';

Dataset info for a specific History

List datasets::

svc_galaxy=# SELECT dataset_id FROM history_dataset_association WHERE history_id = '536';

Get full dataset info::

svc_galaxy=# SELECT * FROM dataset WHERE id IN (SELECT dataset_id FROM history_dataset_association WHERE history_id = '536');

Job info for a specific History

svc_galaxy=# SELECT * FROM job WHERE history_id = '536';

Job metrics for specific steps

svc_galaxy=# SELECT * FROM job_metric_numeric WHERE job_id IN (SELECT id FROM job WHERE history_id = '529' AND tool_id LIKE '%/tophat/%') AND metric_name = 'runtime_seconds';

Job metrics for datasets

svc_galaxy=# SELECT * FROM job_to_input_dataset WHERE dataset_id IN (SELECT dataset_id FROM history_dataset_association WHERE history_id = '536');