bripipetools application packages

Overview

Application-level packages are those exposed to the user through wrapper scripts and the command line. They are used to perform common, high-level tasks related to pipeline operations and data. Packages are listed roughly in order of dependency hierarchy (i.e., packages listed first depend on subsequently listed packages).

Note

Intended for developers!

The documentation below is effectively a dump of all high-level packages, modules, classes, and methods that are used to run bripipetools. This amount of detail shouldn’t be needed for most users, but provides a starting point for those looking to understand or modify the code.


Package details

dbification package

Manages the collection and annotation of data (e.g., generated by the Genomics Core or produced through bioinformatics processing) for import into GenLIMS. Modules are designed to handle the set of data associated with a particular “step” (e.g., a flowcell sequencing run or bioinformatics processing of a batch of samples). The dbify.control module inspects an input path and deploys the appropriate importer class.

control submodule

Parse arguments to determine and select appropriate importer class.

class bripipetools.dbification.control.ImportManager(path, db, run_opts)[source]

Takes an input argument (path) from script or module specifying a scope of data to be imported into GenLIMS; selects the appropriate importer class and makes insert command available.

_init_importer()[source]

Initialize the appropriate importer for the provided path.

_sniff_path()[source]

Check path for known patterns and return path type for importer.

run(collections='all')[source]

Execute the insert method of the selected importer.

flowcellrun module

Class for importing data from a sequencing run into GenLIMS and the Research DB as new objects.

class bripipetools.dbification.flowcellrun.FlowcellRunImporter(path, db, run_opts)[source]

Collects FlowcellRun and SequencedLibrary objects from a sequencing run, converts to documents, inserts into database.

_collect_flowcellrun()[source]

Collect FlowcellRun object for flowcell run.

_collect_librarygenecounts()[source]

Collect list of library gene count objects for flowcell run.

_collect_librarymetrics()[source]

Collect list of library metrics objects for flowcell run.

_collect_sequencedlibraries()[source]

Collect list of SequencedLibrary objects for flowcell run.

_insert_genomicsFlowcellRun(collection='all')[source]

Convert FlowcellRun object and insert into research database

_insert_genomicsLibrarymetrics()[source]

Convert Library Results objects and insert into Research database.

_insert_genomicsSequencedlibraries()[source]

Convert SequencedLibrary objects and insert into Research database.

_insert_genomicsWorkflowbatches()[source]

Collect WorkflowBatch objects and insert them into database.

_insert_librarygenecounts()[source]

Convert Library Results objects and insert into Research database.

insert(collection='all')[source]

Insert documents into ResearchDB databases. Note that ResearchDB collections are prepended by ‘genomics’ to indicate the data origin.

workflowbatch module

Class for importing data from a processing batch into databases as new objects. Supports both research database (“genomics…”) and GenLIMS collections.

class bripipetools.dbification.workflowbatch.WorkflowBatchImporter(path, db, run_opts)[source]

Collects WorkflowBatch and ProcessedLibrary objects from a processing batch, converts to documents, inserts into database.

_collect_processedlibraries()[source]

Collect list of ProcessedLibrary objects for flowcell run.

_collect_workflowbatch()[source]

Collect WorkflowBatch object for flowcell run.

_insert_genomicsProcessedlibraries()[source]

Convert ProcessedLibrary objects and insert into database.

_insert_genomicsWorkflowbatch()[source]

Convert WorkflowBatch object and insert into database.

insert(collection='all')[source]

Insert documents into databases. Note that ResearchDB collections are prepended by “genomics” to indicate the data origin.


postprocessing package

Covers a range of operations performed on outputs and other files produced through bioinformatics processing of a batch of samples. For example, the postprocess.stitching module parses data from individual files of similar type and combines data into a single table for all samples in a project. By extension, postprocess.compiling will take these stitched tables of different types and combine them into a new, large table for the project. On the other hand, the postprocess.cleanup module deals with fixing the way files are named and organized on the disk.

stitching module

Combine parsed data from a set of batch processing output files and write to a single CSV file.

class bripipetools.postprocessing.stitching.OutputStitcher(path, output_type=None, outputs=None)[source]

Given a path to an output folder or list of files, combine parsed data from files and write CSV.

_add_mapped_reads_column(data)[source]

Add mapped_reads_w_dups column to metrics table data.

_build_combined_filename()[source]

Parse input path to create filename for combined CSV file.

_build_overrepresented_seq_table()[source]

Parse and combine overrepresented sequences tables from FastQC files.

_build_table()[source]

Combine parsed data into table for writing.

_get_outputs(output_type)[source]

Return list of outputs of specified type.

_get_parser(output_type, output_source)[source]

Return the appropriate parser for the current output file.

_read_data()[source]

Parse and store data for each output file.

_sniff_output_type()[source]

Return predicted output type based on specified path.

write_overrepresented_seq_table()[source]

Write combined overrepresented sequences table to CSV file.

write_table()[source]

Write the combined table to a CSV file.

compiling module

Compile combined/stitched ‘summary’ outputs of different types from batch processing and write to a single CSV file.

class bripipetools.postprocessing.compiling.OutputCompiler(paths)[source]

Reads combined output tables from list of file paths and compiles into single table, stored in a file at the project level.

_build_combined_filename()[source]

Modify input path to create filename for combined CSV file.

_build_table()[source]

Combine data into table for writing; only keep sample IDs (first column of each file, with header ‘libId’) from first file in list.

_read_data()[source]

Read, sort, and store data for each output file.

write_table()[source]

Write the combined table to a CSV file.

cleanup module

Clean up & organize outputs from processing workflow batch.

class bripipetools.postprocessing.cleanup.OutputCleaner(path)[source]

Moves, renames, and deletes individual output files from a workflow processing batch for a selected project.

_get_output_paths(output_type)[source]

Return full path for individual output files.

_get_output_types()[source]

Identify the types of outputs included for the project.

_recode_output(path, output_type)[source]

Rename file according to template.

_unnest_output(path)[source]

Unnest files in a subfolder by concatenating filenames and moving up one level.

_unzip_output(path)[source]

Unzip the contents of a compressed output file.

clean_outputs()[source]

Walk through output types to unzip, unnest, and rename files.


monitoring package

Contains tools for monitoring the status of pipeline steps. Classes and methods here are designed to inspect files on the server and report on various indicators of state (e.g., file existence, access, completion, size, etc.).

workflowbatches module

Monitor the outputs of a workflow processing batch.

class bripipetools.monitoring.workflowbatches.WorkflowBatchMonitor(workflowbatch_file, pipeline_root)[source]

Controls operations (identification, annotation, etc.) for the set of outputs generated by a batch processing job in Globus Galaxy.

Parameters
  • workflowbatch_file (str) – File path of the submitted workflow batch file.

  • pipeline_root (str) – Path to the root directory for processing

_clean_output_paths(outputs)[source]

Replaces ambiguous file path roots with current system root.

Parameters

outputs (list) – A list of dicts, one for each sample in the workflow batch, where key-value pairs in the dict describe the tag/label and path to each output file for the sample.

Return type

list

Returns

A list of dicts, with output file paths updated to use the current system root for the ‘genomics’ server.

_get_outputs()[source]

Collect all output files for the workflow batch, grouped by sample.

Returns

A list of dicts, one for each sample in the workflow batch, where key-value pairs in the dict describe the tag/label and path to each output file for the sample.

check_outputs()[source]

Check whether all expected output files are present for each sample in the batch.

Return type

dict

Returns

A dict, where for each sample, output files are flagged as ok, missing, or empty.

check_project_outputs(project_id)[source]

Check whether all expected output files are present for each sample in the batch that is part of the indicated project.

Return type

dict

Returns

A dict, where for each sample, output files are flagged as ok, missing, or empty.


submission package

Prepares data for batch submission through Globus Galaxy, typically starting from unaligned samples (libraries) from a flowcell run. The submission.batchcreate and submission.batchparameterize modules handle most of the work: the first takes a list of sample paths (or folders containing sample paths) and a workflow template file and controls the preparation of a batch submit file as well as target folders for batch outputs; the latter sets individual parameter values (mostly input and output file paths) for each sample, which are then used by the BatchCreator class to create and write the overall submission instructions. The submission.flowcellsubmit module provides a wrapper around batchcreate, allowing a user to select workflows and generate batch submissions for multiple unaligned projects from a flowcell run.

flowcellsubmit module

class bripipetools.submission.flowcellsubmit.FlowcellSubmissionBuilder(path, endpoint, db, workflow_dir=None, all_workflows=True)[source]

Prepares workflow batch submissions for all unaligned projects from a flowcell run.

_assign_workflows()[source]
_get_batch_tags(paths)[source]
_get_project_paths()[source]
_init_annotator()[source]
get_workflow_options(optimized_only=True)[source]
run(sort=False, num_samples=None)[source]

samplesubmit module

class bripipetools.submission.samplesubmit.SampleSubmissionBuilder(manifest, out_dir, endpoint, workflow_dir=None, all_workflows=True, tag=None)[source]

Prepares workflow batch submissions for a list of sample paths or folders of sample paths.

_assign_workflow()[source]
_read_paths()[source]
get_workflow_options(optimized_only=True)[source]
run()[source]

batchcreate module

class bripipetools.submission.batchcreate.BatchCreator(paths, workflow_template, endpoint, base_dir, submit_dir=None, group_tag=None, subgroup_tags=None, sort=False, num_samples=None, build='GRCh38.77', stranded=False)[source]

Given a list of sample paths or folders of sample paths as well as the path to a workflow tempate, creates a batch submit file for the input samples.

Parameters
  • paths (list) – List of paths to sample folders, where each folder contains one or more lane-specifc FASTQ file (e.g., ‘<path-to-sample-folder>/sample_L001_R1.fastq.gz’); list can alternatively include one or more paths to folders that contain sample folders (e.g., a project folder).

  • workflow_template (str) – Path to workflow template file, exported from Globus Genomics for API batch submission.

  • endpoint (str) – Globus endpoint where input files are accessed and output files will be sent (e.g., ‘benaroyaresearch#BRIGridFTP’).

  • base_dir (str) – Path to folder where outputs will be stored; outputs will be grouped into one or more ‘Project_<label>Processed’ subfolder(s) in the base_dir.

  • submit_dir (str) – Name of folder where batch submit file will be saved. Folder will be created under base_dir. Defaults to ‘globus_batch_submission’.

  • group_tag (str) – String indicating overal group identifier for workflow batches (e.g., a flowcell ID).

  • subgroup_tags (list) – List of strings indicating subgroup identifiers (e.g., project labels from a flowcell run).

  • sort (bool) – Flag indicating whether or not to sort samples from smallest to largest (based on total size of raw data files) before submitting; most useful when also restricting number of samples.

  • num_samples (int) – Number of samples to submit from each folder, if input paths are folders of sample folders.

  • build (str) – ID string of reference genome build to be used for processing current set of samples.

_build_batch_name()[source]

Construct unique batch name indicating date, workflow, and build, as well as any group or subgroup identifier tags.

_check_input_type()[source]

Inspect list of input paths and determine whether they represent sample paths or folders of sample paths.

_get_input_params()[source]

For each input folder or for the full list of sample paths, create and map values (e.g., file paths) to each parameter in the workflow template. Return the combined set of sample parameter values across all samples or folders.

_get_sample_paths(folder)[source]

Return the list of sample paths for an invididual folder. Optionally, sort and subset sample paths.

_prep_target_dir(folder=None)[source]

Create processed output folder for an invididual input folder or for the full set of samples.

create_batch()[source]

Create batch name, prepare output folders, parameterize samples, and write the workflow batch submit file.

Return type

str

Returns

Path to the batch submit file.

batchparameterize module

class bripipetools.submission.batchparameterize.BatchParameterizer(sample_paths, parameters, endpoint, target_dir, build='GRCh38.77', stranded=False)[source]

Defines workflow batch parameters for a list of input samples, given a list of parsed parameters for a Galaxy workflow.

Parameters
  • sample_paths (list) – List of paths to sample folders, where each folder contains one or more lane-specifc FASTQ file(s).

  • parameters (list) – List of workflow parameters, parsed from a workflow template file, where each parameter is represented by a dict with fields tag, type, and name.

  • target_dir (str) – Path to folder where outputs are to be saved. Subfolders will be created within the target_dir based on output type.

  • endpoint (str) – Globus endpoint where input files are accessed and output files will be sent (e.g., ‘benaroyaresearch#BRIGridFTP’).

  • build (str) – ID string of reference genome build to be used for processing current set of samples.

_build_output_path(sample_name, parameter)[source]

Construct the full path of the current output file, formatted with the sample name and source/type-specific file label (as well as the appropriate extension).

_build_reference_path(parameter)[source]

Given a parameter for an input annotation dataset stored in a library on Globus Galaxy, return the path to the dataset based on the current build and annotation type.

_build_sample_parameters(sample_path)[source]

For a given input sample folder, create and set all parameter values for input paths, output paths, and other options.

_get_lane_fastq(sample_path, lane, read_number='R1')[source]

Retrieve the path for the FASTQ file from the specified lane within the sample folder. If no file exists, create and return the path of an empty FASTQ file.

_get_lane_order()[source]

Return the list of lane numbers (1-8) based on the order in which they appear for input FASTQs in the parameter list.

_prep_output_dir(output_type)[source]

Create a subfolder in the target_dir to store outputs of the specified type, return folder path.

_set_option_value(parameter)[source]
parameterize()[source]

Set all parameter values for the current workflow and input samples and return as list of sample parameters.

Return type

list

Returns

List of lists, where the original input list of parameter dicts has been replicated for each sample and updated to include values specific for that sample.