bripipetools application packages¶
Overview¶
Application-level packages are those exposed to the user through wrapper scripts and the command line. They are used to perform common, high-level tasks related to pipeline operations and data. Packages are listed roughly in order of dependency hierarchy (i.e., packages listed first depend on subsequently listed packages).
Note
Intended for developers!
The documentation below is effectively a dump of all high-level packages, modules, classes, and methods that are used to run bripipetools. This amount of detail shouldn’t be needed for most users, but provides a starting point for those looking to understand or modify the code.
Package details¶
dbification package¶
Manages the collection and annotation of data (e.g., generated by the
Genomics Core or produced through bioinformatics processing) for import
into GenLIMS. Modules are designed to handle the set of data
associated with a particular “step” (e.g., a flowcell sequencing run or
bioinformatics processing of a batch of samples). The dbify.control
module inspects an input path and deploys the appropriate importer
class.
control submodule¶
Parse arguments to determine and select appropriate importer class.
flowcellrun module¶
Class for importing data from a sequencing run into GenLIMS and the Research DB as new objects.
-
class
bripipetools.dbification.flowcellrun.FlowcellRunImporter(path, db, run_opts)[source]¶ Collects FlowcellRun and SequencedLibrary objects from a sequencing run, converts to documents, inserts into database.
-
_insert_genomicsFlowcellRun(collection='all')[source]¶ Convert FlowcellRun object and insert into research database
-
_insert_genomicsLibrarymetrics()[source]¶ Convert Library Results objects and insert into Research database.
-
_insert_genomicsSequencedlibraries()[source]¶ Convert SequencedLibrary objects and insert into Research database.
-
_insert_genomicsWorkflowbatches()[source]¶ Collect WorkflowBatch objects and insert them into database.
-
workflowbatch module¶
Class for importing data from a processing batch into databases as new objects. Supports both research database (“genomics…”) and GenLIMS collections.
-
class
bripipetools.dbification.workflowbatch.WorkflowBatchImporter(path, db, run_opts)[source]¶ Collects WorkflowBatch and ProcessedLibrary objects from a processing batch, converts to documents, inserts into database.
postprocessing package¶
Covers a range of operations performed on outputs and other files
produced through bioinformatics processing of a batch of samples. For
example, the postprocess.stitching module parses data from
individual files of similar type and combines data into a single table
for all samples in a project. By extension, postprocess.compiling
will take these stitched tables of different types and combine them
into a new, large table for the project. On the other hand, the
postprocess.cleanup module deals with fixing the way files
are named and organized on the disk.
stitching module¶
Combine parsed data from a set of batch processing output files and write to a single CSV file.
-
class
bripipetools.postprocessing.stitching.OutputStitcher(path, output_type=None, outputs=None)[source]¶ Given a path to an output folder or list of files, combine parsed data from files and write CSV.
-
_build_overrepresented_seq_table()[source]¶ Parse and combine overrepresented sequences tables from FastQC files.
-
_get_parser(output_type, output_source)[source]¶ Return the appropriate parser for the current output file.
-
compiling module¶
Compile combined/stitched ‘summary’ outputs of different types from batch processing and write to a single CSV file.
-
class
bripipetools.postprocessing.compiling.OutputCompiler(paths)[source]¶ Reads combined output tables from list of file paths and compiles into single table, stored in a file at the project level.
cleanup module¶
Clean up & organize outputs from processing workflow batch.
monitoring package¶
Contains tools for monitoring the status of pipeline steps. Classes and methods here are designed to inspect files on the server and report on various indicators of state (e.g., file existence, access, completion, size, etc.).
workflowbatches module¶
Monitor the outputs of a workflow processing batch.
-
class
bripipetools.monitoring.workflowbatches.WorkflowBatchMonitor(workflowbatch_file, pipeline_root)[source]¶ Controls operations (identification, annotation, etc.) for the set of outputs generated by a batch processing job in Globus Galaxy.
- Parameters
workflowbatch_file (str) – File path of the submitted workflow batch file.
pipeline_root (str) – Path to the root directory for processing
-
_clean_output_paths(outputs)[source]¶ Replaces ambiguous file path roots with current system root.
- Parameters
outputs (list) – A list of dicts, one for each sample in the workflow batch, where key-value pairs in the dict describe the tag/label and path to each output file for the sample.
- Return type
list
- Returns
A list of dicts, with output file paths updated to use the current system root for the ‘genomics’ server.
-
_get_outputs()[source]¶ Collect all output files for the workflow batch, grouped by sample.
- Returns
A list of dicts, one for each sample in the workflow batch, where key-value pairs in the dict describe the tag/label and path to each output file for the sample.
submission package¶
Prepares data for batch submission through Globus Galaxy, typically
starting from unaligned samples (libraries) from a flowcell run. The
submission.batchcreate and submission.batchparameterize
modules handle most of the work: the first takes a list of sample
paths (or folders containing sample paths) and a workflow template
file and controls the preparation of a batch submit file as well as
target folders for batch outputs; the latter sets individual
parameter values (mostly input and output file paths) for each sample,
which are then used by the BatchCreator class to create and write
the overall submission instructions. The submission.flowcellsubmit
module provides a wrapper around batchcreate, allowing a user to
select workflows and generate batch submissions for multiple unaligned
projects from a flowcell run.
flowcellsubmit module¶
samplesubmit module¶
batchcreate module¶
-
class
bripipetools.submission.batchcreate.BatchCreator(paths, workflow_template, endpoint, base_dir, submit_dir=None, group_tag=None, subgroup_tags=None, sort=False, num_samples=None, build='GRCh38.77', stranded=False)[source]¶ Given a list of sample paths or folders of sample paths as well as the path to a workflow tempate, creates a batch submit file for the input samples.
- Parameters
paths (list) – List of paths to sample folders, where each folder contains one or more lane-specifc FASTQ file (e.g., ‘<path-to-sample-folder>/sample_L001_R1.fastq.gz’); list can alternatively include one or more paths to folders that contain sample folders (e.g., a project folder).
workflow_template (str) – Path to workflow template file, exported from Globus Genomics for API batch submission.
endpoint (str) – Globus endpoint where input files are accessed and output files will be sent (e.g., ‘benaroyaresearch#BRIGridFTP’).
base_dir (str) – Path to folder where outputs will be stored; outputs will be grouped into one or more ‘Project_<label>Processed’ subfolder(s) in the
base_dir.submit_dir (str) – Name of folder where batch submit file will be saved. Folder will be created under
base_dir. Defaults to ‘globus_batch_submission’.group_tag (str) – String indicating overal group identifier for workflow batches (e.g., a flowcell ID).
subgroup_tags (list) – List of strings indicating subgroup identifiers (e.g., project labels from a flowcell run).
sort (bool) – Flag indicating whether or not to sort samples from smallest to largest (based on total size of raw data files) before submitting; most useful when also restricting number of samples.
num_samples (int) – Number of samples to submit from each folder, if input paths are folders of sample folders.
build (str) – ID string of reference genome build to be used for processing current set of samples.
-
_build_batch_name()[source]¶ Construct unique batch name indicating date, workflow, and build, as well as any group or subgroup identifier tags.
-
_check_input_type()[source]¶ Inspect list of input paths and determine whether they represent sample paths or folders of sample paths.
-
_get_input_params()[source]¶ For each input folder or for the full list of sample paths, create and map values (e.g., file paths) to each parameter in the workflow template. Return the combined set of sample parameter values across all samples or folders.
-
_get_sample_paths(folder)[source]¶ Return the list of sample paths for an invididual folder. Optionally, sort and subset sample paths.
batchparameterize module¶
-
class
bripipetools.submission.batchparameterize.BatchParameterizer(sample_paths, parameters, endpoint, target_dir, build='GRCh38.77', stranded=False)[source]¶ Defines workflow batch parameters for a list of input samples, given a list of parsed parameters for a Galaxy workflow.
- Parameters
sample_paths (list) – List of paths to sample folders, where each folder contains one or more lane-specifc FASTQ file(s).
parameters (list) – List of workflow parameters, parsed from a workflow template file, where each parameter is represented by a dict with fields
tag,type, andname.target_dir (str) – Path to folder where outputs are to be saved. Subfolders will be created within the
target_dirbased on output type.endpoint (str) – Globus endpoint where input files are accessed and output files will be sent (e.g., ‘benaroyaresearch#BRIGridFTP’).
build (str) – ID string of reference genome build to be used for processing current set of samples.
-
_build_output_path(sample_name, parameter)[source]¶ Construct the full path of the current output file, formatted with the sample name and source/type-specific file label (as well as the appropriate extension).
-
_build_reference_path(parameter)[source]¶ Given a parameter for an input annotation dataset stored in a library on Globus Galaxy, return the path to the dataset based on the current build and annotation type.
-
_build_sample_parameters(sample_path)[source]¶ For a given input sample folder, create and set all parameter values for input paths, output paths, and other options.
-
_get_lane_fastq(sample_path, lane, read_number='R1')[source]¶ Retrieve the path for the FASTQ file from the specified lane within the sample folder. If no file exists, create and return the path of an empty FASTQ file.
-
_get_lane_order()[source]¶ Return the list of lane numbers (1-8) based on the order in which they appear for input FASTQs in the parameter list.
-
_prep_output_dir(output_type)[source]¶ Create a subfolder in the
target_dirto store outputs of the specified type, return folder path.
-
parameterize()[source]¶ Set all parameter values for the current workflow and input samples and return as list of sample parameters.
- Return type
list
- Returns
List of lists, where the original input list of parameter dicts has been replicated for each sample and updated to include values specific for that sample.