bripipetools core packages¶
Overview¶
“Core” packages are where most of the heavy lifting happens, and are called by application-level modules to perform various pipeline tasks. Packages are listed roughly in order of dependency hierarchy (i.e., packages listed first depend on subsequently listed packages).
Note
Intended for developers!
The documentation below is effectively a dump of all low-level packages, modules, classes, and methods that are used to run bripipetools. This amount of detail shouldn’t be needed for most users, but provides a starting point for those looking to understand or modify the code.
Package details¶
annotation package¶
Includes critical functionality for identifying, locating, and describing data and results at various points (e.g., data generation, computational processing) in the bioinformatics pipeline. Each “annotator” class, contained in its respective module, is responsible for collecting and/or updating information for a specific object in the GenLIMS database. When possible, details for an object are retrieved directly from the database; for new objects or objects with missing fields, information is compiled, parsed, and formatted (as needed) from files on the server.
sequencedlibraries module¶
Classify / provide details for sequenced libraries (outputs of a flowcell sequencing run) and the associated raw data.
-
class
bripipetools.annotation.sequencedlibs.SequencedLibraryAnnotator(path, library, project, run_id, db)[source]¶ Identifies, stores, and updates information about a sequenced library.
flowcellruns module¶
Classify / provide details for objects generated from an Illumina sequencing run performed by the BRI Genomics Core.
-
class
bripipetools.annotation.flowcellruns.FlowcellRunAnnotator(run_id, pipeline_root, db)[source]¶ Identifies, stores, and updates information about a flowcell run.
-
_init_flowcellrun()[source]¶ Try to retrieve data for the flowcell run from GenLIMS; if unsuccessful, create new
FlowcellRunobject.
-
get_libraries(project=None)[source]¶ Collect list of libraries for flowcell run from one or all projects.
-
processedlibraries module¶
-
class
bripipetools.annotation.processedlibs.ProcessedLibraryAnnotator(workflowbatch_id, params, db)[source]¶ Identifies, stores, and updates information about a processed library.
-
_append_processed_data()[source]¶ Add details and outputs for current workflow batch to processed data array field for processed library.
-
_init_processedlibrary()[source]¶ Try to retrieve data for the processed library from GenLIMS; if unsuccessful, create new
ProcessedLibraryobject.
-
workflowbatches module¶
Classify / provide details for objects generated from a Globus Galaxy workflow processing batch performed by the BRI Bioinformatics Core.
-
class
bripipetools.annotation.workflowbatches.WorkflowBatchAnnotator(workflowbatch_file, pipeline_root, db, run_opts)[source]¶ Identifies, stores, and updates information about a workflow batch.
-
_check_sex(processedlibrary)[source]¶ Retrieve reported sex for sample and compare to predicted sex of processed library.
-
_init_workflowbatch()[source]¶ Try to retrieve data for the workflow batch from GenLIMS; if unsuccessful, create new
GalaxyWorkflowBatchobject.
-
get_processed_libraries(project=None, qc=False)[source]¶ Collect processed library objects for workflow batch.
-
qc package¶
Contains classes and methods for performing post-hoc quality control operations on raw or processed genomics data. Modules are organized according to the specifc QC step performed. Unlike routine quality inspection metrics and information provided by standard bioinformatics tools through processing workflows, modules here are aimed more at identifying problems with sample handling or data generation. As such, outputs from these submodules are designated as a special type, ‘validation’, to distinguish them from the QC, metrics, counts, and other output types generated through processing.
sexcheck module¶
Class and methods to perform routine sex check on all processed libraries.
-
class
bripipetools.qc.sexcheck.SexChecker(processedlibrary, reference, workflowbatch_id, pipeline_root, db, run_opts)[source]¶ Reads gene counts for a processed library, maps genes to X and Y chromosomes, computes ratio of Y to X counts, gives predicted sex based on pre-defined rule.
-
_compute_x_y_data()[source]¶ Collect and store X and Y gene/count data as well as total counts for the current processed library.
-
sexverify module¶
Class and methods to perform routine sex check on all processed libraries.
sexpredict module¶
Class and methods to perform routine sex check on all processed libraries.
database package¶
Contains methods for interacting with - connecting to, retrieving data
from, and inserting data into - BRI databases (GenLIMS and ResDB) at
a low level. Under the hood, much of the functionality in this package
relies on the pymongo client library for MongoDB. The database.operations
module provides wrapper functions for getting/putting objects
from/to commonly used database collections, while database.mapping
helps to construct Python model class objects from database
documents. Methods in the database.connection module manage the
database connection, depending on environment and configurations.
connection module¶
Connect to a BRI Mongo database.
operations module¶
Basic operations for BRI Mongo databases.
-
bripipetools.database.operations.create_workflowbatch_id(db, prefix, date)[source]¶ Check the ‘workflowbatches’ collection and construct ID with lowest available batch number (i.e., ‘’<prefix>_<date>_<number>’).
- Parameters
db (type[pymongo.database.Database]) – database object for current MongoDB connection
prefix (str) – base string for workflow batch ID, based on workflow batch type (e.g., ‘globusgalaxy’ for Globus Galaxy workflow
date (type[datetime.datetime]) – date on which workflow batch was run
- Return type
str
- Returns
a unique ID for the workflow batch, with the prefix and date combination appended with the highest available integer
-
bripipetools.database.operations.find_objects(collection)[source]¶ Return a decorator that retrieves objects from the specified collection, given a db connection and query.
- Parameters
collection (str) – String indicating the name of the collection
-
bripipetools.database.operations.get_genomicsCounts(db, query)[source]¶ Return list of documents from ‘genomicsCounts’ collection based on query.
-
bripipetools.database.operations.get_genomicsMetrics(db, query)[source]¶ Return list of documents from ‘genomicsMetrics’ collection based on query.
-
bripipetools.database.operations.get_genomicsRuns(db, query)[source]¶ Return list of documents from ‘genomicsRuns’ collection based on query.
-
bripipetools.database.operations.get_genomicsSamples(db, query)[source]¶ Return list of documents from ‘genomicsSamples’ collection based on query.
-
bripipetools.database.operations.get_genomicsWorkflowbatches(db, query)[source]¶ Return list of documents from ‘genomicsWorkflowbatches’ collection based on query.
-
bripipetools.database.operations.insert_objects(collection)[source]¶ Return a decorator that inserts one or more objects in into specified collection; if object exists, updates any individual fields that are not empty in the input object.
- Parameters
collection (str) – string indicating the name of the collection
-
bripipetools.database.operations.put_genomicsCounts(db, counts)[source]¶ Insert each document in list into ‘genomicsCounts’ collection.
-
bripipetools.database.operations.put_genomicsMetrics(db, metrics)[source]¶ Insert each document in list into ‘genomicsMetrics’ collection.
-
bripipetools.database.operations.put_genomicsRuns(db, runs)[source]¶ Insert each document in list into ‘genomicsRuns’ collection.
-
bripipetools.database.operations.put_genomicsSamples(db, samples)[source]¶ Insert each document in list into ‘genomicsSamples’ collection.
-
bripipetools.database.operations.put_genomicsWorkflowbatches(db, workflowbatches)[source]¶ Insert each document in list into ‘genomicsWorkflowbatches’ collection.
-
bripipetools.database.operations.search_ancestors(db, sample_id, field)[source]¶ Given an object in the ‘samples’ collection, specified by the input ID, iteratively walk through ancestors based on ‘parentId’ until a value is found for the requested field.
- Parameters
db (type[pymongo.database.Database]) – database object for current MongoDB connection
sample_id (str) – a unique ID for a sample in GenLIMS
field (str) – the field for which to search among ancestor samples
- Returns
value for field, if found
mapping module¶
bripipetools mapping submodule: methods to map from Mongo documents to model classes.
-
bripipetools.database.mapping.get_model_class(doc)[source]¶ Find the matching class for the document, based on its type.
- Parameters
doc (dict) – a dict representing a MongoDB document/object
- Return type
str
- Returns
a string representing the name of the matched class from the model module
-
bripipetools.database.mapping.map_keys(obj)[source]¶ Convert keys in a dictionary (or nested dictionary) from camelCase to snake_case; ignore ‘_id’ keys.
- Parameters
obj (dict, list) – a dict or list of dicts with string keys to be converted
- Return type
dict, list
- Returns
a dict or list of dicts with string keys converted from camelCase to snake_case
model package¶
Establishes the underlying data model linking data from bioinformatics
processing pipelines to the GenLIMS/TG3 database. Python class
representations of database objects (documents) are defined in the
model.documents module. These classes include some basic
functionality, mostly related to setting/formatting attributes,
which are eventually fed back into the database as key-value pairs.
However, model classes are also the basic “currency” for several other
modules, where they are used to retrieve, modify, store, and return
data.
Depends on the util and parsing modules.
documents module¶
Classes representing documents in the GenLIMS database.
-
class
bripipetools.model.documents.FlowcellRun(**kwargs)[source]¶ GenLIMS object in the ‘runs’ collection of type ‘flowcell’.
-
_flowcell_path= None¶
-
property
flowcell_path¶ Return root-agnostic path to flowcell data folder.
-
-
class
bripipetools.model.documents.GalaxyWorkflowBatch(workflowbatch_file=None, **kwargs)[source]¶ GenLIMS object in ‘workflow batches’ collection of type ‘Galaxy workflow’
- Parameters
workflowbatch_file (str) – path to file describing samples and parameters of Globus Galaxy workflow batch
-
class
bripipetools.model.documents.GeneCounts(**kwargs)[source]¶ Research Database object in ‘counts’ collection of type ‘gene counts’
-
property
gene_counts¶ Return list of dictionaries with information about each library’s genecounts.
-
property
-
class
bripipetools.model.documents.GenericRun(protocol_id=None, date=None, **kwargs)[source]¶ GenLIMS object in the ‘runs’ collection
- Parameters
protocol_id (str) – unique ID of a protocol object in the GenLIMS database (in the ‘protocols’ collection)
date (str) – string indicating date of the run in ISO 8601 format
-
class
bripipetools.model.documents.GenericSample(project_id=None, subproject_id=None, protocol_id=None, parent_id=None, **kwargs)[source]¶ GenLIMS object in the ‘samples’ collection
- Parameters
project_id (int) – Genomics Core project number
subproject_id – Genomics Core sub-project number
protocol_id (str) – unique ID of a protocol object in the GenLIMS database (in the ‘protocols’ collection)
parent_id (str) – unique ID of a sample object in the GenLIMS database from which the current sample was derived (in the ‘samples’ collection)
- Type
subproject_id: int
-
class
bripipetools.model.documents.GenericWorkflow(**kwargs)[source]¶ GenLIMS object in the ‘workflows’ collection
-
class
bripipetools.model.documents.GenericWorkflowBatch(**kwargs)[source]¶ GenLIMS object in the ‘workflow batches’ collection
-
class
bripipetools.model.documents.GlobusGalaxyWorkflow(**kwargs)[source]¶ GenLIMS object in ‘workflows’ collection of type ‘Globus Galaxy workflow’
-
class
bripipetools.model.documents.Library(**kwargs)[source]¶ GenLIMS object in ‘samples’ collection of type ‘library’
-
class
bripipetools.model.documents.Metrics(**kwargs)[source]¶ Research Database object in ‘metrics’ collection of type ‘metrics’
-
property
metrics¶ Return list of dictionaries with information about each library’s metrics.
-
property
-
class
bripipetools.model.documents.ProcessedLibrary(**kwargs)[source]¶ GenLIMS object in ‘samples’ collection of type ‘processed library’
-
property
processed_data¶ Return list of dictionaries with information about each set of data processing outputs (i.e., from workflow batches).
-
property
-
class
bripipetools.model.documents.SequencedLibrary(run_id=None, **kwargs)[source]¶ GenLIMS object in ‘samples’ collection of type ‘sequenced library’
- Parameters
run_id (str) – unique ID of a run object in the GenLIMS database
-
property
raw_data¶ Return list of dictionaries with information about each raw data file (i.e., FASTQ) for a sequenced library.
-
class
bripipetools.model.documents.TG3Object(_id=None, type=None, is_mapped=False)[source]¶ Generic functions for objects in TG3 collections.
- Parameters
_id (str) – unique object identifier in the GenLIMS/TG3 Mongo database
type (str) – field indicating object type in a collection
is_mapped (bool) – flag indicating whether class instance was mapped from a database object (True) or created from scratch (False)
-
to_json()[source]¶ Return object attributes as dictionary with keys formatted as camel case.
- Return type
dict
- Returns
a dict containing class instance attributes, with all field names converted from snake case to camel case
-
update_attrs(attr_map, force=False)[source]¶ Given a dictionary of key-value pairs for attribute names with new values, update each attribute. Always update empty (‘None’) attributes and set any new attributes; update all modified attributes if force option is ‘True’.
- Parameters
attr_map (dict) – a dict with key-value pairs representing object attributes and values to which they should be set
force (bool) – force overwrite of object fields in database, if they already exist
-
bripipetools.model.documents.convert_keys(obj)[source]¶ Convert keys in a dictionary (or nested dictionary) from snake_case to camelCase; ignore ‘_id’ keys.
- Parameters
obj (dict, list) – A dict or list of dicts with string keys to be converted.
- Return type
dict, list
- Returns
A dict or list of dicts with string keys converted from snake_case to camelCase.
io package¶
Contains class representations of various file types produced through the generation or processing of genomics data. In particular, most of these classes provide methods for reading and parsing raw data from files and storing/returning these data in a more usable format, such as dictionaries or data frames. Each module contains the representaiton of a file generated by a particular tool or routine; some submodules may handle files from multiple methods within a tool (e.g., Picard). While not explicitly organized as such, modules adhere to a hierarchy based on the “type” of file, where current types include metrics, counts, QC, and validation.
workflow module¶
Class for reading and parsing Galaxy workflow files.
workflowbatch module¶
Classes for reading, parsing, and writing workflow batch submit files for Globus Galaxy.
-
class
bripipetools.io.workflowbatch.WorkflowBatchFile(path, state='template')[source]¶ A parser to map input sample names to expected output files based on a completed Globus Galaxy batch submit file.
- Parameters
path (str) – File path of batch submit file.
state (str) – String indicating the current state of the batch submit file; either template or submit (if populated with project and sample information).
-
_locate_batch_name_line()[source]¶ Identify batch file metadata line with place-holder for project name; return line number. Note: batch submissions can include multiple projects, so the ‘batch name’ label is more appropriate.
-
_locate_param_line()[source]¶ Identify batch file header line with parameter names; return line number.
-
_locate_sample_start_line()[source]¶ Identify batch file line where sample parameter info begins; return line number. Note: should immediately follow parameter header line.
-
_locate_workflow_name_line()[source]¶ Identify batch file metadata line with name of workflow; return line number.
-
get_params()[source]¶ Return the parameters defined for the current workflow.
- Return type
list
- Returns
A list of tuples with number (index) and dict with details for each parameter.
-
get_sample_params(sample_line)[source]¶ Collect the parameter details for each input sample; store the index and input for each parameter.
- Parameters
sample_line (str) – Raw, tab-delimited line of text from workflow batch submit file describing the paramaters for a single sample.
- Return type
list
- Returns
A list of dicts, one for each sample.
picardmetrics module¶
Class for reading and parsing Picard metrics files.
-
class
bripipetools.io.picardmetrics.PicardMetricsFile(path)[source]¶ Parser to read tables of metrics generated by one of several Picard tools, typically stored in an HTML file, and return as a parsed and formatted dictionary.
htseqmetrics module¶
Class for reading and parsing Tophat Stats metrics files.
tophatstats module¶
Class for reading and parsing Tophat Stats metrics files.
fastqc module¶
Class for reading and parsing FastQC report files.
-
class
bripipetools.io.fastqc.FastQCFile(path)[source]¶ Parser to read QC data from a FastQC report, stored in a tab-delimited text file.
-
_get_section_status(section_name, section_info)[source]¶ Return a tuple with the section name and status.
-
_locate_sections()[source]¶ Return a dict with section names as keys and tuples of start/end line numbers as values.
-
htseqcounts module¶
Class for reading and parsing htseq files.
parsing package¶
Slightly more specialized than methods in the util.strings module,
provides functions for parsing and extracting information from strings
that follow some expected nomenclature. The primary examples of this
information are IDs, names, labels, and other metadata for files and
objects generated either by Illumina technology or the BRI Genomics
Core (via GenLIMS). The parsing.processing module is also designed
to handle specialized strings and labels related to processing
workflows in Globus Galaxy.
Depends on the util module.
gencore module¶
-
bripipetools.parsing.gencore.get_library_id(string)[source]¶ Return library ID matched in input string.
- Parameters
string (str) – any string that might contain a library ID of the format ‘lib1234’
- Return type
str
- Returns
the matching substring representing the library ID or an empty string (‘’) if no match found
-
bripipetools.parsing.gencore.get_project_label(string)[source]¶ Return a Genomics Core project label matched in input string.
- Parameters
string (str) – any string
- Return type
str
- Returns
Genomics Core project label (e.g., P00-0) substring or empty string, if no match found
-
bripipetools.parsing.gencore.get_sample_id(string)[source]¶ More general than library ID; returns either library ID (if present), or any word starting with ‘Sample_’, ends in a number, and preceeds any non-alphanumeric characters.
- Parameters
string (str) – any string that might contain a form of sample ID
- Return type
str
- Returns
the matching substring representing the sample ID or an empty string (‘’) if no match found
-
bripipetools.parsing.gencore.parse_batch_file_path(batchfile_path)[source]¶ Return ‘genomics’ root and batch file name based on directory path.
-
bripipetools.parsing.gencore.parse_flowcell_path(flowcell_path)[source]¶ Return ‘genomics’ root and run ID based on directory path.
-
bripipetools.parsing.gencore.parse_project_label(project_label)[source]¶ Parse a Genomics Core project label (e.g., P00-0) and return individual components indicating project ID and subproject ID.
- Parameters
project_label (str) – String following Genomics Core convention for project labels, P<project ID>-<subproject ID>
- Return type
dict
- Returns
a dict with fields for ‘project_id’ and ‘subproject_id’
illumina module¶
-
bripipetools.parsing.illumina.get_flowcell_id(string)[source]¶ Return flowcell ID.
- Parameters
string (str) – any string that might contain an Illumina flowcell ID (e.g., C6VG0ANXX)
- Return type
str
- Returns
the matching substring representing the flowcell ID or an empty string (‘’) if no match found
-
bripipetools.parsing.illumina.parse_fastq_filename(path)[source]¶ Parse standard Illumina FASTQ filename and return individual components indicating generic path, lane ID, read ID, and sample number.
- Parameters
path (str) – full path to FASTQ file with filename adhering to standard Illumina format (e.g., ‘1D-HC29-C04_S27_L001_R1_001.fastq.gz’)
- Return type
dict
- Returns
a dict with fields for ‘path’ (with root removed), ‘lane_id’, ‘read_id’, and ‘sample_number’
-
bripipetools.parsing.illumina.parse_flowcell_run_id(run_id)[source]¶ Parse Illumina flowcell run ID (or folder name) and return individual components indicating date, instrument ID, run number, flowcell ID, and flowcell position.
- Parameters
run_id (str) – string adhering to standard Illumina format (e.g., ‘150615_D00565_0087_AC6VG0ANXX’) for a sequencing run
- Return type
dict
- Returns
a dict with fields for ‘date’, ‘instrument_id’, ‘run_number’, ‘flowcell_id’, and ‘flowcell_position’
processing module¶
-
bripipetools.parsing.processing.parse_batch_name(batch_name)[source]¶ Parse batch name indicated in workflow batch submit file and return individual components indicating date, list of project labels, and flowcell ID.
-
bripipetools.parsing.processing.parse_output_filename(output_path)[source]¶ Parse output name indicated by parameter tag in output file return individual components indicating processed library ID, output source, and type.
util module¶
Includes convenience methods related to handling and manipulating
strings (util.strings), file paths (util.files), as well as
user interactions via the command line (util.ui). Methods are used
throughout other packages to streamline common operations.
strings submodule¶
-
bripipetools.util.strings.matchdefault(pattern, string, default='')[source]¶ Search for pattern in string and return default string if no match
- Parameters
pattern (str) – non-compiled regular expression to search for in input string
string (str) – any string
default (str) – string to return if no match found
- Return type
str
- Returns
substring matched to regular expression or default string, if no match found
-
bripipetools.util.strings.matchlastdefault(pattern, string, default='')[source]¶ Search for pattern in string from right, return default string if no match
- Parameters
pattern (str) – non-compiled regular expression to search for in input string
string (str) – any string
default (str) – string to return if no match found
- Return type
str
- Returns
rightmost substring matched to regular expression or default string, if no match found
-
bripipetools.util.strings.to_camel_case(snake_str)[source]¶ Convert snake_case string to camelCase
- Parameters
snake_str (str) – a string in snake_case format
- Return type
str
- Returns
input string converted to camelCase format
-
bripipetools.util.strings.to_snake_case(camel_str)[source]¶ Convert camelCase to snake_case. found function here: http://stackoverflow.com/questions/1175208/elegant-python-function-to-convert-camelcase-to-snake-case
- Parameters
camel_str (str) – a string in camelCase format
- Return type
str
- Returns
input string converted to snake_case format
files submodule¶
-
bripipetools.util.files.locate_root_folder(top_level, max_depth=3)[source]¶ Find the root of a file path preceding a specified ‘top level’ directory.
- Parameters
top_level (str) – Nominal ‘top level’ directory immediately following root (e.g., ‘genomics’ in ‘/Volumes/genomics’); should be a relatively unique folder name, at least within the specified depth).
max_depth (str) – How many directory levels down from the true system root to search for
top_levelfolder.
- Return type
str
- Returns
A string representing the part of the file path starting from the current system root up to (but not including) the
top_levelfolder.
-
bripipetools.util.files.swap_root(path, top_level, new_root='/~/')[source]¶ Replace section of file path preceding a specified ‘top level’ directory with a different string (mostly for use with Globus transfers).
- Parameters
path (str) – Any system file path.
top_level (str) – Nominal ‘top level’ directory to immediately follow new root (e.g., ‘genomics’ in ‘/Volumes/genomics’).
new_root (str) – String specifying the new root of the file path.
- Return type
str
- Returns
modified path with new root