bripipetools core packages¶

Overview¶

“Core” packages are where most of the heavy lifting happens, and are called by application-level modules to perform various pipeline tasks. Packages are listed roughly in order of dependency hierarchy (i.e., packages listed first depend on subsequently listed packages).

Note

Intended for developers!

The documentation below is effectively a dump of all low-level packages, modules, classes, and methods that are used to run bripipetools. This amount of detail shouldn’t be needed for most users, but provides a starting point for those looking to understand or modify the code.

Package details¶

`annotation` package¶

Includes critical functionality for identifying, locating, and describing data and results at various points (e.g., data generation, computational processing) in the bioinformatics pipeline. Each “annotator” class, contained in its respective module, is responsible for collecting and/or updating information for a specific object in the GenLIMS database. When possible, details for an object are retrieved directly from the database; for new objects or objects with missing fields, information is compiled, parsed, and formatted (as needed) from files on the server.

`sequencedlibraries` module¶

Classify / provide details for sequenced libraries (outputs of a flowcell sequencing run) and the associated raw data.

class bripipetools.annotation.sequencedlibs.SequencedLibraryAnnotator(path, library, project, run_id, db)[source]¶

Identifies, stores, and updates information about a sequenced library.

_get_raw_data()[source]¶: Locate and store details about raw data for sequenced library.

_init_sequencedlibrary()[source]¶: Try to retrieve data for the sequenced library from GenLIMS; if unsuccessful, create new SequencedLibrary object.

_update_sequencedlibrary()[source]¶: Add any missing fields to SequencedLibrary object.

get_sequenced_library()[source]¶: Return sequenced library object with updated fields.

`flowcellruns` module¶

Classify / provide details for objects generated from an Illumina sequencing run performed by the BRI Genomics Core.

class bripipetools.annotation.flowcellruns.FlowcellRunAnnotator(run_id, pipeline_root, db)[source]¶

Identifies, stores, and updates information about a flowcell run.

_init_flowcellrun()[source]¶: Try to retrieve data for the flowcell run from GenLIMS; if unsuccessful, create new FlowcellRun object.

_update_flowcellrun()[source]¶: Add any missing fields to FlowcellRun object.

get_flowcell_path()[source]¶: Find path to flowcell folder on the server.

get_flowcell_run()[source]¶: Return flowcell run object with updated fields.

get_libraries(project=None)[source]¶: Collect list of libraries for flowcell run from one or all projects.

get_library_gene_counts(project=None)[source]¶: Collect library gene count objects for flowcell run.

get_library_metrics(project=None)[source]¶: Collect library gene count objects for flowcell run.

get_processed_libraries(project=None, sub_path='inputFastqs')[source]¶: Collect list of libraries for flowcell run from one or all projects.

get_processed_projects()[source]¶: List processed projects for a flowcell run.

get_projects()[source]¶: Collect list of projects for flowcell run.

get_sequenced_libraries(project=None)[source]¶: Collect sequenced library objects for flowcell run.

get_unaligned_path()[source]¶: Find path to unaligned data in flowcell folder.

`processedlibraries` module¶

class bripipetools.annotation.processedlibs.ProcessedLibraryAnnotator(workflowbatch_id, params, db)[source]¶

Identifies, stores, and updates information about a processed library.

_append_processed_data()[source]¶: Add details and outputs for current workflow batch to processed data array field for processed library.

_get_outputs()[source]¶: Return the list of outputs from the processing workflow batch.

_get_seqlib_id()[source]¶: Return the ID of the parent sequenced library.

_group_outputs()[source]¶: Organize outputs according to type and source.

_init_processedlibrary()[source]¶: Try to retrieve data for the processed library from GenLIMS; if unsuccessful, create new ProcessedLibrary object.

_parse_output_name(output_name)[source]¶: Parse output name indicated by parameter tag in workflow batch submit file and return individual components indicating name, source, and type.

_update_processedlibrary()[source]¶: Add or update any missing fields in ProcessedLibrary object.

get_processed_library()[source]¶: Return updated ProcessedLibrary object.

`workflowbatches` module¶

Classify / provide details for objects generated from a Globus Galaxy workflow processing batch performed by the BRI Bioinformatics Core.

class bripipetools.annotation.workflowbatches.WorkflowBatchAnnotator(workflowbatch_file, pipeline_root, db, run_opts)[source]¶

Identifies, stores, and updates information about a workflow batch.

_check_sex(processedlibrary)[source]¶: Retrieve reported sex for sample and compare to predicted sex of processed library.

_init_workflowbatch()[source]¶: Try to retrieve data for the workflow batch from GenLIMS; if unsuccessful, create new GalaxyWorkflowBatch object.

_run_qc(processedlibrary)[source]¶

_update_workflowbatch()[source]¶: Add any missing fields to GalaxyWorkflowBatch object.

get_processed_libraries(project=None, qc=False)[source]¶: Collect processed library objects for workflow batch.

get_sequenced_libraries()[source]¶: Collect list of sequenced libraries processed as part of workflow batch.

get_workflow_batch()[source]¶: Return workflow batch object with updated fields.

`qc` package¶

Contains classes and methods for performing post-hoc quality control operations on raw or processed genomics data. Modules are organized according to the specifc QC step performed. Unlike routine quality inspection metrics and information provided by standard bioinformatics tools through processing workflows, modules here are aimed more at identifying problems with sample handling or data generation. As such, outputs from these submodules are designated as a special type, ‘validation’, to distinguish them from the QC, metrics, counts, and other output types generated through processing.

`sexcheck` module¶

Class and methods to perform routine sex check on all processed libraries.

class bripipetools.qc.sexcheck.SexChecker(processedlibrary, reference, workflowbatch_id, pipeline_root, db, run_opts)[source]¶

Reads gene counts for a processed library, maps genes to X and Y chromosomes, computes ratio of Y to X counts, gives predicted sex based on pre-defined rule.

_compute_x_y_data()[source]¶: Collect and store X and Y gene/count data as well as total counts for the current processed library.

_get_counts_path()[source]¶: Construct the absolute path to the counts file for the specified workflow batch.

_get_x_y_counts()[source]¶: Extract and store counts for X and Y genes; also store count total.

_load_x_genes(ref='grch38')[source]¶: Read X chromosome gene IDs from file and return data frame.

_load_y_genes(ref='grch38')[source]¶: Read Y chromosome gene IDs from file and return data frame.

_predict_sex()[source]¶: Return predicted sex based on X/Y gene equation and cutoff.

_verify_sex()[source]¶: Compare predicted sex to reported sex.

_write_data()[source]¶: Save the sex validation data to a new file.

update()[source]¶: Add predicted sex validation field to processed library outputs and return processed library object.

`sexverify` module¶

Class and methods to perform routine sex check on all processed libraries.

class bripipetools.qc.sexverify.SexVerifier(data, processedlibrary, db)[source]¶

Identifies, stores, and updates information about a workflow batch.

_retrieve_sex(parent_id)[source]¶: Retrieve reported sex for sample.

verify()[source]¶

Compare reported sex for sample to predicted sex of processed library.

Returns

`sexpredict` module¶

Class and methods to perform routine sex check on all processed libraries.

class bripipetools.qc.sexpredict.SexPredictor(data, run_opts)[source]¶

Predicts sex based X and Y gene count data using a pre-defined rule.

_compute_y_x_count_ratio()[source]¶: Calculate the ratio of Y counts to X counts.

_compute_y_x_gene_ratio()[source]¶: Calculate the ratio of Y genes detected to X genes detected, where detected = count > 0.

_predict_sex()[source]¶: Return predicted sex based on X/Y gene equation and cutoff.

predict()[source]¶

`database` package¶

Contains methods for interacting with - connecting to, retrieving data from, and inserting data into - BRI databases (GenLIMS and ResDB) at a low level. Under the hood, much of the functionality in this package relies on the pymongo client library for MongoDB. The database.operations module provides wrapper functions for getting/putting objects from/to commonly used database collections, while database.mapping helps to construct Python model class objects from database documents. Methods in the database.connection module manage the database connection, depending on environment and configurations.

`connection` module¶

Connect to a BRI Mongo database.

bripipetools.database.connection.connect(db_config_name)[source]¶

Check the current environment to determine which database parameters to use, then connect to the target database on the specified host.

Returns: A database connection object.

`operations` module¶

Basic operations for BRI Mongo databases.

bripipetools.database.operations.create_workflowbatch_id(db, prefix, date)[source]¶

Check the ‘workflowbatches’ collection and construct ID with lowest available batch number (i.e., ‘’<prefix>_<date>_<number>’).

Parameters

db (type[pymongo.database.Database]) – database object for current MongoDB connection
prefix (str) – base string for workflow batch ID, based on workflow batch type (e.g., ‘globusgalaxy’ for Globus Galaxy workflow
date (type[datetime.datetime]) – date on which workflow batch was run

Return type

str

Returns

a unique ID for the workflow batch, with the prefix and date combination appended with the highest available integer

bripipetools.database.operations.find_objects(collection)[source]¶

Return a decorator that retrieves objects from the specified collection, given a db connection and query.

Parameters: collection (str) – String indicating the name of the collection

bripipetools.database.operations.get_genomicsCounts(db, query)[source]¶: Return list of documents from ‘genomicsCounts’ collection based on query.

bripipetools.database.operations.get_genomicsMetrics(db, query)[source]¶: Return list of documents from ‘genomicsMetrics’ collection based on query.

bripipetools.database.operations.get_genomicsRuns(db, query)[source]¶: Return list of documents from ‘genomicsRuns’ collection based on query.

bripipetools.database.operations.get_genomicsSamples(db, query)[source]¶: Return list of documents from ‘genomicsSamples’ collection based on query.

bripipetools.database.operations.get_genomicsWorkflowbatches(db, query)[source]¶: Return list of documents from ‘genomicsWorkflowbatches’ collection based on query.

bripipetools.database.operations.insert_objects(collection)[source]¶

Return a decorator that inserts one or more objects in into specified collection; if object exists, updates any individual fields that are not empty in the input object.

Parameters: collection (str) – string indicating the name of the collection

bripipetools.database.operations.put_genomicsCounts(db, counts)[source]¶: Insert each document in list into ‘genomicsCounts’ collection.

bripipetools.database.operations.put_genomicsMetrics(db, metrics)[source]¶: Insert each document in list into ‘genomicsMetrics’ collection.

bripipetools.database.operations.put_genomicsRuns(db, runs)[source]¶: Insert each document in list into ‘genomicsRuns’ collection.

bripipetools.database.operations.put_genomicsSamples(db, samples)[source]¶: Insert each document in list into ‘genomicsSamples’ collection.

bripipetools.database.operations.put_genomicsWorkflowbatches(db, workflowbatches)[source]¶: Insert each document in list into ‘genomicsWorkflowbatches’ collection.

bripipetools.database.operations.search_ancestors(db, sample_id, field)[source]¶

Given an object in the ‘samples’ collection, specified by the input ID, iteratively walk through ancestors based on ‘parentId’ until a value is found for the requested field.

Parameters

db (type[pymongo.database.Database]) – database object for current MongoDB connection
sample_id (str) – a unique ID for a sample in GenLIMS
field (str) – the field for which to search among ancestor samples

Returns

value for field, if found

`mapping` module¶

bripipetools mapping submodule: methods to map from Mongo documents to model classes.

bripipetools.database.mapping.get_model_class(doc)[source]¶

Find the matching class for the document, based on its type.

Parameters: doc (dict) – a dict representing a MongoDB document/object
Return type: str
Returns: a string representing the name of the matched class from the model module

bripipetools.database.mapping.map_keys(obj)[source]¶

Convert keys in a dictionary (or nested dictionary) from camelCase to snake_case; ignore ‘_id’ keys.

Parameters: obj (dict, list) – a dict or list of dicts with string keys to be converted
Return type: dict, list
Returns: a dict or list of dicts with string keys converted from camelCase to snake_case

bripipetools.database.mapping.map_to_object(doc)[source]¶

Convert document to model class of appropriate type.

Parameters: doc (dict) – a dict representing a MongoDB document/object
Return type: type[docs.TG3Object]
Returns: an new instance of the matched model class

`model` package¶

Establishes the underlying data model linking data from bioinformatics processing pipelines to the GenLIMS/TG3 database. Python class representations of database objects (documents) are defined in the model.documents module. These classes include some basic functionality, mostly related to setting/formatting attributes, which are eventually fed back into the database as key-value pairs. However, model classes are also the basic “currency” for several other modules, where they are used to retrieve, modify, store, and return data.

Depends on the util and parsing modules.

`documents` module¶

Classes representing documents in the GenLIMS database.

class bripipetools.model.documents.FlowcellRun(**kwargs)[source]¶

GenLIMS object in the ‘runs’ collection of type ‘flowcell’.

_flowcell_path = None¶

property flowcell_path¶: Return root-agnostic path to flowcell data folder.

class bripipetools.model.documents.GalaxyWorkflowBatch(workflowbatch_file=None, **kwargs)[source]¶

GenLIMS object in ‘workflow batches’ collection of type ‘Galaxy workflow’

Parameters: workflowbatch_file (str) – path to file describing samples and parameters of Globus Galaxy workflow batch

class bripipetools.model.documents.GeneCounts(**kwargs)[source]¶

Research Database object in ‘counts’ collection of type ‘gene counts’

property gene_counts¶: Return list of dictionaries with information about each library’s genecounts.

class bripipetools.model.documents.GenericRun(protocol_id=None, date=None, **kwargs)[source]¶

GenLIMS object in the ‘runs’ collection

Parameters

protocol_id (str) – unique ID of a protocol object in the GenLIMS database (in the ‘protocols’ collection)
date (str) – string indicating date of the run in ISO 8601 format

class bripipetools.model.documents.GenericSample(project_id=None, subproject_id=None, protocol_id=None, parent_id=None, **kwargs)[source]¶

GenLIMS object in the ‘samples’ collection

Parameters

project_id (int) – Genomics Core project number
subproject_id – Genomics Core sub-project number
protocol_id (str) – unique ID of a protocol object in the GenLIMS database (in the ‘protocols’ collection)
parent_id (str) – unique ID of a sample object in the GenLIMS database from which the current sample was derived (in the ‘samples’ collection)

Type

subproject_id: int

class bripipetools.model.documents.GenericWorkflow(**kwargs)[source]¶: GenLIMS object in the ‘workflows’ collection

class bripipetools.model.documents.GenericWorkflowBatch(**kwargs)[source]¶: GenLIMS object in the ‘workflow batches’ collection

class bripipetools.model.documents.GlobusGalaxyWorkflow(**kwargs)[source]¶: GenLIMS object in ‘workflows’ collection of type ‘Globus Galaxy workflow’

class bripipetools.model.documents.Library(**kwargs)[source]¶: GenLIMS object in ‘samples’ collection of type ‘library’

class bripipetools.model.documents.Metrics(**kwargs)[source]¶

Research Database object in ‘metrics’ collection of type ‘metrics’

property metrics¶: Return list of dictionaries with information about each library’s metrics.

class bripipetools.model.documents.ProcessedLibrary(**kwargs)[source]¶

GenLIMS object in ‘samples’ collection of type ‘processed library’

property processed_data¶: Return list of dictionaries with information about each set of data processing outputs (i.e., from workflow batches).

class bripipetools.model.documents.SequencedLibrary(run_id=None, **kwargs)[source]¶

GenLIMS object in ‘samples’ collection of type ‘sequenced library’

Parameters: run_id (str) – unique ID of a run object in the GenLIMS database

property raw_data¶: Return list of dictionaries with information about each raw data file (i.e., FASTQ) for a sequenced library.

class bripipetools.model.documents.TG3Object(_id=None, type=None, is_mapped=False)[source]¶

Generic functions for objects in TG3 collections.

Parameters

_id (str) – unique object identifier in the GenLIMS/TG3 Mongo database
type (str) – field indicating object type in a collection
is_mapped (bool) – flag indicating whether class instance was mapped from a database object (True) or created from scratch (False)

to_json()[source]¶

Return object attributes as dictionary with keys formatted as camel case.

Return type: dict
Returns: a dict containing class instance attributes, with all field names converted from snake case to camel case

update_attrs(attr_map, force=False)[source]¶

Given a dictionary of key-value pairs for attribute names with new values, update each attribute. Always update empty (‘None’) attributes and set any new attributes; update all modified attributes if force option is ‘True’.

Parameters

attr_map (dict) – a dict with key-value pairs representing object attributes and values to which they should be set
force (bool) – force overwrite of object fields in database, if they already exist

bripipetools.model.documents.convert_keys(obj)[source]¶

Convert keys in a dictionary (or nested dictionary) from snake_case to camelCase; ignore ‘_id’ keys.

Parameters: obj (dict, list) – A dict or list of dicts with string keys to be converted.
Return type: dict, list
Returns: A dict or list of dicts with string keys converted from snake_case to camelCase.

`io` package¶

Contains class representations of various file types produced through the generation or processing of genomics data. In particular, most of these classes provide methods for reading and parsing raw data from files and storing/returning these data in a more usable format, such as dictionaries or data frames. Each module contains the representaiton of a file generated by a particular tool or routine; some submodules may handle files from multiple methods within a tool (e.g., Picard). While not explicitly organized as such, modules adhere to a hierarchy based on the “type” of file, where current types include metrics, counts, QC, and validation.

`workflow` module¶

Class for reading and parsing Galaxy workflow files.

class bripipetools.io.workflow.WorkflowFile(path)[source]¶

Parser to exported workflow descriptions from Galaxy, stored in a JSON-like format with extension .ga.

_read_file()[source]¶: Read file into dictionary.

get_tool_info()[source]¶: Retrieve tools and versions from a workflow as a dictionary

get_workflow_name()[source]¶: Retrieve the workflow name

parse()[source]¶: Parse workflow file and return dictionary.

`workflowbatch` module¶

Classes for reading, parsing, and writing workflow batch submit files for Globus Galaxy.

class bripipetools.io.workflowbatch.WorkflowBatchFile(path, state='template')[source]¶

A parser to map input sample names to expected output files based on a completed Globus Galaxy batch submit file.

Parameters

path (str) – File path of batch submit file.
state (str) – String indicating the current state of the batch submit file; either template or submit (if populated with project and sample information).

_locate_batch_name_line()[source]¶: Identify batch file metadata line with place-holder for project name; return line number. Note: batch submissions can include multiple projects, so the ‘batch name’ label is more appropriate.

_locate_param_line()[source]¶: Identify batch file header line with parameter names; return line number.

_locate_sample_start_line()[source]¶: Identify batch file line where sample parameter info begins; return line number. Note: should immediately follow parameter header line.

_locate_workflow_name_line()[source]¶: Identify batch file metadata line with name of workflow; return line number.

_read_file()[source]¶: Read and store lines from batch submit file.

get_batch_name()[source]¶: Return name of workflow batch for batch submit file.

get_params()[source]¶

Return the parameters defined for the current workflow.

Return type: list
Returns: A list of tuples with number (index) and dict with details for each parameter.

get_sample_params(sample_line)[source]¶

Collect the parameter details for each input sample; store the index and input for each parameter.

Parameters: sample_line (str) – Raw, tab-delimited line of text from workflow batch submit file describing the paramaters for a single sample.
Return type: list
Returns: A list of dicts, one for each sample.

get_workflow_name()[source]¶: Return name of workflow for batch submit file.

parse()[source]¶: Parse workflow batch file and return dict.

update_batch_name(batch_name)[source]¶: Update name of workflow batch and insert in template lines.

write(path, batch_name=None, sample_lines=None)[source]¶: Write workflow batch data to file.

`picardmetrics` module¶

Class for reading and parsing Picard metrics files.

class bripipetools.io.picardmetrics.PicardMetricsFile(path)[source]¶

Parser to read tables of metrics generated by one of several Picard tools, typically stored in an HTML file, and return as a parsed and formatted dictionary.

_check_table_format()[source]¶: Check whether table is long (keys in one column, values in the other) or wide (keys in one row, values in the other).

_get_table()[source]¶: Extract metrics table from raw HTML string.

_parse_long()[source]¶: Parse long-formatted table to dictionary.

_parse_wide()[source]¶: Parse wide-formatted table to dictionary.

_read_file()[source]¶: Read file into raw HTML string.

parse()[source]¶: Parse metrics table and return dictionary.

`htseqmetrics` module¶

Class for reading and parsing Tophat Stats metrics files.

class bripipetools.io.htseqmetrics.HtseqMetricsFile(path)[source]¶

Parser to read tables of metrics generated by the htseq-count tool, stored in a tab-delimited text file.

_parse_lines()[source]¶: Get key-value pairs from text lines and return dictionary.

_read_file()[source]¶: Read file into list of raw strings.

parse()[source]¶: Parse metrics table and return dictionary.

`tophatstats` module¶

Class for reading and parsing Tophat Stats metrics files.

class bripipetools.io.tophatstats.TophatStatsFile(path)[source]¶

Parser to read tables of metrics generated by custom Tophat Stats PE tool, stored in a tab-delimited text file.

_parse_lines()[source]¶: Get key-value pairs from text lines and return dictionary.

_read_file()[source]¶: Read file into list of raw strings.

parse()[source]¶: Parse metrics table and return dictionary.

`fastqc` module¶

Class for reading and parsing FastQC report files.

class bripipetools.io.fastqc.FastQCFile(path)[source]¶

Parser to read QC data from a FastQC report, stored in a tab-delimited text file.

_clean_header(header)[source]¶: Extract section header from header line, convert to snake case.

_clean_value(value)[source]¶: Convert to numeric unless value contains text.

_get_section_status(section_name, section_info)[source]¶: Return a tuple with the section name and status.

_locate_sections()[source]¶: Return a dict with section names as keys and tuples of start/end line numbers as values.

_parse_section_table(section_info)[source]¶: For the specified section lines, parse tab-delimited columns into key-value pairs and return list of tuples.

_read_file()[source]¶: Read file into list of raw strings.

parse()[source]¶: Parse file and return key-value pairs as dictionary.

parse_overrepresented_seqs()[source]¶: Parse table of overrepresented sequences, return as list of dictionaries.

`htseqcounts` module¶

Class for reading and parsing htseq files.

class bripipetools.io.htseqcounts.HtseqCountsFile(path)[source]¶

Parser to read tables of counts generated by the htseq-count tool, stored in a tab-delimited text file.

_read_file()[source]¶: Read file into Pandas data frame.

parse()[source]¶: Parse counts file and return data frame.

`sexcheck` module¶

Class for reading and parsing sex check validation files.

class bripipetools.io.sexcheck.SexcheckFile(path)[source]¶

Parser to read tables of metrics generated by custom Tophat Stats PE tool, stored in a tab-delimited text file.

_parse_lines()[source]¶: Get key-value pairs from text lines and return dictionary.

_read_file()[source]¶: Read file into list of raw strings.

parse()[source]¶: Parse metrics table and return dictionary.

`parsing` package¶

Slightly more specialized than methods in the util.strings module, provides functions for parsing and extracting information from strings that follow some expected nomenclature. The primary examples of this information are IDs, names, labels, and other metadata for files and objects generated either by Illumina technology or the BRI Genomics Core (via GenLIMS). The parsing.processing module is also designed to handle specialized strings and labels related to processing workflows in Globus Galaxy.

Depends on the util module.

`gencore` module¶

bripipetools.parsing.gencore.get_library_id(string)[source]¶

Return library ID matched in input string.

Parameters: string (str) – any string that might contain a library ID of the format ‘lib1234’
Return type: str
Returns: the matching substring representing the library ID or an empty string (‘’) if no match found

bripipetools.parsing.gencore.get_project_label(string)[source]¶

Return a Genomics Core project label matched in input string.

Parameters: string (str) – any string
Return type: str
Returns: Genomics Core project label (e.g., P00-0) substring or empty string, if no match found

bripipetools.parsing.gencore.get_sample_id(string)[source]¶

More general than library ID; returns either library ID (if present), or any word starting with ‘Sample_’, ends in a number, and preceeds any non-alphanumeric characters.

Parameters: string (str) – any string that might contain a form of sample ID
Return type: str
Returns: the matching substring representing the sample ID or an empty string (‘’) if no match found

bripipetools.parsing.gencore.parse_batch_file_path(batchfile_path)[source]¶: Return ‘genomics’ root and batch file name based on directory path.

bripipetools.parsing.gencore.parse_flowcell_path(flowcell_path)[source]¶: Return ‘genomics’ root and run ID based on directory path.

bripipetools.parsing.gencore.parse_project_label(project_label)[source]¶

Parse a Genomics Core project label (e.g., P00-0) and return individual components indicating project ID and subproject ID.

Parameters: project_label (str) – String following Genomics Core convention for project labels, P<project ID>-<subproject ID>
Return type: dict
Returns: a dict with fields for ‘project_id’ and ‘subproject_id’

`illumina` module¶

bripipetools.parsing.illumina.get_flowcell_id(string)[source]¶

Return flowcell ID.

Parameters: string (str) – any string that might contain an Illumina flowcell ID (e.g., C6VG0ANXX)
Return type: str
Returns: the matching substring representing the flowcell ID or an empty string (‘’) if no match found

bripipetools.parsing.illumina.parse_fastq_filename(path)[source]¶

Parse standard Illumina FASTQ filename and return individual components indicating generic path, lane ID, read ID, and sample number.

Parameters: path (str) – full path to FASTQ file with filename adhering to standard Illumina format (e.g., ‘1D-HC29-C04_S27_L001_R1_001.fastq.gz’)
Return type: dict
Returns: a dict with fields for ‘path’ (with root removed), ‘lane_id’, ‘read_id’, and ‘sample_number’

bripipetools.parsing.illumina.parse_flowcell_run_id(run_id)[source]¶

Parse Illumina flowcell run ID (or folder name) and return individual components indicating date, instrument ID, run number, flowcell ID, and flowcell position.

Parameters: run_id (str) – string adhering to standard Illumina format (e.g., ‘150615_D00565_0087_AC6VG0ANXX’) for a sequencing run
Return type: dict
Returns: a dict with fields for ‘date’, ‘instrument_id’, ‘run_number’, ‘flowcell_id’, and ‘flowcell_position’

`processing` module¶

bripipetools.parsing.processing.parse_batch_name(batch_name)[source]¶: Parse batch name indicated in workflow batch submit file and return individual components indicating date, list of project labels, and flowcell ID.

bripipetools.parsing.processing.parse_output_filename(output_path)[source]¶: Parse output name indicated by parameter tag in output file return individual components indicating processed library ID, output source, and type.

bripipetools.parsing.processing.parse_output_name(output_name)[source]¶

bripipetools.parsing.processing.parse_run_id_for_batch(batch_file)[source]¶: Parse the run id (YYMMDD_D00565_####_FCID) from a path to a batch file.

bripipetools.parsing.processing.parse_workflow_param(param)[source]¶: Parse workflow parameter into components indicating tag, type, and name.

`util` module¶

Includes convenience methods related to handling and manipulating strings (util.strings), file paths (util.files), as well as user interactions via the command line (util.ui). Methods are used throughout other packages to streamline common operations.

`strings` submodule¶

bripipetools.util.strings.matchdefault(pattern, string, default='')[source]¶

Search for pattern in string and return default string if no match

Parameters

pattern (str) – non-compiled regular expression to search for in input string
string (str) – any string
default (str) – string to return if no match found

Return type

str

Returns

substring matched to regular expression or default string, if no match found

bripipetools.util.strings.matchlastdefault(pattern, string, default='')[source]¶

Search for pattern in string from right, return default string if no match

Parameters

pattern (str) – non-compiled regular expression to search for in input string
string (str) – any string
default (str) – string to return if no match found

Return type

str

Returns

rightmost substring matched to regular expression or default string, if no match found

bripipetools.util.strings.to_camel_case(snake_str)[source]¶

Convert snake_case string to camelCase

Parameters: snake_str (str) – a string in snake_case format
Return type: str
Returns: input string converted to camelCase format

bripipetools.util.strings.to_snake_case(camel_str)[source]¶

Convert camelCase to snake_case. found function here: http://stackoverflow.com/questions/1175208/elegant-python-function-to-convert-camelcase-to-snake-case

Parameters: camel_str (str) – a string in camelCase format
Return type: str
Returns: input string converted to snake_case format

`files` submodule¶

bripipetools.util.files.locate_root_folder(top_level, max_depth=3)[source]¶

Find the root of a file path preceding a specified ‘top level’ directory.

Parameters

top_level (str) – Nominal ‘top level’ directory immediately following root (e.g., ‘genomics’ in ‘/Volumes/genomics’); should be a relatively unique folder name, at least within the specified depth).
max_depth (str) – How many directory levels down from the true system root to search for top_level folder.

Return type

str

Returns

A string representing the part of the file path starting from the current system root up to (but not including) the top_level folder.

bripipetools.util.files.swap_root(path, top_level, new_root='/~/')[source]¶

Replace section of file path preceding a specified ‘top level’ directory with a different string (mostly for use with Globus transfers).

Parameters

path (str) – Any system file path.
top_level (str) – Nominal ‘top level’ directory to immediately follow new root (e.g., ‘genomics’ in ‘/Volumes/genomics’).
new_root (str) – String specifying the new root of the file path.

Return type

str

Returns

modified path with new root

bripipetools core packages¶

Overview¶

Package details¶

annotation package¶

sequencedlibraries module¶

flowcellruns module¶

processedlibraries module¶

workflowbatches module¶

qc package¶

sexcheck module¶

sexverify module¶

sexpredict module¶

database package¶

connection module¶

operations module¶

mapping module¶

model package¶

documents module¶

io package¶

workflow module¶

workflowbatch module¶

picardmetrics module¶

htseqmetrics module¶

tophatstats module¶

fastqc module¶

htseqcounts module¶

sexcheck module¶

parsing package¶

gencore module¶

illumina module¶

processing module¶

util module¶

strings submodule¶

files submodule¶

`annotation` package¶

`sequencedlibraries` module¶

`flowcellruns` module¶

`processedlibraries` module¶

`workflowbatches` module¶

`qc` package¶

`sexcheck` module¶

`sexverify` module¶

`sexpredict` module¶

`database` package¶

`connection` module¶

`operations` module¶

`mapping` module¶

`model` package¶

`documents` module¶

`io` package¶

`workflow` module¶

`workflowbatch` module¶

`picardmetrics` module¶

`htseqmetrics` module¶

`tophatstats` module¶

`fastqc` module¶

`htseqcounts` module¶

`sexcheck` module¶

`parsing` package¶

`gencore` module¶

`illumina` module¶

`processing` module¶

`util` module¶

`strings` submodule¶

`files` submodule¶