bripipetools core packages

Overview

“Core” packages are where most of the heavy lifting happens, and are called by application-level modules to perform various pipeline tasks. Packages are listed roughly in order of dependency hierarchy (i.e., packages listed first depend on subsequently listed packages).

Note

Intended for developers!

The documentation below is effectively a dump of all low-level packages, modules, classes, and methods that are used to run bripipetools. This amount of detail shouldn’t be needed for most users, but provides a starting point for those looking to understand or modify the code.


Package details

annotation package

Includes critical functionality for identifying, locating, and describing data and results at various points (e.g., data generation, computational processing) in the bioinformatics pipeline. Each “annotator” class, contained in its respective module, is responsible for collecting and/or updating information for a specific object in the GenLIMS database. When possible, details for an object are retrieved directly from the database; for new objects or objects with missing fields, information is compiled, parsed, and formatted (as needed) from files on the server.

sequencedlibraries module

Classify / provide details for sequenced libraries (outputs of a flowcell sequencing run) and the associated raw data.

class bripipetools.annotation.sequencedlibs.SequencedLibraryAnnotator(path, library, project, run_id, db)[source]

Identifies, stores, and updates information about a sequenced library.

_get_raw_data()[source]

Locate and store details about raw data for sequenced library.

_init_sequencedlibrary()[source]

Try to retrieve data for the sequenced library from GenLIMS; if unsuccessful, create new SequencedLibrary object.

_update_sequencedlibrary()[source]

Add any missing fields to SequencedLibrary object.

get_sequenced_library()[source]

Return sequenced library object with updated fields.

flowcellruns module

Classify / provide details for objects generated from an Illumina sequencing run performed by the BRI Genomics Core.

class bripipetools.annotation.flowcellruns.FlowcellRunAnnotator(run_id, pipeline_root, db)[source]

Identifies, stores, and updates information about a flowcell run.

_init_flowcellrun()[source]

Try to retrieve data for the flowcell run from GenLIMS; if unsuccessful, create new FlowcellRun object.

_update_flowcellrun()[source]

Add any missing fields to FlowcellRun object.

get_flowcell_path()[source]

Find path to flowcell folder on the server.

get_flowcell_run()[source]

Return flowcell run object with updated fields.

get_libraries(project=None)[source]

Collect list of libraries for flowcell run from one or all projects.

get_library_gene_counts(project=None)[source]

Collect library gene count objects for flowcell run.

get_library_metrics(project=None)[source]

Collect library gene count objects for flowcell run.

get_processed_libraries(project=None, sub_path='inputFastqs')[source]

Collect list of libraries for flowcell run from one or all projects.

get_processed_projects()[source]

List processed projects for a flowcell run.

get_projects()[source]

Collect list of projects for flowcell run.

get_sequenced_libraries(project=None)[source]

Collect sequenced library objects for flowcell run.

get_unaligned_path()[source]

Find path to unaligned data in flowcell folder.

processedlibraries module

class bripipetools.annotation.processedlibs.ProcessedLibraryAnnotator(workflowbatch_id, params, db)[source]

Identifies, stores, and updates information about a processed library.

_append_processed_data()[source]

Add details and outputs for current workflow batch to processed data array field for processed library.

_get_outputs()[source]

Return the list of outputs from the processing workflow batch.

_get_seqlib_id()[source]

Return the ID of the parent sequenced library.

_group_outputs()[source]

Organize outputs according to type and source.

_init_processedlibrary()[source]

Try to retrieve data for the processed library from GenLIMS; if unsuccessful, create new ProcessedLibrary object.

_parse_output_name(output_name)[source]

Parse output name indicated by parameter tag in workflow batch submit file and return individual components indicating name, source, and type.

_update_processedlibrary()[source]

Add or update any missing fields in ProcessedLibrary object.

get_processed_library()[source]

Return updated ProcessedLibrary object.

workflowbatches module

Classify / provide details for objects generated from a Globus Galaxy workflow processing batch performed by the BRI Bioinformatics Core.

class bripipetools.annotation.workflowbatches.WorkflowBatchAnnotator(workflowbatch_file, pipeline_root, db, run_opts)[source]

Identifies, stores, and updates information about a workflow batch.

_check_sex(processedlibrary)[source]

Retrieve reported sex for sample and compare to predicted sex of processed library.

_init_workflowbatch()[source]

Try to retrieve data for the workflow batch from GenLIMS; if unsuccessful, create new GalaxyWorkflowBatch object.

_run_qc(processedlibrary)[source]
_update_workflowbatch()[source]

Add any missing fields to GalaxyWorkflowBatch object.

get_processed_libraries(project=None, qc=False)[source]

Collect processed library objects for workflow batch.

get_sequenced_libraries()[source]

Collect list of sequenced libraries processed as part of workflow batch.

get_workflow_batch()[source]

Return workflow batch object with updated fields.


qc package

Contains classes and methods for performing post-hoc quality control operations on raw or processed genomics data. Modules are organized according to the specifc QC step performed. Unlike routine quality inspection metrics and information provided by standard bioinformatics tools through processing workflows, modules here are aimed more at identifying problems with sample handling or data generation. As such, outputs from these submodules are designated as a special type, ‘validation’, to distinguish them from the QC, metrics, counts, and other output types generated through processing.

sexcheck module

Class and methods to perform routine sex check on all processed libraries.

class bripipetools.qc.sexcheck.SexChecker(processedlibrary, reference, workflowbatch_id, pipeline_root, db, run_opts)[source]

Reads gene counts for a processed library, maps genes to X and Y chromosomes, computes ratio of Y to X counts, gives predicted sex based on pre-defined rule.

_compute_x_y_data()[source]

Collect and store X and Y gene/count data as well as total counts for the current processed library.

_get_counts_path()[source]

Construct the absolute path to the counts file for the specified workflow batch.

_get_x_y_counts()[source]

Extract and store counts for X and Y genes; also store count total.

_load_x_genes(ref='grch38')[source]

Read X chromosome gene IDs from file and return data frame.

_load_y_genes(ref='grch38')[source]

Read Y chromosome gene IDs from file and return data frame.

_predict_sex()[source]

Return predicted sex based on X/Y gene equation and cutoff.

_verify_sex()[source]

Compare predicted sex to reported sex.

_write_data()[source]

Save the sex validation data to a new file.

update()[source]

Add predicted sex validation field to processed library outputs and return processed library object.

sexverify module

Class and methods to perform routine sex check on all processed libraries.

class bripipetools.qc.sexverify.SexVerifier(data, processedlibrary, db)[source]

Identifies, stores, and updates information about a workflow batch.

_retrieve_sex(parent_id)[source]

Retrieve reported sex for sample.

verify()[source]

Compare reported sex for sample to predicted sex of processed library.

Returns

sexpredict module

Class and methods to perform routine sex check on all processed libraries.

class bripipetools.qc.sexpredict.SexPredictor(data, run_opts)[source]

Predicts sex based X and Y gene count data using a pre-defined rule.

_compute_y_x_count_ratio()[source]

Calculate the ratio of Y counts to X counts.

_compute_y_x_gene_ratio()[source]

Calculate the ratio of Y genes detected to X genes detected, where detected = count > 0.

_predict_sex()[source]

Return predicted sex based on X/Y gene equation and cutoff.

predict()[source]

database package

Contains methods for interacting with - connecting to, retrieving data from, and inserting data into - BRI databases (GenLIMS and ResDB) at a low level. Under the hood, much of the functionality in this package relies on the pymongo client library for MongoDB. The database.operations module provides wrapper functions for getting/putting objects from/to commonly used database collections, while database.mapping helps to construct Python model class objects from database documents. Methods in the database.connection module manage the database connection, depending on environment and configurations.

connection module

Connect to a BRI Mongo database.

bripipetools.database.connection.connect(db_config_name)[source]

Check the current environment to determine which database parameters to use, then connect to the target database on the specified host.

Returns

A database connection object.

operations module

Basic operations for BRI Mongo databases.

bripipetools.database.operations.create_workflowbatch_id(db, prefix, date)[source]

Check the ‘workflowbatches’ collection and construct ID with lowest available batch number (i.e., ‘’<prefix>_<date>_<number>’).

Parameters
  • db (type[pymongo.database.Database]) – database object for current MongoDB connection

  • prefix (str) – base string for workflow batch ID, based on workflow batch type (e.g., ‘globusgalaxy’ for Globus Galaxy workflow

  • date (type[datetime.datetime]) – date on which workflow batch was run

Return type

str

Returns

a unique ID for the workflow batch, with the prefix and date combination appended with the highest available integer

bripipetools.database.operations.find_objects(collection)[source]

Return a decorator that retrieves objects from the specified collection, given a db connection and query.

Parameters

collection (str) – String indicating the name of the collection

bripipetools.database.operations.get_genomicsCounts(db, query)[source]

Return list of documents from ‘genomicsCounts’ collection based on query.

bripipetools.database.operations.get_genomicsMetrics(db, query)[source]

Return list of documents from ‘genomicsMetrics’ collection based on query.

bripipetools.database.operations.get_genomicsRuns(db, query)[source]

Return list of documents from ‘genomicsRuns’ collection based on query.

bripipetools.database.operations.get_genomicsSamples(db, query)[source]

Return list of documents from ‘genomicsSamples’ collection based on query.

bripipetools.database.operations.get_genomicsWorkflowbatches(db, query)[source]

Return list of documents from ‘genomicsWorkflowbatches’ collection based on query.

bripipetools.database.operations.insert_objects(collection)[source]

Return a decorator that inserts one or more objects in into specified collection; if object exists, updates any individual fields that are not empty in the input object.

Parameters

collection (str) – string indicating the name of the collection

bripipetools.database.operations.put_genomicsCounts(db, counts)[source]

Insert each document in list into ‘genomicsCounts’ collection.

bripipetools.database.operations.put_genomicsMetrics(db, metrics)[source]

Insert each document in list into ‘genomicsMetrics’ collection.

bripipetools.database.operations.put_genomicsRuns(db, runs)[source]

Insert each document in list into ‘genomicsRuns’ collection.

bripipetools.database.operations.put_genomicsSamples(db, samples)[source]

Insert each document in list into ‘genomicsSamples’ collection.

bripipetools.database.operations.put_genomicsWorkflowbatches(db, workflowbatches)[source]

Insert each document in list into ‘genomicsWorkflowbatches’ collection.

bripipetools.database.operations.search_ancestors(db, sample_id, field)[source]

Given an object in the ‘samples’ collection, specified by the input ID, iteratively walk through ancestors based on ‘parentId’ until a value is found for the requested field.

Parameters
  • db (type[pymongo.database.Database]) – database object for current MongoDB connection

  • sample_id (str) – a unique ID for a sample in GenLIMS

  • field (str) – the field for which to search among ancestor samples

Returns

value for field, if found

mapping module

bripipetools mapping submodule: methods to map from Mongo documents to model classes.

bripipetools.database.mapping.get_model_class(doc)[source]

Find the matching class for the document, based on its type.

Parameters

doc (dict) – a dict representing a MongoDB document/object

Return type

str

Returns

a string representing the name of the matched class from the model module

bripipetools.database.mapping.map_keys(obj)[source]

Convert keys in a dictionary (or nested dictionary) from camelCase to snake_case; ignore ‘_id’ keys.

Parameters

obj (dict, list) – a dict or list of dicts with string keys to be converted

Return type

dict, list

Returns

a dict or list of dicts with string keys converted from camelCase to snake_case

bripipetools.database.mapping.map_to_object(doc)[source]

Convert document to model class of appropriate type.

Parameters

doc (dict) – a dict representing a MongoDB document/object

Return type

type[docs.TG3Object]

Returns

an new instance of the matched model class


model package

Establishes the underlying data model linking data from bioinformatics processing pipelines to the GenLIMS/TG3 database. Python class representations of database objects (documents) are defined in the model.documents module. These classes include some basic functionality, mostly related to setting/formatting attributes, which are eventually fed back into the database as key-value pairs. However, model classes are also the basic “currency” for several other modules, where they are used to retrieve, modify, store, and return data.

Depends on the util and parsing modules.

documents module

Classes representing documents in the GenLIMS database.

class bripipetools.model.documents.FlowcellRun(**kwargs)[source]

GenLIMS object in the ‘runs’ collection of type ‘flowcell’.

_flowcell_path = None
property flowcell_path

Return root-agnostic path to flowcell data folder.

class bripipetools.model.documents.GalaxyWorkflowBatch(workflowbatch_file=None, **kwargs)[source]

GenLIMS object in ‘workflow batches’ collection of type ‘Galaxy workflow’

Parameters

workflowbatch_file (str) – path to file describing samples and parameters of Globus Galaxy workflow batch

class bripipetools.model.documents.GeneCounts(**kwargs)[source]

Research Database object in ‘counts’ collection of type ‘gene counts’

property gene_counts

Return list of dictionaries with information about each library’s genecounts.

class bripipetools.model.documents.GenericRun(protocol_id=None, date=None, **kwargs)[source]

GenLIMS object in the ‘runs’ collection

Parameters
  • protocol_id (str) – unique ID of a protocol object in the GenLIMS database (in the ‘protocols’ collection)

  • date (str) – string indicating date of the run in ISO 8601 format

class bripipetools.model.documents.GenericSample(project_id=None, subproject_id=None, protocol_id=None, parent_id=None, **kwargs)[source]

GenLIMS object in the ‘samples’ collection

Parameters
  • project_id (int) – Genomics Core project number

  • subproject_id – Genomics Core sub-project number

  • protocol_id (str) – unique ID of a protocol object in the GenLIMS database (in the ‘protocols’ collection)

  • parent_id (str) – unique ID of a sample object in the GenLIMS database from which the current sample was derived (in the ‘samples’ collection)

Type

subproject_id: int

class bripipetools.model.documents.GenericWorkflow(**kwargs)[source]

GenLIMS object in the ‘workflows’ collection

class bripipetools.model.documents.GenericWorkflowBatch(**kwargs)[source]

GenLIMS object in the ‘workflow batches’ collection

class bripipetools.model.documents.GlobusGalaxyWorkflow(**kwargs)[source]

GenLIMS object in ‘workflows’ collection of type ‘Globus Galaxy workflow’

class bripipetools.model.documents.Library(**kwargs)[source]

GenLIMS object in ‘samples’ collection of type ‘library’

class bripipetools.model.documents.Metrics(**kwargs)[source]

Research Database object in ‘metrics’ collection of type ‘metrics’

property metrics

Return list of dictionaries with information about each library’s metrics.

class bripipetools.model.documents.ProcessedLibrary(**kwargs)[source]

GenLIMS object in ‘samples’ collection of type ‘processed library’

property processed_data

Return list of dictionaries with information about each set of data processing outputs (i.e., from workflow batches).

class bripipetools.model.documents.SequencedLibrary(run_id=None, **kwargs)[source]

GenLIMS object in ‘samples’ collection of type ‘sequenced library’

Parameters

run_id (str) – unique ID of a run object in the GenLIMS database

property raw_data

Return list of dictionaries with information about each raw data file (i.e., FASTQ) for a sequenced library.

class bripipetools.model.documents.TG3Object(_id=None, type=None, is_mapped=False)[source]

Generic functions for objects in TG3 collections.

Parameters
  • _id (str) – unique object identifier in the GenLIMS/TG3 Mongo database

  • type (str) – field indicating object type in a collection

  • is_mapped (bool) – flag indicating whether class instance was mapped from a database object (True) or created from scratch (False)

to_json()[source]

Return object attributes as dictionary with keys formatted as camel case.

Return type

dict

Returns

a dict containing class instance attributes, with all field names converted from snake case to camel case

update_attrs(attr_map, force=False)[source]

Given a dictionary of key-value pairs for attribute names with new values, update each attribute. Always update empty (‘None’) attributes and set any new attributes; update all modified attributes if force option is ‘True’.

Parameters
  • attr_map (dict) – a dict with key-value pairs representing object attributes and values to which they should be set

  • force (bool) – force overwrite of object fields in database, if they already exist

bripipetools.model.documents.convert_keys(obj)[source]

Convert keys in a dictionary (or nested dictionary) from snake_case to camelCase; ignore ‘_id’ keys.

Parameters

obj (dict, list) – A dict or list of dicts with string keys to be converted.

Return type

dict, list

Returns

A dict or list of dicts with string keys converted from snake_case to camelCase.


io package

Contains class representations of various file types produced through the generation or processing of genomics data. In particular, most of these classes provide methods for reading and parsing raw data from files and storing/returning these data in a more usable format, such as dictionaries or data frames. Each module contains the representaiton of a file generated by a particular tool or routine; some submodules may handle files from multiple methods within a tool (e.g., Picard). While not explicitly organized as such, modules adhere to a hierarchy based on the “type” of file, where current types include metrics, counts, QC, and validation.

workflow module

Class for reading and parsing Galaxy workflow files.

class bripipetools.io.workflow.WorkflowFile(path)[source]

Parser to exported workflow descriptions from Galaxy, stored in a JSON-like format with extension .ga.

_read_file()[source]

Read file into dictionary.

get_tool_info()[source]

Retrieve tools and versions from a workflow as a dictionary

get_workflow_name()[source]

Retrieve the workflow name

parse()[source]

Parse workflow file and return dictionary.

workflowbatch module

Classes for reading, parsing, and writing workflow batch submit files for Globus Galaxy.

class bripipetools.io.workflowbatch.WorkflowBatchFile(path, state='template')[source]

A parser to map input sample names to expected output files based on a completed Globus Galaxy batch submit file.

Parameters
  • path (str) – File path of batch submit file.

  • state (str) – String indicating the current state of the batch submit file; either template or submit (if populated with project and sample information).

_locate_batch_name_line()[source]

Identify batch file metadata line with place-holder for project name; return line number. Note: batch submissions can include multiple projects, so the ‘batch name’ label is more appropriate.

_locate_param_line()[source]

Identify batch file header line with parameter names; return line number.

_locate_sample_start_line()[source]

Identify batch file line where sample parameter info begins; return line number. Note: should immediately follow parameter header line.

_locate_workflow_name_line()[source]

Identify batch file metadata line with name of workflow; return line number.

_read_file()[source]

Read and store lines from batch submit file.

get_batch_name()[source]

Return name of workflow batch for batch submit file.

get_params()[source]

Return the parameters defined for the current workflow.

Return type

list

Returns

A list of tuples with number (index) and dict with details for each parameter.

get_sample_params(sample_line)[source]

Collect the parameter details for each input sample; store the index and input for each parameter.

Parameters

sample_line (str) – Raw, tab-delimited line of text from workflow batch submit file describing the paramaters for a single sample.

Return type

list

Returns

A list of dicts, one for each sample.

get_workflow_name()[source]

Return name of workflow for batch submit file.

parse()[source]

Parse workflow batch file and return dict.

update_batch_name(batch_name)[source]

Update name of workflow batch and insert in template lines.

write(path, batch_name=None, sample_lines=None)[source]

Write workflow batch data to file.

picardmetrics module

Class for reading and parsing Picard metrics files.

class bripipetools.io.picardmetrics.PicardMetricsFile(path)[source]

Parser to read tables of metrics generated by one of several Picard tools, typically stored in an HTML file, and return as a parsed and formatted dictionary.

_check_table_format()[source]

Check whether table is long (keys in one column, values in the other) or wide (keys in one row, values in the other).

_get_table()[source]

Extract metrics table from raw HTML string.

_parse_long()[source]

Parse long-formatted table to dictionary.

_parse_wide()[source]

Parse wide-formatted table to dictionary.

_read_file()[source]

Read file into raw HTML string.

parse()[source]

Parse metrics table and return dictionary.

htseqmetrics module

Class for reading and parsing Tophat Stats metrics files.

class bripipetools.io.htseqmetrics.HtseqMetricsFile(path)[source]

Parser to read tables of metrics generated by the htseq-count tool, stored in a tab-delimited text file.

_parse_lines()[source]

Get key-value pairs from text lines and return dictionary.

_read_file()[source]

Read file into list of raw strings.

parse()[source]

Parse metrics table and return dictionary.

tophatstats module

Class for reading and parsing Tophat Stats metrics files.

class bripipetools.io.tophatstats.TophatStatsFile(path)[source]

Parser to read tables of metrics generated by custom Tophat Stats PE tool, stored in a tab-delimited text file.

_parse_lines()[source]

Get key-value pairs from text lines and return dictionary.

_read_file()[source]

Read file into list of raw strings.

parse()[source]

Parse metrics table and return dictionary.

fastqc module

Class for reading and parsing FastQC report files.

class bripipetools.io.fastqc.FastQCFile(path)[source]

Parser to read QC data from a FastQC report, stored in a tab-delimited text file.

_clean_header(header)[source]

Extract section header from header line, convert to snake case.

_clean_value(value)[source]

Convert to numeric unless value contains text.

_get_section_status(section_name, section_info)[source]

Return a tuple with the section name and status.

_locate_sections()[source]

Return a dict with section names as keys and tuples of start/end line numbers as values.

_parse_section_table(section_info)[source]

For the specified section lines, parse tab-delimited columns into key-value pairs and return list of tuples.

_read_file()[source]

Read file into list of raw strings.

parse()[source]

Parse file and return key-value pairs as dictionary.

parse_overrepresented_seqs()[source]

Parse table of overrepresented sequences, return as list of dictionaries.

htseqcounts module

Class for reading and parsing htseq files.

class bripipetools.io.htseqcounts.HtseqCountsFile(path)[source]

Parser to read tables of counts generated by the htseq-count tool, stored in a tab-delimited text file.

_read_file()[source]

Read file into Pandas data frame.

parse()[source]

Parse counts file and return data frame.

sexcheck module

Class for reading and parsing sex check validation files.

class bripipetools.io.sexcheck.SexcheckFile(path)[source]

Parser to read tables of metrics generated by custom Tophat Stats PE tool, stored in a tab-delimited text file.

_parse_lines()[source]

Get key-value pairs from text lines and return dictionary.

_read_file()[source]

Read file into list of raw strings.

parse()[source]

Parse metrics table and return dictionary.


parsing package

Slightly more specialized than methods in the util.strings module, provides functions for parsing and extracting information from strings that follow some expected nomenclature. The primary examples of this information are IDs, names, labels, and other metadata for files and objects generated either by Illumina technology or the BRI Genomics Core (via GenLIMS). The parsing.processing module is also designed to handle specialized strings and labels related to processing workflows in Globus Galaxy.

Depends on the util module.

gencore module

bripipetools.parsing.gencore.get_library_id(string)[source]

Return library ID matched in input string.

Parameters

string (str) – any string that might contain a library ID of the format ‘lib1234’

Return type

str

Returns

the matching substring representing the library ID or an empty string (‘’) if no match found

bripipetools.parsing.gencore.get_project_label(string)[source]

Return a Genomics Core project label matched in input string.

Parameters

string (str) – any string

Return type

str

Returns

Genomics Core project label (e.g., P00-0) substring or empty string, if no match found

bripipetools.parsing.gencore.get_sample_id(string)[source]

More general than library ID; returns either library ID (if present), or any word starting with ‘Sample_’, ends in a number, and preceeds any non-alphanumeric characters.

Parameters

string (str) – any string that might contain a form of sample ID

Return type

str

Returns

the matching substring representing the sample ID or an empty string (‘’) if no match found

bripipetools.parsing.gencore.parse_batch_file_path(batchfile_path)[source]

Return ‘genomics’ root and batch file name based on directory path.

bripipetools.parsing.gencore.parse_flowcell_path(flowcell_path)[source]

Return ‘genomics’ root and run ID based on directory path.

bripipetools.parsing.gencore.parse_project_label(project_label)[source]

Parse a Genomics Core project label (e.g., P00-0) and return individual components indicating project ID and subproject ID.

Parameters

project_label (str) – String following Genomics Core convention for project labels, P<project ID>-<subproject ID>

Return type

dict

Returns

a dict with fields for ‘project_id’ and ‘subproject_id’

illumina module

bripipetools.parsing.illumina.get_flowcell_id(string)[source]

Return flowcell ID.

Parameters

string (str) – any string that might contain an Illumina flowcell ID (e.g., C6VG0ANXX)

Return type

str

Returns

the matching substring representing the flowcell ID or an empty string (‘’) if no match found

bripipetools.parsing.illumina.parse_fastq_filename(path)[source]

Parse standard Illumina FASTQ filename and return individual components indicating generic path, lane ID, read ID, and sample number.

Parameters

path (str) – full path to FASTQ file with filename adhering to standard Illumina format (e.g., ‘1D-HC29-C04_S27_L001_R1_001.fastq.gz’)

Return type

dict

Returns

a dict with fields for ‘path’ (with root removed), ‘lane_id’, ‘read_id’, and ‘sample_number’

bripipetools.parsing.illumina.parse_flowcell_run_id(run_id)[source]

Parse Illumina flowcell run ID (or folder name) and return individual components indicating date, instrument ID, run number, flowcell ID, and flowcell position.

Parameters

run_id (str) – string adhering to standard Illumina format (e.g., ‘150615_D00565_0087_AC6VG0ANXX’) for a sequencing run

Return type

dict

Returns

a dict with fields for ‘date’, ‘instrument_id’, ‘run_number’, ‘flowcell_id’, and ‘flowcell_position’

processing module

bripipetools.parsing.processing.parse_batch_name(batch_name)[source]

Parse batch name indicated in workflow batch submit file and return individual components indicating date, list of project labels, and flowcell ID.

bripipetools.parsing.processing.parse_output_filename(output_path)[source]

Parse output name indicated by parameter tag in output file return individual components indicating processed library ID, output source, and type.

bripipetools.parsing.processing.parse_output_name(output_name)[source]
bripipetools.parsing.processing.parse_run_id_for_batch(batch_file)[source]

Parse the run id (YYMMDD_D00565_####_FCID) from a path to a batch file.

bripipetools.parsing.processing.parse_workflow_param(param)[source]

Parse workflow parameter into components indicating tag, type, and name.


util module

Includes convenience methods related to handling and manipulating strings (util.strings), file paths (util.files), as well as user interactions via the command line (util.ui). Methods are used throughout other packages to streamline common operations.

strings submodule

bripipetools.util.strings.matchdefault(pattern, string, default='')[source]

Search for pattern in string and return default string if no match

Parameters
  • pattern (str) – non-compiled regular expression to search for in input string

  • string (str) – any string

  • default (str) – string to return if no match found

Return type

str

Returns

substring matched to regular expression or default string, if no match found

bripipetools.util.strings.matchlastdefault(pattern, string, default='')[source]

Search for pattern in string from right, return default string if no match

Parameters
  • pattern (str) – non-compiled regular expression to search for in input string

  • string (str) – any string

  • default (str) – string to return if no match found

Return type

str

Returns

rightmost substring matched to regular expression or default string, if no match found

bripipetools.util.strings.to_camel_case(snake_str)[source]

Convert snake_case string to camelCase

Parameters

snake_str (str) – a string in snake_case format

Return type

str

Returns

input string converted to camelCase format

bripipetools.util.strings.to_snake_case(camel_str)[source]

Convert camelCase to snake_case. found function here: http://stackoverflow.com/questions/1175208/elegant-python-function-to-convert-camelcase-to-snake-case

Parameters

camel_str (str) – a string in camelCase format

Return type

str

Returns

input string converted to snake_case format

files submodule

bripipetools.util.files.locate_root_folder(top_level, max_depth=3)[source]

Find the root of a file path preceding a specified ‘top level’ directory.

Parameters
  • top_level (str) – Nominal ‘top level’ directory immediately following root (e.g., ‘genomics’ in ‘/Volumes/genomics’); should be a relatively unique folder name, at least within the specified depth).

  • max_depth (str) – How many directory levels down from the true system root to search for top_level folder.

Return type

str

Returns

A string representing the part of the file path starting from the current system root up to (but not including) the top_level folder.

bripipetools.util.files.swap_root(path, top_level, new_root='/~/')[source]

Replace section of file path preceding a specified ‘top level’ directory with a different string (mostly for use with Globus transfers).

Parameters
  • path (str) – Any system file path.

  • top_level (str) – Nominal ‘top level’ directory to immediately follow new root (e.g., ‘genomics’ in ‘/Volumes/genomics’).

  • new_root (str) – String specifying the new root of the file path.

Return type

str

Returns

modified path with new root