Post-processing Data¶
Postprocessing nominally entails any operations performed on data after it has left Galaxy (or more generally, outside the scope of a bioinformatics processing workflow).
QC & validation¶
QC here is geared less on intrinsic qualities of the data, but on leveraging what is known about the samples (i.e., through the BRI GenLIMS or Sample Repository databases) to identify potential sample swaps or other problematic mixups.
Sex check validation¶
The bripipetools.qc.sexcheck modules inspect X and Y chromosome gene counts for each processed library, provide a predicted sex value, compare this prediction to any reported sex value that exists in the database, and reports the status.
SNP fingerprinting¶
[PENDING] These modules would compare SNP profiles from either 1) a panel of common, highly discriminative SNPs, 2) all mitochondrial SNPs, or possibly 3) all common SNPs to confirm that libraries generated from the same subject match as expected.
File naming & organization¶
postprocessing package¶
Modules in the postprocessing package are used to parse, combine, rename, and organize files as needed such that they fit the desired layout.
Current (Globus)¶
No more crazy nesting or zipping. More explicitly labeled files. This organization also includes combined outputs across all samples for QC and validation data; furthermore, combined QC, metrics, and validation data are merged into a single table at the project level labeled combined_summary_data.
This particular organization (including the addition of parsed and combined QC and validation data) entered production with the 161014_D00565_0133_AC9E23ANXX flowcell (the handful prior were similar in terms of nesting and labeling, but did not include the full assortment of files).
Note: the bripipetools.postprocessing.cleanup is designed to convert organization and naming from older schemes into the current structure, prior to other postprocessing steps (stitching, compiling). Such cleanup may be unnecessary if output files are parsed and imported directly into GenLIMS.
Project_P123-10Processed_globus_161017/
├── P123-10_C9E23ANXX_161017_combined_summary_data.csv
├── QC
│ ├── P123-10_C9E23ANXX_161017_combined_overrep_seqs.csv
│ ├── P123-10_C9E23ANXX_161017_combined_qc.csv
│ ├── lib13364_C9E23ANXX_fastqc_qc.html
│ └── lib13364_C9E23ANXX_fastqc_qc.txt
├── Trinity
│ ├── Trinity_combined1.fa
│ ├── Trinity_combined2.fa
│ └── lib13364_C9E23ANXX_trinity.fasta
├── alignments
│ ├── lib13364_C9E23ANXX_tophat_alignments.bam
│ └── lib13364_C9E23ANXX_tophat_alignments.bam.bai
├── counts
│ ├── P123-10_C9E23ANXX_161017_combined_counts.csv
│ └── lib13364_C9E23ANXX_htseq_counts.txt
├── log
│ └── lib13364_C9E23ANXX_workflow_log.txt
├── metrics
│ ├── P123-10_C9E23ANXX_161017_combined_metrics.csv
│ ├── P123-10_geneModelCoverage.pdf
│ ├── lib13364_C9E23ANXX_htseq_metrics.txt
│ ├── lib13364_C9E23ANXX_picard_align_metrics.html
│ ├── lib13364_C9E23ANXX_picard_markdups_metrics.html
│ ├── lib13364_C9E23ANXX_picard_rnaseq_metrics.html
│ └── lib13364_C9E23ANXX_tophat_stats_metrics.txt
├── mixcrOutput_trinity
│ ├── lib13364_mixcrAlign.vdjca
│ ├── lib13364_mixcrAlignPretty.txt
│ ├── lib13364_mixcrAssemble.clns
│ ├── lib13364_mixcrClns.txt
│ └── lib13364_mixcrReport.txt
├── trimmed
│ └── lib13364_C9E23ANXX_trimmed.fastq
└── validation
├── P123-10_C9E23ANXX_161017_combined_validation.csv
└── lib13364_C9E23ANXX_sexcheck_validation.csv
Old (Globus)¶
Similar to the local organization, the initial Globus organization included various degrees of nesting, zipping, and crytpic naming (this was with the intention of matching the old structure). A few notes:
inputFastqs: these files represent concatenated lane-specific FASTQ files for a library; there’s no real reason for these to be output from workflows or saved long term (this behavior will change in future versions)metrics: the zipped archives in this folder behave a bit different than with the local org, as contents are extracted directly to themetricsfolder, rather than as a new uncompressed folder; I might have fixed this at some point, but it’s something to watch out forTrimmedFastqs: these represent FASTQs after adapter and quality trimming; they also don’t need ot be saved
This organization was used through the 160817_D00565_0129_BC97JMANXX flowcell.
Project_P135-1Processed_globus_160622/
├── QC
│ └── lib12112_C8LB7ANXX
│ ├── qcR1
│ │ ├── fastqc_data.txt
│ │ └── fastqc_report.html
│ └── qcR1.zip
├── TrimmedFastqs
│ └── lib12112_C8LB7ANXX_trimmed.fastq
├── Trinity
│ ├── Trinity_combined1.fa
│ ├── Trinity_combined2.fa
│ └── lib12112_C8LB7ANXX
│ └── Trinity.fasta
├── alignments
│ ├── lib12112_C8LB7ANXX.bam
│ └── lib12112_C8LB7ANXX_tophat_alignments.bam.bai
├── counts
│ ├── P135-1_C8LB7ANXX_160622_combined_counts.csv
│ └── lib12112_C8LB7ANXX_count.txt
├── inputFastqs
│ └── lib12112_C8LB7ANXX_R1-final.fastq.gz
├── logs
│ └── lib12112_C8LB7ANXX_workflow_log.txt
├── metrics
│ ├── P135-1_C8LB7ANXX_160622_combined_metrics.csv
│ ├── P135-1_geneModelCoverage.pdf
│ ├── lib12112_C8LB7ANXXMarkDups.zip
│ │ └── (MarkDups_Dupes_Marked_html.html)
│ ├── lib12112_C8LB7ANXX_al.zip
│ │ └── (RNA_Seq_Metrics_html.html)
│ ├── lib12112_C8LB7ANXX_qc.zip
│ │ └── (Picard_Alignment_Summary_Metrics_html.html)
│ ├── lib12112_C8LB7ANXXmm.txt
│ └── lib12112_C8LB7ANXXths.txt
└── mixcrOutput_trinity
├── lib12112_mixcrAlign.vdjca
├── lib12112_mixcrAlignPretty.txt
├── lib12112_mixcrAssemble.clns
├── lib12112_mixcrClns.txt
└── lib12112_mixcrReport.txt
Old (local)¶
The old file organization (used when workflows were run on a local Galaxy server and cluster) includes a lot of nesting, zipping, and cryptic naming.
For instance, metrics file abbreviations can be decoded as follows:
_qc: Picard Alignment Summary Metrics_al: Picard CollectRnaSeqMetricsMarkDups: Picard MarkDuplicatesths: Tophat Statsmm: htseq-count “other counts”
The last flowcell for which this organization was used exclusively is 160307_D00565_0103_BC893JANXX (note: projects were processed both locally and with Globus Genomics for the next handful of flowcells until 160609_D00565_0113_BC8LB7ANXX, in which operations transferred completely to Globus)
Project_P43-41Processed_160311/
├── P43-41_C893JANXX_160311_pulldownLog.txt
├── QC
│ └── lib10852_C893JANXX
│ ├── qcR1
│ │ ├── FastQC_FastqMcf_on_data_69_and_data_68__reads_html.html
│ │ ├── FastqMcf_on_data_69_and_data_68__reads_fastqc.zip
│ │ ├── duplication_levels.png
│ │ ├── error.png
│ │ ├── fastqc_data.txt
│ │ ├── fastqc_icon.png
│ │ ├── fastqc_report.html
│ │ ├── kmer_profiles.png
│ │ ├── per_base_gc_content.png
│ │ ├── per_base_n_content.png
│ │ ├── per_base_quality.png
│ │ ├── per_base_sequence_content.png
│ │ ├── per_sequence_gc_content.png
│ │ ├── per_sequence_quality.png
│ │ ├── rgFastQC96yA9X.log
│ │ ├── sequence_length_distribution.png
│ │ ├── summary.txt
│ │ ├── tick.png
│ │ └── warning.png
│ └── qcR1.zip
├── TrimmedFastqs
│ └── lib10852_C893JANXX_trimmed.fastq
├── Trinity
│ ├── Trinity_combined1.fa
│ └── lib10852_C893JANXX
│ └── Trinity.fasta
├── alignments
│ ├── lib10852_C893JANXX.bam
│ └── lib10852_C893JANXX.bam.bai
├── alignments_noDups
│ ├── lib10852_C893JANXX_noDups.bam
│ └── lib10852_C893JANXX_noDups.bam.bai
├── counts
│ ├── P43-41_C893JANXX_160311_combined_counts.csv
│ └── lib10852_C893JANXX_count.txt
├── metrics
│ ├── P43-41_C893JANXX_160311_combined_metrics.csv
│ ├── P43-41_geneModelCoverage.pdf
│ ├── lib10852_C893JANXXMarkDups
│ │ ├── MarkDuplicates.log
│ │ ├── MarkDuplicates.metrics.txt
│ │ └── MarkDups_Dupes_Marked_html.html
│ ├── lib10852_C893JANXXMarkDups.zip
│ ├── lib10852_C893JANXX_al
│ │ ├── CollectRnaSeqMetrics.log
│ │ ├── CollectRnaSeqMetrics.metrics.txt
│ │ └── RNA_Seq_Metrics_html.html
│ ├── lib10852_C893JANXX_al.zip
│ ├── lib10852_C893JANXX_qc
│ │ ├── CollectAlignmentSummaryMetrics.log
│ │ ├── CollectAlignmentSummaryMetrics.metrics.txt
│ │ └── Picard_Alignment_Summary_Metrics_html.html
│ ├── lib10852_C893JANXX_qc.zip
│ ├── lib10852_C893JANXXmm.txt
│ └── lib10852_C893JANXXths.txt
└── mixcrOutput_trinity
├── lib10852_mixcrAlign.vdjca
├── lib10852_mixcrAlignPretty.txt
├── lib10852_mixcrAssemble.clns
├── lib10852_mixcrClns.txt
└── lib10852_mixcrReport.txt