Pipeline Catalog: Quality Control

Quality Control (FASTQC)

The quality of high-throughput genome sequencing data is most commonly analyzed using the FASTQC utility. This summary includes the quality scores as well as an assessment of any anomalies in sequence content which could be associated with adapter contamination or errors in library construction. The MultiQC utility is used to conveniently combine the FASTQC results from multiple datasets into a single summary document.

User Guide

Workflow Repository: github.com/FredHutch/multi-fastqc-nf

Citations:

FASTQC: de Sena Brandine G, Smith AD. Falco: high-speed FastQC emulation for quality control of sequencing data. F1000Res. 2019 Nov 7;8:1874. doi: 10.12688/f1000research.21142.2. PMID: 33552473; PMCID: PMC7845152.
MultiQC: Ewels P, Magnusson M, Lundin S, Käller M. MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics. 2016 Oct 1;32(19):3047-8. doi: 10.1093/bioinformatics/btw354. Epub 2016 Jun 16. PMID: 27312411; PMCID: PMC5039924.

Source Tracking (NGSCheckMate)

NGSCheckMate is a software package for identifying next generation sequencing (NGS) data files from the same individual. It analyzes various types of NGS data files including (but not limited to) whole genome sequencing (WGS), whole exome sequencing (WES), RNA-seq, ChIP-seq, and targeted sequencing of various depths.

User Guide

Input Data: The NGSCheckMate tool can be run on any DNA or RNA dataset which is uploaded. Only the samples contained in a single dataset will be compared against each other.

Outputs:

The output of this tool includes:

A dendrogram (tree) showing which samples are most likely from the same same source
A table listing the similarity scores for all pairs of samples in the analysis

Source Code: github.com/parklab/NGSCheckMate

PDF Documentation: github.com/parklab/NGSCheckMate/blob/master/Documentation.pdf

Citations:

Sejoon Lee, Soohyun Lee, Scott Ouellette, Woong-Yang Park, Eunjung A. Lee, Peter J. Park, NGSCheckMate: software for validating sample identity in next-generation sequencing studies within and across data types, Nucleic Acids Research, Volume 45, Issue 11, 20 June 2017, Page e103, https://doi.org/10.1093/nar/gkx193

Filter Sequences

When performing high-throughput sequencing of DNA or RNA, it can be helpful in some situations to remove sequences which resulted from some unintended or unwanted material present in the sample. One of the most common ways to perform this type of filtering is by aligning the sequence reads against a reference nucleotide sequence matching that contaminant.

Note: Reads will only be removed when both ends from each pair align to the indicated sequence.

Supports combining input datasets.

User Guide

Workflow Repository: github.com/FredHutch/star-filter-fastq

Reference Genome

To specify the sequence which should be removed, upload an uncompressed FASTA file as reference of the type Reference Genome (FASTA).

Aligner

The aligner used for filtering is the STAR aligner (version 2.7.10). A complete set of log files on the software used and the performance of each filtering step can be found in the logs/ folder in the outputs.

Citations:

STAR: Dobin A, Davis CA, Schlesinger F, Drenkow J, Zaleski C, Jha S, Batut P, Chaisson M, Gingeras TR. STAR: ultrafast universal RNA-seq aligner. Bioinformatics. 2013 Jan 1;29(1):15-21. doi: 10.1093/bioinformatics/bts635. Epub 2012 Oct 25. PMID: 23104886; PMCID: PMC3530905.