Pipeline Catalog: Statistical Analysis

Differential Expression Analysis

One of the primary uses of bulk RNA sequencing data is to identify genes which are expressed at different levels between groups of samples. The three primary algorithms used for this analysis are DEseq2, edgeR, and limma-voom.

Image credit: WikiMedia

User Guide

To compare groups of samples on the basis of their metadata annotations, simply indicate which column in the sample metadata table should be used as the basis of comparison.

If the value in that column is numeric, then it will be treated as a continuous variable. If the value is categorical, then the user must also specify what the comparison value should be (against which all other values are contrasted).

The output of this analysis is a table of gene-level differential expression results, indicating the fold change in gene expression between sample groups as well as the FDR-adjusted significance of that difference.

Parameters:

Gene Counts: Select the table which contains the expression values for each gene in the analysis
Algorithm: Select the statistical method which should be used to identify significant changes in expression for each gene
Comparison Value: Enter the name of the column in the sample annotation table which indicates the comparison to make, either a continuous or a categorical value
Comparison References: In the case that a categorical variable was provided for the "Comparison Value", the user must specify the value in that column which should be treated as the control
Batch Correction: Optionally indicate any columns in the sample metadata table which annotate analysis batches which should be adjusted for in the differential expression analysis
Filtering: Optionally provide a boolean expression which is used to filter the set of samples included in the analysis (e.g. colA < 2 will exclude any samples which do not contain a value less than 2 in the column labeled 'colA')

Example 1 (Treatment/Control):

Consider the experimental design represented by:

sample	cage	status
mouse1	c1	control
mouse2	c1	control
mouse3	c1	treatment
mouse4	c1	treatment
mouse5	c2	control
mouse6	c2	control
mouse7	c2	treatment
mouse8	c2	treatment

To identify changes in gene expression in the treatment samples (relative to the control) samples, while controlling for any difference between cages c1 and c2, the appropriate parameters to use would be:

Comparison Value: status
Comparison Reference: control
Batch Correction: cage
Filtering: <blank>

Example 2 (Filtering):

Building from the example above, consider the scenario where two samples have been determined to have failed quality control, and need to be excluded from the anlaysis. To accomplish this, an additional column should be added (passing_qc), which is used to annotate the samples as a boolean pass/fail (1/0), and filter out the failing samples from the differential expression analysis

sample	cage	status	passing_qc
mouse1	c1	control	1
mouse2	c1	control	0
mouse3	c1	treatment	1
mouse4	c1	treatment	1
mouse5	c2	control	1
mouse6	c2	control	1
mouse7	c2	treatment	1
mouse8	c2	treatment	0

To identify changes in gene expression in the treatment samples (relative to the control) samples, while controlling for any difference between cages c1 and c2, the appropriate parameters to use would be:

Comparison Value: status
Comparison Reference: control
Batch Correction: cage
Filtering: passing_qc

Example 3 (Continuous Variable):

Instead of the treatment/control design above, consider the scenario where gene expression is being analyzed in comparison to a laboratory test for inflammation - C-reactive protein (CRP), and multiple samples have been collected from the same individuals over time:

sample	PTID	day	CRP
p01_01	p01	1	12.5
p01_03	p01	3	2.1
p01_12	p01	12	7.9
p02_06	p02	6	5.6
p02_08	p02	8	3.9
p02_11	p02	11	9.5
p03_01	p03	1	13.2
p03_02	p03	2	11.9
p03_09	p03	9	2.6

To identify changes in gene expression which are correlated with CRP levels (which in this case is a continuous variable), the appropriate parameters to use would be:

Comparison Value: CRP
Comparison Reference: <blank> (only needed for categorical variables)
Batch Correction: <blank> or PTID (to optionally control for differences between participants)
Filtering: <blank>

Workflow Repository: github.com/FredHutch/pw-differential-expression

Citations:

DEseq2: Love MI, Huber W, Anders S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 2014;15(12):550. doi: 10.1186/s13059-014-0550-8. PMID: 25516281; PMCID: PMC4302049.
edgeR: Robinson MD, McCarthy DJ, Smyth GK. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. 2010 Jan 1;26(1):139-40. doi: 10.1093/bioinformatics/btp616. Epub 2009 Nov 11. PMID: 19910308; PMCID: PMC2796818.
limma-voom: Law CW, Chen Y, Shi W, Smyth GK. voom: Precision weights unlock linear model analysis tools for RNA-seq read counts. Genome Biol. 2014 Feb 3;15(2):R29. doi: 10.1186/gb-2014-15-2-r29. PMID: 24485249; PMCID: PMC4053721.

Gene Set Enrichment Analysis

Gene set enrichment analysis is performed to summarize the biological pathways which are commonly enriched or depleted in a collection of differentially expressed genes. The collections of genes which have been annotated by biological pathways can be found in the MSigDB. The Bioconductor package used to perform GSEA is fgsea.

The output of this analysis is a table indicating the enrichment scores associated with each of the hallmark gene sets in the MSigDB.

User Guide

Parameters:

Gene Ranking Table: The input table must contain a ranking of genes by their enrichment score, with values for both gene score and p-value
Gene ID Column Name: Indiciate the name of the column in the Gene Ranking Table which contains the Gene ID
Gene Score Column Name: Indiciate the name of the column in the Gene Ranking Table which contains the Gene Score
P-Value Column Name: Indiciate the name of the column in the Gene Ranking Table which contains the P-Value

Workflow Repository: github.com/FredHutch/pw-gene-set-enrichment

Citations:

fgsea: Korotkevich G, Sukhov V, Sergushichev A (2019). “Fast gene set enrichment analysis.” bioRxiv. doi: 10.1101/060012, http://biorxiv.org/content/early/2016/06/20/060012.

Lymphocyte Quantification

Having estimated the relative gene expression from bulk RNA sequencing, it is possible to estimate the proportions of different cell types which are present in the source tissue sample. This analysis is based on a mixture modeling approach using prior knowledge of the genes which are typically expressed in different cell types. The quanTIseq method focuses on different subtypes of lymphocytes, estimating the proportion of macrophages, T-cells, B-cells, etc.

User Guide

The source code for quanTIseq is available at https://github.com/icbi-lab/quanTIseq, with more extensive documentation at https://icbi.i-med.ac.at/software/quantiseq/doc/.

Parameters:

Gene Expression Table: The primary input for quanTIseq is a table with expression values for each gene, which defaults to the salmon.merged.gene_tpm.tsv file generated by the bulk RNAseq analysis workflow
Tumor Analysis: For samples generated from tumor specimens, selecting this option will use a tumor-specific prediction algorithm provided by quanTIseq

Workflow Repository: github.com/FredHutch/pw-bulk-rna-lymphocytes/

Citations:

quanTIseq: Finotello, F., Mayer, C., Plattner, C. et al. Molecular and pharmacological modulators of the tumor immune contexture revealed by deconvolution of RNA-seq data. Genome Med 11, 34 (2019). https://doi.org/10.1186/s13073-019-0638-6

Relative Abundance Table

The core innovation of "omics" technologies is the ability to measure a large number of features in a single sample. Depending on the instrumentation being used, these features may be (to pick a few):

Cell Types
Metabolites
Microbial Species
RNA Transcripts

To help support a wide range of data sources, Cirro provides a general-purpose dataset type for relative abundance data. The purpose of this dataset type is to (a) help better catalog dataset collections and (b) provide access to analysis tools which use this type of data as input.

Requirements:

The general-purpose format expected for relative abundance data is:

Comma-delimited table with the .csv file extension (or .csv.gz)
Rectangular table including a header row and index column
Each column contains a sample (observation)
Each row contains a feature (species, metabolite, gene, etc.)
Each cell contains a numeric value (integer or float)

index	sampleA	sampleB	sampleC
species1	0.1	0.08	0.15
species2	0.85	0.72	0.68
species3	0.05	0.20	0.17

Note: While the example table shows proportional values which add up to 1 for each sample, there is no such requirement or expectation for this data type.

Metadata:

It is strongly recommended to include a metadata table along with the relative abundance table. The only requirements for the metadata table is that it:

Is named samplesheet.csv
Uses a header row
Contains a sample column with values matching the relative abundance table

sample	batch	status
sampleA	0	control
sampleB	1	treatment
sampleC	0	control