Skip to content

Pipeline Catalog: Statistical Analysis

Differential Expression Analysis

Data Type

One of the primary uses of bulk RNA sequencing data is to identify genes which are expressed at different levels between groups of samples. The three primary algorithms used for this analysis are DEseq2, edgeR, and limma-voom.

p value vs fold change volcano plot Image credit: WikiMedia

User Guide

To compare groups of samples on the basis of their metadata annotations, simply indicate which column in the sample metadata table should be used as the basis of comparison.

If the value in that column is numeric, then it will be treated as a continuous variable. If the value is categorical, then the user must also specify what the comparison value should be (against which all other values are contrasted).

The output of this analysis is a table of gene-level differential expression results, indicating the fold change in gene expression between sample groups as well as the FDR-adjusted significance of that difference.

Parameters:

  • Gene Counts: Select the table which contains the expression values for each gene in the analysis
  • Algorithm: Select the statistical method which should be used to identify significant changes in expression for each gene
  • Comparison Value: Enter the name of the column in the sample annotation table which indicates the comparison to make, either a continuous or a categorical value
  • Comparison References: In the case that a categorical variable was provided for the "Comparison Value", the user must specify the value in that column which should be treated as the control
  • Batch Correction: Optionally indicate any columns in the sample metadata table which annotate analysis batches which should be adjusted for in the differential expression analysis
  • Filtering: Optionally provide a boolean expression which is used to filter the set of samples included in the analysis (e.g. colA < 2 will exclude any samples which do not contain a value less than 2 in the column labeled 'colA')

Example 1 (Treatment/Control):

Consider the experimental design represented by:

sample cage status
mouse1 c1 control
mouse2 c1 control
mouse3 c1 treatment
mouse4 c1 treatment
mouse5 c2 control
mouse6 c2 control
mouse7 c2 treatment
mouse8 c2 treatment

To identify changes in gene expression in the treatment samples (relative to the control) samples, while controlling for any difference between cages c1 and c2, the appropriate parameters to use would be:

  • Comparison Value: status
  • Comparison Reference: control
  • Batch Correction: cage
  • Filtering: <blank>

Example 2 (Filtering):

Building from the example above, consider the scenario where two samples have been determined to have failed quality control, and need to be excluded from the anlaysis. To accomplish this, an additional column should be added (passing_qc), which is used to annotate the samples as a boolean pass/fail (1/0), and filter out the failing samples from the differential expression analysis

sample cage status passing_qc
mouse1 c1 control 1
mouse2 c1 control 0
mouse3 c1 treatment 1
mouse4 c1 treatment 1
mouse5 c2 control 1
mouse6 c2 control 1
mouse7 c2 treatment 1
mouse8 c2 treatment 0

To identify changes in gene expression in the treatment samples (relative to the control) samples, while controlling for any difference between cages c1 and c2, the appropriate parameters to use would be:

  • Comparison Value: status
  • Comparison Reference: control
  • Batch Correction: cage
  • Filtering: passing_qc

Example 3 (Continuous Variable):

Instead of the treatment/control design above, consider the scenario where gene expression is being analyzed in comparison to a laboratory test for inflammation - C-reactive protein (CRP), and multiple samples have been collected from the same individuals over time:

sample PTID day CRP
p01_01 p01 1 12.5
p01_03 p01 3 2.1
p01_12 p01 12 7.9
p02_06 p02 6 5.6
p02_08 p02 8 3.9
p02_11 p02 11 9.5
p03_01 p03 1 13.2
p03_02 p03 2 11.9
p03_09 p03 9 2.6

To identify changes in gene expression which are correlated with CRP levels (which in this case is a continuous variable), the appropriate parameters to use would be:

  • Comparison Value: CRP
  • Comparison Reference: <blank> (only needed for categorical variables)
  • Batch Correction: <blank> or PTID (to optionally control for differences between participants)
  • Filtering: <blank>

Workflow Repository: github.com/FredHutch/pw-differential-expression

Citations:

  • DEseq2: Love MI, Huber W, Anders S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 2014;15(12):550. doi: 10.1186/s13059-014-0550-8. PMID: 25516281; PMCID: PMC4302049.
  • edgeR: Robinson MD, McCarthy DJ, Smyth GK. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. 2010 Jan 1;26(1):139-40. doi: 10.1093/bioinformatics/btp616. Epub 2009 Nov 11. PMID: 19910308; PMCID: PMC2796818.
  • limma-voom: Law CW, Chen Y, Shi W, Smyth GK. voom: Precision weights unlock linear model analysis tools for RNA-seq read counts. Genome Biol. 2014 Feb 3;15(2):R29. doi: 10.1186/gb-2014-15-2-r29. PMID: 24485249; PMCID: PMC4053721.

Gene Set Enrichment Analysis

Data Type

Gene set enrichment analysis is performed to summarize the biological pathways which are commonly enriched or depleted in a collection of differentially expressed genes. The collections of genes which have been annotated by biological pathways can be found in the MSigDB. The Bioconductor package used to perform GSEA is fgsea.

The output of this analysis is a table indicating the enrichment scores associated with each of the hallmark gene sets in the MSigDB.

User Guide

Parameters:

  • Gene Ranking Table: The input table must contain a ranking of genes by their enrichment score, with values for both gene score and p-value
  • Gene ID Column Name: Indiciate the name of the column in the Gene Ranking Table which contains the Gene ID
  • Gene Score Column Name: Indiciate the name of the column in the Gene Ranking Table which contains the Gene Score
  • P-Value Column Name: Indiciate the name of the column in the Gene Ranking Table which contains the P-Value

Workflow Repository: github.com/FredHutch/pw-gene-set-enrichment

Citations:

  • fgsea: Korotkevich G, Sukhov V, Sergushichev A (2019). “Fast gene set enrichment analysis.” bioRxiv. doi: 10.1101/060012, http://biorxiv.org/content/early/2016/06/20/060012.

Lymphocyte Quantification

Data Type

Having estimated the relative gene expression from bulk RNA sequencing, it is possible to estimate the proportions of different cell types which are present in the source tissue sample. This analysis is based on a mixture modeling approach using prior knowledge of the genes which are typically expressed in different cell types. The quanTIseq method focuses on different subtypes of lymphocytes, estimating the proportion of macrophages, T-cells, B-cells, etc.

User Guide

The source code for quanTIseq is available at https://github.com/icbi-lab/quanTIseq, with more extensive documentation at https://icbi.i-med.ac.at/software/quantiseq/doc/.

Parameters:

  • Gene Expression Table: The primary input for quanTIseq is a table with expression values for each gene, which defaults to the salmon.merged.gene_tpm.tsv file generated by the bulk RNAseq analysis workflow
  • Tumor Analysis: For samples generated from tumor specimens, selecting this option will use a tumor-specific prediction algorithm provided by quanTIseq

Workflow Repository: github.com/FredHutch/pw-bulk-rna-lymphocytes/

Citations:

  • quanTIseq: Finotello, F., Mayer, C., Plattner, C. et al. Molecular and pharmacological modulators of the tumor immune contexture revealed by deconvolution of RNA-seq data. Genome Med 11, 34 (2019). https://doi.org/10.1186/s13073-019-0638-6

Relative Abundance Table

Data Type

The core innovation of "omics" technologies is the ability to measure a large number of features in a single sample. Depending on the instrumentation being used, these features may be (to pick a few):

  • Cell Types
  • Metabolites
  • Microbial Species
  • RNA Transcripts

To help support a wide range of data sources, Cirro provides a general-purpose dataset type for relative abundance data. The purpose of this dataset type is to (a) help better catalog dataset collections and (b) provide access to analysis tools which use this type of data as input.

Requirements:

The general-purpose format expected for relative abundance data is:

  • Comma-delimited table with the .csv file extension (or .csv.gz)
  • Rectangular table including a header row and index column
  • Each column contains a sample (observation)
  • Each row contains a feature (species, metabolite, gene, etc.)
  • Each cell contains a numeric value (integer or float)
index sampleA sampleB sampleC
species1 0.1 0.08 0.15
species2 0.85 0.72 0.68
species3 0.05 0.20 0.17

Note: While the example table shows proportional values which add up to 1 for each sample, there is no such requirement or expectation for this data type.

Metadata:

It is strongly recommended to include a metadata table along with the relative abundance table. The only requirements for the metadata table is that it:

  • Is named samplesheet.csv
  • Uses a header row
  • Contains a sample column with values matching the relative abundance table
sample batch status
sampleA 0 control
sampleB 1 treatment
sampleC 0 control