Pipeline Catalog: Targeted Sequencing

CRISPR Screen Analysis

One of the most useful applications of CRISPR is to perform in vitro perturbations of thousands of genes in parallel, using genome sequencing to identify which genes are involved in a particular biological process. In this approach a library of CRISPR guide sequences (called sgRNA) are used to transform a population of cells, and then DNA is collected from experimental replicates which have undergone different perturbations. A genome sequencer is then used to count up the number of times that each sgRNA sequence is found in each replicate, and a reproducible change in the proportional abundance of a particular guide is interpreted as a change in cell replication rate as a function of the gene which is targeted by the guide. Typically, a reproducible change must be seen in multiple guides which all target the same gene in order to be confident in its importance.

User Guide

Non-Targeting Control (NTC) Normalization:

To optionally provide the user with the ability to perform normalization using information from the non-targeting control guides, you may upload a text file indicating which of the guide sequences correspond to the non-targeting controls. This list of controls may be uploaded at the same time as the library CSV it applies to (as described above), as a text file with each guide ID of the non-targeting control on each line.

Note: The non-targeting control file must contain "cont" as part of its filename (e.g. controls.txt).

Library Reference:

The collection of sgRNA sequences used in a particular experiment must be provided in CSV format using the columns:

guide: A unique ID for the sgRNA guide
sgrna: The nucleotide sequence of the sgRNA guide
gene: The name of the gene targeted by the sgRNA guide

Note: Please avoid manipulating or saving the CRISPR sgRNA library file using Microsoft Excel -- it often changes gene names into dates (e.g. SEPTIN1 -> 1-SEP)

After formatting the reference library CSV, upload it to Cirro from the References page using the CRISPR sgRNA Library type. When analyzing CRISPR data, select the appropriate reference using the name which was provided at the time of upload.

Note: The library CSV file must contain "lib" as part of its filename (e.g. library.csv).

Normalization Strategies

As part of the MAGeCK Flute analysis, a normalization process is implemented to determine which of the guide sequences are significantly different from the null expectation. The default behavior of the analysis software is to perform median count normalization, except in those cases when the total library read count is too low (as determined by the MAGeCK Flute software). In those cases of low read counts normalization is performed using the total library counts.

As an alternative, when a library is configured with a list of non-targeting controls then the user may select to perform 'Control Normalization'. In the process of control normalization the null distribution for each sample is generated using the NTC guides.

Uploading Data:

Sequencing data should be provided in FASTQ format from both treatment and control samples. To distinguish these groups of files, use the subfolders treatments/ and controls/. Each of the files in those folders must have the extension .fq.gz or .fastq.gz

Tools:

MAGeCK Count: Simply count the number of times that each sgRNA is found in each sample
MAGeCK Flute: Identify which genes are significantly enriched/depleted in treatment samples compared to controls

Parameters:

Treatment / Control FASTQs: Indicate which of the uploaded FASTQ files are treatments vs. controls (MAGeCK Flute only)
Library: Select the CRISPR sgRNA Library appropriate for this analysis
Adapter: The nucleotide sequence found immediately before the sgRNA sequence in each read
Insert Length: The nucleotide length of the sgRNA sequences used in the library

Workflow Repository: https://github.com/FredHutch/crispr-screen-nf/

Citations:

MAGeCK: Li, W., Xu, H., Xiao, T. et al. MAGeCK enables robust identification of essential genes from genome-scale CRISPR/Cas9 knockout screens. Genome Biol 15, 554 (2014). https://doi.org/10.1186/s13059-014-0554-4

VirScan

The presence of antibodies against specific epitopes can be estimated by measuring the degree of antibody binding to complex phage display libraries, a process generally called Phage Immunoprecipitation Sequencing (PhIP-Seq). The VirScan process uses PhIP-Seq to detect antibodies binding to a panel of viral epitopes, integrating antibody binding data across many peptides to infer past exposure to specific pathogens.

User Guide

References:

Reference libraries for VirScan analysis consist of two files, one containing the list of epitope sequences to search, and one containing a list of "public" epitopes which are more commonly detected across individuals. Any reference library may be used which conforms to the format shown here for the Vir3 library:

After formatting the necessary reference CSV files, upload them to Cirro from the References page using the VirScan Library type. Make sure to upload both the library and public epitopes CSV to the same reference. When analyzing VirScan data, select the appropriate reference using the name which was provided at the time of upload.

Uploading Data:

When uploading sequencing data (in FASTQ format) from VirScan assays, it is important that the experimental replicates are marked appropriately. The VirScan analysis process compares the degree of antigen recognition across experimental replicates to better identify high-confidence predictions.

To ensure that experimental replicates are marked appropriately, the best approach is to use a sample sheet (uploading a file named samplesheet.csv) which assigns a sample name to each FASTQ file. The sample names should go in a column named sample, while the FASTQ files should be listed in a column named fastq_1, with one line per file.

In addition, the sample sheet should include a column called control_status which indicates whether the sample is a control sample (beads_only) or an experimental sample (empirical). Every batch of analysis should include at least one beads_only control and one empirical sample (although greater numbers of controls are better).

An example samplesheet.csv (with two samples, two controls, and two replicates each) may look like this:

sample,fastq_1,control_status
sample1,sample1_rep1_S1_R1_001.fastq.gz,empirical
sample1,sample1_rep2_S2_R1_001.fastq.gz,empirical
sample2,sample2_rep1_S3_R1_001.fastq.gz,empirical
sample2,sample2_rep2_S4_R1_001.fastq.gz,empirical
control1,control1_rep1_S5_R1_001.fastq.gz,beads_only
control1,control1_rep2_S6_R1_001.fastq.gz,beads_only
control2,control2_rep1_S7_R1_001.fastq.gz,beads_only
control2,control2_rep2_S8_R1_001.fastq.gz,beads_only

Note: File names which match the pattern shown above (SAMPLE_REP_SN_R1_001.fastq.gz) can be uploaded without a sample sheet and will be parsed appropriately. However, that approach is not recommended because it does not support sample names with underscores, and control_status will need to be filled in manually.

Sample Metadata:

If you feel comfortable with JSON files, you can follow the instructions below to edit the JSON schemas for metadata. If not, contact the Cirro team for assistance.

After uploading the samples, the user must indicate which were generated from beads-only controls, and which were generated from empirical samples.

If a sample sheet was used to automatically annotate control status at upload time, it is still helpful to follow the steps below so that the sample annotation page can be used to view and edit control status.

To add the appropriate field in the sample annotation page, upload the following metadata.schema.json:

{
  "$id": "https://json-schema.org/draft/2020-12/schema",
  "$schema": "https://json-schema.org/draft/2020-12/schema",
  "type": "object",
  "properties": {
    "control_status": {
      "type": "string",
      "title": "Treatment Status",
      "description": "Indicates whether a sample is a treatment or a control",
      "enum": [
        "beads_only",
        "empirical"
      ],
      "enumNames": [
        "Control (beads only)",
        "Empirical sample"
      ]
    }
  }
}

After updating the metadata schema, you will be able to mark each of the uploaded samples as either treatment or control.

Parameters:

Read / Peptide Length: Indicates the length of the sequence generated from each epitope which is listed in the library CSV (both read and peptide length should be the same in most cases)
Num. Mismatches: The number of mismatches between the sequence read and the reference sequence which are allowed to count it as a hit
Z-score Threshold: Used for identifying significantly enriched peptides in each treatment sample relative to the beads-only controls for the CPM Enrichment Analysis
Enrichment Modeling: While the CPM Enrichment Analysis should work even with low numbers of beads-only controls, the Negative Binomial Modeling is only expected to perform well with larger numbers (>=10) of controls

Workflow Repository: github.com/matsengrp/phip-flow

Citations:

PhIP-Seq: Mohan D, Wansley DL, Sie BM, Noon MS, Baer AN, Laserson U, Larman HB. PhIP-Seq characterization of serum antibodies using oligonucleotide-encoded peptidomes. Nat Protoc. 2018 Sep;13(9):1958-1978. doi: 10.1038/s41596-018-0025-6. Erratum in: Nat Protoc. 2018 Oct 25;: PMID: 30190553; PMCID: PMC6568263.
VirScan: Xu GJ, Kula T, Xu Q, Li MZ, Vernon SD, Ndung'u T, Ruxrungtham K, Sanchez J, Brander C, Chung RT, O'Connor KC, Walker B, Larman HB, Elledge SJ. Viral immunology. Comprehensive serological profiling of human populations using a synthetic human virome. Science. 2015 Jun 5;348(6239):aaa0698. doi: 10.1126/science.aaa0698. PMID: 26045439; PMCID: PMC4844011.