Pipeline Catalog: Single-Cell Sequencing

Uploading Data

In addition to the FASTQ files for the single-cell sequencing data, a sample sheet (named samplesheet.csv) must be provided which indicates what information is contained in each file.

The sample sheet should be a CSV file with the following columns:

sample: the name of the sequencing library (e.g. SampleA_GEX)
fastq_1: the path to the Read 1 FASTQ file (e.g. SampleA_GEX.R1.fastq.gz)
fastq_2: the path to the Read 2 FASTQ file (e.g. SampleA_GEX.R2.fastq.gz)
feature_types: the type of feature data contained in the sample (e.g. Gene Expression)
grouping: a grouping identifier for the sample (e.g. SampleA)

The grouping column is used to match up which GEX and VDJ FASTQs should be analyzed together. By including this column, a dataset can be analyzed which contains multiple groups of samples which should be analyzed independently.

For example:

sample,fastq_1,fastq_2,grouping,feature_types
SampleA_GEX,SampleA_GEX.R1.fastq.gz,SampleA_GEX.R2.fastq.gz,SampleA,Gene Expression
SampleA_VDJ,SampleA_VDJ.R1.fastq.gz,SampleA_VDJ.R2.fastq.gz,SampleA,VDJ

The different options for the feature_types column are:

Gene Expression
VDJ
VDJ-T
VDJ-B
Antibody Capture
CRISPR Guide Capture
Multiplexing Capture
Chromatin Accessibility

Forget to upload a sample sheet? No problem! You can upload it later by clicking on the "Upload Samplesheet" button for the dataset.

10X Single Cell Sequencing (cellranger)

Single-cell sequencing data produced by the 10X platform can be analyzed using the CellRanger software suite produced by that company.

The CellRanger software suite can be run in many different ways, to account for all of the different ways that data can be collected from single cells.

Analysis options include:

Gene Expression Only
VDF (T- or B-cells) Only
Fixed RNA Profiling
Gene Expression + VDJ and/or Antibody Capture
CellRanger Multiome ATAC + GEX

5' Cell Multiplexing: Analysis of data produced by the 5' Cell Multiplexing platform is supported using the approach shown here. Use the CellRanger Hashtagging workflow to analyze this data.

Note: If your dataset has one set of reads which contains both the 5' Cell Multiplexing tags as well as feature barcode data, those files should be marked as Multiplexing Capture in the feature_types column of the samplesheet.csv. If a Feature Reference CSV is provided to the analysis, then those reads will be processed as feature barcode data after demultiplexing.

Fixed RNA Profiling + Feature Barcodes: If your dataset contains both fixed RNA profiling data and feature barcode data, you can use the CellRanger Multi Analysis workflow to analyze it. Make sure to select the "Fixed RNA Profiling" checkbox in the workflow options and enter the appropriate probe barcode information.

Reference Genomes:

Human reference (GRCh38) - 2024-A (and legacy 2020-A)
Mouse reference (GRCm39) - 2024-A (and legacy 2020-A)
Human (GRCh38) and mouse (GRCm39) reference - 2024-A (and legacy 2020-A)
Human V(D)J reference (GRCh38) - 7.1.0 (and legacy 7.0.0)
Mouse V(D)J reference (GRCm38) - 7.0.0

More Details

Workflow Repository: github.com/FredHutch/nf-cellranger-tools

Citations:

CellRanger: 10X Documentation

Visualization of Gene Expression

In addition to the summary images produced by CellRanger, the results of single-cell gene expression analysis can be visualized directly in Cirro using an interactive display. Visualizations which are available for a dataset can be opened by clicking on the button immediately above the file browser:

select-visualization

After selecting the dataset of interest, visualization will load directly in the browser:

display-visualization

CellRanger Gene Expression

CellRanger gene expression analysis can be performed for any sequencing data produced by Chromium Single Cell Gene Expression. The output of this analysis is a set of files summarizing the relative expression of each gene detected across each of the cells prepared in the sample.

Parameters:

Transcriptome: Select the appropriate reference genome (human, mouse, or combined)
Include Introns: Retain reads which align to intronic regions of genes
CellRanger Version: Specify the CellRanger software version to be used for analysis

Technical Details

Workflow Repository: github.com/FredHutch/nf-cellranger-tools

Citations:

CellRanger: 10X Documentation

CellRanger V(D)J

CellRanger V(D)J analysis can be performed for any sequencing data produced by Chromium Single Cell 5' V(D)J libraries. The output of this analysis is a set of files summarizing the reconstructed V(D)J alleles from each cell.

Parameters:

Genome: Select the appropriate reference genome (human or mouse)
CellRanger Version: Specify the CellRanger software version to be used for analysis

Technical Details

Workflow Repository: github.com/FredHutch/nf-cellranger-tools

Citations:

CellRanger: 10X Documentation

CellRanger Multi Analysis

CellRanger multi analysis can be performed for any sequencing data produced with a combination of Chromium Single Cell Gene Expression with other modalities including 5' V(D)J, Antibody Capture, or CRISPR Guide Capture.

The implementation of cellranger multi provided in this workflow does not currently support CMO multiplexing - please reach out to support@cirro.bio if you have an interest in CMOs.

Uploading Data:

Input FASTQ data must be annotated to indicate which files contain each type of library using the samplesheet.csv approach shown above. This information will be used to automatically construct the config CSV required by cellranger multi.

Feature References:

When analyzing Antibody Capture or CRISPR Guide Capture data, you must provide the appropriate Feature Reference CSV file. First upload that file to Cirro from the References page as the CellRanger Feature Reference (CSV) file type. Make sure to provide an appropriate name for the reference CSV which you upload. Then, when launching an analysis of CellRanger Multi, select the reference appropriate for that dataset.

Fixed RNA Profiling:

When analyzing data produced from Fixed RNA Profiling which also contains feature barcode data, it can be analyzed using cellranger multi as described here.

To enable this feature, select the "Fixed RNA Profiling" checkbox in the workflow options. If multiple samples were run in the same experiment, you can provide the probe barcode information in the form of CSV in the text box immediate below the checkbox.

For example:

sample_id,probe_barcode_ids,description
sample1,BC001+AB005,Control
sample2,BC002+AB006,Control
sample3,BC003+AB007,Case
sample4,BC004+AB008,Case

Parameters:

Transcriptome: Select the appropriate reference genome (human, mouse, or combined)
V(D)J Reference: Reference genome used for alignment of V(D)J data (human or mouse)
Include Introns: Retain reads which align to intronic regions of genes
Feature Reference: Select the Feature Reference CSV which has been uploaded to the References page (for either Antibody Capture or CRISPR Guide Capture)
Fixed RNA Profiling: Click the "Enabled" checkbox if the dataset contains Fixed RNA Profiling data
CellRanger Version: Specify the CellRanger software version to be used for analysis

Technical Details

Workflow Repository: github.com/FredHutch/nf-cellranger-tools

Citations:

CellRanger: 10X Documentation

CellRanger Hashtagging

Cell Multiplexing (or cell hashing) is a method for pooling samples using sample-specific molecular tags. These tags help identify transcripts that originate from the same donor sample. Cell Multiplexing offers several advantages, including:

Increased sample throughput in a single experiment
Increased number of cells assayed in a single experiment
Increased number of possible replicates in a single experiment
Detection of cell multiplets and their removal before analysis

For more details on Cell Multiplexing, see the 10X Genomics documentation.

Hashtagging Requirements:

In order to analyze hashtagged data, the FASTQ files must be annotated using the samplesheet.csv to indicate which ones contain the hashtag barcode sequences. In addition, the multiplexing information must be provided to indicate which barcodes were used for each sample. This information will be used to automatically construct the config CSV required by cellranger multi.

FASTQ Information (samplesheet.csv):

When constructing the samplesheet.csv used to annotate the uploaded FASTQ files (details on the samplesheet format), the feature_types column must contain the value "Multiplexing Capture" for the FASTQs containing hashtag data.

If a dataset contains a single set of reads which contains both the 5' Cell Multiplexing tags as well as feature barcode data, those files should be marked as Multiplexing Capture in the feature_types column of the samplesheet.csv.

Sample Information (CSV):

To match each hashtag barcode to the corresponding sample, a CSV file must be uploaded as a CellRanger Feature Reference on the Pipelines page. This file should contain columns sample_id and cmo_ids, for example:

sample_id,cmo_ids
SampleA,Hash-tag1
SampleB,Hash-tag2
SampleC,Hash-tag3
SampleD,Hash-tag4

The sample information CSV file may be uploaded with any name, and will be selected by the user when launching the analysis.

Barcode Information (feature_reference.csv):

To provide molecular information about the hashtag barcodes, a feature_reference.csv file must be provided which includes the Multiplexing Capture value as the feature_type. The full specification of this file can be found on the CellRanger documentation, with an example shown below:

id,name,read,pattern,sequence,feature_type
Hash-tag1,Hash-tag1,R2,^NNNNNNNNNN(BC)NNNNNNNNN,GTCAACTCTTTAGCG,Multiplexing Capture
Hash-tag2,Hash-tag2,R2,^NNNNNNNNNN(BC)NNNNNNNNN,TGATGGCCTATTGGG,Multiplexing Capture
Hash-tag3,Hash-tag3,R2,^NNNNNNNNNN(BC)NNNNNNNNN,TTCCGCCTCTCTTTG,Multiplexing Capture
Hash-tag4,Hash-tag4,R2,^NNNNNNNNNN(BC)NNNNNNNNN,AGTAAGTTCAGCGTA,Multiplexing Capture

Because this hashtag information may be used for multiple datasets, it will be uploaded as a Pipeline Reference.

The barcode information CSV file may have any name, and will be selected by the user when launching the analysis.

Output:

The output of the hashtagging analysis will include the standard CellRanger outputs, including gene expression and V(D)J information if provided. Results will be provided for each individual sample which has been specified in the multiplexing table.

Technical Details

Workflow Repository: github.com/FredHutch/nf-cellranger-tools

Citations:

CellRanger: 10X Documentation

CellRanger Flex

Fixed RNA Profiling data can be analyzed using the CellRanger software suite provided by 10X Genomics.

Barcodes

When analyzing 10X data produced by Fixed RNA Profiling, the barcode used for each sample must be listed at the time of analysis. This information will be used to automatically populate a configuration CSV used by cellranger multi for analyzing this sample type.

Probe Sets

By default, the Chromium Probe Set v1.0.1 will be used for analysis. Custom probe sets are also supported by this analysis workflow. First upload the probe set CSV provided by 10X Genomics as a Pipeline Reference, selecting the type "CellRanger Probe Set (CSV)". Then select that uploaded probe set when analyzing the associated sequencing data.

Parameters:

Reference Genome: Select the appropriate reference genome (human or mouse)
Custom Probe Set: Optionally select a custom probe set to use for analysis
Samples: List the samples used for each barcode (BC001, BC002, etc.)
CellRanger Version: Specify the CellRanger software version to be used for analysis

Technical Details

Workflow Repository: github.com/FredHutch/nf-cellranger-tools

Citations:

CellRanger: 10X Documentation

CellRanger Multiome ATAC + GEX

Analyze Single Cell Multiome ATAC + Gene Expression data using the CellRanger ARC software suite provided by 10X Genomics.

Input Data:

After processing a sample for Multiome ATAC + Gene Expression, raw sequencing data will be produced in the form of FASTQ files both for chromatin accessibility (ATAC) as well as gene expression (GEX) information. In order to analyze these datasets, the FASTQ files must be marked appropriately according to the data type that they represent.

The best way to annotate the input FASTQ data is using a samplesheet.csv with columns for:

sample: Identifier for the sequencing library
fastq_1: Name of the R1 (or I1) FASTQ file
fastq_2: Name of the R2 (or I2) FASTQ file
grouping: Identifier for the sample analyzed for both GEX and ATAC
feature_types: Either Gene Expression, or Chromatin Accessibility

Example:

sample,fastq_1,fastq_2,grouping,feature_types
sampleA_gex,sampleA_gex_S1_L001_R1_001.fastq.gz,sampleA_gex_S1_L001_R2_001.fastq.gz,sampleA,Gene Expression
sampleA_atac,sampleA_atac_S2_L001_R1_001.fastq.gz,sampleA_atac_S2_L001_R2_001.fastq.gz,sampleA,Chromatin Accessibility

Note: The file described above should be named samplesheet.csv and can be uploaded either (1) along with the FASTQ files during initial upload or (2) by clicking on the "Upload Samplesheet" button for a previously-uploaded dataset.

Parameters:

Reference Genome: Select the appropriate reference genome (human or mouse)
CellRanger ARC Version: Specify the CellRanger ARC software version to be used for analysis

Technical Details

Workflow Repository: github.com/FredHutch/nf-cellranger-tools

Citations:

CellRanger: 10X Documentation

Multi-Sample Quality Control (scanpy)

The scanpy utility is a widely used tool for single-cell analysis in Python. This workflow uses a set of standard quality control metrics to filter cells and genes, while also combining multiple datasets into a single output.

Harmony is an efficient algorithm produced by the Broad Institute for integrating multiple data sets. The Python version of the Harmony algorithm (harmonypy) is used in this workflow to combine multiple datasets.

harmony algorithm

Input Data

The input data for this workflow is a set of single-cell datasets which have been produced by the 10X platform. In particular, the datasets should be in the form of .h5 files.

Datasets may be uploaded directly as "CellRanger Analysis Output" datasets, or they may be analyzed from the outputs of one of the CellRanger pipelines described above.

To properly combine cells from multiple datasets, each dataset needs to be annotated with a unique name. This can be done either by using a name which is embedded in the filename, or by providing a sample sheet CSV.

Sample names will be parsed from filenames using the patterns:

SAMPLENAME_filtered_feature_bc_matrix.h5, or
SAMPLENAME_feature_bc_matrix.h5

If a file samplesheet.csv is provided in the input directory (or added after the dataset is uploaded), the names in that file will be used instead. The format of the samplesheet.csv file is:

sample,file
SampleA,fileA_filtered_feature_bc_matrix.h5
SampleB,fileB_filtered_feature_bc_matrix.h5

Note that when using a samplesheet.csv, the filename does not need to contain the sample name in any way.

Supports combining input datasets in a single analysis.

Parameters

Minimum Cells per Gene
Minimum number of genes required for a cell to be considered
Minimum Genes per Cell
Minimum number of cells required for a gene to be considered
Maximum Genes per Cell
Maximum number of genes for a cell to be considered
Percent Mitochondrial Cutoff
Maximum percentage of mitochondrial reads for a cell to be considered
Filter RBCs
Remove red blood cells (RBCs) from the dataset (bool)
Filter Platelets
Remove platelets from the dataset (bool)

Analysis Steps

Run Scrublet to identify doublets
Calculate the percentage of reads which are mitochondrial or ribosomal
Annotate cells which are platelets or red blood cells (RBCs) - Data saved at this step: adata_combined.h5ad
Remove doublets
Filter cells using a minimum number of genes (min_genes)
Filter genes using a minimum number of cells (min_cells)
Filter cells using a maximum percentage of mitochondrial reads (percent_mito_cutoff)
Filter out RBCs (optional: filter_rbc)
Filter out platelets (optional: filter_platelets)
Filter cells on a maximum number of genes (max_genes_cutoff)
Log transform the data (log1p) - Data saved at this step: adata_allgenes_log1p.h5ad
Filter to highly variable genes
Regress out the number of genes and the percentage of mitochondrial reads
Scale the data (max_value=10)
Perform PCA
Use harmonypy for batch correction
Perform UMAP, Louvain, and Leiden clustering - Data saved at this step: adata_combined_harmony_hvg.h5ad

References:

Wolock SL, Lopez R, Klein AM. Scrublet: Computational Identification of Cell Doublets in Single-Cell Transcriptomic Data. Cell Syst. 2019 Apr 24;8(4):281-291.e9. doi: 10.1016/j.cels.2018.11.005. Epub 2019 Apr 3. PMID: 30954476; PMCID: PMC6625319.
Wolf FA, Angerer P, Theis FJ. SCANPY: large-scale single-cell gene expression data analysis. Genome Biol. 2018 Feb 6;19(1):15. doi: 10.1186/s13059-017-1382-0. PMID: 29409532; PMCID: PMC5802054.
Korsunsky I, Millard N, Fan J, Slowikowski K, Zhang F, Wei K, Baglaenko Y, Brenner M, Loh PR, Raychaudhuri S. Fast, sensitive and accurate integration of single-cell data with Harmony. Nat Methods. 2019 Dec;16(12):1289-1296. doi: 10.1038/s41592-019-0619-0. Epub 2019 Nov 18. PMID: 31740819; PMCID: PMC6884693.

Single-Cell Azimuth Projection

Single-cell sequencing provides an incredibly detailed description of the genes being expressed by individual cells from a complex tissue. However, it can sometimes be difficult to interpret this information in a coordinated way across multiple specimens, analysis batches, or sampling modalities. The Satija lab published a reference-based approach (Azimuth), which projects newly collected datasets into the multidimensional space established for a set of curated tissue-specific atlases. This approach can be used to quickly annotate cell types and align UMAP ordinations for new datasets to facilitate rapid comparison.

The Azimuth analysis can be run on single-cell gene expression datasets in Cirro, and will produce as an output an updated Seurat (h5seurat) or Scanpy (h5ad) object which can be used for further downstream analysis.

Azimuth Human Motor Cortex Reference Atlas:

human brain umap

Azimuth References:

Human Adipose - adiposeref Cells in Reference: 160,075
Human Bone Marrow bonemarrowref Cells in Reference: 297,627
Human Fetal Development - fetusref Cells in Reference: 377,456
Human Heart - heartref Cells in Reference: 656,509
Human Kidney - kidneyref Cells in Reference: 64,693
Human Lung v2 - lungref Cells in Reference: 584,944
Human Motor Cortex - humancortexref Nuclei in Reference: 76,533
Mouse Motor Cortex - mousecortexref Nuclei in Reference: 159,738
Human Pancreas - pancreasref Cells in Reference: 35,289
Human PBMC - pbmcref Cells in Reference: 161,764
Human Tonsil - tonsilref Cells in Reference: 263,299

Citation:

Hao, Yuhan et al. “Integrated analysis of multimodal single-cell data.” Cell vol. 184,13 (2021): 3573-3587.e29. doi:10.1016/j.cell.2021.04.048

Aggregate CellRanger Outputs

Combine the results from multiple datasets with CellRanger aggr.

Many experiments generate data for multiple samples. Depending on the experimental design, these could be replicates from the same set of cells, cells from different tissues or time points from the same individual, or cells from different individuals. Samples could be processed through different Gel Bead-in Emulsion (GEM) wells or multiplexed within the same GEM well on Chromium instruments. The cellranger aggr pipeline can be used to aggregate samples from these scenarios into a single feature-barcode matrix.

When a single dataset in Cirro contains results from multiple samples, those results can be combined into a single set of outputs. This can be particularly useful when there is a need to project cells from multiple samples into the same t-SNE ordination and gene expression clusters.

Technical Details

Workflow Repository: github.com/FredHutch/nf-cellranger-tools

Citations:

CellRanger: 10X Documentation

Visium: Spatial Transcriptomics

Spatial Transcriptomics datasets generated on the Visium platform with custom probes can be analyzed on Cirro using the Space Ranger software provided by 10X.

Probe Reference:

To get ready for running the analysis, build an analysis reference using the custom probes included in the analysis.

Steps:

Upload the Visium sequencing data as a "10X Single-Cell (FASTQ)" dataset
Upload the image files generated for those samples (see note below)
Open the "Analyze Visium Spatial Transcriptomics" pipeline
Run using the appropriate datasets for (a) FASTQs, (b) images, and (c) analysis reference

Images:

When uploading images, make sure to provide a samplesheet.csv file which matches up the images to the appropriate sequencing dataset. Use columns slide and area to include details on Visium Slide Serial Number and Capture Area. If those columns are omitted, then Space Ranger will be run with the --unknown-slide parameter.

For example:

sample,file,slide,area
sampleA,sampleA_image.tif,V19L01-041,A1
sampleB,sampleB_image.tif,V19L01-041,B1
sampleC,sampleC_image.tif,V19L01-041,D1

Images can be provided for one of:

CytAssist brightfield image
Colorized image (TIFF or JPEG)
Multi-channel, dark-background fluorescence image (TIFF)
Single H&E brightfield image in either TIFF or JPG format

Technical Details

Workflow Repository: github.com/FredHutch/nf-cellranger-tools

Citations:

SpaceRanger: 10X Documentation

Visium: Custom Probes

When analyzing Visium Spatial Transcriptomics data which has been prepared using custom probes, the reference genome must be combined with those probes prior to running that analysis.

Steps:

Upload the custom probe(s) as a "Nucleotide Sequences (FASTA)" dataset (either in a single or multiple files)
Open the "Build SpaceRanger Reference" pipeline and select the uploaded probes
Select the appropriate reference genome (human or mouse) to combine with those probes
Provide a memorable name for the custom reference and click "Run"
Once the custom reference is finished building, it can be used with the "Visium Spatial Transcriptomics (Custom Probes) pipeline

Technical Details

Workflow Repository: github.com/FredHutch/nf-cellranger-tools

Citations:

SpaceRanger: 10X Documentation

CITE-seq Total Variational Inference

scvi-tools (single-cell variational inference tools) is an analysis package for end-to-end analysis of single-cell omics data primarily developed and maintained by the Yosef Lab at UC Berkeley and the Weizmann Institute of Science.

totalVI (total Variational Inference) provides a flexible generative model of CITE-seq RNA and protein data that can subsequently be used for many common downstream tasks.

Inputs:

Combined RNA and protein data must be uploaded in the form of a MuData multimodal data object (h5mu). Batch correction will be performed on the data, and so the batch information must be included in the input data as a column in the .obs.

After the input MuData file (h5mu) has been uploaded to Cirro, the "CITE-seq Total Variational Inference (scvi-tools/totalVI)" pipeline can be selected from the pipeline catalog.

Parameters:

Protein Layer: Provide the key used for the protein data in the MuData object
RNA Layer: Provide the key used for the RNA data in the MuData object
Batch Key: Provide the key used for the batch information in the MuData object

Citations:

Adam Gayoso, Zoë Steier, Romain Lopez, Jeffrey Regier, Kristopher L Nazor, Aaron Streets, Nir Yosef (2021), Joint probabilistic modeling of single-cell multi-omic data with totalVI, Nature Methods