Pipeline Catalog: Nanopore Sequencing

Introduction to Nanopore Sequencing

Nanopore sequencing is a technology for determining the sequence and chemical modifications present in a sample of DNA or RNA. Each measurement is made on a single DNA/RNA molecule as it passes through a small pore in a membrane, without the need for any intermediate amplification step.

For more information, visit Oxford Nanopore Technologies (ONT).

POD5 File Format:

The POD5 file format has been developed by ONT using the Apache Arrow framework for storing nanopore sequencing data in an easily accessible way. As with the prior FAST5 format, this includes the raw signal intensity information which is needed to perform basecalling and identify epigenetic modifications. More information can be found on the POD5 File Format Documentation.

FAST5 File Format:

The FAST5 file format was developed by ONT to store the electrical signal information captured by the sequencing instrument. It is based on the HDF5 file format (Hierarchical Data Format), which is a highly interoperable format that provides folder-like organization. FAST5 was the standard file format for ONT data prior to the release of the POD5 file format in December, 2022.

Nanopore Basecalling (Dorado)

After collecting data with a nanopore sequencer, the raw signal intensity information often needs to be transformed into the nucleotide sequence information which is expected by typical bioinformatics analysis software.

Oxford Nanopore Technologies (ONT), the instrument manufacturer, has released a tool for basecalling called Dorado. In addition to identifying the sequence of bases in a DNA/RNA molecule, Dorado can be used to identify epigenetic (chemical) modifications made to those bases.

Note: The Dorado software requires GPUs, which must be enabled by the Project administrator (see the Project documentation)

Basecalling Model(s)

Because the performance and accuracy of basecalling continues to improve significantly, there are a number of different Dorado models available. For more information, read the guide to decoding Dorado model names, which are specific to:

Analyte Type (e.g. DNA)
Pore Type (e.g. r10.5.1)
Chemistry (e.g. e8.2)
Translocation Speed (e.g. 400bps)
Model Type (e.g. hac)
Model Version Number (e.g. v5.0.0)

Analysis Workflow

The analysis workflow used to run the Dorado tool is epi2me-labs/wf-basecalling, published by EPI2ME Labs, a bioinformatics resource from Oxford Nanopore Technologies.

For a longer description of how the workflow functions, and all of the various parameters, see their documentation page.

This pipeline supports viewing its output using the embedded IGV Track Viewer App directly on the output dataset's Overview page.

Nanopore Variant Calling

SNV, SV and CNV calling, modified base calling, and STR genotyping of human samples.

Oxford Nanopore Technologies provides an analysis workflow for the analysis of genomic variants from long-read nanopore sequencing data. This analysis workflow includes:

diploid variant calling (Clair3)
structural variant calling (Sniffles2)
copy number variant calling (Spectre or QDNAseq)
short tandem repeat (STR) expansion genotyping (Straglr)
analysis of modified base calls

Please note, this pipeline is intended for use with 20x whole-genome Oxford Nanopore Technologies sequencing data (with the exception of QDNAseq, please see this section for more information). Usage outside of this (e.g. with adaptive sampling data, or using lower coverage inputs) may cause the workflow to terminate with an error, or produce unexpected results.

User Guide

Workflow repository

epi2me-labs/wf-human-variation

Inputs

Nanopore sequencing data should be uploaded as either Unaligned BAM, or as Nanopore Sequencing POD5. If POD5 files are uploaded, they should be processed by the Dorado basecalling pipeline prior to analysis for variant calling.

Outputs

Output files may be aggregated including information for all samples or provided per sample. Per-sample files will be prefixed with respective aliases and represented below as {{ alias }}.

Title	File path	Description	Per sample or aggregated
Report of the alignment statistics	{{ alias }}.wf-human-alignment-report.html	Report summarising the results of the alignment statistics for the sample.	per-sample
JSON file of some base statistics	{{ alias }}.stats.json	This JSON file contains base statistics on the reads, mappings, SNPs and SVs for the sample.	per-sample
Report of the SNP workflow	{{ alias }}.wf-human-snp-report.html	Report summarising the results of the SNP subworkflow for the sample.	per-sample
Report of the SV workflow	{{ alias }}.wf-human-sv-report.html	Report summarising the results of the SV subworkflow for the sample.	per-sample
Report of the CNV workflow	{{ alias }}.wf-human-cnv-report.html	Report summarising the results of the CNV subworkflow for the sample.	per-sample
Report of the STR workflow	{{ alias }}.wf-human-str-report.html	Report summarising the results of the short tandem repeat subworkflow for the sample.	per-sample
Short variant VCF	{{ alias }}.wf_snp.vcf.gz	VCF file with the SNPs for the sample.	per-sample
Structural variant VCF	{{ alias }}.wf_sv.vcf.gz	VCF file with the SVs for the sample.	per-sample
Structural variant SNF	{{ alias }}.wf_sv.snf	SNF file with the SVs for the sample, for onward multi-sample SV calling.	per-sample
Copy number variants VCF	{{ alias }}.wf_cnv.vcf.gz	VCF file with the CNV for the sample.	per-sample
Modified bases BEDMethyl	{{ alias }}.wf_mods.bedmethyl.gz	BED file with the aggregated modification counts for the sample.	per-sample
Modified bases BEDMethyl (haplotype 1)	{{ alias }}.wf_mods.1.bedmethyl.gz	BED file with the aggregated modification counts for haplotype 1 of the sample.	per-sample
Modified bases BEDMethyl (haplotype 2)	{{ alias }}.wf_mods.2.bedmethyl.gz	BED file with the aggregated modification counts for haplotype 2 of the sample.	per-sample
Modified bases BEDMethyl (ungrouped)	{{ alias }}.wf_mods.ungrouped.bedmethyl.gz	BED file with the aggregated modification counts of non-haplotagged reads for the sample.	per-sample
Short tandem repeat VCF	{{ alias }}.wf_str.vcf.gz	VCF file with the STR sites for the sample.	per-sample
Alignment file	{{ alias }}.cram	CRAM or BAM file with the aligned reads for the sample, generated when the input file is unaligned.	per-sample
Alignment file index	{{ alias }}.cram.crai	The index of the resulting CRAM or BAM file with the reads for the sample, generated when the input file is unaligned.	per-sample
Haplotagged alignment file	{{ alias }}.haplotagged.cram	CRAM or BAM file of all input reads with haplotags added by phasing.	per-sample
Haplotagged alignment file index	{{ alias }}.haplotagged.cram.crai	The index of the resulting CRAM or BAM file produced when haplotags have been added by phasing.	per-sample
Mean coverage for each region	{{ alias }}.regions.bed.gz	The mean coverage in the individual regions of the genome in BED format.	per-sample
Coverage per region above the given thresholds	{{ alias }}.thresholds.bed.gz	The BED reporting the number of bases in each region that are covered at or above each threshold values (1x, 10x, 20x and 30x).	per-sample
Distribution of the proportion of total bases covered by a given coverage value	{{ alias }}.mosdepth.global.dist.txt	The cumulative distribution indicating the proportion of total bases covered by a given coverage value, both genome-wide and by sequence.	per-sample
Mean coverage per sequence and target region	{{ alias }}.mosdepth.summary.txt	The summary of mean depths per chromosome and within specified regions per chromosome.	per-sample
BEDgraph of the single-base coverage	{{ alias }}.per-base.bedgraph.gz	The single-base coverage of the genome in BED graph format.	per-sample
Gene level coverage summary	SAMPLE.gene_summary.tsv	A table where each gene of the input BED file has columns describing the percentage of positions along the gene region that are covered to a given threshold, and a column with the average coverage.	per-sample
Haplocheck contamination summary	{{ alias }}.haplocheck.tsv	A table generated by haplocheck, with estimate of contamination from the MT genome.	per-sample
FAI index of the reference FASTA file	{{ ref }}.fai	FAI Index of the reference FASTA file.	aggregated
GZI index of the reference FASTA file	{{ ref }}.gzi	GZI Index of the reference FASTA file.	aggregated