Pipeline Catalog: Nanopore Sequencing
Introduction to Nanopore Sequencing
Nanopore sequencing is a technology for determining the sequence and chemical modifications present in a sample of DNA or RNA. Each measurement is made on a single DNA/RNA molecule as it passes through a small pore in a membrane, without the need for any intermediate amplification step.
For more information, visit Oxford Nanopore Technologies (ONT).
POD5 File Format:
The POD5 file format has been developed by ONT using the Apache Arrow framework for storing nanopore sequencing data in an easily accessible way. As with the prior FAST5 format, this includes the raw signal intensity information which is needed to perform basecalling and identify epigenetic modifications. More information can be found on the POD5 File Format Documentation.
FAST5 File Format:
The FAST5 file format was developed by ONT to store the electrical signal information captured by the sequencing instrument. It is based on the HDF5 file format (Hierarchical Data Format), which is a highly interoperable format that provides folder-like organization. FAST5 was the standard file format for ONT data prior to the release of the POD5 file format in December, 2022.
Nanopore Basecalling (Dorado)
After collecting data with a nanopore sequencer, the raw signal intensity information often needs to be transformed into the nucleotide sequence information which is expected by typical bioinformatics analysis software.
Oxford Nanopore Technologies (ONT), the instrument manufacturer, has released a tool for basecalling called Dorado. In addition to identifying the sequence of bases in a DNA/RNA molecule, Dorado can be used to identify epigenetic (chemical) modifications made to those bases.
Note: The Dorado software requires GPUs, which must be enabled by the Project administrator (see the Project documentation)
Basecalling Model(s)
Because the performance and accuracy of basecalling continues to improve significantly, there are a number of different Dorado models available. For more information, read the guide to decoding Dorado model names, which are specific to:
- Analyte Type (e.g.
DNA
) - Pore Type (e.g.
r10.5.1
) - Chemistry (e.g.
e8.2
) - Translocation Speed (e.g.
400bps
) - Model Type (e.g.
hac
) - Model Version Number (e.g.
v5.0.0
)
Analysis Workflow
The analysis workflow used to run the Dorado tool is epi2me-labs/wf-basecalling, published by EPI2ME Labs, a bioinformatics resource from Oxford Nanopore Technologies.
For a longer description of how the workflow functions, and all of the various parameters, see their documentation page.
This pipeline supports viewing its output using the embedded IGV Track Viewer App directly on the output dataset's Overview page.
Nanopore Variant Calling
SNV, SV and CNV calling, modified base calling, and STR genotyping of human samples.
Oxford Nanopore Technologies provides an analysis workflow for the analysis of genomic variants from long-read nanopore sequencing data. This analysis workflow includes:
- diploid variant calling (Clair3)
- structural variant calling (Sniffles2)
- copy number variant calling (Spectre or QDNAseq)
- short tandem repeat (STR) expansion genotyping (Straglr)
- analysis of modified base calls
Please note, this pipeline is intended for use with 20x whole-genome Oxford Nanopore Technologies sequencing data (with the exception of QDNAseq, please see this section for more information). Usage outside of this (e.g. with adaptive sampling data, or using lower coverage inputs) may cause the workflow to terminate with an error, or produce unexpected results.
User Guide
Workflow repository
epi2me-labs/wf-human-variation
Inputs
Nanopore sequencing data should be uploaded as either Unaligned BAM, or as Nanopore Sequencing POD5. If POD5 files are uploaded, they should be processed by the Dorado basecalling pipeline prior to analysis for variant calling.
Outputs
Output files may be aggregated including information for all samples or provided per sample. Per-sample files will be prefixed with respective aliases and represented below as {{ alias }}.
Title | File path | Description | Per sample or aggregated |
---|---|---|---|
Report of the alignment statistics | {{ alias }}.wf-human-alignment-report.html | Report summarising the results of the alignment statistics for the sample. | per-sample |
JSON file of some base statistics | {{ alias }}.stats.json | This JSON file contains base statistics on the reads, mappings, SNPs and SVs for the sample. | per-sample |
Report of the SNP workflow | {{ alias }}.wf-human-snp-report.html | Report summarising the results of the SNP subworkflow for the sample. | per-sample |
Report of the SV workflow | {{ alias }}.wf-human-sv-report.html | Report summarising the results of the SV subworkflow for the sample. | per-sample |
Report of the CNV workflow | {{ alias }}.wf-human-cnv-report.html | Report summarising the results of the CNV subworkflow for the sample. | per-sample |
Report of the STR workflow | {{ alias }}.wf-human-str-report.html | Report summarising the results of the short tandem repeat subworkflow for the sample. | per-sample |
Short variant VCF | {{ alias }}.wf_snp.vcf.gz | VCF file with the SNPs for the sample. | per-sample |
Structural variant VCF | {{ alias }}.wf_sv.vcf.gz | VCF file with the SVs for the sample. | per-sample |
Structural variant SNF | {{ alias }}.wf_sv.snf | SNF file with the SVs for the sample, for onward multi-sample SV calling. | per-sample |
Copy number variants VCF | {{ alias }}.wf_cnv.vcf.gz | VCF file with the CNV for the sample. | per-sample |
Modified bases BEDMethyl | {{ alias }}.wf_mods.bedmethyl.gz | BED file with the aggregated modification counts for the sample. | per-sample |
Modified bases BEDMethyl (haplotype 1) | {{ alias }}.wf_mods.1.bedmethyl.gz | BED file with the aggregated modification counts for haplotype 1 of the sample. | per-sample |
Modified bases BEDMethyl (haplotype 2) | {{ alias }}.wf_mods.2.bedmethyl.gz | BED file with the aggregated modification counts for haplotype 2 of the sample. | per-sample |
Modified bases BEDMethyl (ungrouped) | {{ alias }}.wf_mods.ungrouped.bedmethyl.gz | BED file with the aggregated modification counts of non-haplotagged reads for the sample. | per-sample |
Short tandem repeat VCF | {{ alias }}.wf_str.vcf.gz | VCF file with the STR sites for the sample. | per-sample |
Alignment file | {{ alias }}.cram | CRAM or BAM file with the aligned reads for the sample, generated when the input file is unaligned. | per-sample |
Alignment file index | {{ alias }}.cram.crai | The index of the resulting CRAM or BAM file with the reads for the sample, generated when the input file is unaligned. | per-sample |
Haplotagged alignment file | {{ alias }}.haplotagged.cram | CRAM or BAM file of all input reads with haplotags added by phasing. | per-sample |
Haplotagged alignment file index | {{ alias }}.haplotagged.cram.crai | The index of the resulting CRAM or BAM file produced when haplotags have been added by phasing. | per-sample |
Mean coverage for each region | {{ alias }}.regions.bed.gz | The mean coverage in the individual regions of the genome in BED format. | per-sample |
Coverage per region above the given thresholds | {{ alias }}.thresholds.bed.gz | The BED reporting the number of bases in each region that are covered at or above each threshold values (1x, 10x, 20x and 30x). | per-sample |
Distribution of the proportion of total bases covered by a given coverage value | {{ alias }}.mosdepth.global.dist.txt | The cumulative distribution indicating the proportion of total bases covered by a given coverage value, both genome-wide and by sequence. | per-sample |
Mean coverage per sequence and target region | {{ alias }}.mosdepth.summary.txt | The summary of mean depths per chromosome and within specified regions per chromosome. | per-sample |
BEDgraph of the single-base coverage | {{ alias }}.per-base.bedgraph.gz | The single-base coverage of the genome in BED graph format. | per-sample |
Gene level coverage summary | SAMPLE.gene_summary.tsv | A table where each gene of the input BED file has columns describing the percentage of positions along the gene region that are covered to a given threshold, and a column with the average coverage. | per-sample |
Haplocheck contamination summary | {{ alias }}.haplocheck.tsv | A table generated by haplocheck, with estimate of contamination from the MT genome. | per-sample |
FAI index of the reference FASTA file | {{ ref }}.fai | FAI Index of the reference FASTA file. | aggregated |
GZI index of the reference FASTA file | {{ ref }}.gzi | GZI Index of the reference FASTA file. | aggregated |