Skip to content

Standard Reference Files

Workflows used to analyze genomic data often rely on a set of standard references such as:

  • Genome sequences for commonly studied organisms (e.g. Homo sapiens GRCh38)
  • Compiled alignment indexes for aligners like BWA, STAR, or Salmon
  • Annotations of gene and transcript locations

To save workflow developers the trouble of setting up those references independently across different accounts, a set of standard reference files are hosted in a location which can be accessed by any Cirro project.

iGenomes

The iGenomes are a collection of reference genomes which were originally hosted by Illumina but have since been moved to the Registry of Open Data on AWS. A mirror of the iGenomes files is provided within Cirro so that these large files do not need to be transferred or copied between AWS S3 regions during execution, which saves considerable time and money.

The iGenomes are a collection of reference sequences and annotation files for commonly analyzed organisms. The files have been downloaded from Ensembl, NCBI, or UCSC. Chromosome names have been changed to be simple and consistent with the download source. Each iGenome is available as a compressed file that contains sequences and annotation files for a single genomic build of an organism.

A broad collection of nf-core workflows use the iGenomes as the canonical source of reference genome information.

Any nf-core pipeline can be set up to use the Cirro mirror of iGenomes using the single parameter:

"igenomes_base": "s3://pubweb-references/igenomes"

For more information on the location of specific resources within the iGenomes package, please consult the excellent iGenomes configuration tool which has been provided by the hero of bioinformatics Phil Ewels.

GATK References

In addition to the GATK references which are provided within the Cirro iGenomes mirror, workflow developers may instead wish to access files sourced directly from the GATK Resource Bundle. To save the time and expense of pulling files from Google Cloud into AWS, these files have been mirrored in an S3 bucket which is available to all workflows in Cirro.

These reference files include:

WholeGenomeGermlineSingleSample.references (Struct):

{
    "contamination_sites_ud": "s3://pubweb-references/GATK/hg38/v0/contamination-resources/1000g/1000g.phase3.100k.b38.vcf.gz.dat.UD",
    "contamination_sites_bed": "s3://pubweb-references/GATK/hg38/v0/contamination-resources/1000g/1000g.phase3.100k.b38.vcf.gz.dat.bed",
    "contamination_sites_mu": "s3://pubweb-references/GATK/hg38/v0/contamination-resources/1000g/1000g.phase3.100k.b38.vcf.gz.dat.mu",
    "calling_interval_list": "s3://pubweb-references/GATK/hg38/v0/wgs_calling_regions.hg38.interval_list",
    "reference_fasta" : {
        "ref_dict": "s3://pubweb-references/GATK/hg38/v0/dragen_reference/Homo_sapiens_assembly38_masked.dict",
        "ref_fasta": "s3://pubweb-references/GATK/hg38/v0/dragen_reference/Homo_sapiens_assembly38_masked.fasta",
        "ref_fasta_index": "s3://pubweb-references/GATK/hg38/v0/dragen_reference/Homo_sapiens_assembly38_masked.fasta.fai",
        "ref_sa": "s3://pubweb-references/GATK/hg38/v0/dragen_reference/Homo_sapiens_assembly38_masked.fasta.64.sa",
        "ref_alt": "s3://pubweb-references/GATK/hg38/v0/dragen_reference/Homo_sapiens_assembly38_masked.fasta.alt",
        "ref_amb": "s3://pubweb-references/GATK/hg38/v0/dragen_reference/Homo_sapiens_assembly38_masked.fasta.64.amb",
        "ref_bwt": "s3://pubweb-references/GATK/hg38/v0/dragen_reference/Homo_sapiens_assembly38_masked.fasta.64.bwt",
        "ref_ann": "s3://pubweb-references/GATK/hg38/v0/dragen_reference/Homo_sapiens_assembly38_masked.fasta.64.ann",
        "ref_pac": "s3://pubweb-references/GATK/hg38/v0/dragen_reference/Homo_sapiens_assembly38_masked.fasta.64.pac"
    },
    "known_indels_sites_vcfs": [
        "s3://pubweb-references/GATK/hg38/v0/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz",
        "s3://pubweb-references/GATK/hg38/v0/Homo_sapiens_assembly38.known_indels.vcf.gz"
    ],
    "known_indels_sites_indices": [
        "s3://pubweb-references/GATK/hg38/v0/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz.tbi",
        "s3://pubweb-references/GATK/hg38/v0/Homo_sapiens_assembly38.known_indels.vcf.gz.tbi"
    ],
    "dbsnp_vcf": "s3://pubweb-references/GATK/hg38/v0/Homo_sapiens_assembly38.dbsnp138.vcf",
    "dbsnp_vcf_index": "s3://pubweb-references/GATK/hg38/v0/Homo_sapiens_assembly38.dbsnp138.vcf.idx",
    "evaluation_interval_list": "s3://pubweb-references/GATK/hg38/v0/wgs_evaluation_regions.hg38.interval_list",
    "haplotype_database_file": "s3://pubweb-references/GATK/hg38/v0/Homo_sapiens_assembly38.haplotype_database.txt"
}

PreProcessingForVariantDiscovery_GATK4:

{
  "ref_dict": "s3://pubweb-references/GATK/hg38/v0/Homo_sapiens_assembly38.dict",
  "ref_fasta": "s3://pubweb-references/GATK/hg38/v0/Homo_sapiens_assembly38.fasta",
  "ref_fasta_index": "s3://pubweb-references/GATK/hg38/v0/Homo_sapiens_assembly38.fasta.fai",
  "ref_alt": "s3://pubweb-references/GATK/hg38/v0/Homo_sapiens_assembly38.fasta.alt",
  "ref_sa": "s3://pubweb-references/GATK/hg38/v0/Homo_sapiens_assembly38.fasta.sa",
  "ref_amb": "s3://pubweb-references/GATK/hg38/v0/Homo_sapiens_assembly38.fasta.amb",
  "ref_bwt": "s3://pubweb-references/GATK/hg38/v0/Homo_sapiens_assembly38.fasta.bwt",
  "ref_ann": "s3://pubweb-references/GATK/hg38/v0/Homo_sapiens_assembly38.fasta.ann",
  "ref_pac": "s3://pubweb-references/GATK/hg38/v0/Homo_sapiens_assembly38.fasta.pac",
  "dbSNP_vcf": "s3://pubweb-references/GATK/hg38/v0/Homo_sapiens_assembly38.dbsnp138.vcf",
  "dbSNP_vcf_index": "s3://pubweb-references/GATK/hg38/v0/Homo_sapiens_assembly38.dbsnp138.vcf.idx",
  "known_indels_sites_VCFs": "s3://pubweb-references/GATK/hg38/v0/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz,s3://pubweb-references/GATK/hg38/v0/Homo_sapiens_assembly38.known_indels.vcf.gz",
  "known_indels_sites_indices": "s3://pubweb-references/GATK/hg38/v0/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz.tbi,s3://pubweb-references/GATK/hg38/v0/Homo_sapiens_assembly38.known_indels.vcf.gz.tbi",
}

If there are any other elements of the GATK reference bundle that you require, please reach out to support@cirro.bio

10X CellRanger

Reference genome files for the 10X CellRanger analysis software can be found at:

s3://pubweb-references/cellranger/refdata-gex-GRCh38-2020-A/
s3://pubweb-references/cellranger/refdata-gex-mm10-2020-A/
s3://pubweb-references/cellranger/refdata-gex-GRCh38-and-mm10-2020-A/
s3://pubweb-references/cellranger/refdata-cellranger-vdj-GRCh38-alts-ensembl-7.0.0/
s3://pubweb-references/cellranger/refdata-cellranger-vdj-GRCm38-alts-ensembl-7.0.0/
s3://pubweb-references/cellranger/flex/Chromium_Human_Transcriptome_Probe_Set_v1.0.1_GRCh38-2020-A.csv
s3://pubweb-references/cellranger/flex/Chromium_Mouse_Transcriptome_Probe_Set_v1.0.1_mm10-2020-A.csv
s3://pubweb-references/spaceranger/probesets/Visium_Human_Transcriptome_Probe_Set_v1.0_GRCh38-2020-A.csv
s3://pubweb-references/spaceranger/probesets/Visium_Human_Transcriptome_Probe_Set_v2.0_GRCh38-2020-A.csv
s3://pubweb-references/spaceranger/probesets/Visium_Mouse_Transcriptome_Probe_Set_v1.0_mm10-2020-A.csv

Adding Custom References

If you are adding a pipeline which uses reference files which are very large and which may conceivably be used across multiple projects, it may be worthwhile to add them as custom references in Cirro. It is generally discouraged to pull large files from HTTPS, GCP, FTP, etc. during workflow execution, due to the increased time and cost that is incurred.

Please submit all requests for an additional Custom Reference to be added to Cirro by sending an email to support@cirro.bio.