Skip to content

Analyzing a dataset

Running analysis on a dataset

from cirro import DataPortal

portal = DataPortal()
# Get the project by name
project = portal.get_project_by_name('Test Project') 
print(f"Project '{project.name}' contains {len(project.list_datasets()):,} datasets")

# Get a particular dataset from that project
dataset = project.get_dataset_by_name('Test dataset for variant calling')
print(f"Dataset '{dataset.name}' contains {len(dataset.list_files()):,} files")

# Get the process to run on the dataset
process = portal.get_process_by_id('process-nf-core-sarek-3-0-1')
print(f"Using the '{process.name}' process (ID: {process.id})")
Project 'Test Project' contains 104 datasets
Dataset 'Test dataset for variant calling' contains 2 files
Using the 'Variant Calling (nf-core/sarek)' process (ID: process-nf-core-sarek-3-0-1)

Look up the parameters that are required for the process. You'll have to set values for these parameters later.

param_spec = process.get_parameter_spec()
param_spec.print()
Parameters:
    Experiment Design (Group)
        Reference Genome (key=genome, default=GATK.GRCh38, type=string, enum=['GATK.GRCh38', 'GATK.GRCh37', 'GRCm38'])
        Whole Exome/Targeted Gene Panel Assay (key=wes, type=boolean, description=Please indicate if your data was generated using a capture kit.)
        Genomic intervals (key=intervals, type=string, description=Target bed file in case of whole exome or targeted sequencing or intervals file for parallelization.)
        Variant Calling Type (key=analysis_type, default=Germline Variant Calling, enum=['Germline Variant Calling', 'Somatic Variant Calling'])
    Variant Annotation (Group)
        Annotation tool(s) (key=annotation_tool, type=array, description=Please select one or both variant annotation tools.)
    Read Trimming Options (Group)
        Trim reads using Trim-Galore? (key=trim_fastq, type=boolean)

Look up the references you'll need to use as input parameters. See the Using_references notebook for more info on how to find references

references = project.list_references('Genome Regions (BED)')
print("The BED references available are:\n" + "\n - ".join(list(map(str, references))))
reference_library = project.get_reference_by_name('GRCh38_Chr20', 'Genome Regions (BED)')

print(f"\nThe reference library we are using is: {reference_library.name}\nThe absolute path to the file is: {reference_library.absolute_path}")
The BED references available are:
GRCh38_Chr20
 - NimbleGen_SeqCap_EZ_Exome_primary-capture_hg19_chr17
 - wgs_calling_regions.hg19.bed
 - wgs_calling_regions.hg38.bed

The reference library we are using is: GRCh38_Chr20
The absolute path to the file is: s3://z-9a31492a-e679-43ce-9f06-d84213c8f7f7/resources/data/references/genome_bed/GRCh38_Chr20/regions.bed

Define the parameters you want to use. The keys you'll want to use will come from the param_spec variable defined above (look at the key for each entry).

params = {
    'genome': 'GATK.GRCh38',
    'wes': True,
    'intervals': reference_library.absolute_path,
    'trim_fastq': False,
    'annotation_tool': ['cnvkit', 'deepvariant']
}
params
{'genome': 'GATK.GRCh38',
 'wes': True,
 'intervals': 's3://z-9a31492a-e679-43ce-9f06-d84213c8f7f7/resources/data/references/genome_bed/GRCh38_Chr20/regions.bed',
 'trim_fastq': False,
 'annotation_tool': ['cnvkit', 'deepvariant']}

Before submitting the analysis, the client automatically validates that the parameters are valid. But, you can also validate them manually using validate_params

try:
    param_spec.validate_params({
        'library': 1
    })
except Exception as e:
    print(e)

Run the analysis using the process, dataset, project, and parameters you defined above.

# Run the analysis, specifying a name and description for the resulting dataset
new_dataset_id = dataset.run_analysis(
    name='Variant Calling Analysis',
    description='Test from SDK',
    process='process-nf-core-sarek-3-0-1',
    params=params
)
print(new_dataset_id)
71ec598c-368b-47a5-84c8-c209739b050a