Configuring a Pipeline for Cirro

The Cirro pipeline configuration consists of 6 files:

process-form.json: Form used to collect user input prior to workflow launch
process-input.json: Mapping of form contents to workflow parameters/options
process-output.json: Identifying any workflow output files which can be directly visualized by Cirro
preprocess.py: Optional Python script which can be used to evaluate arbitrary configuration logic immediately prior to workflow launch
process-compute.config: Any additional compute configuration needed (used as nextflow.config for Nextflow or config.file for Cromwell/WDL)

You can choose to use the Cirro Pipeline Configurator Tool to set up the configuration files for your pipeline, or you can manually create them.

Pipeline Configurator Tool

The Cirro Pipeline Configurator Tool is designed to help users create and manage pipeline configurations for integrating their pipelines into Cirro.

Using the Tool

The tool consists of three primary tabs:

Input Parameters: Set up the input parameters which are used to run the workflow on a particular dataset
Output Files: Specify any output files which can be visualized in Cirro.
Preprocess: Set up a Python script which can be used to modify the input parameters or inputs to the workflow

This tool doesn't include the process-compute.config file, which is used to override the pipeline's configuration. See the Pipeline Configuration Override section for more information.

You can resume from an existing configuration by clicking the "Load from File" button at the top right of the page.

Input Parameters

One of the most useful aspects of the Pipeline Configuration tool is being able to set up the input parameters which are provided to the workflow. This functionally translates to:

Setting up elements in process-form.json which can render a web form to the user
Specifying which elements should be mapped to input parameters within process-input.json.

To set up these input parameters, switch to the "Input Parameters" tab of the tool and click "Add Parameter".

You can see a live preview of the form as well as the JSON configuration which will be generated on the right side of the page.

Parameter Types

The dropdown will contain the parameter types described below.

Form Inputs:

Users will be able to fill in the values for these parameters using a web form.

Text: A simple text input field
Integer Number: A number input field which only accepts integers (e.g. 1, 2, 3)
Decimal Number: A number input field which accepts decimal numbers (e.g. 1.0, 2.5, 3.14)
Checkbox: A boolean input, which is represented as a checkbox
Dataset File: A file input, which allows the user to select files from the datasets that they have selected. This is useful for pipelines which do not run on the complete contents of a dataset, but which may want to allow the user to select a subset of files for analysis.
Reference File: A file input, which allows the user to select a reference file that are available in the project

Other:

These parameters are not user-editable, but are set to a specific value.

Input Dataset(s) Path: The value of this will be set to the base S3 path of the input dataset(s).

When multiple datasets are selected, the value will be a comma-separated list of S3 paths.
Output Dataset Path: The value of the parameter will be set to the base path of the output directory, which has been set up by Cirro as the location for the outputs of the pipeline.

This option will be hidden if the parameter already exists.
Hardcoded Value: The value of the parameter will always be the same

Hardcoded values could be used to set up the path to a Standard References provided by Cirro.

You can also add groups to the form, which can be used to group together related parameters. Use the "Add Group" button to add a group, and then drag and drop the parameters into the group.

Parameter Settings

Once a parameter has been added, you can edit the settings described below. Additional settings are available by clicking on the gear icon next to the parameter name.

Name: The short string (lowercase letters, numbers, and underscores) key which will identify the value of the parameter to the workflow.

Form Inputs:

For form inputs, some of the following settings may be available based on your selected parameter type.

Title: The title of the parameter, which will be displayed in the web form. This can be different from the name, and can contain spaces and other characters.
Description: A longer description of the parameter, which will be displayed underneath the field in the form. This can be used to provide additional information about the parameter.
Default Value: The default value of the parameter, the field will be pre-filled with this value.
Pattern: (for text inputs) A regular expression which will be used to validate the value of the parameter. This can be used to enforce a particular format for the value of the parameter.

If you specify this value, please make sure to provide a description of the allowed characters in the description field.
Options: (for text inputs) The options for the parameter, which will be displayed as a dropdown list in the form.
Reference Type: (for reference file inputs) The type of reference file which will be used.
File Pattern: (for dataset file inputs) The pattern which will be used to filter the files in the selector.

This will be passed to the workflow as the complete S3 path to the file(s).
Allow Multiple Files: (for dataset file inputs) Whether the user can select multiple files or not.

Multiple files will be passed to the workflow as a comma-separated list.
Required: Whether the parameter is required or not.

Hardcoded Inputs:

For hardcoded inputs, the following settings are available:

Value: The value passed to the workflow.

Output Files

This section is in development and will be available soon.

Preprocess

This tab provides an interface for setting up the preprocess script, which is a Python script.

The example script in the tool contains a number of comments which describe the purpose of each section.

Saving Cirro Configuration

After setting up the pipeline configuration using the tool, configuration files can be downloaded by clicking the "Save" button in the top right corner of the app.

You may have to allow the site to download multiple files in your browser.

Once the files have been pushed to a GitHub repository, they can then be used to add the workflow to Cirro.

Manual Configuration

For more advanced configurations, you can manually create the configuration files (or start with the configurator tool and edit them).

The sections below describe the contents of each configuration file and how they are used.

Customizing the Form

While the Configuration tool does a good job of setting up the syntax of elements which are required to render a web form which gathers information appropriately from the user, it does not support the full set of features which are available to the developer.

The contents of process-form.json follow the JSON Schema format. The full set of functionality available can be found in the react-jsonschema-form page. The vast majority of functionality described on that page can be used for Cirro pipelines.

To test modifications to the form, simply copy and paste the form section of process-form.json (omitting the ui element) into the input box on the react-jsonschema-form page page.

Caveat: The Cirro-specific form elements will not render in their final form using this preview tool, and cannot currently be used as dependencies within logic-gated object types.

Expert-Level Note - Customizing the Form JSON

The form and structure of the form can be modified extensively without any impact on the functionality within Cirro. For example, the order of input elements in the form has no relation to the parsing of the user inputs by the website. However, it is important to keep in mind that nesting of form elements within larger objects may require an updated reference in the process-input.json.

For example, the form given by:

{
    "form": {
        "properties": {
            "foo": {"title": "Input Foo", "type": "string"}
        }
    },
    "ui": {}
}

Can be referenced in the process-input.json as:

{
    "foo": "$.dataset.params.foo"
}

While the nested element shown by:

{
    "form": {
        "properties": {
            "outer": {
                "type": "object",
                "title": "Grouping of Inputs",
                "properties": {
                    "foo": {"title": "Input Foo", "type": "string"}
                }
            }
        }
    },
    "ui": {}
}

Should be referenced as

{
    "foo": "$.dataset.params.outer.foo"
}

Form-to-Workflow Input Mapping

The contents of the process-input.json file are used to map the values from the form to the input parameters used by the workflow.

When using the Pipeline Configurator Tool, the process-input.json file is automatically generated based on the configured form.

For more complex parameter mappings, the process-input.json file can be modified directly. The mapping function supports the use of JSONPath syntax. You may use this functionality to access other parameters that are not directly part of the form, such as other dataset or process information.

Some rules to keep in mind when setting up the process-input.json file:

The workflow's output directory should always be present and mapped to "$.dataset.dataPath".
Input files can be sourced from either the $.inputs[*].dataPath path or from a sample sheet generated by the preprocess script.
Parameters with multiple matches (i.e., from the [*] operator) will be combined into a comma-separated string. This cannot be combined with the | operator.
Parameters can be combined using the | operator with other paths or static values.
The form parameters can be accessed through the $.dataset.params path.

The full list of parameters and examples can be found in the sections below.

Available Parameters

The following parameters are available for use in the process-input.json file:

$.dataset: Information about the newly created dataset, including $.dataset.params which contains the user input from the form;
$.project: Attributes of the project which the dataset is contained within;
$.process: Attributes of the process (workflow) which will be used to create the dataset;
$.inputs: A list of each of the dataset(s) which have been selected as inputs.

For example:

{
    "dataset": {
        "id": "111",
        "createdAt": "2024-09-23T19:55:28.020052Z",
        "updatedAt": "<SNIP>",
        "createdBy": "hello@cirro.bio",
        "params": {
            "level": 3,
            "experiment_format": {
                "plates": "A B C",
                "wells": "1 2 3"
            },
            "custom_options": {
                "with_custom": false
            }
        },
        "info": {},
        "projectId": "000",
        "tags": [],
        "processId": "process-hutch-interop-qc-1_0",
        "name": "New Dataset",
        "status": "PENDING",
        "s3": "s3://project-000/datasets/111",
        "dataPath": "s3://project-000/datasets/111/data"
    },
    "project": {
        "id": "000",
        "name": "Test Project",
        "description": "Test Project Description",
        "status": "COMPLETED"
    },
    "process": {
        "id": "process-hutch-interop-qc-1_0",
        "createdAt": "<SNIP>",
        "updatedAt": "<SNIP>",
        "pipelineType": "Community",
        "name": "InterOp QC",
        "description": "InterOp QC Reporting - Generate QC metrics from Illumina Sequencing run",
        "executor": "NEXTFLOW",
        "documentationUrl": "https://github.com/FredHutch/interop-nf",
        "code": {
            "repositoryType": "GITHUB_PUBLIC",
            "uri": "FredHutch/interop-nf",
            "version": "1.0",
            "script": "main.nf"
        },
        "archived": false,
        "category": "Quality Control"
    },
    "inputs": [
        {
            "id": "aaa",
            "createdAt": "<SNIP>",
            "updatedAt": "<SNIP>",
            "createdBy": "hello@cirro.bio",
            "params": {},
            "info": {},
            "projectId": "000",
            "sourceDatasetIds": [],
            "tags": [],
            "processId": "sequencing-run",
            "name": "MiSeq Demo 1",
            "status": "COMPLETED",
            "description": "test",
            "s3": "s3://project-000/aaa",
            "dataPath": "s3://project-000/aaa/data"
        },
        {
            "id": "bbb",
            "createdAt": "<SNIP>",
            "updatedAt": "<SNIP>",
            "createdBy": "hello@cirro.bio",
            "params": {},
            "info": {},
            "projectId": "000",
            "sourceDatasetIds": [],
            "tags": [],
            "processId": "sequencing-run",
            "name": "MiSeq Demo 2",
            "status": "COMPLETED",
            "description": "test",
            "s3": "s3://project-000/bbb",
            "dataPath": "s3://project-000/bbb/data"
        }
    ]
}

Example Input Mapping

Using the example shown above, the following process-input.json file:

{
    // Map the dataset name to the workflow parameter "name"
    "name": "$.dataset.name",
    // Map the "level" parameter on the form to the workflow parameter "level"
    "level": "$.dataset.params.level",
    // Map the paths of all the input datasets to the workflow parameter "inputs"
    "inputs": "$.inputs[*].dataPath",
    // Map the path to the dataset output directory to the workflow parameter "output"
    "output": "$.dataset.dataPath",
    // Map the plates and wells from the form to the workflow parameter "experiment_format"
    // The values will be combined with a colon
    "experiment_format": "$.dataset.params.experiment_format.plates| : |$.dataset.params.experiment_format.wells",
}

will be populated with data and provided to the workflow as:

{
    "name": "New Dataset",
    "level": 3,
    "inputs": "s3://project-000/aaa/data,s3://project-000/bbb/data",
    "output": "s3://project-000/datasets/111/data",
    "experiment_format": "A B C : 1 2 3"
}

Output Files

In addition to running pipelines, Cirro also provides functionality for visualizing data directly via Dashboards (Beta).

To set up the output data from a pipeline such that it can be viewed in a Cirro dashboard, add entries to the process-output.json file.

For data seperated value (comma- or tab-delimited) files, the following information must be provided.

The name to display for the information from a particular file
The relative path of that file within the output directory
The delimiter used in that file (,, \t, |, etc.)
Names for each of the columns (headers) in that file

This can also be used to visualize spatial transcriptomics data (using Vitessce) or imaging data.

More information about the file syntax will be provided soon.

Setting up the Preprocess Script

The preprocess.py file produced by the configuration app is merely a placeholder -- you may need to customize the preprocess script to launch your workflow in the appropriate way.

One of the most common tasks performed in the preprocess script is to construct a sample sheet (or inputs JSON) listing all of the files from the input dataset.

In addition, the preprocess script can be used to execute any Python code after the user has filled out the analysis form, but before the workflow has started. This gives the workflow developer an opportunity to modify parameters or inputs to the workflow to fulfill any need which is too complex to be fully fulfilled by the form input.

Guide to the Preprocess Script

Pipeline configuration override

The process-compute.config file can be used to override a pipeline's configuration.

The following is an example of how you might override certain settings in a Nextflow pipeline. This file is especially useful if you are trying to integrate an off-the-shelf pipeline into Cirro or cannot make modifications to the pipeline itself.

params {
    // Override the container to use ECR instead of DockerHub
    // This can also be used to make a private container available to your Cirro organization without having the user input the container
    tool_container = "public.ecr.aws/cirrobio/qupath:0046339"
}

process {
    // If the process fails due to a memory error, retry the task, otherwise cause the process to fail
    errorStrategy = { task.exitStatus in 137..140 ? 'retry' : 'ignore' }

    // Override the maximum number of retries for a task
    maxRetries = 3

    // Monkey patch the make_miniature process
    withName: make_miniature {
        beforeScript = "mkdir -p bin/colormaps; cp /miniature/bin/colormaps/* bin/colormaps/"
    }

    // Override the resources for the io_limited task and increase the resources with each attempt
    withLabel: io_limited {
        cpus = { 1 * task.attempt }
        memory = { 4.GB * task.attempt }
    }

    // Disable caching for a specific process
    withName: CUSTOM_DUMPSOFTWAREVERSIONS {
        cache = false
    }

    // Add additional arguments to a specific process
    // This uses the nf-core ext property syntax
    // (https://nf-co.re/docs/contributing/components/ext_args)
    withName: '.*:FDRCONSENSUSID' {
        ext.args    = "-PSM true -protein false"
    }
}

Saving to a GitHub repository

When saving the pipeline configuration files to a GitHub repository, we recommend using the folder .cirro/ at the root of the repository. If the repository will ultimately contain multiple pipeline configurations, use subfolders to distinguish between them (.cirro/align/, .cirro/genotype/, etc.). The files can be sourced from any location, but this convention can help with organization.

Next Steps

After setting up your pipeline configuration files and saving them to a code repository, you are ready to add your workflow to Cirro.

Adding a Custom Workflow