Analysis pipelines can be added to Cirro, with support for:
- Running Nextflow or WDL workflow code located in a public or private GitHub repository
- Graphical interface for user input to select analysis parameters
- Any bespoke workflow configuration needed for execution on Cirro
How To Add a Pipeline to Cirro
- Set up Cirro configuration files (using the Workflow Configuration app or manually)
- Save those configuration files to a folder in any GitHub repository
- Import the pipeline to your Cirro account from that GitHub repository
Private GitHub Repositories
You can run a pipeline in Cirro even if it is located in a private repository.
To grant permission for a pipeline to be run from a private repository, go to the "Cirro Data Portal" app, click "Configure", and select the appropriate repository. By installing this app, you are granting permission for Cirro to run the code contained in it, but it will not be copied permanently to any storage system beyond the ephemeral instance used to execute the pipeline.
Setting up Cirro Configuration Files
You've already set up a Nextflow or WDL workflow, and now you'd like to run it in Cirro as an analysis pipeline.
Cirro is going to need to know a few basic things:
- Where can the workflow be found?
- Cirro supports running workflows from public or private GitHub repositories
- Is there a specific version that should be used?
- What information does the workflow need?
- Corresponding to 'parameters' for Nextflow, and 'inputs' and 'options' for WDL
- Does the back-end executor (AWS Batch) need any additional configuration?
- Beyond the authentication and queue information set up by default
- Are there output files which should be attached to Cirro visualization tools?
- Data output with various formats can be parsed for the Cirro Dashboard page
While it is possible to write the configuration files from scratch, we recommend that you get started using the Cirro Workflow Configuration Builder:
Using Standard Reference Files
Workflows used to analyze genomic data often rely on a set of standard references such as:
- Genome sequences for commonly studied organisms (e.g. Homo sapiens GRCh38)
- Compiled alignment indexes for aligners like BWA, STAR, or Salmon
- Annotations of gene and transcript locations
To save workflow developers the trouble of setting up those references independently across different accounts, a set of standard reference files are hosted in a location which can be accessed by any Cirro project.
Understanding Pipeline Execution in Cirro
Pipelines are executed in Cirro using a framework that is intended to be flexible enough to accommodate a wide variety of use-cases, while also being relatively lightweight for straightforward pipelines.
The series of events which are involved in pipeline execution are:
- The user is presented with a web form which can be used to collect any needed information
- The data returned by the web form is collected in a list of parameters
- A 'headnode' batch computing process is launched which has access to those parameters, as well the complete workflow configuration
- The optional "preprocess script" is launched, which can execute any arbitrary code needed to configure the workflow in a more complex way than is supported by the web form
- The workflow repository is cloned to the execution environment
- The workflow executor (Nextflow or Cromwell) is launched using the provided workflow and parameters
Saving Cirro Configuration Files
The Cirro pipeline configuration consists of 6 files:
process-dynamo.json: High-level pipeline information (name, description, repository, version, etc.)
process-form.json: Form used to collect user input prior to workflow launch
process-input.json: Mapping of form contents to workflow parameters/options
process-output.json: Identifying any workflow output files which can be directly visualized by Cirro
process-compute.config: Any additional compute configuration needed (used as
nextflow.configfor Nextflow or
preprocess.py: Optional Python script which can be used to evaluate arbitrary configuration logic immediately prior to workflow launch
After producing these files using the Cirro Workflow Configuration Builder,
they should be saved to a folder in a GitHub repository. The example folder
.cirro/ will be referenced
below, but any name can be used.
Organizing Cirro Configuration Files:
While the Cirro configuration files for a workflow can be stored in the repository which contains the Nextflow or WDL code for that workflow, they don't have to be. The biggest reason would be if you are not able to add files to the workflow repository (e.g. GATK, nf-core).
Note also that it is possible to set up multiple processes in Cirro which reference the same workflow repository, e.g. if a workflow can be run in different ways depending on the input dataset. To maintain these configurations, simply set up multiple subfolders (e.g.
Importing a Pipeline Configuration to Cirro
After setting up the configuration files, a pipeline can be imported to Cirro. The import process for a pipeline will make it available within a defined set of Cirro projects (defined by the pipeline author). All users (with Contributor-level permissions) will be able to run the pipeline on datasets in that Project. However, the author can remove access to the pipeline at any time.
Pipeline Security Notes
The ability to add custom pipelines to Cirro opens the door for any arbitrary code to be executed in a Cirro Project. User-level security practices can be enforced by ensuring that only trusted actors are granted the Admin-level permissions required for adding a pipeline.
In addition, the actions which can be performed by pipeline code are strictly limited to listing, reading, writing, and deleting files located within the AWS S3 bucket associated with a Cirro Project. The headnode and pipeline code cannot take any action which accesses or modifies the larger Cirro platform, such as user permissions or access control.
While it would be ideal to completely prevent pipelines from deleting files,
this is currently how Nextflow implements the
overwrite: true flag when
publishing output files.
Recovering Deleted Data: Every modification to data stored in Cirro is versioned, which means that deleted data can be recovered from backups for as many days as the "Deleted dataset retention" specifies in the project configuration. Only the project administrators are able to view and modify the deleted dataset retention settings. Please reach out if you have any questions.
While it is still important to only add pipelines which are trusted, the security guardrails set up in Cirro should mitigate the risk of adding custom analysis code to your Cirro environment.
Get in Touch
Have any questions about adding custom pipelines to Cirro? Please get in touch at email@example.com