Developing Pipelines

Every time you run a process in Cirro and create a new dataset, what's happening behind the scenes is that a reproducible computational 'workflow' is being executed. For more details on the fundamental concepts of Cirro, read more here.

Rather than using any custom system for defining workflows, Cirro wraps around workflow manager software which is used by the broader community. Cirro currently supports the Nextflow and Cromwell workflow management systems, and all of the documentation below refers to concepts which they employ. For more details on the syntax and structure of Nextflow, please consult its excellent documentation.

The documentation on this page should help you build a reproducible computational workflow which can be run either inside Cirro (on AWS) or on any other system supported by Nextflow or Cromwell.

Designing a Workflow

A workflow is a tool you can use to manipulate data. Before starting to write any code you should know what the tool is used for, and how it will complement the other tools you have available.

Start at the end

The end result of running your workflow will be a set of files which are created in an output directory. Start by writing out what those set of files should be. If you don't have particular requirements for the file format or structure, that's totally fine. However, if you do have an idea of what form of data would be the easiest to use in downstream applications, keep that in mind from the start.

For most applications, structured (tabular) data is written out in CSV format (or arrow / feather / HDF for more advanced use-cases). If you want to produce tabular data try to lay out the number of tables that you will want in the end, and what their rows/columns might be.

Once you have an idea of what the outputs of the workflow should be, it will be much easier to design an efficient workflow which only concerns itself with the tasks which are strictly necessary.

Knowing your destination, write out the set on input data which you will start with. This should include all of the data files which are needed to produce your needed outputs. Once you know what all of your starting inputs and final outputs will be, the workflow just needs to concern itself with filling in the intermediate steps.

Gather your tools

The primary challenge of bioinformatics is how to appropriately wield a collection of researcher-built computational tools. Depending on your technical area, you may make heavy use of short-read aligners like bwa mem or genome interval utilities like bedtools. Each of these tools will be used in discrete, interconnected steps of your workflow.

Looking over the analysis tasks which your workflow will encompass, make a list of the different component tools which you will need to use. The most efficient way to invoke each tool in a Nextflow workflow is by using Docker containers (which can be run via Singularity on HPC systems).

The vast majority of bioinformatics tools can be sourced as version controlled Docker images from the BioContainers Community. To find an image for a particular tool, the quickest place to check is directly on Quay.io via https://quay.io/repository/biocontainers/<TOOL_NAME>?tab=tags. For example, https://quay.io/repository/biocontainers/bwa?tab=tags has all of the images available for the BWA aligner. To use one of those images in a workflow, it would be referenced along with the tag as, e.g. quay.io/biocontainers/bwa:0.7.3a--h7132678_7.

Draw the map

One of the most challenging aspects of writing workflows can be the manipulation of channels which route data from one step to the next. A strategy which can help organize this process is to sketch out a rough flow of how the data needs to be transformed. For example, FASTQ reads are trimmed with cutadapt, trimmed reads are aligned with BWA, variants are called with vcftools, and then variants are aggregated across all samples.

Mapping out the data transformations needed for a workflow in advance can help you stay organized as workflows start to expand. Instead of scrolling up and down a workflow looking for missing pieces, even a very rough map can be used as a checklist to track what is done and what remains.

Most importantly, you will want to identify the points at which data channels are joined, collected, or pivoted. For example, if the variants called from each sample need to be joined in some way with the alignment summary metrics for that sample, then you will need to pass the sample ID along in a tuple with each respective channel so that the appropriate files may be joined.

Fill in the details

Now that your requirements are laid out, it is much easier to fill in all of the details needed for the workflow. If you are new to writing workflows, read through the parts of a workflow to familiarize yourself with the tools at your disposal.

Knowing what your inputs and outputs are, the tools at your disposal, and the general structure for the workflow, you can simply work your way down the list. Start with the first step, check it off the list, and move on to the next. Make sure to run a test at every step, and before you know it the whole workflow will be finished.

Parts of a Workflow

When reading through the documentation below, it is useful to know the basic elements of a Nextflow workflow. Each workflow should be organized as a distinct software repository (e.g. on GitHub).

DSL 2

Because of the high degree of flexibility and modularity it provides, we recommend using the DSL 2 syntax in Nextflow. While DSL 2 is technically optional, it will become the default in future releases of Nextflow and is more than worth the effort of becoming familiar with.

DSL 2 is enabled by adding nextflow.enable.dsl=2 at the beginning of your workflow script.

Configuration

The configuration of a workflow can be most easily defined in a file named nextflow.config located in the base directory of the workflow repository.

Workflow script

The primary entrypoint for the workflow should be in a file named main.nf located in the base directory of the workflow repository. For small workflows, this file may contain all of the processes and sub-workflows which are used. For larger workflows, processes and sub-workflows can be imported from other module files.

Modules

For workflows which contain multiple independent components, each section should be configured as a sub-workflow which is included in the main workflow script. A useful convention is to keep all of those components in different files in the modules/ directory.

Testing and Debugging

When you want to start building a workflow, it can be hard to justify the effort needed to assemble a good testing dataset. However, once you get deep into the workflow development that test data will be an invaluable resource. A good rule of thumb is that it is worthwhile to invest in a testing framework for any workflow that you want to maintain in the future. Think of it as a gift to your future self.

Test data

The dataset used to test a workflow should fulfill two basic requirements:

Small enough to be run through the workflow in minutes, not hours
Complex enough to test any alternate routes which may exist in the workflow

It is useful to keep the absolute file sizes of test datasets below 10MB to more easily commit them to GitHub. When analyzing genomic datasets, a useful technique for making smaller datasets is to filter to a particular genomic region. This approach can be used for short read datasets as well as the genomic references used to align them.

Local testing

The most useful test is one which takes little effort to run, and which returns informative error messages. To make a test suite easily runnable, a good practice is to keep all testing files in a folder of the workflow repository named tests/. In addition to the input files used to run the test dataset, that folder should also contain a run.sh script which is sufficient for invoking the workflow appropriately.

For a multi-step testing suite, it is helpful to use BATS (the Bash Automated Testing System). When running a set of tests in BATS, logging messages will be suppressed for all successful steps, which makes it easier to identify and debug errors. Keep in mind that BATS will not return an error when commands return a non-zero exit status. Instead, use boolean evaluation (e.g., checking for file existence with [ -s expected.file.txt ]) to ensure that errors are caught.

Automated testing

Once a testing framework has been established for local execution and a workflow has reached a reasonable state of completion, it may be worthwhile to automate the testing suite with GitHub Actions. To enable GitHub Actions, set up a file in the repository under .github/workflows/test.yaml which follows this general pattern:

name: Test Suite

on: [push, pull_request]  

jobs:

  test:
    runs-on: ubuntu-latest
    steps:
      - name: Setup BATS
        uses: mig4/setup-bats@v1
        with:
          bats-version: 1.2.1

      - uses: actions/checkout@v3
      - uses: actions/setup-java@v2
        with:
          distribution: 'temurin'
          java-version: '17'

      - name: Install Nextflow
        run: |
          wget -qO- get.nextflow.io | bash
          sudo mv nextflow /usr/local/bin/

      - name: set up Python3
        uses: actions/setup-python@v3

      - name: Run tests
        working-directory: tests
        run: bash run.sh

In the example above, any pushes or pull requests will trigger an Action which installs prerequisites and then runs bash run.sh from the tests/ folder. This should not be thought of as a replacement for local testing because it is inefficient to wait for an action triggered from a pushed commit. Instead, this setup can be used in conjunction with status badges to maintain an independent validation that the updated code is in good working order.

Defensive workflow design

One of the biggest potential sources of errors in a workflow is when the content of user-provided parameters does not conform to the intended design. By checking explicitly for any unexpected inputs, a workflow can provide more helpful error messages guiding the user to correct their mistake without having to reach out to the developer for support.

Check for file existence

Consider the example where params.input is expected to contain the path to a FASTA file containing a set of sequences. If the user provides an incorrect path (or any other non-path string), this will not be caught by the default use of file() or the Channel.fromPath() constructor.

Set the checkIfExists flag to true for either file() or Channel.fromPath() to raise an error if an expected filepath cannot be found (docs).

Check for missing parameters

If you expect that the user will set a parameter (e.g. params.input), explicitly check in the workflow that the parameter was set and raise an error if not. One way to accomplish this is by setting a default value of false in the params scope of nextflow.config, and then raising an error in the workflow if that default has not been replaced by the user.

workflow {
    if ("${params.input}" == "false"){
        error "Error: User must provide a value for the --input parameter"
    }
}

Check for empty channels

An extremely common failure case which can result in an uninformative message during development, or to the end-user, is when a Channel is unexpectedly empty. To provide a more informative error message, it is best to defensively use the ifEmpty operator.

A common scenario to consider is when a user mistakenly does not provide inputs to a workflow. For example, if they specify an input folder which does not contain any files then the result of a Channel.fromPath("${params.input_folder}/*") factory will be an empty channel. In that scenario, any downstream .collect() operator will immediately be evaluated and yield an empty list [] to whatever process it is used for (e.g. collectResults()) The error seen by the user will take the form of empty input to collectResults which will not point them to the issue with the empty input folder.

The solution in this scenario is to use the ifEmpty operator as follows:

Channel
    .fromPath("${params.input_folder}")
    .ifEmpty { error "No inputs in ${params.input_folder}" }
    .set { input_ch }

For more details on the nuances of this operator, consult the Nextflow documentation at more length.

Best Practices

While a workflow can be built in any number of ways, we advocate for a set of best practices which help streamline workflow development and support. A number of these practices are particularly helpful for the cloud-based execution backend employed by Cirro (such as the use of software containers). The guidance below is intended purely as advice, and is not strictly required for a workflow to be used in Cirro.

Resource allocation

The amount of resources which are allotted for each process can be defined with the cpus and memory tags. If no resources are specified, then a task will be given a default 1 CPU and 1 GB RAM.

The combination of CPUs and RAM allocated for each process should correspond to the C4, M4, and R4 instance types on AWS (which are the optimal instance type used by the AWS Batch job scheduler). As a quick guide, the best options to select from are:

C4 Class (X CPU / ~2X GB RAM)
- 1 CPU / 1.5 GB RAM
- 2 CPU / 3.75 GB RAM
- 4 CPU / 7.5 GB RAM
- 8 CPU / 15 GB RAM
- 16 CPU / 30 GB RAM
- 36 CPU / 60 GB RAM
M4 Class (X CPU / 4X GB RAM)
- 1 CPU / 4 GB RAM
- 2 CPU / 8 GB RAM
- 4 CPU / 16 GB RAM
- 8 CPU / 32 GB RAM
- 16 CPU / 64 GB RAM
- 40 CPU / 160 GB RAM
- 64 CPU / 256 GB RAM
M4 Class (X CPU / ~8X GB RAM)
- 1 CPU / 7.5 GB RAM
- 2 CPU / 15 GB RAM
- 4 CPU / 30 GB RAM
- 8 CPU / 60 GB RAM
- 16 CPU / 122 GB RAM
- 32 CPU / 244 GB RAM
- 64 CPU / 488 GB RAM

We recommend that all resources are set in the nextflow.config file using one of two available process selectors:

withLabel: Matching up a set of resources to all of the processes which share the same label, making it easy to set up a handful of commonly re-used types like mem_large or cpu_large
withName: To set the resources which are used for each process individually (which is a bit more cumbersome)

Parameters

All default parameters for a workflow should be set in the nextflow.config file inside the params scope. The values for all of those parameters can be accessed by any workflow or process with the params object (e.g. params.alpha_1, params.beta_2).

params {
    alpha_1 = true
    beta_2 = 'another string ..'
}

These default parameters will be overridden by the user input (following the Nextflow rules for configuration priority), and should represent the default values which should be used if the user does not provide any input.

When importing sub-workflows, it is possible to change the parameter values which are used in that context. This should be considered an expert-level option which is only used rarely -- it is generally best to define one set of parameters across a workflow to avoid confusion. In the even that it is necessary to add parameters to an imported sub-workflow, the best practice is to use the addParams option. Using the params option will omit any parameters which are not explicitly added in that statement, which is generally less useful.

Containers

While Nextflow can be used to provide the dependencies used for each process in a number of different ways, the most convenient approach with AWS Batch (the infrastructure of Cirro) is to use Docker containers.

The container used for each process is defined with the container directive. To conveniently re-use containers across processes, we recommend setting up all containers as default params in nextflow.config and specifying those parameters in the process definitions.

For Example

nextflow.config:

params {
    container__pandas = "quay.io/fhcrc-microbiome/python-pandas:v1.2.1_latest"
}

main.nf:

process processName {
    container "${params.container__pandas}"
    ...
}

Setting up containers in the params scope provides automatic inheritance into the namespace of all sub-workflows and modules, while also allowing the user to override default values at runtime to help with debugging and development.

Container Resources

There are many container resources available which should be used before making any custom Docker images for workflows.

BioContainers

The BioContainers project is a fantastic community-supported resource for standardized installations of widely used bioinformatics tools. To check for a particular tool you can either browse the web-based BioContainers registry or look through the BioContainers GitHub repository.

If your tool of interest is available in BioContainers, check for the latest container build which is available on their Quay repository by visiting https://quay.io/repository/biocontainers/ and clicking on "Tags." For example, to check for the tool 'metaphlan' you would visit https://quay.io/repository/biocontainers/metaphlan?tab=tags.

To use a BioContainer in your workflow, specify the image hosted in Quay with the syntax: quay.io/biocontainers/TOOL_NAME:TAG, e.g. quay.io/biocontainers/metaphlan:3.0.14--pyhb7b1952_0.

Data Core Containers

To help provide general-use containers which fall outside of the purview of BioContainers, the Data Core maintains a handful of container images using the hdc-workflows organization on Quay. For example, the quay.io/hdc-workflows/ubuntu:20.04 image is a useful alternative to the DockerHub ubuntu:20.04 version when running into download rate throttling issues.

If you need to use a container which is not available on BioContainers and which you think might be useful for the broader community, please contact the Cirro team.

Sample sheets

An extremely-common workflow scenario is to analyze a set of files in the context of some grouping or annotation which is attached to each individual file (or pair of files). The approach we use to organize this type of information is with "sample sheets," which are CSV files which take the general form:

file_uri,labelA,valB
/path/to/file1,label1,val2
/path/to/file2,label2,val2

To use these sample sheets in workflows, Nextflow provides the useful splitCsv operator. With splitCsv, each line in the CSV becomes a tuple containing the values in each column:

Channel
    .from(
        file(
            "${params.sample_sheet}",
            checkIfExists: true
        )
    )
    .splitCsv(
        header: true,
        split: ','
    )
    .set { input_ch }

In the example above, this would yield the tuples:

[file_uri: /path/to/file1, labelA: label1, valB: val2]
[file_uri: /path/to/file2, labelA: label2, valB: val2]

Note that at this point file_uri contains the path to the files and does not contain the files themselves. To replace each of the paths with file objects you can use the map operator immediately after the .splitCsv() operator:

.map { it -> [file(it.file_uri), it.labelA, it.valB] }

This pattern also has the benefit of setting the order of the tuples to match what is expected by a downstream process. In other words, this .map operator will yield the same order of values no matter what the column order of the sample sheet is, since it is matching on the header names.

Publishing outputs

An important feature of Nextflow to understand is that the files which are identified in the output block of a process are not necessarily 'published' as outputs of the workflow. This distinction makes it possible to create intermediate files which are only used to pass information between workflow steps, but which do not clutter up the filesystem when everything is done.

To specify exactly which files should be provided as outputs from the workflow, you must use the publishDir directive on that process. We recommend:

Using a single parameter which specifies the base folder for all outputs (e.g. params.output)
Using publishDir with the following flags: publishDir "${params.output}", mode: "copy", overwrite: "true"
To use a sub-folder for the outputs from a particular process, use: publishDir "${params.output}/path/to/subfolder/", mode: "copy", overwrite: "true"
To only publish a subset of the outputs from a particular process, use pattern: "<glob>"e.g.: publishDir "${params.output}", mode: "copy", overwrite: "true", pattern: "*.csv.gz"

Workflow Templates

To help jump-start the development of Nextflow workflows in Cirro, we have produced a small collection of repositories which can be used as templates.

To use these templates, simply make a fork of the repository or copy the contents into your own repository. From there, start adding your own code or making any modifications you like.

Template workflow repositories

Python – https://github.com/FredHutch/nextflow-template-python
R – https://github.com/FredHutch/nextflow-template-R
BASH – https://github.com/FredHutch/nextflow-template-bash

Recommended workflow structure

To organize the different assets which may be needed in a workflow repository, we recommend the following directory structure:

workflow/
    modules/        Contains the modules imported by main.nf, e.g.
        module.nf       include { MODULE } from './modules/module.nf'
    templates/      Contains process scripts which are interpolated with variables as runtime
    bin/            Any executables placed in this folder will be available in the PATH for all processes
    tests/          Testing data
    main.nf         Main entrypoint
    nextflow.config Configuration for the workflow