Pipeline Catalog: Proteomics

Uploading Data

Spectral data generated from proteomics experiments can be uploaded in either mzML or RAW file format. In addition, the experimental design structure which describes that set of files should be uploaded as a samplesheet.csv for each dataset.

Note: The path for each of the spectra files in the samplesheet.csv should not include any parent folders. For example, instead of /path/to/dataset/Sample1.mzML, just use Sample1.mzML.

Reference Data - Genome

The set of protein targets used for analyzing peptide spectra must be provided as a Reference Genome (FASTA) on the References page.

Note: If the protein FASTA provided as a reference does not already contain 'decoy' sequences, then the "Add Decoys" option must be selected when running the analysis.

Data-Dependent Acquisition – Label Free Quantitation (DDA-LFQ)

Analysis of DDA-LFQ data is performed using the nf-core/quantms pipeline.

Input Data:

Input files can be processed from either the .mzML or .raw file formats. The input for this workflow should be uploaded as either the "Proteomics (RAW)" or "Proteomics (mzML)" dataset type. Include a full batch of samples to be processed as a single dataset for optimal processing.

Supports combining input datasets in a single analysis.

Sample Metadata:

In addition to the spectral data files, metadata may be provided to annotate specific details of the experimental design. To provide sample-level metadata appropriately, upload a samplesheet.csv as part of the dataset.

The metadata included in the samplesheet.csv mirrors the Sample and Data Relationship (SDRF) Format for Proteomics, with the important distinction that the samplesheet.csv must be comma-delimited (and not tab-delimited as expected for SDRF).

Required Metadata Fields:

When providing a samplesheet.csv the columns sample and file must be provided. Every sample in the dataset must be listed on its own line. If there are multiple replicate files from the same sample, use additional columns file_2, file_3, etc.

Note: The file column should list the relative path of the file within the uploaded folder, not the full URI.

The information in those two columns is used to automatically populate the columns:

source name (from sample)
comment[file uri] (from file)
comment[data file] (from file)

If no samplesheet.csv is provided the following fields will be populated with default values:

characteristics[organism]: Homo sapiens
characteristics[organism part]: not applicable
characteristics[disease]: not applicable
characteristics[cell type]: not applicable
characteristics[biological replicate]: not applicable
comment[fraction identifier]: 1
comment[technical replicate]: (autoincrement files with the same sample)
comment[cleavage agent details]: NT=Trypsin
comment[instrument]: NT=LTQ Orbitrap XL
comment[label]: NT=label free sample

Optional Metadata Fields:

The following fields may also be provided as columns in the samplesheet.csv, with the values being passed along to the workflow for processing:

characteristics[cell line]
characteristics[ancestry category]
characteristics[age]
characteristics[sex]
characteristics[developmental stage]
characteristics[individual]
material type
technology type
comment[flow rate chromatogram]
comment[gradient time]
comment[fractionation method]
comment[modification parameters]
comment[modification parameters]
comment[modification parameters]
comment[fragment mass tolerance]
comment[precursor mass tolerance]
factor value[flow rate chromatogram]
factor value[gradient time]

Analyze Subset of Files:

If you wish to analyze only a portion of the files in a dataset, use the "Analyze Subset" dropdown menu when launching the workflow. If no files are selected (the default), then everything present in the dataset will be analyzed. If any files are selected, then only those files will be analyzed.

Workflow Overview:

nf-core/quantms workflow overview

While the nf-core/quantms workflow supports a variety of data types (DDA-LFQ, DDA-Isobaric, and DIA-LFQ), each of those approaches is provided as a distinct pipeline within Cirro.

For more details on the parameters available, see the full workflow documentation.

Note: The DDA-LFQ pipeline in Cirro is run with the parameters: acquisition_method: "dda", labelling_type: "label free sample"

Quality Control Report (nf-core/quantms)

To generate a high-level summary of the results generated by the nf-core/quantms pipeline, this workflow summarizes the number of spectra, peptides, and proteins which have been detected.

Quality control metrics are generated at the dataset level, combining results across all of the individual files which have been analyzed in a batch.

To compare across runs, the quality control report can be executed on multiple datasets. The results from each dataset will be labeled with the name assigned by the user to the dataset. Even if that dataset name has been modified after creation, the name at the of QC report creation will be used to label the results.

Results include:

Num_Proteins_Identified
Num_Proteins_Quantified
Num_MS2_Spectra_Identified
Num_Peptides_Identified

In addition, plots are provided in HTML, PDF, and PNG format showing the distribution of results across the datasets.