Samples

Datasets are more than just a group of files. The data contained in these files often corresponds to a unit of observation which is described in Cirro as a "sample". These samples can describe a physical specimen, an experimental replicate, or whatever the atomic observation may be for a particular experiment.

Cirro keeps track of dataset files and sample metadata side by side so the metadata can be used as input for different pipeline analyses. The resulting dataset from any pipeline run will contain a snapshot of the samples that were used in its creation in order to maintain a complete picture of the dataset provenance.

Some things to keep in mind with samples include:

Samples have a many to one relationship: samples can be associated with many files, but every file can only point to one sample.
Sample metadata is shared across all of the datasets in a project, but cannot be shared across projects.
Sample entries can be modified by any user, while only administrators can modify the sample schema.

Adding a Sample

The sample table is populated at the time new datasets are uploaded. There is no way to explicitly add a sample, because they must be connected to an uploaded file. Users have three options for how the sample table is populated: They can either explicitly upload a sample sheet at the time they upload their dataset, add a sample sheet to an existing dataset using the "Upload Samplesheet" button on the dataset's Overview page, or they can skip the sample sheet and let Cirro auto-populate the sample table based on the file names. Sample sheets can also be updated with newer versions using the same "Upload Samplesheet" button.

Warning

Note that once you upload a dataset without a sample sheet, you can't go back and add that file. You would have to delete the dataset and re-upload it or add the sample sheet information to Cirro by hand. So be careful when choosing whether or not to upload a sample sheet with the rest of your files.

Providing a Sample Sheet

The easiest way to annotate metadata for a set of files is to upload a sample sheet alongside the files. A sample sheet should be saved in CSV format, with one line per sample. The name of the file should be samplesheet.csv and it should live in the top most level of the files you are uploading and not in any sub-folders.

To learn about the samplesheet requirements for a particular use case, check out our documentation on the following pipeline categories:

The first column in the sample sheet should have the header sample, and should contain the name of the sample. If there is only one file per sample, then the next column should be named file and should contain all of the file names. If there are two files per sample, then use file_1 and file_2. You can use any number of file_N columns in this way, and there is no requirement that each sample has the same number of files. However, be sure that all the files listed are included in the set of files you are uploading to Cirro. You should not have information on files in your sample sheet that you are not uploading at the same time. If you have FASTQ data, use the columns fastq_1 and fastq_2 instead of file_1 and file_2, to automatically annotate each file appropriately as read 1 and 2.

Any additional columns after sample and file_N (or fastq_1/2) can be used to store the metadata associated with each sample. For example:

sample,file_1,file_2,exp_group,replicate
sample_a,sample_a_R1.fastq.gz,sample_a_R2.fastq.gz,treatment,1
sample_b,sample_b_R1.fastq.gz,sample_b_R2.fastq.gz,treatment,2
sample_c,sample_c_R1.fastq.gz,sample_c_R2.fastq.gz,control,1
sample_d,sample_d_R1.fastq.gz,sample_d_R2.fastq.gz,control,2

Expert Note: Metadata can be associated with files as well as samples. To add annotations to file_1 and file_2 separately, use the prefixes file_1_ and file_2_. For example, using fastq_1 and fastq_2 is the equivalent of using file_1 with file_1_read as 1 and file_2 with file_2_read as 2.

Another benefit of the sample sheet is that you do not have to match any file name format requirements for your dataset type. If you have different file names, using a sample sheet is a great way to upload your data without editing the file names.

Using Auto-Population

If a sample sheet is not provided when a dataset is uploaded, Cirro will auto-populate the sample table for those files using the file names.

In this case, Cirro will check the dataset type that the files are being uploaded under (E.g. VirScan) and the expected file name formats for the uploaded files (e.g. *_*.{R1,R2}.fastq.gz and *_*_S*_{R1,R2}_00*.fastq.gz, where * denotes any set of characters). The portal will pull the information from the file name to populate the sample table with things like the "sample name" and "read number".

You are able to go in and update the sample table if any of the populated information is wrong or you want to add more information in new columns, but it will be a longer process than simply uploading a sample sheet with your data.

Editing a Sample

After a sample has been added to the sample table, it can be edited to provide new information or update incorrect information. To edit an entry in the table, click the row you want to edit to get a pop up window where you can edit each of the column entries for that row. Some of these entries may have limited options, while others may be free to write in. If you need to change an entry to something that is not allowed, the sample table's schema files will need to be updated.

Schemas

The sample's schema is defined by a pair of files that are used to set the columns and possible values for the samples table and define how these options are rendered. These two files are the Form Schema file and the UI Schema file.

Form schemas are implemented as a JSON Schema and define the properties and allowed values for a sample. For example, a schema file could include a section that creates a column named "experimental_group" that only allows entries of "treatment" and "control", and a column named "age" that only allows a numerical entry. These can be very helpful when organizing data. The default schema contains either information based on the sample sheet (if provided) or information pulled from the file names, which could be as basic as only including the sample name. An example Form schema could look like:

{
  "$schema": "http://json-schema.org/draft-07/schema#",
  "type": "object",
  "properties": {
    "experimental_group": {
      "type": "string",
      "title": "Experimental Group",
      "description": "Indicates whether a sample is a treatment or a control",
      "enum": [
        "treatment",
        "control"
      ],
      "enumNames": [
        "Treatment",
        "Control"
      ]
    },
    "age": {
      "type": "number",
      "title": "Age (years)"
    }
  }
}

UI Schemas define how the form inputs are rendered in the Edit Sample pop up. This file is for advanced usage and is not typically used. You can learn more about UI schemas from react-jsonschema-form. An example UI schema could look like:

{
  "ui:options": {
    "title": "Flow Cytometry Samples",
    "description": "Sample description",
  },
  "experimental_group": {
    "ui:widget": "autoComplete"
  }
}

You can test your schemas using the react-jsonschema-form playground.

Users can view or update the current schema files by clicking the "Edit Schema" button in the upper right corner of the Samples page. There you can edit the Form and UI files directly.

samples header

If you are interested in updating the sample schema for a project, you can see any specific requirements for different dataset types in the pipeline catalog documentation or contact the Cirro team for help.

Table Options

The samples are organized in a table and by default are sorted by the date it was last updated. You can change the sorting and add/remove columns using the three dots in the table header table icon , and also filter the samples using the sidebar on the right side of the page.

You can also download the entire table as a CSV file by clicking the "Export Data" button at the top of the page.