Datasets are more than just a group of files. The data contained in these files often corresponds to a unit of observation which is described in Cirro as a "sample". These samples can describe a physical specimen, an experimental replicate, or whatever the atomic observation may be for a particular experiment.
Cirro keeps track of dataset files and sample metadata side by side so the metadata can be used as input for different pipeline analyses. The resulting dataset from any pipeline run will contain a snapshot of the samples that were used in its creation in order to maintain a complete picture of the dataset provenance.
Some things to keep in mind with samples include:
- Samples have a many to one relationship: samples can be associated with many files, but every file can only point to one sample.
- Sample metadata is shared across all of the datasets in a project, but cannot be shared across projects.
- Sample entries can be modified by any user, while only administrators can modify the sample schema.
Adding a Sample
The sample table is populated at the time new datasets are uploaded. There is no way to explicitly add a sample, because they must be connected to an uploaded file. Users have three options for how the sample table is populated: They can either explicitly upload a sample sheet at the time they upload their dataset, add a sample sheet to an existing dataset using the "Edit Dataset" button on the dataset's Overview page, or they can skip the sample sheet and let Cirro auto-populate the sample table based on the file names. Sample sheets can also be updated with newer versions using the same "Edit Dataset" button.
Note that once you upload a dataset without a sample sheet, you can't go back and add that file. You would have to delete the dataset and re-upload it or add the sample sheet information to Cirro by hand. So be careful when choosing whether or not to upload a sample sheet with the rest of your files.
Providing a Sample Sheet
The easiest way to annotate metadata for a set of files is to upload a sample sheet alongside the files. A sample sheet should be saved in CSV format, with one line per sample. The name of the file should be
samplesheet.csv and it should live in the top most level of the files you are uploading and not in any sub-folders.
The first column in the sample sheet should have the header
sample, and should contain the name of the sample. If there is only one file per sample, then the next column should be named
file_1 and should contain all of the file names. If there are two files per sample, then use
file_2. You can use any number of
file_N columns in this way, and there is no requirement that each sample has the same number of files. However, be sure that all the files listed are included in the set of files you are uploading to Cirro. You should not have information on files in your sample sheet that you are not uploading at the same time. If you have FASTQ data, use the columns
fastq_2 instead of
file_2, to automatically annotate each file appropriately as read 1 and 2.
Any additional columns after
fastq_1/2) can be used to store the metadata associated with each sample.
sample,file_1,file_2,exp_group,replicate sample_a,sample_a_R1.fastq.gz,sample_a_R2.fastq.gz,treatment,1 sample_b,sample_b_R1.fastq.gz,sample_b_R2.fastq.gz,treatment,2 sample_c,sample_c_R1.fastq.gz,sample_c_R2.fastq.gz,control,1 sample_d,sample_d_R1.fastq.gz,sample_d_R2.fastq.gz,control,2
Expert Note: Metadata can be associated with files as well as samples. To add annotations to
file_2separately, use the prefixes
file_2_. For example, using
fastq_2is the equivalent of using
Another benefit of the sample sheet is that you do not have to match any file name format requirements for your dataset type. If you have different file names, using a sample sheet is a great way to upload your data without editing the file names.
If a sample sheet is not provided when a dataset is uploaded, Cirro will auto-populate the sample table for those files using the file names.
In this case, Cirro will check the dataset type that the files are being uploaded under (E.g. VirScan) and the expected file name formats for the uploaded files (e.g.
* denotes any set of characters). The portal will pull the information from the file name to populate the sample table with things like the "sample name" and "read number".
You are able to go in and update the sample table if any of the populated information is wrong or you want to add more information in new columns, but it will be a longer process than simply uploading a sample sheet with your data.
Editing a Sample
After a sample has been added to the sample table, it can be edited to provide new information or update incorrect information. To edit an entry in the table, click the row you want to edit to get a pop up window where you can edit each of the column entries for that row. Some of these entries may have limited options, while others may be free to write in. If you need to change an entry to something that is not allowed, the sample table's schema files will need to be updated.
The sample's schema is a set of JSON Schema files that are used to define the columns and possible values for the sample table. For example, a schema file could include a section that creates a column named "experimental_group" that only allows entries of "treatment" and "control", or a column named "age" that only allows a numerical entry. These can be very helpful when organizing data. The default schema contains either information based on the sample sheet (if provided) or information pulled from the file names, which could be as basic as only including the sample name.
Users can view the current schema files by clicking the "Download Schema" button in the upper right corner of the Samples page. They can also click the "Upload Schema" button to update the schema with their own files to track more information.
If you are interested in updating the sample schema for a project, you can see any specific requirements for different dataset types in the pipeline catalog documentation or contact the Cirro team for help.
The samples are organized in a table and by default are sorted by the date it was last updated. You can change the sorting and add/remove columns using the three dots in the table header , and also filter the samples using the sidebar on the right side of the page.
You can also download the entire table as a CSV file by clicking the "Export Data" button at the top of the page.