Skip to content

Uploading a dataset

Uploading a dataset

from cirro import DataPortal
portal = DataPortal()

To upload a dataset to the portal, you need to specify:

  1. The project which it will be uploaded to
  2. The process which defines the type of data being uploaded
  3. The name and description of the dataset
  4. The path to the folder which contains the files to upload
  5. Optionally, a filtered list of files which should be uploaded (default is to upload all files in a folder)

First, get the project where the dataset will be uploaded

project = portal.get_project_by_name("Test Project")
print(f"Name: {project.name}")
print(f"ID: {project.id}")
Name: Test Project
ID: 9a31492a-e679-43ce-9f06-d84213c8f7f7

Next, select the process which will be used for ingesting the data

# Get a list of all available ingest processes
ingest_processes = portal.list_processes(ingest=True)
print(f"There are {len(ingest_processes):,} different ingest processes available")
# Uncomment the line below to print a list of all available ingest processes
# print(ingest_processes.description())

# Select the process used to ingest paired DNA sequencing data
process = ingest_processes.get_by_name("Paired DNAseq (FASTQ)")
print("\nSelected process:")
print(str(process))

# To do the above in a single step, simply run:
# process = portal.get_process_by_name("Paired DNAseq (FASTQ)", ingest=True)

# When uploading the dataset (below), the process used for ingest can be specified
# either with the process object which was obtained above, or simply with the name
# of that process (or its ID).
There are 12 different ingest processes available

Selected process:
Name: Paired DNAseq (FASTQ)
Id: paired_dnaseq
Description: FASTQ files generated from paired-end sequencing of DNA libraries

If you want to upload just a subset of files from a folder, you just need to make a list of the filenames (paths relative to the upload directory) which you want to upload. This overrides the default behavior of uploading everything from the upload directory.

We've included two helper functions to get a list of files in the specified directory and filter them.

You can also manually create the list of files (using those relative paths)

from cirro.file_utils import get_files_in_directory, filter_files_by_pattern

directory_to_upload = '/tmp'

files = get_files_in_directory(directory_to_upload)
files_to_upload = filter_files_by_pattern(files, '*.fastq.gz')
files_to_upload
['test.R1.fastq.gz', 'test.R2.fastq.gz']

Fill in information on your new dataset in the name and description variables and then run to check the files and upload new data

# Upload the data and return the new dataset object
uploaded_dataset = project.upload_dataset(
    name = 'Test dataset',
    description = '',
    upload_folder = '/tmp',
    files = files_to_upload,
    process = "Paired DNAseq (FASTQ)"
)

print(str(uploaded_dataset))
Uploading file test.R1.fastq.gz (180.76 KB) | 100.0%|█████████████████████████ | 669kB/s
Uploading file test.R2.fastq.gz (180.76 KB) | 100.0%|█████████████████████████ | 1.30MB/s


Name: Test dataset
Id: beccbeea-5ad4-4a76-9c63-08d5d9b63f7b
Description: 
Status: COMPLETED

That was an example of what a successful upload looks like.

Next, let's take a look at one of the most common issues that you might run into. Namely, if you try to upload files which don't match the pattern expected, it will raise an error:

import json
ingest_10X = portal.get_process_by_name("Single-cell sequencing data (10X)", ingest=True)
print("File pattern expected for 10X data (note the sample index number and lane ID):")
print(json.dumps(ingest_10X.file_mapping_rules, indent=3))

# Try to upload the data (which will cause an error)
project.upload_dataset(
    name = 'Test dataset',
    description = '',
    upload_folder = '/tmp',
    files = files_to_upload,
    process = "Single-cell sequencing data (10X)"
)

File pattern expected for 10X data (note the sample index number and lane ID):
[
   {
      "glob": "*_S*_L???_{I,R}{1,2}_001.fastq.gz",
      "min": null,
      "max": null,
      "description": "Paired FASTQ (Illumina Format)",
      "isSample": null,
      "sampleMatchingPattern": "(?P<sampleName>\\S*)_S(?P<sampleIndex>\\S*)_L(?P<lane>\\S*)_(?P<read_index>I|R)(?P<read>1|2)_001\\.fastq\\.gz"
   }
]



---------------------------------------------------------------------------

ValueError                                Traceback (most recent call last)

/var/folders/ck/j40906kx3mj90bcc8qs7gyxm0000gp/T/ipykernel_83747/2225702019.py in <module>
      5 
      6 # Try to upload the data (which will cause an error)
----> 7 project.upload_dataset(
      8     name = 'Test dataset',
      9     description = '',


~/Documents/GitHub/Cirro-client/cirro/sdk/project.py in upload_dataset(self, name, description, process, upload_folder, files)
    126 
    127         # Make sure that the files match the expected pattern
--> 128         check_dataset_files(files, process.file_mapping_rules, upload_folder)
    129 
    130         # Create the ingest process request


~/Documents/GitHub/Cirro-client/cirro/file_utils.py in check_dataset_files(files, file_mapping_rules, directory)
    148 
    149     if not any(map(functools.partial(match_pattern, files), file_mapping_rules)):
--> 150         raise ValueError("Files do not match dataset type. Expected file type requirements: \n" + "\n".join(
    151             [f"{rule.get('description', '')} {rule.get('glob')}" for rule in file_mapping_rules]))


ValueError: Files do not match dataset type. Expected file type requirements: 
Paired FASTQ (Illumina Format) *_S*_L???_{I,R}{1,2}_001.fastq.gz