Pipeline Catalog: Protein Structure

Uploading FASTA/FASTP Data

A commonly used format for protein sequence data is the FASTA format, which may be referred to as FASTP when it contains amino acid sequences instead of nucleotide sequences. The file extensions expected by Cirro for protein sequence data are fastp, fasta, fsa, fnp, fa, and fp. An additional .gz suffix is expected for gzip-compressed files.

A protein FASTA file may contain one or more protein sequences, each of which is represented by a header line starting with the > character.

Note: Many tools for predicting protein structure from primary amino acid sequence expect that each file contains a single amino acid sequence.

AlphaFold (Ready2Run)

AlphaFold is an open source deep learning system for predicting protein structure from amino acid sequence with high accuracy. The AlphaFold Ready2Run pipeline is a curated workflow that is designed to run on AWS infrastructure, and is available for use in Cirro.

Analysis Options:

There are two tiers of AlphaFold Ready2Run analysis available in Cirro:

Tier 1: Proteins shorter than 600 amino acids
Tier 2: Proteins from 600 to 1,200 amino acids

Inputs:

Protein sequence in FASTA format (note: only single-chain predictions are currently supported)

Outputs:

The protein structure predicted by AlphaFold for the input sequence is provided as a PDB file in the prediction/results.tar.gz archive.

Example output folder structure:

out/
    prediction / results.tar.gz *
    bfd_metrics / metrics.json
    bfd_msa / bfd_hits.a3m
    fasta / input.fasta
    features / features.pkl
    features_metrics / metrics.json
    mgnify_metrics / metrics.json
    mgnify_msa / mgnify_hits.sto
    prediction_metrics / metrics.json
    templates / pdb_hits.hhr
    templates_metrics / metrics.json
    uniref90_metrics / metrics.json
    uniref90_msa / uniref90_hits.sto

Citation:

Jumper J, Evans R, Pritzel A, Green T, Figurnov M, Ronneberger O, Tunyasuvunakool K, Bates R, Žídek A, Potapenko A, Bridgland A, Meyer C, Kohl SAA, Ballard AJ, Cowie A, Romera-Paredes B, Nikolov S, Jain R, Adler J, Back T, Petersen S, Reiman D, Clancy E, Zielinski M, Steinegger M, Pacholska M, Berghammer T, Bodenstein S, Silver D, Vinyals O, Senior AW, Kavukcuoglu K, Kohli P, Hassabis D. Highly accurate protein structure prediction with AlphaFold. Nature. 2021 Aug;596(7873):583-589. doi: 10.1038/s41586-021-03819-2. Epub 2021 Jul 15. PMID: 34265844; PMCID: PMC8371605.

ESMFold (Ready2Run)

ESMFold is an open source large language model (LLM) for predicting protein structure from primary amino acid sequence with high accuracy. The ESMFold Ready2Run pipeline is a curated workflow that is designed to run on AWS infrastructure, and is available for use in Cirro.

The maximum length of the protein sequence that can be analyzed by ESMFold is 800 amino acids.

Inputs:

Protein sequence in FASTA format (note: only single-chain predictions are currently supported)

Outputs:

The protein structure predicted by ESMFold for the input sequence is provided as a PDB file in pdb/prediction.pdb.

Example output folder structure:

out/
    pdb/prediction.pdb *
    fasta/input.fasta
    metrics/metrics.json
    outputs/outputs.pt
    pae/pae.png

Citation:

Lin Z, Akin H, Rao R, Hie B, Zhu Z, Lu W, Smetanin N, Verkuil R, Kabeli O, Shmueli Y, Dos Santos Costa A, Fazel-Zarandi M, Sercu T, Candido S, Rives A. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science. 2023 Mar 17;379(6637):1123-1130. doi: 10.1126/science.ade2574. Epub 2023 Mar 16. PMID: 36927031.