Pipeline Catalog: Protein Structure
Uploading FASTA/FASTP Data
A commonly used format for protein sequence data is the FASTA format,
which may be referred to as FASTP when it contains amino acid
sequences instead of nucleotide sequences.
The file extensions expected by Cirro for protein sequence data are fastp
,
fasta
, fsa
, fnp
, fa
, and fp
. An additional .gz
suffix is
expected for gzip-compressed files.
A protein FASTA file may contain one or more protein sequences, each
of which is represented by a header line starting with the >
character.
Note: Many tools for predicting protein structure from primary amino acid sequence expect that each file contains a single amino acid sequence.
AlphaFold (Ready2Run)
AlphaFold is an open source deep learning system for predicting protein structure from amino acid sequence with high accuracy. The AlphaFold Ready2Run pipeline is a curated workflow that is designed to run on AWS infrastructure, and is available for use in Cirro.
Analysis Options:
There are two tiers of AlphaFold Ready2Run analysis available in Cirro:
- Tier 1: Proteins shorter than 600 amino acids
- Tier 2: Proteins from 600 to 1,200 amino acids
Inputs:
- Protein sequence in FASTA format (note: only single-chain predictions are currently supported)
Outputs:
The protein structure predicted by AlphaFold for the input sequence is
provided as a PDB file in the prediction/results.tar.gz
archive.
Example output folder structure:
out/
prediction / results.tar.gz *
bfd_metrics / metrics.json
bfd_msa / bfd_hits.a3m
fasta / input.fasta
features / features.pkl
features_metrics / metrics.json
mgnify_metrics / metrics.json
mgnify_msa / mgnify_hits.sto
prediction_metrics / metrics.json
templates / pdb_hits.hhr
templates_metrics / metrics.json
uniref90_metrics / metrics.json
uniref90_msa / uniref90_hits.sto
Citation:
- Jumper J, Evans R, Pritzel A, Green T, Figurnov M, Ronneberger O, Tunyasuvunakool K, Bates R, Žídek A, Potapenko A, Bridgland A, Meyer C, Kohl SAA, Ballard AJ, Cowie A, Romera-Paredes B, Nikolov S, Jain R, Adler J, Back T, Petersen S, Reiman D, Clancy E, Zielinski M, Steinegger M, Pacholska M, Berghammer T, Bodenstein S, Silver D, Vinyals O, Senior AW, Kavukcuoglu K, Kohli P, Hassabis D. Highly accurate protein structure prediction with AlphaFold. Nature. 2021 Aug;596(7873):583-589. doi: 10.1038/s41586-021-03819-2. Epub 2021 Jul 15. PMID: 34265844; PMCID: PMC8371605.
ESMFold (Ready2Run)
ESMFold is an open source large language model (LLM) for predicting protein structure from primary amino acid sequence with high accuracy. The ESMFold Ready2Run pipeline is a curated workflow that is designed to run on AWS infrastructure, and is available for use in Cirro.
The maximum length of the protein sequence that can be analyzed by ESMFold is 800 amino acids.
Inputs:
- Protein sequence in FASTA format (note: only single-chain predictions are currently supported)
Outputs:
The protein structure predicted by ESMFold for the input sequence is
provided as a PDB file in pdb/prediction.pdb
.
Example output folder structure:
out/
pdb/prediction.pdb *
fasta/input.fasta
metrics/metrics.json
outputs/outputs.pt
pae/pae.png
Citation:
- Lin Z, Akin H, Rao R, Hie B, Zhu Z, Lu W, Smetanin N, Verkuil R, Kabeli O, Shmueli Y, Dos Santos Costa A, Fazel-Zarandi M, Sercu T, Candido S, Rives A. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science. 2023 Mar 17;379(6637):1123-1130. doi: 10.1126/science.ade2574. Epub 2023 Mar 16. PMID: 36927031.