Readme
AlphaFold cluster produced by our novel foldseek structural clustering algorithm.
We clustered the AlphaFold database into 2.27M non-singleton structural clusters.
On this site we provide the cluster data.
Barrio-Hernandez I, Yeo J, Jänes J, Wein T, Varadi M, Velankar S, Beltrao P, Steinegger M. Clustering predicted structures at the scale of the known protein universe. bioRxiv doi: doi.org/10.1101/2023.03.09.531927 (2023)
Updates
- 2024-12-03: 6-all-vs-all-similarity-queryId_targetId_eValue-Dec3_2024_updated.tsv.gz is added. The representative ids were substituted
- 2023-06-05: 6-all-vs-all-similarity-queryId_targetId_eValue.tsv.gz is added
- 2023-05-25: Erroneous merge in fragment removal step was solved. cluFlag column is added to file 3 so that all of the human proteins
- 2023-03-19: 5-allmembers-repId-entryId-cluFlag-taxId.tsv.gz is added. Now it is available to see all mapping information of AFDB. The identifiers in file 1, 2 and 3 are removed.
Website
For an interactive view on a subset of the data please check out our website here.
Data description
1-AFDBClusters-entryId_repId_taxId.tsv.gz: AFDB clusters' infomation
- memberID identifier of member in cluster
- repId identifier of the representative
- taxId the taxonomy ID of the member
2-repId_isDark_nMem_repLen_avgLen_repPlddt_avgPlddt_LCAtaxId.tsv.gz: Cluster overview files
- repID: ID of the representative protein in the Foldseek cluster
- isDark: whether the protein is from a dark proteome (0 = not dark, 1 = dark)
- nMem: number of members in the Foldseek cluster without fragments
- repLen: length of the representative protein sequence
- avgLen: average length of the protein sequences in the Foldseek cluster
- repPlddt: predicted Local Distance Difference Test (pLDDT, which is per-residue confidence score) score of the representative protein structure
- avgPlddt: average PLDDT score of the protein structures in the Foldseek cluster
- LCAtaxID: the taxonomy ID of the lowest common ancestor (LCA) of the proteins in the Foldseek cluster
3-sapId_sapGO_repId_cluFlag_LCAtaxId.tsv.gz: Human cluster
- human ID: ID of the human protein
- GOs: Gene Ontology terms associated with the protein
- rep ID: ID of the representative protein if the human protein is clustered in the Foldseek clustering step
- cluFlag 1: clustered in AFDB50,
2: clustered in AFDB clusters,
3: removed (fragments in Foldseek clusters),
4: removed (singletons in Foldseek clusters)
- LCA taxonomy ID: LCA taxonomy ID of the cluster where the human protein is assigned
4-domain-clustering.zip: Domain analysis related files
5-allmembers-repId-entryId-cluFlag-taxId.tsv.gz: cluster information of 214 million AFDB sequences
- repId identifier of the representative
- memId identifier of the member - AFDB sequences
- cluFlag 1: clustered in AFDB50,
2: clustered in AFDB clusters,
3: removed (fragments in Foldseek clusters),
4: removed (singletons in Foldseek clusters)
- taxId the taxonomy ID of the member
6-all-vs-all-similarity-queryId_targetId_eValue.tsv.gz: cluster similarity e-value
- queryId identifier of the query
- targetId identifier of the target
- e-value similairty e-value
commands.gz: script used to analysis