Readme
AlphaFold cluster produced by our novel foldseek structural clustering algorithm.
We clustered the AlphaFold database into 2.27M non-singleton structural clusters.
On this site we provide the cluster data.
Barrio-Hernandez I, Yeo J, Jänes J, Wein T, Varadi M, Velankar S, Beltrao P, Steinegger M. Clustering predicted structures at the scale of the known protein universe. bioRxiv doi: doi.org/10.1101/2023.03.09.531927 (2023)
Updates
- 2026-03-10: The cluster flag information is amended for v6.
- 2026-03-04: The update of the AFDB v6. The AFDB Clusters - rep-member (no. 1), cluster information (no. 2), the whole clusters rep-mem (no. 5) and the all-vs-all alignment of the AFDB Cluster representative information are updated.
- 2025-09-12: share_db/8-cluster_alignment_tmscore_lddt.tar.gz is added. The alignment information of the AFDB Clusters
- 2024-12-16: 6-all-vs-all-similarity-queryId_targetId_eValue-Dec16_2024.tsv.gz is added. The representative ids were substituted and e-value has changed
- 2024-12-13: 7-AFDB50-repId_memId.tsv.gz is added.
- 2024-12-03: 6-all-vs-all-similarity-queryId_targetId_eValue-Dec3_2024_updated.tsv.gz is added. The representative ids were substituted
- 2023-06-05: 6-all-vs-all-similarity-queryId_targetId_eValue.tsv.gz is added
- 2023-05-25: Erroneous merge in fragment removal step was solved. cluFlag column is added to file 3 so that all of the human proteins
- 2023-03-19: 5-allmembers-repId-entryId-cluFlag-taxId.tsv.gz is added. Now it is available to see all mapping information of AFDB. The identifiers in file 1, 2 and 3 are removed.
Website
For an interactive view on a subset of the data please check out our website here.
Data description
v3
v6
1-AFDBClusters-entryId_repId_taxId.tsv.gz — AFDB clusters' information
- memberID identifier of member in cluster
- repId identifier of the representative
- taxId the taxonomy ID of the member
2-repId_isDark_nMem_repLen_avgLen_repPlddt_avgPlddt_LCAtaxId.tsv.gz — Cluster overview
- repID: ID of the representative protein in the Foldseek cluster
- isDark: whether the protein is from a dark proteome (0 = not dark, 1 = dark)
- nMem: number of members in the Foldseek cluster without fragments
- repLen: length of the representative protein sequence
- avgLen: average length of the protein sequences in the Foldseek cluster
- repPlddt: pLDDT score of the representative protein structure
- avgPlddt: average pLDDT score of the protein structures in the Foldseek cluster
- LCAtaxID: taxonomy ID of the lowest common ancestor (LCA) of the proteins in the Foldseek cluster
3-sapId_sapGO_repId_cluFlag_LCAtaxId.tsv.gz — Human cluster
- human ID: ID of the human protein
- GOs: Gene Ontology terms associated with the protein
- rep ID: ID of the representative protein if the human protein is clustered in the Foldseek clustering step
- cluFlag 1: clustered in AFDB50, 2: clustered in AFDB clusters, 3: removed (fragments in Foldseek clusters), 4: removed (singletons in Foldseek clusters)
- LCA taxonomy ID: LCA taxonomy ID of the cluster where the human protein is assigned
4-domain-clustering.zip — Domain analysis related files
Domain analysis related files
5-allmembers-repId-entryId-cluFlag-taxId.tsv.gz — Cluster information of 214M AFDB sequences
- repId identifier of the representative
- memId identifier of the member - AFDB sequences
- cluFlag 1: clustered in AFDB50, 2: clustered in AFDB clusters, 3: removed (fragments in Foldseek clusters), 4: removed (singletons in Foldseek clusters)
- taxId the taxonomy ID of the member
6-all-vs-all-similarity-queryId_targetId_eValue.tsv.gz — Cluster similarity e-value
- queryId identifier of the query
- targetId identifier of the target
- e-value similarity e-value
7-AFDB50-repId_memId.tsv.gz — Sequence clusters AFDB50
- repId identifier of the representative
- memId identifier of the member - AFDB50 sequence
8-cluster_alignment_tmscore_lddt.tar.gz — Alignment of AFDB Clusters
- aln-query_target_fident_alnlen_mismatch_gapopen_qstart_qend_tstart_tend_evalue_bits_lddt_alntmscore.tsv: Alignment of AFDB Clusters - members to their representative
- lddt-repId_sumLddt_nMem_avgLddt.tsv: Average lddt computed
- tmScore-repId_sumLddt_nMem_avgLddt.tsv: Average TM-Score computed
commands.gz — Script used for analysis
Script used to analysis
1-AFDBClusters-repId_entryId_cluFlag_taxId.tsv.gz — AFDB clusters' information
- repId identifier of the representative
- entryId identifier of member in cluster
- cluFlag 1: clustered in AFDB clusters (Foldseek), 2: clustered in AFDB50 (MMseqs)
- taxId the taxonomy ID of the member
Preview
L2GK61 L2GK61 1 993615
L2GK61 A0A820QN40 1 433720
A0A377JXT1 A0A377JXT1 1 562
A0A377JXT1 A0AAW7VL20 1 562
A0A0C2YUD9 A0A0C2YUD9 1 1036808
2-repId_isDark_nMem_repLen_avgLen_repPlddt_avgPlddt_LCAtaxId.tsv.gz — Cluster overview
- repID: ID of the representative protein in the Foldseek cluster
- isDark: whether the protein is from a dark proteome (0 = not dark, 1 = dark)
- nMem: number of members in the Foldseek cluster
- repLen: length of the representative protein sequence
- avgLen: average length of the protein sequences in the Foldseek cluster
- repPlddt: pLDDT score of the representative protein structure
- avgPlddt: average pLDDT score of the protein structures in the Foldseek cluster
- LCAtaxID: taxonomy ID of the lowest common ancestor (LCA) of the proteins in the Foldseek cluster
Preview
A0A522E0C9 1 2 3 73 76.6667 70.5 73.94
A0A924B8Z5 0 5 475 173 161.318 72.12 94.8553
A0A7I4YSJ4 0 8 10 120 126.4 62.69 68.378
A0A947S9J0 1 12 16 292 283.062 77.38 72.9294
A0A291RMX2 0 75 230 169 174.457 89.69 83.5313
5-allmembers-repId-entryId-cluFlag-taxId.tsv.gz — Cluster information of 214M AFDB sequences
- repId identifier of the representative
- memId identifier of the member - AFDB sequences
- cluFlag 1: clustered in AFDB clusters (Foldseek), 2: clustered in AFDB50 (MMseqs)
- taxId the taxonomy ID of the member
Preview
A0AAZ3QHE7 A0AAZ3QHE7 1 74940
A0A7V8DG14 A0A7V8DG14 1 0
A0A060CI31 A0A060CI31 1 0
A0A060ZM91 A0A060ZM91 1 576784
A0A644Z1U6 A0A644Z1U6 1 1076179
6-all-vs-all-similarity-queryId_targetId_eValue.tsv.gz — Cluster similarity e-value
- queryId identifier of the query
- targetId identifier of the target
- e-value similarity e-value
Preview
A0AA39JG44 A0AA39JG44 7.579E-41
A0AA39L9A4 A0AA39L9A4 0.000E+00
A0AA39VAE4 A0AA39VAE4 7.703E-36
A0AA39YNF0 A0AA39YNF0 4.106E-66
A0AA40EHK3 A0AA40EHK3 1.068E-71