AFDB Cluster

Name	Uploaded	Size
1-AFDBClusters-entryId_repId_taxId.tsv.gz	Thu, 25 May 2023 12:05:52 GMT	261.8 MB
2-repId_isDark_nMem_repLen_avgLen_repPlddt_avgPlddt_LCAtaxId.tsv.gz	Thu, 25 May 2023 11:46:25 GMT	39.1 MB
3-sapId_sapGO_repId_cluFlag_LCAtaxId.tsv.gz	Thu, 25 May 2023 11:43:11 GMT	2.1 MB
4-domain-clustering.zip	Thu, 09 Mar 2023 17:10:03 GMT	4.7 MB
5-allmembers-repId-entryId-cluFlag-taxId.tsv.gz	Thu, 25 May 2023 12:07:55 GMT	1.6 GB
6-all-vs-all-similarity-queryId_targetId_eValue-Dec16_2024.tsv.gz	Mon, 16 Dec 2024 08:30:51 GMT	4.5 GB
7-AFDB50-repId_memId.tsv.gz	Fri, 13 Dec 2024 03:58:34 GMT	1.2 GB
commands.gz	Fri, 14 Apr 2023 02:17:59 GMT	2.8 kB

Readme

AlphaFold cluster produced by our novel foldseek structural clustering algorithm. We clustered the AlphaFold database into 2.27M non-singleton structural clusters. On this site we provide the cluster data.

Barrio-Hernandez I, Yeo J, Jänes J, Wein T, Varadi M, Velankar S, Beltrao P, Steinegger M. Clustering predicted structures at the scale of the known protein universe. bioRxiv doi: doi.org/10.1101/2023.03.09.531927 (2023)

Updates

2024-12-16: 6-all-vs-all-similarity-queryId_targetId_eValue-Dec16_2024.tsv.gz is added. The representative ids were substituted and e-value has changed
2024-12-13: 7-AFDB50-repId_memId.tsv.gz is added.
2024-12-03: 6-all-vs-all-similarity-queryId_targetId_eValue-Dec3_2024_updated.tsv.gz is added. The representative ids were substituted
2023-06-05: 6-all-vs-all-similarity-queryId_targetId_eValue.tsv.gz is added
2023-05-25: Erroneous merge in fragment removal step was solved. cluFlag column is added to file 3 so that all of the human proteins
2023-03-19: 5-allmembers-repId-entryId-cluFlag-taxId.tsv.gz is added. Now it is available to see all mapping information of AFDB. The identifiers in file 1, 2 and 3 are removed.

Website

For an interactive view on a subset of the data please check out our website here.

Data description

1-AFDBClusters-entryId_repId_taxId.tsv.gz: AFDB clusters' infomation

memberID identifier of member in cluster
repId identifier of the representative
taxId the taxonomy ID of the member

2-repId_isDark_nMem_repLen_avgLen_repPlddt_avgPlddt_LCAtaxId.tsv.gz: Cluster overview files

repID: ID of the representative protein in the Foldseek cluster
isDark: whether the protein is from a dark proteome (0 = not dark, 1 = dark)
nMem: number of members in the Foldseek cluster without fragments
repLen: length of the representative protein sequence
avgLen: average length of the protein sequences in the Foldseek cluster
repPlddt: predicted Local Distance Difference Test (pLDDT, which is per-residue confidence score) score of the representative protein structure
avgPlddt: average PLDDT score of the protein structures in the Foldseek cluster
LCAtaxID: the taxonomy ID of the lowest common ancestor (LCA) of the proteins in the Foldseek cluster

3-sapId_sapGO_repId_cluFlag_LCAtaxId.tsv.gz: Human cluster

human ID: ID of the human protein
GOs: Gene Ontology terms associated with the protein
rep ID: ID of the representative protein if the human protein is clustered in the Foldseek clustering step
cluFlag 1: clustered in AFDB50, 2: clustered in AFDB clusters, 3: removed (fragments in Foldseek clusters), 4: removed (singletons in Foldseek clusters)
LCA taxonomy ID: LCA taxonomy ID of the cluster where the human protein is assigned

4-domain-clustering.zip: Domain analysis related files

5-allmembers-repId-entryId-cluFlag-taxId.tsv.gz: cluster information of 214 million AFDB sequences

repId identifier of the representative
memId identifier of the member - AFDB sequences
cluFlag 1: clustered in AFDB50, 2: clustered in AFDB clusters, 3: removed (fragments in Foldseek clusters), 4: removed (singletons in Foldseek clusters)
taxId the taxonomy ID of the member

6-all-vs-all-similarity-queryId_targetId_eValue.tsv.gz: cluster similarity e-value

queryId identifier of the query
targetId identifier of the target
e-value similairty e-value

7-AFDB50-repId_memId.tsv.gz: sequence clusters AFDB50

repId identifier of the representative
memId identifier of the member - AFDB50 sequence

commands.gz: script used to analysis

License

All files are available under a Creative Commons Attribution 4.0 International License.