Name Uploaded Size
1-AFDBClusters-entryId_repId_taxId.tsv.gz Thu, 25 May 2023 12:05:52 GMT 261.8 MB
2-repId_isDark_nMem_repLen_avgLen_repPlddt_avgPlddt_LCAtaxId.tsv.gz Thu, 25 May 2023 11:46:25 GMT 39.1 MB
3-sapId_sapGO_repId_cluFlag_LCAtaxId.tsv.gz Thu, 25 May 2023 11:43:11 GMT 2.1 MB
4-domain-clustering.zip Thu, 09 Mar 2023 17:10:03 GMT 4.7 MB
5-allmembers-repId-entryId-cluFlag-taxId.tsv.gz Thu, 25 May 2023 12:07:55 GMT 1.6 GB
6-all-vs-all-similarity-queryId_targetId_eValue.tsv.gz Mon, 05 Jun 2023 12:15:46 GMT 7.3 GB
commands.gz Fri, 14 Apr 2023 02:17:59 GMT 2.8 kB

Readme

AlphaFold cluster produced by our novel foldseek structural clustering algorithm. We clustered the AlphaFold database into 2.27M non-singleton structural clusters. On this site we provide the cluster data.

Barrio-Hernandez I, Yeo J, Jänes J, Wein T, Varadi M, Velankar S, Beltrao P, Steinegger M. Clustering predicted structures at the scale of the known protein universe. bioRxiv doi: doi.org/10.1101/2023.03.09.531927 (2023)

Foldseek Marv

Updates

  • 2023-06-05: 6-all-vs-all-similarity-queryId_targetId_eValue.tsv.gz is added
  • 2023-05-25: Erroneous merge in fragment removal step was solved. cluFlag column is added to file 3 so that all of the human proteins
  • 2023-03-19: 5-allmembers-repId-entryId-cluFlag-taxId.tsv.gz is added. Now it is available to see all mapping information of AFDB. The identifiers in file 1, 2 and 3 are removed.

Website

For an interactive view on a subset of the data please check out our website here.

Data description

1-AFDBClusters-entryId_repId_taxId.tsv.gz: AFDB clusters' infomation

  1. memberID identifier of member in cluster
  2. repId identifier of the representative
  3. taxId the taxonomy ID of the member

2-repId_isDark_nMem_repLen_avgLen_repPlddt_avgPlddt_LCAtaxId.tsv.gz: Cluster overview files

  1. repID: ID of the representative protein in the Foldseek cluster
  2. isDark: whether the protein is from a dark proteome (0 = not dark, 1 = dark)
  3. nMem: number of members in the Foldseek cluster without fragments
  4. repLen: length of the representative protein sequence
  5. avgLen: average length of the protein sequences in the Foldseek cluster
  6. repPlddt: predicted Local Distance Difference Test (pLDDT, which is per-residue confidence score) score of the representative protein structure
  7. avgPlddt: average PLDDT score of the protein structures in the Foldseek cluster
  8. LCAtaxID: the taxonomy ID of the lowest common ancestor (LCA) of the proteins in the Foldseek cluster

3-sapId_sapGO_repId_cluFlag_LCAtaxId.tsv.gz: Human cluster

  1. human ID: ID of the human protein
  2. GOs: Gene Ontology terms associated with the protein
  3. rep ID: ID of the representative protein if the human protein is clustered in the Foldseek clustering step
  4. cluFlag 1: clustered in AFDB50, 2: clustered in AFDB clusters, 3: removed (fragments in Foldseek clusters), 4: removed (singletons in Foldseek clusters)
  5. LCA taxonomy ID: LCA taxonomy ID of the cluster where the human protein is assigned

4-domain-clustering.zip: Domain analysis related files

5-allmembers-repId-entryId-cluFlag-taxId.tsv.gz: cluster information of 214 million AFDB sequences

  1. repId identifier of the representative
  2. memId identifier of the member - AFDB sequences
  3. cluFlag 1: clustered in AFDB50, 2: clustered in AFDB clusters, 3: removed (fragments in Foldseek clusters), 4: removed (singletons in Foldseek clusters)
  4. taxId the taxonomy ID of the member

6-all-vs-all-similarity-queryId_targetId_eValue.tsv.gz: cluster similarity e-value

  1. queryId identifier of the query
  2. targetId identifier of the target
  3. e-value similairty e-value

commands.gz: script used to analysis

License

All files are available under a Creative Commons Attribution 4.0 International License.