gbyuvd commited on
Commit
4a3ec25
1 Parent(s): f4b2e60

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +8 -1
README.md CHANGED
@@ -89,7 +89,7 @@ model-index:
89
  name: Spearman Max
90
  ---
91
 
92
- # ChEmbed v0.1
93
 
94
  This prototype is a [sentence-transformers](https://www.SBERT.net) based on [MiniLM-L6-H384-uncased](https://huggingface.co/nreimers/MiniLM-L6-H384-uncased) fine-tuned on around 1 million pairs of valid natural compounds' SELFIES [(Krenn et al. 2020)](https://github.com/aspuru-guzik-group/selfies) taken from COCONUTDB [(Sorokina et al. 2021)](https://coconut.naturalproducts.net/). It maps compounds' *Self-Referencing Embedded Strings* (SELFIES) into a 768-dimensional dense vector space, potentially can be used for chemical similarity, similarity search, classification, clustering, and more.
95
 
@@ -184,6 +184,13 @@ print(similarities.shape)
184
  ## Limitations
185
  For now, the model might be ineffective in embedding synthetic drugs, since it is still trained on just natural products. Also, the tokenizer used is still uncustomized one.
186
 
 
 
 
 
 
 
 
187
  ### Framework Versions
188
  - Python: 3.9.13
189
  - Sentence Transformers: 3.0.1
 
89
  name: Spearman Max
90
  ---
91
 
92
+ # ChEmbed v0.1 - Chemical Embeddings
93
 
94
  This prototype is a [sentence-transformers](https://www.SBERT.net) based on [MiniLM-L6-H384-uncased](https://huggingface.co/nreimers/MiniLM-L6-H384-uncased) fine-tuned on around 1 million pairs of valid natural compounds' SELFIES [(Krenn et al. 2020)](https://github.com/aspuru-guzik-group/selfies) taken from COCONUTDB [(Sorokina et al. 2021)](https://coconut.naturalproducts.net/). It maps compounds' *Self-Referencing Embedded Strings* (SELFIES) into a 768-dimensional dense vector space, potentially can be used for chemical similarity, similarity search, classification, clustering, and more.
95
 
 
184
  ## Limitations
185
  For now, the model might be ineffective in embedding synthetic drugs, since it is still trained on just natural products. Also, the tokenizer used is still uncustomized one.
186
 
187
+ ## Testing Generated Embeddings' Clusters
188
+ The plot below show how the model's embeddings (at this stage) cluster different classes of compounds, compared to using MACCS fingerprints.
189
+
190
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/667da868d653c0b02d6a2399/c8_5IWjPgbrGY0Z9-ZHop.png)
191
+
192
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/667da868d653c0b02d6a2399/EHEcaSnra4lldI0LY5tGq.png)
193
+
194
  ### Framework Versions
195
  - Python: 3.9.13
196
  - Sentence Transformers: 3.0.1