gbyuvd
/

ChemEmbed-v01

Sentence Similarity

sentence-transformers

feature-extraction

Generated from Trainer

dataset_size:1,183,174

loss:CosineSimilarityLoss

text-embeddings-inference

Inference Endpoints

Model card Files Files and versions Community

gbyuvd commited on Jun 27

Commit

4a3ec25

•

1 Parent(s): f4b2e60

Update README.md

Files changed (1) hide show

README.md +8 -1

README.md CHANGED Viewed

@@ -89,7 +89,7 @@ model-index:
       name: Spearman Max
 ---
-# ChEmbed v0.1
 This prototype is a [sentence-transformers](https://www.SBERT.net) based on [MiniLM-L6-H384-uncased](https://huggingface.co/nreimers/MiniLM-L6-H384-uncased) fine-tuned on around 1 million pairs of valid natural compounds' SELFIES [(Krenn et al. 2020)](https://github.com/aspuru-guzik-group/selfies) taken from COCONUTDB [(Sorokina et al. 2021)](https://coconut.naturalproducts.net/). It maps compounds' *Self-Referencing Embedded Strings* (SELFIES) into a 768-dimensional dense vector space, potentially can be used for chemical similarity, similarity search, classification, clustering, and more.
@@ -184,6 +184,13 @@ print(similarities.shape)
 ## Limitations
 For now, the model might be ineffective in embedding synthetic drugs, since it is still trained on just natural products. Also, the tokenizer used is still uncustomized one.
 ### Framework Versions
 - Python: 3.9.13
 - Sentence Transformers: 3.0.1

       name: Spearman Max
 ---
+# ChEmbed v0.1 - Chemical Embeddings
 This prototype is a [sentence-transformers](https://www.SBERT.net) based on [MiniLM-L6-H384-uncased](https://huggingface.co/nreimers/MiniLM-L6-H384-uncased) fine-tuned on around 1 million pairs of valid natural compounds' SELFIES [(Krenn et al. 2020)](https://github.com/aspuru-guzik-group/selfies) taken from COCONUTDB [(Sorokina et al. 2021)](https://coconut.naturalproducts.net/). It maps compounds' *Self-Referencing Embedded Strings* (SELFIES) into a 768-dimensional dense vector space, potentially can be used for chemical similarity, similarity search, classification, clustering, and more.
 ## Limitations
 For now, the model might be ineffective in embedding synthetic drugs, since it is still trained on just natural products. Also, the tokenizer used is still uncustomized one.
+## Testing Generated Embeddings' Clusters
+The plot below show how the model's embeddings (at this stage) cluster different classes of compounds, compared to using MACCS fingerprints.
+![image/png](https://cdn-uploads.huggingface.co/production/uploads/667da868d653c0b02d6a2399/c8_5IWjPgbrGY0Z9-ZHop.png)
+![image/png](https://cdn-uploads.huggingface.co/production/uploads/667da868d653c0b02d6a2399/EHEcaSnra4lldI0LY5tGq.png)
 ### Framework Versions
 - Python: 3.9.13
 - Sentence Transformers: 3.0.1