ChemEmbed-v01 / README.md
gbyuvd's picture
Update README.md
3b85d55 verified
|
raw
history blame
No virus
13.4 kB
---
datasets: [COCONUTDB]
language: [code]
library_name: sentence-transformers
metrics:
- pearson_cosine
- spearman_cosine
- pearson_manhattan
- spearman_manhattan
- pearson_euclidean
- spearman_euclidean
- pearson_dot
- spearman_dot
- pearson_max
- spearman_max
pipeline_tag: sentence-similarity
tags:
- sentence-transformers
- sentence-similarity
- feature-extraction
- generated_from_trainer
- dataset_size:1,183,174
- loss:CosineSimilarityLoss
widget:
- source_sentence: '[O][=C][Branch2][Branch2][Ring1][O][C][C][Branch2][Ring1][=Branch1][O][C][=Branch1][C][=O][C][C][C][C][C][C][=C][C][C][C][C][C][C][C][C][C][C][O][P][=Branch1][C][=O][Branch1][C][O][O][C][C][Branch2][Ring1][Branch1][O][C][O][C][Branch1][Ring1][C][O][C][Branch1][C][O][C][Branch1][C][O][C][Ring1][#Branch2][O][C][Branch1][C][O][C][Branch1][C][O][C][Branch1][C][O][C][Ring2][Ring1][Branch1][O][C][O][C][Branch1][Ring1][C][O][C][Branch1][C][O][C][Branch1][C][O][C][Ring1][#Branch2][O][C][C][C][C][C][C][C][C][C][C][C][C][C][C]'
sentences:
- '[O][=C][Branch2][Ring1][N][N][N][=C][Branch1][N][C][=C][C][=C][Branch1][C][Cl][C][=C][Ring1][#Branch1][C][=C][C][=C][Branch1][C][Cl][C][=C][Ring1][#Branch1][C][=C][C][=C][C][=C][C][=C][Ring1][=Branch1][C][=C][Ring1][#Branch2][O]'
- '[O][=C][Branch1][C][O][C][=C][Branch1][C][C][C][C][=Branch1][C][=O][O][C]'
- '[O][=C][Branch1][C][O][C][C][C][C][C][C][C][C][Branch1][C][O][C][Branch1][C][O][C][Branch1][C][O][C][#C][C][Branch1][C][O][C][C][C][C]'
- source_sentence: '[O][=C][Branch1][#Branch1][O][C][Branch1][C][C][C][C][C][C][C][C][C][C][C]'
sentences:
- '[O][=C][O][C][C][Branch1][C][O][C][C][Ring1][#Branch1][=C][C][C][C][C][C][C][C][C][C][C][C][C][C]'
- '[O][=C][C][=C][C][O][C][O][C][=Ring1][Branch1][C][=C][Ring1][=Branch2][Br]'
- '[O][=C][Branch2][#Branch1][=C][O][C][C][=Branch1][C][=C][C][C][Branch1][#Branch1][O][C][=Branch1][C][=O][C][C][C][C][=Branch1][C][=O][C][Branch2][=Branch1][Ring1][O][C][C][Ring1][Branch2][Branch1][C][C][C][Ring1][=Branch1][Branch1][C][O][C][Branch1][#Branch1][O][C][=Branch1][C][=O][C][C][Branch1][#Branch1][O][C][=Branch1][C][=O][C][C][Ring2][Ring1][N][Branch1][#C][C][O][C][=Branch1][C][=O][C][=C][C][=C][C][=C][Ring1][=Branch1][C][Branch1][Branch2][C][O][C][=Branch1][C][=O][C][C][Ring2][Ring2][S][C][C][=C][C][=C][C][=C][C][=C][Ring1][=Branch1]'
- source_sentence: '[O][=C][O][C][=C][Branch2][Ring1][#C][C][=C][C][O][C][N][Branch1][S][C][=C][C][=C][Branch1][=Branch1][O][C][C][C][C][C][=C][Ring1][O][C][C][Ring2][Ring1][Branch1][=Ring1][P][C][=Branch1][=C][=C][Ring2][Ring1][=Branch2][C][C][=C][C][=C][C][=C][Ring1][=Branch1][C]'
sentences:
- '[O][=C][N][C][Branch1][S][C][=Branch1][C][=O][N][C][=C][C][=C][C][=C][Ring1][N][Ring1][=Branch1][C][C][=Branch1][C][=O][N][C][C][C][=N][C][=Branch1][Branch1][=C][S][Ring1][Branch1][C]'
- '[O][=C][C][=C][Branch2][Branch2][O][O][C][=C][C][Branch2][Ring2][#C][O][C][C][Branch1][Ring1][C][O][C][C][C][=C][C][NH1][C][=C][C][=Ring1][Branch1][C][=C][Ring1][=Branch2][C][C][=Branch1][C][=O][C][Branch1][Ring1][C][O][C][C][=Branch1][Branch2][=C][Ring2][Ring1][#C][Ring2][Ring1][O][C][Ring1][#Branch2][=C][C][Branch2][Ring2][O][O][C][Branch1][Ring2][C][Ring1][Branch1][C][Branch1][C][O][C][Branch2][Ring1][Branch2][C][=C][C][C][Branch1][S][N][Branch1][=Branch1][C][C][C][O][C][C][C][C][Ring1][S][Ring1][O][C][C][C][C][C][O][=C][Ring2][Branch1][=Branch2][C][O][C][=Branch1][C][=O][O][C][C]'
- '[O][C][C][O][C][Branch2][Branch2][#Branch2][O][C][C][Branch1][C][O][C][Branch1][C][O][C][Branch2][#Branch1][#Branch2][O][C][Ring1][Branch2][O][C][C][C][C][Branch2][=Branch1][=N][C][=Branch2][Branch1][P][=C][C][Branch1][C][O][C][Branch1][C][C][C][Ring1][Branch2][C][C][Branch1][C][O][C][C][Branch1][=Branch2][C][C][C][Ring1][O][Ring1][Branch1][C][C][Branch2][Ring1][Branch1][O][C][O][C][Branch1][Ring1][C][O][C][Branch1][C][O][C][Branch1][C][O][C][Ring1][#Branch2][O][Branch1][C][C][C][C][=C][C][Branch1][C][C][C][C][Ring2][Ring2][=Branch2][Branch1][C][C][C][C][C][O][C][Branch1][C][O][C][Branch1][C][O][C][Ring2][Branch1][S][O]'
- source_sentence: '[O][=C][O][C][=C][C][=C][Branch1][O][O][C][=Branch1][C][=O][N][Branch1][C][C][C][C][=C][Ring1][N][C][=Branch1][Ring2][=C][Ring1][S][C][O][C][=C][C][=C][C][=C][Ring1][=Branch1][C][=Ring1][=Branch2]'
sentences:
- '[O][=C][C][=Branch2][#Branch1][=N][=C][C][Branch2][Ring2][N][C][NH1+1][C][=Branch2][Ring2][Ring1][=C][Branch1][#Branch2][C][=N][C][=C][C][Ring1][Branch2][Ring1][Branch1][C][C][=C][C][=Branch1][S][=C][C][=Branch1][Ring2][=C][Ring1][=Branch1][C][C][C][O][C][C][Ring1][=Branch1][C][C][C][O][C][Branch1][C][O][C][C][Branch1][C][C][C][C][C][=Branch1][C][=O][C][Branch1][C][C][Branch1][C][C][C][Ring1][#Branch2][C][C][C][Ring1][=C][Branch1][C][C][C][Ring2][Ring2][=C][Branch1][C][C][C][Ring2][Branch1][C][C][Branch1][C][C][C][C][Branch1][C][O][C][O][C][Ring1][Ring1][Branch1][C][C][C]'
- '[O][=C][Branch1][C][O][C][Branch2][O][P][O][C][C][=Branch1][C][=O][N][C][C][Branch1][C][O][C][C][Branch2][=Branch2][=Branch2][O][C][C][=Branch1][C][=O][N][C][C][Branch1][C][O][C][C][Branch2][=Branch1][P][O][C][C][Branch1][C][O][C][Branch1][C][O][C][Branch1][C][O][C][Branch2][Branch1][#Branch2][O][P][=Branch1][C][=O][Branch1][C][O][O][C][C][Branch2][Ring1][O][N][C][=Branch1][C][=O][C][C][C][C][C][C][C][C][C][C][C][C][C][C][C][C][C][C][C][C][C][C][Branch1][C][O][C][=C][C][C][C][C][C][C][C][C][C][C][C][C][Ring2][Branch1][#Branch1][O][Branch1][P][O][C][Ring2][Branch1][S][C][Branch1][C][O][C][Branch1][C][O][C][O][C][C][=Branch1][C][=O][O][Branch1][P][O][C][Ring2][#Branch1][=Branch1][C][Branch1][C][O][C][Branch1][C][O][C][O][C][C][=Branch1][C][=O][O][O][C][Branch1][N][C][Branch1][C][O][C][Branch1][C][O][C][O][C][C][Branch1][Branch2][N][C][=Branch1][C][=O][C][O][C][Branch1][C][O][C][Ring2][=Branch2][Branch2]'
- '[O][=C][Branch1][C][N][C][=N][C][=C][C][=C][C][=C][Ring1][=Branch1][C][=Branch1][C][=O][N][Ring1][O][C][C][O][C]'
- source_sentence: '[O][=C][Branch1][#Branch1][C][=C][C][C][C][=C][C]'
sentences:
- '[O][=C][Branch1][C][O][C][C][C][C][C][C][C][C][Branch1][C][O][C][=C][C][#C][C][=C][C][C][C]'
- '[O][C][C][O][C][Branch2][=Branch2][Ring1][O][C][Branch1][C][C][C][C][C][Branch1][C][O][O][C][C][C][C][C][C][C][C][C][Branch2][Branch1][N][O][C][O][C][Branch1][Ring1][C][O][C][Branch2][Ring2][#Branch1][O][C][O][C][Branch1][Ring1][C][O][C][Branch1][C][O][C][Branch1][C][O][C][Ring1][#Branch2][O][C][O][C][Branch1][C][C][C][Branch1][C][O][C][Branch1][C][O][C][Ring1][=Branch2][O][C][Branch1][C][O][C][Ring2][Ring1][#C][O][C][C][C][Ring2][Ring2][#Branch1][Branch1][C][C][C][Ring2][Ring2][N][C][C][C][Ring2][Ring2][S][Branch1][C][C][C][Ring2][Branch1][Ring2][C][Ring2][Branch1][Branch2][C][C][Branch1][C][O][C][Branch1][C][O][C][Ring2][=Branch1][=Branch1][O]'
- '[O][=C][Branch1][#Branch2][C][=C][C][#C][C][#C][C][#C][C][N][C][C][C][=C][C][=C][C][=C][Ring1][=Branch1]'
model-index:
- name: SentenceTransformer
results:
- task:
type: semantic-similarity
name: Semantic Similarity
dataset:
name: NP isotest
type: NP-isotest
metrics:
- type: pearson_cosine
value: 0.936731178796972
name: Pearson Cosine
- type: spearman_cosine
value: 0.93027366634068
name: Spearman Cosine
- type: pearson_manhattan
value: 0.826340669261792
name: Pearson Manhattan
- type: spearman_manhattan
value: 0.845192256146849
name: Spearman Manhattan
- type: pearson_euclidean
value: 0.842726066770598
name: Pearson Euclidean
- type: spearman_euclidean
value: 0.865381289346298
name: Spearman Euclidean
- type: pearson_dot
value: 0.924283770507162
name: Pearson Dot
- type: spearman_dot
value: 0.923230424410894
name: Spearman Dot
- type: pearson_max
value: 0.936731178796972
name: Pearson Max
- type: spearman_max
value: 0.93027366634068
name: Spearman Max
---
# ChEmbed v0.1 - Chemical Embeddings
This prototype is a [sentence-transformers](https://www.SBERT.net) based on [MiniLM-L6-H384-uncased](https://huggingface.co/nreimers/MiniLM-L6-H384-uncased) fine-tuned on around 1 million pairs of valid natural compounds' SELFIES [(Krenn et al. 2020)](https://github.com/aspuru-guzik-group/selfies) taken from COCONUTDB [(Sorokina et al. 2021)](https://coconut.naturalproducts.net/). It maps compounds' *Self-Referencing Embedded Strings* (SELFIES) into a 768-dimensional dense vector space, potentially can be used for chemical similarity, similarity search, classification, clustering, and more.
I am planning to train this model with more epochs on current dataset, before moving on to a larger dataset with 6 million pairs generated from ChemBL34. However, this will take some time due to computational and financial constraints. A future project of mine is to develop a custom model specifically for cheminformatics to address any biases and optimization issues in repurposing an embedding model designed for NLP tasks.
### Disclaimer: For Academic Purposes Only
The information and model provided is for academic purposes only. It is intended for educational and research use, and should not be used for any commercial or legal purposes. The author do not guarantee the accuracy, completeness, or reliability of the information.
## Model Details
### Model Description
- **Model Type:** Sentence Transformer
- **Base model:** [MiniLM-L6-H384-uncased](https://huggingface.co/nreimers/MiniLM-L6-H384-uncased)
- **Maximum Sequence Length:** 512 tokens
- **Output Dimensionality:** 768 tokens
- **Similarity Function:** Cosine Similarity
- **Training Dataset:** SELFIES pairs generated from COCONUTDB
- **Language:** SELFIES
- **License:** CC BY-NC 4.0
### Full Model Architecture
```
SentenceTransformer(
(0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel
(1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': True, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': False})
)
```
## Usage
### Direct Usage (Sentence Transformers)
First install the Sentence Transformers library:
```bash
pip install -U sentence-transformers
```
Then you can load this model and run inference.
```python
from sentence_transformers import SentenceTransformer
# Download from the 🤗 Hub
model = SentenceTransformer("gbyuvd/ChEmbed-v01")
# Run inference
sentences = [
'[O][=C][Branch1][#Branch1][C][=C][C][C][C][=C][C]',
'[O][=C][Branch1][C][O][C][C][C][C][C][C][C][C][Branch1][C][O][C][=C][C][#C][C][=C][C][C][C]',
'[O][C][C][O][C][Branch2][=Branch2][Ring1][O][C][Branch1][C][C][C][C][C][Branch1][C][O][O][C][C][C][C][C][C][C][C][C][Branch2][Branch1][N][O][C][O][C][Branch1][Ring1][C][O][C][Branch2][Ring2][#Branch1][O][C][O][C][Branch1][Ring1][C][O][C][Branch1][C][O][C][Branch1][C][O][C][Ring1][#Branch2][O][C][O][C][Branch1][C][C][C][Branch1][C][O][C][Branch1][C][O][C][Ring1][=Branch2][O][C][Branch1][C][O][C][Ring2][Ring1][#C][O][C][C][C][Ring2][Ring2][#Branch1][Branch1][C][C][C][Ring2][Ring2][N][C][C][C][Ring2][Ring2][S][Branch1][C][C][C][Ring2][Branch1][Ring2][C][Ring2][Branch1][Branch2][C][C][Branch1][C][O][C][Branch1][C][O][C][Ring2][=Branch1][=Branch1][O]',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]
# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]
```
## Dataset
| Dataset | Reference | Number of Pairs |
|:---------------------------|:-----------|:-----------|
| COCONUTDB (0.8:0.1:0.1 split) | [(Sorokina et al. 2021)](https://coconut.naturalproducts.net/) | 1,183,174 |
## Evaluation
### Metrics
#### Semantic Similarity
* Dataset: `NP-isotest`
* Number of test pairs: 118,318
* Evaluated with [<code>EmbeddingSimilarityEvaluator</code>](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.EmbeddingSimilarityEvaluator)
| Metric | Value |
|:--------------------|:-----------|
| pearson_cosine | 0.9367 |
| **spearman_cosine** | **0.9303** |
| pearson_manhattan | 0.8263 |
| spearman_manhattan | 0.8452 |
| pearson_euclidean | 0.8654 |
| spearman_euclidean | 0.9243 |
| pearson_dot | 0.9232 |
| spearman_dot | 0.9367 |
| pearson_max | 0.9303 |
| spearman_max | 0.8961 |
## Limitations
For now, the model might be ineffective in embedding synthetic drugs, since it is still trained on just natural products. Also, the tokenizer used is still uncustomized one.
## Testing Generated Embeddings' Clusters
The plot below shows how the model's embeddings (at this stage) cluster different classes of compounds, compared to using MACCS fingerprints.
![image/png](https://cdn-uploads.huggingface.co/production/uploads/667da868d653c0b02d6a2399/c8_5IWjPgbrGY0Z9-ZHop.png)
![image/png](https://cdn-uploads.huggingface.co/production/uploads/667da868d653c0b02d6a2399/EHEcaSnra4lldI0LY5tGq.png)
### Framework Versions
- Python: 3.9.13
- Sentence Transformers: 3.0.1
- Transformers: 4.41.2
- PyTorch: 2.3.1+cu121
- Accelerate: 0.31.0
- Datasets: 2.20.0
- Tokenizers: 0.19.1
## Contact
G Bayu ([email protected])