File size: 6,208 Bytes
e24dfd7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
38930a7
 
e24dfd7
 
 
38930a7
 
 
 
e24dfd7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
---
language:
- en
license: mit
tags:
- chemistry
- SMILES
- yield
datasets:
- ORD
metrics:
- r_squared
---

# Model Card for ReactionT5v2-yield

This is a ReactionT5 pre-trained to predict yields of reactions. You can use the demo [here](https://huggingface.co/spaces/sagawa/ReactionT5_task_yield).

### Model Sources

<!-- Provide the basic links for the model. -->

- **Repository:** https://github.com/sagawatatsuya/ReactionT5v2
- **Paper:** https://arxiv.org/abs/2311.06708
- **Demo:** https://huggingface.co/spaces/sagawa/ReactionT5_task_yield

## Uses

<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->

## How to Get Started with the Model

Use the code below to get started with the model.

```python
import torch
import torch.nn as nn
from transformers import AutoTokenizer, T5ForConditionalGeneration, AutoConfig, PreTrainedModel

class ReactionT5Yield(PreTrainedModel):
    config_class  = AutoConfig
    def __init__(self, config):
        super().__init__(config)
        self.config = config
        self.model = T5ForConditionalGeneration.from_pretrained(self.config._name_or_path)
        self.model.resize_token_embeddings(self.config.vocab_size)
        self.fc1 = nn.Linear(self.config.hidden_size, self.config.hidden_size//2)
        self.fc2 = nn.Linear(self.config.hidden_size, self.config.hidden_size//2)
        self.fc3 = nn.Linear(self.config.hidden_size//2*2, self.config.hidden_size)
        self.fc4 = nn.Linear(self.config.hidden_size, self.config.hidden_size)
        self.fc5 = nn.Linear(self.config.hidden_size, 1)

        self._init_weights(self.fc1)
        self._init_weights(self.fc2)
        self._init_weights(self.fc3)
        self._init_weights(self.fc4)
        self._init_weights(self.fc5)

    def _init_weights(self, module):
        if isinstance(module, nn.Linear):
            module.weight.data.normal_(mean=0.0, std=0.01)
            if module.bias is not None:
                module.bias.data.zero_()
        elif isinstance(module, nn.Embedding):
            module.weight.data.normal_(mean=0.0, std=0.01)
            if module.padding_idx is not None:
                module.weight.data[module.padding_idx].zero_()
        elif isinstance(module, nn.LayerNorm):
            module.bias.data.zero_()
            module.weight.data.fill_(1.0)

    def forward(self, inputs):
        encoder_outputs = self.model.encoder(**inputs)
        encoder_hidden_states = encoder_outputs[0]
        outputs = self.model.decoder(input_ids=torch.full((inputs['input_ids'].size(0),1),
                                            self.config.decoder_start_token_id,
                                            dtype=torch.long), encoder_hidden_states=encoder_hidden_states)
        last_hidden_states = outputs[0]
        output1 = self.fc1(last_hidden_states.view(-1, self.config.hidden_size))
        output2 = self.fc2(encoder_hidden_states[:, 0, :].view(-1, self.config.hidden_size))
        output = self.fc3(torch.hstack((output1, output2)))
        output = self.fc4(output)
        output = self.fc5(output)
        return output*100


model = ReactionT5Yield.from_pretrained('sagawa/ReactionT5v2-yield')
tokenizer = AutoTokenizer.from_pretrained('sagawa/ReactionT5v2-yield')
inp = tokenizer(['REACTANT:CC(C)n1ncnc1-c1cn2c(n1)-c1cnc(O)cc1OCC2.CCN(C(C)C)C(C)C.Cl.NC(=O)[C@@H]1C[C@H](F)CN1REAGENT: PRODUCT:O=C(NNC(=O)C(F)(F)F)C(F)(F)F'], return_tensors='pt')
print(model(inp)) # tensor([[19.1666]], grad_fn=<MulBackward0>)
```

## Training Details

### Training Procedure 

<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
We used [Open Reaction Database (ORD) dataset](https://drive.google.com/file/d/1fa2MyLdN1vcA7Rysk8kLQENE92YejS9B/view?usp=drive_link) for model training. In addition, we used palladium-catalyzed Buchwald-Hartwig [C-N cross-coupling reactions dataset](https://yzhang.hpc.nyu.edu/T5Chem/index.html)'s test split to prevent data leakage.
The command used for training is the following. For more information about data preprocessing and training, please refer to the paper and GitHub repository.

```python
python train.py \
    --train_data_path='../data/preprocessed_ord_train.csv' \
    --valid_data_path='../data/preprocessed_ord_valid.csv' \
    --test_data_path='../data/preprocessed_ord_test.csv' \
    --CN_test_data_path='../data/C_N_yield/MFF_Test1/test.csv' \
    --epochs=100 \
    --batch_size=32 \
    --output_dir='./'
```

### Results

| **R^2**       | **DFT**       | **MFF**       | **Yield-BERT**| **T5Chem**    | **CompoundT5**| **ReactionT5** (without finetuning)    | **ReactionT5** |
| ------------- | ------------- | ------------- | ------------- | ------------- | ------------- | ------------- | ------------- |
| Random 70/30  | 0.92          | 0.927 ± 0.007 | 0.951 ± 0.005 | 0.970 ± 0.003 | 0.971 ± 0.002 | 0.831 ± 0.012 | 0.947 ± 0.003 |
| Test 1        | 0.80          | 0.851         | 0.838         | 0.811         | 0.855         | 0.846         | 0.872         |
| Test 2        | 0.77          | 0.713         | 0.836         | 0.907         | 0.852         | 0.869         | 0.917         |
| Test 3        | 0.64          | 0.635         | 0.738         | 0.789         | 0.712         | 0.779         | 0.811         |
| Test 4        | 0.54          | 0.184         | 0.538         | 0.627         | 0.547         | 0.843         | 0.830         |
| Avg. Tests 1–4| 0.69 ± 0.104  | 0.596 ± 0.251 | 0.738 ± 0.122 | 0.785 ± 0.094 | 0.741 ± 0.126 | 0.834 ± 0.034 | 0.857 ± 0.041 |


## Citation

<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
arxiv link: https://arxiv.org/abs/2311.06708
```
@misc{sagawa2023reactiont5,  
      title={ReactionT5: a large-scale pre-trained model towards application of limited reaction data}, 
      author={Tatsuya Sagawa and Ryosuke Kojima},  
      year={2023},  
      eprint={2311.06708},  
      archivePrefix={arXiv},  
      primaryClass={physics.chem-ph}  
}
```