Tomlim commited on
Commit
cfd82da
1 Parent(s): fbc4457

Model Card

Browse files
Files changed (1) hide show
  1. README.md +244 -0
README.md CHANGED
@@ -1,3 +1,247 @@
1
  ---
2
  license: mit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: mit
3
+ language:
4
+ - af
5
+ - am
6
+ - ar
7
+ - az
8
+ - be
9
+ - bg
10
+ - bn
11
+ - ca
12
+ - ceb
13
+ - co
14
+ - cs
15
+ - cy
16
+ - da
17
+ - de
18
+ - el
19
+ - en
20
+ - eo
21
+ - es
22
+ - et
23
+ - eu
24
+ - fa
25
+ - fi
26
+ - fil
27
+ - fr
28
+ - fy
29
+ - ga
30
+ - gd
31
+ - gl
32
+ - gu
33
+ - ha
34
+ - haw
35
+ - he
36
+ - hi
37
+ - hmn
38
+ - ht
39
+ - hu
40
+ - hy
41
+ - id
42
+ - ig
43
+ - is
44
+ - it
45
+ - iw
46
+ - ja
47
+ - jv
48
+ - ka
49
+ - kk
50
+ - km
51
+ - kn
52
+ - ko
53
+ - ku
54
+ - ky
55
+ - la
56
+ - lb
57
+ - lo
58
+ - lt
59
+ - lv
60
+ - mg
61
+ - mi
62
+ - mk
63
+ - ml
64
+ - mn
65
+ - mr
66
+ - ms
67
+ - mt
68
+ - my
69
+ - ne
70
+ - nl
71
+ - 'no'
72
+ - ny
73
+ - pa
74
+ - pl
75
+ - ps
76
+ - pt
77
+ - ro
78
+ - ru
79
+ - sd
80
+ - si
81
+ - sk
82
+ - sl
83
+ - sm
84
+ - sn
85
+ - so
86
+ - sq
87
+ - sr
88
+ - st
89
+ - su
90
+ - sv
91
+ - sw
92
+ - ta
93
+ - te
94
+ - tg
95
+ - th
96
+ - tr
97
+ - uk
98
+ - und
99
+ - ur
100
+ - uz
101
+ - vi
102
+ - xh
103
+ - yi
104
+ - yo
105
+ - zh
106
+ - zu
107
+ datasets:
108
+ - mc4
109
  ---
110
+
111
+ # MyT5
112
+
113
+
114
+
115
+ ## Model Details
116
+
117
+ MyT5 (**My**te **T5**) is a multilingual language model based on T5 architecture.
118
+ The model uses a **m**orphologically-driven **byte** (**MYTE**) representation described in our paper [Limisiewicz et al., 2024](https://arxiv.org/pdf/2403.10691.pdf).
119
+
120
+ ### Model Description
121
+
122
+ <!-- Provide a longer summary of what this model is. -->
123
+
124
+ - **Developed by:** Tomasz Limisiewicz, Terra Blevins, Hila Gonen, Orevaoghene Ahia, Luke Zettlemoyer
125
+ - **Funded by:** University of Washington Fellowship, Charles University Grant Agency
126
+ - **Model type:** T5
127
+ - **Language(s) (NLP):** Multilingual
128
+ - **License:** MIT
129
+
130
+ ### Model Sizes
131
+
132
+ - **[Small](https://huggingface.co/Tomlim/myt5-small)**: 300M parameters
133
+ - **[Base](https://huggingface.co/Tomlim/myt5-base)**: 582M parameters
134
+ - **[Large](https://huggingface.co/Tomlim/myt5-large)**: 1.2B parameters
135
+
136
+ ### Model Sources
137
+
138
+ <!-- Provide the basic links for the model. -->
139
+
140
+ - **[Repository](https://github.com/tomlimi/MYTE)**
141
+ - **[Paper](https://arxiv.org/pdf/2403.10691.pdf)**
142
+
143
+ ## How to Get Started with the Model
144
+
145
+ The snippet below shows the basic usage of the model for multilingual language modeling.
146
+ Custom Tokenizer is available in [GitHub](https://github.com/tomlimi/MYTE])repository, in `src/myt5/myt5_tokenizer.py`.
147
+ We also plan to release it on HuggingFace in the future.
148
+
149
+ ```python
150
+ from transformers import T5ForConditionalGeneration
151
+ from src.myt5.myt5_tokenizer import MyT5Tokenizer
152
+ import torch
153
+
154
+ MODEL_SIZE = "large" # small, base, or large
155
+
156
+ model = T5ForConditionalGeneration.from_pretrained(f"Tomlim/MyT5_{MODEL_SIZE}", use_safetensors=True)
157
+ tokenizer = MyT5Tokenizer()
158
+
159
+ pre_texts = ['"We now have',
160
+ '„Mamy teraz myszy w wieku',
161
+ '"""எங்களிடம் இப்போது']
162
+ post_texts = ['4-month-old mice that are non-diabetic that used to be diabetic," he added.',
163
+ '4 miesięcy, które miały cukrzycę, ale zostały z niej wyleczone” – dodał.',
164
+ '4-மாத-வயதுடைய எலி ஒன்று உள்ளது, முன்னர் அதற்கு நீரிழிவு இருந்தது தற்போது இல்லை"" என்று அவர் மேலும் கூறினார்."']
165
+
166
+ inputs = tokenizer(pre_texts, padding="longest", return_tensors="pt")
167
+ targets = tokenizer(post_texts, padding="longest", return_tensors="pt")
168
+
169
+
170
+ outputs = model(**inputs, labels=targets.input_ids)
171
+ probs = torch.nn.functional.softmax(outputs.logits, dim=-1)
172
+ ```
173
+
174
+ ## Training Details
175
+
176
+ ### Training Data
177
+
178
+ The model was trained on the standard T5 task of restoring corrupted spans in the multilingual MC4 dataset.
179
+
180
+ ### Preprocessing
181
+
182
+ Instead of UTF-8 bytes, we used morphologically-driven byte representation.
183
+ See the description in our [paper](https://arxiv.org/pdf/2403.10691.pdf) for more details.
184
+
185
+
186
+ ### Training Hyperparameters
187
+
188
+ We used the same hyperparameters as in the original ByT5 paper.
189
+ The only difference is that we decreased the number of training steps to 250,000 to avoid overfiting.
190
+
191
+ ### Computational Infrastructure
192
+
193
+ Models were trained on TPUs available through TPU Research Cloud (TRC).
194
+ We used v3-8 TPU for training small and base models and v3-32 for a large model.
195
+ The training for each instance took:
196
+
197
+ - **Small**: 90h
198
+ - **Base**: 230h
199
+ - **Large**: 190h
200
+
201
+ # Evaluation
202
+
203
+ <!-- This section describes the evaluation protocols and provides the results. -->
204
+
205
+ MyT5 models are compared with reimplementation of [ByT5](https://huggingface.co/docs/transformers/model_doc/byt5) models trained for 250,000 steps.
206
+
207
+ ## Language Modeling
208
+
209
+ We have evaluated LM performance on multi-parallel [FLORES 200](https://arxiv.org/pdf/2207.04672v3.pdf) corpus.
210
+ To compare the scores across languages and models, we used a normalized metric, i.e., Bit-per-English-Byte (BPEB).
211
+
212
+ ### Results
213
+
214
+ | | | ByT5 | | MyT5 | |
215
+ |-------|-----------|------|--------|------|--------|
216
+ | | | BPEB | T (ms) | BPEB | T (ms) |
217
+ | small | All | 10.1 | 7.0 | 4.6 | 6.7 |
218
+ | | Latin | 4.6 | 5.9 | 4.2 | 6.6 |
219
+ | | Non Latin | 18.1 | 8.5 | 5.1 | 6.8 |
220
+ | base | All | 8.2 | 11.5 | 5.8 | 8.9 |
221
+ | | Latin | 4.9 | 9.4 | 5.0 | 8.7 |
222
+ | | Non Latin | 13.0 | 14.6 | 6.9 | 9.1 |
223
+ | large | All | 13.4 | 31.8 | 4.6 | 26.7 |
224
+ | | Latin | 10.1 | 28.1 | 4.0 | 26.6 |
225
+ | | Non Latin | 18.2 | 37.3 | 5.4 | 27.0 |
226
+
227
+ Byte-per-English-Bits and Inference times (average per Flores 200 sentence) averaged for three language groupings.
228
+ The inference was run on an A40 GPU core.
229
+
230
+
231
+ ## Citation
232
+
233
+ ```bibtex
234
+ @misc{limisiewicz2024myte,
235
+ title={MYTE: Morphology-Driven Byte Encoding for Better and Fairer Multilingual Language Modeling},
236
+ author={Tomasz Limisiewicz and Terra Blevins and Hila Gonen and Orevaoghene Ahia and Luke Zettlemoyer},
237
+ year={2024},
238
+ eprint={2403.10691},
239
+ archivePrefix={arXiv},
240
+ primaryClass={cs.CL}
241
+ }
242
+ ```
243
+
244
+
245
+ ## Model Card Author
246
+
247
+ [Tomasz Limisiewicz](mailto:[email protected])