R0k1e commited on
Commit
51fff37
1 Parent(s): e5a3ad0

Update README.md

Browse files
Files changed (3) hide show
  1. README.md +235 -0
  2. flow_diagram.png +0 -0
  3. infer.py +114 -0
README.md CHANGED
@@ -1,3 +1,238 @@
1
  ---
2
  license: mit
 
 
 
 
 
 
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: mit
3
+ datasets:
4
+ - R0k1e/UltraLink
5
+ - stingning/ultrachat
6
+ - ise-uiuc/Magicoder-Evol-Instruct-110K
7
+ - ise-uiuc/Magicoder-OSS-Instruct-75K
8
+ language:
9
+ - eng
10
+ - fra
11
+ - rus
12
+ - spa
13
+ - zho
14
+ metrics:
15
+ - accuracy
16
  ---
17
+
18
+ <img src="flow_diagram.png" alt="UltraLink Flow Diagram" width="800" style="margin-left:'auto' margin-right:'auto' display:'block'"/>
19
+
20
+ # Model Card for UltraLink-LM
21
+
22
+ ## Model Summary
23
+ > The UltraLink-LM is a massively multilingual generative language model that follows instructions in 5 languages, English, French, Russian, Spanish, and Chinese. It is trained on a combination of publicly available datasets and UltraLink, including ShareGPT, UltraChat, Magicoder-Evol-Instruct-110K, and Magicoder-OSS-Instruct-75K. The model is capable of generating text in 5 languages with high quality and diversity.
24
+ > UltraLink-LM outperforms [PolyLM-Chat-13b](https://huggingface.co/DAMO-NLP-MT/polylm-chat-13b), [Guanaco](JosephusCheung/Guanaco), and [Bloomz-7b1-mt](https://huggingface.co/bigscience/bloomz-7b1-mt) in code, math and chat abilities in four languages, and has a high-quality and diverse text generation performance in all languages.
25
+ > The UltraLink-LM is trained using [UltraLink](https://huggingface.co/datasets/R0k1e/UltraLink), [UltraChat](https://huggingface.co/datasets/stingning/ultrachat), [Magicoder-Evol](https://huggingface.co/datasets/ise-uiuc/Magicoder-Evol-Instruct-110K), [Magicoder-OSS](https://huggingface.co/datasets/ise-uiuc/Magicoder-OSS-Instruct-75K), and ShareGPT.
26
+ > We release the checkpoints under a MIT license to further our mission of multilingual technologies empowering a multilingual world.
27
+
28
+ - **Developed by:** [THUNLP]((http://nlp.csai.tsinghua.edu.cn/))
29
+ - **Model type:** a Transformer style autoregressive massively multilingual language model.
30
+ - **Paper**: [UltraLink: An Open-Source Knowledge-Enhanced Multilingual Supervised Fine-tuning Dataset](https://arxiv.org/abs/2402.04588)
31
+ - **Languages**: Refer to the list of languages in the `language` section of this model card.
32
+ - **License**: MIT
33
+ - **Model**: [UltraLink-LM](https://huggingface.co/R0k1e/UltraLink-LM)
34
+ - **Model Size**: 13 billion parameters
35
+ - **Datasets**: [UltraLink](https://huggingface.co/datasets/R0k1e/UltraLink), [UltraChat](https://huggingface.co/datasets/stingning/ultrachat), [Magicoder-Evol](https://huggingface.co/datasets/ise-uiuc/Magicoder-Evol-Instruct-110K), [Magicoder-OSS](https://huggingface.co/datasets/ise-uiuc/Magicoder-OSS-Instruct-75K), and ShareGPT.
36
+
37
+ ## Use
38
+
39
+ ```python
40
+ from transformers import AutoModelForCausalLM, AutoTokenizer
41
+
42
+ checkpoint = "R0k1e/UltraLink-LM"
43
+
44
+ tokenizer = AutoTokenizer.from_pretrained(checkpoint)
45
+ ultralink_lm = AutoModelForCausalLM.from_pretrained(checkpoint)
46
+
47
+ # Chat abilities in Chinese
48
+ # Please tell us about Tang San Cai(An ancient Chinese pottery type).
49
+ chat_inputs = tokenizer.encode("请介绍一下唐三彩。", return_tensors="pt")
50
+ chat_outputs = ultralink_lm.generate(chat_inputs, max_new_tokens=512)
51
+ print(tokenizer.decode(chat_outputs[0]))
52
+ # Expected output:
53
+ """
54
+ 唐三彩是中国古代的一种绘画艺术形式,起源于唐朝(618-907)。它以其精美的色彩和细致的绘画技巧而闻名。唐三彩是一种陶瓷,通常以圆形或扁平的形状制成,并刻有图案或图像。这些图案通常是以彩色的方式呈现,使用釉质陶瓷上的颜料。唐三彩的主题各异,包括动物、植物、人物和日常生活场景。它是中国艺朩的重要贡献,仍然是中国艺术的重要组成部分。
55
+ """
56
+ # Translations in English:
57
+ """
58
+ Tang Sancai is an ancient Chinese art form of painting that originated in the Tang Dynasty (618-907). It is known for its exquisite colours and detailed painting techniques. Tang Sancai is a type of ceramic, usually made in round or flat shapes and inscribed with designs or images. These designs are usually rendered in colour, using pigments on glazed ceramics. Tang sancai has a variety of themes, including animals, plants, figures and scenes from daily life. It was an important contribution to Chinese art kwun and remains an important part of Chinese art.
59
+ """
60
+
61
+ # Code abilities in Russian
62
+ # Please implement a bubble sort algorithm in Python.
63
+ code_inputs = tokenizer.encode("Реализуйте алгоритм пузырьковой сортировки на Python.", return_tensors="pt")
64
+ code_outputs = ultralink_lm.generate(code_inputs, max_new_tokens=512)
65
+ print(tokenizer.decode(code_outputs[0]))
66
+ # Expected output:
67
+ """
68
+ Вот простая реализация алгоритма пузырьковой сортировки на Python:
69
+
70
+ ```python
71
+ def bubbleSort(arr):
72
+ n = len(arr)
73
+
74
+ for i in range(n):
75
+ for j in range(0, n-i-1):
76
+ if arr[j] > arr[j+1]:
77
+ arr[j], arr[j+1] = arr[j+1], arr[j]
78
+
79
+ arr = [64, 34, 25, 12, 22, 11, 90]
80
+ bubbleSort(arr)
81
+
82
+ print("Отсортированный массив:", arr)
83
+ \```
84
+
85
+ В этом коде функция `bubbleSort` сортирует массив `arr` с помощью алгоритма пузырьковой сортировки. Внутри функции происходит итерация по массиву, сравнение соседних элементов и обмен их местами, если они находятся в неправильном порядке. Этот процесс повторяется до тех пор, пока массив не будет отсортирован.
86
+
87
+ После вызова функции `bubbleSort` отсортированный массив выводится на экран.
88
+
89
+ Обратите внимание, что алгоритм пузырьковой сортировки не является эффективным для больших наборов данных, так как его средняя и худшая временная сложность составляют O(n^2), где n - количество элементов в массиве.
90
+ """
91
+ # Translations in English:
92
+ """
93
+ Here is a simple implementation of the bubble sort algorithm in Python:
94
+
95
+ ```python
96
+ def bubbleSort(arr):
97
+ n = len(arr)
98
+
99
+ for i in range(n):
100
+ for j in range(0, n-i-1):
101
+ if arr[j] > arr[j+1]:
102
+ arr[j], arr[j+1] = arr[j+1], arr[j]
103
+
104
+ arr = [64, 34, 25, 12, 22, 11, 90]
105
+ bubbleSort(arr)
106
+
107
+ print("Sorted array:", arr)
108
+ \```
109
+
110
+ In this code, the `bubbleSort` function sorts the `arr` array using the bubble sort algorithm. Inside the function iterates through the array, comparing neighbouring elements and swapping them if they are in the wrong order. This process is repeated until the array is sorted.
111
+
112
+ After calling the `bubbleSort` function, the sorted array is displayed on the screen.
113
+
114
+ Note that the bubble sort algorithm is not efficient for large datasets because its average and worst-case time complexity are O(n^2), where n is the number of elements in the array.
115
+ """
116
+
117
+ # Math abilities in French
118
+ # When the length of a rectangle is twice its width, calculate the area of the rectangle if its perimeter is 18 units?
119
+ math_inputs = tokenizer.encode("Lorsque la longueur d'un rectangle est le double de sa largeur, calculer l'aire du rectangle si son périmètre est de 18 unités?", return_tensors="pt")
120
+ math_outputs = ultralink_lm.generate(math_inputs, max_new_tokens=512)
121
+ print(tokenizer.decode(math_outputs[0]))
122
+ # Expected output:
123
+ """
124
+ La longueur du rectangle est deux fois sa largeur, donc la longueur est de 2w et la largeur est de w.
125
+
126
+ Le périmètre d'un rectangle est deux fois la somme de sa longueur et de sa largeur, donc l'équation est 2(2w + w) = 18.
127
+
128
+ En simplifiant l'équation, nous obtenons 2(3w) = 18, ce qui se simplifie en 6w = 18.
129
+
130
+ En divisant les deux côtés par 6, nous obtenons w = 3.
131
+
132
+ Par conséquent, la longueur du rectangle est de 2w = 2(3) = 6.
133
+
134
+ L'aire d'un rectangle est le produit de sa longueur et de sa largeur, donc l'aire est de 6 * 3 = 18.
135
+
136
+ La réponse est : 18
137
+ """
138
+ # Translations in English:
139
+ """
140
+ The length of the rectangle is twice its width, so the length is 2w and the width is w.
141
+
142
+ The perimeter of a rectangle is twice the sum of its length and width, so the equation is 2(2w + w) = 18.
143
+
144
+ Simplifying the equation, we get 2(3w) = 18, which simplifies to 6w = 18.
145
+
146
+ Dividing the two sides by 6 gives w = 3.
147
+
148
+ So the length of the rectangle is 2w = 2(3) = 6.
149
+
150
+ The area of a rectangle is the product of its length and width, so the area is 6 * 3 = 18.
151
+
152
+ The answer is: 18
153
+ """
154
+ ```
155
+
156
+ ## Model Details
157
+
158
+ ### Finetuning
159
+
160
+ - Architecture: Same as [Llama-2-13b](https://huggingface.co/meta-llama/Llama-2-13b-hf)
161
+ - Number of Samples seen during Finetuning: 1023K
162
+ - Batch size: 128
163
+ - Hardware: NVIDIA A100 80GB PCIe
164
+ - Software: BMTrain
165
+
166
+ ### Data Sources
167
+
168
+ The UltraLink-LM is trained on the following datasets:
169
+
170
+ - [UltraLink](https://huggingface.co/datasets/R0k1e/UltraLink)
171
+ - [UltraChat](https://huggingface.co/datasets/stingning/ultrachat)
172
+ - [Magicoder-Evol](https://huggingface.co/datasets/ise-uiuc/Magicoder-Evol-Instruct-110K)
173
+ - [Magicoder-OSS](https://huggingface.co/datasets/ise-uiuc/Magicoder-OSS-Instruct-75K)
174
+ - ShareGPT
175
+
176
+ All the datasets are integrated into the UltraLink dataset.
177
+
178
+ ## Evaluation
179
+
180
+ ### Multilingual HumanEval
181
+
182
+ [HumanEval](https://github.com/openai/human-eval) is a well-known benchmark for evaluating the code ability of LLMs. It execute the code snippets generated by the model and evaluate their correctness. Since there are no existing multilingual test set for code generation, we use GPT-3.5 with carefully-designed prompts to translation HumanEval into other languages.
183
+
184
+ |Model|En|Zh|Es|Ru|Fr|Avg|
185
+ |-----|---|---|---|---|---|---|
186
+ |Bloomz-7b1-mt | 8.5 | 7.3 | 6.1 | 8.5 | 6.1 | 7.3 |
187
+ |Phoenix-inst-chat-7b | 11.0 | 10.4 | 8.5 | 1.2 | 13.4 | 12.2 |
188
+ |PolyLM-Multialpaca-13b | 8.5 | 7.3 | 6.1 | 6.1 | 6.1 | 6.8 |
189
+ |PolyLM-Chat-13b | 10.4 | 7.9 | 6.1 | 7.3 | 8.5 | 8.1 |
190
+ |Chimera-inst-chat-13b| 14.6 | 13.4 | 14.6 | 12.8 | 14.0 | 13.9 |
191
+ |Okapi-7b | 12.2 | 11.0 | 8.5 | 8.5 | 8.5 | 9.8 |
192
+ |Guanaco-7b | 9.2 | 6.7 | 11.0 | 9.8 | 12.8 | 9.9 |
193
+ |Guanaco-13b| 18.3 | 15.9 | 9.8 | 8.5 | 14.6 | 12.2 |
194
+ |UltraLink-LM | 60.4 | 43.9 | 40.9 | 49.4 | 39.6 | 46.8|
195
+
196
+
197
+ ### MGSM
198
+
199
+ We employ [MGSM](https://github.com/google-research/url-nlp/tree/main/mgsm) to evaluate the math reasoning abilities, which is a multilingual benchmark. It compares the result with correct answers and evaluates the model's ability to perform mathematical reasoning.
200
+ |Model|En|Zh|Es|Ru|Fr|Avg|
201
+ |-----|---|---|---|---|---|---|
202
+ |Bloomz-7b1-mt | 2.8 | 1.6 | 2.0 | 0.4 | 2.8 | 1.7 |
203
+ |Phoenix-inst-chat-7b | 3.2 | 3.2 | 2.8 | 3.2 | 3.2 | 3.1 |
204
+ |PolyLM-Multialpaca-13b | 1.2 | 2.8 | 1.6 | 2.8 | 2.4 | 2.4 |
205
+ |PolyLM-Chat-13b | 10.8 | 6.4 | 4.8 | 4.4 | 5.6 | 5.3 |
206
+ |Chimera-inst-chat-13b | 14.0 | 11.6 | 10.0 | 12.0 | 12.8 | 11.6 |
207
+ |Okapi-7b | 4.0 | 2.4 | 3.6 | 4.4 | 4.8 | 3.8 |
208
+ |Guanaco-7b | 4.0 | 1.6 | 3.2 | 2.8 | 4.4 | 3.0 |
209
+ |Guanaco-13b | 13.6 | 10.8 | 11.2 | 6.4 | 5.2 | 8.4 |
210
+ |UltraLink-LM| 70.4 | 56.0 | 70.4 | 64.8 | 63.6 | 63.7 |
211
+
212
+ ### OMGEval
213
+ We use the [OMGEval](https://github.com/blcuicall/OMGEval) to evaluate the chat ability, which is a multilingual version of the widely-used English benchmark AlpacaEval.
214
+
215
+ |Model|En|Zh|Es|Ru|Fr|Avg|
216
+ |-----|---|---|---|---|---|---|
217
+ |Bloomz-7b1-mt | 0.0 | 0.9 | 0.1 | 0.5 | 0.3 | 0.4 |
218
+ |Phoenix-inst-chat-7b | 6.9 | 13.3 | 7.4 | 2.9 | 8.1 | 7.7 |
219
+ |PolyLM-Multialpaca-13b | 3.4 | 5.0 | 2.1 | 5.1 | 2.2 | 3.6 |
220
+ |PolyLM-Chat-13b | 7.7 | 14.0 | 6.1 | 5.5 | 4.8 | 7.6 |
221
+ |Chimera-inst-chat-13b | 15.5 | 9.7 | 11.8 | 13.7 | 13.8 | 12.9 |
222
+ |Okapi-7b | 8.8 | 6.2 | 5.0 | 12.1 | 8.7 | 8.2 |
223
+ |Guanaco-7b | 4.6 | 3.8 | 0.4 | 1.8 | 1.2 | 2.4 |
224
+ |Guanaco-13b | 29.0 | 8.6 | 16.9 | 15.4 | 17.3 | 17.5 |
225
+ |UltraLink-LM | 28.8 | 21.9 | 23.5 | 37.6 | 29.0 | 28.2 |
226
+
227
+ ## Citation
228
+
229
+ ```bibtex
230
+ @misc{wang2024ultralink,
231
+ title={UltraLink: An Open-Source Knowledge-Enhanced Multilingual Supervised Fine-tuning Dataset},
232
+ author={Haoyu Wang and Shuo Wang and Yukun Yan and Xujia Wang and Zhiyu Yang and Yuzhuang Xu and Zhenghao Liu and Ning Ding and Xu Han and Zhiyuan Liu and Maosong Sun},
233
+ year={2024},
234
+ eprint={2402.04588},
235
+ archivePrefix={arXiv},
236
+ primaryClass={cs.CL}
237
+ }
238
+ ```
flow_diagram.png ADDED
infer.py ADDED
@@ -0,0 +1,114 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from transformers import AutoModelForCausalLM, AutoTokenizer
2
+
3
+ checkpoint = "R0k1e/UltraLink-LM"
4
+
5
+ tokenizer = AutoTokenizer.from_pretrained(checkpoint)
6
+ ultralink_lm = AutoModelForCausalLM.from_pretrained(checkpoint)
7
+
8
+ # Chat abilities in Chinese
9
+ # Please tell us about Tang San Cai(An ancient Chinese pottery type).
10
+ chat_inputs = tokenizer.encode("请介绍一下唐三彩。", return_tensors="pt")
11
+ chat_outputs = ultralink_lm.generate(chat_inputs, max_new_tokens=512)
12
+ print(tokenizer.decode(chat_outputs[0]))
13
+ # Expected output:
14
+ """
15
+ 唐三彩是中国古代的一种绘画艺术形式,起源于唐朝(618-907)。它以其精美的色彩和细致的绘画技巧而闻名。唐三彩是一种陶瓷,通常以圆形或扁平的形状制成,并刻有图案或图像。这些图案通常是以彩色的方式呈现,使用釉质陶瓷上的颜料。唐三彩的主题各异,包括动物、植物、人物和日常生活场景。它是中国艺朩的重要贡献,仍然是中国艺术的重要组成部分。
16
+ """
17
+ # Translations in English:
18
+ """
19
+ Tang Sancai is an ancient Chinese art form of painting that originated in the Tang Dynasty (618-907). It is known for its exquisite colours and detailed painting techniques. Tang Sancai is a type of ceramic, usually made in round or flat shapes and inscribed with designs or images. These designs are usually rendered in colour, using pigments on glazed ceramics. Tang sancai has a variety of themes, including animals, plants, figures and scenes from daily life. It was an important contribution to Chinese art kwun and remains an important part of Chinese art.
20
+ """
21
+
22
+ # Code abilities in Russian
23
+ # Please implement a bubble sort algorithm in Python.
24
+ code_inputs = tokenizer.encode("Реализуйте алгоритм пузырьковой сортировки на Python.", return_tensors="pt")
25
+ code_outputs = ultralink_lm.generate(code_inputs, max_new_tokens=512)
26
+ print(tokenizer.decode(code_outputs[0]))
27
+ # Expected output:
28
+ """
29
+ Вот простая реализация алгоритма пузырьковой сортировки на Python:
30
+
31
+ ```python
32
+ def bubbleSort(arr):
33
+ n = len(arr)
34
+
35
+ for i in range(n):
36
+ for j in range(0, n-i-1):
37
+ if arr[j] > arr[j+1]:
38
+ arr[j], arr[j+1] = arr[j+1], arr[j]
39
+
40
+ arr = [64, 34, 25, 12, 22, 11, 90]
41
+ bubbleSort(arr)
42
+
43
+ print("Отсортированный массив:", arr)
44
+ ```
45
+
46
+ В этом коде функция `bubbleSort` сортирует массив `arr` с помощью алгоритма пузырьковой сортировки. Внутри функции происходит итерация по массиву, сравнение соседних элементов и обмен их местами, если они находятся в неправильном порядке. Этот процесс повторяется до тех пор, пока массив не будет отсортирован.
47
+
48
+ После вызова функции `bubbleSort` отсортированный массив выводится на экран.
49
+
50
+ Обратите внимание, что алгоритм пузырьковой сортировки не является эффективным для больших наборов данных, так как его средняя и худшая временная сложность составляют O(n^2), где n - количество элементов в массиве.
51
+ """
52
+ # Translations in English:
53
+ """
54
+ Here is a simple implementation of the bubble sort algorithm in Python:
55
+
56
+ ```python
57
+ def bubbleSort(arr):
58
+ n = len(arr)
59
+
60
+ for i in range(n):
61
+ for j in range(0, n-i-1):
62
+ if arr[j] > arr[j+1]:
63
+ arr[j], arr[j+1] = arr[j+1], arr[j]
64
+
65
+ arr = [64, 34, 25, 12, 22, 11, 90]
66
+ bubbleSort(arr)
67
+
68
+ print("Sorted array:", arr)
69
+ ```
70
+
71
+ In this code, the `bubbleSort` function sorts the `arr` array using the bubble sort algorithm. Inside the function iterates through the array, comparing neighbouring elements and swapping them if they are in the wrong order. This process is repeated until the array is sorted.
72
+
73
+ After calling the `bubbleSort` function, the sorted array is displayed on the screen.
74
+
75
+ Note that the bubble sort algorithm is not efficient for large datasets because its average and worst-case time complexity are O(n^2), where n is the number of elements in the array.
76
+ """
77
+
78
+ # Math abilities in French
79
+ # When the length of a rectangle is twice its width, calculate the area of the rectangle if its perimeter is 18 units?
80
+ math_inputs = tokenizer.encode("Lorsque la longueur d'un rectangle est le double de sa largeur, calculer l'aire du rectangle si son périmètre est de 18 unités?", return_tensors="pt")
81
+ math_outputs = ultralink_lm.generate(math_inputs, max_new_tokens=512)
82
+ print(tokenizer.decode(math_outputs[0]))
83
+ # Expected output:
84
+ """
85
+ La longueur du rectangle est deux fois sa largeur, donc la longueur est de 2w et la largeur est de w.
86
+
87
+ Le périmètre d'un rectangle est deux fois la somme de sa longueur et de sa largeur, donc l'équation est 2(2w + w) = 18.
88
+
89
+ En simplifiant l'équation, nous obtenons 2(3w) = 18, ce qui se simplifie en 6w = 18.
90
+
91
+ En divisant les deux côtés par 6, nous obtenons w = 3.
92
+
93
+ Par conséquent, la longueur du rectangle est de 2w = 2(3) = 6.
94
+
95
+ L'aire d'un rectangle est le produit de sa longueur et de sa largeur, donc l'aire est de 6 * 3 = 18.
96
+
97
+ La réponse est : 18
98
+ """
99
+ # Translations in English:
100
+ """
101
+ The length of the rectangle is twice its width, so the length is 2w and the width is w.
102
+
103
+ The perimeter of a rectangle is twice the sum of its length and width, so the equation is 2(2w + w) = 18.
104
+
105
+ Simplifying the equation, we get 2(3w) = 18, which simplifies to 6w = 18.
106
+
107
+ Dividing the two sides by 6 gives w = 3.
108
+
109
+ So the length of the rectangle is 2w = 2(3) = 6.
110
+
111
+ The area of a rectangle is the product of its length and width, so the area is 6 * 3 = 18.
112
+
113
+ The answer is: 18
114
+ """