Lin-K76 commited on
Commit
cb8ef88
1 Parent(s): 4ef9216

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +19 -19
README.md CHANGED
@@ -24,8 +24,8 @@ language:
24
  - **License(s):** [llama3.1](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B/blob/main/LICENSE)
25
  - **Model Developers:** Neural Magic
26
 
27
- Quantized version of [Meta-Llama-3.1-405B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-405B-Instruct). It achieves an average recovery of 99.81% on the [OpenLLM](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard) benchmark (version 1) compared to the unquantized model.
28
- <!-- It achieves an average score of 77.75 on the [OpenLLM](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard) benchmark (version 1), whereas the unquantized model achieves 78.67. -->
29
 
30
  ### Model Optimizations
31
 
@@ -162,7 +162,8 @@ oneshot(
162
 
163
  ## Evaluation
164
 
165
- The model was evaluated on the [OpenLLM](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard) leaderboard tasks (version 1) with the [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) and the [vLLM](https://docs.vllm.ai/en/stable/) engine, using the following command:
 
166
  ```
167
  lm_eval \
168
  --model vllm \
@@ -170,7 +171,6 @@ lm_eval \
170
  --tasks openllm \
171
  --batch_size auto
172
  ```
173
- Certain benchmarks for the full precision model are still being acquired. Average recovery is calculated only with metrics that both models have been evaluated on.
174
 
175
  ### Accuracy
176
 
@@ -189,41 +189,41 @@ Certain benchmarks for the full precision model are still being acquired. Averag
189
  <tr>
190
  <td>MMLU (5-shot)
191
  </td>
192
- <td>*
193
  </td>
194
  <td>86.06
195
  </td>
196
- <td>*
197
  </td>
198
  </tr>
199
  <tr>
200
- <td>ARC Challenge (25-shot)
201
  </td>
202
- <td>*
203
  </td>
204
- <td>*
205
  </td>
206
- <td>*
207
  </td>
208
  </tr>
209
  <tr>
210
- <td>GSM-8K (5-shot, strict-match)
211
  </td>
212
- <td>95.07
213
  </td>
214
- <td>94.39
215
  </td>
216
- <td>99.28%
217
  </td>
218
  </tr>
219
  <tr>
220
  <td>Hellaswag (10-shot)
221
  </td>
222
- <td>*
223
  </td>
224
  <td>88.25
225
  </td>
226
- <td>*
227
  </td>
228
  </tr>
229
  <tr>
@@ -249,11 +249,11 @@ Certain benchmarks for the full precision model are still being acquired. Averag
249
  <tr>
250
  <td><strong>Average</strong>
251
  </td>
252
- <td><strong>*</strong>
253
  </td>
254
- <td><strong>*</strong>
255
  </td>
256
- <td><strong>99.81%</strong>
257
  </td>
258
  </tr>
259
  </table>
 
24
  - **License(s):** [llama3.1](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B/blob/main/LICENSE)
25
  - **Model Developers:** Neural Magic
26
 
27
+ Quantized version of [Meta-Llama-3.1-405B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-405B-Instruct).
28
+ It achieves an average score of 86.41 on the [OpenLLM](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard) benchmark (version 1), whereas the unquantized model achieves 86.63.
29
 
30
  ### Model Optimizations
31
 
 
162
 
163
  ## Evaluation
164
 
165
+ The model was evaluated on the [OpenLLM](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard) leaderboard tasks (version 1) with the [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) and the [vLLM](https://docs.vllm.ai/en/stable/) engine, using the following command.
166
+ A modified version of ARC-C and GSM8k-cot was used for evaluations, in line with Llama 3.1's prompting. It can be accessed on the [Neural Magic fork of the lm-evaluation-harness](https://github.com/neuralmagic/lm-evaluation-harness/tree/llama_3.1_instruct).
167
  ```
168
  lm_eval \
169
  --model vllm \
 
171
  --tasks openllm \
172
  --batch_size auto
173
  ```
 
174
 
175
  ### Accuracy
176
 
 
189
  <tr>
190
  <td>MMLU (5-shot)
191
  </td>
192
+ <td>86.25
193
  </td>
194
  <td>86.06
195
  </td>
196
+ <td>99.78%
197
  </td>
198
  </tr>
199
  <tr>
200
+ <td>ARC Challenge (0-shot)
201
  </td>
202
+ <td>96.93
203
  </td>
204
+ <td>96.33
205
  </td>
206
+ <td>99.38%
207
  </td>
208
  </tr>
209
  <tr>
210
+ <td>GSM-8K-cot (8-shot, strict-match)
211
  </td>
212
+ <td>96.44
213
  </td>
214
+ <td>95.91
215
  </td>
216
+ <td>99.45%
217
  </td>
218
  </tr>
219
  <tr>
220
  <td>Hellaswag (10-shot)
221
  </td>
222
+ <td>88.33
223
  </td>
224
  <td>88.25
225
  </td>
226
+ <td>99.91%
227
  </td>
228
  </tr>
229
  <tr>
 
249
  <tr>
250
  <td><strong>Average</strong>
251
  </td>
252
+ <td><strong>86.63</strong>
253
  </td>
254
+ <td><strong>86.41</strong>
255
  </td>
256
+ <td><strong>99.74%</strong>
257
  </td>
258
  </tr>
259
  </table>