Lin-K76 commited on
Commit
1ff28d3
1 Parent(s): 4ae61af

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +26 -5
README.md CHANGED
@@ -26,7 +26,7 @@ base_model: meta-llama/Meta-Llama-3.1-405B-Instruct
26
  - **Model Developers:** Neural Magic
27
 
28
  Quantized version of [Meta-Llama-3.1-405B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-405B-Instruct) with the updated 8 kv-heads.
29
- It achieves an average score of 86.39 on the [OpenLLM](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard) benchmark (version 1), whereas the unquantized model achieves 86.57.
30
 
31
  ### Model Optimizations
32
 
@@ -165,7 +165,7 @@ oneshot(
165
 
166
  The model was evaluated on MMLU, ARC-Challenge, GSM-8K, Hellaswag, Winogrande and TruthfulQA.
167
  Evaluation was conducted using the Neural Magic fork of [lm-evaluation-harness](https://github.com/neuralmagic/lm-evaluation-harness/tree/llama_3.1_instruct) (branch llama_3.1_instruct) and the [vLLM](https://docs.vllm.ai/en/stable/) engine.
168
- This version of the lm-evaluation-harness includes versions of ARC-Challenge, GSM-8K, and MMLU that match the prompting style of [Meta-Llama-3.1-Instruct-evals](https://huggingface.co/datasets/meta-llama/Meta-Llama-3.1-8B-Instruct-evals).
169
 
170
  ### Accuracy
171
 
@@ -191,6 +191,16 @@ This version of the lm-evaluation-harness includes versions of ARC-Challenge, GS
191
  <td>99.59%
192
  </td>
193
  </tr>
 
 
 
 
 
 
 
 
 
 
194
  <tr>
195
  <td>ARC Challenge (0-shot)
196
  </td>
@@ -244,11 +254,11 @@ This version of the lm-evaluation-harness includes versions of ARC-Challenge, GS
244
  <tr>
245
  <td><strong>Average</strong>
246
  </td>
247
- <td><strong>86.57</strong>
248
  </td>
249
- <td><strong>86.39</strong>
250
  </td>
251
- <td><strong>99.75%</strong>
252
  </td>
253
  </tr>
254
  </table>
@@ -270,6 +280,17 @@ lm_eval \
270
  --batch_size auto
271
  ```
272
 
 
 
 
 
 
 
 
 
 
 
 
273
  #### ARC-Challenge
274
  ```
275
  lm_eval \
 
26
  - **Model Developers:** Neural Magic
27
 
28
  Quantized version of [Meta-Llama-3.1-405B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-405B-Instruct) with the updated 8 kv-heads.
29
+ It achieves an average score of 86.60 on the [OpenLLM](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard) benchmark (version 1), whereas the unquantized model achieves 86.79.
30
 
31
  ### Model Optimizations
32
 
 
165
 
166
  The model was evaluated on MMLU, ARC-Challenge, GSM-8K, Hellaswag, Winogrande and TruthfulQA.
167
  Evaluation was conducted using the Neural Magic fork of [lm-evaluation-harness](https://github.com/neuralmagic/lm-evaluation-harness/tree/llama_3.1_instruct) (branch llama_3.1_instruct) and the [vLLM](https://docs.vllm.ai/en/stable/) engine.
168
+ This version of the lm-evaluation-harness includes versions of ARC-Challenge, GSM-8K, MMLU, and MMLU-cot that match the prompting style of [Meta-Llama-3.1-Instruct-evals](https://huggingface.co/datasets/meta-llama/Meta-Llama-3.1-8B-Instruct-evals).
169
 
170
  ### Accuracy
171
 
 
191
  <td>99.59%
192
  </td>
193
  </tr>
194
+ <tr>
195
+ <td>MMLU-cot (0-shot)
196
+ </td>
197
+ <td>88.11
198
+ </td>
199
+ <td>87.87
200
+ </td>
201
+ <td>99.73%
202
+ </td>
203
+ </tr>
204
  <tr>
205
  <td>ARC Challenge (0-shot)
206
  </td>
 
254
  <tr>
255
  <td><strong>Average</strong>
256
  </td>
257
+ <td><strong>86.79</strong>
258
  </td>
259
+ <td><strong>86.60</strong>
260
  </td>
261
+ <td><strong>99.74%</strong>
262
  </td>
263
  </tr>
264
  </table>
 
280
  --batch_size auto
281
  ```
282
 
283
+ #### MMLU-cot
284
+ ```
285
+ lm_eval \
286
+ --model vllm \
287
+ --model_args pretrained="neuralmagic/Meta-Llama-3.1-405B-Instruct-FP8",dtype=auto,add_bos_token=True,max_model_len=4096,max_gen_toks=1024,tensor_parallel_size=8 \
288
+ --tasks mmlu_cot_0shot_llama_3.1_instruct \
289
+ --apply_chat_template \
290
+ --num_fewshot 0 \
291
+ --batch_size auto
292
+ ```
293
+
294
  #### ARC-Challenge
295
  ```
296
  lm_eval \