KingNish loubnabnl HF staff commited on
Commit
96c937f
0 Parent(s):

Duplicate from HuggingFaceFW/ablation-model-fineweb-edu

Browse files

Co-authored-by: Loubna Ben Allal <[email protected]>

.gitattributes ADDED
@@ -0,0 +1,35 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ *.7z filter=lfs diff=lfs merge=lfs -text
2
+ *.arrow filter=lfs diff=lfs merge=lfs -text
3
+ *.bin filter=lfs diff=lfs merge=lfs -text
4
+ *.bz2 filter=lfs diff=lfs merge=lfs -text
5
+ *.ckpt filter=lfs diff=lfs merge=lfs -text
6
+ *.ftz filter=lfs diff=lfs merge=lfs -text
7
+ *.gz filter=lfs diff=lfs merge=lfs -text
8
+ *.h5 filter=lfs diff=lfs merge=lfs -text
9
+ *.joblib filter=lfs diff=lfs merge=lfs -text
10
+ *.lfs.* filter=lfs diff=lfs merge=lfs -text
11
+ *.mlmodel filter=lfs diff=lfs merge=lfs -text
12
+ *.model filter=lfs diff=lfs merge=lfs -text
13
+ *.msgpack filter=lfs diff=lfs merge=lfs -text
14
+ *.npy filter=lfs diff=lfs merge=lfs -text
15
+ *.npz filter=lfs diff=lfs merge=lfs -text
16
+ *.onnx filter=lfs diff=lfs merge=lfs -text
17
+ *.ot filter=lfs diff=lfs merge=lfs -text
18
+ *.parquet filter=lfs diff=lfs merge=lfs -text
19
+ *.pb filter=lfs diff=lfs merge=lfs -text
20
+ *.pickle filter=lfs diff=lfs merge=lfs -text
21
+ *.pkl filter=lfs diff=lfs merge=lfs -text
22
+ *.pt filter=lfs diff=lfs merge=lfs -text
23
+ *.pth filter=lfs diff=lfs merge=lfs -text
24
+ *.rar filter=lfs diff=lfs merge=lfs -text
25
+ *.safetensors filter=lfs diff=lfs merge=lfs -text
26
+ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
27
+ *.tar.* filter=lfs diff=lfs merge=lfs -text
28
+ *.tar filter=lfs diff=lfs merge=lfs -text
29
+ *.tflite filter=lfs diff=lfs merge=lfs -text
30
+ *.tgz filter=lfs diff=lfs merge=lfs -text
31
+ *.wasm filter=lfs diff=lfs merge=lfs -text
32
+ *.xz filter=lfs diff=lfs merge=lfs -text
33
+ *.zip filter=lfs diff=lfs merge=lfs -text
34
+ *.zst filter=lfs diff=lfs merge=lfs -text
35
+ *tfevents* filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,87 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: transformers
3
+ license: apache-2.0
4
+ language:
5
+ - en
6
+ datasets:
7
+ - HuggingFaceFW/fineweb-edu
8
+ ---
9
+
10
+ # Model Card for HuggingFaceFW/ablation-model-fineweb-edu
11
+
12
+ ## Model summary
13
+
14
+ This model is part of the 🍷 [FineWeb](https://huggingface.co/datasets/HuggingFaceFW/fineweb) ablations, detailed in this [technical report](https://huggingface.co/spaces/HuggingFaceFW/blogpost-fineweb-v1).
15
+ The model has 1.82B parameters, 2048 context length and uses Llama architecture with RoPE. It was trained on 350B tokens from [FineWeb-Edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu), tokenized using `gpt2` tokenizer.
16
+
17
+ - **Paper**: 🍷 FineWeb: decanting the web for the finest text data at scale https://hf.co/spaces/HuggingFaceFW/blogpost-fineweb-v1
18
+ - **License**: Apache-2
19
+ - **Languages**: English
20
+
21
+ ## Use
22
+
23
+ ### Intended use
24
+
25
+ This model was trained on English web data and is not instruction-tuned, making it intended for text completion in English.
26
+ It is important to note that the primary intended use case of this model is to compare its performance with other models trained under the same conditions. This model is not necessarily the best possible outcome achievable with the given dataset.
27
+
28
+ ### Generation
29
+
30
+ ```python
31
+ # pip install -q transformers
32
+ from transformers import AutoModelForCausalLM, AutoTokenizer
33
+
34
+ model = "HuggingFaceFW/ablation-model-fineweb-edu"
35
+ device = "cuda" # for GPU usage or "cpu" for CPU usage
36
+
37
+ tokenizer = AutoTokenizer.from_pretrained(model)
38
+ model = AutoModelForCausalLM.from_pretrained(model).to(device)
39
+
40
+ inputs = tokenizer.encode("Machine Learning is", return_tensors="pt").to(device)
41
+ outputs = model.generate(inputs)
42
+ print(tokenizer.decode(outputs[0]))
43
+ ```
44
+
45
+ ## Intermediate checkpoints (soon)
46
+
47
+ We are releasing intermediate checkpoints for this model at intervals of every 1000 training steps in separate branches. The naming convention is `step-001000-2BT`.
48
+
49
+ You can load a specific model revision with `transformers` using the argument `revision`:
50
+ ```python
51
+ model = AutoModelForCausalLM.from_pretrained("HuggingFaceFW/ablation-model-fineweb-edu", revision="step-001000-2BT")
52
+ ```
53
+ You can access all the revisions for the models via the following code:
54
+ ```python
55
+ from huggingface_hub import list_repo_refs
56
+ out = list_repo_refs("HuggingFaceFW/ablation-model-fineweb-edu")
57
+ print([b.name for b in out.branches])
58
+ ```
59
+
60
+ ## Training
61
+ ### Model
62
+ - **Architecture**: Llama model
63
+ - **Pretraining steps**: 167k
64
+ - **Pretraining tokens**: 350B
65
+ - **Precision**: bfloat16
66
+
67
+ ### Hardware
68
+ - **GPUs**: 64 H100
69
+ - **Training time**: 72 wall clock hours
70
+
71
+ ### Software
72
+ - [nanotron](https://github.com/huggingface/nanotron/) for training
73
+ - [datatrove](https://github.com/huggingface/datatrove) for tokenization
74
+ - [lighteval](https://github.com/huggingface/lighteval) for evaluation
75
+
76
+ ## Evaluation
77
+ We used the same setup to evaluate all our ablation models with `lighteval`. To reproduce our numbers, make sure to follow the instruction [here](https://huggingface.co/datasets/HuggingFaceFW/fineweb/blob/main/lighteval_tasks.py#L12).
78
+ ```bash
79
+ # download https://huggingface.co/datasets/HuggingFaceFW/fineweb/blob/main/lighteval_tasks.py and run:
80
+ accelerate launch --num_processes=1 lighteval/run_evals_accelerate.py --model_args="pretrained=HuggingFaceFW/ablation-model-fineweb-edu" \
81
+ --custom_tasks "lighteval_tasks.py" --output_dir [OUTPUTPATH] --max_samples 1000 \
82
+ --tasks "custom|hellaswag|0|1,custom|winogrande|0|1,custom|piqa|0|1,custom|siqa|0|1,custom|openbookqa|0|1,custom|arc:easy|0|1,custom|arc:challenge|0|1,custom|commonsense_qa|0|1,custom|mmlu:abstract_algebra|0|1,custom|mmlu:anatomy|0|1,custom|mmlu:astronomy|0|1,custom|mmlu:business_ethics|0|1,custom|mmlu:clinical_knowledge|0|1,custom|mmlu:college_biology|0|1,custom|mmlu:college_chemistry|0|1,custom|mmlu:college_computer_science|0|1,custom|mmlu:college_mathematics|0|1,custom|mmlu:college_medicine|0|1,custom|mmlu:college_physics|0|1,custom|mmlu:computer_security|0|1,custom|mmlu:conceptual_physics|0|1,custom|mmlu:econometrics|0|1,custom|mmlu:electrical_engineering|0|1,custom|mmlu:elementary_mathematics|0|1,custom|mmlu:formal_logic|0|1,custom|mmlu:global_facts|0|1,custom|mmlu:high_school_biology|0|1,custom|mmlu:high_school_chemistry|0|1,custom|mmlu:high_school_computer_science|0|1,custom|mmlu:high_school_european_history|0|1,custom|mmlu:high_school_geography|0|1,custom|mmlu:high_school_government_and_politics|0|1,custom|mmlu:high_school_macroeconomics|0|1,custom|mmlu:high_school_mathematics|0|1,custom|mmlu:high_school_microeconomics|0|1,custom|mmlu:high_school_physics|0|1,custom|mmlu:high_school_psychology|0|1,custom|mmlu:high_school_statistics|0|1,custom|mmlu:high_school_us_history|0|1,custom|mmlu:high_school_world_history|0|1,custom|mmlu:human_aging|0|1,custom|mmlu:human_sexuality|0|1,custom|mmlu:international_law|0|1,custom|mmlu:jurisprudence|0|1,custom|mmlu:logical_fallacies|0|1,custom|mmlu:machine_learning|0|1,custom|mmlu:management|0|1,custom|mmlu:marketing|0|1,custom|mmlu:medical_genetics|0|1,custom|mmlu:miscellaneous|0|1,custom|mmlu:moral_disputes|0|1,custom|mmlu:moral_scenarios|0|1,custom|mmlu:nutrition|0|1,custom|mmlu:philosophy|0|1,custom|mmlu:prehistory|0|1,custom|mmlu:professional_accounting|0|1,custom|mmlu:professional_law|0|1,custom|mmlu:professional_medicine|0|1,custom|mmlu:professional_psychology|0|1,custom|mmlu:public_relations|0|1,custom|mmlu:security_studies|0|1,custom|mmlu:sociology|0|1,custom|mmlu:us_foreign_policy|0|1,custom|mmlu:virology|0|1,custom|mmlu:world_religions|0|1"
83
+ ```
84
+ In particular the MMLU prompts are slightly different from those in `lm-evaluation-harness` and the Open LLM Leaderboard, more in this [blogpost](https://huggingface.co/blog/open-llm-leaderboard-mmlu#1001-flavors-of-mmlu). We use prompt templates that provide better signal for small and non instruction tuned models.
85
+
86
+ ## Limitations
87
+ This model was predominantly trained on English data, potentially limiting its performance in other languages. Furthermore, the model's behavior is influenced by the quality and diversity of its training data, which may include biases and harmful content.
config.json ADDED
@@ -0,0 +1,27 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "LlamaForCausalLM"
4
+ ],
5
+ "attention_bias": false,
6
+ "attention_dropout": 0.0,
7
+ "bos_token_id": 1,
8
+ "eos_token_id": 2,
9
+ "hidden_act": "silu",
10
+ "hidden_size": 2048,
11
+ "initializer_range": 0.02,
12
+ "intermediate_size": 8192,
13
+ "max_position_embeddings": 2048,
14
+ "model_type": "llama",
15
+ "num_attention_heads": 32,
16
+ "num_hidden_layers": 24,
17
+ "num_key_value_heads": 32,
18
+ "pretraining_tp": 1,
19
+ "rms_norm_eps": 1e-05,
20
+ "rope_scaling": null,
21
+ "rope_theta": 10000.0,
22
+ "tie_word_embeddings": true,
23
+ "torch_dtype": "bfloat16",
24
+ "transformers_version": "4.39.3",
25
+ "use_cache": true,
26
+ "vocab_size": 50272
27
+ }
generation_config.json ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ {
2
+ "_from_model_config": true,
3
+ "bos_token_id": 1,
4
+ "eos_token_id": 2,
5
+ "transformers_version": "4.39.3"
6
+ }
merges.txt ADDED
The diff for this file is too large to render. See raw diff
 
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:1f7fded67785a48e5a758a7ff6f83c67112b90cb3aeb28b8703fd53f49417adb
3
+ size 3427365472
special_tokens_map.json ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ {
2
+ "bos_token": "<|endoftext|>",
3
+ "eos_token": "<|endoftext|>",
4
+ "unk_token": "<|endoftext|>"
5
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,19 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_prefix_space": false,
3
+ "added_tokens_decoder": {
4
+ "50256": {
5
+ "content": "<|endoftext|>",
6
+ "lstrip": false,
7
+ "normalized": true,
8
+ "rstrip": false,
9
+ "single_word": false,
10
+ "special": true
11
+ }
12
+ },
13
+ "bos_token": "<|endoftext|>",
14
+ "clean_up_tokenization_spaces": true,
15
+ "eos_token": "<|endoftext|>",
16
+ "model_max_length": 2048,
17
+ "tokenizer_class": "GPT2Tokenizer",
18
+ "unk_token": "<|endoftext|>"
19
+ }
vocab.json ADDED
The diff for this file is too large to render. See raw diff