JeongwonChoi commited on
Commit
0325e40
β€’
1 Parent(s): 060ff4f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +58 -41
README.md CHANGED
@@ -1,84 +1,101 @@
1
  ---
2
  tags:
3
- - text-generation
4
  license: cc-by-nc-sa-4.0
5
  language:
6
- - ko
7
  base_model: mistralai/Mistral-7B-Instruct-v0.2
8
  pipeline_tag: text-generation
 
 
9
  ---
10
 
11
  # **DataVortexM-7B-Instruct-v0.1**
12
- <img src="https://imgur.com/lpXTyPe.png" alt="DataVortex" style="height: 8em;">
13
 
14
- ## **License**
15
-
16
- [cc-by-nc-sa-4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/)
17
 
18
  ## **Model Details**
19
 
20
  ### **Base Model**
21
- [mistralai/Mistral-7B-Instruct-v0.2](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2)
 
22
 
23
  ### **Trained On**
24
- H100 80GB 4ea
 
 
 
 
 
 
 
25
 
26
  ### **Instruction format**
27
 
28
  It follows **Alpaca** format.
29
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
30
  ## **Model Benchmark**
31
 
32
  ### **Ko-LLM-Leaderboard**
33
 
34
- - **Average**: 39.81
35
- - **Ko-ARC**: 34.13
36
- - **Ko-HellaSwag**: 42.35
37
- - **Ko-MMLU**: 38.73
38
- - **Ko-TruthfulQA**: 45.46
39
- - **Ko-CommonGen V2**: 38.37
40
 
41
- # **Implementation Code**
42
 
43
- Since, chat_template already contains insturction format above.
44
  You can use the code below.
45
 
46
  ```python
47
  from transformers import AutoModelForCausalLM, AutoTokenizer
48
 
49
- device = "cuda"
50
 
51
- model = AutoModelForCausalLM.from_pretrained("Edentns/DataVortexM-7B-Instruct-v0.1", device_map=device)
52
- tokenizer = AutoTokenizer.from_pretrained("Edentns/DataVortexM-7B-Instruct-v0.1")
53
 
54
  messages = [
55
- { "role": "user", "content": "λŒ€ν•œλ―Όκ΅­μ˜ μˆ˜λ„λŠ” μ–΄λ””μ•Ό?" }
 
 
 
56
  ]
57
 
58
- encoded = tokenizer.apply_chat_template(
59
- messages,
60
- add_generation_prompt=True,
61
- return_tensors="pt",
62
- return_token_type_ids=False
63
- ).to(device)
64
-
65
- decoded = model.generate(
66
- input_ids=encoded,
67
- temperature=0.2,
68
- top_p=0.9,
69
- repetition_penalty=1.2,
70
- do_sample=True,
71
- max_length=4096,
72
- eos_token_id=tokenizer.eos_token_id,
73
- pad_token_id=tokenizer.eos_token_id
74
- )
75
- decoded = decoded[0][encoded.shape[1]:decoded[0].shape[-1]]
76
- decoded_text = tokenizer.decode(decoded, skip_special_tokens=True)
77
- print(decoded_text)
78
  ```
79
 
 
 
 
 
80
  <div align="center">
81
  <a href="https://edentns.com/">
82
- <img src="https://imgur.com/MVVlYqG.png" alt="Logo" style="height: 3em;">
83
  </a>
84
  </div>
 
1
  ---
2
  tags:
3
+ - text-generation
4
  license: cc-by-nc-sa-4.0
5
  language:
6
+ - ko
7
  base_model: mistralai/Mistral-7B-Instruct-v0.2
8
  pipeline_tag: text-generation
9
+ datasets:
10
+ - beomi/KoAlpaca-v1.1a
11
  ---
12
 
13
  # **DataVortexM-7B-Instruct-v0.1**
 
14
 
15
+ <img src="./DataVortex.png" alt="DataVortex" style="height: 8em;">
 
 
16
 
17
  ## **Model Details**
18
 
19
  ### **Base Model**
20
+
21
+ [mistralai/Mistral-7B-Instruct-v0.2](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2)
22
 
23
  ### **Trained On**
24
+
25
+ - **OS**: Ubuntu 20.04
26
+ - **GPU**: H100 80GB x4
27
+ - **transformers**: v4.36.2
28
+
29
+ ### **Dataset**
30
+
31
+ - [beomi/KoAlpaca-v1.1a](https://huggingface.co/datasets/beomi/KoAlpaca-v1.1a) - 21k rows
32
 
33
  ### **Instruction format**
34
 
35
  It follows **Alpaca** format.
36
 
37
+ E.g.
38
+
39
+ ```python
40
+ text = """\
41
+ ### System:
42
+ 당신은 μ‚¬λžŒλ“€μ΄ 정보λ₯Ό 찾을 수 μžˆλ„λ‘ λ„μ™€μ£ΌλŠ” 인곡지λŠ₯ λΉ„μ„œμž…λ‹ˆλ‹€.
43
+
44
+ ### User:
45
+ λŒ€ν•œλ―Όκ΅­μ˜ μˆ˜λ„λŠ” μ–΄λ””μ•Ό?
46
+
47
+ ### Assistant:
48
+ λŒ€ν•œλ―Όκ΅­μ˜ μˆ˜λ„λŠ” μ„œμšΈμž…λ‹ˆλ‹€.
49
+
50
+ ### User:
51
+ μ„œμšΈ μΈκ΅¬λŠ” 총 λͺ‡ λͺ…이야?
52
+ """
53
+ ```
54
+
55
  ## **Model Benchmark**
56
 
57
  ### **Ko-LLM-Leaderboard**
58
 
59
+ | Model | Average | Ko-ARC | Ko-HellaSwag | Ko-MMLU | Ko-TruthfulQA | Ko-CommonGen V2 |
60
+ | -------------------------------- | --------- | --------- | ------------ | --------- | ------------- | --------------- |
61
+ | **DataVortexM-7B-Instruct-v0.1** | **39.81** | **34.13** | **42.35** | **38.73** | **45.46** | **38.37** |
 
 
 
62
 
63
+ ## **Implementation Code**
64
 
65
+ This model contains the chat_template instruction format.
66
  You can use the code below.
67
 
68
  ```python
69
  from transformers import AutoModelForCausalLM, AutoTokenizer
70
 
71
+ device = "cuda" # the device to load the model onto
72
 
73
+ model = AutoModelForCausalLM.from_pretrained("DataVortexM-7B-Instruct-v0.1")
74
+ tokenizer = AutoTokenizer.from_pretrained("DataVortexM-7B-Instruct-v0.1")
75
 
76
  messages = [
77
+ {"role": "system", "content": "당신은 μ‚¬λžŒλ“€μ΄ 정보λ₯Ό 찾을 수 μžˆλ„λ‘ λ„μ™€μ£ΌλŠ” 인곡지λŠ₯ λΉ„μ„œμž…λ‹ˆλ‹€."},
78
+ {"role": "user", "content": "λŒ€ν•œλ―Όκ΅­μ˜ μˆ˜λ„λŠ” μ–΄λ””μ•Ό?"},
79
+ {"role": "assistant", "content": "λŒ€ν•œλ―Όκ΅­μ˜ μˆ˜λ„λŠ” μ„œμšΈμž…λ‹ˆλ‹€."},
80
+ {"role": "user", "content": "μ„œμšΈ μΈκ΅¬λŠ” 총 λͺ‡ λͺ…이야?"}
81
  ]
82
 
83
+ encodeds = tokenizer.apply_chat_template(messages, return_tensors="pt")
84
+
85
+ model_inputs = encodeds.to(device)
86
+ model.to(device)
87
+
88
+ generated_ids = model.generate(model_inputs, max_new_tokens=1000, do_sample=True)
89
+ decoded = tokenizer.batch_decode(generated_ids)
90
+ print(decoded[0])
 
 
 
 
 
 
 
 
 
 
 
 
91
  ```
92
 
93
+ ## **License**
94
+
95
+ The model is licensed under the [cc-by-nc-sa-4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/) license, which allows others to copy, modify, and share the work non-commercially, as long as they give appropriate credit and distribute any derivative works under the same license.
96
+
97
  <div align="center">
98
  <a href="https://edentns.com/">
99
+ <img src="./Logo.png" alt="Logo" style="height: 3em;">
100
  </a>
101
  </div>