omitakahiro commited on
Commit
5d4af2f
1 Parent(s): 506a6ed

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +27 -2
README.md CHANGED
@@ -21,7 +21,7 @@ from transformers import AutoModelForCausalLM, AutoTokenizer
21
  tokenizer = AutoTokenizer.from_pretrained("stockmark/stockmark-100b")
22
  model = AutoModelForCausalLM.from_pretrained("stockmark/stockmark-100b", device_map="auto", torch_dtype=torch.bfloat16)
23
 
24
- inputs = tokenizer("人工知能とは、", return_tensors="pt").input_ids.to(model.device)
25
  with torch.inference_mode():
26
  tokens = model.generate(
27
  input_ids,
@@ -33,4 +33,29 @@ with torch.inference_mode():
33
 
34
  output = tokenizer.decode(tokens[0], skip_special_tokens=True)
35
  print(output)
36
- ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
21
  tokenizer = AutoTokenizer.from_pretrained("stockmark/stockmark-100b")
22
  model = AutoModelForCausalLM.from_pretrained("stockmark/stockmark-100b", device_map="auto", torch_dtype=torch.bfloat16)
23
 
24
+ input_ids = tokenizer("人工知能とは、", return_tensors="pt").input_ids.to(model.device)
25
  with torch.inference_mode():
26
  tokens = model.generate(
27
  input_ids,
 
33
 
34
  output = tokenizer.decode(tokens[0], skip_special_tokens=True)
35
  print(output)
36
+ ```
37
+
38
+ ## Dataset (pretraining)
39
+
40
+ Stockmark-100b was trained using a total of about 910B tokens of Japanese and English text corpus. The detail of Japanese data is summarized in the below table.
41
+
42
+ | corpus | tokens after preprocessing |
43
+ |:---:|:---:|
44
+ | Stockmark Web Corpus (This dataset will not be released) | 8.8 billion |
45
+ | Patent | 37.5 billion |
46
+ | Wikipedia |1.5 billion |
47
+ | mC4 | 52.6 billion |
48
+ | CommonCrawl (snapshot: 2020-50 ~ 2024-10) | 203.7 billion|
49
+
50
+ English data is sampled from [RedPajama-Data](https://github.com/togethercomputer/RedPajama-Data/tree/rp_v1).
51
+
52
+ ## Environment
53
+ - GPU: 48 nodes of 8*H100 instances
54
+ - Library: [Megatron-LM](https://github.com/NVIDIA/Megatron-LM)
55
+
56
+ ## License
57
+ [MIT](https://opensource.org/licenses/MIT)
58
+
59
+ ## Developed by
60
+ [Stockmark Inc.](https://stockmark.co.jp/)
61
+