--- library_name: transformers license: mit language: - ja - en --- # stockmark/stockmark-100b Stockmark-100b is a 100 billion parameter LLM pretrained from scratch based on Japanese and English corpus of about 910B tokens. This model is developed by [Stockmark Inc.](https://stockmark.co.jp/) This project is supported by [GENIAC project](https://www.meti.go.jp/policy/mono_info_service/geniac/index.html). - Instruction tuned model: [stockmark-100b-instruct-v0.1](https://huggingface.co/stockmark/stockmark-100b-instruct-v0.1). ## How to use ```python import torch from transformers import AutoModelForCausalLM, AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("stockmark/stockmark-100b") model = AutoModelForCausalLM.from_pretrained("stockmark/stockmark-100b", device_map="auto", torch_dtype=torch.bfloat16) input_ids = tokenizer("人工知能とは、", return_tensors="pt").input_ids.to(model.device) with torch.inference_mode(): tokens = model.generate( input_ids, max_new_tokens = 256, do_sample = True, temperature = 0.5, top_p = 0.95 ) output = tokenizer.decode(tokens[0], skip_special_tokens=True) print(output) ``` ## Dataset (pretraining) Stockmark-100b was trained using a total of about 910B tokens of Japanese and English text corpus. The detail of Japanese data is summarized in the below table. | corpus | tokens after preprocessing | |:---:|:---:| | Stockmark Web Corpus (This dataset will not be released) | 8.8 billion | | Patent | 37.5 billion | | Wikipedia |1.5 billion | | mC4 | 52.6 billion | | CommonCrawl (snapshot: 2020-50 ~ 2024-10) | 203.7 billion| English data is sampled from [RedPajama-Data](https://github.com/togethercomputer/RedPajama-Data/tree/rp_v1). ## Environment - GPU: 48 nodes of 8*H100 instances - Library: [Megatron-LM](https://github.com/NVIDIA/Megatron-LM) ## License [MIT](https://opensource.org/licenses/MIT) ## Developed by [Stockmark Inc.](https://stockmark.co.jp/)