---
library_name: transformers
license: mit
language:
- ja
- en
---

# stockmark/stockmark-100b

Stockmark-100b is a 100 billion parameter LLM pretrained from scratch based on Japanese and English corpus of about 910B tokens. This model is developed by [Stockmark Inc.](https://stockmark.co.jp/) This project is supported by [GENIAC project](https://www.meti.go.jp/policy/mono_info_service/geniac/index.html).

- Instruction tuned model: [stockmark-100b-instruct-v0.1](https://huggingface.co/stockmark/stockmark-100b-instruct-v0.1).

## How to use

```python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("stockmark/stockmark-100b")
model = AutoModelForCausalLM.from_pretrained("stockmark/stockmark-100b", device_map="auto", torch_dtype=torch.bfloat16)

input_ids = tokenizer("人工知能とは、", return_tensors="pt").input_ids.to(model.device)
with torch.inference_mode():
    tokens = model.generate(
        input_ids,
        max_new_tokens = 256,
        do_sample = True,
        temperature = 0.5,
        top_p = 0.95
    )
    
output = tokenizer.decode(tokens[0], skip_special_tokens=True)
print(output)
```

## Dataset (pretraining)

Stockmark-100b was trained using a total of about 910B tokens of Japanese and English text corpus. The detail of Japanese data is summarized in the below table.

| corpus | tokens after preprocessing |
|:---:|:---:|
| Stockmark Web Corpus (This dataset will not be released) | 8.8 billion |
| Patent | 37.5 billion |
| Wikipedia |1.5 billion |
| mC4 | 52.6 billion |
| CommonCrawl (snapshot: 2020-50 ~ 2024-10) | 203.7 billion|

English data is sampled from [RedPajama-Data](https://github.com/togethercomputer/RedPajama-Data/tree/rp_v1).

## Environment
- GPU: 48 nodes of 8*H100 instances
- Library: [Megatron-LM](https://github.com/NVIDIA/Megatron-LM)

## License
[MIT](https://opensource.org/licenses/MIT)

## Developed by
[Stockmark Inc.](https://stockmark.co.jp/)