OOM error running model on single A100 GPU.

by nb2904 - opened Feb 29

Feb 29

Hello,

First of all, thank you so much for your great contribution to the Open Source LLMs community. I really appreciate all the effort you have put into this masterpiece. I have some questions regarding running the model on the GPU.

I tried to run the model on the NVIDIA A100 40GB GPU but I got an OOM (Out of Memory) error.
Then I upgraded my system with 2 L4 GPUs but still encountered the error.
Can you recommend the minimum hardware requirements for running the model?

Note: I was able to run the SOLAR-10.7B model from the official Upstage repo with a single NVIDIA A100 40GB GPU.
My system: I'm running models on GCP instances.

Error:
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:06<00:00, 1.25s/it]
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:06<00:00, 1.34s/it]
Traceback (most recent call last):
File "/Projects/yanolja_solar_test/test.py", line 42, in
mp.spawn(train,
File "/usr/local/lib/python3.10/dist-packages/torch/multiprocessing/spawn.py", line 246, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method="spawn")
File "/usr/local/lib/python3.10/dist-packages/torch/multiprocessing/spawn.py", line 202, in start_processes
while not context.join():
File "/usr/local/lib/python3.10/dist-packages/torch/multiprocessing/spawn.py", line 163, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 1 terminated with the following error:
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/torch/multiprocessing/spawn.py", line 74, in _wrap
fn(i, *args)
File "/Projects/yanolja_solar_test/test.py", line 23, in train
model = AutoModelForCausalLM.from_pretrained(model_name).to(rank)
File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 2556, in to
return super().to(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1160, in to
return self._apply(convert)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 810, in _apply
module._apply(fn)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 810, in _apply
module._apply(fn)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 810, in _apply
module._apply(fn)
[Previous line repeated 2 more times]
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 833, in _apply
param_applied = fn(param)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1158, in convert
return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 64.00 MiB. GPU 1 has a total capacty of 21.99 GiB of which 56.69 MiB is free. Process 10524 has 21.93 GiB memory in use. Of the allocated memory 21.75 GiB is allocated by PyTorch, and 1.17 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

nb2904

Feb 29

•

edited Feb 29

Code for running the model with 2xL4 GPUs:
import torch
import torch.distributed as dist
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch.multiprocessing as mp
import os

def setup(rank, world_size):
os.environ['MASTER_ADDR'] = 'localhost'
os.environ['MASTER_PORT'] = '12355'
dist.init_process_group("nccl", rank=rank, world_size=world_size)

def cleanup():
dist.destroy_process_group()

def train(rank, world_size, model_name):
setup(rank, world_size)

# Ensure only the necessary GPU is visible to each process
torch.cuda.set_device(rank)

# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name).to(rank)

# Wrap model
model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[rank])

# Example inference
prompt_template = "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.\nHuman: {prompt}\nAssistant:\n"
text = '한국의 수도는 어디인가요? 아래 선택지 중 골라주세요.\n\n(A) 경성\n(B) 부산\n(C) 평양\n(D) 서울\n(E) 전주'
model_inputs = tokenizer(prompt_template.format(prompt=text), return_tensors='pt').to(rank)

outputs = model.module.generate(**model_inputs, max_new_tokens=256)
output_text = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]
print(output_text)

cleanup()

if name == "main":
model_name = "yanolja/EEVE-Korean-Instruct-10.8B-v1.0"
world_size = 2 # Number of GPUs you want to use
mp.spawn(train,
args=(world_size, model_name),
nprocs=world_size,
join=True)

nb2904 changed discussion status to closed Feb 29

nb2904 changed discussion status to open Feb 29

seungduk

Yanolja org Feb 29

Hi Ismoilov,

Thank you for your kind words. I'm pleased to hear about your interest in the model. It should be possible to run the model on a single L4 GPU; my personal preference is for a single RTX 3090. It's possible that your machine has preloaded data on the GPU that hasn't been released, which could be causing the issue. I recommend checking this by executing nvidia-smi in your terminal. Alternatively, you might consider running this model using vLLM by specifying its Hugging Face model name in the command line.

Thanks,
Seungduk

nb2904 changed discussion status to closed Mar 4

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment