Well done, I got an error when running on multiple GPU

#3
by nicolaschaillan - opened

Expected all tensors to be on the same device, but found at least two devices, cuda:3 and cuda:0! (when checking argument for argument mat2 in method wrapper_CUDA_mm)

Any clue how to solve that? Thanks!

Analytics Club at ETH Zürich org

Hi - thanks for bringing this up! I haven't tested with multi-GPU, so it's likely there are some updates needed to the MPT modeling code. Could you paste/provide the detailed error message including all python lines/traceback? we can go from there.

If I had to guess, some custom matrix multiplication going on in one or more of the MPT-specific modeling files is not properly handling torch math with two separate devices when using device_map (this would make sense since the original model doesn't support accelerate at all and I added the basics). It's hard to say without the details. You can also check out this section in the accelerate docs as a possible starting point for this rabbit hole 🐇

I see. I was using DataParallel but I would love instead to use device_map, is there any plan from you guys to natively support device_map? Would be much easier for most people to be able to spread among a few GPUs... many cloud offerings offer this on Azure nowadays etc.

Analytics Club at ETH Zürich org

Ah, sorry, let me clarify a bit: so it should already support device_map with the changes I've made. I've been able to do inference with device_map and also do training on a single GPU.

That said, I've only tested a single GPU setup, and from the error you're getting, it seems like adjustments need to be made to the model's custom code. This is because mosaicML did not implement device_map, in the original model and their custom code. I haven't seen the errors in a single GPU setup, and so it will be an iterative process to understand what isn't working with multiple devices & debug. For that, I'll need your full error logs, as I'm not using multi-GPU at the moment.

I am on very thin ice as I really have no clue what I'm doing here... But I would love for this code+log's to help out solving the multiple GPU issue.
Feel free to ignore this if the code doesn't make sense though.
The execution below is from a multi gpu environment with two Tesla P100-SXM2-16GB

Code:

import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0,1"
import torch
import transformers
model = transformers.AutoModelForCausalLM.from_pretrained(
'ethzanalytics/mpt-7b-storywriter-sharded',
trust_remote_code=True,
torch_dtype=torch.bfloat16,
load_in_8bit=False,
device_map='auto'
)
model.eval()
#model.to("cuda:0"), already done by magic
!nvidia-smi

model_size = sum(t.numel() for t in model.parameters())
print(f"Modelsize: {model_size/1000**2:.1f}M parameters")
tokenizer = transformers.AutoTokenizer.from_pretrained('ethzanalytics/mpt-7b-storywriter-sharded')
txt = """
#What is the difference between an Alpaca and
"""
tokenized_example = tokenizer(txt, return_tensors='pt')
tokenized_example['input_ids']
outputs = model.generate(tokenized_example['input_ids'].to(0), max_new_tokens=150, do_sample=False, top_k=5, top_p=0.95)

Logs from execution below:

/root/.cache/huggingface/modules/transformers_modules/ethzanalytics/mpt-7b-storywriter-sharded/26347e86eca753ce5dd9a50f21201976f2e8a9b9/attention.py:269: UserWarning: Using attn_impl: torch. If your model does not use alibi or prefix_lm we recommend using attn_impl: flash otherwise we recommend using attn_impl: triton.
warnings.warn(
/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py:1383: UserWarning: positional arguments and argument "destination" are deprecated. nn.Module.state_dict will not accept them in the future. Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details.
warnings.warn(
Loading checkpoint shards: 100%
7/7 [00:13<00:00, 1.95s/it]
Modelsize: 6649.3M parameters

RuntimeError Traceback (most recent call last)
Input In [2], in <cell line: 23>()
21 tokenized_example = tokenizer(txt, return_tensors='pt')
22 tokenized_example['input_ids']
---> 23 outputs = model.generate(tokenized_example['input_ids'].to(0), max_new_tokens=150, do_sample=False, top_k=5, top_p=0.95)

File /usr/local/lib/python3.8/dist-packages/torch/autograd/grad_mode.py:27, in _DecoratorContextManager.call..decorate_context(*args, **kwargs)
24 @functools.wraps(func)
25 def decorate_context(*args, **kwargs):
26 with self.clone():
---> 27 return func(*args, **kwargs)

File /usr/local/lib/python3.8/dist-packages/transformers/generation/utils.py:1515, in GenerationMixin.generate(self, inputs, generation_config, logits_processor, stopping_criteria, prefix_allowed_tokens_fn, synced_gpus, assistant_model, streamer, **kwargs)
1509 raise ValueError(
1510 "num_return_sequences has to be 1 when doing greedy search, "
1511 f"but is {generation_config.num_return_sequences}."
1512 )
1514 # 11. run greedy search
-> 1515 return self.greedy_search(
1516 input_ids,
1517 logits_processor=logits_processor,
1518 stopping_criteria=stopping_criteria,
1519 pad_token_id=generation_config.pad_token_id,
1520 eos_token_id=generation_config.eos_token_id,
1521 output_scores=generation_config.output_scores,
1522 return_dict_in_generate=generation_config.return_dict_in_generate,
1523 synced_gpus=synced_gpus,
1524 streamer=streamer,
1525 **model_kwargs,
1526 )
1528 elif is_contrastive_search_gen_mode:
1529 if generation_config.num_return_sequences > 1:

File /usr/local/lib/python3.8/dist-packages/transformers/generation/utils.py:2332, in GenerationMixin.greedy_search(self, input_ids, logits_processor, stopping_criteria, max_length, pad_token_id, eos_token_id, output_attentions, output_hidden_states, output_scores, return_dict_in_generate, synced_gpus, streamer, **model_kwargs)
2329 model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs)
2331 # forward pass to get next token
-> 2332 outputs = self(
2333 **model_inputs,
2334 return_dict=True,
2335 output_attentions=output_attentions,
2336 output_hidden_states=output_hidden_states,
2337 )
2339 if synced_gpus and this_peer_finished:
2340 continue # don't waste resources running the code we don't need

File /usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py:1110, in Module._call_impl(self, *input, **kwargs)
1106 # If we don't have any hooks, we want to skip the rest of the logic in
1107 # this function, and just call forward.
1108 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
1109 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1110 return forward_call(*input, **kwargs)
1111 # Do not call functions when jit is used
1112 full_backward_hooks, non_full_backward_hooks = [], []

File /usr/local/lib/python3.8/dist-packages/accelerate/hooks.py:165, in add_hook_to_module..new_forward(*args, **kwargs)
163 output = old_forward(*args, **kwargs)
164 else:
--> 165 output = old_forward(*args, **kwargs)
166 return module._hf_hook.post_forward(module, output)

File ~/.cache/huggingface/modules/transformers_modules/ethzanalytics/mpt-7b-storywriter-sharded/26347e86eca753ce5dd9a50f21201976f2e8a9b9/modeling_mpt.py:406, in MPTForCausalLM.forward(self, input_ids, past_key_values, attention_mask, prefix_mask, sequence_id, labels, return_dict, output_attentions, output_hidden_states, use_cache)
394 use_cache = use_cache if use_cache is not None else self.config.use_cache
395 outputs = self.transformer(
396 input_ids=input_ids,
397 past_key_values=past_key_values,
(...)
404 use_cache=use_cache,
405 )
--> 406 logits = F.linear(outputs.last_hidden_state, self.transformer.wte.weight)
407 if self.logit_scale is not None:
408 if self.logit_scale == 0:

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! (when checking argument for argument mat2 in method wrapper_mm)

Sign up or log in to comment