teknium/OpenHermes-2.5-Mistral-7B · slow inference speed

Jan 28

I was wondering if you could comment on the speed of inference. I have a 16Gb RTX3080, using transformers 4.37.1 and for 1000 tokens I'm getting 10.6s in 4bit and 1min6.6s in 8bit . Previously, when using transformers 4.36, it was it was even slower (35s, 2min40s, respectively). Is there any way to speed up the inference process?
Here's my code:

from transformers import AutoTokenizer, AutoModelForCausalLM
from transformers import LlamaTokenizer, LlamaForCausalLM, MistralForCausalLM
import torch
import bitsandbytes, flash_attn

device = "cuda" # the device to load the model onto
model_name='teknium/OpenHermes-2.5-Mistral-7B'

messages = [
    {"role": "user", "content": "What is your favourite condiment?"},
    {"role": "assistant", "content": "Well, I’m quite partial to a good squeeze of fresh lemon juice. It adds just the right amount of zesty flavour to whatever I’m cooking up in the kitchen!"},
    {"role": "user", "content": "Do you have mayonnaise recipes?"}
]

tokenizer = LlamaTokenizer.from_pretrained(model_name, trust_remote_code=True)
encodeds = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True)
model_inputs = encodeds.to(device)

free_in_GB = int(torch.cuda.mem_get_info()[0]/1024**3)
max_memory = f'{int(torch.cuda.mem_get_info()[0]/1024**3)-2}GB'
n_gpus = torch.cuda.device_count()
max_memory = {i: max_memory for i in range(n_gpus)}

model4b = MistralForCausalLM.from_pretrained(
    model_name,
    bnb_4bit_compute_dtype=torch.float16,
    device_map='auto',#{'': 'cuda:0'},
    load_in_8bit=False,
    load_in_4bit=True,
    max_memory=max_memory,
    attn_implementation="flash_attention_2"
)
generated_ids4b = model4b.generate(model_inputs, max_new_tokens=1000, do_sample=False, eos_token_id=tokenizer.eos_token_id)

model8b = MistralForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    #bnb_4bit_compute_dtype=torch.float16,
    #device_map='auto',#{'': 'cuda:0'},
    load_in_8bit=True,
    #load_in_4bit=True,
    #use_flash_attention_2=True
    max_memory=max_memory,
    attn_implementation="flash_attention_2"
)

generated_ids8b = model8b.generate(model_inputs, max_new_tokens=1000, do_sample=False, eos_token_id=tokenizer.eos_token_id)

decoded4b = tokenizer.batch_decode(generated_ids4b, skip_special_tokens=True, clean_up_tokenization_space=True)
decoded8b = tokenizer.batch_decode(generated_ids8b, skip_special_tokens=True, clean_up_tokenization_space=True)
])

InformaticsSolutions

Feb 14

I was under the assumption that device_map='auto' will force the whole model into the GPU. In fact, it seems that only some layers will get loaded into the GPU, some are not. I believe this is the reason for the slowness. By doing '''model.to(device)''' the whole model will be loaded to the GPU and the inference is much faster. So this was not an issue with the model, but with the way i was loading the model.

InformaticsSolutions changed discussion status to closed Feb 14