Add max model length of 128k to config.json for conversion to GGUF

#24

I have been adding support for Command-R to llama.cpp.
We need to store the actual 128k context length inside the GGUF file read by llama.cpp. Then the user can use -c when running llama.cpp inference to set the desired context length based on available memory.
Currently we hardcode the 128k for this model in the conversion script (see discussion: https://github.com/ggerganov/llama.cpp/pull/6033#discussion_r1525370279)
Can we instead add this as a parameter in the config.json.
Note: I didn't test if this would cause problems with the reference python.

Cohere For AI org

Hi, the pytorch reference is based on the Llama implementation which materializes this huge buffer which is not feasible for 128k context. Setting this context length as default in the config.json would cause problems.
causal_mask = torch.full( (config.max_position_embeddings, config.max_position_embeddings), fill_value=True, dtype=torch.bool )

ahmetustun changed pull request status to closed

Hi @ahmetustun , I completely agree we don't want to affect the pytorch reference.
I just checked the configuration parameters for the python transformer llama they never use "model_max_length":
https://github.com/huggingface/transformers/blob/main/src/transformers/models/llama/configuration_llama.py#L31

My suggestion is simply to put the true context length somewhere in the config.json (we can create another name if you'd like).

The goal is to allow third party tools (like the GGUF converter) to correctly detect the true context length

ahmetustun changed pull request status to open
Cohere For AI org
edited Mar 14

hi @andrewcanis , I misread the config params you are suggesting. I'll do a test and add the parameter to config file if all is good!

saurabhdash changed pull request status to merged

Sign up or log in to comment