How to load q6 with LlamaCpp?

#25
by YanaS - opened

I see the q6 quantization consists of 2 gguf files. How do you load it with LlamaCpp? Are they somehow merged before giving the model path to llamacpp

you just load the part 00001, Llama.cpp or any other library will load the rest automatically. https://github.com/ggerganov/llama.cpp/discussions/6404

When I try to load I get this warning: UserWarning: huggingface_hub cache-system uses symlinks by default to efficiently store duplicated files but your machine does not support them and then loading fails:

  • llama_load_model_from_file: failed to load model
  • pydantic.v1.error_wrappers.ValidationError: 1 validation error for LlamaCpp
    I also use LlamaCpp which is the Langchain integration. Do I need to set specific parameters, additionally?

could you please share how you are using the model?

huggingface-cli download MaziyarPanahi/Meta-Llama-3-70B-Instruct-GGUF --local-dir . --include '*Q6_K*gguf'

Once all the Q6 models where downloaded, locate the directory and point to the first split:

./llama.cpp/main -m ./path_to_q6/Meta-Llama-3-70B-Instruct.Q6_K-00001-of-00002.gguf -r '<|eot_id|>' --in-prefix "\n<|start_header_id|>user<|end_header_id|>\n\n" --in-suffix "<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n" -p "<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nYou are a helpful, smart, kind, and efficient AI assistant. You always fulfill the user's requests to the best of your ability.<|eot_id|>\n<|start_header_id|>user<|end_header_id|>\n\nHi! How are you?<|eot_id|>\n<|start_header_id|>assistant<|end_header_id|>\n\n" -n 1024
MaziyarPanahi changed discussion status to closed

from huggingface_hub import hf_hub_download

from langchain_community.llms import LlamaCpp

model_path = hf_hub_download(
    repo_id=model_id,
    filename=model_basename,
    resume_download=True,
    cache_dir=args.models_dir,
)

kwargs = {
    "model_path": model_path,
    "n_ctx": args.max_new_tokens,
    "max_tokens": args.max_new_tokens,
    "n_batch": args.n_batch,
    "rope_freq_scale": args.rope_freq_scale,
    "stop": ["<|eot_id|>"],
}

if args.device.lower() == "mps":
    kwargs["n_gpu_layers"] = 1
if args.device.lower() == "cuda":
    kwargs["n_gpu_layers"] = args.num_gpu_layers  # set this based on your GPU

print("KWARGS:", kwargs)
return LlamaCpp(**kwargs)

where:
MODEL_ID_LLAMA3: "MaziyarPanahi/Meta-Llama-3-70B-Instruct-GGUF"
MODEL_BASENAME_LLAMA3: "Meta-Llama-3-70B-Instruct.Q6_K-00001-of-00002.gguf"

Sorry I don't know which version of llama.cpp the langchian_community is using. The Llama.cpp itself and all the other libraries on top of it works without any issue. You can follow the link i shared to merge the splits back if langchain doesn't support split gguf.

I have figured it out. Instead of hf_hub_download which only allows one file I use snapshot_download where I set pattern for the file format:

from huggingface_hub import snapshot_download
snapshot_download(
    repo_id=model_id,
    cache_dir=args.models_dir,
    allow_patterns=["*Q6_K*gguf"]

Thanks for sharing. I think this works as well in the README: huggingface-cli download MaziyarPanahi/Meta-Llama-3-70B-Instruct-GGUF --local-dir . --include 'Q2_Kgguf'

Sign up or log in to comment