MaziyarPanahi/Meta-Llama-3-70B-Instruct-GGUF · How to load q6 with LlamaCpp?

How to load q6 with LlamaCpp?

#25

by YanaS - opened Jun 21

Discussion

YanaS

Jun 21

•

edited Jun 21

I see the q6 quantization consists of 2 gguf files. How do you load it with LlamaCpp? Are they somehow merged before giving the model path to llamacpp

MaziyarPanahi

Owner Jun 21

you just load the part 00001, Llama.cpp or any other library will load the rest automatically. https://github.com/ggerganov/llama.cpp/discussions/6404

YanaS

Jun 22

When I try to load I get this warning: UserWarning: huggingface_hub cache-system uses symlinks by default to efficiently store duplicated files but your machine does not support them and then loading fails:

llama_load_model_from_file: failed to load model
pydantic.v1.error_wrappers.ValidationError: 1 validation error for LlamaCpp
I also use LlamaCpp which is the Langchain integration. Do I need to set specific parameters, additionally?

MaziyarPanahi

Owner Jun 23

could you please share how you are using the model?

huggingface-cli download MaziyarPanahi/Meta-Llama-3-70B-Instruct-GGUF --local-dir . --include '*Q6_K*gguf'

Once all the Q6 models where downloaded, locate the directory and point to the first split:

./llama.cpp/main -m ./path_to_q6/Meta-Llama-3-70B-Instruct.Q6_K-00001-of-00002.gguf -r '<|eot_id|>' --in-prefix "\n<|start_header_id|>user<|end_header_id|>\n\n" --in-suffix "<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n" -p "<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nYou are a helpful, smart, kind, and efficient AI assistant. You always fulfill the user's requests to the best of your ability.<|eot_id|>\n<|start_header_id|>user<|end_header_id|>\n\nHi! How are you?<|eot_id|>\n<|start_header_id|>assistant<|end_header_id|>\n\n" -n 1024

MaziyarPanahi changed discussion status to closed Jun 23

YanaS

Jun 23

from huggingface_hub import hf_hub_download

from langchain_community.llms import LlamaCpp

model_path = hf_hub_download(
    repo_id=model_id,
    filename=model_basename,
    resume_download=True,
    cache_dir=args.models_dir,
)

kwargs = {
    "model_path": model_path,
    "n_ctx": args.max_new_tokens,
    "max_tokens": args.max_new_tokens,
    "n_batch": args.n_batch,
    "rope_freq_scale": args.rope_freq_scale,
    "stop": ["<|eot_id|>"],
}

if args.device.lower() == "mps":
    kwargs["n_gpu_layers"] = 1
if args.device.lower() == "cuda":
    kwargs["n_gpu_layers"] = args.num_gpu_layers  # set this based on your GPU

print("KWARGS:", kwargs)
return LlamaCpp(**kwargs)

where:
MODEL_ID_LLAMA3: "MaziyarPanahi/Meta-Llama-3-70B-Instruct-GGUF"
MODEL_BASENAME_LLAMA3: "Meta-Llama-3-70B-Instruct.Q6_K-00001-of-00002.gguf"

MaziyarPanahi

Owner Jun 23

Sorry I don't know which version of llama.cpp the langchian_community is using. The Llama.cpp itself and all the other libraries on top of it works without any issue. You can follow the link i shared to merge the splits back if langchain doesn't support split gguf.

YanaS

Jun 25

I have figured it out. Instead of hf_hub_download which only allows one file I use snapshot_download where I set pattern for the file format:

from huggingface_hub import snapshot_download
snapshot_download(
    repo_id=model_id,
    cache_dir=args.models_dir,
    allow_patterns=["*Q6_K*gguf"]

MaziyarPanahi

Owner Jun 25

Thanks for sharing. I think this works as well in the README: huggingface-cli download MaziyarPanahi/Meta-Llama-3-70B-Instruct-GGUF --local-dir . --include 'Q2_Kgguf'

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment