How did you manage to produce gguf files, when llama.cpp/convert.py gives an error about the ROPE encoding?

by BigDeeper - opened Apr 25

Discussion

BigDeeper

Apr 25

pjh64

Owner Apr 25

•

edited Apr 25

I've applied a pull request, that is already merged. So the recent build of llama.cpp should be able to convert them using the convert and quantize scripts without errors. Still it doesn't do much, as the 128k version is still not supported by llama.cpp, so the model converts and quantizes, but doesn't perform any different than the phi-3 model with 4k context length.

BigDeeper

Apr 25

Interesting. I have managed to get the Q6_K version from PrunaAI to kind of work with Ollama by setting the context to 60000. That said I just created two ggufs myself, Q8_0 and Q6_K, and neither of them seems to work even with dropping the context size.

BigDeeper

Apr 25

The other thing that seems very strange is that some flags are set in such a way that the model is replicated multiple times across multiple GPUs and several times in each GPU. The Q6_K is only 3.1GB but Ollama fills 11.2GiB in each of 4 GPUs.

pjh64

Owner Apr 26

•

edited Apr 26

This seems strange to me. My setup doesn't use a GPU at all and the model (even f16) fits into 16 gig of RAM anytime(8 or slighty above most of the time). But then I have absolutely no clue how the model distributes over multiple GPUs and if this is actually necessary for it to work.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment