How did you manage to produce gguf files, when llama.cpp/convert.py gives an error about the ROPE encoding?

#1
by BigDeeper - opened

I've applied a pull request, that is already merged. So the recent build of llama.cpp should be able to convert them using the convert and quantize scripts without errors. Still it doesn't do much, as the 128k version is still not supported by llama.cpp, so the model converts and quantizes, but doesn't perform any different than the phi-3 model with 4k context length.

Interesting. I have managed to get the Q6_K version from PrunaAI to kind of work with Ollama by setting the context to 60000. That said I just created two ggufs myself, Q8_0 and Q6_K, and neither of them seems to work even with dropping the context size.

The other thing that seems very strange is that some flags are set in such a way that the model is replicated multiple times across multiple GPUs and several times in each GPU. The Q6_K is only 3.1GB but Ollama fills 11.2GiB in each of 4 GPUs.

This seems strange to me. My setup doesn't use a GPU at all and the model (even f16) fits into 16 gig of RAM anytime(8 or slighty above most of the time). But then I have absolutely no clue how the model distributes over multiple GPUs and if this is actually necessary for it to work.

Sign up or log in to comment