--- duplicated_from: localmodels/LLM --- # Wizard Vicuna 13B Uncensored GPTQ From: https://huggingface.co/ehartford/Wizard-Vicuna-13B-Uncensored merged with [SuperHOT 8K](https://huggingface.co/kaiokendev/superhot-13b-8k-no-rlhf-test). **This is an experimental new GPTQ which offers up to 8K context size** The increased context is tested to work with [ExLlama](https://github.com/turboderp/exllama), via the latest release of [text-generation-webui](https://github.com/oobabooga/text-generation-webui). It has also been tested from Python code using AutoGPTQ and `trust_remote_code=True`. Please read carefully below to see how to use it. ## How to use this model in text-generation-webui with ExLlama Using the latest version of text-generation-webui: 1. Click the **Model tab**. 2. Under **Download custom model or LoRA**, enter `localmodels/Wizard-Vicuna-13B-Uncensored-SuperHOT-8K-GPTQ`. 3. Click **Download**. 4. The model will start downloading. Once it's finished it will say "Done" 5. Untick **Autoload the model** 6. In the top left, click the refresh icon next to **Model**. 7. In the **Model** dropdown, choose the model you just downloaded: `Wizard-Vicuna-13B-Uncensored-SuperHOT-8K-GPTQ` 8. To use the increased context, set the **Loader** to **ExLlama**, set **max_seq_len** to 8192 or 4096, and set **compress_pos_emb** to **4** for 8192 context, or to **2** for 4096 context. 9. Now click **Save Settings** followed by **Reload** 10. The model will automatically load, and is now ready for use! 11. Once you're ready, click the **Text Generation tab** and enter a prompt to get started! ## How to use this GPTQ model from Python code with AutoGPTQ First make sure you have AutoGPTQ and Einops installed: ``` pip3 install einops auto-gptq ``` Then run the following code. Note that in order to get this to work, `config.json` has been hardcoded to a sequence length of 8192. If you want to try 4096 instead to reduce VRAM usage, manually edit `config.json` to set `max_position_embeddings` to the value you want. ```python from transformers import AutoTokenizer, pipeline, logging from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig import argparse model_name_or_path = "TheBloke/Wizard-Vicuna-13B-Uncensored-SuperHOT-8K-GPTQ" model_basename = "wizard-vicuna-13b-uncensored-superhot-8k-GPTQ-4bit-128g.no-act.order" use_triton = False tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=True) model = AutoGPTQForCausalLM.from_quantized(model_name_or_path, model_basename=model_basename, use_safetensors=True, trust_remote_code=True, device_map='auto', use_triton=use_triton, quantize_config=None) model.seqlen = 8192 # Note: check the prompt template is correct for this model. prompt = "Tell me about AI" prompt_template=f'''USER: {prompt} ASSISTANT:''' print("\n\n*** Generate:") input_ids = tokenizer(prompt_template, return_tensors='pt').input_ids.cuda() output = model.generate(inputs=input_ids, temperature=0.7, max_new_tokens=512) print(tokenizer.decode(output[0])) # Inference can also be done using transformers' pipeline # Prevent printing spurious transformers error when using pipeline with AutoGPTQ logging.set_verbosity(logging.CRITICAL) print("*** Pipeline:") pipe = pipeline( "text-generation", model=model, tokenizer=tokenizer, max_new_tokens=512, temperature=0.7, top_p=0.95, repetition_penalty=1.15 ) print(pipe(prompt_template)[0]['generated_text']) ``` ## Model **wizard-vicuna-13b-uncensored-superhot-8k-GPTQ-4bit-128g.no-act.order.safetensors** This will work with AutoGPTQ, ExLlama, and CUDA versions of GPTQ-for-LLaMa. There are reports of issues with Triton mode of recent GPTQ-for-LLaMa. If you have issues, please use AutoGPTQ instead. It was created with group_size 128 to increase inference accuracy, but without --act-order (desc_act) to increase compatibility and improve inference speed. * `wizard-vicuna-13b-uncensored-superhot-8k-GPTQ-4bit-128g.no-act.order.safetensors` * Works for use with ExLlama with increased context (4096 or 8192) * Works with AutoGPTQ in Python code, including with increased context if `trust_remote_code=True` is set. * Parameters: Groupsize = 128. No act-order. --- ### SuperHOT Prototype 2 w/ 8K Context This is a second prototype of SuperHOT, this time 30B with 8K context and no RLHF, using the same technique described in [the github blog](https://kaiokendev.github.io/til#extending-context-to-8k). Tests have shown that the model does indeed leverage the extended context at 8K. #### Looking for Merged & Quantized Models? - 30B 4-bit CUDA: [tmpupload/superhot-30b-8k-4bit-safetensors](https://huggingface.co/tmpupload/superhot-30b-8k-4bit-safetensors) - 30B 4-bit CUDA 128g: [tmpupload/superhot-30b-8k-4bit-128g-safetensors](https://huggingface.co/tmpupload/superhot-30b-8k-4bit-128g-safetensors) #### Training Details I trained the LoRA with the following configuration: - 1200 samples (~400 samples over 2048 sequence length) - learning rate of 3e-4 - 3 epochs - The exported modules are: - q_proj - k_proj - v_proj - o_proj - no bias - Rank = 4 - Alpha = 8 - no dropout - weight decay of 0.1 - AdamW beta1 of 0.9 and beta2 0.99, epsilon of 1e-5 - Trained on 4-bit base model