gbueno86/Cathallama-70B · Merge recipe to produce the intermediate models?

Aug 12

Do you mind sharing the merge recipe you used to merge Athene and turbocat with Llama 3.1, or did you use the same SLERP template for all three merges? Did you experiment with any other approaches?

I'm downloading a GGUF of your model now to test it out. Thanks for sharing the result and the method you used.

gbueno86

Owner Aug 12

The template was the same for every merge, just changed the file names. I did try other approaches, but this one was the one that came out the best.

sophosympatheia

Aug 12

Thanks for sharing!

I had some success with this template. I used it to make a merge of Llama 3.1 with my New-Dawn model that I think came out reasonably well. The goal was to retain Llama 3.1's longer context capabilities, and that seems to have worked. I'm going to upload it soon.

merge_method: della_linear
base_model: /home/llm/mergequant/models/BASE/Meta-Llama-3.1-70B-Instruct
models:
  - model: /home/llm/mergequant/models/new-dawn-llama3-70b-32K-v1.0
    parameters:
      weight:
        - filter: v_proj
          value: [0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0]
        - filter: o_proj
          value: [0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0]
        - filter: up_proj
          value: [0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0]
        - filter: gate_proj
          value: [0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0]
        - filter: down_proj
          value: [0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0]
        - value: 0
      density: 0.25
      epsilon: 0.05
      lambda: 1.0
  - model: /home/llm/mergequant/models/BASE/Meta-Llama-3.1-70B-Instruct
    parameters:
        weight: 1.0
        density:
          - filter: v_proj
            value: [1, 1, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 1, 1]
          - filter: o_proj
            value: [1, 1, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 1, 1]
          - filter: up_proj
            value: [1, 1, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 1, 1]
          - filter: gate_proj
            value: [1, 1, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 1, 1]
          - filter: down_proj
            value: [1, 1, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 1, 1]
          - value: 0.5
        epsilon:
          - filter: v_proj
            value: [0, 0, 0.05, 0.05, 0.07, 0.1, 0.07, 0.05, 0.05, 0, 0]
          - filter: o_proj
            value: [0, 0, 0.05, 0.05, 0.07, 0.1, 0.07, 0.05, 0.05, 0, 0]
          - filter: up_proj
            value: [0, 0, 0.05, 0.05, 0.07, 0.1, 0.07, 0.05, 0.05, 0, 0]
          - filter: gate_proj
            value: [0, 0, 0.05, 0.05, 0.07, 0.1, 0.07, 0.05, 0.05, 0, 0]
          - filter: down_proj
            value: [0, 0, 0.05, 0.05, 0.07, 0.1, 0.07, 0.05, 0.05, 0, 0]
          - value: 0.1
        lambda: 1.0
dtype: float16
tokenizer_source: base

I also produced a coherent SLERP merge that retained the long context capabilities using this two-step recipe, although it didn't perform as well in my subjective testing. You could copy the second step of the merge if you wanted to produce a long-context version of your model.

name: _newdawn_pre_merge 
models:
  - model: /home/llm/mergequant/models/new-dawn-llama3-70b-32K-v1.0
  - model: /home/llm/mergequant/models/BASE/Meta-Llama-3.1-70B-Instruct
merge_method: slerp
base_model: /home/llm/mergequant/models/BASE/Meta-Llama-3.1-70B-Instruct
parameters:
  t:
    - value: 0.5
dtype: float16
---
# See https://huggingface.co/jukofyork/Dark-Miqu-70B/discussions/3
# Credit for merge recipe belongs to jukofyork
name: new-dawn-llama3.1-70b-v1.1
merge_method: linear
models:
  - model: /home/llm/mergequant/models/BASE/Meta-Llama-3.1-70B-Instruct
    parameters:
      weight:
        - filter: v_proj
          value: [1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1]
        - filter: o_proj
          value: [1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1]
        - filter: up_proj
          value: [1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1]
        - filter: gate_proj
          value: [1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1]
        - filter: down_proj
          value: [1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1]
        - value: 1
  - model: _newdawn_pre_merge
    parameters:
      weight:
        - filter: v_proj
          value: [0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0]
        - filter: o_proj
          value: [0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0]
        - filter: up_proj
          value: [0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0]
        - filter: gate_proj
          value: [0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0]
        - filter: down_proj
          value: [0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0]
        - value: 0
base_model: /home/llm/mergequant/models/BASE/Meta-Llama-3.1-70B-Instruct
tokenizer_source: base
dtype: float16

I hope something in here proves to be interesting or helpful in your own experiments.

gbueno86 changed discussion status to closed Aug 14