Why does this one perform substantially worse in some tests?

#2
by QuamtumX - opened

What is the difference between yours and the other one I listed below, other than the fact that yours lacks censorship?

cognitivecomputations: dolphin-2.9-llama3-8b-q8_0.gguf (Failed 9 out of 10):

  1. She picked up an apple from the ground and took a bite.
  2. He saw a red apple on the tree, glistening in the sun.
  3. They painted a beautiful picture of a green apple against a blue sky.
  4. The little girl was excited to have her first taste of the crunchy apple.
  5. In the story, the main character discovered an enchanted apple that turned everything it touched into gold.
  6. As she walked by the orchard, she inhaled the sweet aroma of thousands of apples.
  7. After playing all day at the park, he couldn't wait to have his favorite apple slices with mom's homemade sauce.
  8. When the farmer handed her an entire peck of fresh, crisp apples, she felt incredibly grateful and lucky.
  9. In the midst of a forest filled with various fruit trees, the ancient Granny Smith apple tree stood tall, proud, and strong.
  10. The sweet and tart flavors of the freshly picked apples combined to make a delightful, mouthwatering treat for all ages.

Bartowksi model Meta-Llama-3-8B-Instruct-Q8_0.gguf (Succeeded 10 out of 10) :

  1. The teacher handed out a freshly picked apple.
  2. She decided to take a bite into the crisp apple.
  3. He carefully placed the juicy red apple.
  4. After lunch, they all sat under the tree and shared an apple.
  5. His wife loved to bake delicious pies with fresh apple.
  6. The farmer's market was filled with baskets of sweet apple.
  7. She mixed it in a blender with some ripe banana and a crunchy apple.
  8. At the top of the hill stood a giant statue of a shiny green apple.
  9. They packed snacks for their hike, including a few fresh Granny Smith apple.
  10. The chef added a sprinkle of cinnamon to the warm, caramelized apple.

When companies release open-weight models, they usually also release a fine-tune of their base model, but they are almost always very basic. In addition, it seems like most are fine-tuned on synthetic data because they sound and behave very much like ChatGPT.

For LLama 3, Meta fine-tuned it on 10 million high-quality samples, possibly human-made or at the very least had humans somewhere in the process, that were then ranked by human preference. If you've played around with it enough, you'll see that LLama 3 has a personality by default and doesn't exhibit any ChatGPT-like behavior, making it clear that they put a lot of care into it, or at the very least money.

I think this time around, it will be very hard for third-party fine-tunes to surpass the original. Until open-source datasets can get close to their quality, I think running DPO on Meta's own fine-tune might be the way to go.

I still don't get it. These dolphin models have censorship removed, right? Then why does it for example, like in this case, lose the ability to write 10 sentences that end with the word "apple"? Then what's the point of having a model with censorship removed, if it becomes crippled?

All this new AI technology is so poorly explained everywhere, at times it feels that the focus of any AI youtube channel just seems to be to entertain fellow AI nerds, while ignoring any newcomers to AI.
I'm an experienced IT guy, and more often than I like I can't bloody understand what the heck they're talking about.
Why can't we have more excellent videos like this: "[1hr Talk] Intro to Large Language Models" by Andrej Karpathy?

Cognitive Computations org

To answer your question, we do not train from the instruct version, which is where you're seeing those improvements - we train the base model.

Crystalcareai changed discussion status to closed

@QuamtumX

To put it simply, a base model is what is created from the initial training of a large language model (LLM). This phase introduces essentially all of the information the model will know; it is not trained to follow instructions. You need to have a base model first in order to do instruction fine-tuning.

Instruction fine-tuning is where you teach the LLM how to follow instructions by giving it an input, a human asking it whatever question, and an output, the response we want the LLM to have to that question. Fine-tuning requires a chat template, such as ChatML or Alpaca, in order to introduce the concept of a system message and clearly separate the AI and Human messages.

Fine-tuning is also where alignment is introduced; basically, the values you want to instill into the model and the alignment companies put into their models are usually very, very restrictive.

Essentially, it's trained like this:

LLama-3-8B-base_model --> LLama-3-8B-Instruct
LLama-3-8B-base_model --> dolphin-2.9-llama3-8B

And not like this:

LLama-3-8B-Instruct --> dolphin-2.9-llama3-8B

It's not that Dolphin made the model dumber; it did improve over the base model; it's just that Meta's fine-tune had higher-quality data, so when comparing the two, it performs better. Meta can just throw money and get good data, but most open-source data sets are comprised of synthetic data (AI-generated). The great thing about synthetic data is that, when models improve, so does the data we can get from it. This means that open models likely won't be behind for long, since we can just create a better synthetic dateset now, especially once LLama-3-400B releases.

Oh, so that's how it works, now it all starts to make sense to me.
What kind of you to explain things in a way anyone can understand. It's people like you that encourage newcomers like me to continue this interesting AI journey, instead of getting frustrated, and losing interest. Hopefully one day I've gained enough understanding to share my own experiences and knowledge with others, like you're doing.

Thank you very much HiroseKoichi!

Sign up or log in to comment