How giving negative reward to complex context makes model better?

#2
by maywell - opened

Hi, I appreciate your amazing work.

I'm just wondering

# Note - Example of a complex phrase
good_phrases:
  - phrase: "The apple is in the bedroom"
    weight: 1
    contexts: ["Question: If I'm in the living room and pick up the apple, go to the bedroom and drop the apple, then walk to the kitchen, where is the apple? Explain your reasoning. Answer: "]

why you made these reasoning examples get negative reward.

That's because the algorithm chases a single number (the probability of all phrases combined), which it tries to decrease as much as possible.

Bad phrases should have their probability lowered.
Good phrases should have their probability increased, which in the background is then flipped to also be subtracted from the total. So here lower also becomes desired.

Ah thank you I understood!

You made me review the corresponding code and I just found a mistake in my logic (due to extensive additions since then) which I will be hotfixing shortly! So thank you for asking!

Gryphe changed discussion status to closed

Sign up or log in to comment