Edit model card

dense_reward_trainer_final_opt__NumTrainEpochs5_SaveStrategiesepoch_reward_modeling_anthropic_hh

This model is a fine-tuned version of facebook/opt-1.3b on an unknown dataset. It achieves the following results on the evaluation set:

  • Loss: 2.4124
  • Accuracy: 0.6660
  • Train Rewards/chosen: 9.2061
  • Train Rewards/rejected: -9.4536
  • Train Rewards/accuracies: 0.9844
  • Train Rewards/margins: 18.6597
  • Train Nll Loss: 2.1547
  • Train Logit Total Loss: 0.0587
  • Train Logit Loss: 0.0375
  • Rewards/chosen: 3.4303
  • Rewards/rejected: -2.3575
  • Rewards/accuracies: 0.6484
  • Rewards/margins: 5.7878
  • Nll Loss: 2.1950
  • Logit Total Loss: 2.4421
  • Logit Loss: 2.4446

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 1.41e-05
  • train_batch_size: 4
  • eval_batch_size: 8
  • seed: 42
  • gradient_accumulation_steps: 4
  • total_train_batch_size: 16
  • optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
  • lr_scheduler_type: linear
  • num_epochs: 5

Training results

Training Loss Epoch Step Validation Loss Accuracy Rewards/chosen Rewards/rejected Rewards/accuracies Rewards/margins Nll Loss Logit Total Loss Logit Loss
0.7077 0.11 100 0.6897 0.6165 -1.6378 -1.8001 0.6016 0.1622 2.8092 0.6881 0.6667
0.7117 0.23 200 0.6764 0.6103 -2.8148 -3.0536 0.5964 0.2388 2.8927 0.6746 0.6522
0.6502 0.34 300 0.6626 0.6536 -0.8018 -1.1645 0.6399 0.3627 2.9696 0.6611 0.6377
0.655 0.46 400 0.6503 0.6144 -1.5457 -1.9648 0.5984 0.4191 2.7773 0.6489 0.6274
0.6467 0.57 500 0.6653 0.6165 -0.9541 -1.3483 0.6036 0.3942 2.8139 0.6643 0.6426
0.6694 0.69 600 0.6432 0.6392 -1.5917 -1.9439 0.6278 0.3522 2.7779 0.6426 0.6211
0.6753 0.8 700 0.6494 0.6371 -1.3508 -1.7191 0.6246 0.3683 2.8056 0.6474 0.6256
0.6806 0.91 800 0.6449 0.6103 -1.4576 -1.8165 0.6004 0.3589 2.7215 0.6424 0.6214
0.5434 1.03 900 0.6827 0.6557 -0.8965 -1.6611 0.6468 0.7645 2.6762 0.6816 0.6615
0.5448 1.14 1000 0.7194 0.6392 -0.8661 -1.8265 0.6266 0.9604 2.6214 0.7184 0.6992
0.5129 1.26 1100 0.7990 0.6289 1.3108 0.2390 0.6165 1.0718 2.6526 0.7966 0.7779
0.5033 1.37 1200 0.6888 0.6557 -0.9571 -1.8601 0.6488 0.9030 2.6263 0.6868 0.6672
0.404 1.49 1300 0.7422 0.6309 -1.1408 -2.0297 0.6226 0.8890 2.6046 0.7348 0.7159
0.5512 1.6 1400 0.6762 0.6474 -2.5166 -3.3023 0.6327 0.7857 2.5872 0.6766 0.6573
0.4558 1.71 1500 0.6843 0.6619 -2.3183 -3.2412 0.6476 0.9229 2.5268 0.6811 0.6625
0.5184 1.83 1600 0.7135 0.6557 -1.5991 -2.5538 0.6456 0.9547 2.5671 0.7179 0.6992
0.4213 1.94 1700 0.7220 0.6495 -1.3947 -2.4198 0.6395 1.0251 2.5040 0.7198 0.7018
0.1508 2.06 1800 1.0827 0.6598 2.5282 0.2534 0.6476 2.2748 2.6437 1.0758 1.0599
0.1216 2.17 1900 1.1376 0.6474 -0.0750 -2.1523 0.6302 2.0773 2.5506 1.1502 1.1361
0.1044 2.29 2000 1.4682 0.6722 -0.4860 -3.5268 0.6577 3.0408 2.5292 1.4836 1.4730
0.0952 2.4 2100 1.6303 0.6639 1.9842 -1.3673 0.6444 3.3515 2.5293 1.6377 1.6287
0.1951 2.51 2200 1.1515 0.6784 -0.0674 -2.4660 0.6637 2.3985 2.4589 1.1463 1.1331
0.1119 2.63 2300 1.3845 0.6722 4.4149 1.2669 0.6548 3.1480 2.4797 1.3869 1.3759
0.1613 2.74 2400 1.1948 0.6536 -4.3162 -7.1133 0.6367 2.7971 2.4661 1.2014 1.1887
0.1408 2.86 2500 1.4167 0.6557 -3.1501 -6.3592 0.6415 3.2091 2.4591 1.4242 1.4137
0.2694 2.97 2600 1.2168 0.6536 0.5185 -2.2531 0.6395 2.7716 2.4397 1.2074 1.1949
0.1184 3.09 2700 1.6729 0.6412 0.5427 -3.2829 0.6315 3.8257 2.4188 1.6627 1.6551
0.1004 3.2 2800 1.8768 0.6742 3.9205 -0.6543 0.6629 4.5748 2.3906 1.8625 1.8572
0.1029 3.31 2900 1.7461 0.6619 0.1775 -4.2079 0.6496 4.3854 2.3534 1.7356 1.7294
0.0401 3.43 3000 1.9949 0.6825 3.6497 -1.3819 0.6698 5.0317 2.3327 1.9902 1.9868
0.04 3.54 3100 2.0206 0.6763 -0.5106 -5.0903 0.6597 4.5798 2.3202 2.0224 2.0194
0.1035 3.66 3200 2.1971 0.6660 2.3511 -2.5645 0.6536 4.9156 2.3137 2.2218 2.2209
0.0589 3.77 3300 2.1599 0.6412 2.0054 -2.7469 0.6262 4.7523 2.2936 2.1789 2.1777
0.084 3.89 3400 2.2096 0.6598 1.7952 -3.0061 0.6391 4.8013 2.2833 2.2386 2.2382
0.063 4.0 3500 2.2277 0.6660 4.2291 -0.8513 0.6484 5.0805 2.2693 2.2539 2.2537
0.065 4.11 3600 2.3431 0.6598 2.1719 -3.1923 0.6444 5.3642 2.2499 2.3575 2.3585
0.0453 4.23 3700 2.4069 0.6474 5.6839 0.2229 0.6335 5.4609 2.2344 2.4327 2.4347
0.0377 4.34 3800 2.4983 0.6557 2.7785 -2.9928 0.6355 5.7714 2.2258 2.5397 2.5429
0.0559 4.46 3900 2.4027 0.6536 2.8063 -2.8587 0.6375 5.6650 2.2135 2.4278 2.4299
0.0219 4.57 4000 2.4322 0.6598 3.9024 -1.8412 0.6435 5.7436 2.2081 2.4805 2.4832
0.09 4.69 4100 2.4041 0.6680 3.7769 -1.9890 0.6496 5.7659 2.2011 2.4248 2.4271
0.0897 4.8 4200 2.3727 0.6722 2.7679 -3.0182 0.6524 5.7861 2.1974 2.3815 2.3833
0.0474 4.91 4300 2.4124 0.6660 3.4303 -2.3575 0.6484 5.7878 2.1950 2.4421 2.4446

Framework versions

  • Transformers 4.37.2
  • Pytorch 2.4.0+cu121
  • Datasets 2.21.0
  • Tokenizers 0.15.2
Downloads last month
6
Safetensors
Model size
1.42B params
Tensor type
F32
·
Inference API
Unable to determine this model's library. Check the docs .

Model tree for cj453/dense_reward_trainer_final_opt__NumTrainEpochs5_SaveStrategiesepoch_reward_modeling_anthropic_hh

Base model

facebook/opt-1.3b
Finetuned
this model