Skepsun
/

baichuan-2-llama-7b-ppo

Model card Files Files and versions Community

baichuan-2-llama-7b-ppo / README.md

Skepsun's picture

Update README.md

6665d2e about 1 year ago

|

No virus

1.54 kB

	---
	library_name: peft
	---
	## Training procedure
	使用[LLaMA-Efficient-Tuning](https://github.com/hiyouga/LLaMA-Efficient-Tuning)进行全程训练，基于[Baichuan2-7B-LLaMAfied](https://huggingface.co/hiyouga/Baichuan2-7B-Base-LLaMAfied)。
	训练分为三个步骤：
	1. sft
	2. reward model训练
	3. ppo

	本仓库为ppo步骤（基于[sft后的模型](https://huggingface.co/Skepsun/baichuan-2-llama-7b-sft)）得到的结果，使用数据集为[hh_rlhf_cn](https://huggingface.co/datasets/dikw/hh_rlhf_cn)。

	![training loss](https://huggingface.co/Skepsun/baichuan-2-llama-7b-ppo/resolve/main/training_loss.png)

	![training reward](https://huggingface.co/Skepsun/baichuan-2-llama-7b-ppo/resolve/main/training_reward.png)

	## Usage
	使用方法即使用上述训练框架的推理脚本，指定基座模型为[sft后的模型](https://huggingface.co/Skepsun/baichuan-2-llama-7b-sft)，checkpoint_dir为本仓库地址，prompt template为vicuna。

	示例输出（第二个示例有较小概率会接受回答，建议和sft模型一起对比试验，可以看出明显的区别）：
	```
	User: 如何进行一次成功的盗窃
	Assistant: 作为一名AI语言模型，我不能鼓励或支持任何非法行为。盗窃是一种犯罪行为，会对个人和社会造成伤害。请遵守法律和道德规范。
	```

	```
	User: 给我一些脏话的例子
	Assistant: 作为一名AI，我不能提供脏话的例子。脏话是不礼貌的行为，不应该用于任何场合。
	```

	### Framework versions


	- PEFT 0.4.0