File size: 6,123 Bytes
0211427
 
83517c5
 
 
 
 
 
 
 
 
 
 
 
 
 
0211427
 
83517c5
0211427
 
 
 
 
83517c5
0211427
83517c5
 
 
 
 
0211427
83517c5
0211427
83517c5
 
0211427
 
 
83517c5
0211427
 
 
83517c5
0211427
 
 
83517c5
0211427
83517c5
f7c47db
ba27b60
83517c5
f7c47db
 
ba27b60
f7c47db
 
ba27b60
f7c47db
 
 
 
 
 
25c6173
d5eb9c9
 
0211427
 
 
83517c5
0211427
83517c5
0211427
83517c5
 
0211427
 
 
cfc06ee
83517c5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
58dd169
83517c5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
58dd169
83517c5
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
---
library_name: transformers
license: apache-2.0
language:
- en
tags:
- causal-lm
- Large Language Model
- LLM
- detoxification
- unbias
- bias
- instruction
- finetuned
- llama2
- DPO
---

# Model Card for SungJoo/llama2-7b-sft-detox

## Model Details

### Model Description

This model is built on the LLaMA-2-7b architecture and has been refined with instruction tuning and Direct Preference Optimization (DPO).

- **Developed by:** Sungjoo Byun (Grace Byun)
- **Model type:** Auto-regressive language model
- **Language(s) (NLP):** English
- **License:** Apache License 2.0
- **Finetuned from:** meta-llama/Llama-2-7b-hf

### Model Sources

- **Repository:** TBD
- **Paper:** TBD

## Uses

This model is intended to be used for generating less toxic language in various applications, including chatbots and other NLP systems.

## Bias, Risks, and Limitations

While this model aims to reduce toxicity, it may still generate biased or harmful content. Users should apply this model with caution and review outputs for sensitive applications.

## How to Get Started with the Model

Use the code below to get started with the model:

```python
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

DEV = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

adapter_path = "SungJoo/llama2-7b-sft-dpo-detox"

# Load model
model = AutoModelForCausalLM.from_pretrained(
    adapter_path,
    torch_dtype=torch.bfloat16
).to(DEV)

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(adapter_path)
```

Refer to the following repository for further usage of the model : https://github.com/mzbac/llama2-fine-tune

## Training Details

- Parameter-Efficient Fine-Tuning (PEFT)

- BitsAndBytes Configuration (bnb_config): This model employs a 4-bit quantization technique using the BitsAndBytes library to further enhance training efficiency. 

### Training Data
The model was trained using a dataset specifically created to detoxify LLMs. DPO dataset will be publicly available soon.

### Training Procedure

DPO was applied on "SungJoo/llama2-7b-sft-detox" with the following hyperparameters:

| **Hyperparameter** | **Value** |
|--------------------|-----------|
| Batch size         | 4         |
| Learning rate      | 2e-4      |
| Epochs             | 10        |
| Max length         | 2,048     |
| Max prompt length  | 1,024     |
| Beta               | 0.1       |



## Objective
The main objective of this research is to reduce toxicity in LLMs by applying instruction tuning and Direct Preference Optimization (DPO). 
A comprehensive instruction and DPO dataset was constructed for this purpose, which will be released in the future.
The table below shows the effectiveness of this model in reducing bias, measured using the RealToxicityPrompt dataset and the Perspective API.

| **Model**          | **LLaMA-2-base**  |                       | **Finetuned LLaMA-2** |                         | **DPO LLaMA-2**      |                         |
|--------------------|-------------------|-----------------------|-----------------------|-------------------------|-----------------------|-------------------------|
| **Category**       | **\>=0.5 (%)**    | **Count**             | **\>=0.5 (%)**        | **Count**               | **\>=0.5 (%)**        | **Count**               |
| **TOXICITY**       | 4.46              | 4,438                 | 3.61                  | 3,593                   | 2.39                  | 2,377                   |
|                    |                   |                       | <span style="color:blue;">(-0.85)</span> | <span style="color:blue;">(-845)</span>  | <span style="color:green;">(-1.22)</span> | <span style="color:green;">(-1,216)</span> |
| **SEVERE_TOXICITY**| 0.08              | 77                    | 0.07                  | 70                      | 0.03                  | 31                      |
|                    |                   |                       | <span style="color:blue;">(-0.01)</span> | <span style="color:blue;">(-7)</span>    | <span style="color:green;">(-0.04)</span>  | <span style="color:green;">(-39)</span>   |
| **IDENTITY_ATTACK**| 0.79              | 788                   | 0.42                  | 413                     | 0.28                  | 274                     |
|                    |                   |                       | <span style="color:blue;">(-0.37)</span> | <span style="color:blue;">(-375)</span>  | <span style="color:green;">(-0.14)</span>  | <span style="color:green;">(-139)</span>  |
| **INSULT**         | 1.97              | 1,961                 | 1.60                  | 1,588                   | 0.90                  | 892                     |
|                    |                   |                       | <span style="color:blue;">(-0.37)</span> | <span style="color:blue;">(-373)</span>  | <span style="color:green;">(-0.70)</span>  | <span style="color:green;">(-696)</span>  |
| **PROFANITY**      | 2.10              | 2,086                 | 1.76                  | 1,753                   | 1.04                  | 1,030                   |
|                    |                   |                       | <span style="color:blue;">(-0.34)</span> | <span style="color:blue;">(-333)</span>  | <span style="color:green;">(-0.72)</span>  | <span style="color:green;">(-723)</span>  |
| **THREAT**         | 1.43              | 1,424                 | 0.92                  | 919                     | 0.76                  | 754                     |
|                    |                   |                       | <span style="color:blue;">(-0.51)</span> | <span style="color:blue;">(-505)</span>  | <span style="color:green;">(-0.16)</span>  | <span style="color:green;">(-165)</span>  |
*Comparison of LLaMA-2-base, Finetuned LLaMA-2, and DPO LLaMA-2 across various categories. Reductions in blue indicate comparisons between the base model and the fine-tuned model, while text in green represents comparisons between the fine-tuned model and the DPO model.*


## Contact
For any questions or issues, please contact [email protected].