File size: 3,248 Bytes
8e72f93
 
 
 
 
 
24509b9
8e72f93
 
 
97ddd2f
 
 
ef20ca6
 
8e72f93
a9e21a9
97ddd2f
d0e3ea2
8f81d38
136a57f
 
97ddd2f
136a57f
97ddd2f
 
 
136a57f
97ddd2f
 
 
 
a9e21a9
 
97ddd2f
a9e21a9
136a57f
97ddd2f
 
 
 
a9e21a9
97ddd2f
a9e21a9
97ddd2f
 
 
 
136a57f
6242bb9
 
a9e21a9
97ddd2f
 
 
 
 
 
 
a9e21a9
 
 
 
 
 
 
 
 
b4f1620
 
1d054df
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
---
license: apache-2.0
language:
- en
- hi
library_name: transformers
pipeline_tag: translation
tags:
- translation
- Bilingual
datasets:
- Aarif1430/english-to-hindi
- Sampuran01/english-hindi-translation
metrics:
- bleu
---

# Model Description
This merge of lora model was finetuned using  The base model [sarvamai/OpenHathi-7B-Hi-v0.1-Base](https://huggingface.co/sarvamai/OpenHathi-7B-Hi-v0.1-Base) using [Unsloth](https://github.com/unslothai/unsloth)
This model can translate from english to hindi and hindi to english 
<img src="https://cdn-uploads.huggingface.co/production/uploads/6487239cca30096ea9f52115/Rsixw_aSB-ytZT7VEQ06c.jpeg" width="500" height="500" alt="Image">

# Steps to try the model :

## Load the model
```python
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("damerajee/openhathi-h2e-e2h")
model = AutoModelForCausalLM.from_pretrained("damerajee/openhathi-h2e-e2h")
```
## Inference

### For english to hindi(e2h)
```python
inputs = tokenizer(["[INST]translate this from english to hindi: Be a free thinker and don't accept everything you hear as truth. Be critical and evaluate what you believe in. [/INST]<s> hindi output:"]*1, return_tensors = "pt")

outputs = model.generate(**inputs, max_new_tokens = 18, use_cache = True)
tokenizer.batch_decode(outputs)

```
### For hindi to english(h2e)
```python
inputs = tokenizer(["[INST]translate this from hindi to english: अगर तुम सूरज की तरह चमकना चाहते हो, तो सूरज की तरह जलना सीखो।[/INST]<s> english output:"]*1, return_tensors = "pt")

outputs = model.generate(**inputs, max_new_tokens = 18, use_cache = True)
tokenizer.batch_decode(outputs)
```


# Dataset 
* The dataset used was the combination of two dataset which gave a total of 1_786_788 rows 
* The rows were then pre-process to look something like this :

```python
[INST]translate this from english to hindi: When it is said to him: \'Fear Allah\' egotism takes him in his sin. Gehenna (Hell) shall be enough for him. How evil a cradling! [/INST] hindi output: और जब उससे कहा जाता है,
"अल्लाह से डर", तो अहंकार उसे और गुनाह पर जमा देता है। अतः उसके लिए तो जहन्नम ही काफ़ी है, और वह बहुत-ही बुरी शय्या है! '
```
* This was done for both english to hindi and  hindi to english hence the name h2e and e2h
* Now when doing the above we get a total of 3 million plus rows

# Training details
* The model was loaded in 4-Bit
* The target modules include "q_proj", "k_proj", "v_proj", "o_proj"
* The fine-tuning was done on a free goggle colab with a single t4 GPU (huge thanks to unsloth for this)
* Even though the Full dataset was almost 3 million The lora model was finetuned on only 1 million row for each language
  
# Limitations
The model was not fully trained on all the dataset and Much evaluation was not done so any contributions will be helpful.

As of right now this is a smaller model Better model trained on better dataset will be released