damerajee
/

openhathi-h2e-e2h-small

text-generation

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

openhathi-h2e-e2h-small / README.md

damerajee's picture

Update README.md

ef20ca6 verified 8 months ago

|

history blame contribute delete

No virus

3.25 kB

	---
	license: apache-2.0
	language:
	- en
	- hi
	library_name: transformers
	pipeline_tag: translation
	tags:
	- translation
	- Bilingual
	datasets:
	- Aarif1430/english-to-hindi
	- Sampuran01/english-hindi-translation
	metrics:
	- bleu
	---

	# Model Description
	This merge of lora model was finetuned using The base model [sarvamai/OpenHathi-7B-Hi-v0.1-Base](https://huggingface.co/sarvamai/OpenHathi-7B-Hi-v0.1-Base) using [Unsloth](https://github.com/unslothai/unsloth)
	This model can translate from english to hindi and hindi to english
	<img src="https://cdn-uploads.huggingface.co/production/uploads/6487239cca30096ea9f52115/Rsixw_aSB-ytZT7VEQ06c.jpeg" width="500" height="500" alt="Image">

	# Steps to try the model :

	## Load the model
	```python
	from transformers import AutoTokenizer, AutoModelForCausalLM

	tokenizer = AutoTokenizer.from_pretrained("damerajee/openhathi-h2e-e2h")
	model = AutoModelForCausalLM.from_pretrained("damerajee/openhathi-h2e-e2h")
	```
	## Inference

	### For english to hindi(e2h)
	```python
	inputs = tokenizer(["[INST]translate this from english to hindi: Be a free thinker and don't accept everything you hear as truth. Be critical and evaluate what you believe in. [/INST]<s> hindi output:"]*1, return_tensors = "pt")

	outputs = model.generate(**inputs, max_new_tokens = 18, use_cache = True)
	tokenizer.batch_decode(outputs)

	```
	### For hindi to english(h2e)
	```python
	inputs = tokenizer(["[INST]translate this from hindi to english: अगर तुम सूरज की तरह चमकना चाहते हो, तो सूरज की तरह जलना सीखो।[/INST]<s> english output:"]*1, return_tensors = "pt")

	outputs = model.generate(**inputs, max_new_tokens = 18, use_cache = True)
	tokenizer.batch_decode(outputs)
	```


	# Dataset
	* The dataset used was the combination of two dataset which gave a total of 1_786_788 rows
	* The rows were then pre-process to look something like this :

	```python
	[INST]translate this from english to hindi: When it is said to him: \'Fear Allah\' egotism takes him in his sin. Gehenna (Hell) shall be enough for him. How evil a cradling! [/INST] hindi output: और जब उससे कहा जाता है,
	"अल्लाह से डर", तो अहंकार उसे और गुनाह पर जमा देता है। अतः उसके लिए तो जहन्नम ही काफ़ी है, और वह बहुत-ही बुरी शय्या है! '
	```
	* This was done for both english to hindi and hindi to english hence the name h2e and e2h
	* Now when doing the above we get a total of 3 million plus rows

	# Training details
	* The model was loaded in 4-Bit
	* The target modules include "q_proj", "k_proj", "v_proj", "o_proj"
	* The fine-tuning was done on a free goggle colab with a single t4 GPU (huge thanks to unsloth for this)
	* Even though the Full dataset was almost 3 million The lora model was finetuned on only 1 million row for each language

	# Limitations
	The model was not fully trained on all the dataset and Much evaluation was not done so any contributions will be helpful.

	As of right now this is a smaller model Better model trained on better dataset will be released