cointegrated commited on
Commit
dc685c3
1 Parent(s): 0dd911d

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +27 -3
README.md CHANGED
@@ -11,8 +11,32 @@ license: mit
11
  widget:
12
  - text: "Миниатюрная модель для [MASK] разных задач."
13
  ---
14
- This is a very small distilled version of the [bert-base-multilingual-cased](https://huggingface.co/bert-base-multilingual-cased) model for Russian and English.
15
 
16
- This model is useful if you want to fine-tune it for a relatively simple Russian task (e.g. NER or sentiment classification), and you care more about speed and size than about accuracy. It is approximately x10 smaller and faster than [DeepPavlov/rubert-base-cased-sentence](https://huggingface.co/DeepPavlov/rubert-base-cased-sentence).
17
 
18
- It was trained on the [Yandex Translate corpus](https://translate.yandex.ru/corpus) using MLM loss (partially distilled from [bert-base-multilingual-cased](https://huggingface.co/bert-base-multilingual-cased)) and translation ranking loss (partially distilled from [LaBSE](https://huggingface.co/sentence-transformers/LaBSE)).
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
11
  widget:
12
  - text: "Миниатюрная модель для [MASK] разных задач."
13
  ---
14
+ This is a very small distilled version of the [bert-base-multilingual-cased](https://huggingface.co/bert-base-multilingual-cased) model for Russian and English (45 MB, 12M parameters).
15
 
16
+ This model is useful if you want to fine-tune it for a relatively simple Russian task (e.g. NER or sentiment classification), and you care more about speed and size than about accuracy. It is approximately x10 smaller and faster than a base-sized BERT. Its `[CLS]` embeddings can be used as a sentence representation.
17
 
18
+ It was trained on the [Yandex Translate corpus](https://translate.yandex.ru/corpus), [OPUS-100](https://huggingface.co/datasets/opus100) and [Tatoeba](https://huggingface.co/datasets/tatoeba), using MLM loss (distilled from [bert-base-multilingual-cased](https://huggingface.co/bert-base-multilingual-cased)), translation ranking loss, and `[CLS]` embeddings distilled from [LaBSE](https://huggingface.co/sentence-transformers/LaBSE), [rubert-base-cased-sentence](https://huggingface.co/DeepPavlov/rubert-base-cased-sentence), Laser and USE.
19
+
20
+ There is a more detailed [description in Russian](https://habr.com/ru/post/562064/).
21
+
22
+ Sentence embeddings can be produced as follows:
23
+
24
+ ```python
25
+ # pip install transformers sentencepiece
26
+ import torch
27
+ from transformers import AutoTokenizer, AutoModel
28
+ tokenizer = AutoTokenizer.from_pretrained("cointegrated/rubert-tiny")
29
+ model = AutoModel.from_pretrained("cointegrated/rubert-tiny")
30
+ # model.cuda() # uncomment it if you have a GPU
31
+
32
+ def embed_bert_cls(text, model, tokenizer):
33
+ t = tokenizer(text, padding=True, truncation=True, return_tensors='pt')
34
+ with torch.no_grad():
35
+ model_output = model(**{k: v.to(model.device) for k, v in t.items()})
36
+ embeddings = model_output.last_hidden_state[:, 0, :]
37
+ embeddings = torch.nn.functional.normalize(embeddings)
38
+ return embeddings[0].cpu().numpy()
39
+
40
+ print(embed_bert_cls('привет мир', model, tokenizer).shape)
41
+ # (312,)
42
+ ```