Language list

by lbourdois - opened Aug 25, 2022

Aug 25, 2022

On the website https://www.sbert.net/docs/pretrained_models.html there are indications on the languages for the models https://huggingface.co/sentence-transformers/distiluse-base-multilingual-cased-v1 and https://huggingface.co/sentence-transformers/distiluse-base-multilingual-cased-v2 but not on the https://huggingface.co/sentence-transformers/distiluse-base-multilingual-cased.
The language tag of this model card indicates that the model is "multilingual". Would it be possible to know the languages in question please?

HappyLemon

15 days ago

Also wondering what is the supported language list :)

yuriyvnv

15 days ago

Cheers, everyone
The language list is available at: "https://sbert.net/docs/sentence_transformer/pretrained_models.html." for each model card, under Training Data, there are specifications of the training data. In cases where the languages are omitted, you should check the training data.
For this model, the available languages on which the model can work are: "Arabic, Chinese, Dutch, English, French, German, Italian, Korean, Polish, Portuguese, Russian, Spanish, and Turkish."

HappyLemon

10 days ago

Hello!

Thanks for your response! I have absolutely no idea how did I miss that :)

But you know what is interesting? I have already used it in my experiment (similarity search in Latvian and English dataset) and got a pretty solid recall result. But what is intriguing, the result is much better, almost twice, on long texts (multiple paragraphs) than on very short sentences (1 or 2 words).

HappyLemon

10 days ago

Waaaaait, I just double checked and I think you are incorrect. The link, that you mentioned, I have already seen some time before and there is mentioned different model - distiluse-base-multilingual-cased-v1 While we are discussing the distiluse-base-multilingual-cased (without v1 on the end)

tomaarsen

Sentence Transformers org 10 days ago

I believe Nils only had one dataset collection for multilingual models at the time, so I suspect that both models were trained on the same dataset. Having said that, the original distilbert may have been trained on more languages, and it's a common occurrence of multilingual models to have good performance on language X even if it was only pretrained on that language, but finetuned in others.

But what is intriguing, the result is much better, almost twice, on long texts (multiple paragraphs) than on very short sentences (1 or 2 words).

Perhaps multiple paragraphs is more similar to the data that was used during training than 1 or 2 words - I can imagine that would be the case.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment