Is a new version supporting a larger maximum sentence limit (256 or 512) possible in the future?

#10

by lamhieu - opened Dec 17, 2023

Dec 17, 2023

Honestly, this is a very good multilingual model for embeddings for relevance search tasks, but the problem is that the input limit is too short. I wonder if you guys plan to make another version with a larger limit like 256 or 512? I think it will be great.

tomaarsen

Sentence Transformers org Dec 18, 2023

Hello!

It's certainly possible to train models with higher limits, but the real bottleneck is that there's not a lot of good training data with lengths larger than e.g. 128. It also gets harder to annotate/label those longer texts well. As a result, we can train a larger model easily, but it simply won't work well with longer texts. That's sadly not exactly an improvement over the current situation.

Tom Aarsen

lamhieu

Dec 19, 2023

•

edited Dec 19, 2023

Thank you, I understand. These are just current embeddings models. If the limit is higher, the quality is not as good as this model (even paid).

Kiril-KV

Jun 18

hello, I wonder if I could use the pretrained model and set the max_seq_length to 512 totrain my dataset?

tomaarsen

Sentence Transformers org Jun 18

hello, I wonder if I could use the pretrained model and set the max_seq_length to 512 totrain my dataset?

Yes, you can totally do that. The only reason that we use the lower max. sequence length here is because our training data used small texts. If your training dataset has longer texts, then you're free to increase the max_seq_length accordingly.

Tom Aarsen

rodosabbath

Jun 18

hello, I wonder if I could use the pretrained model and set the max_seq_length to 512 totrain my dataset?

Yes, you can totally do that. The only reason that we use the lower max. sequence length here is because our training data used small texts. If your training dataset has longer texts, then you're free to increase the max_seq_length accordingly.

Tom Aarsen

Hey Tom, real beginner here just started learning and evaluating models for my work contexts. I have some doubts regarding the limit of inputs, for example:

sentences = [
"Lorem ipsum criatus feralis...",
"Donec et dolor tincidunt, dictum ...",
{"json here": "content", "more content": "hello"},
]

Does this limit of inputs apply it for the amount of texts present in a sentence?

Noobie question, I appreciate any insight in advance.

tomaarsen

Sentence Transformers org Jun 18

The limit is for the number of tokens per text. In short, a token is a word-part that a natural language processing model will reason with. This is because models can't reason with e.g. 100 million words across all languages, but it can reason with e.g. 250 thousand word parts that can be combined to make all words.
For this model, sentence-transformers/paraphrase-multilingual-mpnet-base-v2, the maximum length is 128 tokens, which likely translates to about 80 words. Anything beyond that will be truncated.

I hope that helps

Tom Aarsen

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment