How did you create (title, paragraph) pairs from c4?

#45
by itayair - opened

In addition, did you filter the C4 data?

Owner

To create contrastive pairs, please refer to the discussion at https://huggingface.co/intfloat/multilingual-e5-large/discussions/37#664b1fe87a1ed3e001471b2f

And yes, we filter the mC4 data using the consistency-based filtering approach in Text Embeddings by Weakly-Supervised Contrastive Pre-training

One more thing, what does the page_content you take? (The web pages might be much longer than 512)

Owner

In that case, texts will be truncated to fit the model's maximum support length.

Sign up or log in to comment