Upload folder using huggingface_hub

927dcc8 about 1 year ago

No virus

10.6 kB

	---
	pipeline_tag: sentence-similarity
	language: fr
	license: apache-2.0
	datasets:
	- unicamp-dl/mmarco
	metrics:
	- recall
	tags:
	- sentence-similarity
	library_name: sentence-transformers
	---
	# crossencoder-distilcamembert-base-mmarcoFR

	This is a [sentence-transformers](https://www.SBERT.net) model trained on the French portion of the [mMARCO](https://huggingface.co/datasets/unicamp-dl/mmarco) dataset.

	It performs cross-attention between a question-passage pair and outputs a relevance score between 0 and 1. The model can be used for tasks like clustering or [semantic search]((https://www.sbert.net/examples/applications/retrieve_rerank/README.html): given a query, encode the latter with some candidate passages -- e.g., retrieved with BM25 or a biencoder -- then sort the passages in a decreasing order of relevance according to the model's predictions.

	## Usage
	***

	#### Sentence-Transformers

	Using this model becomes easy when you have [sentence-transformers](https://www.SBERT.net) installed:

	```bash
	pip install -U sentence-transformers
	```

	Then you can use the model like this:

	```python
	from sentence_transformers import CrossEncoder
	pairs = [('Query', 'Paragraph1'), ('Query', 'Paragraph2') , ('Query', 'Paragraph3')]

	model = CrossEncoder('crossencoder-distilcamembert-base-mmarcoFR')
	scores = model.predict(pairs)
	print(scores)
	```

	#### 🤗 Transformers

	Without [sentence-transformers](https://www.SBERT.net), you can use the model as follows:

	```python
	from transformers import AutoTokenizer, AutoModelForSequenceClassification
	import torch

	model = AutoModelForSequenceClassification.from_pretrained('crossencoder-distilcamembert-base-mmarcoFR')
	tokenizer = AutoTokenizer.from_pretrained('crossencoder-distilcamembert-base-mmarcoFR')

	pairs = [('Query', 'Paragraph1'), ('Query', 'Paragraph2') , ('Query', 'Paragraph3')]
	features = tokenizer(pairs, padding=True, truncation=True, return_tensors='pt')

	model.eval()
	with torch.no_grad():
	scores = model(**features).logits
	print(scores)
	```

	## Evaluation
	***

	We evaluated our model on 500 random queries from the mMARCO-fr train set (which were excluded from training). Each of these queries has at least one relevant and up to 200 irrelevant passages.

	\| r-precision \| mrr@10 \| recall@10 \| recall@20 \| recall@50 \| recall@100 \|
	\|--------------:\|---------:\|------------:\|------------:\|------------:\|-------------:\|
	\| 27.28 \| 43.71 \| 80.3 \| 89.1 \| 95.55 \| 98.6 \|

	Below, we compared its results with other cross-encoder models fine-tuned on the same dataset:
	\| \| model \| r-precision \| mrr@10 \| recall@10 (↑) \| recall@20 \| recall@50 \| recall@100 \|
	\|---:\|:---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------\|--------------:\|---------:\|------------:\|------------:\|------------:\|-------------:\|
	\| 1 \| [crossencoder-camembert-base-mmarcoFR](https://huggingface.co/antoinelouis/crossencoder-camembert-base-mmarcoFR) \| 35.65 \| 50.44 \| 82.95 \| 91.5 \| 96.8 \| 98.8 \|
	\| 2 \| [crossencoder-mMiniLMv2-L12-H384-distilled-from-XLMR-Large-mmarcoFR](https://huggingface.co/antoinelouis/crossencoder-mMiniLMv2-L12-H384-distilled-from-XLMR-Large-mmarcoFR) \| 34.37 \| 51.01 \| 82.23 \| 90.6 \| 96.45 \| 98.4 \|
	\| 3 \| [crossencoder-mmarcoFR-mMiniLMv2-L12-H384-v1-mmarcoFR](https://huggingface.co/antoinelouis/crossencoder-mmarcoFR-mMiniLMv2-L12-H384-v1-mmarcoFR) \| 34.22 \| 49.2 \| 81.7 \| 90.9 \| 97.1 \| 98.9 \|
	\| 4 \| [crossencoder-mpnet-base-mmarcoFR](https://huggingface.co/antoinelouis/crossencoder-mpnet-base-mmarcoFR) \| 29.68 \| 46.13 \| 80.45 \| 87.9 \| 93.15 \| 96.6 \|
	\| 5 \| crossencoder-distilcamembert-base-mmarcoFR \| 27.28 \| 43.71 \| 80.3 \| 89.1 \| 95.55 \| 98.6 \|
	\| 6 \| [crossencoder-roberta-base-mmarcoFR](https://huggingface.co/antoinelouis/crossencoder-roberta-base-mmarcoFR) \| 33.33 \| 48.87 \| 79.33 \| 86.75 \| 94.15 \| 97.6 \|
	\| 7 \| [crossencoder-electra-base-french-europeana-cased-discriminator-mmarcoFR](https://huggingface.co/antoinelouis/crossencoder-electra-base-french-europeana-cased-discriminator-mmarcoFR) \| 28.32 \| 45.28 \| 79.22 \| 87.15 \| 93.15 \| 95.75 \|
	\| 8 \| [crossencoder-mMiniLMv2-L6-H384-distilled-from-XLMR-Large-mmarcoFR](https://huggingface.co/antoinelouis/crossencoder-mMiniLMv2-L6-H384-distilled-from-XLMR-Large-mmarcoFR) \| 33.92 \| 49.33 \| 79 \| 88.35 \| 94.8 \| 98.2 \|
	\| 9 \| [crossencoder-msmarco-electra-base-mmarcoFR](https://huggingface.co/antoinelouis/crossencoder-msmarco-electra-base-mmarcoFR) \| 25.52 \| 42.46 \| 78.73 \| 88.85 \| 96.55 \| 98.85 \|
	\| 10 \| [crossencoder-bert-base-uncased-mmarcoFR](https://huggingface.co/antoinelouis/crossencoder-bert-base-uncased-mmarcoFR) \| 30.48 \| 45.79 \| 78.35 \| 89.45 \| 94.15 \| 97.45 \|
	\| 11 \| [crossencoder-msmarco-MiniLM-L-12-v2-mmarcoFR](https://huggingface.co/antoinelouis/crossencoder-msmarco-MiniLM-L-12-v2-mmarcoFR) \| 29.07 \| 44.41 \| 77.83 \| 88.1 \| 95.55 \| 99 \|
	\| 12 \| [crossencoder-msmarco-MiniLM-L-6-v2-mmarcoFR](https://huggingface.co/antoinelouis/crossencoder-msmarco-MiniLM-L-6-v2-mmarcoFR) \| 32.92 \| 47.56 \| 77.27 \| 88.15 \| 94.85 \| 98.15 \|
	\| 13 \| [crossencoder-msmarco-MiniLM-L-4-v2-mmarcoFR](https://huggingface.co/antoinelouis/crossencoder-msmarco-MiniLM-L-4-v2-mmarcoFR) \| 30.98 \| 46.22 \| 76.35 \| 85.8 \| 94.35 \| 97.55 \|
	\| 14 \| [crossencoder-MiniLM-L6-H384-uncased-mmarcoFR](https://huggingface.co/antoinelouis/crossencoder-MiniLM-L6-H384-uncased-mmarcoFR) \| 29.23 \| 45.12 \| 76.08 \| 83.7 \| 92.65 \| 97.45 \|
	\| 15 \| [crossencoder-electra-base-discriminator-mmarcoFR](https://huggingface.co/antoinelouis/crossencoder-electra-base-discriminator-mmarcoFR) \| 28.48 \| 43.58 \| 75.63 \| 86.15 \| 93.25 \| 96.6 \|
	\| 16 \| [crossencoder-electra-small-discriminator-mmarcoFR](https://huggingface.co/antoinelouis/crossencoder-electra-small-discriminator-mmarcoFR) \| 31.83 \| 45.97 \| 75.13 \| 84.95 \| 94.55 \| 98.15 \|
	\| 17 \| [crossencoder-distilroberta-base-mmarcoFR](https://huggingface.co/antoinelouis/crossencoder-distilroberta-base-mmarcoFR) \| 28.22 \| 42.85 \| 74.13 \| 84.08 \| 94.2 \| 98.5 \|
	\| 18 \| [crossencoder-msmarco-TinyBERT-L-6-mmarcoFR](https://huggingface.co/antoinelouis/crossencoder-msmarco-TinyBERT-L-6-mmarcoFR) \| 28.23 \| 42.7 \| 73.63 \| 85.65 \| 92.65 \| 98.35 \|
	\| 19 \| [crossencoder-msmarco-TinyBERT-L-4-mmarcoFR](https://huggingface.co/antoinelouis/crossencoder-msmarco-TinyBERT-L-4-mmarcoFR) \| 28.6 \| 43.19 \| 72.17 \| 81.95 \| 92.8 \| 97.4 \|
	\| 20 \| [crossencoder-msmarco-MiniLM-L-2-v2-mmarcoFR](https://huggingface.co/antoinelouis/crossencoder-msmarco-MiniLM-L-2-v2-mmarcoFR) \| 30.82 \| 44.3 \| 72.03 \| 82.65 \| 93.35 \| 98.1 \|
	\| 21 \| [crossencoder-distilbert-base-uncased-mmarcoFR](https://huggingface.co/antoinelouis/crossencoder-distilbert-base-uncased-mmarcoFR) \| 25.47 \| 40.11 \| 71.37 \| 85.6 \| 93.85 \| 97.95 \|
	\| 22 \| [crossencoder-msmarco-TinyBERT-L-2-v2-mmarcoFR](https://huggingface.co/antoinelouis/crossencoder-msmarco-TinyBERT-L-2-v2-mmarcoFR) \| 31.08 \| 43.88 \| 71.3 \| 81.43 \| 92.6 \| 98.1 \|

	## Training
	***

	#### Background

	We used the [cmarkea/distilcamembert-base](https://huggingface.co/cmarkea/distilcamembert-base) model and fine-tuned it with a binary cross-entropy loss function on 1M question-passage pairs in French with a positive-to-negative ratio of 4 (i.e., 25% of the pairs are relevant and 75% are irrelevant).

	#### Hyperparameters

	We trained the model on a single Tesla V100 GPU with 32GBs of memory during 10 epochs (i.e., 312.4k steps) using a batch size of 32. We used the adamw optimizer with an initial learning rate of 2e-05, weight decay of 0.01, learning rate warmup over the first 500 steps, and linear decay of the learning rate. The sequence length was limited to 512 tokens.

	#### Data

	We used the French version of the [mMARCO](https://huggingface.co/datasets/unicamp-dl/mmarco) dataset to fine-tune our model. mMARCO is a multi-lingual machine-translated version of the MS MARCO dataset, a popular large-scale IR dataset.

	## Citation
	***

	```bibtex
	@online{louis2023,
	author = 'Antoine Louis',
	title = 'crossencoder-distilcamembert-base-mmarcoFR: A Cross-Encoder Model Trained on 1M sentence pairs in French',
	publisher = 'Hugging Face',
	month = 'september',
	year = '2023',
	url = 'https://huggingface.co/antoinelouis/crossencoder-distilcamembert-base-mmarcoFR',
	}
	```