clarine commited on
Commit
44a46d4
1 Parent(s): d4a1a0e

Add metrics on Polish datasets

Browse files
Files changed (1) hide show
  1. README.md +30 -3
README.md CHANGED
@@ -38,9 +38,10 @@ Besides the aforementioned languages, basic support can be expected for addition
38
 
39
  ## Scores
40
 
41
- | Metric | Value |
42
- |:--------------------|------:|
43
- | Relevance (NDCG@10) | 0.480 |
 
44
 
45
  Note that the relevance score is computed as an average over 14 retrieval datasets (see
46
  [details below](#evaluation-metrics)).
@@ -93,6 +94,8 @@ can be around 0.5 to 1 GiB depending on the used GPU.
93
 
94
  ### Evaluation Metrics
95
 
 
 
96
  To determine the relevance score, we averaged the results that we obtained when evaluating on the datasets of the
97
  [BEIR benchmark](https://github.com/beir-cellar/beir). Note that all these datasets are in English.
98
 
@@ -115,6 +118,30 @@ To determine the relevance score, we averaged the results that we obtained when
115
  | TREC-COVID | 0.651 |
116
  | Webis-Touche-2020 | 0.312 |
117
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
118
  We evaluated the model on the datasets of the [MIRACL benchmark](https://github.com/project-miracl/miracl) to test its multilingual capacities. Note that not all training languages are part of the benchmark, so we only report the metrics for the existing languages.
119
 
120
  | Language | NDCG@10 |
 
38
 
39
  ## Scores
40
 
41
+ | Metric | Value |
42
+ |:----------------------------|------:|
43
+ | English Relevance (NDCG@10) | 0.474 |
44
+ | Polish Relevance (NDCG@10) | 0.380 |
45
 
46
  Note that the relevance score is computed as an average over 14 retrieval datasets (see
47
  [details below](#evaluation-metrics)).
 
94
 
95
  ### Evaluation Metrics
96
 
97
+ ##### English
98
+
99
  To determine the relevance score, we averaged the results that we obtained when evaluating on the datasets of the
100
  [BEIR benchmark](https://github.com/beir-cellar/beir). Note that all these datasets are in English.
101
 
 
118
  | TREC-COVID | 0.651 |
119
  | Webis-Touche-2020 | 0.312 |
120
 
121
+ #### Polish
122
+
123
+ This model has polish capacities, that are being evaluated over a subset of
124
+ the [PIRBenchmark](https://github.com/sdadas/pirb) with BM25 as the first stage retrieval.
125
+
126
+
127
+ | Dataset | NDCG@10 |
128
+ |:--------------|--------:|
129
+ | Average | 0.380 |
130
+ | | |
131
+ | arguana-pl | 0.285 |
132
+ | dbpedia-pl | 0.283 |
133
+ | fiqa-pl | 0.223 |
134
+ | hotpoqa-pl | 0.603 |
135
+ | msmarco-pl | 0.259 |
136
+ | nfcorpus-pl | 0.293 |
137
+ | nq-pl | 0.355 |
138
+ | quora-pl | 0.613 |
139
+ | scidocs-pl | 0.128 |
140
+ | scifact-pl | 0.581 |
141
+ | trec-covid-pl | 0.560 |
142
+
143
+ #### Other languages
144
+
145
  We evaluated the model on the datasets of the [MIRACL benchmark](https://github.com/project-miracl/miracl) to test its multilingual capacities. Note that not all training languages are part of the benchmark, so we only report the metrics for the existing languages.
146
 
147
  | Language | NDCG@10 |