XLMR-MaCoCu-is / README.md

RVN

Update README.md

1abf532 over 1 year ago

preview code

raw

history blame contribute delete

No virus

4.26 kB

	---
	license: cc0-1.0
	language:
	- is
	tags:
	- MaCoCu
	---

	# Model description

	XLMR-MaCoCu-is is a large pre-trained language model trained on Icelandic texts. It was created by continuing training from the [XLM-RoBERTa-large](https://huggingface.co/xlm-roberta-large) model. It was developed as part of the [MaCoCu](https://macocu.eu/) project and only uses data that was crawled during the project. The main developer is [Rik van Noord](https://www.rikvannoord.nl/) from the University of Groningen.

	XLMR-MaCoCu-is was trained on 4.4GB of Icelandic text, which is equal to 688M tokens. It was trained for 75,000 steps with a batch size of 1,024. It uses the same vocabulary as the original XLMR-large model.

	The training and fine-tuning procedures are described in detail on our [Github repo](https://github.com/macocu/LanguageModels).

	# How to use

	```python
	from transformers import AutoTokenizer, AutoModel, TFAutoModel

	tokenizer = AutoTokenizer.from_pretrained("RVN/XLMR-MaCoCu-is")
	model = AutoModel.from_pretrained("RVN/XLMR-MaCoCu-is") # PyTorch
	model = TFAutoModel.from_pretrained("RVN/XLMR-MaCoCu-is") # Tensorflow
	```

	# Data

	For training, we used all Icelandic data that was present in the monolingual Icelandic [MaCoCu](https://macocu.eu/) corpus. After de-duplicating the data, we were left with a total of 4.4 GB of text, which equals 688M tokens.

	# Benchmark performance

	We tested the performance of XLMR-MaCoCu-is on benchmarks of XPOS, UPOS, NER and COPA. For UPOS and XPOS, we used the data from the [Universal Dependencies](https://universaldependencies.org/) project. For NER, we used the data from the MIM-GOLD-NER data set. For COPA, we automatically translated the English data set by using Google Translate. For details please see our [Github repo](https://github.com/RikVN/COPA). We compare performance to the strong multi-lingual models XLMR-base and XLMR-large, but also the monolingual [IceBERT](https://huggingface.co/vesteinn/IceBERT) model. For details regarding the XPOS/UPOS/NER fine-tuning procedure you can checkout our [Github](https://github.com/macocu/LanguageModels).

	Scores are averages of three runs, except for COPA, for which we use 10 runs. We use the same hyperparameter settings for all models.

	\| \| UPOS \| UPOS \| XPOS \| XPOS \| NER \| NER \| COPA \|
	\|--------------------\|:--------:\|:--------:\|:--------:\|:--------:\|---------\|----------\| ----------\|
	\| \| Dev \| Test \| Dev \| Test \| Dev \| Test \| Test \|
	\| XLM-R-base \| 96.8 \| 96.5 \| 94.6 \| 94.3 \| 85.3 \| 89.7 \| 55.2 \|
	\| XLM-R-large \| 97.0 \| 96.7 \| 94.9 \| 94.7 \| 88.5 \| 91.7 \| 54.3 \|
	\| IceBERT \| 96.4 \| 96.0 \| 94.0 \| 93.7 \| 83.8 \| 89.7 \| 54.6 \|
	\| XLMR-MaCoCu-is \| 97.3 \| 97.0 \| 95.4 \| 95.1 \| 90.8 \| 93.2 \| 59.6 \|

	# Acknowledgements

	Research supported with Cloud TPUs from Google's TPU Research Cloud (TRC). The authors received funding from the European Union’s Connecting Europe Facility 2014-
	2020 - CEF Telecom, under Grant Agreement No.INEA/CEF/ICT/A2020/2278341 (MaCoCu).

	# Citation

	If you use this model, please cite the following paper:

	```bibtex
	@inproceedings{non-etal-2022-macocu,
	title = "{M}a{C}o{C}u: Massive collection and curation of monolingual and bilingual data: focus on under-resourced languages",
	author = "Ba{\~n}{\'o}n, Marta and
	Espl{\`a}-Gomis, Miquel and
	Forcada, Mikel L. and
	Garc{\'\i}a-Romero, Cristian and
	Kuzman, Taja and
	Ljube{\v{s}}i{\'c}, Nikola and
	van Noord, Rik and
	Sempere, Leopoldo Pla and
	Ram{\'\i}rez-S{\'a}nchez, Gema and
	Rupnik, Peter and
	Suchomel, V{\'\i}t and
	Toral, Antonio and
	van der Werff, Tobias and
	Zaragoza, Jaume",
	booktitle = "Proceedings of the 23rd Annual Conference of the European Association for Machine Translation",
	month = jun,
	year = "2022",
	address = "Ghent, Belgium",
	publisher = "European Association for Machine Translation",
	url = "https://aclanthology.org/2022.eamt-1.41",
	pages = "303--304"
	}
	```