Update README.md

0e81075 over 1 year ago

4.53 kB

	---
	license: apache-2.0
	datasets:
	- bigbio/cas
	language:
	- fr
	metrics:
	- f1
	library_name: transformers
	tags:
	- medical
	widget:
	- text: Patiente atteinte d’une pathologie chronique
	- text: Vous êtes amené à prendre en charge un homme de 54 ans qui souffre d’une spondylarthrite ankylosante sévère.
	---


	<p align="center">
	<img src="https://github.com/qanastek/DrBERT/blob/main/assets/logo.png?raw=true" alt="drawing" width="250"/>
	</p>

	- Corpora: [bigbio/cas](https://huggingface.co/datasets/bigbio/cas)
	- Embeddings & Sequence Labelling: [DrBERT-7GB](https://arxiv.org/abs/2304.00958)
	- Number of Epochs: 200

	# DrBERT: A Robust Pre-trained Model in French for Biomedical and Clinical domains

	In recent years, pre-trained language models (PLMs) achieve the best performance on a wide range of natural language processing (NLP) tasks. While the first models were trained on general domain data, specialized ones have emerged to more effectively treat specific domains.
	In this paper, we propose an original study of PLMs in the medical domain on French language. We compare, for the first time, the performance of PLMs trained on both public data from the web and private data from healthcare establishments. We also evaluate different learning strategies on a set of biomedical tasks.
	Finally, we release the first specialized PLMs for the biomedical field in French, called DrBERT, as well as the largest corpus of medical data under free license on which these models are trained.

	# CAS: French Corpus with Clinical Cases


	\| \| Train \| Dev \| Test \|
	\|:---------:\|:-----:\|:-----:\|:-----:\|
	\| Documents \| 5,306 \| 1,137 \| 1,137 \|

	The ESSAIS (Dalloux et al., 2021) and CAS (Grabar et al., 2018) corpora respectively contain 13,848 and 7,580 clinical cases in French. Some clinical cases are associated with discussions. A subset of the whole set of cases is enriched with morpho-syntactic (part-of-speech (POS) tagging, lemmatization) and semantic (UMLS concepts, negation, uncertainty) annotations. In our case, we focus only on the POS tagging task.

	# Model Metric

	```plain
	precision recall f1-score support

	ABR 0.8683 0.8480 0.8580 171
	ADJ 0.9634 0.9751 0.9692 4018
	ADV 0.9935 0.9849 0.9892 926
	DET:ART 0.9982 0.9997 0.9989 3308
	DET:POS 1.0000 1.0000 1.0000 133
	INT 1.0000 0.7000 0.8235 10
	KON 0.9883 0.9976 0.9929 845
	NAM 0.9144 0.9353 0.9247 834
	NOM 0.9827 0.9803 0.9815 7980
	NUM 0.9825 0.9845 0.9835 1422
	PRO:DEM 0.9924 1.0000 0.9962 131
	PRO:IND 0.9630 1.0000 0.9811 78
	PRO:PER 0.9948 0.9931 0.9939 579
	PRO:REL 1.0000 0.9908 0.9954 109
	PRP 0.9989 0.9982 0.9985 3785
	PRP:det 1.0000 0.9985 0.9993 681
	PUN 0.9996 0.9958 0.9977 2376
	PUN:cit 0.9756 0.9524 0.9639 84
	SENT 1.0000 0.9974 0.9987 1174
	SYM 0.9495 1.0000 0.9741 94
	VER:cond 1.0000 1.0000 1.0000 11
	VER:futu 1.0000 0.9444 0.9714 18
	VER:impf 1.0000 0.9963 0.9981 804
	VER:infi 1.0000 0.9585 0.9788 193
	VER:pper 0.9742 0.9564 0.9652 1261
	VER:ppre 0.9617 0.9901 0.9757 203
	VER:pres 0.9833 0.9904 0.9868 830
	VER:simp 0.9123 0.7761 0.8387 67
	VER:subi 1.0000 0.7000 0.8235 10
	VER:subp 1.0000 0.8333 0.9091 18

	accuracy 0.9842 32153
	macro avg 0.9799 0.9492 0.9623 32153
	weighted avg 0.9843 0.9842 0.9842 32153
	```

	# Citation BibTeX


	```bibtex
	@inproceedings{labrak2023drbert,
	title = {{DrBERT: A Robust Pre-trained Model in French for Biomedical and Clinical domains}},
	author = {Labrak, Yanis and Bazoge, Adrien and Dufour, Richard and Rouvier, Mickael and Morin, Emmanuel and Daille, Béatrice and Gourraud, Pierre-Antoine},
	booktitle = {Proceedings of the 61th Annual Meeting of the Association for Computational Linguistics (ACL'23), Long Paper},
	month = july,
	year = 2023,
	address = {Toronto, Canada},
	publisher = {Association for Computational Linguistics}
	}
	```