--- license: lgpl-3.0 base_model: sdadas/polish-roberta-base-v2 tags: - generated_from_trainer datasets: - nkjp1m metrics: - precision - recall - f1 - accuracy model-index: - name: polish-roberta-base-v2-cposes-tagging results: - task: name: Token Classification type: token-classification dataset: name: nkjp1m type: nkjp1m config: nkjp1m split: test args: nkjp1m metrics: - name: Precision type: precision value: 0.9913009231909743 - name: Recall type: recall value: 0.9912435137138621 - name: F1 type: f1 value: 0.9912722176212015 - name: Accuracy type: accuracy value: 0.9889172310669364 widget: - text: "Niosę dwa miedziane leje" - text: "Ale dzisiaj leje" language: - pl --- # polish-roberta-base-v2-cposes-tagging This model is a fine-tuned version of [sdadas/polish-roberta-base-v2](https://huggingface.co/sdadas/polish-roberta-base-v2) on the nkjp1m dataset. It achieves the following results on the evaluation set: - Loss: 0.0458 - Precision: 0.9913 - Recall: 0.9912 - F1: 0.9913 - Accuracy: 0.9889 You can find the training notebook here: https://github.com/WikKam/roberta-pos-finetuning ## Usage ``` from transformers import pipeline nlp = pipeline("token-classification", "wkaminski/polish-roberta-base-v2-cposes-tagging") nlp("Ale dzisiaj leje") ``` ## Model description This model is a coarse-part-of-speech tagger for the Polish language based on sdadas/polish-roberta-base-v2. It support 13 classes representing coarse part of speech): ``` { 0: 'A', 1: 'Adv', 2: 'Comp', 3: 'Conj', 4: 'Dig', 5: 'Interj', 6: 'N', 7: 'Num', 8: 'Part', 9: 'Prep', 10: 'Punct', 11: 'V', 12: 'X' } ``` Tags meaning is the same as in nkjp1m dataset: | Tag | Description in English | Description in Polish | Example in Polish | |-------|----------------------------------|-----------------------------|---------------------------| | A | Adjective | przymiotnik | szybki | | Adv | Adverb | przysłówek | szybko | | Comp | Comparative / Complementizer | stopień porównawczy / spójnik podrzędny | lepszy / że | | Conj | Conjunction | spójnik | i | | Dig | Digit | cyfra | 5, 3 | | Interj| Interjection | wykrzyknik | och! | | N | Noun | rzeczownik | dom | | Num | Numeral | liczebnik | jeden | | Part | Particle | partykuła | by | | Prep | Preposition | przyimek | w | | Punct | Punctuation | interpunkcja | ., !, ? | | V | Verb | czasownik | biegać | | X | Unknown / Other | niesklasyfikowane | xxx | ## Intended uses & limitations Even though we have some nice tools for pos-tagging in polish (http://morfeusz.sgjp.pl/), I needed a pos tagger for polish that could be easily loaded inside the browser. Huggingface supports such functionality and that's why I created this model. ## Training and evaluation data Model was trained on a half of test data of the nkjp1m dataset (~0.5 milion tokens). ## Training procedure ### Training hyperparameters The following hyperparameters were used during training: - learning_rate: 2e-05 - train_batch_size: 16 - eval_batch_size: 16 - seed: 42 - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08 - lr_scheduler_type: linear - num_epochs: 3 ### Training results | Training Loss | Epoch | Step | Validation Loss | Precision | Recall | F1 | Accuracy | |:-------------:|:-----:|:----:|:---------------:|:---------:|:------:|:------:|:--------:| | 0.0471 | 1.0 | 2155 | 0.0491 | 0.9896 | 0.9900 | 0.9898 | 0.9873 | | 0.0291 | 2.0 | 4310 | 0.0467 | 0.9901 | 0.9905 | 0.9903 | 0.9884 | | 0.0191 | 3.0 | 6465 | 0.0458 | 0.9913 | 0.9912 | 0.9913 | 0.9889 | ### Framework versions - Transformers 4.35.2 - Pytorch 2.1.0+cu118 - Datasets 2.15.0 - Tokenizers 0.15.0