--- license: lgpl-3.0 base_model: sdadas/polish-roberta-base-v2 tags: - generated_from_trainer datasets: - nkjp1m metrics: - precision - recall - f1 - accuracy model-index: - name: polish-roberta-base-v2-pos-tagging results: - task: name: Token Classification type: token-classification dataset: name: nkjp1m type: nkjp1m config: nkjp1m split: test args: nkjp1m metrics: - name: Precision type: precision value: 0.9853198910270871 - name: Recall type: recall value: 0.9858245297268206 - name: F1 type: f1 value: 0.9855721457799069 - name: Accuracy type: accuracy value: 0.9884294612942691 --- # polish-roberta-base-v2-pos-tagging This model is a fine-tuned version of [sdadas/polish-roberta-base-v2](https://huggingface.co/sdadas/polish-roberta-base-v2) on the nkjp1m dataset. It achieves the following results on the evaluation set: - Loss: 0.0508 - Precision: 0.9853 - Recall: 0.9858 - F1: 0.9856 - Accuracy: 0.9884 ## Model description This model is a part-of-speech tagger for the Polish language based on [sdadas/polish-roberta-base-v2](https://huggingface.co/sdadas/polish-roberta-base-v2). It support 40 classes representing flexemic class (detailed part of speech): ``` { 0: 'adj', 1: 'adja', 2: 'adjc', 3: 'adjp', 4: 'adv', 5: 'aglt', 6: 'bedzie', 7: 'brev', 8: 'comp', 9: 'conj', 10: 'depr', 11: 'dig', 12: 'fin', 13: 'frag', 14: 'ger', 15: 'imps', 16: 'impt', 17: 'inf', 18: 'interj', 19: 'interp', 20: 'num', 21: 'numcomp', 22: 'pact', 23: 'pacta', 24: 'pant', 25: 'part', 26: 'pcon', 27: 'ppas', 28: 'ppron12', 29: 'ppron3', 30: 'praet', 31: 'pred', 32: 'prep', 33: 'romandig', 34: 'siebie', 35: 'subst', 36: 'sym', 37: 'winien', 38: 'xxs', 39: 'xxx' } ``` Tags meaning is the same as in nkjp1m dataset: | flexeme | abbreviation | base form | example | |----------------------------|--------------|---------------------------------------------------------|-----------------------| | noun | subst | singular nominative | profesor | | depreciative form | depr | singular nominative form of the corresponding noun | profesor | | main numeral | num | inanimate masculine nominative form | pięć, dwa | | collective numeral | numcol | inanimate masculine nominative form of the main numeral | pięć, dwa | | adjective | adj | singular nominative masculine positive form | polski | | ad-adjectival adjective | adja | singular nominative masculine positive form of the adjective | polski | | post-prepositional adjective | adjp | singular nominative masculine positive form of the adjective | polski | | predicative adjective | adjc | singular nominative masculine positive form of the adjective | zdrowy, ciekawy | | adverb | adv | positive form | dobrze, bardzo | | non-3rd person pronoun | ppron12 | singular nominative | ja | | 3rd-person pronoun | ppron3 | singular nominative | on | | pronoun siebie | siebie | accusative | siebie | | non-past form | fin | infinitive | czytać | | future być | bedzie | infinitive | być | | agglutinate być | aglt | infinitive | być | | l-participle | praet | infinitive | czytać | | imperative | impt | infinitive | czytać | | impersonal | imps | infinitive | czytać | | infinitive | inf | infinitive | czytać | | contemporary adv. participle | pcon | infinitive | czytać | | anterior adv. participle | pant | infinitive | czytać | | gerund | ger | infinitive | czytać | | active adj. participle | pact | infinitive | czytać | | passive adj. participle | ppas | infinitive | czytać | | winien | winien | singular masculine form | powinien, rad | | predicative | pred | the only form of that flexeme | warto | | preposition | prep | the non-vocalic form of that flexeme | na, przez, w | | coordinating conjunction | conj | the only form of that flexeme | oraz | | subordinating conjunction | comp | the only form of that flexeme | że | | particle-adverb | qub | the only form of that flexeme | nie, -że, się | | abbreviation | brev | the full dictionary form | rok, i tak dalej | | bound word | burk | the only form of that flexeme | trochu, oścież | | interjection | interj | the only form of that flexeme | ech, kurde | | punctuation | interp | the only form of that flexeme | ;, ., (, ] | | alien | xxx | the only form of that flexeme | cool , nihil | ## Intended uses & limitations Even though we have some nice tools for pos-tagging in polish (http://morfeusz.sgjp.pl/), I needed a pos tagger for polish that could be easily loaded inside the browser. Huggingface supports such functionality and that's why I created this model. ## Training and evaluation data Model was trained on a half of test data of the nkjp1m dataset (~0.5 milion tokens). ## Training procedure ### Training hyperparameters The following hyperparameters were used during training: - learning_rate: 2e-05 - train_batch_size: 16 - eval_batch_size: 16 - seed: 42 - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08 - lr_scheduler_type: linear - num_epochs: 3 ### Training results | Training Loss | Epoch | Step | Validation Loss | Precision | Recall | F1 | Accuracy | |:-------------:|:-----:|:----:|:---------------:|:---------:|:------:|:------:|:--------:| | 0.0665 | 1.0 | 2155 | 0.0629 | 0.9835 | 0.9836 | 0.9836 | 0.9867 | | 0.0369 | 2.0 | 4310 | 0.0539 | 0.9842 | 0.9848 | 0.9845 | 0.9876 | | 0.0243 | 3.0 | 6465 | 0.0508 | 0.9853 | 0.9858 | 0.9856 | 0.9884 | ### Framework versions - Transformers 4.36.0 - Pytorch 2.1.0+cu118 - Datasets 2.15.0 - Tokenizers 0.15.0