File size: 8,471 Bytes

---
license: lgpl-3.0
base_model: sdadas/polish-roberta-base-v2
tags:
- generated_from_trainer
datasets:
- nkjp1m
metrics:
- precision
- recall
- f1
- accuracy
model-index:
- name: polish-roberta-base-v2-pos-tagging
  results:
  - task:
      name: Token Classification
      type: token-classification
    dataset:
      name: nkjp1m
      type: nkjp1m
      config: nkjp1m
      split: test
      args: nkjp1m
    metrics:
    - name: Precision
      type: precision
      value: 0.9853198910270871
    - name: Recall
      type: recall
      value: 0.9858245297268206
    - name: F1
      type: f1
      value: 0.9855721457799069
    - name: Accuracy
      type: accuracy
      value: 0.9884294612942691
widget:
- text: "Niosę dwa miedziane leje"
- text: "Ale dzisiaj leje"
language:
- pl
---

<!-- This model card has been generated automatically according to the information the Trainer had access to. You
should probably proofread and complete it, then remove this comment. -->

# polish-roberta-base-v2-pos-tagging

This model is a fine-tuned version of [sdadas/polish-roberta-base-v2](https://huggingface.co/sdadas/polish-roberta-base-v2) on the nkjp1m dataset.
It achieves the following results on the evaluation set:
- Loss: 0.0508
- Precision: 0.9853
- Recall: 0.9858
- F1: 0.9856
- Accuracy: 0.9884

You can find the training notebook here: https://github.com/WikKam/roberta-pos-finetuning

## Usage

```
from transformers import pipeline

nlp = pipeline("token-classification", "wkaminski/polish-roberta-base-v2-pos-tagging")

nlp("Ale dzisiaj leje")
```

## Model description

This model is a part-of-speech tagger for the Polish language based on [sdadas/polish-roberta-base-v2](https://huggingface.co/sdadas/polish-roberta-base-v2). 

It support 40 classes representing flexemic class (detailed part of speech):
```
{
 0: 'adj',
 1: 'adja',
 2: 'adjc',
 3: 'adjp',
 4: 'adv',
 5: 'aglt',
 6: 'bedzie',
 7: 'brev',
 8: 'comp',
 9: 'conj',
 10: 'depr',
 11: 'dig',
 12: 'fin',
 13: 'frag',
 14: 'ger',
 15: 'imps',
 16: 'impt',
 17: 'inf',
 18: 'interj',
 19: 'interp',
 20: 'num',
 21: 'numcomp',
 22: 'pact',
 23: 'pacta',
 24: 'pant',
 25: 'part',
 26: 'pcon',
 27: 'ppas',
 28: 'ppron12',
 29: 'ppron3',
 30: 'praet',
 31: 'pred',
 32: 'prep',
 33: 'romandig',
 34: 'siebie',
 35: 'subst',
 36: 'sym',
 37: 'winien',
 38: 'xxs',
 39: 'xxx'
}
```
Tags meaning is the same as in nkjp1m dataset:

| flexeme                    | abbreviation | base form                                               | example               |
|----------------------------|--------------|---------------------------------------------------------|-----------------------|
| noun                       | subst        | singular nominative                                     | profesor              |
| depreciative form          | depr         | singular nominative form of the corresponding noun      | profesor              |
| main numeral               | num          | inanimate masculine nominative form                     | pięć, dwa             |
| collective numeral         | numcol       | inanimate masculine nominative form of the main numeral | pięć, dwa             |
| adjective                  | adj          | singular nominative masculine positive form             | polski                |
| ad-adjectival adjective    | adja         | singular nominative masculine positive form of the adjective | polski             |
| post-prepositional adjective | adjp       | singular nominative masculine positive form of the adjective | polski             |
| predicative adjective      | adjc         | singular nominative masculine positive form of the adjective | zdrowy, ciekawy    |
| adverb                     | adv          | positive form                                           | dobrze, bardzo        |
| non-3rd person pronoun     | ppron12      | singular nominative                                     | ja                    |
| 3rd-person pronoun         | ppron3       | singular nominative                                     | on                    |
| pronoun siebie             | siebie       | accusative                                              | siebie                |
| non-past form              | fin          | infinitive                                              | czytać                |
| future być                 | bedzie       | infinitive                                              | być                   |
| agglutinate być            | aglt         | infinitive                                              | być                   |
| l-participle               | praet        | infinitive                                              | czytać                |
| imperative                 | impt         | infinitive                                              | czytać                |
| impersonal                 | imps         | infinitive                                              | czytać                |
| infinitive                 | inf          | infinitive                                              | czytać                |
| contemporary adv. participle | pcon        | infinitive                                              | czytać                |
| anterior adv. participle   | pant         | infinitive                                              | czytać                |
| gerund                     | ger          | infinitive                                              | czytać                |
| active adj. participle     | pact         | infinitive                                              | czytać                |
| passive adj. participle    | ppas         | infinitive                                              | czytać                |
| winien                     | winien       | singular masculine form                                 | powinien, rad         |
| predicative                | pred         | the only form of that flexeme                            | warto                 |
| preposition                | prep         | the non-vocalic form of that flexeme                     | na, przez, w          |
| coordinating conjunction   | conj         | the only form of that flexeme                            | oraz                  |
| subordinating conjunction  | comp         | the only form of that flexeme                            | że                    |
| particle-adverb            | qub          | the only form of that flexeme                            | nie, -że, się         |
| abbreviation               | brev         | the full dictionary form                                | rok, i tak dalej      |
| bound word                 | burk         | the only form of that flexeme                            | trochu, oścież        |
| interjection               | interj       | the only form of that flexeme                            | ech, kurde            |
| punctuation                | interp       | the only form of that flexeme                            | ;, ., (, ]            |
| alien                      | xxx          | the only form of that flexeme                            | cool , nihil          |

## Intended uses & limitations

Even though we have some nice tools for pos-tagging in polish (http://morfeusz.sgjp.pl/), I needed a pos tagger for polish that could be easily loaded inside the browser. Huggingface supports such functionality and that's why I created this model.

## Training and evaluation data

Model was trained on a half of test data of the nkjp1m dataset (~0.5 milion tokens).

## Training procedure

### Training hyperparameters

The following hyperparameters were used during training:
- learning_rate: 2e-05
- train_batch_size: 16
- eval_batch_size: 16
- seed: 42
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- num_epochs: 3

### Training results

| Training Loss | Epoch | Step | Validation Loss | Precision | Recall | F1     | Accuracy |
|:-------------:|:-----:|:----:|:---------------:|:---------:|:------:|:------:|:--------:|
| 0.0665        | 1.0   | 2155 | 0.0629          | 0.9835    | 0.9836 | 0.9836 | 0.9867   |
| 0.0369        | 2.0   | 4310 | 0.0539          | 0.9842    | 0.9848 | 0.9845 | 0.9876   |
| 0.0243        | 3.0   | 6465 | 0.0508          | 0.9853    | 0.9858 | 0.9856 | 0.9884   |


### Framework versions

- Transformers 4.36.0
- Pytorch 2.1.0+cu118
- Datasets 2.15.0
- Tokenizers 0.15.0