File size: 4,139 Bytes
fdc8c19
 
 
 
 
3957eed
 
 
 
fdc8c19
 
 
 
 
 
 
 
ff19a01
 
 
3957eed
 
 
e0ce234
 
 
 
 
fdc8c19
 
 
 
 
49537e3
fdc8c19
21fe1be
f1d80b4
 
fdc8c19
 
 
 
 
 
 
 
 
898096f
841b67b
3957eed
 
 
 
 
 
 
 
 
 
 
fdc8c19
f3ec159
 
 
 
 
898096f
 
b4560ce
898096f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
fdc8c19
 
c443954
fdc8c19
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ff19a01
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
---
license: apache-2.0
base_model: sentence-transformers/LaBSE
tags:
- generated_from_trainer
- news
- russian
- media
- text-classification
metrics:
- accuracy
- f1
- precision
- recall
model-index:
- name: news_classifier_ft
  results: []
datasets:
- data-silence/rus_news_classifier
pipeline_tag: text-classification
language:
- ru
widgets:
- text: Введите новостной текст для классификации
  example_title: Классификация новостей
  button_text: Классифицировать
  api_name: classify
library_name: transformers
---

<!-- This model card has been generated automatically according to the information the Trainer had access to. You
should probably proofread and complete it, then remove this comment. -->

# any-news-classifier

This model is a fine-tuned version of [sentence-transformers/LaBSE](https://huggingface.co/sentence-transformers/LaBSE) on my [news dataset](https://huggingface.co/datasets/data-silence/rus_news_classifier).
The learning news dataset is a well-balanced sample of recent news from the last five years.

It achieves the following results on the evaluation set:
- Loss: 0.3820
- Accuracy: 0.9029
- F1: 0.9025
- Precision: 0.9030
- Recall: 0.9029

## Model description

This is a multi-class classifier of Russian news, made with the LaBSE model finetune for [AntiSMI Project](https://github.com/data-silence/antiSMI-Project). 
The news category is assigned by the classifier to one of 11 categories:
- climate (климат)
- conflicts (конфликты)
- culture (культура)
- economy (экономика)
- gloss (глянец)
- health (здоровье)
- politics (политика)
- science (наука)
- society (общество)
- sports (спорт)
- travel (путешествия)

## Testing this model on `Spaces`

You can try the model and evaluate its quality [here](https://huggingface.co/spaces/data-silence/rus-news-classifier)


## How to use

```python

from transformers import pipeline

category_mapper = {
'LABEL_0': 'climate',
'LABEL_1': 'conflicts',
'LABEL_2': 'culture',
'LABEL_3': 'economy',
'LABEL_4': 'gloss',
'LABEL_5': 'health',
'LABEL_6': 'politics',
'LABEL_7': 'science',
'LABEL_8': 'society',
'LABEL_9': 'sports',
'LABEL_10': 'travel'
}

# Используйте предобученную модель из Hugging Face Hub
classifier = pipeline("text-classification", model="data-silence/rus-news-classifier")

def predict_category(text):
    result = classifier(text)
    category = category_mapper[result[0]['label']]
    score = result[0]['score']
    return category, score

predict_category("В Париже завершилась церемония закрытия Олимпийских игр")
# ('sports', 0.9959506988525391)
```


## Intended uses & limitations

The "gloss" category is used to select yellow press, trashy and dubious news. The model can get confused in the classification of news categories politics, society and conflicts.

## Training and evaluation data

More information needed

## Training procedure

### Training hyperparameters

The following hyperparameters were used during training:
- learning_rate: 2e-05
- train_batch_size: 16
- eval_batch_size: 16
- seed: 42
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- num_epochs: 5

### Training results

| Training Loss | Epoch | Step  | Validation Loss | Accuracy | F1     | Precision | Recall |
|:-------------:|:-----:|:-----:|:---------------:|:--------:|:------:|:---------:|:------:|
| 0.3544        | 1.0   | 3596  | 0.3517          | 0.8861   | 0.8860 | 0.8915    | 0.8861 |
| 0.2738        | 2.0   | 7192  | 0.3190          | 0.8995   | 0.8987 | 0.9025    | 0.8995 |
| 0.19          | 3.0   | 10788 | 0.3524          | 0.9016   | 0.9015 | 0.9019    | 0.9016 |
| 0.1402        | 4.0   | 14384 | 0.3820          | 0.9029   | 0.9025 | 0.9030    | 0.9029 |
| 0.1055        | 5.0   | 17980 | 0.4399          | 0.9022   | 0.9018 | 0.9024    | 0.9022 |


### Framework versions

- Transformers 4.42.4
- Pytorch 2.3.1+cu121
- Datasets 2.20.0
- Tokenizers 0.19.1