--- license: cc-by-4.0 tags: - audio - automatic-speech-recognition - hf-asr-leaderboard language: et model-index: - name: TalTechNLP/whisper-medium-et results: - task: name: Automatic Speech Recognition type: automatic-speech-recognition dataset: name: Common Voice 11 type: mozilla-foundation/common_voice_11_0 config: et split: test metrics: - name: Test WER type: wer value: 14.66 - name: Test CER type: cer value: 3.76 - task: name: Automatic Speech Recognition type: automatic-speech-recognition dataset: name: Common Voice 8 type: mozilla-foundation/common_voice_8_0 config: et split: test metrics: - name: Test WER type: wer value: 13.793 - name: Test CER type: cer value: 3.194 --- # Whisper-medium-et This is a Whisper-medium model [openai/whisper-medium](https://huggingface.co/openai/whisper-medium) finetuned on around 800 hours of diverse Estonian data. ## Model description This is a general-purpose Estonian ASR model trained in the Lab of Language Technology at TalTech. ## Intended uses & limitations This model is intended for general-purpose speech recognition, such as broadcast conversations, interviews, talks, etc. ## How to use Use as any other Whisper model via HF transformers, or use a faster decoder like [faster-whisper](https://github.com/guillaumekln/faster-whisper). #### Limitations and bias Since this model was trained on mostly broadcast speech and texts from the web, it might have problems correctly decoding the following: * Speech containing technical and other domain-specific terms * Children's speech * Non-native speech * Speech recorded under very noisy conditions or with a microphone far from the speaker * Very spontaneous and overlapping speech ## Training data Acoustic training data: | Type | Amount (h) | |-----------------------|:------:| | Broadcast speech | 591 | | Spontaneous speech | 53 | | Elderly speech corpus | 53 | | Talks, lectures | 49 | | Parliament speeches | 31 | | *Total* | *761* | ## Training procedure Finetuned using Espnet, and then comverted to transformers format using [this](https://gist.github.com/alumae/2dcf473b667cec9d513b80ea24e94672) script. Finetuning procedure is similar to [this](https://huggingface.co/espnet/shihlun_asr_whisper_medium_finetuned_librispeech100) model. ## Evaluation results ### WER WER results below are obtained using greedy decoding (i.e., beam size 1). |Dataset | WER | |---|---| | Common Voice 8.0 | 13.8 | | Common Voice 11.0 | 14.7 |