blob: 795283bb90131722ccdf003bf93ad1934acd6d33 (
plain) (
blame)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
|
# SWR2-ACR
Automatic speech recognition model for the seminar spoken word
recogniton 2 (SWR2) in the summer term 2023.
# Installation
```
poetry install
```
# Usage
## Training the tokenizer
We use a byte pair encoding tokenizer. To train the tokenizer, run
```
poetry run train-bpe-tokenizer --dataset_path="DATA_PATH" --language=mls_german_opus --split=all --out_path="data/tokenizers/bpe_tokenizer_german_3000.json" --vocab_size=3000
```
with the desired values for `DATA_PATH` and `vocab_size`.
You can also use a character level tokenizer, which can be trained with
```
poetry run train-char-tokenizer --dataset_path="DATA_PATH" --language=mls_german_opus --split=all --out_path="data/tokenizers/char_tokenizer_german.txt"
```
## Training
Train using the provided train script:
poetry run train
## Evaluation
## Inference
poetry run recognize
## CI
You can use the Makefile to run these commands manually
make format
make lint
|