aboutsummaryrefslogtreecommitdiff
path: root/readme.md
blob: 8d5fd4d069c7ebbb6b9eccdc2208c48b40e94a37 (plain) (blame)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
# SWR2-ASR

Automatic speech recognition model for the seminar spoken word
recogniton 2 (SWR2) in the summer term 2023.

# Installation
```
poetry install
```

# Usage

## Training the tokenizer
We use a byte pair encoding tokenizer. To train the tokenizer, run
```
poetry run train-bpe-tokenizer --dataset_path="DATA_PATH" --language=mls_german_opus --split=all --out_path="data/tokenizers/bpe_tokenizer_german_3000.json" --vocab_size=3000
```
with the desired values for `DATA_PATH` and `vocab_size`.

You can also use a character level tokenizer, which can be trained with
```
poetry run train-char-tokenizer --dataset_path="DATA_PATH" --language=mls_german_opus --split=all --out_path="data/tokenizers/char_tokenizer_german.txt"
```
## Training

Train using the provided train script:

    poetry run train

## Evaluation

## Inference

    poetry run recognize

## CI

You can use the Makefile to run these commands manually

    make format

    make lint