readme.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90

# SWR2-ASR

Automatic speech recognition model for the seminar "Spoken Word
Recogniton 2 (SWR2)" by Konstantin Sering in the summer term 2023.

Authors:
Silja Kasper, Marvin Borner, Philipp Merkel, Valentin Schmidt 

# Dataset
We use the german [multilangual librispeech dataset](http://www.openslr.org/94/) (mls_german_opus). If the dataset is not found under the specified path, it will be downloaded automatically.

If you want to train this model on custom data, this code expects a folder structure like this:
```
<dataset_path>
  ├── <language>
  │  ├── train
  │  │  ├── transcripts.txt
  │  │  └── audio
  │  │     └── <speakerid>
  │  │        └── <bookid>
  │  │           └── <speakerid>_<bookid>_<chapterid>.opus/.flac
  │  ├── dev
  │  │  ├── transcripts.txt
  │  │  └── audio
  │  │     └── <speakerid>
  │  │        └── <bookid>
  │  │           └── <speakerid>_<bookid>_<chapterid>.opus/.flac
  │  └── test
  │     ├── transcripts.txt
  │     └── audio
  │        └── <speakerid>
  │           └── <bookid>
  │              └── <speakerid>_<bookid>_<chapterid>.opus/.flac
``````


# Installation
The preferred method of installation is using [`poetry`](https://python-poetry.org/docs/#installation). After installing poetry, run
```
poetry install
```
to install all dependencies. `poetry` also enables you to run our scripts using
```
poetry run SCRIPT_NAME
```

Alternatively, you can use the provided `requirements.txt` file to install the dependencies using `pip` or `conda`.

# Usage

## Tokenizer

We include a pre-trained character-level tokenizer for the german language in the `data/tokenizers` directory.

If the path to the tokenizer you specified in the `config.yaml` file does not exist or is None (~), a new tokenizer will be trained on the training data.

## Decoder
There are two options for the decoder:
- greedy
- beam search with language model

The language model is a KenLM model and supplied by the multi-lingual librispeech dataset. If you want to use a different KenLM language model, you can specify the path to the language model in the `config.yaml` file.

## Training the model

All hyperparameters can be configured in the `config.yaml` file. The main sections are:
- model
- training
- dataset
- tokenizer
- checkpoints
- inference

Train using the provided train script:

    poetry run train \
    --config_path="PATH_TO_CONFIG_FILE"

You can also find our model that was trained for 67 epochs on the mls_german_opus [here](https://drive.google.com/file/d/1gcgCjlCH6DjT6f7EWTx0LcYP3CCuXYP-/view?usp=sharing).

## Inference
The `config.yaml` also includes a section for inference. 
To run inference on a single audio file, run:

    poetry run recognize \
    --config_path="PATH_TO_CONFIG_FILE" \
    --file_path="PATH_TO_AUDIO_FILE" \
    --target_path="PATH_TO_TARGET_FILE"

Target path is optional. If not specified, the recognized text will be printed to the console. Otherwise, a WER will be computed.