diff options
Diffstat (limited to 'readme.md')
-rw-r--r-- | readme.md | 86 |
1 files changed, 64 insertions, 22 deletions
@@ -1,42 +1,84 @@ # SWR2-ASR -Automatic speech recognition model for the seminar spoken word -recogniton 2 (SWR2) in the summer term 2023. +Automatic speech recognition model for the seminar "Spoken Word +Recogniton 2 (SWR2)" by Konstantin Sering in the summer term 2023. + +Authors: +Silja Kasper, Marvin Borner, Philipp Merkel, Valentin Schmidt + +# Dataset +We use the german [multilangual librispeech dataset](http://www.openslr.org/94/) (mls_german_opus). If the dataset is not found under the specified path, it will be downloaded automatically. + +If you want to train this model on custom data, this code expects a folder structure like this: +``` +<dataset_path> + ├── <language> + │ ├── train + │ │ ├── transcripts.txt + │ │ └── audio + │ │ └── <speakerid> + │ │ └── <bookid> + │ │ └── <speakerid>_<bookid>_<chapterid>.opus/.flac + │ ├── dev + │ │ ├── transcripts.txt + │ │ └── audio + │ │ └── <speakerid> + │ │ └── <bookid> + │ │ └── <speakerid>_<bookid>_<chapterid>.opus/.flac + │ └── test + │ ├── transcripts.txt + │ └── audio + │ └── <speakerid> + │ └── <bookid> + │ └── <speakerid>_<bookid>_<chapterid>.opus/.flac +`````` + # Installation +The preferred method of installation is using [`poetry`](https://python-poetry.org/docs/#installation). After installing poetry, run ``` poetry install ``` +to install all dependencies. `poetry` also enables you to run our scripts using +``` +poetry run SCRIPT_NAME +``` + +Alternatively, you can use the provided `requirements.txt` file to install the dependencies using `pip` or `conda`. # Usage -## Training the tokenizer -We use a byte pair encoding tokenizer. To train the tokenizer, run -``` -poetry run train-bpe-tokenizer --dataset_path="DATA_PATH" --language=mls_german_opus --split=all --out_path="data/tokenizers/bpe_tokenizer_german_3000.json" --vocab_size=3000 -``` -with the desired values for `DATA_PATH` and `vocab_size`. +## Tokenizer -You can also use a character level tokenizer, which can be trained with -``` -poetry run train-char-tokenizer --dataset_path="DATA_PATH" --language=mls_german_opus --split=all --out_path="data/tokenizers/char_tokenizer_german.txt" -``` -## Training +We include a pre-trained character-level tokenizer for the german language in the `data/tokenizers` directory. -Train using the provided train script: +If the path to the tokenizer you specified in the `config.yaml` file does not exist or is None (~), a new tokenizer will be trained on the training data. - poetry run train +## Training the model -## Evaluation +All hyperparameters can be configured in the `config.yaml` file. The main sections are: +- model +- training +- dataset +- tokenizer +- checkpoints +- inference -## Inference +Train using the provided train script: - poetry run recognize + poetry run train \ + --config_path="PATH_TO_CONFIG_FILE" -## CI +## Evaluation +Evaluation metrics are computed during training and are serialized with the checkpoints. -You can use the Makefile to run these commands manually +TODO: manual evaluation script / access to the evaluation metrics? - make format +## Inference +The `config.yaml` also includes a section for inference. +To run inference on a single audio file, run: - make lint + poetry run recognize \ + --config_path="PATH_TO_CONFIG_FILE" \ + --file_path="PATH_TO_AUDIO_FILE" + |