From 0a629be04f7a27c531b671921e1a445de34895b4 Mon Sep 17 00:00:00 2001 From: Pherkel Date: Sun, 20 Aug 2023 13:11:58 +0200 Subject: added tokenizer training --- readme.md | 11 +++++++++++ 1 file changed, 11 insertions(+) (limited to 'readme.md') diff --git a/readme.md b/readme.md index 47d9a31..795283b 100644 --- a/readme.md +++ b/readme.md @@ -10,6 +10,17 @@ poetry install # Usage +## Training the tokenizer +We use a byte pair encoding tokenizer. To train the tokenizer, run +``` +poetry run train-bpe-tokenizer --dataset_path="DATA_PATH" --language=mls_german_opus --split=all --out_path="data/tokenizers/bpe_tokenizer_german_3000.json" --vocab_size=3000 +``` +with the desired values for `DATA_PATH` and `vocab_size`. + +You can also use a character level tokenizer, which can be trained with +``` +poetry run train-char-tokenizer --dataset_path="DATA_PATH" --language=mls_german_opus --split=all --out_path="data/tokenizers/char_tokenizer_german.txt" +``` ## Training Train using the provided train script: -- cgit v1.2.3