aboutsummaryrefslogtreecommitdiff
path: root/readme.md
diff options
context:
space:
mode:
Diffstat (limited to 'readme.md')
-rw-r--r--readme.md11
1 files changed, 11 insertions, 0 deletions
diff --git a/readme.md b/readme.md
index 47d9a31..795283b 100644
--- a/readme.md
+++ b/readme.md
@@ -10,6 +10,17 @@ poetry install
# Usage
+## Training the tokenizer
+We use a byte pair encoding tokenizer. To train the tokenizer, run
+```
+poetry run train-bpe-tokenizer --dataset_path="DATA_PATH" --language=mls_german_opus --split=all --out_path="data/tokenizers/bpe_tokenizer_german_3000.json" --vocab_size=3000
+```
+with the desired values for `DATA_PATH` and `vocab_size`.
+
+You can also use a character level tokenizer, which can be trained with
+```
+poetry run train-char-tokenizer --dataset_path="DATA_PATH" --language=mls_german_opus --split=all --out_path="data/tokenizers/char_tokenizer_german.txt"
+```
## Training
Train using the provided train script: