aboutsummaryrefslogtreecommitdiff
path: root/readme.md
diff options
context:
space:
mode:
authorPherkel2023-08-20 15:50:36 +0200
committerGitHub2023-08-20 15:50:36 +0200
commit14ceeb5ad36beea2f05214aa26260cdd1d86590b (patch)
tree891cedeb665913af1a078a3778afffbccd37bae7 /readme.md
parentf88c9afc6e9efcb6f79a959779114095c23e0cef (diff)
parent899a5e1cd7ca9b0601ed64ca3157e2052dd3e669 (diff)
Merge pull request #22 from Algo-Boys/tokenizer
Tokenizer
Diffstat (limited to 'readme.md')
-rw-r--r--readme.md11
1 files changed, 11 insertions, 0 deletions
diff --git a/readme.md b/readme.md
index 47d9a31..795283b 100644
--- a/readme.md
+++ b/readme.md
@@ -10,6 +10,17 @@ poetry install
# Usage
+## Training the tokenizer
+We use a byte pair encoding tokenizer. To train the tokenizer, run
+```
+poetry run train-bpe-tokenizer --dataset_path="DATA_PATH" --language=mls_german_opus --split=all --out_path="data/tokenizers/bpe_tokenizer_german_3000.json" --vocab_size=3000
+```
+with the desired values for `DATA_PATH` and `vocab_size`.
+
+You can also use a character level tokenizer, which can be trained with
+```
+poetry run train-char-tokenizer --dataset_path="DATA_PATH" --language=mls_german_opus --split=all --out_path="data/tokenizers/char_tokenizer_german.txt"
+```
## Training
Train using the provided train script: