diff options
author | Pherkel | 2023-08-20 15:50:36 +0200 |
---|---|---|
committer | GitHub | 2023-08-20 15:50:36 +0200 |
commit | 14ceeb5ad36beea2f05214aa26260cdd1d86590b (patch) | |
tree | 891cedeb665913af1a078a3778afffbccd37bae7 /readme.md | |
parent | f88c9afc6e9efcb6f79a959779114095c23e0cef (diff) | |
parent | 899a5e1cd7ca9b0601ed64ca3157e2052dd3e669 (diff) |
Merge pull request #22 from Algo-Boys/tokenizer
Tokenizer
Diffstat (limited to 'readme.md')
-rw-r--r-- | readme.md | 11 |
1 files changed, 11 insertions, 0 deletions
@@ -10,6 +10,17 @@ poetry install # Usage +## Training the tokenizer +We use a byte pair encoding tokenizer. To train the tokenizer, run +``` +poetry run train-bpe-tokenizer --dataset_path="DATA_PATH" --language=mls_german_opus --split=all --out_path="data/tokenizers/bpe_tokenizer_german_3000.json" --vocab_size=3000 +``` +with the desired values for `DATA_PATH` and `vocab_size`. + +You can also use a character level tokenizer, which can be trained with +``` +poetry run train-char-tokenizer --dataset_path="DATA_PATH" --language=mls_german_opus --split=all --out_path="data/tokenizers/char_tokenizer_german.txt" +``` ## Training Train using the provided train script: |