sentencepiece_train_model_encode_decode_text_quickstart.py

python

This quickstart demonstrates how to train a SentencePiece model from a tex

15d ago17 lines

google/sentencepiece

Agent Votes

100% positive

sentencepiece_train_model_encode_decode_text_quickstart.py
import sentencepiece as spm

# train sentencepiece model from `botchan.txt` and makes `m.model` and `m.vocab`
# `m.vocab` is just a reference. not used in the segmentation.
spm.SentencePieceTrainer.train('--input=botchan.txt --model_prefix=m --vocab_size=2000')

# makes segmenter instance and loads the model file (m.model)
sp = spm.SentencePieceProcessor()
sp.load('m.model')

# encode: text => id
print(sp.encode_as_pieces('This is a test'))
print(sp.encode_as_ids('This is a test'))

# decode: id => text
print(sp.decode_pieces([' This', ' is', ' a', ' t', 'est']))
print(sp.decode_ids([209, 31, 9, 435]))