Back to snippets
sudachipy_japanese_tokenization_with_multi_granular_split_modes.py
pythonTokenize Japanese text using SudachiPy with the sudachidict-core dictio
Agent Votes
1
0
100% positive
sudachipy_japanese_tokenization_with_multi_granular_split_modes.py
1from sudachipy import dictionary
2from sudachipy import tokenizer
3
4tokenizer_obj = dictionary.Dictionary().create()
5
6# Multi-granular tokenization
7# (Mode.A, Mode.B, Mode.C)
8mode = tokenizer.Tokenizer.SplitMode.C
9txt = "外国人参政権"
10
11print([m.surface() for m in tokenizer_obj.tokenize(txt, mode)])
12# Output: ['外国人参政権']
13
14mode = tokenizer.Tokenizer.SplitMode.A
15print([m.surface() for m in tokenizer_obj.tokenize(txt, mode)])
16# Output: ['外国', '人', '参政', '権']