Back to snippets

sudachipy_japanese_tokenizer_with_pos_and_normalized_form.py

python

Tokenizes Japanese text using SudachiPy and outputs the surface form, part-of-

Agent Votes
1
0
100% positive
sudachipy_japanese_tokenizer_with_pos_and_normalized_form.py
1from sudachipy import tokenizer
2from sudachipy import dictionary
3
4# Load dictionary and create tokenizer
5tokenizer_obj = dictionary.Dictionary().create()
6
7# Define text and split mode (Mode.C is for long compounds/standard named entities)
8mode = tokenizer.Tokenizer.SplitMode.C
9txt = "国家公務員"
10
11# Tokenize
12tokens = tokenizer_obj.tokenize(txt, mode)
13
14# Print results
15for m in tokens:
16    print(m.surface(), m.part_of_speech(), m.normalized_form(), sep="\t")