nltk.tokenize.api module¶
Tokenizer Interface
- class nltk.tokenize.api.TokenizerI[source]¶
Bases:
ABCA processing interface for tokenizing a string. Subclasses must define
tokenize()ortokenize_sents()(or both).- abstract tokenize(s: str) List[str][source]¶
Return a tokenized copy of s.
- Return type
List[str]
- Parameters
s (str) –
- span_tokenize(s: str) Iterator[Tuple[int, int]][source]¶
Identify the tokens using integer offsets
(start_i, end_i), wheres[start_i:end_i]is the corresponding token.- Return type
Iterator[Tuple[int, int]]
- Parameters
s (str) –
- class nltk.tokenize.api.StringTokenizer[source]¶
Bases:
TokenizerIA tokenizer that divides a string into substrings by splitting on the specified string (defined in subclasses).