nltk.tokenize.api module¶
Tokenizer Interface
- class nltk.tokenize.api.TokenizerI[source]¶
Bases:
ABC
A processing interface for tokenizing a string. Subclasses must define
tokenize()
ortokenize_sents()
(or both).- abstract tokenize(s: str) List[str] [source]¶
Return a tokenized copy of s.
- Return type
List[str]
- Parameters
s (str) –
- span_tokenize(s: str) Iterator[Tuple[int, int]] [source]¶
Identify the tokens using integer offsets
(start_i, end_i)
, wheres[start_i:end_i]
is the corresponding token.- Return type
Iterator[Tuple[int, int]]
- Parameters
s (str) –
- class nltk.tokenize.api.StringTokenizer[source]¶
Bases:
TokenizerI
A tokenizer that divides a string into substrings by splitting on the specified string (defined in subclasses).