nltk.tokenize.SExprTokenizer

class nltk.tokenize.SExprTokenizer[source]

Bases: TokenizerI

A tokenizer that divides strings into s-expressions. An s-expresion can be either:

  • a parenthesized expression, including any nested parenthesized expressions, or

  • a sequence of non-whitespace non-parenthesis characters.

For example, the string (a (b c)) d e (f) consists of four s-expressions: (a (b c)), d, e, and (f).

By default, the characters ( and ) are treated as open and close parentheses, but alternative strings may be specified.

Parameters
  • parens (str or list) – A two-element sequence specifying the open and close parentheses that should be used to find sexprs. This will typically be either a two-character string, or a list of two strings.

  • strict – If true, then raise an exception when tokenizing an ill-formed sexpr.

__init__(parens='()', strict=True)[source]
tokenize(text)[source]

Return a list of s-expressions extracted from text. For example:

>>> SExprTokenizer().tokenize('(a b (c d)) e f (g)')
['(a b (c d))', 'e', 'f', '(g)']

All parentheses are assumed to mark s-expressions. (No special processing is done to exclude parentheses that occur inside strings, or following backslash characters.)

If the given expression contains non-matching parentheses, then the behavior of the tokenizer depends on the strict parameter to the constructor. If strict is True, then raise a ValueError. If strict is False, then any unmatched close parentheses will be listed as their own s-expression; and the last partial s-expression with unmatched open parentheses will be listed as its own s-expression:

>>> SExprTokenizer(strict=False).tokenize('c) d) e (f (g')
['c', ')', 'd', ')', 'e', '(f (g']
Parameters

text (str or iter(str)) – the string to be tokenized

Return type

iter(str)

span_tokenize(s: str) Iterator[Tuple[int, int]][source]

Identify the tokens using integer offsets (start_i, end_i), where s[start_i:end_i] is the corresponding token.

Return type

Iterator[Tuple[int, int]]

Parameters

s (str) –

span_tokenize_sents(strings: List[str]) Iterator[List[Tuple[int, int]]][source]

Apply self.span_tokenize() to each element of strings. I.e.:

return [self.span_tokenize(s) for s in strings]

Yield

List[Tuple[int, int]]

Parameters

strings (List[str]) –

Return type

Iterator[List[Tuple[int, int]]]

tokenize_sents(strings: List[str]) List[List[str]][source]

Apply self.tokenize() to each element of strings. I.e.:

return [self.tokenize(s) for s in strings]

Return type

List[List[str]]

Parameters

strings (List[str]) –