nltk.tokenize.sexpr module¶
S-Expression Tokenizer
SExprTokenizer is used to find parenthesized expressions in a
string. In particular, it divides a string into a sequence of
substrings that are either parenthesized expressions (including any
nested parenthesized expressions), or other whitespace-separated
tokens.
>>> from nltk.tokenize import SExprTokenizer
>>> SExprTokenizer().tokenize('(a b (c d)) e f (g)')
['(a b (c d))', 'e', 'f', '(g)']
By default, SExprTokenizer will raise a ValueError exception if
used to tokenize an expression with non-matching parentheses:
>>> SExprTokenizer().tokenize('c) d) e (f (g')
Traceback (most recent call last):
...
ValueError: Un-matched close paren at char 1
The strict argument can be set to False to allow for
non-matching parentheses. Any unmatched close parentheses will be
listed as their own s-expression; and the last partial sexpr with
unmatched open parentheses will be listed as its own sexpr:
>>> SExprTokenizer(strict=False).tokenize('c) d) e (f (g')
['c', ')', 'd', ')', 'e', '(f (g']
The characters used for open and close parentheses may be customized
using the parens argument to the SExprTokenizer constructor:
>>> SExprTokenizer(parens='{}').tokenize('{a b {c d}} e f {g}')
['{a b {c d}}', 'e', 'f', '{g}']
The s-expression tokenizer is also available as a function:
>>> from nltk.tokenize import sexpr_tokenize
>>> sexpr_tokenize('(a b (c d)) e f (g)')
['(a b (c d))', 'e', 'f', '(g)']
- class nltk.tokenize.sexpr.SExprTokenizer[source]¶
Bases:
TokenizerIA tokenizer that divides strings into s-expressions. An s-expresion can be either:
a parenthesized expression, including any nested parenthesized expressions, or
a sequence of non-whitespace non-parenthesis characters.
For example, the string
(a (b c)) d e (f)consists of four s-expressions:(a (b c)),d,e, and(f).By default, the characters
(and)are treated as open and close parentheses, but alternative strings may be specified.- Parameters
parens (str or list) – A two-element sequence specifying the open and close parentheses that should be used to find sexprs. This will typically be either a two-character string, or a list of two strings.
strict – If true, then raise an exception when tokenizing an ill-formed sexpr.
- tokenize(text)[source]¶
Return a list of s-expressions extracted from text. For example:
>>> SExprTokenizer().tokenize('(a b (c d)) e f (g)') ['(a b (c d))', 'e', 'f', '(g)']
All parentheses are assumed to mark s-expressions. (No special processing is done to exclude parentheses that occur inside strings, or following backslash characters.)
If the given expression contains non-matching parentheses, then the behavior of the tokenizer depends on the
strictparameter to the constructor. IfstrictisTrue, then raise aValueError. IfstrictisFalse, then any unmatched close parentheses will be listed as their own s-expression; and the last partial s-expression with unmatched open parentheses will be listed as its own s-expression:>>> SExprTokenizer(strict=False).tokenize('c) d) e (f (g') ['c', ')', 'd', ')', 'e', '(f (g']
- Parameters
text (str or iter(str)) – the string to be tokenized
- Return type
iter(str)
- nltk.tokenize.sexpr.sexpr_tokenize(text)[source]¶
Return a list of s-expressions extracted from text. For example:
>>> SExprTokenizer().tokenize('(a b (c d)) e f (g)') ['(a b (c d))', 'e', 'f', '(g)']
All parentheses are assumed to mark s-expressions. (No special processing is done to exclude parentheses that occur inside strings, or following backslash characters.)
If the given expression contains non-matching parentheses, then the behavior of the tokenizer depends on the
strictparameter to the constructor. IfstrictisTrue, then raise aValueError. IfstrictisFalse, then any unmatched close parentheses will be listed as their own s-expression; and the last partial s-expression with unmatched open parentheses will be listed as its own s-expression:>>> SExprTokenizer(strict=False).tokenize('c) d) e (f (g') ['c', ')', 'd', ')', 'e', '(f (g']
- Parameters
text (str or iter(str)) – the string to be tokenized
- Return type
iter(str)