nltk.parse.CoreNLPParser¶
- class nltk.parse.CoreNLPParser[source]¶
Bases:
GenericCoreNLPParser
>>> parser = CoreNLPParser(url='http://localhost:9000')
>>> next( ... parser.raw_parse('The quick brown fox jumps over the lazy dog.') ... ).pretty_print() ROOT | S _______________|__________________________ | VP | | _________|___ | | | PP | | | ________|___ | NP | | NP | ____|__________ | | _______|____ | DT JJ JJ NN VBZ IN DT JJ NN . | | | | | | | | | | The quick brown fox jumps over the lazy dog .
>>> (parse_fox, ), (parse_wolf, ) = parser.raw_parse_sents( ... [ ... 'The quick brown fox jumps over the lazy dog.', ... 'The quick grey wolf jumps over the lazy fox.', ... ] ... )
>>> parse_fox.pretty_print() ROOT | S _______________|__________________________ | VP | | _________|___ | | | PP | | | ________|___ | NP | | NP | ____|__________ | | _______|____ | DT JJ JJ NN VBZ IN DT JJ NN . | | | | | | | | | | The quick brown fox jumps over the lazy dog .
>>> parse_wolf.pretty_print() ROOT | S _______________|__________________________ | VP | | _________|___ | | | PP | | | ________|___ | NP | | NP | ____|_________ | | _______|____ | DT JJ JJ NN VBZ IN DT JJ NN . | | | | | | | | | | The quick grey wolf jumps over the lazy fox .
>>> (parse_dog, ), (parse_friends, ) = parser.parse_sents( ... [ ... "I 'm a dog".split(), ... "This is my friends ' cat ( the tabby )".split(), ... ] ... )
>>> parse_dog.pretty_print() ROOT | S _______|____ | VP | ________|___ NP | NP | | ___|___ PRP VBP DT NN | | | | I 'm a dog
>>> parse_friends.pretty_print() ROOT | S ____|___________ | VP | ___________|_____________ | | NP | | _______|_________ | | NP PRN | | _____|_______ ____|______________ NP | NP | | NP | | | ______|_________ | | ___|____ | DT VBZ PRP$ NNS POS NN -LRB- DT NN -RRB- | | | | | | | | | | This is my friends ' cat -LRB- the tabby -RRB-
>>> parse_john, parse_mary, = parser.parse_text( ... 'John loves Mary. Mary walks.' ... )
>>> parse_john.pretty_print() ROOT | S _____|_____________ | VP | | ____|___ | NP | NP | | | | | NNP VBZ NNP . | | | | John loves Mary .
>>> parse_mary.pretty_print() ROOT | S _____|____ NP VP | | | | NNP VBZ . | | | Mary walks .
Special cases
>>> next( ... parser.raw_parse( ... 'NASIRIYA, Iraq—Iraqi doctors who treated former prisoner of war ' ... 'Jessica Lynch have angrily dismissed claims made in her biography ' ... 'that she was raped by her Iraqi captors.' ... ) ... ).height() 20
>>> next( ... parser.raw_parse( ... "The broader Standard & Poor's 500 Index <.SPX> was 0.46 points lower, or " ... '0.05 percent, at 997.02.' ... ) ... ).height() 9
- parser_annotator = 'parse'¶
- accuracy(gold)[source]¶
Score the accuracy of the tagger against the gold standard. Strip the tags from the gold standard text, retag it using the tagger, then compute the accuracy score.
- Parameters
gold (list(list(tuple(str, str)))) – The list of tagged sentences to score the tagger on.
- Return type
float
- confusion(gold)[source]¶
Return a ConfusionMatrix with the tags from
gold
as the reference values, with the predictions fromtag_sents
as the predicted values.>>> from nltk.tag import PerceptronTagger >>> from nltk.corpus import treebank >>> tagger = PerceptronTagger() >>> gold_data = treebank.tagged_sents()[:10] >>> print(tagger.confusion(gold_data)) | - | | N | | O P | | N J J N N P P R R V V V V V W | | ' E C C D E I J J J M N N N O R P R B R T V B B B B B D ` | | ' , - . C D T X N J R S D N P S S P $ B R P O B D G N P Z T ` | -------+----------------------------------------------------------------------------------------------+ '' | <1> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . | , | .<15> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . | -NONE- | . . <.> . . 2 . . . 2 . . . 5 1 . . . . 2 . . . . . . . . . . . | . | . . .<10> . . . . . . . . . . . . . . . . . . . . . . . . . . . | CC | . . . . <1> . . . . . . . . . . . . . . . . . . . . . . . . . . | CD | . . . . . <5> . . . . . . . . . . . . . . . . . . . . . . . . . | DT | . . . . . .<20> . . . . . . . . . . . . . . . . . . . . . . . . | EX | . . . . . . . <1> . . . . . . . . . . . . . . . . . . . . . . . | IN | . . . . . . . .<22> . . . . . . . . . . 3 . . . . . . . . . . . | JJ | . . . . . . . . .<16> . . . . 1 . . . . 1 . . . . . . . . . . . | JJR | . . . . . . . . . . <.> . . . . . . . . . . . . . . . . . . . . | JJS | . . . . . . . . . . . <1> . . . . . . . . . . . . . . . . . . . | MD | . . . . . . . . . . . . <1> . . . . . . . . . . . . . . . . . . | NN | . . . . . . . . . . . . .<28> 1 1 . . . . . . . . . . . . . . . | NNP | . . . . . . . . . . . . . .<25> . . . . . . . . . . . . . . . . | NNS | . . . . . . . . . . . . . . .<19> . . . . . . . . . . . . . . . | POS | . . . . . . . . . . . . . . . . <1> . . . . . . . . . . . . . . | PRP | . . . . . . . . . . . . . . . . . <4> . . . . . . . . . . . . . | PRP$ | . . . . . . . . . . . . . . . . . . <2> . . . . . . . . . . . . | RB | . . . . . . . . . . . . . . . . . . . <4> . . . . . . . . . . . | RBR | . . . . . . . . . . 1 . . . . . . . . . <1> . . . . . . . . . . | RP | . . . . . . . . . . . . . . . . . . . . . <1> . . . . . . . . . | TO | . . . . . . . . . . . . . . . . . . . . . . <5> . . . . . . . . | VB | . . . . . . . . . . . . . . . . . . . . . . . <3> . . . . . . . | VBD | . . . . . . . . . . . . . 1 . . . . . . . . . . <6> . . . . . . | VBG | . . . . . . . . . . . . . 1 . . . . . . . . . . . <4> . . . . . | VBN | . . . . . . . . . . . . . . . . . . . . . . . . 1 . <4> . . . . | VBP | . . . . . . . . . . . . . . . . . . . . . . . . . . . <3> . . . | VBZ | . . . . . . . . . . . . . . . . . . . . . . . . . . . . <7> . . | WDT | . . . . . . . . 2 . . . . . . . . . . . . . . . . . . . . <.> . | `` | . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . <1>| -------+----------------------------------------------------------------------------------------------+ (row = reference; col = test)
- Parameters
gold (list(list(tuple(str, str)))) – The list of tagged sentences to run the tagger with, also used as the reference values in the generated confusion matrix.
- Return type
- evaluate_per_tag(gold, alpha=0.5, truncate=None, sort_by_count=False)[source]¶
Tabulate the recall, precision and f-measure for each tag from
gold
or from runningtag
on the tokenized sentences fromgold
.>>> from nltk.tag import PerceptronTagger >>> from nltk.corpus import treebank >>> tagger = PerceptronTagger() >>> gold_data = treebank.tagged_sents()[:10] >>> print(tagger.evaluate_per_tag(gold_data)) Tag | Prec. | Recall | F-measure -------+--------+--------+----------- '' | 1.0000 | 1.0000 | 1.0000 , | 1.0000 | 1.0000 | 1.0000 -NONE- | 0.0000 | 0.0000 | 0.0000 . | 1.0000 | 1.0000 | 1.0000 CC | 1.0000 | 1.0000 | 1.0000 CD | 0.7143 | 1.0000 | 0.8333 DT | 1.0000 | 1.0000 | 1.0000 EX | 1.0000 | 1.0000 | 1.0000 IN | 0.9167 | 0.8800 | 0.8980 JJ | 0.8889 | 0.8889 | 0.8889 JJR | 0.0000 | 0.0000 | 0.0000 JJS | 1.0000 | 1.0000 | 1.0000 MD | 1.0000 | 1.0000 | 1.0000 NN | 0.8000 | 0.9333 | 0.8615 NNP | 0.8929 | 1.0000 | 0.9434 NNS | 0.9500 | 1.0000 | 0.9744 POS | 1.0000 | 1.0000 | 1.0000 PRP | 1.0000 | 1.0000 | 1.0000 PRP$ | 1.0000 | 1.0000 | 1.0000 RB | 0.4000 | 1.0000 | 0.5714 RBR | 1.0000 | 0.5000 | 0.6667 RP | 1.0000 | 1.0000 | 1.0000 TO | 1.0000 | 1.0000 | 1.0000 VB | 1.0000 | 1.0000 | 1.0000 VBD | 0.8571 | 0.8571 | 0.8571 VBG | 1.0000 | 0.8000 | 0.8889 VBN | 1.0000 | 0.8000 | 0.8889 VBP | 1.0000 | 1.0000 | 1.0000 VBZ | 1.0000 | 1.0000 | 1.0000 WDT | 0.0000 | 0.0000 | 0.0000 `` | 1.0000 | 1.0000 | 1.0000
- Parameters
gold (list(list(tuple(str, str)))) – The list of tagged sentences to score the tagger on.
alpha (float) – Ratio of the cost of false negative compared to false positives, as used in the f-measure computation. Defaults to 0.5, where the costs are equal.
truncate (int, optional) – If specified, then only show the specified number of values. Any sorting (e.g., sort_by_count) will be performed before truncation. Defaults to None
sort_by_count (bool, optional) – Whether to sort the outputs on number of occurrences of that tag in the
gold
data, defaults to False
- Returns
A tabulated recall, precision and f-measure string
- Return type
str
- f_measure(gold, alpha=0.5)[source]¶
Compute the f-measure for each tag from
gold
or from runningtag
on the tokenized sentences fromgold
. Then, return the dictionary with mappings from tag to f-measure. The f-measure is the harmonic mean of theprecision
andrecall
, weighted byalpha
. In particular, given the precision p and recall r defined by:p = true positive / (true positive + false negative)
r = true positive / (true positive + false positive)
The f-measure is:
1/(alpha/p + (1-alpha)/r)
With
alpha = 0.5
, this reduces to:2pr / (p + r)
- Parameters
gold (list(list(tuple(str, str)))) – The list of tagged sentences to score the tagger on.
alpha (float) – Ratio of the cost of false negative compared to false positives. Defaults to 0.5, where the costs are equal.
- Returns
A mapping from tags to precision
- Return type
Dict[str, float]
- parse(sent, *args, **kwargs)[source]¶
- Returns
An iterator that generates parse trees for the sentence. When possible this list is sorted from most likely to least likely.
- Parameters
sent (list(str)) – The sentence to be parsed
- Return type
iter(Tree)
- parse_sents(sentences, *args, **kwargs)[source]¶
Parse multiple sentences.
Takes multiple sentences as a list where each sentence is a list of words. Each sentence will be automatically tagged with this CoreNLPParser instance’s tagger.
If a whitespace exists inside a token, then the token will be treated as several tokens.
- Parameters
sentences (list(list(str))) – Input sentences to parse
- Return type
iter(iter(Tree))
- parse_text(text, *args, **kwargs)[source]¶
Parse a piece of text.
The text might contain several sentences which will be split by CoreNLP.
- Parameters
text (str) – text to be split.
- Returns
an iterable of syntactic structures. # TODO: should it be an iterable of iterables?
- precision(gold)[source]¶
Compute the precision for each tag from
gold
or from runningtag
on the tokenized sentences fromgold
. Then, return the dictionary with mappings from tag to precision. The precision is defined as:p = true positive / (true positive + false negative)
- Parameters
gold (list(list(tuple(str, str)))) – The list of tagged sentences to score the tagger on.
- Returns
A mapping from tags to precision
- Return type
Dict[str, float]
- raw_parse(sentence, properties=None, *args, **kwargs)[source]¶
Parse a sentence.
Takes a sentence as a string; before parsing, it will be automatically tokenized and tagged by the CoreNLP Parser.
- Parameters
sentence (str) – Input sentence to parse
- Return type
iter(Tree)
- raw_parse_sents(sentences, verbose=False, properties=None, *args, **kwargs)[source]¶
Parse multiple sentences.
Takes multiple sentences as a list of strings. Each sentence will be automatically tokenized and tagged.
- Parameters
sentences (list(str)) – Input sentences to parse.
- Return type
iter(iter(Tree))
- raw_tag_sents(sentences)[source]¶
Tag multiple sentences.
Takes multiple sentences as a list where each sentence is a string.
- Parameters
sentences (list(str)) – Input sentences to tag
- Return type
list(list(list(tuple(str, str)))
- recall(gold) Dict[str, float] [source]¶
Compute the recall for each tag from
gold
or from runningtag
on the tokenized sentences fromgold
. Then, return the dictionary with mappings from tag to recall. The recall is defined as:r = true positive / (true positive + false positive)
- Parameters
gold (list(list(tuple(str, str)))) – The list of tagged sentences to score the tagger on.
- Returns
A mapping from tags to recall
- Return type
Dict[str, float]
- span_tokenize(s: str) Iterator[Tuple[int, int]] [source]¶
Identify the tokens using integer offsets
(start_i, end_i)
, wheres[start_i:end_i]
is the corresponding token.- Return type
Iterator[Tuple[int, int]]
- Parameters
s (str) –
- span_tokenize_sents(strings: List[str]) Iterator[List[Tuple[int, int]]] [source]¶
Apply
self.span_tokenize()
to each element ofstrings
. I.e.:return [self.span_tokenize(s) for s in strings]
- Yield
List[Tuple[int, int]]
- Parameters
strings (List[str]) –
- Return type
Iterator[List[Tuple[int, int]]]
- tag(sentence)[source]¶
Tag a list of tokens.
- Return type
list(tuple(str, str))
>>> parser = CoreNLPParser(url='http://localhost:9000', tagtype='ner') >>> tokens = 'Rami Eid is studying at Stony Brook University in NY'.split() >>> parser.tag(tokens) [('Rami', 'PERSON'), ('Eid', 'PERSON'), ('is', 'O'), ('studying', 'O'), ('at', 'O'), ('Stony', 'ORGANIZATION'), ('Brook', 'ORGANIZATION'), ('University', 'ORGANIZATION'), ('in', 'O'), ('NY', 'O')]
>>> parser = CoreNLPParser(url='http://localhost:9000', tagtype='pos') >>> tokens = "What is the airspeed of an unladen swallow ?".split() >>> parser.tag(tokens) [('What', 'WP'), ('is', 'VBZ'), ('the', 'DT'), ('airspeed', 'NN'), ('of', 'IN'), ('an', 'DT'), ('unladen', 'JJ'), ('swallow', 'VB'), ('?', '.')]
- tag_sents(sentences)[source]¶
Tag multiple sentences.
Takes multiple sentences as a list where each sentence is a list of tokens.
- Parameters
sentences (list(list(str))) – Input sentences to tag
- Return type
list(list(tuple(str, str))
- tokenize(text, properties=None)[source]¶
Tokenize a string of text.
>>> parser = CoreNLPParser(url='http://localhost:9000')
>>> text = 'Good muffins cost $3.88\nin New York. Please buy me\ntwo of them.\nThanks.' >>> list(parser.tokenize(text)) ['Good', 'muffins', 'cost', '$', '3.88', 'in', 'New', 'York', '.', 'Please', 'buy', 'me', 'two', 'of', 'them', '.', 'Thanks', '.']
>>> s = "The colour of the wall is blue." >>> list( ... parser.tokenize( ... 'The colour of the wall is blue.', ... properties={'tokenize.options': 'americanize=true'}, ... ) ... ) ['The', 'color', 'of', 'the', 'wall', 'is', 'blue', '.']