nltk.corpus.reader.semcor module¶
Corpus reader for the SemCor Corpus.
- class nltk.corpus.reader.semcor.SemcorCorpusReader[source]¶
Bases:
XMLCorpusReaderCorpus reader for the SemCor Corpus. For access to the complete XML data structure, use the
xml()method. For access to simple word lists and tagged word lists, usewords(),sents(),tagged_words(), andtagged_sents().- __init__(root, fileids, wordnet, lazy=True)[source]¶
- Parameters
root (PathPointer or str) – A path pointer identifying the root directory for this corpus. If a string is specified, then it will be converted to a
PathPointerautomatically.fileids – A list of the files that make up this corpus. This list can either be specified explicitly, as a list of strings; or implicitly, as a regular expression over file paths. The absolute path for each file will be constructed by joining the reader’s root to each file name.
encoding –
The default unicode encoding for the files that make up the corpus. The value of
encodingcan be any of the following:A string:
encodingis the encoding name for all files.A dictionary:
encoding[file_id]is the encoding name for the file whose identifier isfile_id. Iffile_idis not inencoding, then the file contents will be processed using non-unicode byte strings.A list:
encodingshould be a list of(regexp, encoding)tuples. The encoding for a file whose identifier isfile_idwill be theencodingvalue for the first tuple whoseregexpmatches thefile_id. If no tuple’sregexpmatches thefile_id, the file contents will be processed using non-unicode byte strings.None: the file contents of all files will be processed using non-unicode byte strings.
tagset – The name of the tagset used by this corpus, to be used for normalizing or converting the POS tags returned by the
tagged_...()methods.
- words(fileids=None)[source]¶
- Returns
the given file(s) as a list of words and punctuation symbols.
- Return type
list(str)
- chunks(fileids=None)[source]¶
- Returns
the given file(s) as a list of chunks, each of which is a list of words and punctuation symbols that form a unit.
- Return type
list(list(str))
- tagged_chunks(fileids=None, tag='pos')[source]¶
- Returns
the given file(s) as a list of tagged chunks, represented in tree form.
- Return type
list(Tree)
- Parameters
tag – ‘pos’ (part of speech), ‘sem’ (semantic), or ‘both’ to indicate the kind of tags to include. Semantic tags consist of WordNet lemma IDs, plus an ‘NE’ node if the chunk is a named entity without a specific entry in WordNet. (Named entities of type ‘other’ have no lemma. Other chunks not in WordNet have no semantic tag. Punctuation tokens have None for their part of speech tag.)
- sents(fileids=None)[source]¶
- Returns
the given file(s) as a list of sentences, each encoded as a list of word strings.
- Return type
list(list(str))
- chunk_sents(fileids=None)[source]¶
- Returns
the given file(s) as a list of sentences, each encoded as a list of chunks.
- Return type
list(list(list(str)))
- tagged_sents(fileids=None, tag='pos')[source]¶
- Returns
the given file(s) as a list of sentences. Each sentence is represented as a list of tagged chunks (in tree form).
- Return type
list(list(Tree))
- Parameters
tag – ‘pos’ (part of speech), ‘sem’ (semantic), or ‘both’ to indicate the kind of tags to include. Semantic tags consist of WordNet lemma IDs, plus an ‘NE’ node if the chunk is a named entity without a specific entry in WordNet. (Named entities of type ‘other’ have no lemma. Other chunks not in WordNet have no semantic tag. Punctuation tokens have None for their part of speech tag.)
- class nltk.corpus.reader.semcor.SemcorSentence[source]¶
Bases:
listA list of words, augmented by an attribute
numused to record the sentence identifier (thenattribute from the XML).
- class nltk.corpus.reader.semcor.SemcorWordView[source]¶
Bases:
XMLCorpusViewA stream backed corpus view specialized for use with the BNC corpus.
- __init__(fileid, unit, bracket_sent, pos_tag, sem_tag, wordnet)[source]¶
- Parameters
fileid – The name of the underlying file.
unit – One of ‘token’, ‘word’, or ‘chunk’.
bracket_sent – If true, include sentence bracketing.
pos_tag – Whether to include part-of-speech tags.
sem_tag – Whether to include semantic tags, namely WordNet lemma and OOV named entity status.
- handle_elt(elt, context)[source]¶
Convert an element into an appropriate value for inclusion in the view. Unless overridden by a subclass or by the
elt_handlerconstructor argument, this method simply returnselt.- Returns
The view value corresponding to
elt.- Parameters
elt (ElementTree) – The element that should be converted.
context (str) – A string composed of element tags separated by forward slashes, indicating the XML context of the given element. For example, the string
'foo/bar/baz'indicates that the element is abazelement whose parent is abarelement and whose grandparent is a top-levelfooelement.