nltk.corpus.reader.ipipan module¶
- class nltk.corpus.reader.ipipan.IPIPANCorpusReader[source]¶
Bases:
CorpusReaderCorpus reader designed to work with corpus created by IPI PAN. See http://korpus.pl/en/ for more details about IPI PAN corpus.
The corpus includes information about text domain, channel and categories. You can access possible values using
domains(),channels()andcategories(). You can use also this metadata to filter files, e.g.:fileids(channel='prasa'),fileids(categories='publicystyczny').The reader supports methods: words, sents, paras and their tagged versions. You can get part of speech instead of full tag by giving “simplify_tags=True” parameter, e.g.:
tagged_sents(simplify_tags=True).Also you can get all tags disambiguated tags specifying parameter “one_tag=False”, e.g.:
tagged_paras(one_tag=False).You can get all tags that were assigned by a morphological analyzer specifying parameter “disamb_only=False”, e.g.
tagged_words(disamb_only=False).The IPIPAN Corpus contains tags indicating if there is a space between two tokens. To add special “no space” markers, you should specify parameter “append_no_space=True”, e.g.
tagged_words(append_no_space=True). As a result in place where there should be no space between two tokens new pair (‘’, ‘no-space’) will be inserted (for tagged data) and just ‘’ for methods without tags.The corpus reader can also try to append spaces between words. To enable this option, specify parameter “append_space=True”, e.g.
words(append_space=True). As a result either ‘ ‘ or (’ ‘, ‘space’) will be inserted between tokens.By default, xml entities like " and & are replaced by corresponding characters. You can turn off this feature, specifying parameter “replace_xmlentities=False”, e.g.
words(replace_xmlentities=False).- __init__(root, fileids)[source]¶
- Parameters
root (PathPointer or str) – A path pointer identifying the root directory for this corpus. If a string is specified, then it will be converted to a
PathPointerautomatically.fileids – A list of the files that make up this corpus. This list can either be specified explicitly, as a list of strings; or implicitly, as a regular expression over file paths. The absolute path for each file will be constructed by joining the reader’s root to each file name.
encoding –
The default unicode encoding for the files that make up the corpus. The value of
encodingcan be any of the following:A string:
encodingis the encoding name for all files.A dictionary:
encoding[file_id]is the encoding name for the file whose identifier isfile_id. Iffile_idis not inencoding, then the file contents will be processed using non-unicode byte strings.A list:
encodingshould be a list of(regexp, encoding)tuples. The encoding for a file whose identifier isfile_idwill be theencodingvalue for the first tuple whoseregexpmatches thefile_id. If no tuple’sregexpmatches thefile_id, the file contents will be processed using non-unicode byte strings.None: the file contents of all files will be processed using non-unicode byte strings.
tagset – The name of the tagset used by this corpus, to be used for normalizing or converting the POS tags returned by the
tagged_...()methods.
- class nltk.corpus.reader.ipipan.IPIPANCorpusView[source]¶
Bases:
StreamBackedCorpusView- WORDS_MODE = 0¶
- SENTS_MODE = 1¶
- PARAS_MODE = 2¶
- __init__(filename, startpos=0, **kwargs)[source]¶
Create a new corpus view, based on the file
fileid, and read withblock_reader. See the class documentation for more information.- Parameters
fileid – The path to the file that is read by this corpus view.
fileidcan either be a string or aPathPointer.startpos – The file position at which the view will start reading. This can be used to skip over preface sections.
encoding – The unicode encoding that should be used to read the file’s contents. If no encoding is specified, then the file’s contents will be read as a non-unicode string (i.e., a str).