nltk.corpus.reader.crubadan module¶
An NLTK interface for the n-gram statistics gathered from the corpora for each language using An Crubadan.
There are multiple potential applications for the data but this reader was created with the goal of using it in the context of language identification.
For details about An Crubadan, this data, and its potential uses, see: http://borel.slu.edu/crubadan/index.html
- class nltk.corpus.reader.crubadan.CrubadanCorpusReader[source]¶
Bases:
CorpusReaderA corpus reader used to access language An Crubadan n-gram files.
- __init__(root, fileids, encoding='utf8', tagset=None)[source]¶
- Parameters
root (PathPointer or str) – A path pointer identifying the root directory for this corpus. If a string is specified, then it will be converted to a
PathPointerautomatically.fileids – A list of the files that make up this corpus. This list can either be specified explicitly, as a list of strings; or implicitly, as a regular expression over file paths. The absolute path for each file will be constructed by joining the reader’s root to each file name.
encoding –
The default unicode encoding for the files that make up the corpus. The value of
encodingcan be any of the following:A string:
encodingis the encoding name for all files.A dictionary:
encoding[file_id]is the encoding name for the file whose identifier isfile_id. Iffile_idis not inencoding, then the file contents will be processed using non-unicode byte strings.A list:
encodingshould be a list of(regexp, encoding)tuples. The encoding for a file whose identifier isfile_idwill be theencodingvalue for the first tuple whoseregexpmatches thefile_id. If no tuple’sregexpmatches thefile_id, the file contents will be processed using non-unicode byte strings.None: the file contents of all files will be processed using non-unicode byte strings.
tagset – The name of the tagset used by this corpus, to be used for normalizing or converting the POS tags returned by the
tagged_...()methods.