nltk.downloader module¶
The NLTK corpus and module downloader. This module defines several interfaces which can be used to download corpora, models, and other data packages that can be used with NLTK.
Downloading Packages¶
If called with no arguments, download() will display an interactive
interface which can be used to download and install new packages.
If Tkinter is available, then a graphical interface will be shown,
otherwise a simple text interface will be provided.
Individual packages can be downloaded by calling the download()
function with a single argument, giving the package identifier for the
package that should be downloaded:
>>> download('treebank')
[nltk_data] Downloading package 'treebank'...
[nltk_data] Unzipping corpora/treebank.zip.
NLTK also provides a number of “package collections”, consisting of
a group of related packages. To download all packages in a
colleciton, simply call download() with the collection’s
identifier:
>>> download('all-corpora')
[nltk_data] Downloading package 'abc'...
[nltk_data] Unzipping corpora/abc.zip.
[nltk_data] Downloading package 'alpino'...
[nltk_data] Unzipping corpora/alpino.zip.
...
[nltk_data] Downloading package 'words'...
[nltk_data] Unzipping corpora/words.zip.
Download Directory¶
By default, packages are installed in either a system-wide directory
(if Python has sufficient access to write to it); or in the current
user’s home directory. However, the download_dir argument may be
used to specify a different installation target, if desired.
See Downloader.default_download_dir() for more a detailed
description of how the default download directory is chosen.
NLTK Download Server¶
Before downloading any packages, the corpus and module downloader
contacts the NLTK download server, to retrieve an index file
describing the available packages. By default, this index file is
loaded from https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml.
If necessary, it is possible to create a new Downloader object,
specifying a different URL for the package index file.
Usage:
python nltk/downloader.py [-d DATADIR] [-q] [-f] [-k] PACKAGE_IDS
or:
python -m nltk.downloader [-d DATADIR] [-q] [-f] [-k] PACKAGE_IDS
- class nltk.downloader.Package[source]¶
Bases:
objectA directory entry for a downloadable package. These entries are extracted from the XML index file that is downloaded by
Downloader. Each package consists of a single file; but if that file is a zip file, then it can be automatically decompressed when the package is installed.- __init__(id, url, name=None, subdir='', size=None, unzipped_size=None, checksum=None, svn_revision=None, copyright='Unknown', contact='Unknown', license='Unknown', author='Unknown', unzip=True, **kw)[source]¶
- id¶
A unique identifier for this package.
- name¶
A string name for this package.
- subdir¶
The subdirectory where this package should be installed. E.g.,
'corpora'or'taggers'.
- url¶
A URL that can be used to download this package’s file.
- size¶
The filesize (in bytes) of the package file.
- unzipped_size¶
The total filesize of the files contained in the package’s zipfile.
- checksum¶
The MD-5 checksum of the package file.
- svn_revision¶
A subversion revision number for this package.
- copyright¶
Copyright holder for this package.
- contact¶
Name & email of the person who should be contacted with questions about this package.
- license¶
License information for this package.
- author¶
Author of this package.
- filename¶
The filename that should be used for this package’s file. It is formed by joining
self.subdirwithself.id, and using the same extension asurl.
- unzip¶
A flag indicating whether this corpus should be unzipped by default.
- class nltk.downloader.Collection[source]¶
Bases:
objectA directory entry for a collection of downloadable packages. These entries are extracted from the XML index file that is downloaded by
Downloader.- id¶
A unique identifier for this collection.
- name¶
A string name for this collection.
- children¶
A list of the
CollectionsorPackagesdirectly contained by this collection.
- packages¶
A list of
Packagescontained by this collection or any collections it recursively contains.
- class nltk.downloader.DownloaderMessage[source]¶
Bases:
objectA status message object, used by
incr_downloadto communicate its progress.
- class nltk.downloader.StartCollectionMessage[source]¶
Bases:
DownloaderMessageData server has started working on a collection of packages.
- class nltk.downloader.FinishCollectionMessage[source]¶
Bases:
DownloaderMessageData server has finished working on a collection of packages.
- class nltk.downloader.StartPackageMessage[source]¶
Bases:
DownloaderMessageData server has started working on a package.
- class nltk.downloader.FinishPackageMessage[source]¶
Bases:
DownloaderMessageData server has finished working on a package.
- class nltk.downloader.StartDownloadMessage[source]¶
Bases:
DownloaderMessageData server has started downloading a package.
- class nltk.downloader.FinishDownloadMessage[source]¶
Bases:
DownloaderMessageData server has finished downloading a package.
- class nltk.downloader.StartUnzipMessage[source]¶
Bases:
DownloaderMessageData server has started unzipping a package.
- class nltk.downloader.FinishUnzipMessage[source]¶
Bases:
DownloaderMessageData server has finished unzipping a package.
- class nltk.downloader.UpToDateMessage[source]¶
Bases:
DownloaderMessageThe package download file is already up-to-date
- class nltk.downloader.StaleMessage[source]¶
Bases:
DownloaderMessageThe package download file is out-of-date or corrupt
- class nltk.downloader.ErrorMessage[source]¶
Bases:
DownloaderMessageData server encountered an error
- class nltk.downloader.ProgressMessage[source]¶
Bases:
DownloaderMessageIndicates how much progress the data server has made
- class nltk.downloader.SelectDownloadDirMessage[source]¶
Bases:
DownloaderMessageIndicates what download directory the data server is using
- class nltk.downloader.Downloader[source]¶
Bases:
objectA class used to access the NLTK data server, which can be used to download corpora and other data packages.
- INDEX_TIMEOUT = 3600¶
The amount of time after which the cached copy of the data server index will be considered ‘stale,’ and will be re-downloaded.
- DEFAULT_URL = 'https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml'¶
The default URL for the NLTK data server’s index. An alternative URL can be specified when creating a new
Downloaderobject.
- INSTALLED = 'installed'¶
A status string indicating that a package or collection is installed and up-to-date.
- NOT_INSTALLED = 'not installed'¶
A status string indicating that a package or collection is not installed.
- STALE = 'out of date'¶
A status string indicating that a package or collection is corrupt or out-of-date.
- PARTIAL = 'partial'¶
A status string indicating that a collection is partially installed (i.e., only some of its packages are installed.)
- list(download_dir=None, show_packages=True, show_collections=True, header=True, more_prompt=False, skip_installed=False)[source]¶
- download(info_or_id=None, download_dir=None, quiet=False, force=False, prefix='[nltk_data] ', halt_on_error=True, raise_on_error=False, print_error_to=<colorama.ansitowin32.StreamWrapper object>)[source]¶
- status(info_or_id, download_dir=None)[source]¶
Return a constant describing the status of the given package or collection. Status can be one of
INSTALLED,NOT_INSTALLED,STALE, orPARTIAL.
- index()[source]¶
Return the XML index describing the packages available from the data server. If necessary, this index will be downloaded from the data server.
- property url¶
The URL for the data server’s index file.
- default_download_dir()[source]¶
Return the directory to which packages will be downloaded by default. This value can be overridden using the constructor, or on a case-by-case basis using the
download_dirargument when callingdownload().On Windows, the default download directory is
PYTHONHOME/lib/nltk, where PYTHONHOME is the directory containing Python, e.g.C:\Python25.On all other platforms, the default directory is the first of the following which exists or which can be created with write permission:
/usr/share/nltk_data,/usr/local/share/nltk_data,/usr/lib/nltk_data,/usr/local/lib/nltk_data,~/nltk_data.
- property download_dir¶
The default directory to which packages will be downloaded. This defaults to the value returned by
default_download_dir(). To override this default on a case-by-case basis, use thedownload_dirargument when callingdownload().
- class nltk.downloader.DownloaderGUI[source]¶
Bases:
objectGraphical interface for downloading packages from the NLTK data server.
- COLUMNS = ['', 'Identifier', 'Name', 'Size', 'Status', 'Unzipped Size', 'Copyright', 'Contact', 'License', 'Author', 'Subdir', 'Checksum']¶
A list of the names of columns. This controls the order in which the columns will appear. If this is edited, then
_package_to_columns()may need to be edited to match.
- COLUMN_WEIGHTS = {'': 0, 'Name': 5, 'Size': 0, 'Status': 0}¶
A dictionary specifying how columns should be resized when the table is resized. Columns with weight 0 will not be resized at all; and columns with high weight will be resized more. Default weight (for columns not explicitly listed) is 1.
- COLUMN_WIDTHS = {'': 1, 'Identifier': 20, 'Name': 45, 'Size': 10, 'Status': 12, 'Unzipped Size': 10}¶
A dictionary specifying how wide each column should be, in characters. The default width (for columns not explicitly listed) is specified by
DEFAULT_COLUMN_WIDTH.
- DEFAULT_COLUMN_WIDTH = 30¶
The default width for columns that are not explicitly listed in
COLUMN_WIDTHS.
- INITIAL_COLUMNS = ['', 'Identifier', 'Name', 'Size', 'Status']¶
The set of columns that should be displayed by default.
- HELP = 'This tool can be used to download a variety of corpora and models\nthat can be used with NLTK. Each corpus or model is distributed\nin a single zip file, known as a "package file." You can\ndownload packages individually, or you can download pre-defined\ncollections of packages.\n\nWhen you download a package, it will be saved to the "download\ndirectory." A default download directory is chosen when you run\n\nthe downloader; but you may also select a different download\ndirectory. On Windows, the default download directory is\n\n\n"package."\n\nThe NLTK downloader can be used to download a variety of corpora,\nmodels, and other data packages.\n\nKeyboard shortcuts::\n [return]\t Download\n [up]\t Select previous package\n [down]\t Select next package\n [left]\t Select previous tab\n [right]\t Select next tab\n'¶
- c = 'Status'¶
- nltk.downloader.md5_hexdigest(file)[source]¶
Calculate and return the MD5 checksum for a given file.
filemay either be a filename or an open stream.
- nltk.downloader.unzip(filename, root, verbose=True)[source]¶
Extract the contents of the zip file
filenameinto the directoryroot.
- nltk.downloader.build_index(root, base_url)[source]¶
Create a new data.xml index file, by combining the xml description files for various packages and collections.
rootshould be the path to a directory containing the package xml and zip files; and the collection xml files. Therootdirectory is expected to have the following subdirectories:root/ packages/ .................. subdirectory for packages corpora/ ................. zip & xml files for corpora grammars/ ................ zip & xml files for grammars taggers/ ................. zip & xml files for taggers tokenizers/ .............. zip & xml files for tokenizers etc. collections/ ............... xml files for collections
For each package, there should be two files:
package.zip(where package is the package name) which contains the package itself as a compressed zip file; andpackage.xml, which is an xml description of the package. The zipfilepackage.zipshould expand to a single subdirectory namedpackage/. The base filenamepackagemust match the identifier given in the package’s xml file.For each collection, there should be a single file
collection.zipdescribing the collection, where collection is the name of the collection.All identifiers (for both packages and collections) must be unique.