nltk.probability.FreqDist¶
- class nltk.probability.FreqDist[source]¶
Bases:
Counter
A frequency distribution for the outcomes of an experiment. A frequency distribution records the number of times each outcome of an experiment has occurred. For example, a frequency distribution could be used to record the frequency of each word type in a document. Formally, a frequency distribution can be defined as a function mapping from each sample to the number of times that sample occurred as an outcome.
Frequency distributions are generally constructed by running a number of experiments, and incrementing the count for a sample every time it is an outcome of an experiment. For example, the following code will produce a frequency distribution that encodes how often each word occurs in a text:
>>> from nltk.tokenize import word_tokenize >>> from nltk.probability import FreqDist >>> sent = 'This is an example sentence' >>> fdist = FreqDist() >>> for word in word_tokenize(sent): ... fdist[word.lower()] += 1
An equivalent way to do this is with the initializer:
>>> fdist = FreqDist(word.lower() for word in word_tokenize(sent))
- __init__(samples=None)[source]¶
Construct a new frequency distribution. If
samples
is given, then the frequency distribution will be initialized with the count of each object insamples
; otherwise, it will be initialized to be empty.In particular,
FreqDist()
returns an empty frequency distribution; andFreqDist(samples)
first creates an empty frequency distribution, and then callsupdate
with the listsamples
.- Parameters
samples (Sequence) – The samples to initialize the frequency distribution with.
- N()[source]¶
Return the total number of sample outcomes that have been recorded by this FreqDist. For the number of unique sample values (or bins) with counts greater than zero, use
FreqDist.B()
.- Return type
int
- B()[source]¶
Return the total number of sample values (or “bins”) that have counts greater than zero. For the total number of sample outcomes recorded, use
FreqDist.N()
. (FreqDist.B() is the same as len(FreqDist).)- Return type
int
- r_Nr(bins=None)[source]¶
Return the dictionary mapping r to Nr, the number of samples with frequency r, where Nr > 0.
- Parameters
bins (int) – The number of possible sample outcomes.
bins
is used to calculate Nr(0). In particular, Nr(0) isbins-self.B()
. Ifbins
is not specified, it defaults toself.B()
(so Nr(0) will be 0).- Return type
int
- freq(sample)[source]¶
Return the frequency of a given sample. The frequency of a sample is defined as the count of that sample divided by the total number of sample outcomes that have been recorded by this FreqDist. The count of a sample is defined as the number of times that sample outcome was recorded by this FreqDist. Frequencies are always real numbers in the range [0, 1].
- Parameters
sample (any) – the sample whose frequency should be returned.
- Return type
float
- max()[source]¶
Return the sample with the greatest number of outcomes in this frequency distribution. If two or more samples have the same number of outcomes, return one of them; which sample is returned is undefined. If no outcomes have occurred in this frequency distribution, return None.
- Returns
The sample with the maximum number of outcomes in this frequency distribution.
- Return type
any or None
- plot(*args, title='', cumulative=False, percents=False, show=True, **kwargs)[source]¶
Plot samples from the frequency distribution displaying the most frequent sample first. If an integer parameter is supplied, stop after this many samples have been plotted. For a cumulative plot, specify cumulative=True. Additional
**kwargs
are passed to matplotlib’s plot function. (Requires Matplotlib to be installed.)- Parameters
title (str) – The title for the graph.
cumulative (bool) – Whether the plot is cumulative. (default = False)
percents (bool) – Whether the plot uses percents instead of counts. (default = False)
show (bool) – Whether to show the plot, or only return the ax.
- tabulate(*args, **kwargs)[source]¶
Tabulate the given samples from the frequency distribution (cumulative), displaying the most frequent sample first. If an integer parameter is supplied, stop after this many samples have been plotted.
- Parameters
samples (list) – The samples to plot (default is all samples)
cumulative – A flag to specify whether the freqs are cumulative (default = False)
- pprint(maxlen=10, stream=None)[source]¶
Print a string representation of this FreqDist to ‘stream’
- Parameters
maxlen (int) – The maximum number of items to print
stream – The stream to print to. stdout by default
- pformat(maxlen=10)[source]¶
Return a string representation of this FreqDist.
- Parameters
maxlen (int) – The maximum number of items to display
- Return type
string
- __new__(**kwargs)¶
- clear() None. Remove all items from D. ¶
- elements()[source]¶
Iterator over elements repeating each as many times as its count.
>>> c = Counter('ABCABC') >>> sorted(c.elements()) ['A', 'A', 'B', 'B', 'C', 'C']
# Knuth’s example for prime factors of 1836: 2**2 * 3**3 * 17**1 >>> prime_factors = Counter({2: 2, 3: 3, 17: 1}) >>> product = 1 >>> for factor in prime_factors.elements(): # loop over factors … product *= factor # and multiply them >>> product 1836
Note, if an element’s count has been set to zero or is a negative number, elements() will ignore it.
- classmethod fromkeys(iterable, v=None)[source]¶
Create a new dictionary with keys from iterable and values set to value.
- get(key, default=None, /)¶
Return the value for key if key is in the dictionary, else default.
- items() a set-like object providing a view on D's items ¶
- keys() a set-like object providing a view on D's keys ¶
- most_common(n=None)[source]¶
List the n most common elements and their counts from the most common to the least. If n is None, then list all element counts.
>>> Counter('abracadabra').most_common(3) [('a', 5), ('b', 2), ('r', 2)]
- pop(k[, d]) v, remove specified key and return the corresponding value. ¶
If key is not found, default is returned if given, otherwise KeyError is raised
- popitem()¶
Remove and return a (key, value) pair as a 2-tuple.
Pairs are returned in LIFO (last-in, first-out) order. Raises KeyError if the dict is empty.
- subtract(iterable=None, /, **kwds)[source]¶
Like dict.update() but subtracts counts instead of replacing them. Counts can be reduced below zero. Both the inputs and outputs are allowed to contain zero and negative counts.
Source can be an iterable, a dictionary, or another Counter instance.
>>> c = Counter('which') >>> c.subtract('witch') # subtract elements from another iterable >>> c.subtract(Counter('watch')) # subtract elements from another counter >>> c['h'] # 2 in which, minus 1 in witch, minus 1 in watch 0 >>> c['w'] # 1 in which, minus 1 in witch, minus 1 in watch -1
- values() an object providing a view on D's values ¶