nltk.probability.SimpleGoodTuringProbDist¶

class nltk.probability.SimpleGoodTuringProbDist[source]¶

Bases: ProbDistI

SimpleGoodTuring ProbDist approximates from frequency to frequency of frequency into a linear line under log space by linear regression. Details of Simple Good-Turing algorithm can be found in:

Good Turing smoothing without tears” (Gale & Sampson 1995), Journal of Quantitative Linguistics, vol. 2 pp. 217-237.
“Speech and Language Processing (Jurafsky & Martin), 2nd Edition, Chapter 4.5 p103 (log(Nc) = a + b*log(c))
https://www.grsampson.net/RGoodTur.html

Given a set of pair (xi, yi), where the xi denotes the frequency and yi denotes the frequency of frequency, we want to minimize their square variation. E(x) and E(y) represent the mean of xi and yi.

slope: b = sigma ((xi-E(x)(yi-E(y))) / sigma ((xi-E(x))(xi-E(x)))
intercept: a = E(y) - b.E(x)

SUM_TO_ONE = False¶: True if the probabilities of the samples in this probability distribution will always sum to one.

__init__(freqdist, bins=None)[source]¶

Parameters

freqdist (FreqDist) – The frequency counts upon which to base the estimation.
bins (int) – The number of possible event types. This must be larger than the number of bins in the freqdist. If None, then it’s assumed to be equal to freqdist.B() + 1

find_best_fit(r, nr)[source]¶: Use simple linear regression to tune parameters self._slope and self._intercept in the log-log space based on count and Nr(count) (Work in log space to avoid floating point underflow.)

smoothedNr(r)[source]¶

Return the number of samples with count r.

Parameters: r (int) – The amount of frequency.
Return type: float

prob(sample)[source]¶

Return the sample’s probability.

Parameters: sample (str) – sample of the event
Return type: float

check()[source]¶

discount()[source]¶: This function returns the total mass of probability transfers from the seen samples to the unseen samples.

max()[source]¶

Return the sample with the greatest probability. If two or more samples have the same probability, return one of them; which sample is returned is undefined.

Return type: any

samples()[source]¶

Return a list of all samples that have nonzero probabilities. Use prob to find the probability of each sample.

Return type: list

freqdist()[source]¶

generate()[source]¶: Return a randomly selected sample from this probability distribution. The probability of returning each sample samp is equal to self.prob(samp).

logprob(sample)[source]¶

Return the base 2 logarithm of the probability for a given sample.

Parameters: sample (any) – The sample whose probability should be returned.
Return type: float

NLTK

Documentation

nltk.probability.SimpleGoodTuringProbDist¶