nltk.probability.SimpleGoodTuringProbDist¶
- class nltk.probability.SimpleGoodTuringProbDist[source]¶
Bases:
ProbDistI
SimpleGoodTuring ProbDist approximates from frequency to frequency of frequency into a linear line under log space by linear regression. Details of Simple Good-Turing algorithm can be found in:
Good Turing smoothing without tears” (Gale & Sampson 1995), Journal of Quantitative Linguistics, vol. 2 pp. 217-237.
“Speech and Language Processing (Jurafsky & Martin), 2nd Edition, Chapter 4.5 p103 (log(Nc) = a + b*log(c))
Given a set of pair (xi, yi), where the xi denotes the frequency and yi denotes the frequency of frequency, we want to minimize their square variation. E(x) and E(y) represent the mean of xi and yi.
slope: b = sigma ((xi-E(x)(yi-E(y))) / sigma ((xi-E(x))(xi-E(x)))
intercept: a = E(y) - b.E(x)
- SUM_TO_ONE = False¶
True if the probabilities of the samples in this probability distribution will always sum to one.
- __init__(freqdist, bins=None)[source]¶
- Parameters
freqdist (FreqDist) – The frequency counts upon which to base the estimation.
bins (int) – The number of possible event types. This must be larger than the number of bins in the
freqdist
. If None, then it’s assumed to be equal tofreqdist
.B() + 1
- find_best_fit(r, nr)[source]¶
Use simple linear regression to tune parameters self._slope and self._intercept in the log-log space based on count and Nr(count) (Work in log space to avoid floating point underflow.)
- smoothedNr(r)[source]¶
Return the number of samples with count r.
- Parameters
r (int) – The amount of frequency.
- Return type
float
- prob(sample)[source]¶
Return the sample’s probability.
- Parameters
sample (str) – sample of the event
- Return type
float
- discount()[source]¶
This function returns the total mass of probability transfers from the seen samples to the unseen samples.
- max()[source]¶
Return the sample with the greatest probability. If two or more samples have the same probability, return one of them; which sample is returned is undefined.
- Return type
any
- samples()[source]¶
Return a list of all samples that have nonzero probabilities. Use
prob
to find the probability of each sample.- Return type
list