nltk.probability.HeldoutProbDist¶
- class nltk.probability.HeldoutProbDist[source]¶
Bases:
ProbDistI
The heldout estimate for the probability distribution of the experiment used to generate two frequency distributions. These two frequency distributions are called the “heldout frequency distribution” and the “base frequency distribution.” The “heldout estimate” uses uses the “heldout frequency distribution” to predict the probability of each sample, given its frequency in the “base frequency distribution”.
In particular, the heldout estimate approximates the probability for a sample that occurs r times in the base distribution as the average frequency in the heldout distribution of all samples that occur r times in the base distribution.
This average frequency is Tr[r]/(Nr[r].N), where:
Tr[r] is the total count in the heldout distribution for all samples that occur r times in the base distribution.
Nr[r] is the number of samples that occur r times in the base distribution.
N is the number of outcomes recorded by the heldout frequency distribution.
In order to increase the efficiency of the
prob
member function, Tr[r]/(Nr[r].N) is precomputed for each value of r when theHeldoutProbDist
is created.- Variables
_estimate – A list mapping from r, the number of times that a sample occurs in the base distribution, to the probability estimate for that sample.
_estimate[r]
is calculated by finding the average frequency in the heldout distribution of all samples that occur r times in the base distribution. In particular,_estimate[r]
= Tr[r]/(Nr[r].N)._max_r – The maximum number of times that any sample occurs in the base distribution.
_max_r
is used to decide how large_estimate
must be.
- SUM_TO_ONE = False¶
True if the probabilities of the samples in this probability distribution will always sum to one.
- __init__(base_fdist, heldout_fdist, bins=None)[source]¶
Use the heldout estimate to create a probability distribution for the experiment used to generate
base_fdist
andheldout_fdist
.- Parameters
base_fdist (FreqDist) – The base frequency distribution.
heldout_fdist (FreqDist) – The heldout frequency distribution.
bins (int) – The number of sample values that can be generated by the experiment that is described by the probability distribution. This value must be correctly set for the probabilities of the sample values to sum to one. If
bins
is not specified, it defaults tofreqdist.B()
.
- base_fdist()[source]¶
Return the base frequency distribution that this probability distribution is based on.
- Return type
- heldout_fdist()[source]¶
Return the heldout frequency distribution that this probability distribution is based on.
- Return type
- samples()[source]¶
Return a list of all samples that have nonzero probabilities. Use
prob
to find the probability of each sample.- Return type
list
- prob(sample)[source]¶
Return the probability for a given sample. Probabilities are always real numbers in the range [0, 1].
- Parameters
sample (any) – The sample whose probability should be returned.
- Return type
float
- max()[source]¶
Return the sample with the greatest probability. If two or more samples have the same probability, return one of them; which sample is returned is undefined.
- Return type
any
- discount()[source]¶
Return the ratio by which counts are discounted on average: c*/c
- Return type
float