nltk.metrics.ghd

nltk.metrics.ghd(ref, hyp, ins_cost=2.0, del_cost=2.0, shift_cost_coeff=1.0, boundary='1')[source]

Compute the Generalized Hamming Distance for a reference and a hypothetical segmentation, corresponding to the cost related to the transformation of the hypothetical segmentation into the reference segmentation through boundary insertion, deletion and shift operations.

A segmentation is any sequence over a vocabulary of two items (e.g. “0”, “1”), where the specified boundary value is used to mark the edge of a segmentation.

Recommended parameter values are a shift_cost_coeff of 2. Associated with a ins_cost, and del_cost equal to the mean segment length in the reference segmentation.

>>> # Same examples as Kulyukin C++ implementation
>>> ghd('1100100000', '1100010000', 1.0, 1.0, 0.5)
0.5
>>> ghd('1100100000', '1100000001', 1.0, 1.0, 0.5)
2.0
>>> ghd('011', '110', 1.0, 1.0, 0.5)
1.0
>>> ghd('1', '0', 1.0, 1.0, 0.5)
1.0
>>> ghd('111', '000', 1.0, 1.0, 0.5)
3.0
>>> ghd('000', '111', 1.0, 2.0, 0.5)
6.0
Parameters
  • ref (str or list) – the reference segmentation

  • hyp (str or list) – the hypothetical segmentation

  • ins_cost (float) – insertion cost

  • del_cost (float) – deletion cost

  • shift_cost_coeff (float) – constant used to compute the cost of a shift. shift cost = shift_cost_coeff * |i - j| where i and j are the positions indicating the shift

  • boundary (str or int or bool) – boundary value

Return type

float