Comparisons

Overview

The comparisons configuration section defines constraints on the matching process. Unlike comparison_features and feature_selections, which define features for use with a machine-learning algorithm, comparisons define rules which directly filter the output potential_matches table. These rules often depend on some comparison features, and hlink always applies the rules after exploding and blocking in the matching task.

As an example, suppose that your comparisons configuration section looks like the following.

[comparisons]
comparison_type = "threshold"
feature_name = "namefrst_jw"
threshold = 0.79

This comparison defines a rule that depends on the namefrst_jw comparison feature. During matching, only pairs of records with namefrst_jw greater than or equal to 0.79 will be added to the potential matches table. Pairs of records which do not satisfy the comparison will not be potential matches.

Note: This page focuses on the comparisons section in particular, but the household comparisons section hh_comparisons has the same structure. It defines rules which hlink uses to filter record pairs after household blocking in the hh_matching task. These rules are effectively filters on the output hh_potential_matches table.

Comparison Types

Currently the only comparison_type supported for the comparisons section is "threshold". This requires the threshold attribute, and by default, it restricts a comparison feature to be greater than or equal to the value given by threshold. The configuration section

[comparisons]
comparison_type = "threshold"
feature_name = "namelast_jw"
threshold = 0.84

adds the condition namelast_jw >= 0.84 to each record pair considered during matching. Only record pairs which satisfy this condition are marked as potential matches.

Hlink also supports a threshold_expr attribute in comparisons for more flexibility. This attribute takes SQL syntax and replaces the threshold attribute described above. For example, to define the condition flag < 0.5, you could set threshold_expr like

[comparisons]
comparison_type = "threshold"
feature_name = "flag"
threshold_expr = "< 0.5"

Note that there is now no need for the threshold attribute because the threshold_expr implicitly defines it.

Defining Multiple Comparisons

In some cases, you may have multiple comparisons to make between record pairs. The comparisons section supports this in a flexible but somewhat verbose way. Suppose that you would like to combine two of the conditions used in the examples above, so that record pairs are potential matches only if namefrst_jw >= 0.79 and namelast_jw >= 0.84. You could do this by setting the operator attribute to "AND" and then defining the comp_a (comparison A) and comp_b (comparison B) attributes.

[comparisons]
operator = "AND"

[comparisons.comp_a]
comparison_type = "threshold"
feature_name = "namefrst_jw"
threshold = 0.79

[comparisons.comp_b]
comparison_type = "threshold"
feature_name = "namelast_jw"
threshold = 0.84

Both comp_a and comp_b are recursive, so they may have the same structure as the comparisons section itself. This means that you can add as many comparisons as you would like by recursively defining comparisons. operator may be either "AND" or "OR" and defines the logic for connecting the two sub-comparisons comp_a and comp_b. Defining more than two comparisons can get pretty ugly and verbose, so make sure to use care when defining nested comparisons. Here is an example of a section with three comparisons.

# This comparisons section defines 3 rules for potential matches.
# They are that potential matches must either have
# 1. flag < 0.5
# OR
# 2. namefrst_jw >= 0.79 AND 3. namelast_jw >= 0.84
[comparisons]
operator = "OR"

[comparisons.comp_a]
comparison_type = "threshold"
feature_name = "flag"
threshold_expr = "< 0.5"

[comparisons.comp_b]
operator = "AND"

[comparisons.comp_b.comp_a]
comparison_type = "threshold"
feature_name = "namefrst_jw"
threshold = 0.79

[comparisons.comp_b.comp_b]
comparison_type = "threshold"
feature_name = "namelast_jw"
threshold = 0.84