Comparisons¶
Overview¶
The comparisons
configuration section defines constraints on the matching
process. Unlike comparison_features
and feature_selections
, which define
features for use with a machine-learning algorithm, comparisons
define rules
which directly filter the output potential_matches
table. These rules often
depend on some comparison features, and hlink always applies the rules after
exploding and blocking in the matching task.
As an example, suppose that your comparisons
configuration section looks like
the following.
[comparisons]
comparison_type = "threshold"
feature_name = "namefrst_jw"
threshold = 0.79
This comparison defines a rule that depends on the namefrst_jw
comparison
feature. During matching, only pairs of records with namefrst_jw
greater than
or equal to 0.79 will be added to the potential matches table. Pairs of records
which do not satisfy the comparison will not be potential matches.
Note: This page focuses on the comparisons
section in particular, but the
household comparisons section hh_comparisons
has the same structure. It
defines rules which hlink uses to filter record pairs after household blocking
in the hh_matching task. These rules are effectively filters on the output
hh_potential_matches
table.
Comparison Types¶
Currently the only comparison_type
supported for the comparisons
section is
"threshold"
. This requires the threshold
attribute, and by default, it
restricts a comparison feature to be greater than or equal to the value given
by threshold
. The configuration section
[comparisons]
comparison_type = "threshold"
feature_name = "namelast_jw"
threshold = 0.84
adds the condition namelast_jw >= 0.84
to each record pair considered during
matching. Only record pairs which satisfy this condition are marked as
potential matches.
Hlink also supports a threshold_expr
attribute in comparisons
for more
flexibility. This attribute takes SQL syntax and replaces the threshold
attribute described above. For example, to define the condition flag < 0.5
,
you could set threshold_expr
like
[comparisons]
comparison_type = "threshold"
feature_name = "flag"
threshold_expr = "< 0.5"
Note that there is now no need for the threshold
attribute because the
threshold_expr
implicitly defines it.
Defining Multiple Comparisons¶
In some cases, you may have multiple comparisons to make between record pairs.
The comparisons
section supports this in a flexible but somewhat verbose way.
Suppose that you would like to combine two of the conditions used in the
examples above, so that record pairs are potential matches only if namefrst_jw >= 0.79
and namelast_jw >= 0.84
. You could do this by setting the operator
attribute to "AND"
and then defining the comp_a
(comparison A) and comp_b
(comparison B) attributes.
[comparisons]
operator = "AND"
[comparisons.comp_a]
comparison_type = "threshold"
feature_name = "namefrst_jw"
threshold = 0.79
[comparisons.comp_b]
comparison_type = "threshold"
feature_name = "namelast_jw"
threshold = 0.84
Both comp_a
and comp_b
are recursive, so they may have the same structure
as the comparisons
section itself. This means that you can add as many
comparisons as you would like by recursively defining comparisons. operator
may be either "AND"
or "OR"
and defines the logic for connecting the two
sub-comparisons comp_a
and comp_b
. Defining more than two comparisons can
get pretty ugly and verbose, so make sure to use care when defining nested
comparisons. Here is an example of a section with three comparisons.
# This comparisons section defines 3 rules for potential matches.
# They are that potential matches must either have
# 1. flag < 0.5
# OR
# 2. namefrst_jw >= 0.79 AND 3. namelast_jw >= 0.84
[comparisons]
operator = "OR"
[comparisons.comp_a]
comparison_type = "threshold"
feature_name = "flag"
threshold_expr = "< 0.5"
[comparisons.comp_b]
operator = "AND"
[comparisons.comp_b.comp_a]
comparison_type = "threshold"
feature_name = "namefrst_jw"
threshold = 0.79
[comparisons.comp_b.comp_b]
comparison_type = "threshold"
feature_name = "namelast_jw"
threshold = 0.84