# Comparisons ## Overview The `comparisons` configuration section defines constraints on the matching process. Unlike `comparison_features` and `feature_selections`, which define features for use with a machine-learning algorithm, `comparisons` define rules which directly filter the output `potential_matches` table. These rules often depend on some comparison features, and hlink always applies the rules after exploding and blocking in the matching task. As an example, suppose that your `comparisons` configuration section looks like the following. ``` [comparisons] comparison_type = "threshold" feature_name = "namefrst_jw" threshold = 0.79 ``` This comparison defines a rule that depends on the `namefrst_jw` comparison feature. During matching, only pairs of records with `namefrst_jw` greater than or equal to 0.79 will be added to the potential matches table. Pairs of records which do not satisfy the comparison will not be potential matches. *Note: This page focuses on the `comparisons` section in particular, but the household comparisons section `hh_comparisons` has the same structure. It defines rules which hlink uses to filter record pairs after household blocking in the hh_matching task. These rules are effectively filters on the output `hh_potential_matches` table.* ## Comparison Types Currently the only `comparison_type` supported for the `comparisons` section is `"threshold"`. This requires the `threshold` attribute, and by default, it restricts a comparison feature to be greater than or equal to the value given by `threshold`. The configuration section ``` [comparisons] comparison_type = "threshold" feature_name = "namelast_jw" threshold = 0.84 ``` adds the condition `namelast_jw >= 0.84` to each record pair considered during matching. Only record pairs which satisfy this condition are marked as potential matches. Hlink also supports a `threshold_expr` attribute in `comparisons` for more flexibility. This attribute takes SQL syntax and replaces the `threshold` attribute described above. For example, to define the condition `flag < 0.5`, you could set `threshold_expr` like ``` [comparisons] comparison_type = "threshold" feature_name = "flag" threshold_expr = "< 0.5" ``` Note that there is now no need for the `threshold` attribute because the `threshold_expr` implicitly defines it. ## Defining Multiple Comparisons In some cases, you may have multiple comparisons to make between record pairs. The `comparisons` section supports this in a flexible but somewhat verbose way. Suppose that you would like to combine two of the conditions used in the examples above, so that record pairs are potential matches only if `namefrst_jw >= 0.79` and `namelast_jw >= 0.84`. You could do this by setting the `operator` attribute to `"AND"` and then defining the `comp_a` (comparison A) and `comp_b` (comparison B) attributes. ``` [comparisons] operator = "AND" [comparisons.comp_a] comparison_type = "threshold" feature_name = "namefrst_jw" threshold = 0.79 [comparisons.comp_b] comparison_type = "threshold" feature_name = "namelast_jw" threshold = 0.84 ``` Both `comp_a` and `comp_b` are recursive, so they may have the same structure as the `comparisons` section itself. This means that you can add as many comparisons as you would like by recursively defining comparisons. `operator` may be either `"AND"` or `"OR"` and defines the logic for connecting the two sub-comparisons `comp_a` and `comp_b`. Defining more than two comparisons can get pretty ugly and verbose, so make sure to use care when defining nested comparisons. Here is an example of a section with three comparisons. ``` # This comparisons section defines 3 rules for potential matches. # They are that potential matches must either have # 1. flag < 0.5 # OR # 2. namefrst_jw >= 0.79 AND 3. namelast_jw >= 0.84 [comparisons] operator = "OR" [comparisons.comp_a] comparison_type = "threshold" feature_name = "flag" threshold_expr = "< 0.5" [comparisons.comp_b] operator = "AND" [comparisons.comp_b.comp_a] comparison_type = "threshold" feature_name = "namefrst_jw" threshold = 0.79 [comparisons.comp_b.comp_b] comparison_type = "threshold" feature_name = "namelast_jw" threshold = 0.84 ```