Comparison Features¶
During matching, hlink computes comparison features on each record pair which
it considers a potential match. These comparison features can be passed as
features to machine-learning algorithms or used to define
comparisons which filter the potential_matches
table.
Each comparison feature must have a comparison type, which tells hlink how to compute the comparison feature. This page has information on the available comparison types and how to configure them. It also lists some attributes available to all comparison types and some predefined aggregate features which do not need to be explicitly configured.
Comparison Types¶
Each section below describes a comparison type. Each type represents a different operation, computation, or transformation that hlink can perform on one or more input columns. Some comparison types expect their own attributes for additional configuration. These attributes are listed in each section, along with an example.
maximum_jaro_winkler¶
Finds the greatest Jaro-Winkler value among the cartesian product of multiple
columns. For example, given an input of column_names = ['namefrst', 'namelast']
, it would return the maximum Jaro-Winkler name comparison value
among the following four comparisons:
a.namefrst, b.namefrst
a.namefrst, b.namelast
a.namelast, b.namefrst
a.namelast, b.namelast
Attributes:
column_names
– Type: list of strings. Required. The list of columns used as input for the set of comparisons, which are generated by taking the Cartesian product of the set of input columns with itself.
[[comparison_features]]
alias = "maximum_jw"
column_names = ["namelast", "namefrst"]
comparison_type = "maximum_jaro_winkler"
jaro_winkler¶
Returns the Jaro-Winkler comparison score for a given column.
Attributes:
column_name
– Type:string
. Required. The column to compare using the Jaro-Winkler score.
[[comparison_features]]
alias = "namefrst_jw"
column_name = "namefrst"
comparison_type = "jaro_winkler
jaro_winkler_street¶
Uses an additional geographic column value to filter for major location changes before comparing street names. If boundary column A is not equal to boundary column B, a Jaro-Winkler score of zero is returned. If boundary column A and B are equal, the Jaro-Winkler comparison score of the street columns is returned.
Attributes:
column_name
– Type:string
. Required. The input street column.boundary
– Type:string
. Required. An input column to match on before comparing street name values.
[[comparison_features]]
alias = "jw_street"
column_name = "street"
boundary = "enum_dist"
comparison_type = "jaro_winkler_street"
max_jaro_winkler¶
Returns the greatest Jaro-Winkler value from the comparisons of a list of names.
Attributes:
column_name
– Type:string
. Required. Input column containing a list of names to compare (such as related household members, or neighborhood surnames).
[[comparison_features]]
alias = "related_individual_max_jw"
column_name= "namefrst_related"
comparison_type = "max_jaro_winkler"
equals¶
Asserts that values are the same for both compared columns using SQL: a.{column_name} IS NOT DISTINCT FROM b.{column_name}
[[comparison_features]]
alias = "relatematch"
column_name = "relate_div_100"
comparison_type = "equals"
categorical = true
f1_match¶
Evaluates if the first name initial A matches either the first name first initial B or either the first or second middle initial of B. If so, returns 1. Otherwise, returns 2.
1 = First initial of first first name A matches first initial of any of potential match first names B
2 = mismatch
Uses the following SQL query:
"CASE WHEN (
(a.{fi} IS NOT DISTINCT FROM b.{fi}) OR
(a.{fi} IS NOT DISTINCT FROM b.{mi0}) OR
(a.{fi} IS NOT DISTINCT FROM b.{mi1})
) THEN 1 ELSE 2 END"
[[comparison_features]]
alias = "f1_match"
first_init_col = "namefrst_init"
mid_init_cols = ["namefrst_mid_init", "namefrst_mid_init_2"]
comparison_type = "f1_match"
categorical = true
f2_match¶
Evaluates if first middle initial A is empty/null. If so, return 0. Otherwise, if either first or second middle initial A is not null and matches first name initial B, or first or second middle initial B, return 1. Otherwise, return 2.
1 = First initial of A second first name matches first initial of any of potential match first names B
2 = mismatch
0 = no second first name A
Uses the following SQL:
CASE WHEN ((a.{mi0} == '') OR (a.{mi0} IS NULL)) THEN 0 WHEN (
(a.{mi0} IS NOT DISTINCT FROM b.{fi}) OR
((a.{mi1} IS NOT NULL) AND (a.{mi1} IS NOT DISTINCT FROM b.{fi})) OR
(a.{mi0} IS NOT DISTINCT FROM b.{mi0}) OR
(a.{mi0} IS NOT DISTINCT FROM b.{mi1}) OR
((a.{mi1} IS NOT NULL) AND (a.{mi1} IS NOT DISTINCT FROM b.{mi0})) OR
((a.{mi1} IS NOT NULL) AND (a.{mi1} IS NOT DISTINCT FROM b.{mi1}))
) THEN 1 ELSE 2 END
Attributes:
first_init_col
– Type:string
. Required. First name initial input column.mid_init_cols
– Type: list of strings. Required. List of first and second middle initial input columns.
[[comparison_features]]
alias = "f2_match"
first_init_col = "namefrst_init"
mid_init_cols = ["namefrst_mid_init", "namefrst_mid_init_2"]
comparison_type = "f2_match"
categorical = true
not_equals¶
Asserts that values are distinct between compared individuals using SQL: a.{column_name} IS DISTINCT FROM b.{column_name}
. Used mainly in caution flag features (f_caution, m_caution, sp_caution).
Attributes:
column_name
– Type:string
. Required. Input column to compare.
[[comparison_features]]
alias = "m_caution"
column_names = ["mbpl", "mother_birthyr", "stepmom", "momloc"]
comparison_type = "caution_comp_4"
categorical = true
[comparison_features.comp_a]
column_name = "mbpl"
comparison_type = "not_equals"
[comparison_features.comp_b]
column_name = "mother_birthyr"
comparison_type = "abs_diff"
gt_threshold = 5
[comparison_features.comp_c]
column_name = "stepmom"
comparison_type = "parent_step_change"
[comparison_features.comp_d]
column_name = "momloc"
comparison_type = "present_both_years"
equals_as_int¶
Checks for equality using equals sign and returns boolean result in integer form. Uses SQL: CAST(a.{col} = b.{col} as INT)
Attributes:
column_name
– Type:string
. Required. Input column to compare.
[[comparison_features]]
alias = "namelast_equal_as_int"
column_name = "namelast_clean"
comparison_type = "equals_as_int"
all_equals¶
Asserts whether the values in all given columns match. Uses a SQL expression generated by joining a.{col} = b.{col}
and AND
for each given column.
Attributes:
column_names
– Type: list of strings. Required. List of the columns to evaluate if all are equal across records being compared.
[[comparison_features]]
alias = "exact"
column_names = ["namefrst_unstd", "namelast_clean"]
comparison_type = "all_equals"
not_zero_and_not_equals¶
Checks that both values are present (not null) and nonzero and that they are not equal to one another. Evaluates to a boolean. This is primarily useful when a value of 0 indicates some kind of incomparibility akin to the value being missing.
See also present_and_equal_categorical_in_universe, which is a similar but more general comparison type.
Attributes:
column_name
– Type:string
. Required. Input column to compare.
[[comparison_features]]
alias = "fbpl_nomatch"
column_name = "fbpl"
comparison_type = "not_zero_and_not_equals"
or¶
Allows for the concatenation of up to four comparison features into one feature using a SQL OR
between the generated clause for each sub-comparison.
Attributes:
column_names
– Type: list of strings. Required. A list of all input columns used by sub-comparisons.comp_a
,comp_b
– Type: Object. Required. Sub-comparison using any of the comparison feature types documented in this section.comp_c
,comp_d
– Type: Object. Optional. Sub-comparison using any of the comparison feature types documented in this section.
[[comparison_features]]
alias = "sp_caution"
column_names = ["spouse_bpl", "spouse_birthyr", "durmarr"]
comparison_type = "or"
[comparison_features.comp_a]
column_name = "spouse_bpl"
comparison_type = "not_equals"
[comparison_features.comp_b]
column_name = "spouse_birthyr"
comparison_type = "abs_diff"
lower_threshold = 5
[comparison_features.comp_c]
column_name = "durmarr"
comparison_type = "new_marr"
upper_threshold = 7
and¶
Allows for the concatenation of up to four comparison features into one feature using a SQL AND
between the generated clause for each sub-comparison.
Attributes:
column_names
– Type: list of strings. Required. A list of all input columns used by sub-comparisons.comp_a
,comp_b
– Type: Object. Required. Sub-comparison using any of the comparison feature types documented in this section.comp_c
,comp_d
– Type: Object. Optional. Sub-comparison using any of the comparison feature types documented in this section.
In this example, the and
comparison appears in [comparison_features.comp_b]
.
[[comparison_features]]
alias = "street_jw"
comparison_type = "times"
column_names = ["street","county", "statefip"]
[comparison_features.comp_a]
column_name = "street"
comparison_type = "jaro_winkler"
lower_threshold = 0.9
[comparison_features.comp_b]
comparison_type = "and"
column_names = ["county", "statefip"]
[comparison_features.comp_b.comp_a]
column_name = "county"
comparison_type = "equals"
[comparison_features.comp_b.comp_b]
column_name = "statefip"
comparison_type = "equals"
times¶
Takes the output of two sub-comparisons and multiplies them together after casting as floats.
Attributes:
column_names
– Type: list of strings. Required. A list of all input columns used by sub-comparisons.comp_a
,comp_b
– Type: Object. Required. Sub-comparison using any of the comparison feature types documented in this section.comp_a
andcomp_b
can also have sub-comparisons, as in the given example.
[[comparison_features]]
alias = "street_jw"
comparison_type = "times"
column_names = ["street","county", "statefip"]
[comparison_features.comp_a]
column_name = "street"
comparison_type = "jaro_winkler"
lower_threshold = 0.9
[comparison_features.comp_b]
comparison_type = "and"
column_names = ["county", "statefip"]
[comparison_features.comp_b.comp_a]
column_name = "county"
comparison_type = "equals"
[comparison_features.comp_b.comp_b]
column_name = "statefip"
comparison_type = "equals"
caution_comp_3¶
Generates an SQL expression in the form (comparison A OR comparison B) AND comparison C
.
Attributes:
column_names
– Type: list of strings. Required. A list of all input columns used by sub-comparisons.comp_a
,comp_b
,comp_c
– Type: Object. Required. Sub-comparisons using any of the comparison feature types documented in this section.
[[comparison_features]]
alias = "sp_caution"
column_names = ["spouse_bpl", "spouse_birthyr", "durmarr", "sploc"]
comparison_type = "caution_comp_3"
categorical = true
[comparison_features.comp_a]
column_name = "spouse_bpl"
comparison_type = "not_equals"
[comparison_features.comp_b]
column_name = "spouse_birthyr"
comparison_type = "abs_diff"
gt_threshold = 5
[comparison_features.comp_c]
column_name = "durmarr"
comparison_type = "new_marr"
upper_threshold = 7
caution_comp_3_012¶
Similar to caution_comp_3
, but first checks the value of comparison C. If comparison C evaluates to false,
then caution_comp_3_012
evaluates to 2. Otherwise, it evaluates to the result of caution_comp_3
, so 0 or 1.
Attributes:
column_names
– Type: list of strings. Required. A list of all input columns used by sub-comparisons.comp_a
,comp_b
,comp_c
– Type: Object. Required. Sub-comparison using any of the comparison feature types documented in this section.
caution_comp_4¶
Generates an SQL expression in the form (comparison A OR comparison B OR comparison C) AND comparison D
.
Attributes:
column_names
– Type: list of strings. Required. A list of all input columns used by sub-comparisons.comp_a
,comp_b
,comp_c
,comp_d
– Type: Object. Required. Sub-comparisons using any of the comparison feature types documented in this section.
[[comparison_features]]
alias = "m_caution"
column_names = ["mbpl", "mother_birthyr", "stepmom", "momloc"]
comparison_type = "caution_comp_4"
categorical = true
[comparison_features.comp_a]
column_name = "mbpl"
comparison_type = "not_equals"
[comparison_features.comp_b]
column_name = "mother_birthyr"
comparison_type = "abs_diff"
gt_threshold = 5
[comparison_features.comp_c]
column_name = "stepmom"
comparison_type = "parent_step_change"
[comparison_features.comp_d]
column_name = "momloc"
comparison_type = "present_both_years"
caution_comp_4_012¶
Similar to caution_comp_4
, but first checks the value of comparison D. If comparison D evaluates to false,
then caution_comp_4_012
evaluates to 2. Otherwise, it evaluates to the result of caution_comp_4
, so 0 or 1.
Attributes:
column_names
– Type: list of strings. Required. A list of all input columns used by sub-comparisons.comp_a
,comp_b
,comp_c
,comp_d
– Type: Object. Required. Sub-comparisons using any of the comparison feature types documented in this section.
any_equals¶
Used to compare middle initials and first names under specific circumstances.
If middle initial A is not empty/null and is the same as either middle initial B or first name B,
OR if first name A is not empty/null and is the same as middle initial B.
Attributes:
column_names
– Type: list of strings. Required. The first input column should be the middle initial column, and the second input column should be the first name column.
[[comparison_features]]
alias = "mid_init_match"
column_names = ["namefrst_mid_init", "namefrst_unstd"]
comparison_type = "any_equals"
either_are_1¶
Checks if the column value for either A or B is equal to 1.
Attributes:
column_name
– Type:string
. Required. Input column to compare to 1.
[[comparison_features]]
alias = "either_1"
column_name = "nativity"
comparison_type = "either_are_1"
categorical = true
either_are_0¶
Checks if the column value for either A or B is equal to 0.
Attributes:
column_name
– Type:string
. Required. Input column to compare to 0.
[[comparison_features]]
alias = "either_0"
column_name = "nativity"
comparison_type = "either_are_0"
categorical = true
second_gen_imm¶
Checks if individual A is a second-generation immigrant by looking for nativity
value of 2, 3, or 4 (one or both parents foreign-born).
Attributes:
column_name
– Type:string
. Required. Input should be the name of the nativity column.
[[comparison_features]]
alias = "sgen"
column_name = "nativity"
comparison_type = "second_gen_imm"
categorical = true
rel_jaro_winkler¶
Uses a Scala function to determine the number of people in the input column with a name similarity score (Jaro-Winkler) greater than or equal to the given jw_threshold
, an age difference less than or equal to the given age_threshold
, and matching sex for the sample A individual and the sample B potential match. Takes a column generated with the feature selection transform related_individual_rows
as input (list of person data objects to compare). Can be used for related or unrelated individuals, depending on the input column specified.
Attributes:
column_name
– Type:string
. The input column with data in the form of a list of person data objects.name_col
– Type:string
.birthyr_col
– Type:string
.jw_threshold
– Type:float
.age_threshold
– Type:int
.
[[comparison_features]]
alias = "rel"
column_name = "namefrst_related_rows"
name_col = "namefrst_unstd"
birthyr_col = "replaced_birthyr"
comparison_type = "rel_jaro_winkler"
jw_threshold = 0.9
age_threshold = 5
extra_children¶
Using a Scala function, checks to see if there are children present in sample B who are not present in sample A, but based on relate codes, age, sex, and name, we would have expected to be present in A. Returns a count of suspected “extra” children. Takes a column generated with the feature selection transform related_individual_rows
as input (list of person data objects to compare).
Attributes:
column_name
– Type:string
. The input column with data in the form of a list of person data objects.relate_col
– Type:string
. The name of the column with therelate
code.histid_col
– Type:string
. The name of the id column.name_col
– Type:string
. The name of the column containing the first name for comparison.birthyr_col
– Type:string
. The name of the column containing the birth year.year_b
– Type:int
. The year that sample B was taken.jw_threshold
– Type:float
. The minimum acceptable Jaro-Winkler score to consider a match.age_threshold
– Type:int
. The maximum acceptable age difference to consider a match.
[[comparison_features]]
alias = "extra_children"
column_name = "namefrst_related_rows"
relate_col = "relate"
histid_col = "histid"
name_col = "namefrst_unstd"
birthyr_col = "replaced_birthyr"
year_b = 1910
comparison_type = "extra_children"
jw_threshold = 0.8
age_threshold = 2
jaro_winkler_rate¶
Uses a Scala function to calculate the percentage of individuals who have a Jaro-Winkler score greater than or equal to the given threshold. Rate returned as a percentage as a float data type.
Attributes:
column_name
– Type:string
. The input column with data in the form of a list of person data objects. The input column seen below (”namelast_neighbors”)was generated using a “neighbor_aggregate” feature selection.jw_threshold
– Type:float
. The minimum Jaro-Winkler threshold to consider an acceptable match.
In the following example, a lower_threshold
feature add-on is used to convert the returned rate to a boolean asserting whether it meets the given minimum threshold. (>= 5% of neighbors have a Jaro-Winkler score >= 0.95)
[[comparison_features]]
alias = "nbors"
comparison_type = "times"
column_names = ["namelast_neighbors", "county", "statefip"]
[comparison_features.comp_a]
column_name = "namelast_neighbors"
comparison_type = "jaro_winkler_rate"
jw_threshold = 0.95
lower_threshold = 0.05
[comparison_features.comp_b]
comparison_type = "and"
column_names = ["county", "statefip"]
[comparison_features.comp_b.comp_a]
column_name = "county"
comparison_type = "equals"
[comparison_features.comp_b.comp_b]
column_name = "statefip"
comparison_type = "equals"
sum¶
Adds the column values for A and B together (takes the sum).
Attributes:
column_name
– Type:string
. The input column to be added.
[[comparison_features]]
alias = "namelast_popularity_sum"
column_name = "namelast_popularity"
comparison_type = "sum"
length_b¶
Returns the length of the column value in record B using the SQL size()
function.
Attributes:
column_name
– Type:string
. The name of the input column to take the length of in dataset B.
abs_diff¶
Takes the absolute value of the difference between the values of the given column in datasets A and B.
Attributes:
column_name
– Type:string
. The input column to evaluate.not_equals
– Type:int
. OPTIONAL. You can specify a value for the column to be considered invalid input, in which case the expression would return the value -1 instead of an absolute difference. For example, if you are evaluating the difference in marriage duration values, and “99” is a placeholder value for “unknown” in the data, you can exclude those values from consideration using this attribute.
[[comparison_features]]
alias = "byrdiff"
column_name = "replaced_birthyr"
comparison_type = "abs_diff"
[[comparison_features]]
alias = "mardurmatch"
column_name = "durmarr"
not_equals = 99
comparison_type = "abs_diff"
btwn_threshold = [9, 14]
categorical = True
b_minus_a¶
Returns the value of subtracting the value of column A from the value of column B.
Attributes:
column_name
– Type:string
. The input column to evaluate.not_equals
– Type:int
. OPTIONAL. You can specify a value for the column to be considered invalid input, in which case the expression would return the value -1 instead of an absolute difference. For example, if you are evaluating the difference in marriage duration values, and “99” is a placeholder value for “unknown” in the data, you can exclude those values from consideration using this attribute.
[[comparison_features]]
alias = "mardurmatch"
column_name = "durmarr"
not_equals = 99
comparison_type = "b_minus_a"
btwn_threshold = [5,14]
categorical = true
geo_distance¶
Uses a lookup table to find the geographic distance between locations. The SQL expression is generated by hlink/linking/core/dist_table.py
. There are several ways to configure this feature. You can look up distances in the given file using one or two keys (specified with the key_count
attribute). You can also optionally have a secondary look-up table that serves as a back-up value in the case that the primary look-up does not contain a value for the locations given. This is particularly useful for county distance, as you can set the primary join to be across counties, but set up a secondary join on state, which has much fewer combinations and thus less risk of nulls, to fill in when the counties specified aren’t in the look-up.
Attributes:
key_count
– Type:int
. The number of keys used to join on the primary (or only) look-up table. Acceptable values are 1 or 2. Ex: for state and county, key_count = 2. For just state, key_count = 1 even though there is county_a and county_b.distances_file
– Type:string
of path. Path to the distances look-up file.table_name
– Type:string
. What to name the table that will be generated from the distances file. If you want to do multiple look-ups, if the table_name is the same across all feature specifications, it will only be read in once.Attributes for
key_count = 1
:column_name
– Type:string
. The column in the input data that you want to use as a key to look up the geographic distance.loc_a
– Type:string
. First column to join on in the look-up table (where to find the value coming from thecolumn_name
column A).loc_b
– Type:string
. Second column to join on in the look-up table (where to find the value coming from thecolumn_name
column B).distance_col
– Type:string
. Name of the column containing the geographic distance in the look-up table.
Attributes for
key_count = 2
:column_names
– Type: list of strings. The two columns you want to use as keys to look up the geographic distance.source_column_a
– Type:string
. First column to join on in the source data.source_column_b
– Type:string
. Second column to join on in the source data.loc_a_0
– Type:string
. First column to join on in the look-up table.loc_a_1
– Type:string
. First column to join on in the look-up table.loc_b_0
– Type:string
. Second column to join on in the look-up table.loc_b_1
– Type:string
. Second column to join on in the look-up table.distance_col
– Type:string
. Name of the column containing the geographic distance in the look-up table.
Attributes if using a secondary join:
secondary_key_count
– Type:int
. The number of keys used to join on the secondary (backup) look-up table. Acceptable values are 1 or 2.secondary_table_name
– Type:string
. What to name the table that will be generated from thesecondary_distances_file
. If you want to do multiple look-ups, if the table_name is the same across all feature specifications, it will only be read in once.secondary_distances_file
– Type:string
of path. Path to the secondary distances look-up file.secondary_source_column
– Type:string
. The column in the input data that you want to use as a key in the secondary geographic distance look-up.secondary_loc_a
– Type:string
. First column to join on in the secondary look-up table.secondary_loc_b
– Type:string
. Second column to join on in the secondary look-up table.secondary_distance_col
– Type:string
. Name of the column containing the geographic distance in the secondary look-up table.
[[comparison_features]]
alias = "state_distance"
comparison_type = "geo_distance"
key_count = 1
table_name = "state_distance_lookup"
distances_file = "/path/to/county_state_distance.csv"
column_name = "bpl"
loc_a = "statecode1"
loc_b = "statecode2"
distance_col = "dist"
[[comparison_features]]
alias = "county_distance"
comparison_type = "geo_distance"
column_names = ["county", "statefip"]
key_count = 2
table_name = "county_distance_lookup"
distances_file = "/path/to/county_1900_1910_distances_km.csv"
# columns to join on in the data
source_column_a = "county"
source_column_b = "statefip"
# column names from the csv lookup file
loc_a_0 = "from_icpsrctyi"
loc_a_1 = "to_icpsrctyi"
loc_b_0 = "from_statefip"
loc_b_1 = "to_statefip"
distance_col = "distance_km"
# SECONDARY JOIN
secondary_key_count = 1
secondary_table_name = "state_distance_lookup"
secondary_distances_file = "/path/to/state_1900_1910_distances_km.csv"
secondary_source_column = "statefip"
secondary_loc_a = "from_statefip"
secondary_loc_b = "to_statefip"
secondary_distance_col = "distance_km"
fetch_a¶
Gets the value of column A.
Attributes:
column_name
– Type:string
. Required. The column to get the value from.
[[comparison_features]]
alias = "race"
column_name = "race"
comparison_type = "fetch_a"
categorical = true
fetch_b¶
Gets the value of column B.
Attributes:
column_name
– Type:string
. The column to get the value from.
[[comparison_features]]
alias = "race"
column_name = "race"
comparison_type = "fetch_b"
categorical = true
present_both_years¶
Checks whether column A and column B are both present and both positive (> 0). Evaluates to 1 if both are present and positive and 0 otherwise.
Attributes:
column_name
– Type:string
. The column to check. Must be a column with a numerical type.
[[comparison_features]]
alias = "sp_caution"
column_names = ["spouse_bpl", "spouse_birthyr", "durmarr", "sploc"]
comparison_type = "caution_comp_4"
categorical = true
[comparison_features.comp_a]
column_name = "spouse_bpl"
comparison_type = "not_equals"
[comparison_features.comp_b]
column_name = "spouse_birthyr"
comparison_type = "abs_diff"
gt_threshold = 5
[comparison_features.comp_c]
column_name = "durmarr"
comparison_type = "new_marr"
upper_threshold = 7
[comparison_features.comp_d]
column_name = "sploc"
comparison_type = "present_both_years"
neither_are_null¶
Checks that neither column A nor column B is null or the empty string ''
.
Evaluates to 1 if neither column is null or ''
and evaluates to 0 otherwise.
Attributes:
column_name
– Type:string
. The column of type string to check.
[[comparison_features]]
alias = "mpres"
column_name = "m_namefrst"
comparison_type = "neither_are_null"
categorical = true
present_and_matching_categorical¶
Checks that both column A and column B are present and that they match according to SQL’s IS DISTINCT FROM
. Evaluates to 0, 1, or 2:
0 → columns are both present and match
1 → columns are both present but are distinct
2 → one or both columns are missing
Attributes:
column_name
– Type:string
. Required. The column to check.
present_and_not_equal¶
Checks that column A and column B are both present but are not equal.
Attributes:
column_name
– Type:string
. The column to check.
present_and_equal_categorical_in_universe¶
Checks that column A and column B are both present, are not equal to the not-in-universe value NIU,
and are equal to each other according to SQL’s IS DISTINCT FROM
. Evaluates to 0 if either column is missing or
if either column is the NIU value. Otherwise, evaluates to 0 if the columns are distinct or 1 if
the columns are equal.
Attributes:
column_name
– Type:string
. Required. The column to check.NIU
– Type: same as the type of the input column. Required. The not-in-universe value to use in the check.
[[comparison_features]]
alias = "mfbplmatch"
column_name = "nativity"
comparison_type = "present_and_equal_categorical_in_universe"
NIU = "0"
categorical = true
multi_jaro_winkler_search¶
This is a specialized comparison feature that simplifies particular ways of searching
multiple columns for Jaro-Winkler similarity with additional constraints on some
other columns. It does Jaro-Winkler and equality comparisons on a variable number
num_cols
of columns whose names can be templated with the numbers 1, 2, …, num_cols
.
It takes a number of columns and templates for generating the column names. For
each combination (M, N) in the Cartesian product {1, 2, …, num_cols
} ×
{1, 2, …, num_cols
}, it…
Generates two column names per template by replacing the string
{n}
in the given templates with the current column numbers M and N, respectively.Compares the columns generated by
jw_col_template
with thejw
function and checks if the similarity between the columns is at leastjw_threshold
.Compares each pair of columns generated by
equal_and_not_null_templates
to see if they are equal and neither are null.Takes the AND of all comparisons in steps 2 and 3, which is the final result for this (M, N) combination.
After this comparison feature has computed step 4 for each combination, it returns 1 if any step 4 returned 1, or 0 if all of the step 4s are 0.
For example,
[[comparison_features]]
comparison_type = "multi_jaro_winkler_search"
num_cols = 5
jw_col_template = "s{n}_namefrst"
jw_threshold = 0.7
equal_and_not_null_templates = ["s{n}_bpl", "s{n}_sex"]
does 25 comparisons. For each (M, N) in the Cartesian product {1, 2, 3, 4, 5} × {1, 2, 3, 4, 5}, it computes
jw(nvl(a.sM_namefrst, ''), nvl(b.sN_namefrst, '')) >= 0.7 AND
a.sM_bpl IS NOT NULL AND
b.sN_bpl IS NOT NULL AND
a.sM_bpl = b.sN_bpl AND
a.sM_sex IS NOT NULL AND
b.sN_sex IS NOT NULL AND
a.sM_sex = b.sN_sex
It returns 1 if any of these expressions evaluate to 1, and 0 if all of them evaluate to 0.
Attributes:
num_cols
– Type:integer
. Required. The number of columns being compared.jw_col_template
– Type:string
. Required. The template for the columns to be compared with Jaro-Winkler.jw_threshold
– Type:float
. Required. The minimum Jaro-Winkler score required for the comparison to pass.equal_and_not_null_templates
– Type: list of strings. Optional, defaults to the empty list. The templates for the columns to be compared for equality. If one or both of the columns are NULL, they are considered not to be equal.
sql_condition¶
This is a flexible comparison type that allows users to write their own SQL expressions to be evaluated. Favor using a different comparison type if that’s a reasonable option. If there are no other comparison types that work for a particular use case, this one is a good fallback.
Attributes:
column_names
– Type: list of strings. Required. A list of all columns used in the SQL expression.condition
– Type:string
. The SQL expression to evaluate.
In this example, we make use of the hlink-defined jw
function, which computes
the Jaro-Winkler similarity of two strings. nvl
is a Spark builtin function
which returns its second argument if the first is null, and the first argument
otherwise.
[[comparison_features]]
alias = "namelast_jw_max"
comparison_type = "sql_condition"
column_names = ["namelast1", "namelast2", "namelast3"]
condition = "GREATEST(jw(nvl(a.namelast1, ''), nvl(b.namelast1, '')), jw(nvl(a.namelast2, ''), nvl(b.namelast2, '')), jw(nvl(a.namelast3, ''), nvl(b.namelast3, '')))"
Feature add-ons¶
These attributes can be added to most comparison feature types above to extend the type of output returned beyond the standard comparison feature.
alias¶
Attributes:
alias
: Type:string
. Should be used at the top level comparison of every comparison feature. The name for the output column.
[[comparison_features]]
alias = "jw_f"
column_name = "father_namefrst"
comparison_type = "jaro_winkler"
power¶
Raises a comparison feature to a given exponential power.
Attributes:
power
– Type:int
. The power to raise the comparison output to. For example,power = 2
will square the output.
[[comparison_features]]
alias = "county_distance_squared"
comparison_type = "geo_distance"
column_names = ["county", "statefip"]
# PRIMARY JOIN
# key count: the number of keys used for the join per source file. Ex: for state and county, key_count = 2. For just state, key_count = 1 even though there is county_a and county_b
key_count = 2
table_name = "county_distance_lookup"
#distances_file = "/path/to/county_state_distance.csv"
distances_file = "/path/to/county_1900_1910_distances_km.csv"
# columns to join on in the data
source_column_a = "county"
source_column_b = "statefip"
# column names from the csv lookup file
loc_a_0 = "from_icpsrctyi"
loc_a_1 = "to_icpsrctyi"
loc_b_0 = "from_statefip"
loc_b_1 = "to_statefip"
distance_col = "distance_km"
# SECONDARY JOIN
secondary_key_count = 1
secondary_table_name = "state_distance_lookup"
secondary_distances_file = "/path/to/state_1900_1910_distances_km.csv"
secondary_source_column = "statefip"
secondary_loc_a = "from_statefip"
secondary_loc_b = "to_statefip"
secondary_distance_col = "distance_km"
power = 2
threshold¶
Attributes:
threshold
– Type: numeric types. Asserts if the comparison feature output is not null and is greater than or equal to (>=
) the given threshold value.
[[comparison_features]]
alias = "imm"
column_name = "nativity"
comparison_type = "fetch_a"
threshold = 5
categorical = true
lower_threshold¶
Attributes:
lower_threshold
– Type: numeric types. Asserts if the comparison feature output is not null and is greater than or equal to (>=
) the given threshold value.
[[comparison_features]]
alias = "street_jw"
comparison_type = "times"
column_names = ["street","county", "statefip"]
[comparison_features.comp_a]
column_name = "street"
comparison_type = "jaro_winkler"
lower_threshold = 0.9
[comparison_features.comp_b]
comparison_type = "and"
column_names = ["county", "statefip"]
[comparison_features.comp_b.comp_a]
column_name = "county"
comparison_type = "equals"
[comparison_features.comp_b.comp_b]
column_name = "statefip"
comparison_type = "equals"
upper_threshold¶
Attributes:
upper_threshold
– Type: numeric types. Asserts if the comparison feature output is not null and is less than or equal to (<=
) the given threshold value.
[[comparison_features]]
alias = "sp_caution"
column_names = ["spouse_bpl", "spouse_birthyr", "durmarr", "sploc"]
comparison_type = "caution_comp_4"
categorical = true
[comparison_features.comp_a]
column_name = "spouse_bpl"
comparison_type = "not_equals"
[comparison_features.comp_b]
column_name = "spouse_birthyr"
comparison_type = "abs_diff"
gt_threshold = 5
[comparison_features.comp_c]
column_name = "durmarr"
comparison_type = "new_marr"
upper_threshold = 7
[comparison_features.comp_d]
column_name = "sploc"
comparison_type = "present_both_years"
gt_threshold¶
Attributes:
gt_threshold
– Type: numeric types. Asserts if the comparison feature output is not null and is greater than (>
) the given threshold value.
[[comparison_features]]
alias = "sp_caution"
column_names = ["spouse_bpl", "spouse_birthyr", "durmarr", "sploc"]
comparison_type = "caution_comp_4"
categorical = true
[comparison_features.comp_a]
column_name = "spouse_bpl"
comparison_type = "not_equals"
[comparison_features.comp_b]
column_name = "spouse_birthyr"
comparison_type = "abs_diff"
gt_threshold = 5
[comparison_features.comp_c]
column_name = "durmarr"
comparison_type = "new_marr"
upper_threshold = 7
[comparison_features.comp_d]
column_name = "sploc"
comparison_type = "present_both_years"
btwn_threshold¶
Attributes:
btwn_threshold
– Type: List of numeric type. Asserts if the comparison feature output is greater than or equal to (>=
) the first threshold value, and less than or equal to (<=
) the second threshold value.
[[comparison_features]]
alias = "mardurmatch"
column_name = "durmarr"
not_equals = 99
comparison_type = "b_minus_a"
btwn_threshold = [5,14]
categorical = true
look_at_addl_var¶
Attributes:
look_at_addl_var
– Type: boolean. Flags the program to consider an additional column value before reporting the comparison feature value.addl_var
– Type:string
. The additional column to consider.check_val_expr
– Type: expression. The expression to use to evaluate the additional column. For example,check_val_expr = "= 5"
.else_val
– Type: same type as comparison feature output. If the additional volumn value does not meet thecheck_val_expr
specification, the value to return instead of the comparison feature value.
In the following example, the generated SQL expression for the column would be: CASE WHEN {datasource}.nativity = 5 then {yrimmig abs_diff value} else -1 END
.
[[comparison_features]]
alias = "immyear_diff"
column_name = "yrimmig"
comparison_type = "abs_diff"
look_at_addl_var = true
addl_var = "nativity"
datasource = "a"
check_val_expr = "= 5"
else_val = -1
Aggregate Features¶
These features are not configurable. To include them in the generated comparison features, they just need to be included in the [training][independent_vars]
section of the config. They are generated using the “aggregate_features” SQL template.
hits¶
The number of potential matches generated for the given individual (counted by aggregating on {id_column}_a
).
hits2¶
hits
squared.
exact_mult¶
Indicator for the existence of multiple potential matches with the exact same first and last name as the A sample individual within the B data. Returns numeric boolean (0 or 1).
Household Aggregate Features¶
These features are not configurable. To include them in the generated comparison features, they just need to be included in the [hh_training][independent_vars]
section of the config. They are generated using the “hh_aggregate_features” SQL template.
jw_max_a¶
The highest Jaro-Winkler score for any of the first names in linked household A against the first name in linked household B where birth year difference is less than or equal to ten, excluding the individual A in the current potential match. Returns 0
if no other individuals are in the household for comparison.
jw_max_b¶
The highest Jaro-Winkler score for any of the first names in linked household A against the first name in linked household B where sex matches and birth year difference is less than or equal to ten, excluding the individual A in the current potential match. Returns 0
if no other individuals are in the household for comparison.