Comparison Features

During matching, hlink computes comparison features on each record pair which it considers a potential match. These comparison features can be passed as features to machine-learning algorithms or used to define comparisons which filter the potential_matches table.

Each comparison feature must have a comparison type, which tells hlink how to compute the comparison feature. This page has information on the available comparison types and how to configure them. It also lists some attributes available to all comparison types and some predefined aggregate features which do not need to be explicitly configured.

Comparison Types

Each section below describes a comparison type. Each type represents a different operation, computation, or transformation that hlink can perform on one or more input columns. Some comparison types expect their own attributes for additional configuration. These attributes are listed in each section, along with an example.

maximum_jaro_winkler

Finds the greatest Jaro-Winkler value among the cartesian product of multiple columns. For example, given an input of column_names = ['namefrst', 'namelast'], it would return the maximum Jaro-Winkler name comparison value among the following four comparisons:

a.namefrst, b.namefrst
a.namefrst, b.namelast
a.namelast, b.namefrst
a.namelast, b.namelast
  • Attributes:

    • column_names – Type: list of strings. Required. The list of columns used as input for the set of comparisons, which are generated by taking the Cartesian product of the set of input columns with itself.

[[comparison_features]]
alias = "maximum_jw"
column_names = ["namelast", "namefrst"]
comparison_type = "maximum_jaro_winkler"

jaro_winkler

Returns the Jaro-Winkler comparison score for a given column.

  • Attributes:

    • column_name – Type: string. Required. The column to compare using the Jaro-Winkler score.

[[comparison_features]]
alias = "namefrst_jw"
column_name = "namefrst"
comparison_type = "jaro_winkler

jaro_winkler_street

Uses an additional geographic column value to filter for major location changes before comparing street names. If boundary column A is not equal to boundary column B, a Jaro-Winkler score of zero is returned. If boundary column A and B are equal, the Jaro-Winkler comparison score of the street columns is returned.

  • Attributes:

    • column_name – Type: string. Required. The input street column.

    • boundary – Type: string. Required. An input column to match on before comparing street name values.

[[comparison_features]]
alias = "jw_street"
column_name = "street"
boundary = "enum_dist"
comparison_type = "jaro_winkler_street"

max_jaro_winkler

Returns the greatest Jaro-Winkler value from the comparisons of a list of names.

  • Attributes:

    • column_name – Type: string. Required. Input column containing a list of names to compare (such as related household members, or neighborhood surnames).

[[comparison_features]]
alias = "related_individual_max_jw"
column_name= "namefrst_related"
comparison_type = "max_jaro_winkler"

equals

Asserts that values are the same for both compared columns using SQL: a.{column_name} IS NOT DISTINCT FROM b.{column_name}

[[comparison_features]]
alias = "relatematch"
column_name = "relate_div_100"
comparison_type = "equals"
categorical = true

f1_match

Evaluates if the first name initial A matches either the first name first initial B or either the first or second middle initial of B. If so, returns 1. Otherwise, returns 2.

1 = First initial of first first name A matches first initial of any of potential match first names B

2 = mismatch

Uses the following SQL query:

"CASE WHEN (
    (a.{fi} IS NOT DISTINCT FROM b.{fi}) OR 
    (a.{fi} IS NOT DISTINCT FROM b.{mi0}) OR
    (a.{fi} IS NOT DISTINCT FROM b.{mi1})
) THEN 1 ELSE 2 END"
[[comparison_features]]
alias = "f1_match"
first_init_col = "namefrst_init"
mid_init_cols = ["namefrst_mid_init", "namefrst_mid_init_2"]
comparison_type = "f1_match"
categorical = true

f2_match

Evaluates if first middle initial A is empty/null. If so, return 0. Otherwise, if either first or second middle initial A is not null and matches first name initial B, or first or second middle initial B, return 1. Otherwise, return 2.

1 = First initial of A second first name matches first initial of any of potential match first names B

2 = mismatch

0 = no second first name A

Uses the following SQL:

CASE WHEN ((a.{mi0} == '') OR (a.{mi0} IS NULL)) THEN 0 WHEN (
    (a.{mi0} IS NOT DISTINCT FROM b.{fi}) OR
    ((a.{mi1} IS NOT NULL) AND (a.{mi1} IS NOT DISTINCT FROM b.{fi})) OR
    (a.{mi0} IS NOT DISTINCT FROM b.{mi0}) OR
    (a.{mi0} IS NOT DISTINCT FROM b.{mi1}) OR
    ((a.{mi1} IS NOT NULL) AND (a.{mi1} IS NOT DISTINCT FROM b.{mi0})) OR
    ((a.{mi1} IS NOT NULL) AND (a.{mi1} IS NOT DISTINCT FROM b.{mi1}))
) THEN 1 ELSE 2 END
  • Attributes:

    • first_init_col – Type: string. Required. First name initial input column.

    • mid_init_cols – Type: list of strings. Required. List of first and second middle initial input columns.

[[comparison_features]]
alias = "f2_match"
first_init_col = "namefrst_init"
mid_init_cols = ["namefrst_mid_init", "namefrst_mid_init_2"]
comparison_type = "f2_match"
categorical = true

not_equals

Asserts that values are distinct between compared individuals using SQL: a.{column_name} IS DISTINCT FROM b.{column_name}. Used mainly in caution flag features (f_caution, m_caution, sp_caution).

  • Attributes:

    • column_name – Type: string. Required. Input column to compare.

[[comparison_features]]
alias = "m_caution"
column_names = ["mbpl", "mother_birthyr", "stepmom", "momloc"]
comparison_type = "caution_comp_4"
categorical = true
[comparison_features.comp_a]
column_name = "mbpl"
comparison_type = "not_equals"
[comparison_features.comp_b]
column_name = "mother_birthyr"
comparison_type = "abs_diff"
gt_threshold = 5
[comparison_features.comp_c]
column_name = "stepmom"
comparison_type = "parent_step_change"
[comparison_features.comp_d]
column_name = "momloc"
comparison_type = "present_both_years"

equals_as_int

Checks for equality using equals sign and returns boolean result in integer form. Uses SQL: CAST(a.{col} = b.{col} as INT)

  • Attributes:

    • column_name – Type: string. Required. Input column to compare.

[[comparison_features]]
alias = "namelast_equal_as_int"
column_name = "namelast_clean"
comparison_type  = "equals_as_int"

all_equals

Asserts whether the values in all given columns match. Uses a SQL expression generated by joining a.{col} = b.{col} and AND for each given column.

  • Attributes:

    • column_names – Type: list of strings. Required. List of the columns to evaluate if all are equal across records being compared.

[[comparison_features]]
alias = "exact"
column_names = ["namefrst_unstd", "namelast_clean"]
comparison_type = "all_equals"

not_zero_and_not_equals

Checks that both values are present (not null) and nonzero and that they are not equal to one another. Evaluates to a boolean. This is primarily useful when a value of 0 indicates some kind of incomparibility akin to the value being missing.

See also present_and_equal_categorical_in_universe, which is a similar but more general comparison type.

  • Attributes:

    • column_name – Type: string. Required. Input column to compare.

[[comparison_features]]
alias = "fbpl_nomatch"
column_name = "fbpl"
comparison_type = "not_zero_and_not_equals"

or

Allows for the concatenation of up to four comparison features into one feature using a SQL OR between the generated clause for each sub-comparison.

  • Attributes:

    • column_names – Type: list of strings. Required. A list of all input columns used by sub-comparisons.

    • comp_a, comp_b – Type: Object. Required. Sub-comparison using any of the comparison feature types documented in this section.

    • comp_c, comp_d – Type: Object. Optional. Sub-comparison using any of the comparison feature types documented in this section.

[[comparison_features]]
alias = "sp_caution"
column_names = ["spouse_bpl", "spouse_birthyr", "durmarr"]
comparison_type = "or"
[comparison_features.comp_a]
column_name = "spouse_bpl"
comparison_type = "not_equals"
[comparison_features.comp_b]
column_name = "spouse_birthyr"
comparison_type = "abs_diff"
lower_threshold = 5
[comparison_features.comp_c]
column_name = "durmarr"
comparison_type = "new_marr"
upper_threshold = 7

and

Allows for the concatenation of up to four comparison features into one feature using a SQL AND between the generated clause for each sub-comparison.

  • Attributes:

    • column_names – Type: list of strings. Required. A list of all input columns used by sub-comparisons.

    • comp_a, comp_b – Type: Object. Required. Sub-comparison using any of the comparison feature types documented in this section.

    • comp_c, comp_d – Type: Object. Optional. Sub-comparison using any of the comparison feature types documented in this section.

In this example, the and comparison appears in [comparison_features.comp_b].

[[comparison_features]]
alias = "street_jw"
comparison_type = "times"
column_names = ["street","county", "statefip"]
[comparison_features.comp_a]
column_name = "street"
comparison_type = "jaro_winkler"
lower_threshold = 0.9
[comparison_features.comp_b]
comparison_type = "and"
column_names = ["county", "statefip"]
[comparison_features.comp_b.comp_a]
column_name = "county"
comparison_type = "equals"
[comparison_features.comp_b.comp_b]
column_name = "statefip"
comparison_type = "equals"

times

Takes the output of two sub-comparisons and multiplies them together after casting as floats.

  • Attributes:

    • column_names – Type: list of strings. Required. A list of all input columns used by sub-comparisons.

    • comp_a, comp_b – Type: Object. Required. Sub-comparison using any of the comparison feature types documented in this section. comp_a and comp_b can also have sub-comparisons, as in the given example.

[[comparison_features]]
alias = "street_jw"
comparison_type = "times"
column_names = ["street","county", "statefip"]
[comparison_features.comp_a]
column_name = "street"
comparison_type = "jaro_winkler"
lower_threshold = 0.9
[comparison_features.comp_b]
comparison_type = "and"
column_names = ["county", "statefip"]
[comparison_features.comp_b.comp_a]
column_name = "county"
comparison_type = "equals"
[comparison_features.comp_b.comp_b]
column_name = "statefip"
comparison_type = "equals"

caution_comp_3

Generates an SQL expression in the form (comparison A OR comparison B) AND comparison C.

  • Attributes:

    • column_names – Type: list of strings. Required. A list of all input columns used by sub-comparisons.

    • comp_a, comp_b, comp_c – Type: Object. Required. Sub-comparisons using any of the comparison feature types documented in this section.

[[comparison_features]]
alias = "sp_caution"
column_names = ["spouse_bpl", "spouse_birthyr", "durmarr", "sploc"]
comparison_type = "caution_comp_3"
categorical = true
[comparison_features.comp_a]
column_name = "spouse_bpl"
comparison_type = "not_equals"
[comparison_features.comp_b]
column_name = "spouse_birthyr"
comparison_type = "abs_diff"
gt_threshold = 5
[comparison_features.comp_c]
column_name = "durmarr"
comparison_type = "new_marr"
upper_threshold = 7

caution_comp_3_012

Similar to caution_comp_3, but first checks the value of comparison C. If comparison C evaluates to false, then caution_comp_3_012 evaluates to 2. Otherwise, it evaluates to the result of caution_comp_3, so 0 or 1.

  • Attributes:

    • column_names – Type: list of strings. Required. A list of all input columns used by sub-comparisons.

    • comp_a, comp_b, comp_c – Type: Object. Required. Sub-comparison using any of the comparison feature types documented in this section.

caution_comp_4

Generates an SQL expression in the form (comparison A OR comparison B OR comparison C) AND comparison D.

  • Attributes:

    • column_names – Type: list of strings. Required. A list of all input columns used by sub-comparisons.

    • comp_a, comp_b, comp_c, comp_d – Type: Object. Required. Sub-comparisons using any of the comparison feature types documented in this section.

[[comparison_features]]
alias = "m_caution"
column_names = ["mbpl", "mother_birthyr", "stepmom", "momloc"]
comparison_type = "caution_comp_4"
categorical = true
[comparison_features.comp_a]
column_name = "mbpl"
comparison_type = "not_equals"
[comparison_features.comp_b]
column_name = "mother_birthyr"
comparison_type = "abs_diff"
gt_threshold = 5
[comparison_features.comp_c]
column_name = "stepmom"
comparison_type = "parent_step_change"
[comparison_features.comp_d]
column_name = "momloc"
comparison_type = "present_both_years"

caution_comp_4_012

Similar to caution_comp_4, but first checks the value of comparison D. If comparison D evaluates to false, then caution_comp_4_012 evaluates to 2. Otherwise, it evaluates to the result of caution_comp_4, so 0 or 1.

  • Attributes:

    • column_names – Type: list of strings. Required. A list of all input columns used by sub-comparisons.

    • comp_a, comp_b, comp_c, comp_d – Type: Object. Required. Sub-comparisons using any of the comparison feature types documented in this section.

any_equals

Used to compare middle initials and first names under specific circumstances.
If middle initial A is not empty/null and is the same as either middle initial B or first name B, OR if first name A is not empty/null and is the same as middle initial B.

  • Attributes:

    • column_names – Type: list of strings. Required. The first input column should be the middle initial column, and the second input column should be the first name column.

[[comparison_features]]
alias = "mid_init_match"
column_names = ["namefrst_mid_init", "namefrst_unstd"]
comparison_type = "any_equals"

either_are_1

Checks if the column value for either A or B is equal to 1.

  • Attributes:

    • column_name – Type: string. Required. Input column to compare to 1.

[[comparison_features]]
alias = "either_1"
column_name = "nativity"
comparison_type = "either_are_1"
categorical = true

either_are_0

Checks if the column value for either A or B is equal to 0.

  • Attributes:

    • column_name – Type: string. Required. Input column to compare to 0.

[[comparison_features]]
alias = "either_0"
column_name = "nativity"
comparison_type = "either_are_0"
categorical = true

second_gen_imm

Checks if individual A is a second-generation immigrant by looking for nativity value of 2, 3, or 4 (one or both parents foreign-born).

  • Attributes:

    • column_name – Type: string. Required. Input should be the name of the nativity column.

[[comparison_features]]
alias =  "sgen"
column_name = "nativity"
comparison_type = "second_gen_imm"
categorical = true

rel_jaro_winkler

Uses a Scala function to determine the number of people in the input column with a name similarity score (Jaro-Winkler) greater than or equal to the given jw_threshold, an age difference less than or equal to the given age_threshold, and matching sex for the sample A individual and the sample B potential match. Takes a column generated with the feature selection transform related_individual_rows as input (list of person data objects to compare). Can be used for related or unrelated individuals, depending on the input column specified.

  • Attributes:

    • column_name – Type: string. The input column with data in the form of a list of person data objects.

    • name_col – Type: string.

    • birthyr_col – Type: string.

    • jw_threshold – Type: float.

    • age_threshold – Type: int.

[[comparison_features]]
alias = "rel"
column_name = "namefrst_related_rows"
name_col = "namefrst_unstd"
birthyr_col = "replaced_birthyr"
comparison_type = "rel_jaro_winkler"
jw_threshold = 0.9
age_threshold = 5

extra_children

Using a Scala function, checks to see if there are children present in sample B who are not present in sample A, but based on relate codes, age, sex, and name, we would have expected to be present in A. Returns a count of suspected “extra” children. Takes a column generated with the feature selection transform related_individual_rows as input (list of person data objects to compare).

  • Attributes:

    • column_name – Type: string. The input column with data in the form of a list of person data objects.

    • relate_col – Type: string. The name of the column with the relate code.

    • histid_col – Type: string. The name of the id column.

    • name_col – Type: string. The name of the column containing the first name for comparison.

    • birthyr_col – Type: string. The name of the column containing the birth year.

    • year_b – Type: int. The year that sample B was taken.

    • jw_threshold – Type: float. The minimum acceptable Jaro-Winkler score to consider a match.

    • age_threshold – Type: int. The maximum acceptable age difference to consider a match.

[[comparison_features]]
alias = "extra_children"
column_name = "namefrst_related_rows"
relate_col = "relate"
histid_col = "histid"
name_col = "namefrst_unstd"
birthyr_col = "replaced_birthyr"
year_b = 1910
comparison_type = "extra_children"
jw_threshold = 0.8
age_threshold = 2

jaro_winkler_rate

Uses a Scala function to calculate the percentage of individuals who have a Jaro-Winkler score greater than or equal to the given threshold. Rate returned as a percentage as a float data type.

  • Attributes:

    • column_name – Type: string. The input column with data in the form of a list of person data objects. The input column seen below (”namelast_neighbors”)was generated using a “neighbor_aggregate” feature selection.

    • jw_threshold – Type: float. The minimum Jaro-Winkler threshold to consider an acceptable match.

In the following example, a lower_threshold feature add-on is used to convert the returned rate to a boolean asserting whether it meets the given minimum threshold. (>= 5% of neighbors have a Jaro-Winkler score >= 0.95)

[[comparison_features]]
alias = "nbors"
comparison_type = "times"
column_names = ["namelast_neighbors", "county", "statefip"]
[comparison_features.comp_a]
column_name = "namelast_neighbors"
comparison_type = "jaro_winkler_rate"
jw_threshold = 0.95
lower_threshold = 0.05
[comparison_features.comp_b]
comparison_type = "and"
column_names = ["county", "statefip"]
[comparison_features.comp_b.comp_a]
column_name = "county"
comparison_type = "equals"
[comparison_features.comp_b.comp_b]
column_name = "statefip"
comparison_type = "equals"

sum

Adds the column values for A and B together (takes the sum).

  • Attributes:

    • column_name – Type: string. The input column to be added.

[[comparison_features]]
alias = "namelast_popularity_sum"
column_name = "namelast_popularity"
comparison_type = "sum"

length_b

Returns the length of the column value in record B using the SQL size() function.

  • Attributes:

    • column_name – Type: string. The name of the input column to take the length of in dataset B.

abs_diff

Takes the absolute value of the difference between the values of the given column in datasets A and B.

  • Attributes:

    • column_name – Type: string. The input column to evaluate.

    • not_equals – Type: int. OPTIONAL. You can specify a value for the column to be considered invalid input, in which case the expression would return the value -1 instead of an absolute difference. For example, if you are evaluating the difference in marriage duration values, and “99” is a placeholder value for “unknown” in the data, you can exclude those values from consideration using this attribute.

[[comparison_features]]
alias = "byrdiff"
column_name = "replaced_birthyr"
comparison_type = "abs_diff"

[[comparison_features]]
alias = "mardurmatch"
column_name = "durmarr"
not_equals = 99
comparison_type = "abs_diff"
btwn_threshold = [9, 14]
categorical = True

b_minus_a

Returns the value of subtracting the value of column A from the value of column B.

  • Attributes:

    • column_name – Type: string. The input column to evaluate.

    • not_equals – Type: int. OPTIONAL. You can specify a value for the column to be considered invalid input, in which case the expression would return the value -1 instead of an absolute difference. For example, if you are evaluating the difference in marriage duration values, and “99” is a placeholder value for “unknown” in the data, you can exclude those values from consideration using this attribute.

[[comparison_features]]
alias = "mardurmatch"
column_name = "durmarr"
not_equals = 99
comparison_type = "b_minus_a"
btwn_threshold = [5,14]
categorical = true

geo_distance

Uses a lookup table to find the geographic distance between locations. The SQL expression is generated by hlink/linking/core/dist_table.py. There are several ways to configure this feature. You can look up distances in the given file using one or two keys (specified with the key_count attribute). You can also optionally have a secondary look-up table that serves as a back-up value in the case that the primary look-up does not contain a value for the locations given. This is particularly useful for county distance, as you can set the primary join to be across counties, but set up a secondary join on state, which has much fewer combinations and thus less risk of nulls, to fill in when the counties specified aren’t in the look-up.

  • Attributes:

    • key_count – Type: int. The number of keys used to join on the primary (or only) look-up table. Acceptable values are 1 or 2. Ex: for state and county, key_count = 2. For just state, key_count = 1 even though there is county_a and county_b.

    • distances_file – Type: string of path. Path to the distances look-up file.

    • table_name – Type: string. What to name the table that will be generated from the distances file. If you want to do multiple look-ups, if the table_name is the same across all feature specifications, it will only be read in once.

    • Attributes for key_count = 1:

      • column_name – Type: string. The column in the input data that you want to use as a key to look up the geographic distance.

      • loc_a – Type: string. First column to join on in the look-up table (where to find the value coming from the column_name column A).

      • loc_b – Type: string. Second column to join on in the look-up table (where to find the value coming from the column_name column B).

      • distance_col – Type: string. Name of the column containing the geographic distance in the look-up table.

    • Attributes for key_count = 2:

      • column_names – Type: list of strings. The two columns you want to use as keys to look up the geographic distance.

      • source_column_a – Type: string. First column to join on in the source data.

      • source_column_b – Type: string. Second column to join on in the source data.

      • loc_a_0 – Type: string. First column to join on in the look-up table.

      • loc_a_1 – Type: string. First column to join on in the look-up table.

      • loc_b_0 – Type: string. Second column to join on in the look-up table.

      • loc_b_1 – Type: string. Second column to join on in the look-up table.

      • distance_col – Type: string. Name of the column containing the geographic distance in the look-up table.

    • Attributes if using a secondary join:

      • secondary_key_count – Type: int. The number of keys used to join on the secondary (backup) look-up table. Acceptable values are 1 or 2.

      • secondary_table_name – Type: string. What to name the table that will be generated from the secondary_distances_file. If you want to do multiple look-ups, if the table_name is the same across all feature specifications, it will only be read in once.

      • secondary_distances_file – Type: string of path. Path to the secondary distances look-up file.

      • secondary_source_column – Type: string. The column in the input data that you want to use as a key in the secondary geographic distance look-up.

      • secondary_loc_a – Type: string. First column to join on in the secondary look-up table.

      • secondary_loc_b – Type: string. Second column to join on in the secondary look-up table.

      • secondary_distance_col – Type: string. Name of the column containing the geographic distance in the secondary look-up table.

[[comparison_features]]
alias = "state_distance"
comparison_type = "geo_distance"
key_count = 1
table_name = "state_distance_lookup"
distances_file = "/path/to/county_state_distance.csv"
column_name = "bpl"
loc_a = "statecode1"
loc_b = "statecode2"
distance_col = "dist"


[[comparison_features]]
alias = "county_distance"
comparison_type = "geo_distance"
column_names = ["county", "statefip"]
key_count = 2
table_name = "county_distance_lookup"
distances_file = "/path/to/county_1900_1910_distances_km.csv"
# columns to join on in the data
source_column_a = "county"
source_column_b = "statefip"

# column names from the csv lookup file
loc_a_0 = "from_icpsrctyi"
loc_a_1 = "to_icpsrctyi"
loc_b_0 = "from_statefip"
loc_b_1 = "to_statefip"
distance_col = "distance_km"

# SECONDARY JOIN
secondary_key_count = 1
secondary_table_name = "state_distance_lookup"
secondary_distances_file = "/path/to/state_1900_1910_distances_km.csv"
secondary_source_column = "statefip"
secondary_loc_a = "from_statefip"
secondary_loc_b = "to_statefip"
secondary_distance_col = "distance_km"

fetch_a

Gets the value of column A.

  • Attributes:

    • column_name – Type: string. Required. The column to get the value from.

[[comparison_features]]
alias = "race"
column_name = "race"
comparison_type = "fetch_a"
categorical = true

fetch_b

Gets the value of column B.

  • Attributes:

    • column_name – Type: string. The column to get the value from.

[[comparison_features]]
alias = "race"
column_name = "race"
comparison_type = "fetch_b"
categorical = true

present_both_years

Checks whether column A and column B are both present and both positive (> 0). Evaluates to 1 if both are present and positive and 0 otherwise.

  • Attributes:

    • column_name – Type: string. The column to check. Must be a column with a numerical type.

[[comparison_features]]
alias = "sp_caution"
column_names = ["spouse_bpl", "spouse_birthyr", "durmarr", "sploc"]
comparison_type = "caution_comp_4"
categorical = true
[comparison_features.comp_a]
column_name = "spouse_bpl"
comparison_type = "not_equals"
[comparison_features.comp_b]
column_name = "spouse_birthyr"
comparison_type = "abs_diff"
gt_threshold = 5
[comparison_features.comp_c]
column_name = "durmarr"
comparison_type = "new_marr"
upper_threshold = 7
[comparison_features.comp_d]
column_name = "sploc"
comparison_type = "present_both_years"

neither_are_null

Checks that neither column A nor column B is null or the empty string ''. Evaluates to 1 if neither column is null or '' and evaluates to 0 otherwise.

  • Attributes:

    • column_name – Type: string. The column of type string to check.

[[comparison_features]]
alias = "mpres"
column_name = "m_namefrst"
comparison_type = "neither_are_null"
categorical = true

present_and_matching_categorical

Checks that both column A and column B are present and that they match according to SQL’s IS DISTINCT FROM. Evaluates to 0, 1, or 2:

0 → columns are both present and match

1 → columns are both present but are distinct

2 → one or both columns are missing

  • Attributes:

    • column_name – Type: string. Required. The column to check.

present_and_not_equal

Checks that column A and column B are both present but are not equal.

  • Attributes:

    • column_name – Type: string. The column to check.

present_and_equal_categorical_in_universe

Checks that column A and column B are both present, are not equal to the not-in-universe value NIU, and are equal to each other according to SQL’s IS DISTINCT FROM. Evaluates to 0 if either column is missing or if either column is the NIU value. Otherwise, evaluates to 0 if the columns are distinct or 1 if the columns are equal.

  • Attributes:

    • column_name – Type: string. Required. The column to check.

    • NIU – Type: same as the type of the input column. Required. The not-in-universe value to use in the check.

[[comparison_features]]
alias = "mfbplmatch"
column_name = "nativity"
comparison_type = "present_and_equal_categorical_in_universe"
NIU = "0"
categorical = true

sql_condition

This is a flexible comparison type that allows users to write their own SQL expressions to be evaluated. Favor using a different comparison type if that’s a reasonable option. If there are no other comparison types that work for a particular use case, this one is a good fallback.

  • Attributes:

    • column_names – Type: list of strings. Required. A list of all columns used in the SQL expression.

    • condition – Type: string. The SQL expression to evaluate.

In this example, we make use of the hlink-defined jw function, which computes the Jaro-Winkler similarity of two strings. nvl is a Spark builtin function which returns its second argument if the first is null, and the first argument otherwise.

[[comparison_features]]
alias = "namelast_jw_max"
comparison_type = "sql_condition"
column_names = ["namelast1", "namelast2", "namelast3"]
condition = "GREATEST(jw(nvl(a.namelast1, ''), nvl(b.namelast1, '')), jw(nvl(a.namelast2, ''), nvl(b.namelast2, '')), jw(nvl(a.namelast3, ''), nvl(b.namelast3, '')))"

Feature add-ons

These attributes can be added to most comparison feature types above to extend the type of output returned beyond the standard comparison feature.

alias

  • Attributes:

    • alias: Type: string. Should be used at the top level comparison of every comparison feature. The name for the output column.

[[comparison_features]]
alias = "jw_f"
column_name = "father_namefrst"
comparison_type = "jaro_winkler"

power

Raises a comparison feature to a given exponential power.

  • Attributes:

    • power – Type: int. The power to raise the comparison output to. For example, power = 2 will square the output.

[[comparison_features]]
alias = "county_distance_squared"
comparison_type = "geo_distance"
column_names = ["county", "statefip"]
# PRIMARY JOIN
# key count: the number of keys used for the join per source file.  Ex: for state and county, key_count = 2.  For just state, key_count = 1 even though there is county_a and county_b
key_count = 2
table_name = "county_distance_lookup"
#distances_file = "/path/to/county_state_distance.csv"
distances_file = "/path/to/county_1900_1910_distances_km.csv"
# columns to join on in the data
source_column_a = "county"
source_column_b = "statefip"
# column names from the csv lookup file
loc_a_0 = "from_icpsrctyi"
loc_a_1 = "to_icpsrctyi"
loc_b_0 = "from_statefip"
loc_b_1 = "to_statefip"
distance_col = "distance_km"
# SECONDARY JOIN
secondary_key_count = 1
secondary_table_name = "state_distance_lookup"
secondary_distances_file = "/path/to/state_1900_1910_distances_km.csv"
secondary_source_column = "statefip"
secondary_loc_a = "from_statefip"
secondary_loc_b = "to_statefip"
secondary_distance_col = "distance_km"
power = 2

threshold

  • Attributes:

    • threshold – Type: numeric types. Asserts if the comparison feature output is not null and is greater than or equal to (>=) the given threshold value.

[[comparison_features]]
alias = "imm"
column_name = "nativity"
comparison_type = "fetch_a"
threshold = 5
categorical = true

lower_threshold

  • Attributes:

    • lower_threshold – Type: numeric types. Asserts if the comparison feature output is not null and is greater than or equal to (>=) the given threshold value.

[[comparison_features]]
alias = "street_jw"
comparison_type = "times"
column_names = ["street","county", "statefip"]
[comparison_features.comp_a]
column_name = "street"
comparison_type = "jaro_winkler"
lower_threshold = 0.9
[comparison_features.comp_b]
comparison_type = "and"
column_names = ["county", "statefip"]
[comparison_features.comp_b.comp_a]
column_name = "county"
comparison_type = "equals"
[comparison_features.comp_b.comp_b]
column_name = "statefip"
comparison_type = "equals"

upper_threshold

  • Attributes:

    • upper_threshold – Type: numeric types. Asserts if the comparison feature output is not null and is less than or equal to (<=) the given threshold value.

[[comparison_features]]
alias = "sp_caution"
column_names = ["spouse_bpl", "spouse_birthyr", "durmarr", "sploc"]
comparison_type = "caution_comp_4"
categorical = true
[comparison_features.comp_a]
column_name = "spouse_bpl"
comparison_type = "not_equals"
[comparison_features.comp_b]
column_name = "spouse_birthyr"
comparison_type = "abs_diff"
gt_threshold = 5
[comparison_features.comp_c]
column_name = "durmarr"
comparison_type = "new_marr"
upper_threshold = 7
[comparison_features.comp_d]
column_name = "sploc"
comparison_type = "present_both_years"

gt_threshold

  • Attributes:

    • gt_threshold – Type: numeric types. Asserts if the comparison feature output is not null and is greater than (>) the given threshold value.

[[comparison_features]]
alias = "sp_caution"
column_names = ["spouse_bpl", "spouse_birthyr", "durmarr", "sploc"]
comparison_type = "caution_comp_4"
categorical = true
[comparison_features.comp_a]
column_name = "spouse_bpl"
comparison_type = "not_equals"
[comparison_features.comp_b]
column_name = "spouse_birthyr"
comparison_type = "abs_diff"
gt_threshold = 5
[comparison_features.comp_c]
column_name = "durmarr"
comparison_type = "new_marr"
upper_threshold = 7
[comparison_features.comp_d]
column_name = "sploc"
comparison_type = "present_both_years"

btwn_threshold

  • Attributes:

    • btwn_threshold – Type: List of numeric type. Asserts if the comparison feature output is greater than or equal to (>=) the first threshold value, and less than or equal to (<=) the second threshold value.

[[comparison_features]]
alias = "mardurmatch"
column_name = "durmarr"
not_equals = 99
comparison_type = "b_minus_a"
btwn_threshold = [5,14]
categorical = true

look_at_addl_var

  • Attributes:

    • look_at_addl_var – Type: boolean. Flags the program to consider an additional column value before reporting the comparison feature value.

    • addl_var – Type: string. The additional column to consider.

    • check_val_expr – Type: expression. The expression to use to evaluate the additional column. For example, check_val_expr = "= 5".

    • else_val – Type: same type as comparison feature output. If the additional volumn value does not meet the check_val_expr specification, the value to return instead of the comparison feature value.

In the following example, the generated SQL expression for the column would be: CASE WHEN {datasource}.nativity = 5 then {yrimmig abs_diff value} else -1 END.

[[comparison_features]]
alias = "immyear_diff"
column_name = "yrimmig"
comparison_type = "abs_diff"
look_at_addl_var = true
addl_var = "nativity"
datasource = "a"
check_val_expr = "= 5"
else_val = -1

Aggregate Features

These features are not configurable. To include them in the generated comparison features, they just need to be included in the [training][independent_vars] section of the config. They are generated using the “aggregate_features” SQL template.

hits

The number of potential matches generated for the given individual (counted by aggregating on {id_column}_a).

hits2

hits squared.

exact_mult

Indicator for the existence of multiple potential matches with the exact same first and last name as the A sample individual within the B data. Returns numeric boolean (0 or 1).

Household Aggregate Features

These features are not configurable. To include them in the generated comparison features, they just need to be included in the [hh_training][independent_vars] section of the config. They are generated using the “hh_aggregate_features” SQL template.

jw_max_a

The highest Jaro-Winkler score for any of the first names in linked household A against the first name in linked household B where birth year difference is less than or equal to ten, excluding the individual A in the current potential match. Returns 0 if no other individuals are in the household for comparison.

jw_max_b

The highest Jaro-Winkler score for any of the first names in linked household A against the first name in linked household B where sex matches and birth year difference is less than or equal to ten, excluding the individual A in the current potential match. Returns 0 if no other individuals are in the household for comparison.