Link Tasks¶
Preprocessing¶
Overview¶
Read in raw data and prepare it for linking. This task may include a variety of transformations on the data, such as stripping out whitespace and normalizing strings that have common abbreviations. The same transformations are applied to both input datasets.
Task steps¶
Step 0: Read raw data in from Parquet or CSV files. Register the raw dataframes with the program.
Step 1: Prepare the dataframes for linking. Perform substitutions, transformations, and column mappings as requested.
Training and Household Training¶
Overview¶
Train a machine learning model to use for classification of potential links. This requires training data, which is read in in the first step. Comparison features are generated for the training data, and then the model is trained on the data and saved for use in the Matching task. The last step optionally saves some metadata like feature importances or coefficients for the model to help with introspection.
Task steps¶
The first three steps in each of these tasks are the same:
Step 0: Ingest the training data from a CSV file.
Step 1: Create comparison features.
Step 2: Train and save the model.
The last step is available only for Training, not for Household Training.
Step 3: Save the coefficients or feature importances of the model for inspection. This step is skipped by default. To enable it, set the
training.feature_importances
config attribute to true in your config file.
Related Configuration Sections¶
The
training
section is the most important for Training and provides configuration attributes for many aspects of the task. For Household Training, use thehh_training
section instead.comparison_features
andpipeline_features
are both generated in order to train the model. These sections are also used extensively by the Matching task.
Matching¶
Overview¶
Run the linking algorithm, generating a table with potential matches between records in the two datasets. This is the core of hlink’s work and may take the longest of all of the tasks. Universe definition and blocking reduce the number of comparisons needed when determining potential matches, which can drastically improve the runtime of Matching.
Task steps¶
Step 0: Perform blocking, separating records into different buckets to reduce the total number of comparisons needed during matching. Some columns may be “exploded” here if needed.
Step 1: Run the matching algorithm, outputting potential matches to the
potential_matches
table.Step 2: Score the potential matches with the trained model. This step will be automatically skipped if machine learning is not being used.
Related Configuration Sections¶
The
potential_matches_universe
section may be used to provide a universe for matches in the form of a SQL condition. Only records that satisfy the condition are eligible for matching.blocking
specifies how to block the input records into separate buckets before matching. Two records are eligible to match with one another only if they are grouped into the same blocking bucket.comparison_features
support computing features on each record. These features may be passed to a machine learning model through thetraining
section and/or passed to deterministic rules with thecomparisons
section. There are many different comparison types available for use withcomparison_features
.pipeline_features
are machine learning transformations useful for reshaping and interacting data before they are fed to the machine learning model.
Household Matching¶
Overview¶
Generate a table with potential matches between households in the two datasets.
Task steps¶
Step 0: Block on households.
Step 1: Filter households based on
hh_comparisons
configuration settings.Step 2: Score the potential matches with the trained model. This step will be automatically skipped if machine learning is not being used.
Related Configuration Sections¶
comparison_features
andpipeline_features
are used as they are in the Matching task.hh_comparisons
correspond tocomparisons
in the Matching task and may be thought of as “post-blocking filters”. Only potential matches that pass these comparisons will be eligible for being scored as matches.hh_training
corresponds totraining
in Matching.
Model Exploration and Household Model Exploration¶
Overview¶
Evaluate the performance of different types of models and different parameter combinations on training data. These tasks are highly configurable and are typically not part of a full linking run. Instead, they are usually run ahead of time, and then the best-performing model is chosen and used for the full linking run.
Task steps¶
The steps in each of these tasks are the same:
Step 0: Ingest the training data file specified in the config with the
dataset
attribute.Step 1: Create training features on the training data. If the
use_training_data_features
attribute is provided in the respective training config section, then instead read features from the training data file.Step 2: Run
n_training_iterations
number of train-test splits on each of the models in the configmodel_parameters
.
Related Configuration Sections¶
training
is used extensively by Model Exploration, andhh_training
is used extensively by Household Model Exploration.comparison_features
andpipeline_features
are used to generate features that are passed as input to the trained models.
Reporting¶
Overview¶
Report on characteristics of the linked data. This task is experimental and focused primarily on demographic census data. At the moment, it does not allow very much configuration.
Task steps¶
Step 0: For households with anyone linked in Matching, report the percent of remaining household members linked in Household Matching.
Step 1: Report on the representivity of linked data compared to source populations.
Step 2: Pull in key demographic data for linked individuals and export a fixed-width crosswalk file.
Related Configuration Sections¶
The
alias
attributes are read from bothdatasource_a
anddatasource_b
. The step uses them to construct the output reports.