Link Tasks¶
Preprocessing¶
Overview¶
Read in raw data and prepare it for linking. This task may include a variety of transformations on the data, such as stripping out whitespace and normalizing strings that have common abbreviations. The same transformations are applied to both input datasets.
Task steps¶
Step 0: Read raw data in from Parquet or CSV files. Register the raw dataframes with the program.
Step 1: Prepare the dataframes for linking. Perform substitutions, transformations, and column mappings as requested.
Training and Household Training¶
Overview¶
Train a machine learning model to use for classification of potential links. This requires training data, which is read in in the first step. Comparison features are generated for the training data, and then the model is trained on the data and saved for use in the Matching task. The last step optionally saves some metadata like feature importances or coefficients for the model to help with introspection.
Task steps¶
The first three steps in each of these tasks are the same:
Step 0: Ingest the training data from a CSV file.
Step 1: Create comparison features.
Step 2: Train and save the model.
Step 3: Save the coefficients or feature importances of the model for inspection. This step is skipped by default. To enable it, set the
training.feature_importancesand/or thehh_training.feature_importancesconfig attribute(s) to true in your config file.
Related Configuration Sections¶
The
trainingsection is the most important for Training and provides configuration attributes for many aspects of the task. For Household Training, use thehh_trainingsection instead.comparison_featuresandpipeline_featuresare both generated in order to train the model. These sections are also used extensively by the Matching task.
Matching¶
Overview¶
Run the linking algorithm, generating a table with potential matches between records in the two datasets. This is the core of hlink’s work and may take the longest of all of the tasks. Universe definition and blocking reduce the number of comparisons needed when determining potential matches, which can drastically improve the runtime of Matching.
Task steps¶
Step 0: Perform blocking, separating records into different buckets to reduce the total number of comparisons needed during matching. Some columns may be “exploded” here if needed.
Step 1: Run the matching algorithm, outputting potential matches to the
potential_matchestable.Step 2: Score the potential matches with the trained model. This step will be automatically skipped if machine learning is not being used.
Related Configuration Sections¶
The
potential_matches_universesection may be used to provide a universe for matches in the form of a SQL condition. Only records that satisfy the condition are eligible for matching.blockingspecifies how to block the input records into separate buckets before matching. Two records are eligible to match with one another only if they are grouped into the same blocking bucket.comparison_featuressupport computing features on each record. These features may be passed to a machine learning model through thetrainingsection and/or passed to deterministic rules with thecomparisonssection. There are many different comparison types available for use withcomparison_features.pipeline_featuresare machine learning transformations useful for reshaping and interacting data before they are fed to the machine learning model.
Household Matching¶
Overview¶
Generate a table with potential matches between households in the two datasets.
Task steps¶
Step 0: Block on households.
Step 1: Filter households based on
hh_comparisonsconfiguration settings.Step 2: Score the potential matches with the trained model. This step will be automatically skipped if machine learning is not being used.
Related Configuration Sections¶
comparison_featuresandpipeline_featuresare used as they are in the Matching task.hh_comparisonscorrespond tocomparisonsin the Matching task and may be thought of as “post-blocking filters”. Only potential matches that pass these comparisons will be eligible for being scored as matches.hh_trainingcorresponds totrainingin Matching.
Model Exploration and Household Model Exploration¶
Overview¶
Evaluate the performance of different types of models and different parameter combinations on training data. These tasks are highly configurable and are typically not part of a full linking run. Instead, they are usually run ahead of time, and then the best-performing model is chosen and used for the full linking run.
Task steps¶
The steps in each of these tasks are the same:
Step 0: Ingest the training data file specified in the config with the
datasetattribute.Step 1: Create training features on the training data. If the
use_training_data_featuresattribute is provided in the respective training config section, then instead read features from the training data file.Step 2: Run
n_training_iterationsnumber of train-test splits on each of the models in the configmodel_parameters.
Related Configuration Sections¶
trainingis used extensively by Model Exploration, andhh_trainingis used extensively by Household Model Exploration.comparison_featuresandpipeline_featuresare used to generate features that are passed as input to the trained models.
Reporting¶
Overview¶
Report on characteristics of the linked data. This task is experimental and focused primarily on demographic census data. At the moment, it does not allow very much configuration.
Task steps¶
Step 0: For households with anyone linked in Matching, report the percent of remaining household members linked in Household Matching.
Step 1: Report on the representivity of linked data compared to source populations.
Step 2: Pull in key demographic data for linked individuals and export a fixed-width crosswalk file.
Related Configuration Sections¶
The
aliasattributes are read from bothdatasource_aanddatasource_b. The step uses them to construct the output reports.