# Link Tasks ## Preprocessing ### Overview Read in raw data and prepare it for linking. This task may include a variety of transformations on the data, such as stripping out whitespace and normalizing strings that have common abbreviations. The same transformations are applied to both input datasets. ### Task steps * Step 0: Read raw data in from Parquet or CSV files. Register the raw dataframes with the program. * Step 1: Prepare the dataframes for linking. Perform substitutions, transformations, and column mappings as requested. ### Related Configuration Sections * The [`datasource_a` and `datasource_b`](config.html#data-sources) sections specify where to find the input data. * [`column_mappings`](column_mappings.html#column-mappings), [`feature_selections`](feature_selection_transforms.html#feature-selection-transforms), and [`substitution_columns`](substitutions.html#substitutions) may all be used to define transformations on the input data. * The [`filter`](config.html#filter) section may be used to filter some records out of the input data as they are read in. ## Training and Household Training ### Overview Train a machine learning model to use for classification of potential links. This requires training data, which is read in in the first step. Comparison features are generated for the training data, and then the model is trained on the data and saved for use in the Matching task. The last step optionally saves some metadata like feature importances or coefficients for the model to help with introspection. ### Task steps The first three steps in each of these tasks are the same: * Step 0: Ingest the training data from a CSV file. * Step 1: Create comparison features. * Step 2: Train and save the model. The last step is available only for Training, not for Household Training. * Step 3: Save the coefficients or feature importances of the model for inspection. This step is skipped by default. To enable it, set the `training.feature_importances` config attribute to true in your config file. ### Related Configuration Sections * The [`training`](config.html#training-and-models) section is the most important for Training and provides configuration attributes for many aspects of the task. For Household Training, use the [`hh_training`](config.html#household-training-and-models) section instead. * [`comparison_features`](config.html#comparison-features) and [`pipeline_features`](pipeline_features.html#pipeline-generated-features) are both generated in order to train the model. These sections are also used extensively by the Matching task. ## Matching ### Overview Run the linking algorithm, generating a table with potential matches between records in the two datasets. This is the core of hlink's work and may take the longest of all of the tasks. Universe definition and blocking reduce the number of comparisons needed when determining potential matches, which can drastically improve the runtime of Matching. ### Task steps * Step 0: Perform blocking, separating records into different buckets to reduce the total number of comparisons needed during matching. Some columns may be "exploded" here if needed. * Step 1: Run the matching algorithm, outputting potential matches to the `potential_matches` table. * Step 2: Score the potential matches with the trained model. This step will be automatically skipped if machine learning is not being used. ### Related Configuration Sections * The [`potential_matches_universe`](config.html#potential-matches-universe) section may be used to provide a universe for matches in the form of a SQL condition. Only records that satisfy the condition are eligible for matching. * [`blocking`](config.html#blocking) specifies how to block the input records into separate buckets before matching. Two records are eligible to match with one another only if they are grouped into the same blocking bucket. * [`comparison_features`](config.html#comparison-features) support computing features on each record. These features may be passed to a machine learning model through the [`training`](config.html#training-and-models) section and/or passed to deterministic rules with the [`comparisons`](config.html#comparisons) section. There are many different [comparison types](comparison_features) available for use with `comparison_features`. * [`pipeline_features`](pipeline_features.html#pipeline-generated-features) are machine learning transformations useful for reshaping and interacting data before they are fed to the machine learning model. ## Household Matching ### Overview Generate a table with potential matches between households in the two datasets. ### Task steps * Step 0: Block on households. * Step 1: Filter households based on `hh_comparisons` configuration settings. * Step 2: Score the potential matches with the trained model. This step will be automatically skipped if machine learning is not being used. ### Related Configuration Sections * [`comparison_features`](config.html#comparison-features) and [`pipeline_features`](pipeline_features.html#pipeline-generated-features) are used as they are in the Matching task. * [`hh_comparisons`](config.html#household-comparisons) correspond to `comparisons` in the Matching task and may be thought of as "post-blocking filters". Only potential matches that pass these comparisons will be eligible for being scored as matches. * [`hh_training`](config.html#household-training-and-models) corresponds to `training` in Matching. ## Model Exploration and Household Model Exploration ### Overview Evaluate the performance of different types of models and different parameter combinations on training data. These tasks are highly configurable and are typically not part of a full linking run. Instead, they are usually run ahead of time, and then the best-performing model is chosen and used for the full linking run. ### Task steps The steps in each of these tasks are the same: * Step 0: Ingest the training data file specified in the config with the `dataset` attribute. * Step 1: Create training features on the training data. If the `use_training_data_features` attribute is provided in the respective training config section, then instead read features from the training data file. * Step 2: Run `n_training_iterations` number of train-test splits on each of the models in the config `model_parameters`. ### Related Configuration Sections * [`training`](config.html#training-and-models) is used extensively by Model Exploration, and [`hh_training`](config.html#household-training-and-models) is used extensively by Household Model Exploration. * [`comparison_features`](config.html#comparison-features) and [`pipeline_features`](pipeline_features.html#pipeline-generated-features) are used to generate features that are passed as input to the trained models. ## Reporting ### Overview Report on characteristics of the linked data. This task is experimental and focused primarily on demographic census data. At the moment, it does not allow very much configuration. ### Task steps * Step 0: For households with anyone linked in Matching, report the percent of remaining household members linked in Household Matching. * Step 1: Report on the representivity of linked data compared to source populations. * Step 2: Pull in key demographic data for linked individuals and export a fixed-width crosswalk file. ### Related Configuration Sections * The `alias` attributes are read from both [`datasource_a`](config.html#data-sources) and [`datasource_b`](config.html#data-sources). The step uses them to construct the output reports.