Changelog¶
The format of this changelog is based on Keep A Changelog. Hlink adheres to semantic versioning as much as possible.
v4.2.1 (2025-08-18)¶
Fixed¶
Fixed a bug where hlink would throw an error if you tried to manually provide a seed to the
random_forest,decision_tree, orgradient_boosted_treesmachine learning model. Now hlink accepts theseedparameter and passes it along to the model. If you do not pass aseedparameter, hlink behaves in the same way as before and automatically sets a default seed for you. PR #222Fixed a bug where hlink did not automatically set the seed for the XGBoost and LightGBM machine learning models. The new behavior is to accept the
seedparameter if it is passed by the user, or set a default seed if it is not passed. This matches the new behavior for the other models. PR #222
v4.2.0 (2025-04-29)¶
Added¶
Added documentation for the column mapping transforms
condense_prefixes,length,swap_words,expand, andcast_as_int. These transforms have been around for a long time but have been missing documentation until now. PR #212Added support for custom column mapping transforms. You can now pass a dictionary of custom transforms to the
LinkRunconstructor, and hlink will automatically invoke them when the configuration calls for them. Please see thehlink.linking.core.column_mappingmodule for more information. PR #213Added individual functions which compute the built-in column mapping transforms to
hlink.linking.core.column_mapping. These are automatically invoked byselect_column_mappingwhen the configuration calls for them. PR #207
Changed¶
Deprecated¶
The
hlink.linking.core.transforms.apply_transformfunction, which applies column mapping transforms, is now deprecated. Please usehlink.linking.core.column_mapping.apply_transforminstead.column_mapping.apply_transformsupports the same interface. PR #207
Fixed¶
Fixed a bug where command-line hlink would sometimes crash if the command history file was missing. PR #215
v4.1.0 (2025-04-15)¶
Added¶
Added a new configuration option
hh_matching.records_to_matchthat controls which records are eligible for re-matching in thehh_matchingtask. You can find the documentation for this option in the new Household Matching section on the Configuration page. PR #201Added the
hh_training.feature_importancesconfiguration option for saving model feature importances or coefficients as step 3 of “Household Training” when set to true. PR #202
Fixed¶
Fixed a bug in the calculation of predicted matches. Previously, if there was a second-best probability, hlink applied the threshold ratio only if the first and second-best probabilities were both at least at the alpha threshold. Now it always applies the threshold ratio when the best probability is at least at the alpha threshold and there is a second-best probability. PR #200
v4.0.0 (2025-04-07)¶
Added¶
Changed¶
Overhauled the model exploration task to use a nested cross-validation approach. PR #169
Changed
hlink.linking.core.classifierfunctions to not interact withthresholdandthreshold_ratio. Please ensure that the parameter dictionaries passed to these functions only contain parameters for the chosen model. PR #175Simplified the parameters required for
hlink.linking.core.threshold.predict_using_thresholds. Instead of passing the entiretrainingconfiguration section to this function, you now need only passtraining.decision. PR #175Added a new required
checkpoint_dirargument toSparkConnection, which lets hlink set different directories for the tmp and checkpoint directories. PR #182Swapped to using
tomlias the default TOML parser. This should fix several issues with how hlink parses TOML files.load_conf_file()provides theuse_legacy_toml_parserargument for backwards compatibility if necessary. PR #185
Deprecated¶
Deprecated the
training.param_gridattribute in favor of the new, more flexibletraining.model_parameter_searchtable. This is part of supporting the new randomized parameter search. PR #168
Removed¶
Removed functionality for outputting “suspicious” training data from model exploration. We determined that this is out of the scope of model exploration step 2. This change greatly simplifies the model exploration code. PR #178
Removed the deprecated
hlink.linking.transformers.interaction_transformermodule. This module was deprecated in v3.5.0. Please usepyspark.ml.feature.Interactioninstead. PR #184Removed some alternate configuration syntax which has been deprecated since v3.0.0. PR #184
Removed
hlink.scripts.main.load_confin favor of a much simpler approach to finding the configuration file and configuring spark. Please callhlink.configs.load_config.load_conf_filedirectly instead.load_conf_filenow returns both the path to the configuration file and its contents as a mapping. PR #182
v3.8.0 (2024-12-04)¶
Added¶
Added optional support for the XGBoost and LightGBM gradient boosting machine learning libraries. You can find documentation on how to use these libraries here. PR #165
Added a new
hlink.linking.transformers.RenameVectorAttributestransformer which can rename the attributes or “slots” of Spark vector columns. PR #165
Fixed¶
v3.7.0 (2024-10-10)¶
Added¶
Added an optional argument to
SparkConnectionto allow setting a custom Spark app name. The default is still to set the app name to “linking”. PR #156
Changed¶
Improved model exploration step 2’s terminal output, logging, and documentation to make the step easier to work with. PR #155
Fixed¶
Updated all modules to log to module-level loggers instead of the root logger. This gives users of the library more control over filtering logs from hlink. PR #152
v3.6.1 (2024-08-14)¶
Fixed¶
Fixed a crash in matching step 0 triggered when there were multiple exploded columns in the blocking section. Multiple exploded columns are now supported. PR #143
v3.6.0 (2024-06-18)¶
Added¶
Added OR conditions in blocking. This new feature supports connecting some or all blocking conditions together with ORs instead of ANDs. Note that using many ORs in blocking may have negative performance implications for large datasets since it increases the size of the blocks and makes each block more difficult to compute. You can find documentation on OR blocking conditions under the
or_groupbullet point here. PR #138
v3.5.5 (2024-05-31)¶
Added¶
Added support for a variable number of columns in the array feature selection transform, instead of forcing it to use exactly 2 columns. PR #135
v3.5.4 (2024-02-20)¶
Added¶
Fixed¶
Fixed a bug where config validation checks did not respect column mapping overrides. PR #131
v3.5.3 (2023-11-02)¶
Added¶
Fixed¶
v3.5.2 (2023-10-26)¶
Changed¶
Made some minor updates to the format of training step 3’s output. There are now 3 columns:
feature_name,category, andcoefficient_or_importance. Feature names are not suffixed with the category value anymore. PR #112BUG reverted in v3.5.3: Started erroring out on invalid categories in training data instead of creating a new category for them. PR #109
Fixed¶
v3.5.1 (2023-10-23)¶
Added¶
Made a new training step 3 to replace model exploration step 3, which was buggy. Training step 3 saves model feature importances or coefficients when
training.feature_importancesis set to true. PR #101
Removed¶
Removed the buggy implementation of model exploration step 3. Training step 3 replaces this. PR #101
v3.5.0 (2023-10-16)¶
Added¶
Changed¶
Upgraded from PySpark 3.3 to 3.5. PR #94
Deprecated¶
Deprecated the
hlink.linking.transformers.interaction_transformermodule. Please use PySpark 3’spyspark.ml.feature.Interactionclass instead. Hlink’sinteraction_transformermodule is scheduled for removal in version 4. PR #97
Fixed¶
Fixed a bug where the hlink script’s autocomplete feature sometimes did not work correctly. PR #96
v3.4.0 (2023-08-09)¶
Added¶
Removed¶
Dropped the
commentcolumn from the script’sdesccommand. This column was always full of nulls and cluttered up the screen. PR #88
v3.3.1 (2023-06-02)¶
Changed¶
Fixed¶
Fixed a bug where comparison features were marked as categorical whenever the
categoricalkey was present, even if it was set to false. PR #82
v3.3.0 (2022-12-13)¶
Added¶
Changed¶
Fixed¶
Fixed a bug which caused Jaro-Winkler scores to be 1.0 for two empty strings. The scores are now 0.0 on two empty strings. PR #59
v3.2.7 (2022-09-14)¶
Added¶
Changed¶
Fixed¶
v3.2.6 (2022-07-18)¶
Added¶
Made hlink installable with
pipvia PyPI.org.
v3.2.1 (2022-05-24)¶
Added¶
Improved logging during startup and for the
LinkTask.run_all_steps()method. PR #7
Changed¶
Added code to adjust the number of Spark partitions based on the size of the input datasets for some link steps. This should help these steps scale better with large datasets. PR #10
Fixed¶
Fixed a bug where model exploration’s step 3 would run into a
TypeErrordue to trying to manually build up a file path. PR #8
v3.2.0 (2022-05-16)¶
Changed¶
v3.1.0 (2022-05-04)¶
Added¶
Started exporting true positive and true negative data along with false positive and false negative data in model exploration. PR #1
Fixed¶
Fixed a bug where
exact_all_multwas not handled correctly in config validation. PR #2
v3.0.0 (2022-04-27)¶
Added¶
This is the initial open-source version of hlink.