Changelog¶
The format of this changelog is based on Keep A Changelog. Hlink adheres to semantic versioning as much as possible.
v4.0.0 (Unreleased)¶
Added¶
Changed¶
Overhauled the model exploration task to use a nested cross-validation approach. PR #169
Changed
hlink.linking.core.classifier
functions to not interact withthreshold
andthreshold_ratio
. Please ensure that the parameter dictionaries passed to these functions only contain parameters for the chosen model. PR #175Simplified the parameters required for
hlink.linking.core.threshold.predict_using_thresholds
. Instead of passing the entiretraining
configuration section to this function, you now need only passtraining.decision
. PR #175Added a new required
checkpoint_dir
argument toSparkConnection
, which lets hlink set different directories for the tmp and checkpoint directories. PR #182Swapped to using
tomli
as the default TOML parser. This should fix several issues with how hlink parses TOML files.load_conf_file()
provides theuse_legacy_toml_parser
argument for backwards compatibility if necessary. PR #185
Deprecated¶
Deprecated the
training.param_grid
attribute in favor of the new, more flexibletraining.model_parameter_search
table. This is part of supporting the new randomized parameter search. PR #168
Removed¶
Removed functionality for outputting “suspicious” training data from model exploration. We determined that this is out of the scope of model exploration step 2. This change greatly simplifies the model exploration code. PR #178
Removed the deprecated
hlink.linking.transformers.interaction_transformer
module. This module was deprecated in v3.5.0. Please usepyspark.ml.feature.Interaction
instead. PR #184Removed some alternate configuration syntax which has been deprecated since v3.0.0. PR #184
Removed
hlink.scripts.main.load_conf
in favor of a much simpler approach to finding the configuration file and configuring spark. Please callhlink.configs.load_config.load_conf_file
directly instead.load_conf_file
now returns both the path to the configuration file and its contents as a mapping. PR #182
v3.8.0 (2024-12-04)¶
Added¶
Added optional support for the XGBoost and LightGBM gradient boosting machine learning libraries. You can find documentation on how to use these libraries here. PR #165
Added a new
hlink.linking.transformers.RenameVectorAttributes
transformer which can rename the attributes or “slots” of Spark vector columns. PR #165
Fixed¶
v3.7.0 (2024-10-10)¶
Added¶
Added an optional argument to
SparkConnection
to allow setting a custom Spark app name. The default is still to set the app name to “linking”. PR #156
Changed¶
Improved model exploration step 2’s terminal output, logging, and documentation to make the step easier to work with. PR #155
Fixed¶
Updated all modules to log to module-level loggers instead of the root logger. This gives users of the library more control over filtering logs from hlink. PR #152
v3.6.1 (2024-08-14)¶
Fixed¶
Fixed a crash in matching step 0 triggered when there were multiple exploded columns in the blocking section. Multiple exploded columns are now supported. PR #143
v3.6.0 (2024-06-18)¶
Added¶
Added OR conditions in blocking. This new feature supports connecting some or all blocking conditions together with ORs instead of ANDs. Note that using many ORs in blocking may have negative performance implications for large datasets since it increases the size of the blocks and makes each block more difficult to compute. You can find documentation on OR blocking conditions under the
or_group
bullet point here. PR #138
v3.5.5 (2024-05-31)¶
Added¶
Added support for a variable number of columns in the array feature selection transform, instead of forcing it to use exactly 2 columns. PR #135
v3.5.4 (2024-02-20)¶
Added¶
Fixed¶
Fixed a bug where config validation checks did not respect column mapping overrides. PR #131
v3.5.3 (2023-11-02)¶
Added¶
Fixed¶
v3.5.2 (2023-10-26)¶
Changed¶
Made some minor updates to the format of training step 3’s output. There are now 3 columns:
feature_name
,category
, andcoefficient_or_importance
. Feature names are not suffixed with the category value anymore. PR #112BUG reverted in v3.5.3: Started erroring out on invalid categories in training data instead of creating a new category for them. PR #109
Fixed¶
v3.5.1 (2023-10-23)¶
Added¶
Made a new training step 3 to replace model exploration step 3, which was buggy. Training step 3 saves model feature importances or coefficients when
training.feature_importances
is set to true. PR #101
Removed¶
Removed the buggy implementation of model exploration step 3. Training step 3 replaces this. PR #101
v3.5.0 (2023-10-16)¶
Added¶
Changed¶
Upgraded from PySpark 3.3 to 3.5. PR #94
Deprecated¶
Deprecated the
hlink.linking.transformers.interaction_transformer
module. Please use PySpark 3’spyspark.ml.feature.Interaction
class instead. Hlink’sinteraction_transformer
module is scheduled for removal in version 4. PR #97
Fixed¶
Fixed a bug where the hlink script’s autocomplete feature sometimes did not work correctly. PR #96
v3.4.0 (2023-08-09)¶
Added¶
Removed¶
Dropped the
comment
column from the script’sdesc
command. This column was always full of nulls and cluttered up the screen. PR #88
v3.3.1 (2023-06-02)¶
Changed¶
Fixed¶
Fixed a bug where comparison features were marked as categorical whenever the
categorical
key was present, even if it was set to false. PR #82
v3.3.0 (2022-12-13)¶
Added¶
Changed¶
Fixed¶
Fixed a bug which caused Jaro-Winkler scores to be 1.0 for two empty strings. The scores are now 0.0 on two empty strings. PR #59
v3.2.7 (2022-09-14)¶
Added¶
Changed¶
Fixed¶
v3.2.6 (2022-07-18)¶
Added¶
Made hlink installable with
pip
via PyPI.org.
v3.2.1 (2022-05-24)¶
Added¶
Improved logging during startup and for the
LinkTask.run_all_steps()
method. PR #7
Changed¶
Added code to adjust the number of Spark partitions based on the size of the input datasets for some link steps. This should help these steps scale better with large datasets. PR #10
Fixed¶
Fixed a bug where model exploration’s step 3 would run into a
TypeError
due to trying to manually build up a file path. PR #8
v3.2.0 (2022-05-16)¶
Changed¶
v3.1.0 (2022-05-04)¶
Added¶
Started exporting true positive and true negative data along with false positive and false negative data in model exploration. PR #1
Fixed¶
Fixed a bug where
exact_all_mult
was not handled correctly in config validation. PR #2
v3.0.0 (2022-04-27)¶
Added¶
This is the initial open-source version of hlink.