Models

These are the machine learning models available for use in the model evaluation and training tasks and in their household counterparts.

There are a few attributes available for all models.

  • type – Type: string. The name of the model type. The available model types are listed below.

  • threshold – Type: float. The “alpha threshold”. This is the probability score required for a potential match to be labeled a match. 0 threshold 1.

  • threshold_ratio – Type: float. The threshold ratio or “beta threshold”. This applies to records which have multiple potential matches when training.decision is set to "drop_duplicate_with_threshold_ratio". For each record, only potential matches which have the highest probability, have a probability of at least threshold, and whose probabilities are at least threshold_ratio times larger than the second-highest probability are matches. This is sometimes called the “de-duplication distance ratio”. 1 threshold_ratio < .

In addition, any model parameters documented in a model type’s Spark documentation can be passed as parameters to the model through hlink’s training.chosen_model and training.model_exploration configuration sections.

Here is an example training.chosen_model configuration. The type, threshold, and threshold_ratio attributes are hlink specific. maxDepth is a parameter to the random forest model which hlink passes through to the underlying Spark classifier.

[training.chosen_model]
type = "random_forest"
threshold = 0.2
threshold_ratio = 1.2
maxDepth = 5

random_forest

Uses pyspark.ml.classification.RandomForestClassifier.

  • Parameters:

    • maxDepth – Type: int. Maximum depth of the tree. Spark default value is 5.

    • numTrees – Type: int. The number of trees to train. Spark default value is 20, must be >= 1.

    • featureSubsetStrategy – Type: string. Per the Spark docs: “The number of features to consider for splits at each tree node. Supported options: auto, all, onethird, sqrt, log2, (0.0-1.0], [1-n].”

[training.chosen_model]
type = "random_forest"
threshold = 0.15
threshold_ratio = 1.0
maxDepth = 5
numTrees = 75
featureSubsetStrategy = "sqrt"

probit

Uses pyspark.ml.regression.GeneralizedLinearRegression with family="binomial" and link="probit".

[training.chosen_model]
type = "probit"
threshold = 0.85
threshold_ratio = 1.2

logistic_regression

Uses pyspark.ml.classification.LogisticRegression

[training.chosen_model]
type = "logistic_regression"
threshold = 0.5
threshold_ratio = 1.0

decision_tree

Uses pyspark.ml.classification.DecisionTreeClassifier.

  • Parameters:

    • maxDepth – Type: int. Maximum depth of the tree.

    • minInstancesPerNode – Type int. Per the Spark docs: “Minimum number of instances each child must have after split. If a split causes the left or right child to have fewer than minInstancesPerNode, the split will be discarded as invalid. Should be >= 1.”

    • maxBins – Type: int. Per the Spark docs: “Max number of bins for discretizing continuous features. Must be >=2 and >= number of categories for any categorical feature.”

[training.chosen_model]
type = "decision_tree"
threshold = 0.5
threshold_ratio = 1.5
maxDepth = 6
minInstancesPerNode = 2
maxBins = 4

gradient_boosted_trees

Uses pyspark.ml.classification.GBTClassifier.

  • Parameters:

    • maxDepth – Type: int. Maximum depth of the tree.

    • minInstancesPerNode – Type int. Per the Spark docs: “Minimum number of instances each child must have after split. If a split causes the left or right child to have fewer than minInstancesPerNode, the split will be discarded as invalid. Should be >= 1.”

    • maxBins – Type: int. Per the Spark docs: “Max number of bins for discretizing continuous features. Must be >=2 and >= number of categories for any categorical feature.”

[training.chosen_model]
type = "gradient_boosted_trees"
threshold = 0.7
threshold_ratio = 1.3
maxDepth = 4
minInstancesPerNode = 1
maxBins = 6

xgboost

Added in version 3.8.0.

XGBoost is an alternate, high-performance implementation of gradient boosting. It uses xgboost.spark.SparkXGBClassifier. Since the XGBoost-PySpark integration which the xgboost Python package provides is currently unstable, support for the xgboost model type is disabled in hlink by default. hlink will stop with an error if you try to use this model type without enabling support for it. To enable support for xgboost, install hlink with the xgboost extra.

pip install hlink[xgboost]

This installs the xgboost package and its Python dependencies. Depending on your machine and operating system, you may also need to install the libomp library, which is another dependency of xgboost. xgboost should raise a helpful error if it detects that you need to install libomp.

You can view a list of xgboost’s parameters here.

[training.chosen_model]
type = "xgboost"
threshold = 0.8
threshold_ratio = 1.5
max_depth = 5
eta = 0.5
gamma = 0.05

lightgbm

Added in version 3.8.0.

LightGBM is another alternate, high-performance implementation of gradient boosting. It uses synapse.ml.lightgbm.LightGBMClassifier. synapse.ml is a library which provides various integrations with PySpark, including integrations between the C++ LightGBM library and PySpark.

LightGBM requires some additional Scala libraries that hlink does not usually install, so support for the lightgbm model is disabled in hlink by default. hlink will stop with an error if you try to use this model type without enabling support for it. To enable support for lightgbm, install hlink with the lightgbm extra.

pip install hlink[lightgbm]

This installs the lightgbm package and its Python dependencies. Depending on your machine and operating system, you may also need to install the libomp library, which is another dependency of lightgbm. If you encounter errors when training a lightgbm model, please try installing libomp if you do not have it installed.

lightgbm has an enormous number of available parameters. Many of these are available as normal in hlink, via the LightGBMClassifier class. Others are available through the special passThroughArgs parameter, which passes additional parameters through to the C++ library. You can see a full list of the supported parameters here.

[training.chosen_model]
type = "lightgbm"
# hlink's threshold and threshold_ratio
threshold = 0.8
threshold_ratio = 1.5
# LightGBMClassifier supports these parameters (and many more).
maxDepth = 5
learningRate = 0.5
# LightGBMClassifier does not directly support this parameter,
# so we have to send it to the C++ library with passThroughArgs.
passThroughArgs = "force_row_wise=true"