Models¶
These are the machine learning models available for use in the model evaluation and training tasks and in their household counterparts.
There are a few attributes available for all models.
type
– Type:string
. The name of the model type. The available model types are listed below.threshold
– Type:float
. The “alpha threshold”. This is the probability score required for a potential match to be labeled a match.0 ≤ threshold ≤ 1
.threshold_ratio
– Type:float
. The threshold ratio or “beta threshold”. This applies to records which have multiple potential matches whentraining.decision
is set to"drop_duplicate_with_threshold_ratio"
. For each record, only potential matches which have the highest probability, have a probability of at leastthreshold
, and whose probabilities are at leastthreshold_ratio
times larger than the second-highest probability are matches. This is sometimes called the “de-duplication distance ratio”.1 ≤ threshold_ratio < ∞
.
In addition, any model parameters documented in a model type’s Spark
documentation can be passed as parameters to the model through hlink’s
training.chosen_model
and training.model_exploration
configuration
sections.
Here is an example training.chosen_model
configuration. The type
,
threshold
, and threshold_ratio
attributes are hlink specific. maxDepth
is
a parameter to the random forest model which hlink passes through to the
underlying Spark classifier.
[training.chosen_model]
type = "random_forest"
threshold = 0.2
threshold_ratio = 1.2
maxDepth = 5
random_forest¶
Uses pyspark.ml.classification.RandomForestClassifier.
Parameters:
maxDepth
– Type:int
. Maximum depth of the tree. Spark default value is 5.numTrees
– Type:int
. The number of trees to train. Spark default value is 20, must be >= 1.featureSubsetStrategy
– Type:string
. Per the Spark docs: “The number of features to consider for splits at each tree node. Supported options: auto, all, onethird, sqrt, log2, (0.0-1.0], [1-n].”
[training.chosen_model]
type = "random_forest"
threshold = 0.15
threshold_ratio = 1.0
maxDepth = 5
numTrees = 75
featureSubsetStrategy = "sqrt"
probit¶
Uses pyspark.ml.regression.GeneralizedLinearRegression with family="binomial"
and link="probit"
.
[training.chosen_model]
type = "probit"
threshold = 0.85
threshold_ratio = 1.2
logistic_regression¶
Uses pyspark.ml.classification.LogisticRegression
[training.chosen_model]
type = "logistic_regression"
threshold = 0.5
threshold_ratio = 1.0
decision_tree¶
Uses pyspark.ml.classification.DecisionTreeClassifier.
Parameters:
maxDepth
– Type:int
. Maximum depth of the tree.minInstancesPerNode
– Typeint
. Per the Spark docs: “Minimum number of instances each child must have after split. If a split causes the left or right child to have fewer than minInstancesPerNode, the split will be discarded as invalid. Should be >= 1.”maxBins
– Type:int
. Per the Spark docs: “Max number of bins for discretizing continuous features. Must be >=2 and >= number of categories for any categorical feature.”
[training.chosen_model]
type = "decision_tree"
threshold = 0.5
threshold_ratio = 1.5
maxDepth = 6
minInstancesPerNode = 2
maxBins = 4
gradient_boosted_trees¶
Uses pyspark.ml.classification.GBTClassifier.
Parameters:
maxDepth
– Type:int
. Maximum depth of the tree.minInstancesPerNode
– Typeint
. Per the Spark docs: “Minimum number of instances each child must have after split. If a split causes the left or right child to have fewer than minInstancesPerNode, the split will be discarded as invalid. Should be >= 1.”maxBins
– Type:int
. Per the Spark docs: “Max number of bins for discretizing continuous features. Must be >=2 and >= number of categories for any categorical feature.”
[training.chosen_model]
type = "gradient_boosted_trees"
threshold = 0.7
threshold_ratio = 1.3
maxDepth = 4
minInstancesPerNode = 1
maxBins = 6
xgboost¶
Added in version 3.8.0.
XGBoost is an alternate, high-performance implementation of gradient boosting.
It uses xgboost.spark.SparkXGBClassifier.
Since the XGBoost-PySpark integration which the xgboost Python package provides
is currently unstable, support for the xgboost model type is disabled in hlink
by default. hlink will stop with an error if you try to use this model type
without enabling support for it. To enable support for xgboost, install hlink
with the xgboost
extra.
pip install hlink[xgboost]
This installs the xgboost package and its Python dependencies. Depending on your machine and operating system, you may also need to install the libomp library, which is another dependency of xgboost. xgboost should raise a helpful error if it detects that you need to install libomp.
You can view a list of xgboost’s parameters here.
[training.chosen_model]
type = "xgboost"
threshold = 0.8
threshold_ratio = 1.5
max_depth = 5
eta = 0.5
gamma = 0.05
lightgbm¶
Added in version 3.8.0.
LightGBM is another alternate, high-performance implementation of gradient
boosting. It uses
synapse.ml.lightgbm.LightGBMClassifier.
synapse.ml
is a library which provides various integrations with PySpark,
including integrations between the C++ LightGBM library and PySpark.
LightGBM requires some additional Scala libraries that hlink does not usually
install, so support for the lightgbm model is disabled in hlink by default.
hlink will stop with an error if you try to use this model type without
enabling support for it. To enable support for lightgbm, install hlink with the
lightgbm
extra.
pip install hlink[lightgbm]
This installs the lightgbm package and its Python dependencies. Depending on your machine and operating system, you may also need to install the libomp library, which is another dependency of lightgbm. If you encounter errors when training a lightgbm model, please try installing libomp if you do not have it installed.
lightgbm has an enormous number of available parameters. Many of these are
available as normal in hlink, via the LightGBMClassifier
class.
Others are available through the special passThroughArgs
parameter, which
passes additional parameters through to the C++ library. You can see a full
list of the supported parameters
here.
[training.chosen_model]
type = "lightgbm"
# hlink's threshold and threshold_ratio
threshold = 0.8
threshold_ratio = 1.5
# LightGBMClassifier supports these parameters (and many more).
maxDepth = 5
learningRate = 0.5
# LightGBMClassifier does not directly support this parameter,
# so we have to send it to the C++ library with passThroughArgs.
passThroughArgs = "force_row_wise=true"