# Models

These are the machine learning models available for use in the model evaluation
and training tasks and in their household counterparts.

There are a few attributes available for all models.

* `type` -- Type: `string`. The name of the model type. The available model
  types are listed below.
* `threshold` -- Type: `float`.  The "alpha threshold". This is the probability
  score required for a potential match to be labeled a match. `0 ≤ threshold ≤
  1`.
* `threshold_ratio` -- Type: `float`. The threshold ratio or "beta threshold".
  This applies to records which have multiple potential matches when
  `training.decision` is set to `"drop_duplicate_with_threshold_ratio"`. For
  each record, only potential matches which have the highest probability, have
  a probability of at least `threshold`, *and* whose probabilities are at least
  `threshold_ratio` times larger than the second-highest probability are
  matches. This is sometimes called the "de-duplication distance ratio". `1 ≤
  threshold_ratio < ∞`.

In addition, any model parameters documented in a model type's Spark
documentation can be passed as parameters to the model through hlink's
`training.chosen_model` and `training.model_exploration` configuration
sections.

Here is an example `training.chosen_model` configuration. The `type`,
`threshold`, and `threshold_ratio` attributes are hlink specific. `maxDepth` is
a parameter to the random forest model which hlink passes through to the
underlying Spark classifier.

```toml
[training.chosen_model]
type = "random_forest"
threshold = 0.2
threshold_ratio = 1.2
maxDepth = 5
```

## random_forest

Uses [pyspark.ml.classification.RandomForestClassifier](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.classification.RandomForestClassifier.html).
* Parameters:
  * `maxDepth` -- Type: `int`. Maximum depth of the tree. Spark default value is 5.
  * `numTrees` -- Type: `int`. The number of trees to train.  Spark default value is 20, must be >= 1.
  * `featureSubsetStrategy` -- Type: `string`. Per the Spark docs: "The number of features to consider for splits at each tree node. Supported options: auto, all, onethird, sqrt, log2, (0.0-1.0], [1-n]."

```toml
[training.chosen_model]
type = "random_forest"
threshold = 0.15
threshold_ratio = 1.0
maxDepth = 5
numTrees = 75
featureSubsetStrategy = "sqrt"
```

## probit

Uses [pyspark.ml.regression.GeneralizedLinearRegression](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.regression.GeneralizedLinearRegression.html) with `family="binomial"` and `link="probit"`.  

```toml
[training.chosen_model]
type = "probit"
threshold = 0.85
threshold_ratio = 1.2
```

## logistic_regression

Uses [pyspark.ml.classification.LogisticRegression](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.classification.LogisticRegression.html)

```toml
[training.chosen_model]
type = "logistic_regression"
threshold = 0.5
threshold_ratio = 1.0
```

## decision_tree

Uses [pyspark.ml.classification.DecisionTreeClassifier](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.classification.DecisionTreeClassifier.html).

* Parameters:
  * `maxDepth` -- Type: `int`.  Maximum depth of the tree.
  * `minInstancesPerNode` -- Type `int`. Per the Spark docs: "Minimum number of instances each child must have after split. If a split causes the left or right child to have fewer than minInstancesPerNode, the split will be discarded as invalid. Should be >= 1."
  * `maxBins` -- Type: `int`. Per the Spark docs: "Max number of bins for discretizing continuous features. Must be >=2 and >= number of categories for any categorical feature."

```toml
[training.chosen_model]
type = "decision_tree"
threshold = 0.5
threshold_ratio = 1.5
maxDepth = 6
minInstancesPerNode = 2
maxBins = 4
```

## gradient_boosted_trees

Uses [pyspark.ml.classification.GBTClassifier](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.classification.GBTClassifier.html).

* Parameters:
  * `maxDepth` -- Type: `int`.  Maximum depth of the tree.
  * `minInstancesPerNode` -- Type `int`. Per the Spark docs: "Minimum number of instances each child must have after split. If a split causes the left or right child to have fewer than minInstancesPerNode, the split will be discarded as invalid. Should be >= 1."
  * `maxBins` -- Type: `int`. Per the Spark docs: "Max number of bins for discretizing continuous features. Must be >=2 and >= number of categories for any categorical feature."
  
```toml
[training.chosen_model]
type = "gradient_boosted_trees"
threshold = 0.7
threshold_ratio = 1.3
maxDepth = 4
minInstancesPerNode = 1
maxBins = 6
```

## xgboost

*Added in version 3.8.0.*

XGBoost is an alternate, high-performance implementation of gradient boosting.
It uses [xgboost.spark.SparkXGBClassifier](https://xgboost.readthedocs.io/en/stable/python/python_api.html#xgboost.spark.SparkXGBClassifier).
Since the XGBoost-PySpark integration requires some additional Python packages,
support for the xgboost model type is disabled in hlink
by default. hlink will stop with an error if you try to use this model type
without enabling support for it. To enable support for xgboost, install hlink
with the `xgboost` extra.

```
pip install hlink[xgboost]
```

This installs the xgboost package and its Python dependencies. Depending on
your machine and operating system, you may also need to install the libomp
library, which is another dependency of xgboost. xgboost should raise a helpful
error if it detects that you need to install libomp.

You can view a list of xgboost's parameters
[here](https://xgboost.readthedocs.io/en/latest/parameter.html).

```toml
[training.chosen_model]
type = "xgboost"
threshold = 0.8
threshold_ratio = 1.5
max_depth = 5
eta = 0.5
gamma = 0.05
```

## lightgbm

*Added in version 3.8.0.*

LightGBM is another alternate, high-performance implementation of gradient
boosting. It uses
[synapse.ml.lightgbm.LightGBMClassifier](https://mmlspark.blob.core.windows.net/docs/1.0.8/pyspark/synapse.ml.lightgbm.html#module-synapse.ml.lightgbm.LightGBMClassifier).
`synapse.ml` is a library which provides various integrations with PySpark,
including integrations between the C++ LightGBM library and PySpark.

LightGBM requires some additional Scala libraries that hlink does not usually
install, so support for the lightgbm model is disabled in hlink by default.
hlink will stop with an error if you try to use this model type without
enabling support for it. To enable support for lightgbm, install hlink with the
`lightgbm` extra.

```
pip install hlink[lightgbm]
```

This installs the lightgbm package and its Python dependencies. Depending on
your machine and operating system, you may also need to install the libomp
library, which is another dependency of lightgbm. If you encounter errors when
training a lightgbm model, please try installing libomp if you do not have it
installed.

lightgbm has an enormous number of available parameters. Many of these are
available as normal in hlink, via the [LightGBMClassifier
class](https://mmlspark.blob.core.windows.net/docs/1.0.8/pyspark/synapse.ml.lightgbm.html#module-synapse.ml.lightgbm.LightGBMClassifier).
Others are available through the special `passThroughArgs` parameter, which
passes additional parameters through to the C++ library. You can see a full
list of the supported parameters
[here](https://lightgbm.readthedocs.io/en/latest/Parameters.html).

```toml
[training.chosen_model]
type = "lightgbm"
# hlink's threshold and threshold_ratio
threshold = 0.8
threshold_ratio = 1.5
# LightGBMClassifier supports these parameters (and many more).
maxDepth = 5
learningRate = 0.5
# LightGBMClassifier does not directly support this parameter,
# so we have to send it to the C++ library with passThroughArgs.
passThroughArgs = "force_row_wise=true"
```