Model Exploration¶
Overview¶
The model exploration task provides a way to try out different types of machine learning models and sets of parameters to those models. It tests those models on splits of the training data and outputs information on the performance of the models. The purpose of model exploration is to help you choose a model that performs well without having to test each model individually on the entire input datasets. If you’re interested in the exact workings of the model exploration algorithm, see the Details section below.
Model exploration uses several configuration attributes listed in the training
section because it is closely related to training
.
Searching for Model Parameters¶
Part of the process of model exploration is searching for model parameters which
give good results on the training data. Hlink supports three strategies for model
parameter searches, controlled by the training.model_parameter_search
table.
Explicit Search (strategy = "explicit"
)¶
An explicit model parameter search lists out all of the parameter combinations
to be tested. Each element of the training.model_parameters
list becomes one
set of parameters to evaluate. This is the simplest search strategy and is hlink’s
default behavior.
This example training
section uses an explicit search over two sets of model parameters.
Model exploration will train two random forest models. The first will have a
maxDepth
of 3 and numTrees
of 50, and the second will have a maxDepth
of 3
and numTrees
of 20.
[training.model_parameter_search]
strategy = "explicit"
[[training.model_parameters]]
type = "random_forest"
maxDepth = 3
numTrees = 50
[[training.model_parameters]]
type = "random_forest"
maxDepth = 3
numTrees = 20
Grid Search (strategy = "grid"
)¶
A grid search takes multiple values for each model parameter and generates one model for each possible combination of the given parameters. This is often much more compact than writing out all of the possible combinations in an explicit search.
For example, this training
section generates 30 combinations of model
parameters for testing. The first has a maxDepth
of 1 and numTrees
of 20,
the second has a maxDepth
of 1 and numTrees
of 30, and so on.
[training.model_parameter_search]
strategy = "grid"
[[training.model_parameters]]
type = "random_forest"
maxDepth = [1, 2, 3, 5, 10]
numTrees = [20, 30, 40, 50, 60, 70]
Although grid search is more compact than explicitly listing out all of the model parameters, it can be quite time-consuming to check every possible combination of model parameters. Randomized search, described below, can be a more efficient way to evaluate models with large numbers of parameters or large parameter ranges.
Randomized Search (strategy = "randomized"
)¶
Added in version 4.0.0.
A randomized parameter search generates model parameter settings by sampling each parameter from a distribution or set. The number of samples is an additional parameter to the strategy. This separates the size of the search space from the number of samples taken, making a randomized search more flexible than a grid search. The downside of this is that, unlike a grid search, a randomized search does not necessarily test all of the possible values given for each parameter. It is necessarily non-exhaustive.
In a randomized search, each model parameter may take one of 3 forms:
A list, which is a set of values to sample from with replacement. Each value has an equal chance of being chosen for each sample.
[[training.model_parameters]]
type = "random_forest"
numTrees = [20, 30, 40]
A single value, which “pins” the model parameter to always be that value. This is syntactic sugar for sampling from a list with one element.
[[training.model_parameters]]
type = "random_forest"
# numTrees will always be 30.
# This is equivalent to numTrees = [30].
numTrees = 30
A table defining a distribution from which to sample the parameter. The available distributions are
"randint"
, to choose a random integer from a range,"uniform"
, to choose a random floating-point number from a range, and"normal"
, to choose a floating-point number from a normal distribution with a given mean and standard deviation.
For example, this training
section generates 20 model parameter combinations
for testing, using a randomized search. Each of the three given model parameters
uses a different type of distribution.
[training.model_parameter_search]
strategy = "randomized"
num_samples = 20
[[training.model_parameters]]
type = "random_forest"
numTrees = {distribution = "randint", low = 20, high = 70}
minInfoGain = {distribution = "uniform", low = 0.0, high = 0.3}
subsamplingRate = {distribution = "normal", mean = 1.0, standard_deviation = 0.2}
The training.param_grid
Attribute¶
As of version 4.0.0, the training.param_grid
attribute is deprecated. Please use
training.model_parameter_search
instead, as it is more flexible and supports additional
parameter search strategies. Prior to version 4.0.0, you will need to use training.param_grid
.
param_grid
has a direct mapping to model_parameter_search
.
[training]
param_grid = true
is equivalent to
[training]
model_parameter_search = {strategy = "grid"}
and
[training]
param_grid = false
is equivalent to
[training]
model_parameter_search = {strategy = "explicit"}
Types and Thresholds¶
There are 3 attributes which are hlink-specific and are not passed through as model parameters.
type
is the name of the model type.threshold
andthreshold_ratio
control how hlink classifies potential matches based on the probabilistic output of the models. They may each be either a float or a list of floats, and hlink will always use a grid strategy to generate the set of test combinations for these parameters.
For more details, please see the Models page and the Details section below.
The Details¶
The current model exploration implementation uses a technique called nested cross-validation to evaluate each model which the search strategy generates. The algorithm follows this basic outline.
Let N
be the value of training.n_training_iterations
.
Let J
be 3. (Currently J
is hard-coded).
Split the prepared training data into
N
outer folds. This forms a partition of the training data intoN
distinct pieces, each of roughly equal size.Choose the first outer fold.
Combine the
N - 1
other outer folds into the set of outer training data.Split the outer training data into
J
inner folds. This forms a partition of the training data intoJ
distinct pieces, each of roughly equal size.Choose the first inner fold.
Combine the
J - 1
other inner folds into the test of inner training data.Train, test, and score all of the models using the inner training data and the first inner fold as the test data.
Repeat steps 5 - 7 for each other inner fold.
After finishing all of the inner folds, choose the single model with the best aggregate score over those folds.
For each setting of
threshold
andthreshold_ratio
, train the best model on the outer training data and the chosen outer fold. Collect metrics on the performance of the model based on its confusion matrix.Repeat steps 2-10 for each other outer fold.
Report on all of the metrics gathered for the best-scoring models.