# Model Exploration ## Overview The model exploration task provides a way to try out different types of machine learning models and sets of parameters to those models. It tests those models on splits of the training data and outputs information on the performance of the models. The purpose of model exploration is to help you choose a model that performs well without having to test each model individually on the entire input datasets. If you're interested in the exact workings of the model exploration algorithm, see the [Details](#the-details) section below. Model exploration uses several configuration attributes listed in the `training` section because it is closely related to `training`. ## Searching for Model Parameters Part of the process of model exploration is searching for model parameters which give good results on the training data. Hlink supports three strategies for model parameter searches, controlled by the `training.model_parameter_search` table. ### Explicit Search (`strategy = "explicit"`) An explicit model parameter search lists out all of the parameter combinations to be tested. Each element of the `training.model_parameters` list becomes one set of parameters to evaluate. This is the simplest search strategy and is hlink's default behavior. This example `training` section uses an explicit search over two sets of model parameters. Model exploration will train two random forest models. The first will have a `maxDepth` of 3 and `numTrees` of 50, and the second will have a `maxDepth` of 3 and `numTrees` of 20. ```toml [training.model_parameter_search] strategy = "explicit" [[training.model_parameters]] type = "random_forest" maxDepth = 3 numTrees = 50 [[training.model_parameters]] type = "random_forest" maxDepth = 3 numTrees = 20 ``` ### Grid Search (`strategy = "grid"`) A grid search takes multiple values for each model parameter and generates one model for each possible combination of the given parameters. This is often much more compact than writing out all of the possible combinations in an explicit search. For example, this `training` section generates 30 combinations of model parameters for testing. The first has a `maxDepth` of 1 and `numTrees` of 20, the second has a `maxDepth` of 1 and `numTrees` of 30, and so on. ```toml [training.model_parameter_search] strategy = "grid" [[training.model_parameters]] type = "random_forest" maxDepth = [1, 2, 3, 5, 10] numTrees = [20, 30, 40, 50, 60, 70] ``` Although grid search is more compact than explicitly listing out all of the model parameters, it can be quite time-consuming to check every possible combination of model parameters. Randomized search, described below, can be a more efficient way to evaluate models with large numbers of parameters or large parameter ranges. ### Randomized Search (`strategy = "randomized"`) *Added in version 4.0.0.* A randomized parameter search generates model parameter settings by sampling each parameter from a distribution or set. The number of samples is an additional parameter to the strategy. This separates the size of the search space from the number of samples taken, making a randomized search more flexible than a grid search. The downside of this is that, unlike a grid search, a randomized search does not necessarily test all of the possible values given for each parameter. It is necessarily non-exhaustive. In a randomized search, each model parameter may take one of 3 forms: * A list, which is a set of values to sample from with replacement. Each value has an equal chance of being chosen for each sample. ```toml [[training.model_parameters]] type = "random_forest" numTrees = [20, 30, 40] ``` * A single value, which "pins" the model parameter to always be that value. This is syntactic sugar for sampling from a list with one element. ```toml [[training.model_parameters]] type = "random_forest" # numTrees will always be 30. # This is equivalent to numTrees = [30]. numTrees = 30 ``` * A table defining a distribution from which to sample the parameter. The available distributions are `"randint"`, to choose a random integer from a range, `"uniform"`, to choose a random floating-point number from a range, and `"normal"`, to choose a floating-point number from a normal distribution with a given mean and standard deviation. For example, this `training` section generates 20 model parameter combinations for testing, using a randomized search. Each of the three given model parameters uses a different type of distribution. ```toml [training.model_parameter_search] strategy = "randomized" num_samples = 20 [[training.model_parameters]] type = "random_forest" numTrees = {distribution = "randint", low = 20, high = 70} minInfoGain = {distribution = "uniform", low = 0.0, high = 0.3} subsamplingRate = {distribution = "normal", mean = 1.0, standard_deviation = 0.2} ``` ### The `training.param_grid` Attribute As of version 4.0.0, the `training.param_grid` attribute is deprecated. Please use `training.model_parameter_search` instead, as it is more flexible and supports additional parameter search strategies. Prior to version 4.0.0, you will need to use `training.param_grid`. `param_grid` has a direct mapping to `model_parameter_search`. ```toml [training] param_grid = true ``` is equivalent to ```toml [training] model_parameter_search = {strategy = "grid"} ``` and ```toml [training] param_grid = false ``` is equivalent to ```toml [training] model_parameter_search = {strategy = "explicit"} ``` ### Types and Thresholds There are 3 attributes which are hlink-specific and are not passed through as model parameters. * `type` is the name of the model type. * `threshold` and `threshold_ratio` control how hlink classifies potential matches based on the probabilistic output of the models. They may each be either a float or a list of floats, and hlink will always use a grid strategy to generate the set of test combinations for these parameters. For more details, please see the [Models](models) page and the [Details](#the-details) section below. ## The Details The current model exploration implementation uses a technique called nested cross-validation to evaluate each model which the search strategy generates. The algorithm follows this basic outline. Let `N` be the value of `training.n_training_iterations`. Let `J` be 3. (Currently `J` is hard-coded). 1. Split the prepared training data into `N` **outer folds**. This forms a partition of the training data into `N` distinct pieces, each of roughly equal size. 2. Choose the first **outer fold**. 3. Combine the `N - 1` other **outer folds** into the set of outer training data. 4. Split the outer training data into `J` **inner folds**. This forms a partition of the training data into `J` distinct pieces, each of roughly equal size. 5. Choose the first **inner fold**. 6. Combine the `J - 1` other **inner folds** into the test of inner training data. 7. Train, test, and score all of the models using the inner training data and the first **inner fold** as the test data. 8. Repeat steps 5 - 7 for each other **inner fold**. 9. After finishing all of the **inner folds**, choose the single model with the best aggregate score over those folds. 10. For each setting of `threshold` and `threshold_ratio`, train the best model on the outer training data and the chosen **outer fold**. Collect metrics on the performance of the model based on its confusion matrix. 11. Repeat steps 2-10 for each other **outer fold**. 12. Report on all of the metrics gathered for the best-scoring models.