Label | Explanation | Data Type |

Prediction Type
| Specifies the operation mode that will be used. The tool can be run to train a model to only assess performance, predict features, or create a prediction surface. - Train only—A model will be trained, but no predictions will be generated. Use this option to assess the accuracy of the model before generating predictions. This option will output model diagnostics in the messages window and a chart of variable importance. This is the default.
- Predict to features—Predictions or classifications will be generated for features. Explanatory variables must be provided for both the training features and the features to be predicted. The output of this option will be a feature class, model diagnostics in the messages window, and an optional table and chart of variable importance.
- Predict to raster—A prediction raster will be generated for the area where the explanatory rasters intersect. Explanatory rasters must be provided for both the training area and the area to be predicted. The output of this option will be a prediction surface, model diagnostics in the messages window, and an optional table and chart of variable importance.
| String |

Input Training Features
| The feature class containing the Variable to Predict parameter value and, optionally, the explanatory training variables from fields. | Feature Layer |

Variable to Predict
(Optional) | The variable from the Input Training Features parameter value containing the values to be used to train the model. This field contains known (training) values of the variable that will be used to predict at unknown locations. | Field |

Treat Variable as Categorical (Optional) | Specifies whether the Variable to Predict value is a categorical variable. - Checked—The Variable to Predict value is a categorical variable and classification will be performed.
- Unchecked—The Variable to Predict value is continuous and regression will be performed. This is the default.
| Boolean |

Explanatory Training Variables
(Optional) | A list of fields representing the explanatory variables that help predict the value or category of the Variable to Predict value. Check the Categorical check box for any variables that represent classes or categories (such as land cover or presence or absence). | Value Table |

Explanatory Training Distance Features
(Optional) | The feature layer containing the explanatory training distance features. Explanatory variables will be automatically created by calculating a distance from the provided features to the Input Training Features values. Distances will be calculated from each of the features in the Input Training Features value to the nearest Explanatory Training Distance Features values. If the input Explanatory Training Distance Features values are polygons or lines, the distance attributes will be calculated as the distance between the closest segments of the pair of features. | Feature Layer |

Explanatory Training Rasters (Optional) | The explanatory training variables extracted from rasters. Explanatory training variables will be automatically created by extracting raster cell values. For each feature in the Input Training Features parameter, the value of the raster cell is extracted at that exact location. Bilinear raster resampling is used when extracting the raster value for continuous rasters. Nearest neighbor assignment is used when extracting a raster value from categorical rasters. Check the Categorical check box for any rasters that represent classes or categories such as land cover or presence or absence. | Value Table |

Input Prediction Features (Optional) | A feature class representing the locations where predictions will be made. This feature class must also contain any explanatory variables provided as fields that correspond to those used from the training data. | Feature Layer |

Output Predicted Features
(Optional) | The output feature class containing the prediction results. | Feature Class |

Output Prediction Surface (Optional) | The output raster containing the prediction results. The default cell size will be the maximum cell size of the raster inputs. To set a different cell size, use the Cell Size environment setting. | Raster Dataset |

Match Explanatory Variables
(Optional) | A list of the Explanatory Variables values specified from the Input Training Features parameter on the right and corresponding fields from the Input Prediction Features parameter on the left. | Value Table |

Match Distance Features
(Optional) | A list of the Explanatory Distance Features values specified for the Input Training Features parameter on the right and corresponding feature sets from the Input Prediction Features parameter on the left. Explanatory Distance Features values that are more appropriate for the Input Prediction Features parameter can be provided if those used for training are in a different study area or time period. | Value Table |

Match Explanatory Rasters
(Optional) | A list of the Explanatory Rasters values specified for the Input Training Features parameter on the right and corresponding rasters from the Input Prediction Features parameter or the Prediction Surface parameter to be created on the left. The Explanatory Rasters values that are more appropriate for the Input Prediction Features parameter can be provided if those used for training are in a different study area or time period. | Value Table |

Output Trained Features
(Optional) | The explanatory variables used for training (including sampled raster values and distance calculations), as well as the observed Variable to Predict field and accompanying predictions that will be used to further assess performance of the trained model. | Feature Class |

Output Variable Importance Table
(Optional) | The table that will contain information describing the importance of each explanatory variable used in the model. The explanatory variables include fields, distance features, and rasters used to create the model. If the Model type parameter value is Gradient Boosted, importance is measured by gain, weight, and cover, and the table will include these fields. The output will include a bar chart, if the Number of Runs for Validation parameter value is one, or a box plot, if the value is greater than one, of the importance of the explanatory variables. | Table |

Convert Polygons to Raster Resolution for Training
(Optional) | Specifies how polygons will be treated when training the model if the Input Training Features values are polygons with a categorical Variable to Predict value and only Explanatory Training Rasters values have been provided. - Checked—The polygon will be divided into all of the raster cells with centroids falling within the polygon. The raster values at each centroid will be extracted and used to train the model. The model will no longer be trained on the polygon; it will be trained on the raster values extracted for each cell centroid. This is the default.
- Unchecked—Each polygon will be assigned the average value of the underlying continuous rasters and the majority for underlying categorical rasters.
| Boolean |

Number of Trees
(Optional) | The number of trees that will be created in the Forest-based and Gradient Boosted models. The default is 100. If the Model Type parameter value is Forest-based, more trees will generally result in more accurate model predictions; however, the model will take longer to calculate. If the Model Type parameter value is Gradient Boosted, more trees may result in more accurate model predictions; however, they may also lead to overfitting the training data. To avoid overfitting the data, provide values for the Maximum Tree Depth, L2 Regularization (Lambda), Minimum Loss Reduction for Splits (Gamma), and Learning Rate (Eta)parameters. | Long |

Minimum Leaf Size
(Optional) | The minimum number of observations required to keep a leaf (that is, the terminal node on a tree without further splits). The default minimum for regression is 5 and the default for classification is 1. For very large data, increasing these numbers will decrease the run time of the tool. | Long |

Maximum Tree Depth
(Optional) | The maximum number of splits that will be made down a tree. Using a large maximum depth, more splits will be created, which may increase the chances of overfitting the model. If the Model type parameter value is Forest-based, the default is data driven and depends on the number of trees created and the number of variables included. If the Model type parameter value is Gradient Boosted, the default is 6. | Long |

Data Available per Tree (%)
(Optional) | The percentage of the Input Training Features values that will be used for each decision tree. The default is 100 percent of the data. Samples for each tree are taken randomly from two-thirds of the data specified. Each decision tree in the forest is created using a random sample or subset (approximately two-thirds) of the training data available. Using a lower percentage of the input data for each decision tree decreases the run time of the tool for very large datasets. | Long |

Number of Randomly Sampled Variables
(Optional) | The number of explanatory variables that will be used to create each decision tree. Each decision tree in the forest-based and gradient boosted models is created using a random subset of the specified explanatory variables. Increasing the number of variables used in each decision tree will increase the chances of overfitting the model, particularly if there is one or more dominant variables. The default is to use the square root of the total number of explanatory variables (fields, distances, and rasters combined) if the Variable to Predict value is categorical or to divide the total number of explanatory variables (fields, distances, and rasters combined) by 3 if the Variable to Predict value is numeric. | Long |

Training Data Excluded for Validation (%)
(Optional) | The percentage (between 10 percent and 50 percent) of the Input Training Features values that will be reserved as the test dataset for validation. The model will be trained without this random subset of data, and the model predicted values for those features will be compared to the observed values. The default is 10 percent. | Double |

Output Classification Performance Table (Confusion Matrix)
(Optional) | A confusion matrix that summarizes the performance of the model created on the validation data. The matrix compares the model predicted categories for the validation data to the actual categories. This table can be used to calculate additional diagnostics that are not included in the output messages. This parameter is available when the Variable to Predict value is categorical and the Treat as Categorical parameter is checked. | Table |

Output
Validation Table (Optional) | A table that contains the
R | Table |

Compensate for Sparse Categories
(Optional) | Specifies whether each category in the training dataset, regardless of its frequency, will be represented in each tree. This parameter is available when the Model Type parameter value is Forest-based. - Checked—Each tree will include every category that is represented in the training dataset.
- Unchecked—Each tree will be created based on a random sample of the categories in the training dataset. This is the default.
| Boolean |

Number of Runs for Validation
(Optional) | The number of iterations of the tool. The distribution of R-squared values (continuous) or accuracies (categorical) of all the models can be displayed using the Output Validation Table parameter. If the Prediction Type parameter value is Predict to raster or Predict to features, the model that produced the median R-squared value or accuracy will be used to make predictions. Using the median value helps ensure stability of the predictions. | Long |

Calculate Uncertainty
(Optional) | Specifies whether prediction uncertainty will be calculated when training, predicting to features, or predicting to raster. This parameter is available when the Model Type parameter value is Forest-based. - Checked—A prediction uncertainty interval will be calculated.
- Unchecked—Uncertainty will not be calculated. This is the default.
| Boolean |

Output Trained Model File
(Optional) | An output model file that will save the trained model, which can be used later for prediction. | File |

Model Type
(Optional) | Specifies the method that will be used to create the model. - Forest-based—A model will be created using an adaptation of the random forest algorithm. The model will use the votes from hundreds of decisions trees. Each decision tree will be created from a randomly generated subset of the original data and original variables.
- Gradient Boosted—A model will be created using the Extreme Gradient Boosting (XGBoost) algorithm. The model will create a sequence of hundreds of trees in which each subsequent tree corrects the errors of the previous trees.
| String |

L2 Regularization (Lambda)
(Optional) | A regularization term that reduces the model's sensitivity to individual features. Increasing this value will make the model more conservative and prevent overfitting the training data. If the value is 0, the model becomes the traditional Gradient Boosting model. The default is 1. This parameter is available when the Model Type parameter value is Gradient Boosted. | Double |

Minimum Loss Reduction for Splits (Gamma)
(Optional) | A threshold for the minimum loss reduction needed to split trees. Potential splits are evaluated for their loss reduction. If the candidate split has a higher loss reduction than this threshold value, the partition will occur. Higher threshold values avoid overfitting and result in more conservative models with fewer partitions. The default is 0. This parameter is available when the Model Type parameter value is Gradient Boosted. | Double |

Learning Rate (Eta)
(Optional) | A value that reduces the contribution of each tree to the final prediction. The value should be greater than 0 and less than or equal to 1. A lower learning rate prevents overfitting the model; however, it may require longer computation times. The default is 0.3. This parameter is available when the Model Type parameter value is Gradient Boosted. | Double |

Maximum Number of Bins for Searching Splits (Optional) | The number of bins that the training data will be divided into to search for the best splitting point. The value cannot be 1. The default is 0, which corresponds to the use of a greedy algorithm. A greedy algorithm will create a candidate split at every data point. Providing too few bins for searching is not recommended because it will lead to poor model prediction performance. This parameter is available when the Model Type parameter value is Gradient Boosted. | Long |

Optimize Parameters
(Optional) | Specifies whether an optimization method will be used to find the set of hyperparameters that achieve optimal model performance. - Checked—An optimization method will be used to find the set of hyperparameters.
- Unchecked—An optimization method will not be used. This is the default.
| Boolean |

Optimization Method
(Optional) | Specifies the optimization method that will be used to select and test search points to find the optimal set of hyperparameters. Search points are combinations of hyperparameters within the search space specified by the Model Parameter Setting parameter. This option is available when the Optimization Parameters parameter is checked. - Random Search (Quick)—A stratified random sampling algorithm will be used to select the search points within the search space. This is the default.
- Random Search (Robust)—A stratified random sampling algorithm will be used to select the search points. Each search will be run 10 times using a different random seed. The result of each search will be the median best run determined by the Optimize Target (Objective) parameter value. This option is available if the Model Type parameter value is Forest-based.
- Grid Search—Every search point within the search space will be selected.
| String |

Optimize Target (Objective)
(Optional) | Specifies the objective function or value that will be minimized or maximized to find the optimal set of hyperparameters. - R-Squared—The optimization method will maximize R
^{2}to find the optimal set of hyperparameters. This option is only available when the variable to predict is not categorical. This is the default when the variable to predict is not categorical. - Root Mean Square Error (RMSE)—The optimization method will minimize root mean square error to find the optimal set of hyperparameters. This option is only available when the variable to predict is not categorical.
- Accuracy—The optimization method will maximize accuracy to find the optimal model. This option is only available when the variable to predict is categorical. This is the default when the variable to predict is categorical.
- Matthews correlation coefficient (MCC)—The optimization method will maximize Matthews correlation coefficient to find the optimal set of hyperparameters. This option is only available when the variable to predict is categorical.
- F1-Score—The optimization method will maximize the F1-Score to find the optimal set of hyperparameters. This option is only available when the variable to predict is categorical.
| String |

Number of Runs for Parameter Sets
(Optional) | The number of search points within the search space specified by the Model Parameter Setting parameter that will be tested. This parameter is available when the Optimization Method value is Random Search (Quick) or Random Search (Robust). | Long |

Model Parameter Setting
(Optional) | A list of hyperparameters and their search spaces. Customize the search space of each hyperparameter by providing a lower bound, upper bound, and interval. The lower bound and upper bound specify the range of possible values for the hyperparameter. The following is the range of valid values for each hyperparameter: - Number of Trees (number_of_trees)—An integer value greater than 1.
- Maximum Tree Depth (maximum_depth)—An integer value greater than or equal to 0.
- Minimum Leaf Size (minimum_leaf_size)—An integer value greater than 1.
- Data Available per Tree (%) (sample_size)—An integer value greater than 0 and less than or equal to 100.
- Number of Randomly Sampled Variables (random_variables)—An integer value less than or equal to the number of explanatory variables. This includes the explanatory variables from fields, distance features, and rasters.
- Learning Rate (Eta) (eta)—A double value greater than 0 and less than or equal to 1.
- L2 Regularization (Lambda) (reg_lambda)—A double value greater than or equal to 0.
- Minimum Loss Reduction for Splits (Gamma) (gamma)—A double value greater than or equal to 0.
- Maximum Number of Bins for Searching Splits (max_bins)—An integer value greater than 1 or the value 0. A value of 0 means the model will create a candidate split at every data point.
| Value Table |

Output Parameter Tuning Table
(Optional) | A table that contains the parameter settings and objective values for each optimization trial. The output includes a chart of all the trials and their objective values. This option is available when Optimize Parameters is checked. | Table |

Include All Prediction Probabilities (Optional) | For categorical variables to predict, specifies whether the probability of every category of the categorical variable or only the probability of the record's category will be predicted. For example, if a categorical variable has categories A, B, and C, and the first record has category B, use this parameter to specify whether the probability for categories A, B, and C will be predicted or only the probability of category B will be predicted for the record. - Checked—Probabilities for all categories of the categorical variable will be predicted and included in the output.
- Unchecked—Only the probability of the category of the record will be predicted and included in the output. This is the default.
| Boolean |

### Derived Output

Label | Explanation | Data Type |

Output Uncertainty Raster Layers | When the Calculate Uncertainty parameter is checked, the tool will calculate a 90 percent prediction interval around each predicted value of the Variable to Predict parameter. | Raster Layer |