How Evaluate Predictions with Cross-validation works—ArcGIS Pro

The Evaluate Predictions with Cross-validation tool performs K-fold cross-validation to evaluate how well a model predicts unseen data through multiple validations. The tool splits the input dataset into groups, reserves a single group as the testing set, trains a model using the remaining groups, and calculates evaluation metrics to evaluate how well the model predicted the values in the reserved group. It then repeats this process with each group. Groups can be randomly selected (Random k-fold) or spatially clustered (Spatial k-fold) when you want to understand the model’s predicting power on unknown data in new geospatial regions. The tool also has data balancing options, which can help when classifying rare events. This tool is used in conjunction with predictive tools, such as Forest-based and Boosted Classification and Regression, Generalized Linear Regression, and Presence-only Prediction (Max-Ent). It provides a superior validation method to evaluate a model’s performance than the ones that are offered in the predictive tools.

K-fold cross-validation

The Evaluate Predictions with Cross-validation tool evaluates how well a model predicts unseen data through multiple validations. In K-fold validation, the input analysis result features are first split into a number (k) of groups (folds) of the same or similar size. In a validation run, a single group is reserved as the testing set while the model is trained on the remaining groups. The model is then used to predict the testing set and statistical metrics are generated to evaluate the model’s performance. The tool then iteratively uses each group as the testing set and performs a validation run.

K-fold cross-validation repeats the validation process multiple times and creates a more comprehensive assessment of the model performance with different testing sets. While simple validation with a single train-test split can be straightforward and useful, K-fold cross-validation is more informative. The predictive tools, such as Forest-based and Boosted Classification and Regression and Generalized Linear Regression, offer a validation step; however, K-fold cross-validation is superior because it repeatedly splits the data into different training and testing sets. This provides a more reliable estimate of the model’s performance in predicting new data and helps prevent potential overfitting during training.

Grouping features

K-fold cross-validation splits the analysis features into groups. The Number of groups parameter controls the number of groups (k) that are created. The parameter value can range from 2 to the number of features in the dataset. The Evaluation Type parameter determines whether the features in a group are randomly selected or spatially clustered. When working with categorical variables, all the categorical levels may not be represented equally. Some categories may be rare while others are frequent.

Random K-fold

Random K-fold cross-validation randomly splits the analysis result features into k groups. Each group contains the same, or a similar, number of features.

Spatial K-fold

Spatial K-fold ensures that each training group and testing group are spatially separate from one another. The spatial groups are created using k-means clustering. It takes the coordinates of each feature and creates k groups that are spatially partitioned. However, these groups may not have the same number of features in each of the validation sets. Spatial K-fold validation is helpful to understand the model’s predicting power on unknown data in new geospatial regions.

Leave one out cross-validation

If the number of groups corresponds to the number of input features, leave one out cross-validation (LOOCV) is performed—for example, if a Generalized Linear Regression analysis output with 100 features is the Analysis Result Feature and the Number of Groups parameter is set to 100. The model will be trained on 99 features and then predicted and evaluated on the remaining 1 feature. This process is repeated 100 times. The advantage of LOOCV is that it provides a robust and unbiased measure of error metrics such as MSE, RMSE, and MAPE. However, it should not be used to evaluate global metrics like R², as these cannot be calculated with a sample size of 1, and it is not a reliable metric for very small datasets.

Evaluating spatial k-fold results

Evaluation metrics for spatial cross-validation are influenced by the number of groups selected. The smaller the size of the spatially contiguous cluster used as the validation set, the closer the evaluation metrics will be to that of leave one out cross-validation. A smaller spatially contiguous validation set is likely to have less spatial extrapolation because it has closer neighbors in the training set. On the other hand, random cross-validation metrics tend to remain stable and similar or equal to the leave one out cross-validation metrics regardless of the number of groups selected. Therefore, the number of groups selected for spatial cross-validation is a crucial parameter to consider. For example, if you train your model on data from counties in 49 of 50 states in the United States and aim to make predictions in the 50th state, an appropriate number of groups might be 49. This approach ensures that each fold represents a hypothetical state allowing the final metrics to accurately reflect the model's performance when predicting in a new state.

Comparing evaluation types

In general, spatial cross-validation metrics tend to yield poorer evaluation results compared to random cross-validation. For example, while random cross-validation might achieve an average accuracy of 90 percent across the folds, spatial cross-validation could show a lower average accuracy around 70 percent. This discrepancy is expected because random cross-validation benefits from spatial autocorrelation. In random validation sets, features often have spatial neighbors that closely resemble them in the corresponding training set, especially when autocorrelation is high. In contrast, spatial validation subsets lack this advantage, leading to a degree of spatial extrapolation. Here, predictions are made in a new spatial area that the model has not been trained on. Using random cross-validation to evaluate a model does not make the underlying model better, even if the metrics look better. Rather, it is overestimating how the model will perform in a real-world scenario where new regions are used.

Reviewing cross-validation results

A common misconception about cross-validation and other model validation procedures is that they are intended to determine whether the model is correct for the data. Models are never correct for data collected from the real world, but they don't need to be correct to provide actionable information for decision-making. Cross-validation statistics are a means to quantify the usefulness of a model, not as a checklist to determine whether a model is correct. With the many available statistics (individual values, summary statistics, and charts), it's possible to look too closely and find problems and deviations from ideal values and patterns. Models are never perfect because models never perfectly represent the data.

When reviewing cross-validation results, it is important to remember the goals and expectations of your analysis. For example, suppose you are predicting temperature in degrees Celsius to make public health recommendations during a heat wave. In this scenario, how should you interpret a mean error value of 0.1? Taken literally, it means that the model has positive bias and tends to overpredict temperature values. However, the average bias is only one-tenth of a degree, which is likely not large enough to be relevant to public health policy. On the other hand, a root mean square error value of 10 degrees means that, on average, the predicted values were off by 10 degrees from the true temperatures. This model would likely be too inaccurate to be useful because differences of 10 degrees would elicit very different public health recommendations.

Outputs

The tool will generate geoprocessing messages and two outputs: a feature class and table. The feature class records the training dataset and the training and predicting results of each feature in the training dataset. The table records the evaluation metrics for each validation run. The tool also creates many useful geoprocessing messages, including the Average Out of Sample Diagnostic Statistics table.

Geoprocessing messages

You can access the messages by hovering over the progress bar, clicking the pop-out button, or expanding the messages section in the Geoprocessing pane. You can also access the messages for a previous run of this tool in the geoprocessing history. The messages include an Average Out of Sample Diagnostic Statistics table.

Average Out of Sample Diagnostic Statistics table

Analysis diagnostics are provided in the Average Out of Sample Diagnostic Statistics table.

The Average Out of Sample Diagnostic Statistics table contains the following diagnostics:

R-squared—R-squared is a measure of goodness of fit. It is the proportion of dependent variable variance accounted for by the regression model. The value varies from 0.0 to 1.0 and a higher value denotes a better model. Unlike the R-squared value for the training data, the out of sample R-squared can decrease when including additional explanatory variables, so it can be effective to determine whether including new explanatory variables is effective. R-squared will not be calculated when groups contain fewer than three features.
Adjusted R-Squared—Adjusted R-squared is like R-squared, however, it adds a penalty for including additional explanatory variables in order to give some preference to models with fewer explanatory variables. Calculations for the adjusted R-squared value normalize the numerator and denominator by their degrees of freedom. In making this adjustment, you lose the interpretation of the value as a proportion of the variance explained. This metric is only calculated for Generalized Linear Regression models. Adjusted R-squared will not be calculated when groups contain fewer than three features.
Root Mean Square Error (RMSE)—RMSE is the square root of mean square error (MSE), which is the square root of the averaged squared difference between the actual values and the predicted values. As with MAE (Mean Absolute Error), RMSE represents the average model prediction error in the units of the variable of interest; however, RMSE is more sensitive to large errors and outliers. This statistic is generally used to measure prediction accuracy. RMSE is in the units of the variable of interest, so it cannot be compared across different models.
Mean Absolute Error (MAE)—MAE is the average of the absolute difference between the actual values and predicted values of the Variable of Interest parameter. A value of 0 means the model correctly predicted every observed value. MAE is in the units of the variable of interest, so it cannot be compared across different models.
Mean Absolute Percentage Error (MAPE)—MAPE is similar to MAE in that it represents the difference between the original values and predicted values. However, while MAE represents the difference in the original units, MAPE represents the difference as a percentage. MAPE is a relative error, so it is a better diagnostic when comparing different models. Due to how MAPE is calculated, it cannot be used if any of the original values are 0. If the original values are close to 0, MAPE will go to infinity and appear as Null in the table. Another limitation of MAPE is that it is scale dependent. For example, if there are two cases where the difference between the actual values and the predicted values are the same, the case where the actual value is smaller will contribute more to MAPE.

Additional outputs

This tool also produces a table and an output feature class.

Output table

The output validation table contains the same diagnostics included in the geoprocessing messages: Adjusted R-squared, R-squared, Root Mean Square Error (RMSE), Mean Absolute Percentage Error (MAPE), and Mean Absolute Error (MAE). The table shows the statistics for each of the K folds.

Output features

The fields in the output features include the explanatory training variables used in the model, the variable to predict, average training predicted value, average training residual, out of sample predicted value, and out of sample residual. You can use the average training predicted value, average training residual, out of sample predicted value, and out of sample residual to evaluate how accurately the value of the field was predicted.

Best practices and limitations

The following are best practices and limitations when using this tool:

Use this tool during parameter tuning and model optimization. For example, you can specify parameter settings in the Forest-based and Boosted Classification and Regression tool and assess the trained model by inputting the output training dataset into the Evaluate Predictions with Cross-validation tool. With the cross-validation result, you can return to the Forest-based and Boosted Classification and Regression tool to fine-tune certain parameters. These two steps can be repeated until you find the appropriate cross-validation metrics of your model. You can prepare the final model using the full training dataset or balanced dataset and then predict to new, unknown data.
Consider what evaluation metric is most important for your specific use case. Consider the following:
- For classification—If you are predicting a rare event that is very important, you can optimize the sensitivity of that category. If you have many categories and you want the model that predicts the best across all of the categories, you may want to consider MCC or overall F1 metrics. Accuracy is not always the best metric, especially when there are rare categories involved. For example, if 99% of your data is Category A, and 1% of your data is Category B, a model that predicts every feature as Category A would have 99% accuracy, but it would have 0% sensitivity for Category B.
- For regression—If you are interested in the overall fit of the model to the data, you may want to optimize the R-squared. If you are concerned with individual errors of the model, you may want to optimize MAPE or MAE. If you are concerned with individual errors and minimizing extreme errors, you may want to optimize based on RMSE.
The hyperparameters that yield optimal metrics from a random split may not be the same ones that provide the best metrics for a spatial split. If your objective is to make predictions for a new spatial area, evaluate using spatial splits. Experiment with various models and parameter selections and input each into the tool to determine which combination results in the best average metrics with spatial cross-validation.
R-squared and Adjusted R-squared will not be calculated when the validation datasets are fewer than three. This means that they will not be calculated if the number of groups is greater than one-third of the number of features.
Matthews Correlation Coefficient cannot be calculated if all predicted outputs are the same value
Data balancing can help improve model accuracy when classifying rare case events.