The Forest-based Classification and Regression tool trains a model based on known values provided as part of a training dataset. This prediction model can then be used to predict unknown values in a prediction dataset that has the same associated explanatory variables. The tool creates models and generates predictions using an adaptation of Leo Breiman's random forest algorithm, which is a supervised machine learning method. The tool creates many decision trees, called an ensemble or a forest, that are used for prediction. Each tree generates its own prediction and is used as part of a voting scheme to make final predictions. The final predictions are not based on any single tree but rather on the entire forest. The use of the entire forest rather than an individual tree helps avoid overfitting the model to the training dataset, as does the use of both a random subset of the training data and a random subset of explanatory variables in each tree that comprises the forest.
The following are potential applications for this tool:
- Given data on occurrence of seagrass, as well as a number of environmental explanatory variables represented as both attributes and rasters, in addition to distances to factories upstream and major ports, future seagrass occurrence can be predicted based on future projections for those same environmental explanatory variables.
- Suppose you have crop yield data at hundreds of farms across the country along with other attributes at each of those farms (number of employees, acreage, and so on), as well as a number of rasters that represent the slope, elevation, rainfall, and temperature at each farm. Using these pieces of data, you can provide a set of features representing farms where you don't have crop yield (but you do have all of the other variables), and make a prediction about crop yield.
- Housing values can be predicted based on the prices of houses that have been sold in the current year. The sale price of homes sold along with information about the number of bedrooms, distance to schools, proximity to major highways, average income, and crime counts can be used to predict sale prices of similar homes.
- Land use types can be classified using training data and a combination of raster layers, including multiple individual bands, and products such as NDVI.
- Given information on the blood lead levels of children and the tax parcel ID of their homes, combined with parcel-level attributes such as age of home, census-level data such as income and education levels, and national datasets reflecting toxic release of lead and lead compounds, the risk of lead exposure for parcels without blood lead level data can be predicted. These risk predictions could drive policies and education programs in the area.
Training a model
The first step in using the Forest-based Classification and Regression tool is training a model for prediction. Training builds a forest that establishes a relationship between explanatory variables and the Variable to Predict parameter. Whether you choose the Train only option or train and predict, the tool begins by constructing a model based on the Variable to Predict parameter and any combination of the Explanatory Training Variables, Explanatory Training Distance Features (available with an Advanced license), and Explanatory Training Rasters (available with a Spatial Analyst license) parameters. The tool evaluates the performance of the model created and provides additional diagnostics.
By default, 10 percent of the training data is excluded from training for validation purposes. After the model is trained, it is used to predict the values for the test data, and those predicted values are compared to the observed values to provide a measure of prediction accuracy based on data that was not included in the training process. Additional diagnostics about the model, including forest characteristics, Out of Bag (OOB) Errors, and a Summary of Variable Importance are also included. These outputs are described in more detail below.
The model can be constructed to predict either a categorical Variable to Predict (classification) or a continuous Variable to Predict (regression). When Treat Variable as Categorical is chosen, the model constructed is based on classification trees. When left unchecked, the Variable to Predict parameter is assumed to be continuous, and the model constructed is based on regression trees.
Explanatory Training Variables
One of the most common forms of explanatory variables used to train a forest model are fields in the training dataset that also contain the Variable to Predict parameter. These fields can be continuous or categorical. Regardless of whether you choose to predict a continuous variable or a categorical variable, each of the Explanatory Training Variables can be either continuous or categorical. If the trained model is also being used to predict, each of the provided Explanatory Training Variables must be available for both the training dataset and the prediction dataset.
Explanatory Training Distance Features
Although Forest-based Classification and Regression is not a spatial machine learning tool, one way to leverage the power of space in your analysis is using distance features. If you were modeling the performance of a series of retail stores, a variable representing the distance to highway on ramps or distance to closest competitor could be critical to producing accurate predictions. Similarly, if modeling air quality, an explanatory variable representing distance to major sources of pollution or distance to major roadways would be crucial. Distance features are used to automatically create explanatory variables by calculating a distance from the provided features to the Input Training Features. Distances will be calculated from each of the input Explanatory Training Distance Features to the nearest Input Training Features. If the input Explanatory Training Distance Features are polygons or lines, the distance attributes are calculated as the distance between the closest segments of the pair of features. However, distances are calculated differently for polygons and lines. See How proximity tools calculate distance for details.
Explanatory Training Rasters
Explanatory Training Rasters can also be used to train the model, which exposes a wealth of data sources including imagery, DEMs, population density models, or environmental measurements. The Explanatory Training Rasters parameter is only available if you have a Spatial Analyst license. If your Input Training Features are points, the tool drills down to extract explanatory variables at each point location. For multiband rasters, only the first band is used. For mosaic datasets, use the Make Mosaic Layer tool first.
These rasters can be continuous or categorical. Regardless of whether you choose to predict a continuous variable or a categorical variable, each of the Explanatory Training Rasters can be either continuous or categorical.
If your Input Training Features are polygons, the Variable to Predict is categorical, and you're using Explanatory Training Rasters, there is an option to Convert Polygons to Raster Resolution for Training. If this option is checked, the polygon is divided into points at the centroid of each raster cell whose centroid falls within the polygon and the polygon is treated as a point dataset. The raster values at each point location are then extracted and used to train the model. The model is no longer trained on the polygon, but rather the model is trained on the raster values extracted for each cell centroid. A bilinear sampling method is used for numeric variables, and the nearest method is used for categorical variables. The default cell size of the converted polygons will be the maximum cell size of input rasters. However, this can be changed using the Cell Size environment setting. If not checked, one raster value for each polygon will be used in the model. Each polygon is assigned the average value for continuous rasters and the majority for categorical rasters.
Predicting using a forest-based model
It is best practice to start with the Train only option, evaluate the results of the analysis, adjust the variables included and the advanced parameters as necessary, and once a good model is found, rerun the tool to predict to either features or raster. When moving on to prediction, it is best practice to change the Training Data Excluded for Validation (%) parameter to 0% so that you can include all available training data in the final model used to predict. You can make predictions in the following ways:
Predicting in the same study area
When predicting to features in the same study area, each prediction feature must have all of the associated explanatory variables (fields), as well as overlapping extents with the Explanatory Training Distance Features and Explanatory Training Rasters.
When predicting to a raster in the same study area using the provided Explanatory Training Rasters, the prediction will be the overlapping extent of all explanatory rasters.
Predicting in a different study area
When predicting to features in a different study area, each prediction feature must have all of the associated explanatory variables (fields), and new explanatory distance features and explanatory rasters must be matched to their corresponding Explanatory Training Distance Features and rasters. These new distance features and rasters must be available for the new study area and correspond to the Explanatory Training Distance Features and Explanatory Training Rasters. For instance, if a categorical raster is used to train the model, the corresponding prediction explanatory raster cannot have different categories or a dramatically different range of values.
When predicting to a raster in a different study area, new explanatory prediction rasters must be provided and matched to their corresponding Explanatory Training Rasters. The corresponding prediction explanatory raster cannot have different categories or a dramatically different range of values. The resulting Output Prediction Raster will be the overlapping extent of all provided explanatory prediction rasters.
Predicting to a different time period by matching the explanatory variables used for training to variables with projections into the future
When predicting to a future time period, whether predicting to features or a raster, each projected explanatory prediction variable (fields, distance features, and rasters) needs to be matched to the corresponding explanatory training variables.
Predicting to features
With a model that has been trained using any combination of Explanatory Training Variables, Explanatory Training Distance Features, and Explanatory Training Rasters, it can be used to predict to either points or polygons in either the same or a different study area. Predicting to features requires that every feature receiving a prediction have values for each field, distance feature, and raster provided.
If the Input Training Features and Input Prediction Features field names do not match, a variable matching parameter is provided. When matching explanatory variables, the Prediction and Training fields must be of the same type (a double field in Training must be matched to a double field in Prediction).
If you want to use distance features or rasters other than those used for training the model because you are predicting either in a different study area or a different time period, Match Distance Features and Match Explanatory Rasters parameters are provided.
Predicting to rasters
Using a model that has been trained using only Explanatory Training Rasters, you can choose to predict to a raster in either the same or a different study area. If you want to use prediction rasters other than those used for training the model because you are predicting either in a different study area or a different time period, a Match Explanatory Rasters parameter is provided. An Output Prediction Raster can be created with a Spatial Analyst license by choosing Predict to raster for Prediction Type.
Output messages and diagnostics
This tool also creates messages and charts to help you understand the performance of the model. You can access the messages by hovering over the progress bar, clicking the pop-out button, or expanding the messages section in the Geoprocessing pane. You can also access the messages for a previous run of the Forest-based Classification and Predictiontool via theGeoprocessing history. The messages include information on the characteristics of your model, OOB errors, variable importance, and validation diagnostics.
The Forest Characteristics table contains information on a number of important aspects of your forest model, some of which are chosen through parameters in Advanced Forest Options, and some of which are data driven. Data-driven forest characteristics can be important to understand when optimizing the performance of the model. The Tree Depth Range shows the minimum tree depth found in the forest and maximum tree depth found in the forest (the maximum is set as a parameter, but any depth below that maximum is possible). The Mean Tree Depth reports the average depth of trees in the forest. If a maximum depth was set to 100, but the range and mean depth indicate that a much smaller depth is being used most of the time, a smaller maximum depth parameter may improve the performance of the model as it decreases the chances of overfitting the model to the training data. The Number of Randomly Sampled Variables option reports the number of randomly selected variables that are used for any given tree in the forest. Each tree will have a different combination of variables, but each will use exactly that number. The number chosen by default is based on a combination of the number of features and the number of variables available. For regression, it is one-third of the total number of explanatory variables (including features, rasters, and distance features). For classification, it is the square-root of the total number of variables.
In addition to basic characteristics of the forest, OOB errors are provided to help in evaluation of the accuracy of the model. Both Mean Squared Error (MSE) and percentage of variation explained are based on the ability of the model to accurately predict the Variable to Predict value based on the observed values in the training dataset. OOB is a prediction error calculated using the data that is a part of the training dataset that is not seen by a subset of the trees making up the forest. If you want to train a model on 100 percent of your data, you will be relying on OOB to assess the accuracy of your model. These errors are reported for half the number of trees and the total number of trees used as a way to help evaluate if increasing the number of trees is improving the performance of the model. If the errors and percentage of variation explained are similar values for both number of trees, it is an indicator that a smaller number of trees can be used with minimal impact on model performance. However, it is best practice to use as many trees as your machine allows. A higher number of trees in the forest will result in more stable results and a model that is less prone to noise in the data and sampling scheme.
Another major factor in the performance of the model is the explanatory variables used. The Top Variable Importance table lists the variables with the top 20 importance scores. Importance is calculated using Gini coefficients, which can be thought of as the number of times a variable is responsible for a split and the impact of that split divided by the number of trees. Splits are each individual decision within a decision tree. Variable importance can be used to create a more parsimonious model that contains variables that are detected to be meaningful.
An optional bar chart displaying the importance of each variable used in the model will be created if Output Diagnostic Table is specified and can be accessed by selecting the List By Charts tab in the Contents pane. The chart displays the variables used in the model on the y-axis and their importance based on the Gini coefficient on the x-axis.
Another important way to evaluate the performance of the model is to use the model to predict values for features that were not included in the training of the model. By default, this test dataset is 10 percent of the Input Training Features, and can be controlled using the Training Data Excluded for Validation (%) parameter. One disadvantage of OOB is that it uses a subset of the forest (trees that have not used a specific feature from the training dataset) as opposed to using the entire forest. By excluding some data for validation, error metrics can be assessed for the entire forest.
When predicting a continuous variable, the observed value for each of the test features is compared to the predictions for those features based on the trained model, and an associated R-Squared, p-value, and Standard Error are reported. These diagnostics will change each time you run through the training process because the selection of the test dataset is random. To create a model that does not change with every run, you can set a seed in the Random Number Generator environment setting.
When predicting a categorical variable, both sensitivity and accuracy are reported in the messages window. These diagnostics are calculated using a confusion matrix, which keeps track of each instance that the category of interest is correctly and incorrectly classified and when other categories are misclassified as the category of interest. Sensitivity for each category is reported as the percentage of times features with an observed category were correctly predicted with that category. For instance, if you are predicting Land and Water, and Land has a sensitivity of 1.00, every feature that should have been marked Land was correctly predicted. If, however, a Water feature was incorrectly marked Land, it would not be reflected in the sensitivity number for Land. This would be reflected in Water's sensitivity number, however, as it would mean one of the water features was not correctly predicted.
The accuracy diagnostic takes into account both how well features with a particular category are predicted and how often other categories are miscategorized for the category of interest. It gives an idea about how frequently a category is identified correctly among the total number of "confusions" for that category. When classifying a variable with only two classes, the accuracy measure will be the same for each class, but the sensitivity can differ. When classifying a variable with more than two classes, both sensitivity and accuracy can differ between classes.
The same diagnostics are also provided to compare predicted values to observed values for the training dataset. These diagnostics can help you understand how fit the model is to the training data.
The explanatory range diagnostics can help you evaluate whether the values used for training, validation, and prediction are sufficient to produce a good model and allow you to trust other model diagnostics. The data used to train a random forest model has a large impact on the quality of the resulting classification and predictions. Ideally, the training data should be representative of the data you are modeling. By default, 10 percent of the training data is randomly excluded, resulting in a training subset and a validation subset of the Input Training Features. The Explanatory Variable Range Diagnostics table shows the minimum and maximum values of these subsets and, if predicting to features or rasters, for the data used for prediction.
Due to the random nature of how subsets are determined, the values of the variables in the training subset may not be representative of the overall values in the Input Training Features. For each continuous explanatory variable, the Training Share column indicates the percentage of overlap between the values of the training subset and the values of all the features in the Input Training Features. For example, if variable A from the Input Training Features had the values 1 through 100 and the training subset had the values 50 through 100, the Training Share for Variable A would be 0.50 or 50 percent. For variable A, 50 percent of the range of values of the Input Training Features are covered by the training subset. If the training subset does not cover a wide range of the values found in the Input Training Features for each explanatory variable in the model, other model diagnostics may be biased. A similar calculation is performed to produce the Validation Share diagnostic. It is important that the values used to validate the model cover as much of the range of the values used to train the model as possible. For example, if variable B from the training subset had the values 1 through 100 and the validation subset had the values 1 through 10, the Validation Share for Variable B would be 0.10 or 10 percent. This small range of values might all be low values or all high values and would thus bias other diagnostics. If the validation subset held all low values, other model diagnostics such as MSE and % of variation explained would actually be reporting how well the model predicts to low values, not the complete range of values found in the Input Training Features.
The Prediction Share diagnostic is particularly important. Forest-based models do not extrapolate, they can only classify or predict to a value on which the model was trained. Prediction Share is the percentage of overlap between the values of the training data and the prediction data. Values less than zero indicate that you are attempting to predict to a value on which the model was not trained. A value of 1 indicates that the range of values in the training subset and the range of values being used for prediction are equivalent. A value greater than 1 indicates that the range of the values in the training subset is greater than the range of the values used for prediction.
All three share diagnostics that are only valid if the ranges of the subsets are coincident. For example, if the validation subset for variable C had the values 1 through 100 and the training subset had the values 90 through 200, they would overlap by 10 percent, but they do not have coincident ranges. In this case, the diagnostic is marked with an asterisk to show noncoincident ranges. Examine the minimum and maximum values to see the extent and direction of nonoverlap. Prediction share is marked with a plus sign (+) if the model is attempting to predict outside the range of the training data.
There are no absolute rules for acceptable values for the Explanatory Variable Range Diagnostics table. Training Share and Validation Share should be as high as possible, given the constraints of your training data. Prediction Share should not be less than 1. If the Validation Share diagnostic is low, consider increasing the value of the Training Data Excluded for Validation (%) parameter. Also, consider running the model multiple times and choose the run that balances the best values of the range diagnostics. The random seed used in each run is reported in the messages.
This tool also produces a variety of different outputs. Output Trained Features will contain all of the Input Training Features and Explanatory Training Variables used in the model as well as new fields for the extracted raster values for each Explanatory Training Rasters and calculated distance values for each Explanatory Training Distance Features. These new fields can be used to rerun the training portion of the analysis without extracting raster values and calculating distance values each time. Output Trained Features will also contain predictions for all of the features, those used in training and those excluded for testing, which can be helpful in assessing the performance of the model. The trained_features field in Output Trained Features will have a zero value for all test data (indicating it was not used in training) and a 1 value for all training data. When using this tool for prediction, the tool will produce either a new feature class containing the Output Predicted Features or a new Output Prediction Surface if explanatory rasters are provided.
Advanced forest options
The strength of the forest-based method is in capturing commonalities of weak predictors (or trees) and combining them to create a powerful predictor (the forest). If a relationship is persistently captured by singular trees, it means that there is a strong relationship in the data that can be detected even when the model is not complex. Adjusting the forest parameters can help create a high number of weak predictors resulting in a powerful model. You can create weak predictors by using less information in each tree. This can be accomplished by any combination of using a small subset of the features per tree, a small number of variables per tree, and a low tree depth. The number of trees controls how many of these weak predictors are created; and the weaker your predictors (trees), the more trees you need to create a strong model.
The following advanced training options are available in the tool:
- The default value for Number of Trees is 100. Increasing the number of trees in the forest model will generally result in more accurate model prediction, but the model will take longer to calculate.
- Minimum Leaf Size is the minimum number of observations required to keep a leaf (that is, the terminal node on a tree without further splits). The default minimum for regression is 5, and the default for classification is 1. For very large datasets, increasing these numbers will decrease the run time of the tool. For very small leaf sizes (close to the minimums defined), your forest will be prone to noise in your data. For a more stable model, experiment with increasing Minimum Leaf Size.
- Maximum Tree Depth is the maximum number of splits that will be made down a tree. When using a large maximum depth, more splits will be created, which may increase the chance of overfitting the model. The default is data driven and depends on the number of trees created and the number of variables included. Note that a node cannot be split after it reaches the Minimum Leaf Size value. If both Minimum Leaf Size and Maximum Tree Depth are set, Minimum Leaf Size will dominate in the determination of the depth of trees.
- The Data Available per Tree (%) parameter specifies the percentage of Input Training Features used for each decision tree. The default is 100 percent of the data. Each decision tree in the forest is created using a random subset (approximately two-thirds) of the training data available. Using a lower percentage of the input data for each decision tree increases the speed of the tool for very large datasets.
- The Number of Randomly Sampled Variables parameter specifies the number of explanatory variables used to create each decision tree. Each decision tree in the forest is created using a random subset of the explanatory variables specified. Increasing the number of variables used in each decision tree will increase the chance of overfitting your model, particularly if there is one or more dominant variables. A common practice (and the default used by the tool) is to use the square root of the total number of explanatory variables (fields, distance features, and rasters) if the Variable to Predict is numeric, or divide the total number of explanatory variables (fields, distance features, and rasters) by 3 if the Variable to Predict is categorical.
- The Training Data Excluded for Validation (%)parameter specifies the percentage (between 10 percent and 50 percent) of the Input Training Features to reserve as the test dataset for validation. The model will be trained without this random subset of data, and the observed values for those features will be compared to the predicted values to validate model performance. The default is 10 percent.
The following are best practices when using the Forest-based Classification and Regression tool:
- The forest model should be trained on at least several hundred features for best results and is not an appropriate tool for very small datasets.
- This tool may perform poorly when trying to predict with explanatory variables that are out of range of the explanatory variables used to train the model. Forest-based models do not extrapolate; they can only classify or predict to the value range on which the model was trained. When predicting a value based on explanatory variables much higher or lower than the range of the original training dataset, the model will estimate the value to be around the highest or lowest value in the original dataset.
- To improve performance when extracting values from Explanatory Training Rasters and calculating distances using Explanatory Training Distance Features, consider training the model on 100 percent of the data without excluding data for validation, and choose to create Output Trained Features. The next time you run the tool, use the Output Trained Features as your Input Training Features and use all of the extracted values and distances as Explanatory Training Variables instead of extracting them each time you train the model. If you choose to do this, set the Number of Trees, Maximum Tree Depth, and Number of Randomly Sampled Variables to 1 to create a very small dummy tree to quickly prepare your data for analysis.
- Although the tool defaults the Number of Trees parameter to 100, this number is not data driven. The number of trees needed increases with the complexity of relationships between the explanatory variables, size of the dataset, and the Variable to Predict, in addition to the variation of these variables.
- Increase the Number of Trees in the forest value and keep track of the OOB or classification error. It is recommended that you increase the Number of Trees value at least 3 times up to at least 500 trees to best evaluate model performance.
- Tool execution time is highly sensitive to the number of variables used per tree. Using a small number of variables per tree decreases chances of over-fitting; however, be sure to use many trees if using a small number of variables per tree to improve model performance.
- To create a model that does not change in every run, a seed can be set in the Random Number Generator environment setting. There will still be randomness in the model, but that randomness will be consistent between runs.
Breiman, Leo. (1996). "Out-Of-Bag Estimation." Abstract.
Breiman, L. (1996). "Bagging predictors." Machine learning 24 (2): 123–140.
Breiman, Leo. (2001). "Random Forests." Machine Learning 45 (1): 5-32. https://doi.org/10.1023/A:1010933404324.
Breiman, L., J.H. Friedman, R.A. Olshen, and C.J. Stone. (2017). Classification and regression trees. New York: Routledge. Chapter 4.
Dietterich, T. G. (2000, June). "Ensemble methods in machine learning." In International workshop on multiple classifier systems,. 1–15. Springer, Berlin, Heidelberg.
Gini, C. 1912 1955. Variabilità e mutabilità. Reprinted in Memorie di metodologica statistica (eds. E. Pizetti and T. Salvemini). Rome: Libreria Eredi Virgilio Veschi.
Grömping, U. (2009). "Variable importance assessment in regression: linear regression versus random forest.:" The American Statistician 63 (4): 308–319.
Ho, T. K. (1995, August). "Random decision forests." In Document analysis and recognition, 1995., proceedings of the third international conference on Document Analysis and Recognition Vol. 1: 278-282. IEEE.
James, G., D. Witten, T. Hastie, and R. Tibshirani. (2013). An introduction to statistical learning Vol. 112. New York: springer.
LeBlanc, M. and R. Tibshirani. (1996). "Combining estimates in regression and classification." Journal of the American Statistical Association 91 (436): 1641–1650.
Loh, W. Y. and Y. S. Shih. (1997). "Split selection methods for classification trees." Statistica sinica, 815–840.
Nadeau, C. and Y. Bengio. (2000). "Inference for the generalization error." In Advances in neural information processing systems, 307-313.
Strobl, C., A. L. Boulesteix, T. Kneib, T. Augustin, and A. Zeileis. (2008). "Conditional variable importance for random forests." BMC bioinformatics 9 (1): 307.
Zhou, Z. H. (2012). Ensemble methods: foundations and algorithms. CRC press.