The Forest-based Classification and Regression tool trains a model based on known values provided as part of a training dataset. This prediction model can then be used to predict unknown values in a prediction dataset that has the same associated explanatory variables. The tool creates models and generates predictions using an adaptation of Leo Breiman's random forest algorithm, which is a supervised machine learning method. The tool creates many decision trees, called an ensemble or a forest, that are then used for prediction. Each tree generates its own prediction and is used as part of a voting scheme to make final predictions. The final predictions are not based on any single tree but rather on the entire forest. The use of the entire forest rather than an individual tree helps avoid overfitting the model to the training dataset, as does the use of both a random subset of the training data and a random subset of explanatory variables in each tree that comprises the forest.
The following are potential applications for this tool:
- Given data on occurrence of seagrass, as well as a number of environmental explanatory variables represented as both attributes and rasters, in addition to distances to factories upstream and major ports, future seagrass occurrence can be predicted based on future projections for those same environmental explanatory variables.
- Suppose you have crop yield data at hundreds of farms across the country along with other attributes at each of those farms (number of employees, acreage, and so on), as well as a number of rasters that represent the slope, elevation, rainfall, and temperature at each farm. Using these pieces of data, you can provide a set of features representing farms where you don't have crop yield (but you do have all of the other variables), and make a prediction about crop yield.
- Housing values can be predicted based on the prices of houses that have been sold in the current year. The sale price of homes sold along with information about the number of bedrooms, distance to schools, proximity to major highways, average income, and crime counts can be used to predict sale prices of similar homes.
- Land use types can be classified using training data and a combination of raster layers, including multiple individual bands, and products such as NDVI.
- Given information on the blood lead levels of children and the tax parcel ID of their homes, combined with parcel-level attributes such as age of home, census-level data such as income and education levels, and national datasets reflecting toxic release of lead and lead compounds, the risk of lead exposure for parcels without blood lead level data can be predicted. These risk predictions could drive policies and education programs in the area.
Training a model
The first step in using the Forest-based Classification and Regression tool is training a model for prediction. Training builds a forest that establishes a relationship between explanatory variables and the Variable to Predict. Whether you choose the Train only option or train and predict, the tool begins by constructing a model based on the Variable to Predict parameter and any combination of the Explanatory Training Variables, Explanatory Training Distance Features (available with an Advanced license), and Explanatory Training Rasters (available with a Spatial Analyst license) parameters. The tool then evaluates the performance of that model and provides additional diagnostics.
By default, 10 percent of the training data is excluded from training for validation purposes. After the model is trained, it is used to predict the values for the test data, and those predicted values are compared to the observed values to provide a measure of prediction accuracy based on data that was not included in the training process. Additional diagnostics about the model, including forest characteristics, Out of Bag Errors, a Summary of Variable Importance are also included. These outputs are described in more detail below.
The model can be constructed to predict either a categorical Variable to Predict(classification) or a continuous Variable to Predict (regression). When Treat Variable as Categorical is chosen, the model constructed is based on classification trees. When left unchecked, the Variable to Predict parameter is assumed to be continuous and the model constructed is based on regression trees.
Explanatory Training Variables
One of the most common forms of explanatory variables used to train a forest model are fields in the training dataset that also contain the Variable to Predict. These fields can be continuous or categorical. Regardless of whether you choose to predict a continuous variable or a categorical variable, each of the Explanatory Training Variables can be either continuous or categorical. If the trained model is also being used to predict, each of the provided Explanatory Training Variables must be available for both the training dataset and the prediction dataset.
Explanatory Distance Features
Although Forest-based Classification and Regression is not a spatial machine learning tool, one way to leverage the power of space in your analysis is using distance features. If you were modeling the performance of a series of retail stores, a variable representing the distance to highway on ramps or distance to closest competitor could be critical to producing accurate predictions. Similarly, if modeling air quality, an explanatory variable representing distance to major sources of pollution or distance to major roadways would be crucial. Distance features are used to automatically create explanatory variables by calculating a distance from the provided features to the Input Training Features. Distances will be calculated from each of the input Explanatory Training Distance Features to the nearest Input Training Features. If the input Explanatory Training Distance Features are polygons or lines, the distance attributes are calculated as the distance between the closest segments of the pair of features. However, distances are calculated differently for polygons and lines. See How proximity tools calculate distance for details.
Explanatory Training Rasters
Explanatory Training Rasters can also be used to train the model, which exposes a wealth of data sources including imagery, DEMs, population density models, or environmental measurements. The Explanatory Training Rasters parameter is only available if you have a Spatial Analyst license. If your Input Training Featuresare points, the tool drills down to extract explanatory variables at each point location. For multiband rasters, only the first band is used. For mosaic datasets, use the Make Mosaic Layertool first.
These rasters can be continuous or categorical. Regardless of whether you choose to predict a continuous variable or a categorical variable, each of the Explanatory Training Rasters can be either continuous or categorical.
If your Input Training Features are polygons, the Variable to Predict is categorical, and you're using Explanatory Training Rasters, there is an option to Convert Polygons to Raster Resolution for Training. If this option is checked, the polygon is divided into points at the centroid of each raster cell whose centroid falls within the polygon and the polygon is treated as a point dataset. The raster values at each point location are then extracted and used to train the model. The model is no longer trained on the polygon, but rather the model is trained on the raster values extracted for each cell centroid. A bilinear sampling method is used for numeric variables, and the nearest method is used for categorical variables. The default cell size of the converted polygons will be the maximum cell size of input rasters. However, this can be changed using the Cell Size environment setting. If not checked, one raster value for each polygon will be used in the model. Each polygon is assigned the average value for continuous rasters and the majority for categorical rasters.
Predicting using a forest-based model
It is best practice to start with the Train onlyoption, evaluate the results of the analysis, adjust the variables included and the advanced parameters as necessary and once a good model is found, re-run the tool to predict to either features or raster. When moving on to prediction, it is best practice to change the Training Data Excluded for Validation (%) parameter to 0% so that you can include all available training data in the final model used to predict. You can make predictions in the following ways:
Predicting in the same study area
When predicting to features in the same study area, each prediction feature must have all of the associated explanatory variables (fields), as well as overlapping extents with the Explanatory Training Distance Featuresand Explanatory Training Rasters.
When predicting to a raster in the same study area using the provided Explanatory Training Rasters, the prediction will be the overlapping extent of all explanatory rasters.
Predicting in a different study area
When predicting to features in a different study area, each prediction feature must have all of the associated explanatory variables (fields), and new explanatory distance features and explanatory rasters must be matched to their corresponding Explanatory Training Distance Features and rasters. These new distance features and rasters must be available for the new study area and correspond to the Explanatory Training Distance Features and Explanatory Training Rasters. For instance, if a categorical raster is used to train the model, the corresponding prediction explanatory raster cannot have different categories or a dramatically different range of values.
When predicting to a raster in a different study area, new explanatory prediction rasters must be provided and matched to their corresponding Explanatory Training Rasters. The corresponding prediction explanatory raster cannot have different categories or a dramatically different range of values. The resulting Output Prediction Raster will be the overlapping extent of all provided explanatory prediction rasters.
Predicting to a different time period by matching the explanatory variables used for training to variables with projections into the future
When predicting to a future time period, whether predicting to features or a raster, each projected explanatory prediction variable (fields, distance features, and rasters) needs to be matched to the corresponding explanatory training variables.
Predicting to features
With a model that has been trained using any combination of Explanatory Training Variables, Explanatory Training Distance Features, and Explanatory Training Rasters, can be used to predict to either points or polygons in either the same or a different study area. Predicting to features requires that every feature receiving a prediction have values for each field, distance feature, and raster provided.
If the Input Training Features and Input Prediction Features field names do not match, a variable matching parameter is provided. When matching explanatory variables, the Prediction and Training fields must be of the same type (a double field in Training must be matched to a double field in Prediction).
If you want to use distance features or rasters other than those used for training the model because you are predicting either in a different study area or a different time period, Match Distance Features and Match Explanatory Rasters parameters are provided.
Predicting to rasters
Using a model that has been trained using only Explanatory Training Rasters, you can choose to predict to a raster in either the same or a different study area. If you want to use prediction rasters other than those used for training the model because you are predicting either in a different study area or a different time period, a Match Explanatory Rasters parameter is provided. An Output Prediction Raster can be created with a Spatial Analyst license by choosing Predict to raster for Prediction Type.
Output messages and diagnostics
This tool also creates messages and charts to help you understand the performance of the model. You can access the messages by hovering over the progress bar, clicking the pop-out button, or expanding the messages section in the Geoprocessing pane. You can also access the messages for a previous run of the Forest-based Classification and Predictiontool via theGeoprocessing history. The messages include information on the characteristics of your model, Out of Bag errors, variable importance, and validation diagnostics.
The Forest Characteristics table contains information on a number of important aspects of your forest model, some of which are chosen through parameters in Advanced Forest Options, and some of which are data driven. Data-driven forest characteristics can be important to understand when optimizing the performance of the model. The Tree Depth Range shows the minimum tree depth found in the forest and maximum tree depth found in the forest (the maximum is set as a parameter, but any depth below that maximum is possible). The Mean Tree Depth reports the average depth of trees in the forest. If a maximum depth was set to 100, but the range and mean depth indicate that a much smaller depth is being used most of the time, a smaller maximum depth parameter may improve the performance of the model as it decreases the chances of overfitting the model to the training data. The Number of Randomly Sampled Variables option reports the number of randomly selected variables that are used for any given tree in the forest. Each tree will have a different combination of variables, but each will use exactly that number. The number chosen by default is based on a combination of the number of features and the number of variables available. For regression, it is one-third of the total number of explanatory variables (including features, rasters and distance features). For classification it is the square-root of the total number of variables.
In addition to basic characteristics of the forest, Out of Bag (OOB) errors are provided to help in evaluation of the accuracy of the model. Both the Mean Squared Error (MSE) and the % of variation explained are based on the ability of the model to accurately predict the Variable to Predict value based on the observed values in the training dataset. OOB is a prediction error calculated using the data that is a part of the training dataset that is not seen by a subset of the trees making up the forest. If you would like to train a model on 100% of your data you will be relying on OOB to assess the accuracy of your model. These errors are reported for half the number of trees and the total number of trees used as a way to help evaluate if increasing the number of trees is improving the performance of the model. If the errors and percentage of variation explained are similar values for both number of trees, it is an indicator that a smaller number of trees could be used with minimal impact on model performance. However, it is best practice to use as many trees as your machine allows. A higher number of trees in the forest will result in more stable results and a model that is less prone to noise in the data and sampling scheme.
Another major factor in the performance of the model is the explanatory variables used. The Top Variable Importance table lists the variables with the top 20 importance scores. Importance is calculated using Gini coefficients, which can be thought of as the number of times a variable is responsible for a split and the impact of that split divided by the number of trees. Splits are each individual decision within a decision tree. Variable importance can be used to create a more parsimonious model that contains variables that are detected to be meaningful.
An optional bar chart displaying the importance of each variable used in the model will be created if Output Diagnostic Table is specified and can be accessed by selecting the List By Charts tab in the Contents pane. The chart displays the variables used in the model on the y-axis and their importance based on the Gini coefficient on the x-axis.
Another important way to evaluate the performance of the model is to use the model to predict values for features that were not included in the training of the model. By default this test dataset is 10 percent of the Input Training Features, and can be controlled using the Training Data Excluded for Validation (%) parameter. One disadvantage of OOB is it uses a subset of the forest (trees that have not used a specific feature from the training dataset) as opposed to using the entire forest. By excluding some data for validation, error metrics can be assessed for the entire forest.
When predicting a continuous variable, the observed values for each of the test features is compared to the predictions for those features based on the trained model, and an associated R-Squared, p-value and Standard Error are reported. These diagnostics will change each time you run through the training process because the selection of the test dataset is random. To create a model that does not change with every run, you may set a seed in the Random Number Generator environment setting.
When predicting a categorical variable, both sensitivity and accuracy are reported in the messages window. These diagnostics are calculated using a confusion matrix, which keeps track of each instance that the category of interest is correctly and incorrectly classified and when other categories are misclassified as the category of interest. Sensitivity for each category is reported as the percentage of times features with an observed category were correctly predicted with that category. For instance, if you are predicting Land and Water and Land has an sensitivity of 1.00, then every feature that should have been marked Land was correctly predicted. If, however, a Water feature was incorrectly marked Land, this would not be reflected in the sensitivity number for Land. This would, however, would be reflected in Water's sensitivity number, as it would mean one of the water features were not correctly predicted.
The accuracy diagnostic takes into account both how well features with a particular category are predicted and how often other categories were miscategorized for the category of interest. It gives an idea about how frequently a category is identified correctly among the total number of "confusions" for that category. When classifying a variable with only two classes, the accuracy measure will be the same for each class, but the sensitivity can differ. When classifying a variable with more than two classes, both sensitivity and accuracy can differ between classes.
The same diagnostics are also provided to compare predicted values to observed values for the training dataset. These diagnostics can help you get an understanding of how fit the model is to the training data.
This tool produces a variety of different outputs. Output Trained Features will contain all of the Input Training Features, the Explanatory Training Variables used in the model as well as new fields for the extracted raster values for each Explanatory Training Rasters and calculated distance values for each Explanatory Training Distance Features. These new fields can be used to rerun the training portion of the analysis without needing to extract raster values and calculate distance values each time. The Output Trained Features will also contain predictions for all of the features, both those used in training and those excluded for testing, which can be helpful in assessing the performance of the model. The trained_features field in the Output Trained Features will have a 0 for all test data (indicating it was not used in training) and a 1 for all training data. When using this tool for prediction, the tool will produce either a new feature class containing the Output Predicted Features or a new Output Prediction Surface if explanatory rasters are provided.
Advanced forest options
The strength of the forest-based method is in capturing commonalities of weak predictors (or trees) and combining them to create a powerful predictor (the forest). If a relationship is persistently captured by singular trees, it means that there is a strong relationship in the data that can be detected even when the model is not complex. Adjusting the forest parameters can help create a high number of weak predictors resulting in a powerful model. You can create weak predictors by using less information in each tree. This can be accomplished by any combination of using a small subset of the features per tree, a small number of variables per tree and a low tree depth. The number of trees controls how many of these weak predictors are created and the weaker your predictors (trees), the more trees you need in order to create a strong model.
The following advanced training options are available in the tool:
- The default value for Number of Trees is 100. Increasing the number of trees in the forest model will generally result in more accurate model prediction, but the model will take longer to calculate.
- Minimum Leaf Size is the minimum number of observations required to keep a leaf (that is, the terminal node on a tree without further splits). The default minimum for regression is 5, and the default for classification is 1. For very large datasets, increasing these numbers will decrease the run time of the tool. For very small leaf sizes (close to the minimums defined), your forest will be prone to noise in your data. For a more stable model, experiment with increasing the Minimum Leaf Size.
- Maximum Tree Depth is the maximum number of splits that will be made down a tree. When using a large maximum depth, more splits will be created, which may increase the chances of overfitting the model. The default is data driven and depends on the number of trees created and the number of variables included. Note that a node cannot be split after it reaches the Minimum Leaf Size. If both Minimum Leaf Size and Maximum Tree Depth are set, Minimum Leaf Size will dominate in the determination of the depth of trees.
- The Data Available per Tree (%) parameter specifies the percentage of Input Training Features used for each decision tree. The default is 100 percent of the data. Each decision tree in the forest is created using a random subset (approximately two-thirds) of the training data available. Using a lower percentage of the input data for each decision tree increases the speed of the tool for very large datasets.
- The Number of Randomly Sampled Variablesparameter specifies the number of explanatory variables used to create each decision tree. Each decision tree in the forest is created using a random subset of the explanatory variables specified. Increasing the number of variables used in each decision tree will increase the chances of overfitting your model, particularly if there is one or more dominant variables. A common practice (and the default used by the tool) is to use the square root of the total number of explanatory variables (fields, distance features, and rasters) if the Variable to Predict is numeric, or divide the total number of explanatory variables (fields, distance features, and rasters) by 3 if the Variable to Predict is categorical.
- The Training Data Excluded for Validation (%)parameter specifies the percentage (between 10 percent and 50 percent) of the Input Training Features to reserve as the test dataset for validation. The model will be trained without this random subset of data, and the observed values for those features will be compared to the predicted values to validate model performance. The default is 10 percent.
The following are best practices when using the Forest-based Classification and Regression tool:
- The forest model should be trained on at least several hundred features for best results and is not an appropriate tool for very small datasets.
- This tool may perform poorly when trying to predict with explanatory variables that are out of range of the explanatory variables used to train the model. Forest-based models do not extrapolate, they can only classify or predict to the value range the model was trained on. When predicting a value based on explanatory variables much higher or lower than the range of the original training dataset, the model will estimate the value to be around the highest or lowest value in the original dataset.
- To improve performance when extracting values from Explanatory Training Rasters and calculating distances using Explanatory Training Distance Features, consider training the model on 100% of the data without excluding data for validation and choose to create Output Trained Features. Next time you run the tool, use the Output Trained Features as your Input Training Features and use all of the extracted values and distances as Explanatory Training Variables instead of extracting them each time you train the model. If you choose to do this, set the Number of Trees, Maximum Tree Depth and Number of Randomly Sampled Variables to 1 to create a very small dummy tree to quickly prepare your data for analysis.
- Although the tool defaults the Number of Trees parameter to 100, this number is not data driven and the number of trees needed increases with the complexity of relationships between the explanatory variables, size of the dataset, and the Variable to Predict, in addition to the variation of these variables.
- Increase the Number of Trees in the forest and keep track of the OOB or classification error. It is recommended that you increase the Number of Trees at least 3 times up to at least 500 trees in order to best evaluate model performance.
- Tool execution time is highly sensitive to the number of variables used per tree. Using a small number of variables per tree decreases chances of over-fitting, however, be sure to use many trees if using a small number of variables per tree to improve model performance.
- To create a model that does not change in every run, a seed can be set in the Random Number Generator environment setting. There will still be randomness in the model, but that randomness will be consistent between runs.
Breiman, Leo. Out-Of-Bag Estimation. 1996.
Breiman, L. (1996). Bagging predictors. Machine learning, 24(2), 123-140.
Breiman, Leo. "Random Forests". Machine Learning. 45 (1): 5-32. doi:10.1023/A:1010933404324. 2001.
Breiman, L., J.H. Friedman, R.A. Olshen, C.J. Stone. Classification and regression trees. New York: Routledge. Chapter 4. 2017.
Dietterich, T. G. (2000, June). Ensemble methods in machine learning. In International workshop on multiple classifier systems (pp. 1-15). Springer, Berlin, Heidelberg.
Gini, C. (1912). Variabilità e mutabilità. Reprinted in Memorie di metodologica statistica (Ed. Pizetti E, Salvemini, T). Rome: Libreria Eredi Virgilio Veschi.
Grömping, U. (2009). Variable importance assessment in regression: linear regression versus random forest. The American Statistician, 63(4), 308-319.
Ho, T. K. (1995, August). Random decision forests. In Document analysis and recognition, 1995., proceedings of the third international conference on Document Analysis and Recognition. (Vol. 1, pp. 278-282). IEEE.
James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction to statistical learning (Vol. 112). New York: springer.
LeBlanc, M., & Tibshirani, R. (1996). Combining estimates in regression and classification. Journal of the American Statistical Association, 91(436), 1641-1650.
Loh, W. Y., & Shih, Y. S. (1997). Split selection methods for classification trees. Statistica sinica, 815-840.
Nadeau, C., & Bengio, Y. (2000). Inference for the generalization error. In Advances in neural information processing systems (pp. 307-313).
Strobl, C., Boulesteix, A. L., Kneib, T., Augustin, T., & Zeileis, A. (2008). Conditional variable importance for random forests. BMC bioinformatics, 9(1), 307.
Zhou, Z. H. (2012). Ensemble methods: foundations and algorithms. CRC press.