Summary
Creates models and generates predictions using an adaptation of Leo Breiman's random forest algorithm, which is a supervised machine learning method. Predictions can be performed for both categorical variables (classification) and continuous variables (regression). Explanatory variables can take the form of fields in the attribute table of the training features, raster datasets, and distance features used to calculate proximity values for use as additional variables. In addition to validation of model performance based on the training data, predictions can be made to either features or a prediction raster.
Learn more about how Forest-based Classification and Regression works
Illustration
Usage
This tool creates hundreds of trees, called an ensemble of decision trees, to create a model that can then be used for prediction. Each decision tree is created using randomly generated portions of the original (training) data. Each tree generates its own prediction and votes on an outcome. The forest model considers votes from all decision trees to predict or classify the outcome of an unknown sample. This is important as individual trees may have issues with overfitting a model; however, combining multiple trees in a forest for prediction addresses the overfitting problem associated with a single tree.
This tool can be used in three different operation modes. The Train option can be used to evaluate the performance of different models as you explore different explanatory variables and tool settings. Once a good model has been found, you can use the Predict to features or Predict to raster option. This is a data-driven tool and performs best on large datasets. The tool should be trained on at least several hundred features for best results. It is not an appropriate tool for very small datasets.
The Input Training Features can be points or polygons. This tool does not work with multipart data.
A Spatial Analyst license is required to use rasters as explanatory variables or to predict to an Output Prediction Surface.
This tool produces a variety of different outputs. Output Trained Features will contain all of the Input Training Features used in the model created as well as all of the explanatory variables used in the model (including the input fields used, any distances calculated, and any raster values extracted or calculated). It will also contain predictions for all of the features used for training the model, which can be helpful in assessing the performance of the model created. When using this tool for prediction, it will produce either a new feature class containing the Output Predicted Features or a new Output Prediction Surface if explanatory rasters are provided.
When using the Predict to features option, a new feature class containing the Output Predicted Features will be created. When the Predict to Raster option, a new Output Prediction Surface will be created.
This tool also creates messages and charts to help you understand the performance of the model created. You can access the messages by hovering over the progress bar, clicking the pop-out button, or expanding the messages section in the Geoprocessing pane. You can also access the messages for a previous run of the Forest-based Classification and Prediction tool via the Geoprocessing history. The messages include information on the model characteristics, out of bag errors, variable importance, and validation diagnostics.
You can use the Output Variable Importance Table parameter to create a table to display a chart of variable importance for evaluation. The top 20 variable importance values are also reported in the messages window. The chart can be accessed directly below the layer in the Contents pane.
Explanatory variables can come from fields or be calculated from distance features or extracted from rasters. You can use any combination of these explanatory variable types, but at least one type is required. The explanatory variables (from fields, distance features, or rasters) used should contain a variety of values. If the explanatory variable is categorical, the Categorical check box should be checked (variables of type string will automatically be checked). Categorical explanatory variables are limited to 60 unique values, though a smaller number of categories will improve model performance. For a given data size, the more categories a variable contains, the more likely it is that it will dominate the model and lead to less effective prediction results.
Distance features are used to automatically create explanatory variables representing a distance from the provided features to the Input Training Features. Distances will be calculated from each of the input Explanatory Training Distance Features to the nearest Input Training Feature. If the input Explanatory Training Distance Features are polygons or lines, the distance attributes are calculated as the distance between the closest segments of the pair of features. However, distances are calculated differently for polygons and lines. See How proximity tools calculate distance for details.
If your Input Training Features are points and you are using Explanatory Training Rasters, the tool drills down to extract explanatory variables at each point location. For multiband rasters, only the first band is used.
Although you can have multiple layers with the same name in the Contents pane, the tool is unable to accept explanatory layers with the same name or to remove duplicate layer names in the drop-down lists. To avoid this issue, ensure that each layer has a unique name.
If your Input Training Features are polygons, the Variable to Predict is categorical, and you are using exclusively Explanatory Training Rasters, there is an option to Convert Polygons to Raster Resolution for Training. If this option is checked, the polygon is divided into points at the centroid of each raster cell whose centroid falls within the polygon. The raster values at each point location are then extracted and used to train the model. A bilinear sampling method is used for numeric variables, and the nearest method is used for categorical variables. The default cell size of the converted polygons will be the maximum cell size of input rasters. However, this can be changed using the Cell Size environment setting. If not checked, one raster value for each polygon will be used in the model. Each polygon is assigned the average value for continuous rasters and the majority for categorical rasters.
There must be variation in the data used for each explanatory variable specified. If you receive an error that there is no variation in one of the fields or rasters specified, you can try running the tool again marking that variable as categorical. If 95 percent of the features have the same value for a particular variable, that variable is flagged as having no variation.
The Compensate for Sparse Categories parameter can be used if the variation in your categories are unbalanced. For instance, if you have some categories that occur hundreds of times in your dataset and a few that occur significantly less often, checking this parameter will ensure that each category is represented in each tree to create balanced models.
When matching explanatory variables, the Prediction and Training fields must be of the same type (a double field in Training must be matched to a double field in Prediction).
Forest-based models do not extrapolate, they can only classify or predict to a value that the model was trained on. When predicting a value based on explanatory variables much higher or lower than the range of the original training dataset, the model will estimate the value to be around the highest or lowest value in the original dataset. This tool may perform poorly when trying to predict with explanatory variables that are out of range of the explanatory variables used to train the model.
The tool will fail if categories exist in the prediction explanatory variables that are not present in the training features.
To use mosaic datasets as explanatory variables, use the Make Mosaic Layer tool first and copy the full path to the layer into the tool or use the Make Mosaic Layer tool and the Make Raster Layer tool to adjust the processing template for the mosaic dataset.
The default value for the Number of Trees parameter is 100. Increasing the number of trees in the forest model will result in more accurate model prediction, but the model will take longer to calculate.
When the Calculate Uncertainty parameter is checked, the tool will calculate a 90 percent prediction interval around each predicted value of the Variable to Predict. When Prediction Type is Train only or Predict to features, two fields are added to either Output Trained Features or Output Predicted Features. These fields, ending with _P05 and _P95, represent the upper and lower bounds of the prediction interval. For any new observation, you can predict with 90 percent confidence that the value of a new observation will fall within the interval, given the same explanatory variables. When the Predict to raster option is used, two additional rasters representing the upper and lower bounds of the prediction interval are added to the Contents pane.
This tool supports parallel processing for prediction and uses 50 percent of available processors by default. The number of processors can be increased or decreased using the Parallel Processing Factor environment.
To learn more about how this tool works and understand the output messages and charts, see How Forest-based Classification and Regression works.
References:
Breiman, Leo. Out-Of-Bag Estimation. 1996.
Breiman, L. (1996). Bagging predictors. Machine learning, 24(2), 123-140.
Breiman, Leo. "Random Forests". Machine Learning. 45 (1): 5-32. doi:10.1023/A:1010933404324. 2001.
Breiman, L., J.H. Friedman, R.A. Olshen, C.J. Stone. Classification and regression trees. New York: Routledge. Chapter 4. 2017.
Dietterich, T. G. (2000, June). Ensemble methods in machine learning. In International workshop on multiple classifier systems (pp. 1-15). Springer, Berlin, Heidelberg.
Gini, C. (1912). Variabilità e mutabilità. Reprinted in Memorie di metodologica statistica (Ed. Pizetti E, Salvemini, T). Rome: Libreria Eredi Virgilio Veschi.
Grömping, U. (2009). Variable importance assessment in regression: linear regression versus random forest. The American Statistician, 63(4), 308-319.
Ho, T. K. (1995, August). Random decision forests. In Document analysis and recognition, 1995., proceedings of the third international conference on Document Analysis and Recognition. (Vol. 1, pp. 278-282). IEEE.
James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction to statistical learning (Vol. 112). New York: springer.
LeBlanc, M., & Tibshirani, R. (1996). Combining estimates in regression and classification. Journal of the American Statistical Association, 91(436), 1641-1650.
Loh, W. Y., & Shih, Y. S. (1997). Split selection methods for classification trees. Statistica sinica, 815-840.
Meinshausen, Nicolai. "Quantile regression forests." Journal of Machine Learning Research 7. Jun (2006): 983-999.
Nadeau, C., & Bengio, Y. (2000). Inference for the generalization error. In Advances in neural information processing systems (pp. 307-313).
Strobl, C., Boulesteix, A. L., Kneib, T., Augustin, T., & Zeileis, A. (2008). Conditional variable importance for random forests. BMC bioinformatics, 9(1), 307.
Zhou, Z. H. (2012). Ensemble methods: foundations and algorithms. CRC press.
Syntax
Forest(prediction_type, in_features, {variable_predict}, {treat_variable_as_categorical}, {explanatory_variables}, {distance_features}, {explanatory_rasters}, {features_to_predict}, {output_features}, {output_raster}, {explanatory_variable_matching}, {explanatory_distance_matching}, {explanatory_rasters_matching}, {output_trained_features}, {output_importance_table}, {use_raster_values}, {number_of_trees}, {minimum_leaf_size}, {maximum_depth}, {sample_size}, {random_variables}, {percentage_for_training}, {output_classification_table}, {output_validation_table}, {compensate_sparse_categories}, {number_validation_runs}, {calculate_uncertainty})
Parameter | Explanation | Data Type |
prediction_type | Specifies the operation mode of the tool. The tool can be run to train a model to only assess performance, predict features, or create a prediction surface.
| String |
in_features | The feature class containing the variable_predict parameter and, optionally, the explanatory training variables from fields. | Feature Layer |
variable_predict (Optional) | The variable from the in_features parameter containing the values to be used to train the model. This field contains known (training) values of the variable that will be used to predict at unknown locations. | Field |
treat_variable_as_categorical (Optional) |
| Boolean |
explanatory_variables [[Variable, Categorical],...] (Optional) | A list of fields representing the explanatory variables that help predict the value or category of the variable_predict. Use the treat_variable_as_categorical parameter for any variables that represent classes or categories (such as land cover or presence or absence). Specify the variable as true for any that represent classes or categories such as land cover or presence or absence and false if the variable is continuous. | Value Table |
distance_features [distance_features,...] (Optional) | Automatically creates explanatory variables by calculating a distance from the provided features to the in_features. Distances will be calculated from each of the input distance_features to the nearest in_features. If the input distance_features are polygons or lines, the distance attributes are calculated as the distance between the closest segments of the pair of features. | Feature Layer |
explanatory_rasters [[Variable, Categorical],...] (Optional) | Automatically creates explanatory training variables in your model whose values are extracted from rasters. For each feature in the in_features, the value of the raster cell is extracted at that exact location. Bilinear raster resampling is used when extracting the raster value unless it is specified as categorical, in which case nearest neighbor assignment is used. Specify the raster as true for any rasters that represent classes or categories such as land cover or presence or absence and false if the raster is continuous. | Value Table |
features_to_predict (Optional) | A feature class representing locations where predictions will be made. This feature class must also contain any explanatory variables provided as fields that correspond to those used from the training data if any. | Feature Layer |
output_features (Optional) | The output feature class to receive the results of the prediction results. | Feature Class |
output_raster (Optional) | The output raster containing the prediction results. The default cell size will be the maximum cell size of the raster inputs. To set a different cell size, use the cell size environment setting. | Raster Dataset |
explanatory_variable_matching [[Prediction, Training],...] (Optional) | A list of the explanatory_variables specified from the in_features on the right and their corresponding fields from the features_to_predict on the left, for example, [["LandCover2000", "LandCover2010"], ["Income", "PerCapitaIncome"]]. | Value Table |
explanatory_distance_matching [[Prediction, Training],...] (Optional) | A list of the distance_features specified for the in_features on the right. Corresponding feature sets should be specified for the features_to_predict on the left. explanatory_distance_features that are more appropriate for the features_to_predict can be provided if those used for training are in a different study area or time period. | Value Table |
explanatory_rasters_matching [[Prediction, Training],...] (Optional) | A list of the explanatory_rasters specified for the in_features on the right. Corresponding rasters should be specified for the features_to_predict or output_raster to be created on the left. explanatory_rasters that are more appropriate for the features_to_predict can be provided if those used for training are in a different study area or time period. | Value Table |
output_trained_features (Optional) | output_trained_features will contain all explanatory variables used for training (including sampled raster values and distance calculations), as well as the observed variable_to_predict field and accompanying predictions that can be used to further assess performance of the trained model. | Feature Class |
output_importance_table (Optional) | If specified, the table will contain information describing the importance of each explanatory variable (fields, distance features, and rasters) used in the model created. | Table |
use_raster_values (Optional) | Specifies how polygons are treated when training the model if the in_features are polygons with a categorical variable_predict and only explanatory_rasters have been specified.
| Boolean |
number_of_trees (Optional) | The number of trees to create in the forest model. More trees will generally result in more accurate model prediction, but the model will take longer to calculate. The default number of trees is 100. | Long |
minimum_leaf_size (Optional) | The minimum number of observations required to keep a leaf (that is the terminal node on a tree without further splits). The default minimum for regression is 5 and the default for classification is 1. For very large data, increasing these numbers will decrease the run time of the tool. | Long |
maximum_depth (Optional) | The maximum number of splits that will be made down a tree. Using a large maximum depth, more splits will be created, which may increase the chances of overfitting the model. The default is data driven and depends on the number of trees created and the number of variables included. | Long |
sample_size (Optional) | Specifies the percentage of the in_features used for each decision tree. The default is 100 percent of the data. Samples for each tree are taken randomly from two-thirds of the data specified. Each decision tree in the forest is created using a random sample or subset (approximately two-thirds) of the training data available. Using a lower percentage of the input data for each decision tree increases the speed of the tool for very large datasets. | Long |
random_variables (Optional) | Specifies the number of explanatory variables used to create each decision tree. Each of the decision trees in the forest is created using a random subset of the explanatory variables specified. Increasing the number of variables used in each decision tree will increase the chances of overfitting your model particularly if there is one or a couple dominant variables. A common practice is to use the square root of the total number of explanatory variables (fields, distances, and rasters combined) if your variable_predict is numeric or divide the total number of explanatory variables (fields, distances, and rasters combined) by 3 if variable_predict is categorical. | Long |
percentage_for_training (Optional) | Specifies the percentage (between 10 percent and 50 percent) of in_features to reserve as the test dataset for validation. The model will be trained without this random subset of data, and the observed values for those features will be compared to the predicted value. The default is 10 percent. | Double |
output_classification_table (Optional) | If specified, creates a confusion matrix for classification summarizing the performance of the model created. This table can be used to calculate other diagnostics beyond the accuracy and sensitivity measures the tool calculates in the output messages. | Table |
output_validation_table (Optional) | If the Number of Runs for Validation specified is greater than 2, this table creates a chart of the distribution of R2 for each model. This distribution can be used to assess the stability of your model. | Table |
compensate_sparse_categories (Optional) | If there are categories in your dataset that don't occur as often as others, checking this parameter will ensure that each category is represented in each tree.
| Boolean |
number_validation_runs (Optional) | The tool will run for the number of iterations specified. The distribution of the R2 for each run can be displayed using the Output Validation Table parameter. When this is set and predictions are being generated, only the model that produced the highest R2 value will be used for predictions. | Long |
calculate_uncertainty (Optional) | Specifies whether prediction uncertainty will be calculated when training, predicting to features, or predicting to raster.
| Boolean |
Derived Output
Name | Explanation | Data Type |
output_uncertainty_raster_layers | When calculate_uncertainty is checked, the tool will calculate a 90 percent prediction interval around each predicted value of the variable_to_predict. | Raster Layer |
Code sample
The following Python script demonstrates how to use the Forest function.
import arcpy
arcpy.env.workspace = r"c:\data"
# Forest-based model using only the training method and all data
# comes from a single polygon feature class. The tool excludes 10% of the
# input features from training and uses these values to validate the model.
prediction_type = "TRAIN"
in_features = r"Boston_Vandalism.shp"
variable_predict = "VandCnt"
explanatory_variables = [["Educat", "false"], ["MedAge", "false"],
["HHInc", "false"], ["Pop", "false"]]
output_trained_features = "TrainingFeatures.shp"
number_of_trees = 100
sample_size = 100
percentage_for_training = 10
arcpy.stats.Forest(prediction_type, in_features, variable_predict, None,
explanatory_variables, None, None, None, None, None, None, None, None,
output_trained_features, None, True, number_of_trees, None, None, sample_size,
None, percentage_for_training)
The following Python script demonstrates how to use the Forest function to predict to features.
# Import system modules
import arcpy
# Set property to overwrite existing outputs
arcpy.env.overwriteOutput = True
# Set the work space to a gdb
arcpy.env.workspace = r"C:\Data\BostonCrimeDB.gdb"
# Forest-based model taking advantage of both distance features and
# explanatory rasters. The training and prediction data has been manually
# split so the percentage to exclude parameter was set to 0. A variable importance
# table is created to help assess results and advanced options have been used
# to fine tune the model.
prediction_type = "PREDICT_FEATURES"
in_features = r"Boston_Vandalism_Training"
variable_predict = "Vandalism_Count"
treat_variable_as_categorical = None
explanatory_variables = [["EduClass", "true"], ["MedianAge", "false"],
["HouseholdIncome", "false"], ["TotalPopulation", "false"]]
distance_features = r"Boston_Highways"
explanatory_rasters = r"LandUse true"
features_to_predict = r"Boston_Vandalism_Prediction"
output_features = r"Prediction_Output"
output_raster = None
explanatory_variable_matching = [["EduClass", "EduClass"], ["MedianAge", "MedianAge"],
["HouseholdIncome", "HouseholdIncome"], ["TotalPopulation", "TotalPopulation"]]
explanatory_distance_matching = [["Boston_Highways", "Boston_Highways"]]
explanatory_rasters_matching = [["LandUse", "LandUse"]]
output_trained_features = r"Training_Output"
output_importance_table = r"Variable_Importance"
use_raster_values = True
number_of_trees = 100
minimum_leaf_size = 2
maximum_level = 5
sample_size = 100
random_sample = 3
percentage_for_training = 0
arcpy.stats.Forest(prediction_type, in_features, variable_predict,
treat_variable_as_categorical, explanatory_variables, distance_features,
explanatory_rasters, features_to_predict, output_features, output_raster,
explanatory_variable_matching, explanatory_distance_matching,
explanatory_rasters_matching, output_trained_features, output_importance_table,
use_raster_values, number_of_trees, minimum_leaf_size, maximum_level,
sample_size, random_sample, percentage_for_training)
The following Python script demonstrates how to use the Forest function to create a prediction surface.
# Import system modules
import arcpy
# Set property to overwrite existing outputs
arcpy.env.overwriteOutput = True
# Set the work space to a gdb
arcpy.env.workspace = r"C:\Data\Landsat.gdb"
# Using a forest-based model to classify a landsat image. The TrainingPolygons feature
# class was created manually and is used to train the model to
# classify the remainder of the landsat image.
prediction_type = "PREDICT_RASTER"
in_features = r"TrainingPolygons"
variable_predict = "LandClassName"
treat_variable_as_categorical = "CATEGORICAL"
explanatory_variables = None
distance_features = None
explanatory_rasters = [["Band1", "false"], ["Band2", "false"], ["Band3", "false"]]
features_to_predict = None
output_features = None
output_raster = r"PredictionSurface"
explanatory_variable_matching = None
explanatory_distance_matching = None
explanatory_rasters_matching = [["Band1", "Band1"], ["Band2", "Band2"], ["Band3", "Band3"]]
output_trained_features = None
output_importance_table = None
use_raster_values = True
number_of_trees = 100
minimum_leaf_size = None
maximum_level = None
sample_size = 100
random_sample = None
percentage_for_training = 10
arcpy.stats.Forest(prediction_type, in_features, variable_predict,
treat_variable_as_categorical, explanatory_variables, distance_features,
explanatory_rasters, features_to_predict, output_features, output_raster,
explanatory_variable_matching, explanatory_distance_matching,
explanatory_rasters_matching, output_trained_features, output_importance_table,
use_raster_values, number_of_trees, minimum_leaf_size, maximum_level,
sample_size, random_sample, percentage_for_training)
Environments
- Random number generator
The Random Generator Type used is always Mersenne Twister.
- Parallel Processing Factor
Parallel processing is only used when predictions are being made.
Licensing information
- Basic: Limited
- Standard: Limited
- Advanced: Limited