Forest-based Classification and Regression (Spatial Statistics)—ArcGIS Pro

Summary

Creates models and generates predictions using an adaptation of Leo Breiman's random forest algorithm, which is a supervised machine learning method. Predictions can be performed for both categorical variables (classification) and continuous variables (regression). Explanatory variables can take the form of fields in the attribute table of the training features, raster datasets, and distance features used to calculate proximity values for use as additional variables. In addition to validation of model performance based on the training data, predictions can be made to either features or a prediction raster.

Learn more about how Forest-based Classification and Regression works

Illustration

Usage

This tool creates hundreds of trees, called an ensemble of decision trees, to create a model that can then be used for prediction. Each decision tree is created using randomly generated portions of the original (training) data. Each tree generates its own prediction and votes on an outcome. The forest model considers votes from all decision trees to predict or classify the outcome of an unknown sample. This is important as individual trees may have issues with overfitting a model; however, combining multiple trees in a forest for prediction addresses the overfitting problem associated with a single tree.
This tool can be used in three different operation modes. The Train option can be used to evaluate the performance of different models as you explore different explanatory variables and tool settings. Once a good model has been found, you can use the Predict to features or Predict to raster option. This is a data-driven tool and performs best on large datasets. The tool should be trained on at least several hundred features for best results. It is not an appropriate tool for very small datasets.
The Input Training Features can be points or polygons. This tool does not work with multipart data.
A Spatial Analyst license is required to use rasters as explanatory variables or to predict to an Output Prediction Surface.
This tool produces a variety of different outputs. Output Trained Features will contain all of the Input Training Features used in the model created as well as all of the explanatory variables used in the model (including the input fields used, any distances calculated, and any raster values extracted or calculated). It will also contain predictions for all of the features used for training the model, which can be helpful in assessing the performance of the model created. When using this tool for prediction, it will produce either a new feature class containing the Output Predicted Features or a new Output Prediction Surface if explanatory rasters are provided.
When using the Predict to features option, a new feature class containing the Output Predicted Features will be created. When the Predict to Raster option, a new Output Prediction Surface will be created.
This tool also creates messages and charts to help you understand the performance of the model created. You can access the messages by hovering over the progress bar, clicking the pop-out button, or expanding the messages section in the Geoprocessing pane. You can also access the messages for a previous run of the Forest-based Classification and Prediction tool via the Geoprocessing history. The messages include information on the model characteristics, out of bag errors, variable importance, and validation diagnostics.
You can use the Output Variable Importance Table parameter to create a table to display a chart of variable importance for evaluation. The top 20 variable importance values are also reported in the messages window. The chart can be accessed directly below the layer in the Contents pane.
Explanatory variables can come from fields or be calculated from distance features or extracted from rasters. You can use any combination of these explanatory variable types, but at least one type is required. The explanatory variables (from fields, distance features, or rasters) used should contain a variety of values. If the explanatory variable is categorical, the Categorical check box should be checked (variables of type string will automatically be checked). Categorical explanatory variables are limited to 60 unique values, though a smaller number of categories will improve model performance. For a given data size, the more categories a variable contains, the more likely it is that it will dominate the model and lead to less effective prediction results.
Distance features are used to automatically create explanatory variables representing a distance from the provided features to the Input Training Features. Distances will be calculated from each of the input Explanatory Training Distance Features to the nearest Input Training Feature. If the input Explanatory Training Distance Features are polygons or lines, the distance attributes are calculated as the distance between the closest segments of the pair of features. However, distances are calculated differently for polygons and lines. See How proximity tools calculate distance for details.
If your Input Training Features are points and you are using Explanatory Training Rasters, the tool drills down to extract explanatory variables at each point location. For multiband rasters, only the first band is used.
Although you can have multiple layers with the same name in the Contents pane, the tool is unable to accept explanatory layers with the same name or to remove duplicate layer names in the drop-down lists. To avoid this issue, ensure that each layer has a unique name.
If your Input Training Features are polygons, the Variable to Predict is categorical, and you are using exclusively Explanatory Training Rasters, there is an option to Convert Polygons to Raster Resolution for Training. If this option is checked, the polygon is divided into points at the centroid of each raster cell whose centroid falls within the polygon. The raster values at each point location are then extracted and used to train the model. A bilinear sampling method is used for numeric variables, and the nearest method is used for categorical variables. The default cell size of the converted polygons will be the maximum cell size of input rasters. However, this can be changed using the Cell Size environment setting. If not checked, one raster value for each polygon will be used in the model. Each polygon is assigned the average value for continuous rasters and the majority for categorical rasters.
Polygons are converted to raster resolution (left) or assigned an average value (right)
There must be variation in the data used for each explanatory variable specified. If you receive an error that there is no variation in one of the fields or rasters specified, you can try running the tool again marking that variable as categorical. If 95 percent of the features have the same value for a particular variable, that variable is flagged as having no variation.
The Compensate for Sparse Categories parameter can be used if the variation in your categories are unbalanced. For instance, if you have some categories that occur hundreds of times in your dataset and a few that occur significantly less often, checking this parameter will ensure that each category is represented in each tree to create balanced models.
When matching explanatory variables, the Prediction and Training fields must be of the same type (a double field in Training must be matched to a double field in Prediction).
Forest-based models do not extrapolate, they can only classify or predict to a value that the model was trained on. When predicting a value based on explanatory variables much higher or lower than the range of the original training dataset, the model will estimate the value to be around the highest or lowest value in the original dataset. This tool may perform poorly when trying to predict with explanatory variables that are out of range of the explanatory variables used to train the model.
The tool will fail if categories exist in the prediction explanatory variables that are not present in the training features.
To use mosaic datasets as explanatory variables, use the Make Mosaic Layer tool first and copy the full path to the layer into the tool or use the Make Mosaic Layer tool and the Make Raster Layer tool to adjust the processing template for the mosaic dataset.
The default value for the Number of Trees parameter is 100. Increasing the number of trees in the forest model will result in more accurate model prediction, but the model will take longer to calculate.
When the Calculate Uncertainty parameter is checked, the tool will calculate a 90 percent prediction interval around each predicted value of the Variable to Predict. When Prediction Type is Train only or Predict to features, two fields are added to either Output Trained Features or Output Predicted Features. These fields, ending with _P05 and _P95, represent the upper and lower bounds of the prediction interval. For any new observation, you can predict with 90 percent confidence that the value of a new observation will fall within the interval, given the same explanatory variables. When the Predict to raster option is used, two additional rasters representing the upper and lower bounds of the prediction interval are added to the Contents pane.
This tool supports parallel processing for prediction and uses 50 percent of available processors by default. The number of processors can be increased or decreased using the Parallel Processing Factor environment.
To learn more about how this tool works and understand the output messages and charts, see How Forest-based Classification and Regression works.
References:
Breiman, Leo. Out-Of-Bag Estimation. 1996.
Breiman, L. (1996). Bagging predictors. Machine learning, 24(2), 123-140.
Breiman, Leo. "Random Forests". Machine Learning. 45 (1): 5-32. doi:10.1023/A:1010933404324. 2001.
Breiman, L., J.H. Friedman, R.A. Olshen, C.J. Stone. Classification and regression trees. New York: Routledge. Chapter 4. 2017.
Dietterich, T. G. (2000, June). Ensemble methods in machine learning. In International workshop on multiple classifier systems (pp. 1-15). Springer, Berlin, Heidelberg.
Gini, C. (1912). Variabilità e mutabilità. Reprinted in Memorie di metodologica statistica (Ed. Pizetti E, Salvemini, T). Rome: Libreria Eredi Virgilio Veschi.
Grömping, U. (2009). Variable importance assessment in regression: linear regression versus random forest. The American Statistician, 63(4), 308-319.
Ho, T. K. (1995, August). Random decision forests. In Document analysis and recognition, 1995., proceedings of the third international conference on Document Analysis and Recognition. (Vol. 1, pp. 278-282). IEEE.
James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction to statistical learning (Vol. 112). New York: springer.
LeBlanc, M., & Tibshirani, R. (1996). Combining estimates in regression and classification. Journal of the American Statistical Association, 91(436), 1641-1650.
Loh, W. Y., & Shih, Y. S. (1997). Split selection methods for classification trees. Statistica sinica, 815-840.
Meinshausen, Nicolai. "Quantile regression forests." Journal of Machine Learning Research 7. Jun (2006): 983-999.
Nadeau, C., & Bengio, Y. (2000). Inference for the generalization error. In Advances in neural information processing systems (pp. 307-313).
Strobl, C., Boulesteix, A. L., Kneib, T., Augustin, T., & Zeileis, A. (2008). Conditional variable importance for random forests. BMC bioinformatics, 9(1), 307.
Zhou, Z. H. (2012). Ensemble methods: foundations and algorithms. CRC press.

Syntax

arcpy.stats.Forest(prediction_type, in_features, {variable_predict}, {treat_variable_as_categorical}, {explanatory_variables}, {distance_features}, {explanatory_rasters}, {features_to_predict}, {output_features}, {output_raster}, {explanatory_variable_matching}, {explanatory_distance_matching}, {explanatory_rasters_matching}, {output_trained_features}, {output_importance_table}, {use_raster_values}, {number_of_trees}, {minimum_leaf_size}, {maximum_depth}, {sample_size}, {random_variables}, {percentage_for_training}, {output_classification_table}, {output_validation_table}, {compensate_sparse_categories}, {number_validation_runs}, {calculate_uncertainty})

Parameter	Explanation	Data Type
prediction_type	Specifies the operation mode of the tool. The tool can be run to train a model to only assess performance, predict features, or create a prediction surface. TRAIN —A model will be trained, but no predictions will be generated. Use this option to assess the accuracy of your model before generating predictions. This option will output model diagnostics in the messages window and a chart of variable importance. This is the default PREDICT_FEATURES —Predictions or classifications will be generated for features. Explanatory variables must be provided for both the training features and the features to be predicted. The output of this option will be a feature class, model diagnostics in the messages window, and an optional table and chart of variable importance. PREDICT_RASTER —A prediction raster will be generated for the area where the explanatory rasters intersect. Explanatory rasters must be provided for both the training area and the area to be predicted. The output of this option will be a prediction surface, model diagnostics in the messages window, and an optional table and chart of variable importance.	String
in_features	The feature class containing the variable_predict parameter and, optionally, the explanatory training variables from fields.	Feature Layer
variable_predict (Optional)	The variable from the in_features parameter containing the values to be used to train the model. This field contains known (training) values of the variable that will be used to predict at unknown locations.	Field
treat_variable_as_categorical (Optional)	CATEGORICAL —The variable_predict is a categorical variable and the tool will perform classification. NUMERIC —The variable_predict is continuous and the tool will perform regression. This is the default	Boolean
explanatory_variables [[Variable, Categorical],...] (Optional)	A list of fields representing the explanatory variables that help predict the value or category of the variable_predict. Use the treat_variable_as_categorical parameter for any variables that represent classes or categories (such as land cover or presence or absence). Specify the variable as true for any that represent classes or categories such as land cover or presence or absence and false if the variable is continuous.	Value Table
distance_features [distance_features,...] (Optional)	Automatically creates explanatory variables by calculating a distance from the provided features to the in_features. Distances will be calculated from each of the input distance_features to the nearest in_features. If the input distance_features are polygons or lines, the distance attributes are calculated as the distance between the closest segments of the pair of features.	Feature Layer
explanatory_rasters [[Variable, Categorical],...] (Optional)	Automatically creates explanatory training variables in your model whose values are extracted from rasters. For each feature in the in_features, the value of the raster cell is extracted at that exact location. Bilinear raster resampling is used when extracting the raster value unless it is specified as categorical, in which case nearest neighbor assignment is used. Specify the raster as true for any rasters that represent classes or categories such as land cover or presence or absence and false if the raster is continuous.	Value Table
features_to_predict (Optional)	A feature class representing locations where predictions will be made. This feature class must also contain any explanatory variables provided as fields that correspond to those used from the training data if any.	Feature Layer
output_features (Optional)	The output feature class to receive the results of the prediction results.	Feature Class
output_raster (Optional)	The output raster containing the prediction results. The default cell size will be the maximum cell size of the raster inputs. To set a different cell size, use the cell size environment setting.	Raster Dataset
explanatory_variable_matching [[Prediction, Training],...] (Optional)	A list of the explanatory_variables specified from the in_features on the right and their corresponding fields from the features_to_predict on the left, for example, [["LandCover2000", "LandCover2010"], ["Income", "PerCapitaIncome"]].	Value Table
explanatory_distance_matching [[Prediction, Training],...] (Optional)	A list of the distance_features specified for the in_features on the right. Corresponding feature sets should be specified for the features_to_predict on the left. explanatory_distance_features that are more appropriate for the features_to_predict can be provided if those used for training are in a different study area or time period.	Value Table
explanatory_rasters_matching [[Prediction, Training],...] (Optional)	A list of the explanatory_rasters specified for the in_features on the right. Corresponding rasters should be specified for the features_to_predict or output_raster to be created on the left. explanatory_rasters that are more appropriate for the features_to_predict can be provided if those used for training are in a different study area or time period.	Value Table
output_trained_features (Optional)	output_trained_features will contain all explanatory variables used for training (including sampled raster values and distance calculations), as well as the observed variable_to_predict field and accompanying predictions that can be used to further assess performance of the trained model.	Feature Class
output_importance_table (Optional)	If specified, the table will contain information describing the importance of each explanatory variable (fields, distance features, and rasters) used in the model created.	Table
use_raster_values (Optional)	Specifies how polygons are treated when training the model if the in_features are polygons with a categorical variable_predict and only explanatory_rasters have been specified. TRUE —The polygon is divided into all of the raster cells with centroids falling within the polygon. The raster values at each centroid are then extracted and used to train the model. The model is no longer trained on the polygon itself, but rather the model is trained on the raster values extracted for each cell centroid. This is the default. FALSE —Each polygon is assigned the average value of the underlying continuous rasters and the majority for underlying categorical rasters.	Boolean
number_of_trees (Optional)	The number of trees to create in the forest model. More trees will generally result in more accurate model prediction, but the model will take longer to calculate. The default number of trees is 100.	Long
minimum_leaf_size (Optional)	The minimum number of observations required to keep a leaf (that is the terminal node on a tree without further splits). The default minimum for regression is 5 and the default for classification is 1. For very large data, increasing these numbers will decrease the run time of the tool.	Long
maximum_depth (Optional)	The maximum number of splits that will be made down a tree. Using a large maximum depth, more splits will be created, which may increase the chances of overfitting the model. The default is data driven and depends on the number of trees created and the number of variables included.	Long
sample_size (Optional)	Specifies the percentage of the in_features used for each decision tree. The default is 100 percent of the data. Samples for each tree are taken randomly from two-thirds of the data specified. Each decision tree in the forest is created using a random sample or subset (approximately two-thirds) of the training data available. Using a lower percentage of the input data for each decision tree increases the speed of the tool for very large datasets.	Long
random_variables (Optional)	Specifies the number of explanatory variables used to create each decision tree. Each of the decision trees in the forest is created using a random subset of the explanatory variables specified. Increasing the number of variables used in each decision tree will increase the chances of overfitting your model particularly if there is one or a couple dominant variables. A common practice is to use the square root of the total number of explanatory variables (fields, distances, and rasters combined) if your variable_predict is numeric or divide the total number of explanatory variables (fields, distances, and rasters combined) by 3 if variable_predict is categorical.	Long
percentage_for_training (Optional)	Specifies the percentage (between 10 percent and 50 percent) of in_features to reserve as the test dataset for validation. The model will be trained without this random subset of data, and the observed values for those features will be compared to the predicted value. The default is 10 percent.	Double
output_classification_table (Optional)	If specified, creates a confusion matrix for classification summarizing the performance of the model created. This table can be used to calculate other diagnostics beyond the accuracy and sensitivity measures the tool calculates in the output messages.	Table
output_validation_table (Optional)	If the Number of Runs for Validation specified is greater than 2, this table creates a chart of the distribution of R² for each model. This distribution can be used to assess the stability of your model.	Table
compensate_sparse_categories (Optional)	If there are categories in your dataset that don't occur as often as others, checking this parameter will ensure that each category is represented in each tree. TRUE —Each tree will include every category that is represented in the training dataset. FALSE —Each tree will be created based on a random sample of the training dataset. This is the default.	Boolean
number_validation_runs (Optional)	The tool will run for the number of iterations specified. The distribution of the R² for each run can be displayed using the Output Validation Table parameter. When this is set and predictions are being generated, only the model that produced the highest R² value will be used for predictions.	Long
calculate_uncertainty (Optional)	Specifies whether prediction uncertainty will be calculated when training, predicting to features, or predicting to raster. TRUE — A prediction uncertainty interval will be calculated. FALSE — Uncertainty will not be calculated. This is the default.	Boolean

Derived Output

Name	Explanation	Data Type
output_uncertainty_raster_layers	When calculate_uncertainty is checked, the tool will calculate a 90 percent prediction interval around each predicted value of the variable_to_predict.	Raster Layer

Code sample

Forest example 1 (Python window)

The following Python script demonstrates how to use the Forest function.

import arcpy
arcpy.env.workspace = r"c:\data"

# Forest-based model using only the training method and all data
# comes from a single polygon feature class. The tool excludes 10% of the 
# input features from training and uses these values to validate the model.

prediction_type = "TRAIN"
in_features = r"Boston_Vandalism.shp"
variable_predict = "VandCnt"
explanatory_variables = [["Educat", "false"], ["MedAge", "false"], 
    ["HHInc", "false"], ["Pop", "false"]]
output_trained_features = "TrainingFeatures.shp"
number_of_trees = 100
sample_size = 100
percentage_for_training = 10

arcpy.stats.Forest(prediction_type, in_features, variable_predict, None,
    explanatory_variables, None, None, None, None, None, None, None, None,
    output_trained_features, None, True, number_of_trees, None, None, sample_size, 
    None, percentage_for_training)

Forest example 2 (stand-alone script)

The following Python script demonstrates how to use the Forest function to predict to features.

# Import system modules
import arcpy

# Set property to overwrite existing outputs
arcpy.env.overwriteOutput = True

# Set the work space to a gdb
arcpy.env.workspace = r"C:\Data\BostonCrimeDB.gdb"

# Forest-based model taking advantage of both distance features and 
# explanatory rasters. The training and prediction data has been manually
# split so the percentage to exclude parameter was set to 0. A variable importance
# table is created to help assess results and advanced options have been used
# to fine tune the model.

prediction_type = "PREDICT_FEATURES"
in_features = r"Boston_Vandalism_Training"
variable_predict = "Vandalism_Count"
treat_variable_as_categorical = None
explanatory_variables = [["EduClass", "true"], ["MedianAge", "false"],
    ["HouseholdIncome", "false"], ["TotalPopulation", "false"]]
distance_features = r"Boston_Highways"
explanatory_rasters = r"LandUse true"
features_to_predict = r"Boston_Vandalism_Prediction"
output_features = r"Prediction_Output"
output_raster = None
explanatory_variable_matching = [["EduClass", "EduClass"], ["MedianAge", "MedianAge"], 
    ["HouseholdIncome", "HouseholdIncome"], ["TotalPopulation", "TotalPopulation"]]
explanatory_distance_matching = [["Boston_Highways", "Boston_Highways"]]
explanatory_rasters_matching = [["LandUse", "LandUse"]]
output_trained_features = r"Training_Output"
output_importance_table = r"Variable_Importance"
use_raster_values = True
number_of_trees = 100
minimum_leaf_size = 2
maximum_level = 5
sample_size = 100
random_sample = 3
percentage_for_training = 0

arcpy.stats.Forest(prediction_type, in_features, variable_predict,
    treat_variable_as_categorical, explanatory_variables, distance_features,
    explanatory_rasters, features_to_predict, output_features, output_raster,
    explanatory_variable_matching, explanatory_distance_matching, 
    explanatory_rasters_matching, output_trained_features, output_importance_table,
    use_raster_values, number_of_trees, minimum_leaf_size, maximum_level,
    sample_size, random_sample, percentage_for_training)

Forest example 3 (stand-alone script)

The following Python script demonstrates how to use the Forest function to create a prediction surface.

# Import system modules
import arcpy

# Set property to overwrite existing outputs
arcpy.env.overwriteOutput = True

# Set the work space to a gdb
arcpy.env.workspace = r"C:\Data\Landsat.gdb"

# Using a forest-based model to classify a landsat image. The TrainingPolygons feature 
# class was created manually and is used to train the model to 
# classify the remainder of the landsat image.

prediction_type = "PREDICT_RASTER"
in_features = r"TrainingPolygons"
variable_predict = "LandClassName"
treat_variable_as_categorical = "CATEGORICAL" 
explanatory_variables = None
distance_features = None
explanatory_rasters = [["Band1", "false"], ["Band2", "false"], ["Band3", "false"]]
features_to_predict = None
output_features = None
output_raster = r"PredictionSurface"
explanatory_variable_matching = None
explanatory_distance_matching = None
explanatory_rasters_matching = [["Band1", "Band1"], ["Band2", "Band2"], ["Band3", "Band3"]]
output_trained_features = None
output_importance_table = None
use_raster_values = True
number_of_trees = 100
minimum_leaf_size = None
maximum_level = None
sample_size = 100
random_sample = None
percentage_for_training = 10

arcpy.stats.Forest(prediction_type, in_features, variable_predict,
    treat_variable_as_categorical, explanatory_variables, distance_features,
    explanatory_rasters, features_to_predict, output_features, output_raster,
    explanatory_variable_matching, explanatory_distance_matching, 
    explanatory_rasters_matching, output_trained_features, output_importance_table,
    use_raster_values, number_of_trees, minimum_leaf_size, maximum_level,
    sample_size, random_sample, percentage_for_training)

Environments

Cell Size, Output Coordinate System, Random number generator, Mask, Parallel Processing Factor

Random number generator: The Random Generator Type used is always Mersenne Twister.

Parallel Processing Factor: Parallel processing is only used when predictions are being made.

Licensing information

Basic: Limited
Standard: Limited
Advanced: Limited