Forest-based Classification and Regression (GeoAnalytics Desktop)

Summary

Creates models and generates predictions using an adaptation of the random forest algorithm, which is a supervised machine learning method developed by Leo Breiman and Adele Cutler. Predictions can be performed for both categorical variables (classification) and continuous variables (regression). Explanatory variables can take the form of fields in the attribute table of the training features. In addition to validation of model performance based on the training data, predictions can be made to features.

Usage

  • This tool creates hundreds of trees, called an ensemble of decision trees, to create a model that can be used for prediction. Each decision tree is created using randomly generated portions of the original (training) data. Each tree generates its own prediction and votes on an outcome. The forest model considers votes from all decision trees to predict or classify the outcome of an unknown sample. This is important, as individual trees may have issues with overfitting a model; however, combining multiple trees in a forest for prediction addresses the overfitting problem associated with a single tree.

  • This tool can be used in two operation modes. The Train mode can be used to evaluate the performance of different models as you explore different explanatory variables and tool settings. Once a good model has been found, you can use the Train and Predict mode.

  • Input Training Features can be tables, points, lines, or polygon features. This tool does not work with multipart data.

  • Features with one or more null values or empty string values in prediction or explanatory fields will be excluded from the output. You can modify values using the Calculate Field tool if necessary.

  • This tool produces a variety of outputs depending on the following operation modes:

    • Train produces the following two outputs:
      • Output trained features—Contains all of the Input Training Features used in the model created as well as all of the explanatory variables used in the model. It also contains predictions for all of the features used for training the model, which can be helpful in assessing the performance of the model created.
      • Tool summary messages—Messages to help you understand the performance of the model created. The messages include information about model characteristics, variable importance, and validation diagnostics.
    • Train and Predict produces the following three outputs:
      • Output trained features—Contains all of the Input Training Features used in the model created as well as all of the explanatory variables used in the model. It also contains predictions for all of the features used for training the model, which can be helpful in assessing the performance of the model created.
      • Output predicted features—A layer of predicted results. Predictions are applied to the layer to predict (use the Input Prediction Features option) using the model generated from the training layer.
      • Tool summary messages—Messages to help you understand the performance of the model created. The messages include information about model characteristics, variable importance, and validation diagnostics.

  • You can use the Create Variable Importance Table parameter to create a table to display a chart of variable importance for evaluation. The top 20 variable importance values are also reported in the messages window.

  • Explanatory variables can come from fields and should contain a variety of values. If the explanatory variable is categorical, the Categorical check box should be checked (variables of type string will automatically be checked). Categorical explanatory variables are limited to 60 unique values, though a smaller number of categories will improve model performance. For a given data size, the more categories a variable contains, the more likely it is that it will dominate the model and lead to less effective prediction results.

  • When matching explanatory variables, the Training Field and Prediction Field must have fields that are the same type (a double field in Training Field must be matched to a double field in Prediction Field for example).

  • Forest-based models do not extrapolate; they can only classify or predict to a value that the model was trained on. Train the model with training features and explanatory variables that are within the range of your target features and variables. The tool will fail if categories exist in the prediction explanatory variables that are not present in the training features.

  • The default value for the Number of Trees parameter is 100. Increasing the number of trees in the forest model will result in more accurate model prediction, but the model will take longer to calculate.

  • A single layer for training and a single layer for prediction are supported. To combine multiple datasets into one, use the Build Multi-Variable Grid and Enrich from Multi-Variable Grid tools to generate input data.

  • This geoprocessing tool is powered by Spark. Analysis is completed on your desktop machine using multiple cores in parallel. See Considerations for GeoAnalytics Desktop tools to learn more about running analysis.

  • When running GeoAnalytics Desktop tools, the analysis is completed on your desktop machine. For optimal performance, data should be available on your desktop. If you are using a hosted feature layer, it is recommended that you use ArcGIS GeoAnalytics Server. If your data isn't local, it will take longer to run a tool. To use your ArcGIS GeoAnalytics Server to perform analysis, see GeoAnalytics Tools.

Parameters

LabelExplanationData Type
Prediction Type

Specifies the operation mode of the tool. The tool can be run to train a model to only assess performance, predict features, or create a prediction surface.

  • Train onlyA model will be trained, but no predictions will be generated. Use this option to assess the accuracy of your model before generating predictions. This option will output model diagnostics in the messages window and a chart of variable importance. This is the default
  • Train and PredictPredictions or classifications will be generated for features. Explanatory variables must be provided for both the training features and the features to be predicted. The output of this option will be a feature class, model diagnostics in the messages window, and an optional table of variable importance.
String
Input Training Features

The layercontaining the Variable to Predict parameter and the explanatory training variables fields.

Table View
Output Trained Features
(Optional)

The output feature layer name.

Table;Feature Class
Variable to Predict
(Optional)

The variable from the Input Training Features parameter containing the values to be used to train the model. This field contains known (training) values of the variable that will be used to predict at unknown locations.

Field
Treat Variable as Categorical
(Optional)

Specifies whether Variable to Predict is a categorical variable.

  • Checked—Variable to Predict is a categorical variable and the tool will perform classification.
  • Unchecked—Variable to Predict is continuous and the tool will perform regression. This is the default.
Boolean
Explanatory Variables
(Optional)

A list of fields representing the explanatory variables that help predict the value or category of Variable to Predict. Check the Categorical check box for any variables that represent classes or categories (such as land cover or presence or absence).

Value Table
Input Prediction Features
(Optional)

A feature layer representing locations where predictions will be made. This feature layer must also contain any explanatory variables provided as fields that correspond to those used from the training data.

Table View
Variable of Importance Table
(Optional)

A table containing information describing the importance of each explanatory variable to be used in the created model.

Table
Output Predicted Features
(Optional)

The output feature class that will receive the results of the prediction results.

Table;Feature Class
Match Explanatory Variables
(Optional)

A list of Explanatory Variables specified from Input Training Features on the right and their corresponding fields from Input Prediction Features on the left.

Value Table
Number of Trees
(Optional)

The number of trees to create in the forest model. More trees will generally result in more accurate model prediction, but the model will take longer to calculate. The default number of trees is 100.

Long
Minimum Leaf Size
(Optional)

The minimum number of observations required to keep a leaf (that is, the terminal node on a tree without further splits). The default minimum for regression is 5, and the default for classification is 1. For very large data, increasing these numbers will decrease the run time of the tool.

Long
Maximum Tree Depth
(Optional)

The maximum number of splits that will be made down a tree. Using a large maximum depth, more splits will be created, which may increase the chances of overfitting the model. The default is data driven and depends on the number of trees created and the number of variables included.

Long
Data Available per Tree (%)
(Optional)

The percentage of Input Training Features used for each decision tree. The default is 100 percent of the data. Samples for each tree are taken randomly from two-thirds of the data specified.

Each decision tree in the forest is created using a random sample or subset (approximately two-thirds) of the training data available. Using a lower percentage of the input data for each decision tree increases the speed of the tool for very large datasets.

Long
Number of Randomly Sampled Variables
(Optional)

The number of explanatory variables used to create each decision tree.

Each decision tree in the forest is created using a random subset of the explanatory variables specified. Increasing the number of variables used in each decision tree will increase the chances of overfitting your model, particularly if there is one or more dominant variables. A common practice is to use the square root of the total number of explanatory variables if Variable to Predict is numeric, or divide the total number of explanatory variables by 3 if Variable to Predict is categorical.

Long
Training Data Excluded for Validation (%)
(Optional)

The percentage (between 10 percent and 50 percent) of Input Training Features to reserve as the test dataset for validation. The model will be trained without this random subset of data, and the observed values for those features will be compared to the predicted values. The default is 10 percent.

Long

arcpy.gapro.Forest(prediction_type, in_features, {output_trained_features}, {variable_predict}, {treat_variable_as_categorical}, {explanatory_variables}, {features_to_predict}, {variable_of_importance}, {output_predicted}, {explanatory_variable_matching}, {number_of_trees}, {minimum_leaf_size}, {maximum_tree_depth}, {sample_size}, {random_variables}, {percentage_for_validation})
NameExplanationData Type
prediction_type

Specifies the operation mode of the tool. The tool can be run to train a model to only assess performance, predict features, or create a prediction surface.

  • TRAINA model will be trained, but no predictions will be generated. Use this option to assess the accuracy of your model before generating predictions. This option will output model diagnostics in the messages window and a chart of variable importance. This is the default
  • TRAIN_AND_PREDICTPredictions or classifications will be generated for features. Explanatory variables must be provided for both the training features and the features to be predicted. The output of this option will be a feature class, model diagnostics in the messages window, and an optional table of variable importance.
String
in_features

The feature class containing the variable_predict parameter and the explanatory training variables fields.

Table View
output_trained_features
(Optional)

The output feature layer name.

Table;Feature Class
variable_predict
(Optional)

The variable from the in_features parameter containing the values to be used to train the model. This field contains known (training) values of the variable that will be used to predict at unknown locations.

Field
treat_variable_as_categorical
(Optional)
  • CATEGORICALvariable_predict is a categorical variable and the tool will perform classification.
  • NUMERICvariable_predict is continuous and the tool will perform regression. This is the default.
Boolean
explanatory_variables
[[Variable, Categorical],...]
(Optional)

A list of fields representing the explanatory variables that help predict the value or category of variable_predict. Use the treat_variable_as_categorical parameter for any variables that represent classes or categories (such as land cover or presence or absence). Specify the variable as true for any that represent classes or categories such as land cover or presence or absence and false if the variable is continuous.

Value Table
features_to_predict
(Optional)

A feature layer representing locations where predictions will be made. This feature layer must also contain any explanatory variables provided as fields that correspond to those used from the training data.

Table View
variable_of_importance
(Optional)

A table containing information describing the importance of each explanatory variable to be used in the created model.

Table
output_predicted
(Optional)

The output feature class that will receive the results of the prediction results.

Table;Feature Class
explanatory_variable_matching
[[Prediction, Training],...]
(Optional)

A list of explanatory_variables specified from in_features on the right and their corresponding fields from features_to_predict on the left, for example, [["LandCover2000", "LandCover2010"], ["Income", "PerCapitaIncome"]].

Value Table
number_of_trees
(Optional)

The number of trees to create in the forest model. More trees will generally result in more accurate model prediction, but the model will take longer to calculate. The default number of trees is 100.

Long
minimum_leaf_size
(Optional)

The minimum number of observations required to keep a leaf (that is, the terminal node on a tree without further splits). The default minimum for regression is 5, and the default for classification is 1. For very large data, increasing these numbers will decrease the run time of the tool.

Long
maximum_tree_depth
(Optional)

The maximum number of splits that will be made down a tree. Using a large maximum depth, more splits will be created, which may increase the chances of overfitting the model. The default is data driven and depends on the number of trees created and the number of variables included.

Long
sample_size
(Optional)

The percentage of in_features used for each decision tree. The default is 100 percent of the data. Samples for each tree are taken randomly from two-thirds of the data specified.

Each decision tree in the forest is created using a random sample or subset (approximately two-thirds) of the training data available. Using a lower percentage of the input data for each decision tree increases the speed of the tool for very large datasets.

Long
random_variables
(Optional)

The number of explanatory variables used to create each decision tree.

Each decision tree in the forest is created using a random subset of the explanatory variables specified. Increasing the number of variables used in each decision tree will increase the chances of overfitting your model, particularly if there is one or more dominant variables. A common practice is to use the square root of the total number of explanatory variables if variable_predict is numeric, or divide the total number of explanatory variables by 3 if variable_predict is categorical.

Long
percentage_for_validation
(Optional)

The percentage (between 10 percent and 50 percent) of in_features to reserve as the test dataset for validation. The model will be trained without this random subset of data, and the observed values for those features will be compared to the predicted value. The default is 10 percent.

Long

Code sample

Forest example (Python window)

The following Python window script demonstrates how to use the Forest tool.

In this script, run Forest on sales data from 1980 and predict for sales in 1981.

#-------------------------------------------------------------------------------
# Name: Forest.py
# Description: Run Forest on sales data from 1980 and predict for sales in 1981
#
# Requirements: Advanced License

# Import system modules
import arcpy
arcpy.env.workspace = "c:/data/commercial.gdb"   

# Set local variables
trainingDataset = "sales"
predictionDataset = "next_year"
outputName = "training"
outputPredictedName = "predicted"

# Execute Forest
arcpy.geoanalytics.Forest(
    "TRAIN_AND_PREDICT", inputDataset, outputName, "PERIMETER", None, 
    "STORE_CATEGORY true;AVG_INCOME false;POPULATION false", None, 
    predictionDataset, 
    "STORE_CATEGORY STORE_CATEGORY;AVG_INCOME MEAN_INCOME;POPULATION POPULATION", 
    100, None, None, 120, None, 10)

Licensing information

  • Basic: No
  • Standard: No
  • Advanced: Yes

Related topics