Forest-based Classification and Regression (GeoAnalytics)

Summary

Creates models and generates predictions using an adaptation of the random forest algorithm, which is a supervised machine learning method developed by Leo Breiman and Adele Cutler. Predictions can be performed for both categorical variables (classification) and continuous variables (regression). Explanatory variables can take the form of fields in the attribute table of the training features. In addition to validation of model performance based on the training data, predictions can be made to features.

Legacy:

The ArcGIS GeoAnalytics Server extension is being deprecated in ArcGIS Enterprise. The final release of GeoAnalytics Server was included with ArcGIS Enterprise 11.3. This geoprocessing tool is available through ArcGIS Enterprise 11.3 and earlier versions.

Usage

  • This tool creates hundreds of trees, called an ensemble of decision trees, to create a model that can be used for prediction. Each decision tree is created using randomly generated portions of the original (training) data. Each tree generates its own prediction and votes on an outcome. The forest model considers votes from all decision trees to predict or classify the outcome of an unknown sample. This is important, as individual trees may have issues with overfitting a model; however, combining multiple trees in a forest for prediction addresses the overfitting problem associated with a single tree.

  • This tool can be used in two operation modes. The Train mode can be used to evaluate the performance of different models as you explore different explanatory variables and tool settings. Once a good model has been found, you can use the Train and Predict mode.

  • This tool does not support inputs with date only or time only fields.

  • This is a data-driven tool and performs best on large datasets. The tool should be trained on at least several hundred features for best results. It is not an appropriate tool for very small datasets.

  • Input Training Features can be tables, points, lines, or polygon features. This tool does not work with multipart data.

  • Features with one or more null values or empty string values in prediction or explanatory fields will be excluded from the output. You can modify values using the Calculate Field tool if necessary.

  • This tool produces a variety of outputs depending on the following operation modes:

    • Train produces the following two outputs:
      • Output trained features—Contains all of the Input Training Features used in the model created as well as all of the explanatory variables used in the model. It also contains predictions for all of the features used for training the model, which can be helpful in assessing the performance of the model created.
      • Tool summary messages—Messages to help you understand the performance of the model created. The messages include information about model characteristics, variable importance, and validation diagnostics.
    • Train and Predict produces the following three outputs:
      • Output trained features—Contains all of the Input Training Features used in the model created as well as all of the explanatory variables used in the model. It also contains predictions for all of the features used for training the model, which can be helpful in assessing the performance of the model created.
      • Output predicted features—A layer of predicted results. Predictions are applied to the layer to predict (use the Input Prediction Features option) using the model generated from the training layer.
      • Tool summary messages—Messages to help you understand the performance of the model created. The messages include information about model characteristics, variable importance, and validation diagnostics.

  • You can use the Create Variable Importance Table parameter to create a table to display a chart of variable importance for evaluation. The top 20 variable importance values are also reported in the messages window.

  • Explanatory variables can come from fields and should contain a variety of values. If the explanatory variable is categorical, the Categorical check box should be checked (variables of type string will automatically be checked). Categorical explanatory variables are limited to 60 unique values, though a smaller number of categories will improve model performance. For a given data size, the more categories a variable contains, the more likely it is that it will dominate the model and lead to less effective prediction results.

  • When matching explanatory variables, the Training Field and Prediction Field must have fields that are the same type (a double field in Training Field must be matched to a double field in Prediction Field for example).

  • Forest-based models do not extrapolate; they can only classify or predict to a value that the model was trained on. Train the model with training features and explanatory variables that are within the range of your target features and variables. The tool will fail if categories exist in the prediction explanatory variables that are not present in the training features.

  • The default value for the Number of Trees parameter is 100. Increasing the number of trees in the forest model will result in more accurate model prediction, but the model will take longer to calculate.

  • A single layer for training and a single layer for prediction are supported. To combine multiple datasets into one, use the Build Multi-Variable Grid and Enrich from Multi-Variable Grid tools to generate input data.

  • This geoprocessing tool is powered by ArcGIS GeoAnalytics Server. Analysis is completed on GeoAnalytics Server, and results are stored in your content in ArcGIS Enterprise.

  • When running GeoAnalytics Server tools, the analysis is completed on GeoAnalytics Server. For optimal performance, make data available to GeoAnalytics Server through feature layers hosted on your ArcGIS Enterprise portal or through big data file shares. Data that is not local to GeoAnalytics Server will be moved to GeoAnalytics Server before analysis begins. This means that it will take longer to run a tool and, in some cases, moving the data from ArcGIS Pro to GeoAnalytics Server may fail. The threshold for failure depends on your network speeds, as well as the size and complexity of the data. It is recommended that you always share your data or create a big data file share.

    Learn more about sharing data to your portal

    Learn more about creating a big data file share through Server Manager

Parameters

LabelExplanationData Type
Prediction Type

Specifies the operation mode of the tool. The tool can be run to train a model to only assess performance, predict features, or create a prediction surface.

  • Train onlyA model will be trained, but no predictions will be generated. Use this option to assess the accuracy of your model before generating predictions. This option will output model diagnostics in the messages window and a chart of variable importance. This is the default
  • Train and PredictPredictions or classifications will be generated for features. Explanatory variables must be provided for both the training features and the features to be predicted. The output of this option will be a feature class, model diagnostics in the messages window, and an optional table of variable importance.
String
Input Training Features

The layercontaining the Variable to Predict parameter and the explanatory training variables fields.

Record Set
Output Features Name
(Optional)

The output feature layer name.

String
Variable to Predict
(Optional)

The variable from the Input Training Features parameter containing the values to be used to train the model. This field contains known (training) values of the variable that will be used to predict at unknown locations.

Field
Treat Variable as Categorical
(Optional)

Specifies whether Variable to Predict is a categorical variable.

  • Checked—Variable to Predict is a categorical variable and the tool will perform classification.
  • Unchecked—Variable to Predict is continuous and the tool will perform regression. This is the default.
Boolean
Explanatory Variables
(Optional)

A list of fields representing the explanatory variables that help predict the value or category of Variable to Predict. Check the Categorical check box for any variables that represent classes or categories (such as land cover or presence or absence).

Value Table
Create Variable Importance Table
(Optional)

Specifies whether the output table will contain information describing the importance of each explanatory variable used in the model.

  • Checked—The output table will contain information for each explanatory variable.
  • Unchecked—The output table will not contain information for each explanatory variable. This is the default.
Boolean
Input Prediction Features
(Optional)

A feature layer representing locations where predictions will be made. This feature layer must also contain any explanatory variables provided as fields that correspond to those used from the training data.

Record Set
Match Explanatory Variables
(Optional)

A list of Explanatory Variables specified from Input Training Features on the right and their corresponding fields from Input Prediction Features on the left.

Value Table
Number of Trees
(Optional)

The number of trees to create in the forest model. More trees will generally result in more accurate model prediction, but the model will take longer to calculate. The default number of trees is 100.

Long
Minimum Leaf Size
(Optional)

The minimum number of observations required to keep a leaf (that is, the terminal node on a tree without further splits). The default minimum for regression is 5, and the default for classification is 1. For very large data, increasing these numbers will decrease the run time of the tool.

Long
Maximum Tree Depth
(Optional)

The maximum number of splits that will be made down a tree. Using a large maximum depth, more splits will be created, which may increase the chances of overfitting the model. The default is data driven and depends on the number of trees created and the number of variables included.

Long
Data Available per Tree (%)
(Optional)

The percentage of Input Training Features used for each decision tree. The default is 100 percent of the data. Samples for each tree are taken randomly from two-thirds of the data specified.

Each decision tree in the forest is created using a random sample or subset (approximately two-thirds) of the training data available. Using a lower percentage of the input data for each decision tree increases the speed of the tool for very large datasets.

Long
Number of Randomly Sampled Variables
(Optional)

The number of explanatory variables used to create each decision tree.

Each decision tree in the forest is created using a random subset of the explanatory variables specified. Increasing the number of variables used in each decision tree will increase the chances of overfitting your model, particularly if there is one or more dominant variables. A common practice is to use the square root of the total number of explanatory variables if Variable to Predict is numeric, or divide the total number of explanatory variables by 3 if Variable to Predict is categorical.

Long
Training Data Excluded for Validation (%)
(Optional)

The percentage (between 10 percent and 50 percent) of Input Training Features to reserve as the test dataset for validation. The model will be trained without this random subset of data, and the observed values for those features will be compared to the predicted values. The default is 10 percent.

Long
Data Store
(Optional)

Specifies the ArcGIS Data Store where the output will be stored. All results stored in a spatiotemporal big data store will be stored in WGS84. Results stored in a relational data store will maintain their coordinate system.

  • Spatiotemporal big data storeOutput will be stored in a spatiotemporal big data store. This is the default.
  • Relational data storeOutput will be stored in a relational data store.
String

Derived Output

LabelExplanationData Type
Output Trained Features

The output containing the input variables used for training, as well as the observed variable to predict parameter, and the accompanying predictions that can be used to further assess the performance of the model.

Record Set
Variable of Importance Table

A table containing information describing the importance of each explanatory variable to be used in the created model.

Record Set
Output Predicted Features

The layer that will receive the predictions of the model.

Record Set

arcpy.geoanalytics.Forest(prediction_type, in_features, {output_trained_name}, {variable_predict}, {treat_variable_as_categorical}, {explanatory_variables}, {create_variable_importance_table}, {features_to_predict}, {explanatory_variable_matching}, {number_of_trees}, {minimum_leaf_size}, {maximum_tree_depth}, {sample_size}, {random_variables}, {percentage_for_validation}, {data_store})
NameExplanationData Type
prediction_type

Specifies the operation mode of the tool. The tool can be run to train a model to only assess performance, predict features, or create a prediction surface.

  • TRAINA model will be trained, but no predictions will be generated. Use this option to assess the accuracy of your model before generating predictions. This option will output model diagnostics in the messages window and a chart of variable importance. This is the default
  • TRAIN_AND_PREDICTPredictions or classifications will be generated for features. Explanatory variables must be provided for both the training features and the features to be predicted. The output of this option will be a feature class, model diagnostics in the messages window, and an optional table of variable importance.
String
in_features

The feature class containing the variable_predict parameter and the explanatory training variables fields.

Record Set
output_trained_name
(Optional)

The output feature layer name.

String
variable_predict
(Optional)

The variable from the in_features parameter containing the values to be used to train the model. This field contains known (training) values of the variable that will be used to predict at unknown locations.

Field
treat_variable_as_categorical
(Optional)
  • CATEGORICALvariable_predict is a categorical variable and the tool will perform classification.
  • NUMERICvariable_predict is continuous and the tool will perform regression. This is the default.
Boolean
explanatory_variables
[[Variable, Categorical],...]
(Optional)

A list of fields representing the explanatory variables that help predict the value or category of variable_predict. Use the treat_variable_as_categorical parameter for any variables that represent classes or categories (such as land cover or presence or absence). Specify the variable as true for any that represent classes or categories such as land cover or presence or absence and false if the variable is continuous.

Value Table
create_variable_importance_table
(Optional)

Specifies whether the output table will contain information describing the importance of each explanatory variable used in the model.

  • CREATE_TABLEThe output table will contain information for each explanatory variable.
  • NO_TABLEThe output table will not contain information for each explanatory variable. This is the default.
Boolean
features_to_predict
(Optional)

A feature layer representing locations where predictions will be made. This feature layer must also contain any explanatory variables provided as fields that correspond to those used from the training data.

Record Set
explanatory_variable_matching
[[Prediction, Training],...]
(Optional)

A list of explanatory_variables specified from in_features on the right and their corresponding fields from features_to_predict on the left, for example, [["LandCover2000", "LandCover2010"], ["Income", "PerCapitaIncome"]].

Value Table
number_of_trees
(Optional)

The number of trees to create in the forest model. More trees will generally result in more accurate model prediction, but the model will take longer to calculate. The default number of trees is 100.

Long
minimum_leaf_size
(Optional)

The minimum number of observations required to keep a leaf (that is, the terminal node on a tree without further splits). The default minimum for regression is 5, and the default for classification is 1. For very large data, increasing these numbers will decrease the run time of the tool.

Long
maximum_tree_depth
(Optional)

The maximum number of splits that will be made down a tree. Using a large maximum depth, more splits will be created, which may increase the chances of overfitting the model. The default is data driven and depends on the number of trees created and the number of variables included.

Long
sample_size
(Optional)

The percentage of in_features used for each decision tree. The default is 100 percent of the data. Samples for each tree are taken randomly from two-thirds of the data specified.

Each decision tree in the forest is created using a random sample or subset (approximately two-thirds) of the training data available. Using a lower percentage of the input data for each decision tree increases the speed of the tool for very large datasets.

Long
random_variables
(Optional)

The number of explanatory variables used to create each decision tree.

Each decision tree in the forest is created using a random subset of the explanatory variables specified. Increasing the number of variables used in each decision tree will increase the chances of overfitting your model, particularly if there is one or more dominant variables. A common practice is to use the square root of the total number of explanatory variables if variable_predict is numeric, or divide the total number of explanatory variables by 3 if variable_predict is categorical.

Long
percentage_for_validation
(Optional)

The percentage (between 10 percent and 50 percent) of in_features to reserve as the test dataset for validation. The model will be trained without this random subset of data, and the observed values for those features will be compared to the predicted value. The default is 10 percent.

Long
data_store
(Optional)

Specifies the ArcGIS Data Store where the output will be stored. All results stored in a spatiotemporal big data store will be stored in WGS84. Results stored in a relational data store will maintain their coordinate system.

  • SPATIOTEMPORAL_DATA_STOREOutput will be stored in a spatiotemporal big data store. This is the default.
  • RELATIONAL_DATA_STOREOutput will be stored in a relational data store.
String

Derived Output

NameExplanationData Type
output_trained

The output containing the input variables used for training, as well as the observed variable to predict parameter, and the accompanying predictions that can be used to further assess the performance of the model.

Record Set
variable_of_importance

A table containing information describing the importance of each explanatory variable to be used in the created model.

Record Set
output_predicted

The layer that will receive the predictions of the model.

Record Set

Code sample

Forest example (Python window)

The following Python window script demonstrates how to use the ForestBasedClassificationAndRegression tool.

In this script, network features are described and a sample layer of 2500 features is created.

#-------------------------------------------------------------------------------
# Name: Forest.py
# Description: Run Forest on sales data from 1980 and predict for sales in 1981
#
# Requirements: ArcGIS GeoAnalytics Server

# Import system modules
import arcpy
arcpy.env.workspace = "c:/data/commercial.gdb"
# Set local variables
trainingDataset = "https://analysis.org.com/server/rest/services/Hosted/sales/FeatureServer/0"
predictionDataset = "https://analysis.org.com/server/rest/services/Hosted/next_year/FeatureServer/0"
outputName = "training"
outputPredictedName = "predicted"

# Execute Forest
arcpy.geoanalytics.Forest("TRAIN_AND_PREDICT", inputDataset, outputName, "PERIMETER", None, ""STORE_CATEGORY true;AVG_INCOME false;POPULATION false", None, predictionDataset, 
"STORE_CATEGORY STORE_CATEGORY;AVG_INCOME MEAN_INCOME; POPULATION POPULATION", 100, , , 120, , 10, "SPATIOTEMPORAL_DATA_STORE")

Environments

Special cases

Output Coordinate System

The coordinate system that will be used for analysis. Analysis will be completed in the input coordinate system unless specified by this parameter. For GeoAnalytics Tools, final results will be stored in the spatiotemporal data store in WGS84.

Licensing information

  • Basic: Requires ArcGIS GeoAnalytics Server
  • Standard: Requires ArcGIS GeoAnalytics Server
  • Advanced: Requires ArcGIS GeoAnalytics Server

Related topics