Forest-based Classification and Regression (GeoAnalytics)

ArcGIS Pro 3.3 | | Help archive


Creates models and generates predictions using an adaptation of the random forest algorithm, which is a supervised machine learning method developed by Leo Breiman and Adele Cutler. Predictions can be performed for both categorical variables (classification) and continuous variables (regression). Explanatory variables can take the form of fields in the attribute table of the training features. In addition to validation of model performance based on the training data, predictions can be made to features.


  • This tool creates hundreds of trees, called an ensemble of decision trees, to create a model that can be used for prediction. Each decision tree is created using randomly generated portions of the original (training) data. Each tree generates its own prediction and votes on an outcome. The forest model considers votes from all decision trees to predict or classify the outcome of an unknown sample. This is important, as individual trees may have issues with overfitting a model; however, combining multiple trees in a forest for prediction addresses the overfitting problem associated with a single tree.

  • This tool can be used in two operation modes. The Train mode can be used to evaluate the performance of different models as you explore different explanatory variables and tool settings. Once a good model has been found, you can use the Train and Predict mode.

  • This tool does not support inputs with date only or time only fields.

  • This is a data-driven tool and performs best on large datasets. The tool should be trained on at least several hundred features for best results. It is not an appropriate tool for very small datasets.

  • Input Training Features can be tables, points, lines, or polygon features. This tool does not work with multipart data.

  • Features with one or more null values or empty string values in prediction or explanatory fields will be excluded from the output. You can modify values using the Calculate Field tool if necessary.

  • This tool produces a variety of outputs depending on the following operation modes:

    • Train produces the following two outputs:
      • Output trained features—Contains all of the Input Training Features used in the model created as well as all of the explanatory variables used in the model. It also contains predictions for all of the features used for training the model, which can be helpful in assessing the performance of the model created.
      • Tool summary messages—Messages to help you understand the performance of the model created. The messages include information about model characteristics, variable importance, and validation diagnostics.
    • Train and Predict produces the following three outputs:
      • Output trained features—Contains all of the Input Training Features used in the model created as well as all of the explanatory variables used in the model. It also contains predictions for all of the features used for training the model, which can be helpful in assessing the performance of the model created.
      • Output predicted features—A layer of predicted results. Predictions are applied to the layer to predict (use the Input Prediction Features option) using the model generated from the training layer.
      • Tool summary messages—Messages to help you understand the performance of the model created. The messages include information about model characteristics, variable importance, and validation diagnostics.

  • You can use the Create Variable Importance Table parameter to create a table to display a chart of variable importance for evaluation. The top 20 variable importance values are also reported in the messages window.

  • Explanatory variables can come from fields and should contain a variety of values. If the explanatory variable is categorical, the Categorical check box should be checked (variables of type string will automatically be checked). Categorical explanatory variables are limited to 60 unique values, though a smaller number of categories will improve model performance. For a given data size, the more categories a variable contains, the more likely it is that it will dominate the model and lead to less effective prediction results.

  • When matching explanatory variables, the Training Field and Prediction Field must have fields that are the same type (a double field in Training Field must be matched to a double field in Prediction Field for example).

  • Forest-based models do not extrapolate; they can only classify or predict to a value that the model was trained on. Train the model with training features and explanatory variables that are within the range of your target features and variables. The tool will fail if categories exist in the prediction explanatory variables that are not present in the training features.

  • The default value for the Number of Trees parameter is 100. Increasing the number of trees in the forest model will result in more accurate model prediction, but the model will take longer to calculate.

  • A single layer for training and a single layer for prediction are supported. To combine multiple datasets into one, use the Build Multi-Variable Grid and Enrich from Multi-Variable Grid tools to generate input data.

  • This geoprocessing tool is powered by ArcGIS GeoAnalytics Server. Analysis is completed on your GeoAnalytics Server, and results are stored in your content in ArcGIS Enterprise.

  • When running GeoAnalytics Server tools, the analysis is completed on the GeoAnalytics Server. For optimal performance, make data available to the GeoAnalytics Server through feature layers hosted on your ArcGIS Enterprise portal or through big data file shares. Data that is not local to your GeoAnalytics Server will be moved to your GeoAnalytics Server before analysis begins. This means that it will take longer to run a tool, and in some cases, moving the data from ArcGIS Pro to your GeoAnalytics Server may fail. The threshold for failure depends on your network speeds, as well as the size and complexity of the data. It is recommended that you always share your data or create a big data file share.

    Learn more about sharing data to your portal

    Learn more about creating a big data file share through Server Manager


LabelExplanationData Type
Prediction Type

Specifies the operation mode of the tool. The tool can be run to train a model to only assess performance, predict features, or create a prediction surface.

  • Train onlyA model will be trained, but no predictions will be generated. Use this option to assess the accuracy of your model before generating predictions. This option will output model diagnostics in the messages window and a chart of variable importance. This is the default
  • Train and PredictPredictions or classifications will be generated for features. Explanatory variables must be provided for both the training features and the features to be predicted. The output of this option will be a feature class, model diagnostics in the messages window, and an optional table of variable importance.
Input Training Features

The layercontaining the Variable to Predict parameter and the explanatory training variables fields.

Record Set
Output Features Name

The output feature layer name.

Variable to Predict

The variable from the Input Training Features parameter containing the values to be used to train the model. This field contains known (training) values of the variable that will be used to predict at unknown locations.

Treat Variable as Categorical

Specifies whether Variable to Predict is a categorical variable.

  • Checked—Variable to Predict is a categorical variable and the tool will perform classification.
  • Unchecked—Variable to Predict is continuous and the tool will perform regression. This is the default.
Explanatory Variables

A list of fields representing the explanatory variables that help predict the value or category of Variable to Predict. Check the Categorical check box for any variables that represent classes or categories (such as land cover or presence or absence).

Value Table
Create Variable Importance Table

Specifies whether the output table will contain information describing the importance of each explanatory variable used in the model.

  • Checked—The output table will contain information for each explanatory variable.
  • Unchecked—The output table will not contain information for each explanatory variable. This is the default.
Input Prediction Features

A feature layer representing locations where predictions will be made. This feature layer must also contain any explanatory variables provided as fields that correspond to those used from the training data.

Record Set
Match Explanatory Variables

A list of Explanatory Variables specified from Input Training Features on the right and their corresponding fields from Input Prediction Features on the left.

Value Table
Number of Trees

The number of trees to create in the forest model. More trees will generally result in more accurate model prediction, but the model will take longer to calculate. The default number of trees is 100.

Minimum Leaf Size

The minimum number of observations required to keep a leaf (that is, the terminal node on a tree without further splits). The default minimum for regression is 5, and the default for classification is 1. For very large data, increasing these numbers will decrease the run time of the tool.

Maximum Tree Depth

The maximum number of splits that will be made down a tree. Using a large maximum depth, more splits will be created, which may increase the chances of overfitting the model. The default is data driven and depends on the number of trees created and the number of variables included.

Data Available per Tree (%)

The percentage of Input Training Features used for each decision tree. The default is 100 percent of the data. Samples for each tree are taken randomly from two-thirds of the data specified.

Each decision tree in the forest is created using a random sample or subset (approximately two-thirds) of the training data available. Using a lower percentage of the input data for each decision tree increases the speed of the tool for very large datasets.

Number of Randomly Sampled Variables

The number of explanatory variables used to create each decision tree.

Each decision tree in the forest is created using a random subset of the explanatory variables specified. Increasing the number of variables used in each decision tree will increase the chances of overfitting your model, particularly if there is one or more dominant variables. A common practice is to use the square root of the total number of explanatory variables if Variable to Predict is numeric, or divide the total number of explanatory variables by 3 if Variable to Predict is categorical.

Training Data Excluded for Validation (%)

The percentage (between 10 percent and 50 percent) of Input Training Features to reserve as the test dataset for validation. The model will be trained without this random subset of data, and the observed values for those features will be compared to the predicted values. The default is 10 percent.

Data Store

Specifies the ArcGIS Data Store where the output will be saved. The default is Spatiotemporal big data store. All results stored in a spatiotemporal big data store will be stored in WGS84. Results stored in a relational data store will maintain their coordinate system.

  • Spatiotemporal big data storeOutput will be stored in a spatiotemporal big data store. This is the default.
  • Relational data storeOutput will be stored in a relational data store.

Derived Output

LabelExplanationData Type
Output Trained Features

The output containing the input variables used for training, as well as the observed variable to predict parameter, and the accompanying predictions that can be used to further assess the performance of the model.

Record Set
Variable of Importance Table

A table containing information describing the importance of each explanatory variable to be used in the created model.

Record Set
Output Predicted Features

A layer that will receive the predictions of the model.

Record Set


Special cases

Output Coordinate System

The coordinate system that will be used for analysis. Analysis will be completed in the input coordinate system unless specified by this parameter. For GeoAnalytics Tools, final results will be stored in the spatiotemporal data store in WGS84.

Licensing information

  • Basic: Requires ArcGIS GeoAnalytics Server
  • Standard: Requires ArcGIS GeoAnalytics Server
  • Advanced: Requires ArcGIS GeoAnalytics Server

Related topics