Train Using AutoML (GeoAI)

Summary

Trains a machine learning model by building training pipelines and automating much of the training process. This includes exploratory data analysis, feature selection, feature engineering, model selection, hyperparameter tuning, and model training. Its outputs include performance metrics of the best model on the training data, as well as the trained deep learning model package .dlpk that can be used as input for the Predict Using AutoML tool to predict on a new dataset.

Learn more about how AutoML works

Usage

  • You must install the proper deep learning framework for Python in ArcGIS Pro.

    Learn how to install deep learning frameworks for ArcGIS

  • The time it takes for the tool to produce the trained model depends on the following:

    • The amount of data provided during training
    • The AutoML Mode parameter value

    By default, the timer for all modes is set at 60 minutes. Regardless of the amount of data used in training, the Basic option will not take the entire 60 minutes to find the optimum model. The fit process will complete as soon as the optimum model is identified. The Advanced option will take more time due to the additional tasks of feature engineering, feature selection, and hyperparameter tuning. In addition to the new features obtained by combining multiple features from the input, the tool creates spatial features with names from zone3_id through zone7_id. These new features will be extracted from the location information in the input data and will be used to train better models. For more information about the new spatial features, see How AutoML Works. If the amount of data being trained is large, all combinations of the models may not be evaluated within 60 minutes. In such cases, the best performing model determined within 60 minutes will be considered the optimum model. You can then either use this model or rerun the tool with a higher Total Time Limit (Minutes) parameter value.

  • An ArcGIS Spatial Analyst extension license is required to use rasters as explanatory variables.

  • The Output Report parameter value is a file in HTML format that provides a way to review the information in the working directory.

    The first page in the output report includes links to each of the models evaluated and shows their performance on a validation dataset along with the time it took to train them. Based on the evaluation metric, the report shows the best performing model that was chosen.

    RMSE is the default evaluation metric for regression problems, while Logloss is the default metric for classification problems. The following metrics are available in the output report:

      • Classification—AUC, Logloss, F1, Accuracy, Average precision
      • Regression—MSE, RMSE, MAE, R2, MAPE, Spearman coefficient, Pearson coefficient

    When you click a model combination, details about the training for that model combination are displayed including the learning curves, variable importance curves, hyperparameters used, and so on.

  • Potential use cases for the tool include training an annual solar energy generation model based on weather factors, training a crop prediction model using related variables, and training a house value prediction model.

  • For information about requirements for running this tool and issues you may encounter, see Deep Learning frequently asked questions.

Parameters

LabelExplanationData Type
Input Training Features

The input feature class that will be used to train the model.

Feature Layer; Table View
Output Model

The output trained model that will be saved as a deep learning package (.dlpk) file.

File
Variable to Predict

A field from the Input Training Features parameter value that contains the values that will be used to train the model. This field contains known (training) values of the variable that will be used to predict at unknown locations.

Field
Treat Variable as Categorical
(Optional)

Specifies whether the Variable to Predict parameter value will be treated as a categorical variable.

  • Checked—The Variable to Predict parameter value will be treated as a categorical variable and the tool will perform classification.
  • Unchecked—The Variable to Predict parameter value will be treated as continuous and the tool will perform regression. This is the default.

Boolean
Explanatory Training Variables
(Optional)

A list of fields representing the explanatory variables that will help predict the value or category of the Variable to Predict parameter value. Check the accompanying check box for any variables that represent classes or categories (such as land cover, presence, or absence).

Value Table
Explanatory Training Distance Features
(Optional)

The features whose distances from the input training features will be estimated automatically and added as more explanatory variables. Distances will be calculated from each of the input explanatory training distance features to the nearest input training features. Point and polygon features are supported, and if the input explanatory training distance features are polygons, the distance attributes will be calculated as the distance between the closest segments of the pair of features.

Feature Layer
Explanatory Training Rasters
(Optional)

The rasters whose values will be extracted from the raster and considered as explanatory variables for the model. Each layer forms one explanatory variable. For each feature in the input training features, the value of the raster cell will be extracted at that exact location. Bilinear raster resampling will be used when extracting the raster value for continuous rasters. Nearest neighbor assignment will be used when extracting a raster value from categorical rasters. If the Input Training Features parameter value has polygons, and you have specified this parameter, one raster value for each polygon will be used in the model. Each polygon is assigned the average value for continuous rasters and the majority for categorical rasters. Check the Categorical column check box for any raster that represents classes or categories such as land cover, presence, or absence.

Value Table
Total Time Limit (Minutes)
(Optional)

The total time limit in minutes it takes for AutoML model training. The default is 60 (1 hour).

Double
AutoML Mode
(Optional)

Specifies the goal of AutoML and how intensive the AutoML search will be.

  • BasicBasic is used to explain the significance of the different variables and the data. Feature engineering, feature selection, and hyperparameter tuning will not be performed. Full descriptions and explanations for model learning curves, feature importance plots generated for tree-based models, and SHAP plots for all other models will be included in reports. This mode takes the least amount of processing time. This is the default.
  • IntermediateIntermediate is used to train a model that will be used in real-life use cases. This mode uses 5-fold cross validation (CV) and produces output of learning curves and importance plots in the reports, but SHAP plots are not available.
  • Advanced Advanced is used for machine learning competitions (for maximum performance). This mode uses 10-fold cross validation (CV) and performs feature engineering, feature selection, and hyperparameter tuning. Input training features are assigned to multiple spatial grids of different sizes based on their location, and the corresponding grid IDs are passed as additional categorical explanatory variables to the model. The report only includes learning curves; model explainability is not available.
String
Algorithms
(Optional)

Specifies the algorithms that will be used during the training.

By default, all the algorithms will be used.

  • LinearThe Linear regression supervised algorithm will be used to train a regression machine learning model. If Linear is the only algorithm specified, ensure that the total number of records is less than 10.000 and the number of columns is less than 1,000. Other models can accommodate larger datasets and it is recommended that you use Linear with other algorithms and not as the sole algorithm.
  • Random ForestThe Random Forest decision tree-based supervised machine learning algorithm will be used. It can be used for both classification and regression.
  • XGBoostThe XGBoost (extreme gradient boosting) supervised machine learning algorithm will be used. It can be used for both classification and regression.
  • Light GBMThe Light GBM gradient boosting ensemble algorithm, which is based on decision trees, will be used. It can be used for both classification and regression. Light GBM is optimized for high performance with distributed systems.
  • Decision Tree The Decision Tree supervised machine learning algorithm, which classifies or regresses the data using true and false answers to certain questions, will be used. Decision trees are easily understood and are good for explainability.
  • Extra Tree The Extra Tree (extremely randomized trees) ensemble supervised machine learning algorithm, which uses decision trees, will be used. This algorithm is similar to Random Forests but can be faster.
Multivalue
Validation Percentage
(Optional)

The percentage of input data that will be used for validation. The default value is 10.

Long
Output Report
(Optional)

The output report that will be generated as an .html file. If the path provided is not empty, the report will be created in a new folder under the provided path. The report will contain details of the various models as well as details of the hyperparameters that were used during the evaluation and the performance of each model. Hyperparameters are parameters that control the training process. They are not updated during training and include model architecture, learning rate, number of epochs, and so on.

File
Output Importance Table
(Optional)

An output table containing information about the importance of each explanatory variable (fields, distance features, and rasters) used in the model.

Table
Output Feature Class
(Optional)

The feature layer containing the predicted values by the best performing model on the training feature layer. It can be used to verify model performance by visually comparing the predicted values with the ground truth.

Feature Class

arcpy.geoai.TrainUsingAutoML(in_features, out_model, variable_predict, {treat_variable_as_categorical}, {explanatory_variables}, {distance_features}, {explanatory_rasters}, {total_time_limit}, {autoML_mode}, {algorithms}, {validation_percent}, {out_report}, {out_importance}, {out_features})
NameExplanationData Type
in_features

The input feature class that will be used to train the model.

Feature Layer; Table View
out_model

The output trained model that will be saved as a deep learning package (.dlpk) file.

File
variable_predict

A field from the in_features parameter that contains the values that will be used to train the model. This field contains known (training) values of the variable that will be used to predict at unknown locations.

Field
treat_variable_as_categorical
(Optional)

Specifies whether the variable_predict parameter value will be treated as a categorical variable.

  • CATEGORICALThe variable_predict parameter value will be treated as a categorical variable and the tool will perform classification.
  • CONTINUOUSThe variable_predict parameter value will be treated as continuous and the tool will perform regression. This is the default.
Boolean
explanatory_variables
[explanatory_variables,...]
(Optional)

A list of fields representing the explanatory variables that will help predict the value or category of the variable_predict parameter value. Pass the True value ('name_of_variable',True) for any variables that represent classes or categories (such as land cover, presence, or absence).

Value Table
distance_features
[distance_features,...]
(Optional)

The features whose distances from the input training features will be estimated automatically and added as more explanatory variables. Distances will be calculated from each of the input explanatory training distance features to the nearest input training features. Point and polygon features are supported, and if the input explanatory training distance features are polygons, the distance attributes will be calculated as the distance between the closest segments of the pair of features.

Feature Layer
explanatory_rasters
[explanatory_rasters,...]
(Optional)

The rasters whose values will be extracted from the raster and considered as explanatory variables for the model. Each layer forms one explanatory variable. For each feature in the input training features, the value of the raster cell will be extracted at that exact location. Bilinear raster resampling will be used when extracting the raster value for continuous rasters. Nearest neighbor assignment will be used when extracting a raster value from categorical rasters. If the in_features parameter value has polygons, and you have specified this parameter, one raster value for each polygon will be used in the model. Each polygon is assigned the average value for continuous rasters and the majority for categorical rasters. Pass a true value using "<name_of_raster> true" for any raster that represents classes or categories such as land cover, presence, or absence.

Value Table
total_time_limit
(Optional)

The total time limit in minutes it takes for AutoML model training. The default is 60 (1 hour).

Double
autoML_mode
(Optional)

Specifies the goal of AutoML and how intensive the AutoML search will be.

  • BASICBasic is used to explain the significance of the different variables and the data. Feature engineering, feature selection, and hyperparameter tuning will not be performed. Full descriptions and explanations for model learning curves, feature importance plots generated for tree-based models, and SHAP plots for all other models will be included in reports. This mode takes the least amount of processing time. This is the default.
  • INTERMEDIATEIntermediate is used to train a model that will be used in real-life use cases. This mode uses 5-fold cross validation (CV) and produces output of learning curves and importance plots in the reports, but SHAP plots are not available.
  • ADVANCED Advanced is used for machine learning competitions (for maximum performance). This mode uses 10-fold cross validation (CV) and performs feature engineering, feature selection, and hyperparameter tuning. Input training features are assigned to multiple spatial grids of different sizes based on their location, and the corresponding grid IDs are passed as additional categorical explanatory variables to the model. The report only includes learning curves; model explainability is not available.
String
algorithms
[algorithms,...]
(Optional)

Specifies the algorithms that will be used during the training.

  • LINEARThe Linear regression supervised algorithm will be used to train a regression machine learning model. If Linear is the only algorithm specified, ensure that the total number of records is less than 10.000 and the number of columns is less than 1,000. Other models can accommodate larger datasets and it is recommended that you use Linear with other algorithms and not as the sole algorithm.
  • RANDOM FORESTThe Random Forest decision tree-based supervised machine learning algorithm will be used. It can be used for both classification and regression.
  • XGBOOSTThe XGBoost (extreme gradient boosting) supervised machine learning algorithm will be used. It can be used for both classification and regression.
  • LIGHT GBMThe Light GBM gradient boosting ensemble algorithm, which is based on decision trees, will be used. It can be used for both classification and regression. Light GBM is optimized for high performance with distributed systems.
  • DECISION TREE The Decision Tree supervised machine learning algorithm, which classifies or regresses the data using true and false answers to certain questions, will be used. Decision trees are easily understood and are good for explainability.
  • EXTRA TREE The Extra Tree (extremely randomized trees) ensemble supervised machine learning algorithm, which uses decision trees, will be used. This algorithm is similar to Random Forests but can be faster.

By default, all the algorithms will be used.

Multivalue
validation_percent
(Optional)

The percentage of input data that will be used for validation. The default value is 10.

Long
out_report
(Optional)

The output report that will be generated as an .html file. If the path provided is not empty, the report will be created in a new folder under the provided path. The report will contain details of the various models as well as details of the hyperparameters that were used during the evaluation and the performance of each model. Hyperparameters are parameters that control the training process. They are not updated during training and include model architecture, learning rate, number of epochs, and so on.

File
out_importance
(Optional)

An output table containing information about the importance of each explanatory variable (fields, distance features, and rasters) used in the model.

Table
out_features
(Optional)

The feature layer containing the predicted values by the best performing model on the training feature layer. It can be used to verify model performance by visually comparing the predicted values with the ground truth.

Feature Class

Code sample

TrainUsingAutoML (Python window)

This example shows how to use the TrainUsingAutoML function.

# Name: TrainUsingAutoML.py
# Description: Train a machine learning model on feature or tabular data with
# automatic hyperparameter selection.
  
# Import system modules
import arcpy
import os

# Set local variables

datapath  = "path_to_data" 
out_path = "path_to_trained_model"

in_feature = os.path.join(datapath, "train_data.gdb", "name_of_data")
out_model = os.path.join(out_path, "model.dlpk")

# Run Train Using AutoML Model
arcpy.geoai.TrainUsingAutoML(in_feature, out_model, "price", None, 
                             "bathrooms #;bedrooms #;square_fee #", None, None, 
                             60, "BASIC")

Licensing information

  • Basic: No
  • Standard: No
  • Advanced: Yes

Related topics