Generalized Linear Regression (GeoAnalytics)

Summary

Performs generalized linear regression (GLR) to generate predictions or to model a dependent variable in terms of its relationship to a set of explanatory variables. This tool can be used to fit continuous (OLS), binary (logistic), and count (Poisson) models.

Usage

  • This tool can be used in two operation modes. You can evaluate the performance of different models as you explore different explanatory variables and tool settings. Once a good model has been found, you can fit the model to a new dataset.

  • This tool does not support inputs with date only or time only fields.

  • Use the Input Features parameter with a field representing the phenomena you are modeling (the Dependent Variable parameter value) and one or more fields representing the explanatory variables.

  • The Generalized Linear Regression tool also produces output features and diagnostics. Output feature layers are automatically added to the map with a rendering scheme applied to model residuals. An explanation of each output is provided below.

  • Ensure that you use the correct Model Type parameter option (Continuous, Binary, or Count) for the analysis to obtain accurate results of the regression analysis.

  • Model summary results and diagnostics are written to the messages window, and charts will be created below the output feature class. The diagnostics reported depend on the Model Type parameter value. The three model type options are as follows:

    • Use the Continuous (Gaussian) model type if the dependent variable can accept a wide range of values such as temperature or total sales. Ideally, the dependent variable will be normally distributed.
    • Use the Binary (Logistic) model type if the dependent variable can accept one of two possible values, such as success and failure or presence and absence. The field containing the dependent variable must be either a numeric field or a text field. If the field is numeric, it should contain only ones and zeros. If the field is text, it should contain only two distinct values. If you are using a text field, you must use theMap Dependent Variables parameter to map the distinct text values to ones and zeros. There must be variation of the ones and zeros of the distinct text values in the data.

    • Use the Count (Poisson) model type if the dependent variable is discrete and represents the number of occurrences of an event such as a count of crimes. Count models can also be used if the dependent variable represents a rate and the denominator of the rate is a fixed value such as sales per month or number of people with cancer per 10,000 in the population. In the Count model, it is assumed that the mean and variance of the dependent variable are equal, and the values of the dependent variable cannot be negative or contain decimals.

    The Dependent Variable and Explanatory Variable parameter values should be numeric fields containing a range of values. This tool cannot solve when variables have the same values (if all the values for a field are 9.0, for example).

  • Features with one or more null values or empty string values in prediction or explanatory fields will be excluded from the output. You can modify values using the Calculate Field tool if necessary.

  • Review the over- and underpredictions evident in the regression residuals to see whether they provide information about potential missing variables from the regression model.

  • You can use the regression model that has been created to make predictions for other features. Creating these predictions requires that each prediction feature has values for each of the explanatory variables provided. If the field names from the input features and prediction locations parameters do not match, a variable matching the parameter is provided. When matching the explanatory variables, the fields from the Input Features and Input Prediction Features parameters must be of the same type (double fields must be matched with double fields, for example).

  • The GeoAnalytics implementation of GLR has the following limitations:

    • It is a global regression model and does not take the spatial distribution of data into account.
    • Analysis does not apply Moran's I test on the residuals.
    • Feature datasets (points, lines, polygons, and tables) are supported as input; rasters are not supported.
    • You cannot classify values into multiple classes.

  • This geoprocessing tool is powered by ArcGIS GeoAnalytics Server. Analysis is completed on your GeoAnalytics Server, and results are stored in your content in ArcGIS Enterprise.

  • When running GeoAnalytics Server tools, the analysis is completed on the GeoAnalytics Server. For optimal performance, make data available to the GeoAnalytics Server through feature layers hosted on your ArcGIS Enterprise portal or through big data file shares. Data that is not local to your GeoAnalytics Server will be moved to your GeoAnalytics Server before analysis begins. This means that it will take longer to run a tool, and in some cases, moving the data from ArcGIS Pro to your GeoAnalytics Server may fail. The threshold for failure depends on your network speeds, as well as the size and complexity of the data. It is recommended that you always share your data or create a big data file share.

    Learn more about sharing data to your portal

    Learn more about creating a big data file share through Server Manager

Parameters

LabelExplanationData Type
Input Features

The layer containing the dependent and independent variables.

Record Set
Dependent Variable

The numeric field containing the observed values to be modeled.

Field
Model Type

Specifies the type of data that will be modeled.

  • Continuous (Gaussian) The dependent_variable value is continuous. The Gaussian model will be used, and the tool will perform ordinary least squares regression. This is the default.
  • Binary (Logistic) The dependent_variable value represents presence or absence. This can be either conventional ones and zeroes, or string values mapped to zero or ones in the Match Explanatory Variables parameter. The Logistic regression model will be used.
  • Count (Poisson)The dependent_variable value is discrete and represents events, for example, crime counts, disease incidents, or traffic accidents. The Poisson regression model will be used.
  • Continuous (Gaussian) The Dependent Variable value is continuous. The Gaussian model will be used, and the tool will perform ordinary least squares regression. This is the default.
  • Binary (Logistic) The Dependent Variable value represents presence or absence. This can be either conventional ones and zeroes, or string values mapped to zero or ones in the explanatory_variables_to_match parameter. The Logistic regression model will be used.
  • Count (Poisson)The Dependent Variable value is discrete and represents events, for example, crime counts, disease incidents, or traffic accidents. The Poisson regression model will be used.
String
Explanatory Variable(s)

A list of fields representing independent explanatory variables in the regression model.

Field
Output Features Name

The name of the feature class that will be created containing the dependent variable estimates and residuals.

String
Generate Coefficient Table
(Optional)

Specifies whether an output table with coefficient (Boolean) values will be generated.

  • Checked—A table with coefficient values will be generated.
  • Unchecked—A table with coefficient values will not be generated. This is the default.
Boolean
Input Prediction Features
(Optional)

A layer containing features representing locations where estimates will be computed. Each feature in this dataset should contain values for all the explanatory variables specified. The dependent variable for these features will be estimated using the model calibrated for the input layer data.

Record Set
Match Explanatory Variables
(Optional)

Matches the explanatory variables in the Input Prediction Features parameter to corresponding explanatory variables from the Input Features parameter.

Value Table
Map Dependent Variables
(Optional)

Two strings representing the values used to map to 0 (absence) and 1 (presence) for binary regression. By default, 0 and 1 will be used. For example, to predict an arrest with field values of Arrest and No Arrest, enter No Arrest for False Value (0) and Arrest for True Value (1).

Value Table
Data Store
(Optional)

Specifies the ArcGIS Data Store where the output will be saved. The default is Spatiotemporal big data store. All results stored in a spatiotemporal big data store will be stored in WGS84. Results stored in a relational data store will maintain their coordinate system.

  • Spatiotemporal big data storeOutput will be stored in a spatiotemporal big data store. This is the default.
  • Relational data storeOutput will be stored in a relational data store.
String

Derived Output

LabelExplanationData Type
Output

The output feature service containing the dependent variable estimates for each input feature.

Record Set
Output Predicted Features

An output layer containing the input variables and predicted explanatory values.

Record Set
Output Table of Coefficients

An output table containing the coefficients from the model fit. The output is created when the Generate Coefficient Table parameter is checked.

Record Set

arcpy.geoanalytics.GeneralizedLinearRegression(input_features, dependent_variable, model_type, explanatory_variables, output_features_name, {generate_coefficient_table}, {input_features_to_predict}, {explanatory_variables_to_match}, {dependent_variable_mapping}, {data_store})
NameExplanationData Type
input_features

The layer containing the dependent and independent variables.

Record Set
dependent_variable

The numeric field containing the observed values to be modeled.

Field
model_type

Specifies the type of data that will be modeled.

  • CONTINUOUS The dependent_variable value is continuous. The Gaussian model will be used, and the tool will perform ordinary least squares regression. This is the default.
  • BINARY The dependent_variable value represents presence or absence. This can be either conventional ones and zeroes, or string values mapped to zero or ones in the Match Explanatory Variables parameter. The Logistic regression model will be used.
  • COUNTThe dependent_variable value is discrete and represents events, for example, crime counts, disease incidents, or traffic accidents. The Poisson regression model will be used.
String
explanatory_variables
[explanatory_variables,...]

A list of fields representing independent explanatory variables in the regression model.

Field
output_features_name

The name of the feature class that will be created containing the dependent variable estimates and residuals.

String
generate_coefficient_table
(Optional)

Specifies whether an output table with coefficient (Boolean) values will be generated.

  • CREATE_TABLEA table with coefficient values will be generated.
  • NO_TABLEA table with coefficient values will not be generated. This is the default.
Boolean
input_features_to_predict
(Optional)

A layer containing features representing locations where estimates will be computed. Each feature in this dataset should contain values for all the explanatory variables specified. The dependent variable for these features will be estimated using the model calibrated for the input layer data.

Record Set
explanatory_variables_to_match
[[Field from Prediction Locations, Field from Input Features],...]
(Optional)

Matches the explanatory variables in the input_features_to_predict parameter to corresponding explanatory variables from the input_features parameter—for example, [["LandCover2000", "LandCover2010"], ["Income", "PerCapitaIncome"]].

Value Table
dependent_variable_mapping
[dependent_variable_mapping,...]
(Optional)

Two strings representing the values used to map to 0 (absence) and 1 (presence) for binary regression. By default, 0 and 1 will be used. For example, to predict an arrest with field values of Arrest and No Arrest, enter No Arrest for False Value (0) and Arrest for True Value (1).

Value Table
data_store
(Optional)

Specifies the ArcGIS Data Store where the output will be saved. The default is SPATIOTEMPORAL_DATA_STORE. All results stored in a spatiotemporal big data store will be stored in WGS84. Results stored in a relational data store will maintain their coordinate system.

  • SPATIOTEMPORAL_DATA_STOREOutput will be stored in a spatiotemporal big data store. This is the default.
  • RELATIONAL_DATA_STOREOutput will be stored in a relational data store.
String

Derived Output

NameExplanationData Type
output

The output feature service containing the dependent variable estimates for each input feature.

Record Set
output_predicted_features

An output layer containing the input variables and predicted explanatory values.

Record Set
coefficient_table

An output table containing the coefficients from the model fit. The output is created when the generate_coefficient_table parameter is set to CREATE_TABLE.

Record Set

Code sample

GeneralizedLinearRegression example (stand-alone script)

The following stand-alone script demonstrates how to use the GeneralizedLinearRegression function.

In this script, you create a model and predict if an arrest was made for given crimes.

# Description: Run GLR on crime data and predict if an arrest was made for a crime reporting.
#
# Requirements: ArcGIS GeoAnalytics Server

# Import system modules
import arcpy

# Set local variables
trainingDataset = "https://analysis.org.com/server/rest/services/Hosted/old_crimes/FeatureServer/0"
predictionDataset = "https://analysis.org.com/server/rest/services/Hosted/new_crimes/FeatureServer/0"
outputTrainingName = "training"

# Run GLR
arcpy.geoanalytics.GeneralizedLinearRegression(
    trainingDataset, "ArrestMade", "BINARY", ["CRIME_TYPE", "WARD", "DAY_OF_MONTH"], outputTrainingName, 
    "NO_TABLE", predictionDataset, [["CRIME_TYPE", "CRIME_TYPE"], ["WARD", "WARD"], ["DAY_OF_MONTH", "DAY_OF_MON"]], 
    [["Arrest", "NoArrest"]], "SPATIOTEMPORAL_DATA_STORE")

Environments

Special cases

Output Coordinate System

The coordinate system that will be used for analysis. Analysis will be completed in the input coordinate system unless specified by this parameter. For GeoAnalytics Tools, final results will be stored in the spatiotemporal data store in WGS84.

Licensing information

  • Basic: Requires ArcGIS GeoAnalytics Server
  • Standard: Requires ArcGIS GeoAnalytics Server
  • Advanced: Requires ArcGIS GeoAnalytics Server

Related topics