Generalized Linear Regression (GeoAnalytics Desktop)

Summary

Performs generalized linear regression (GLR) to generate predictions or to model a dependent variable in terms of its relationship to a set of explanatory variables. This tool can be used to fit continuous (OLS), binary (logistic), and count (Poisson) models.

Usage

  • This tool can be used in two operation modes. You can evaluate the performance of different models as you explore different explanatory variables and tool settings. Once a good model has been found, you can fit the model to a new dataset.

  • Use the Input Features parameter with a field representing the phenomena you are modeling (the Dependent Variable parameter) and one or more fields representing the explanatory variables.

  • The Generalized Linear Regression tool also produces output features and diagnostics. Output feature layers are automatically added to the map with a rendering scheme applied to model residuals. A full explanation of each output is provided below.

  • It is important to use the correct model type (Continuous, Binary, or Count) for your analysis to obtain accurate results of your regression analysis.

  • Model summary results and diagnostics are written to the messages window, and charts will be created below the output feature class. The diagnostics reported depend on the Model Type parameter value. The three model type options are as follows:

    • Use the Continuous (Gaussian) model type if the dependent variable can accept a wide range of values such as temperature or total sales. Ideally, the dependent variable will be normally distributed.
    • Use the Binary (logistic) model type if the dependent variable can accept one of two possible values, such as success and failure or presence and absence. The field containing the dependent variable must be either a numeric field or a text field. If the field is numeric, it should contain only ones and zeros. If the field is text, it should contain only two distinct values. If you are using a text field, you must use theMap Dependent Variables parameter to map the distinct text values to ones and zeros. There must be variation of the ones and zeros of the distinct text values in your data.

    • Use the Count (Poisson) model type if the dependent variable is discrete and represents the number of occurrences of an event such as a count of crimes. Count models can also be used if the dependent variable represents a rate and the denominator of the rate is a fixed value such as sales per month or number of people with cancer per 10,000 in the population. The Count model assumes that the mean and variance of the dependent variable are equal, and the values of the dependent variable cannot be negative or contain decimals.

    The Dependent Variable and Explanatory Variable parameter values should be numeric fields containing a range of values. This tool cannot solve when variables have the same values (if all the values for a field are 9.0, for example).

  • Features with one or more null values or empty string values in prediction or explanatory fields will be excluded from the output. You can modify values using the Calculate Field tool if necessary.

  • Review the over- and underpredictions evident in the regression residuals to see whether they provide information about potential missing variables from your regression model.

  • You can use the regression model that has been created to make predictions for other features. Creating these predictions requires that each prediction feature has values for each of the explanatory variables provided. If the field names from the input features and prediction locations parameters do not match, a variable matching the parameter is provided. When matching the explanatory variables, the fields from the Input Features and Input Prediction Features parameters must be of the same type (double fields must be matched with double fields, for example).

  • You can improve the processing speed of the Generalized Linear Regression tool by doing one or more of the following:

    • Set the processing extent for analysis so you only analyze data of interest.
    • Don't output a coefficient table.
    • Use data that is local to where the analysis is being run.

  • This geoprocessing tool is powered by Spark. Analysis is completed on your desktop machine using multiple cores in parallel. See Considerations for GeoAnalytics Desktop tools to learn more about running analysis.

  • When running GeoAnalytics Desktop tools, the analysis is completed on your desktop machine. For optimal performance, data should be available on your desktop. If you are using a hosted feature layer, it is recommended that you use ArcGIS GeoAnalytics Server. If your data isn't local, it will take longer to run a tool. To use your ArcGIS GeoAnalytics Server to perform analysis, see GeoAnalytics Tools.

  • The GeoAnalytics implementation of GLR has the following limitations:

    • It is a global regression model and does not take the spatial distribution of data into account.
    • Analysis does not apply Moran's I test on the residuals.
    • Feature datasets (points, lines, polygons, and tables) are supported as input; rasters are not supported.
    • You cannot classify values into multiple classes.

Syntax

arcpy.gapro.GeneralizedLinearRegression(input_features, dependent_variable, model_type, explanatory_variables, output_features, {input_features_to_predict}, {explanatory_variables_to_match}, {dependent_variable_mapping}, {output_predicted_features}, {coefficient_table})
ParameterExplanationData Type
input_features

The layer containing the dependent and independent variables.

Table View
dependent_variable

The numeric field containing the observed values to be modeled.

Field
model_type

Specifies the type of data that will be modeled.

  • CONTINUOUS The dependent_variable is continuous. The Gaussian model will be used, and the tool will perform ordinary least squares regression. This is the default.
  • BINARY The dependent_variable represents presence or absence. This can be either conventional 1s and 0s, or string values mapped to 0 or 1s in the Match Explanatory Variables parameter. The Logistic Regression model will be used.
  • COUNTThe dependent_variable is discrete and represents events, for example, crime counts, disease incidents, or traffic accidents. The Poisson regression model will be used.
String
explanatory_variables
[explanatory_variables,...]

A list of fields representing independent explanatory variables in the regression model.

Field
output_features

The name of the feature class that will be created containing the dependent variable estimates and residuals.

Table; Feature Class
input_features_to_predict
(Optional)

A layer containing features representing locations where estimates will be computed. Each feature in this dataset should contain values for all the explanatory variables specified. The dependent variable for these features will be estimated using the model calibrated for the input layer data.

Table View
explanatory_variables_to_match
[[Field from Prediction Locations, Field from Input Features],...]
(Optional)

Matches the explanatory variables in the input_features_to_predict parameter to corresponding explanatory variables from the input_features parameter—for example, [["LandCover2000", "LandCover2010"], ["Income", "PerCapitaIncome"]].

Value Table
dependent_variable_mapping
[dependent_variable_mapping,...]
(Optional)

Two strings representing the values used to map to 0 (absence) and 1 (presence) for binary regression. By default, 0 and 1 will be used. For example, to predict an arrest with field values of Arrest and No Arrest, you would enter No Arrest for False Value (0) and Arrest for True Value (1).

Value Table
output_predicted_features
(Optional)

The output feature class with the dependent variable estimates for each input_features_to_predict value.

The output feature class with the dependent variable estimates for each Input Prediction Features value.

Table; Feature Class
coefficient_table
(Optional)

The output feature class with the dependent variable estimates for each input_features_to_predict value.

The output feature class with the dependent variable estimates for each Input Prediction Features value.

Table

Code sample

GeneralizedLinearRegression example (stand-alone script)

The following stand-alone script demonstrates how to use the GeneralizeLinearRegression tool.

In this script, you create a model and predict if an arrest was made for a crime reporting.

# Name: GeneralizedLinearRegression.py
# Description: Run GLR on crime data and predict if an arrest was made for a crime reporting.
#
# Requirements: Advanced License

# Import system modules
import arcpy
arcpy.env.workspace = "c:/data/city.gdb"

# Set local variables
trainingDataset = "old_crimes"
predictionDataset = "new_crimes"
outputTrainingName = "training"
outputPredictedName = "predicted"

# Execute Generalize Linear Regression
arcpy.geoanalytics.GeneralizedLinearRegression(
    trainingData, "ArrestMade", "BINARY", "CRIME_TYPE; WARD; DAY_OF_MONTH", 
    outputTrainingName, None, outputPredictedName, 
    "CRIME_TYPE CRIME_TYPE;WARD WARD;DAY_OF_MONTH DAY_OF_MON", 
    "Arrest NoArrest")

Licensing information

  • Basic: No
  • Standard: No
  • Advanced: Yes

Related topics