Summary
Performs generalized linear regression (GLR) to generate predictions or to model a dependent variable in terms of its relationship to a set of explanatory variables. This tool can be used to fit continuous (OLS), binary (logistic), and count (Poisson) models.
Usage
This tool can be used in two operation modes. You can evaluate the performance of different models as you explore different explanatory variables and tool settings. Once a good model has been found, you can fit the model to a new dataset.
Use the Input Features parameter with a field representing the phenomena you are modeling (the Dependent Variable parameter) and one or more fields representing the explanatory variables.
The Generalized Linear Regression tool also produces output features and diagnostics. Output feature layers are automatically added to the map with a rendering scheme applied to model residuals. A full explanation of each output is provided below.
It is important to use the correct model type (Continuous, Binary, or Count) for your analysis to obtain accurate results of your regression analysis.
Model summary results and diagnostics are written to the messages window, and charts will be created below the output feature class. The diagnostics reported depend on the Model Type parameter. The three model type options are as follows:
- Use the Continuous (Gaussian) model type if the dependent variable can accept a wide range of values such as temperature or total sales. Ideally, the dependent variable will be normally distributed.
Use a Binary (logistic) model type if the dependent variable can accept one of two possible values, such as success and failure or presence and absence. The field containing the dependent variable must be numeric and contain only ones and zeros. There must be variation of ones and zeros in your data.
Consider using a Count (Poisson) model type if the dependent variable is discrete and represents the number of occurrences of an event such as a count of crimes. Count models can also be used if the dependent variable represents a rate and the denominator of the rate is a fixed value such as sales per month or number of people with cancer per 10,000 in the population. A Count model assumes that the mean and variance of the dependent variable are equal, and the values of the dependent variable cannot be negative or contain decimals.
The Dependent Variable and Explanatory Variable parameters should be numeric fields, containing a range of values. This tool cannot solve when variables have the same values (if all the values for a field are 9.0, for example).
Features with one or more null values or empty string values in prediction or explanatory fields will be excluded from the output. You can modify values using the Calculate Field tool if necessary.
Review the over- and underpredictions evident in the regression residuals to see whether they provide information about potential missing variables from your regression model.
You can use the regression model that has been created to make predictions for other features. Creating these predictions requires that each prediction feature has values for each of the explanatory variables provided. If the field names from the input features and prediction locations parameters do not match, a variable matching the parameter is provided. When matching the explanatory variables, the fields from the Input Features and Input Prediction Features parameters must be of the same type (double fields must be matched with double fields, for example).
You can improve the processing speed of the Generalized Linear Regression tool by doing one or more of the following:
- Set the processing extent for analysis so you only analyze data of interest.
- Don't output a coefficient table.
- Use data that is local to where the analysis is being run.
This geoprocessing tool is powered by Spark. Analysis is completed on your desktop machine using multiple cores in parallel. See Considerations for GeoAnalytics Desktop tools to learn more about running analysis.
When running GeoAnalytics Desktop tools, the analysis is completed on your desktop machine. For optimal performance, data should be available on your desktop. If you are using a hosted feature layer, it is recommended that you use ArcGIS GeoAnalytics Server. If your data isn't local, it will take longer to run a tool. To use your ArcGIS GeoAnalytics Server to perform analysis, see GeoAnalytics Tools.
The GeoAnalytics implementation of GLR has the following limitations:
- It is a global regression model and does not take the spatial distribution of data into account.
- Analysis does not apply Moran's I test on the residuals.
- Feature datasets (points, lines, polygons, and tables) are supported as input; rasters are not supported.
- You cannot classify values into multiple classes.
Syntax
GeneralizedLinearRegression(input_features, dependent_variable, model_type, explanatory_variables, output_features, {input_features_to_predict}, {explanatory_variables_to_match}, {dependent_variable_mapping}, {output_predicted_features}, {coefficient_table})
Parameter | Explanation | Data Type |
input_features | The layer containing the dependent and independent variables. | Table View |
dependent_variable | The numeric field containing the observed values to be modeled. | Field |
model_type | Specifies the type of data that will be modeled.
| String |
explanatory_variables [explanatory_variables,...] | A list of fields representing independent explanatory variables in the regression model. | Field |
output_features | The name of the feature class that will be created containing the dependent variable estimates and residuals. | Table; Feature Class |
input_features_to_predict (Optional) | A layer containing features representing locations where estimates should be computed. Each feature in this dataset should contain values for all the explanatory variables specified. The dependent variable for these features will be estimated using the model calibrated for the input layer data. | Table View |
explanatory_variables_to_match [[Field from Prediction Locations, Field from Input Features],...] (Optional) | Matches the explanatory variables in the input_features_to_predict parameter to corresponding explanatory variables from the input_features parameter—for example, [["LandCover2000", "LandCover2010"], ["Income", "PerCapitaIncome"]] | Value Table |
dependent_variable_mapping [dependent_variable_mapping,...] (Optional) | Two strings representing the values used to map to 0 (absence) and 1 (presence) for binary regression. By default 0 and 1 will be used. For example, if you wanted to predict an arrest and had fields with values of Arrest and No Arrest, you would enter No Arrest for False Value (0) and Arrest for True Value (1). | Value Table |
output_predicted_features (Optional) | The output feature class with the dependent variable estimates for each input_features_to_predict. The output feature class with the dependent variable estimates for each Input Prediction Features. | Table; Feature Class |
coefficient_table (Optional) | The output feature class with the dependent variable estimates for each input_features_to_predict. The output feature class with the dependent variable estimates for each Input Prediction Features. | Table |
Code sample
The following Python window script demonstrates how to use the GeneralizeLinearRegression tool.
In this script, you create a model and predict if an arrest was made for a crime reporting.
#-------------------------------------------------------------------------------
# Name: GeneralizedLinearRegression.py
# Description: Run GLR on crime data and predict if an arrest was made for a crime reporting.
#
# Requirements: Advanced License
# Import system modules
import arcpy
arcpy.env.workspace = "c:/data/city.gdb"
# Set local variables
trainingDataset = "old_crimes"
predictionDataset = "new_crimes"
outputTrainingName = "training"
outputPredictedName = "predicted"
# Execute Generalize Linear Regression
arcpy.geoanalytics.GeneralizedLinearRegression(
trainingData, "ArrestMade", "BINARY", "CRIME_TYPE; WARD; DAY_OF_MONTH",
outputTrainingName, None, outputPredictedName,
"CRIME_TYPE CRIME_TYPE;WARD WARD;DAY_OF_MONTH DAY_OF_MON",
"Arrest NoArrest")
Licensing information
- Basic: No
- Standard: No
- Advanced: Yes