How Generalized Linear Regression works

Regression analysis may be the most commonly used statistic in the social sciences. Regression is used to evaluate relationships between two or more feature attributes. Identifying and measuring relationships allows you to better understand what's going on in a place, predict where something is likely to occur, or examine causes of why things occur where they do. Generalized Linear Regression creates a model of the variable or process you are trying to understand or predict that can be used to examine and quantify relationships among features.

Note:

This tool is new in ArcGIS Pro 2.3 and includes the functionality of Ordinary Least Squares (OLS). This tool includes the additional models of Count (Poisson) and Binary (Logistic) which allow the tool to be applied to a wider range of problems.

Potential applications

Generalized Linear Regression can be used for a variety of applications, including the following:

  • What demographic characteristics contribute to high rates of public transportation usage?
  • Is there a positive relationship between vandalism and burglary?
  • Which variables effectively predict 911 call volume? Given future projections, what is the expected demand for emergency response resources?
  • What variables affect low birth rates?

Inputs

To run the Generalized Linear Regression tool, provide Input Features with a field representing the Dependent Variable and one or more fields representing the Explanatory Variable(s) or, optionally, Distance Features. These fields must be numeric and have a range of values. Features that contain missing values in the dependent or explanatory variables will be excluded from the analysis; however, you can use the Fill Missing Values tool to complete the dataset before running the Generalized Linear Regression tool. Next, you must choose a Model Type based on the data you are analyzing. It is important to use an appropriate model for your data. Descriptions of the model types and how to determine the appropriate one for your data are below.

Model type

Generalized Linear Regression provides three types of regression models: Continuous, Binary and Count. These types of regressions are known in statistical literature as Gaussian, Logistic, and Poisson, respectively. The Model Type for your analysis should be chosen based on how your Dependent Variable was measured or summarized as well as the range of values it contains.

Continuous (Gaussian)

Use the Continuous (Gaussian) Model Type if your Dependent Variable can take on a wide range of values such as temperature or total sales. Ideally, your dependent variable will be normally distributed. You can create a histogram of your dependent variable to verify that it is normally distributed. If the histogram is a symmetrical bell curve, use a Gaussian model type. Most of the values will be clustered near the mean, with few values departing radically from the mean. There should be as many values on the left side of the mean as on the right (the mean and median values for the distribution are the same). If your Dependent Variable does not appear to be normally distributed, consider reclassifying it as a binary variable. For example, if your dependent variable is average household income, you can recode it as a binary variable where 1 indicates above the national median income and 0 indicates below the national median income. A continuous field can be reclassified as a binary field using the Reclassify helper function in the Calculate Field tool.

Binary (Logistic)

Use a Binary (Logistic) Model Type if your Dependent Variable can take on one of two possible values such as success and failure or presence and absence. The field containing your Dependent Variable must be numeric and contain only ones and zeros. Results will be easier to interpret if you code the event of interest, such as success or presence of an animal, as 1, as the regression will model the probability of 1. There must be variation of the ones and zeros in your data. If you create a histogram of your Dependent Variable, it should only show ones and zeros.

Count (Poisson)

Consider using a Count (Poisson) Model Type if your Dependent Variable is discrete and represents the number of occurrences of an event such as a count of crimes. Count models can also be used if your Dependent Variable represents a rate and the denominator of the rate is a fixed value such as sales per month or number of people with cancer per 10,000 population. A Count (Poisson) model assumes that the mean and variance of the Dependent Variable are equal, and the values of your Dependent Variable cannot be negative or contain decimals.

Distance Features

Although Generalized Linear Regression is not an inherently spatial method, one way to leverage the power of space in your analysis is using distance features. For example, if you are modeling the performance of a series of retail stores, a variable representing the distance to highway on-ramps or the distance to a closest competitor may be critical to producing accurate predictions. Similarly, if you're modeling air quality, an explanatory variable representing distance to major sources of pollution or distance to major roadways is crucial. Distance features are used to automatically create explanatory variables by calculating a distance from the provided features to the Input Features. Distances will be calculated from each of the input Explanatory Distance Features to the nearest Input Features. If the input Explanatory Distance Features are polygons or lines, the distance attributes are calculated as the distance between the closest segments of the pair of features. However, distances are calculated differently for polygons and lines. See How proximity tools calculate distance for details.

Prediction

You can use the regression model that has been created to make predictions for other features (either points or polygons). Creating these predictions requires that each of the Prediction Locations has values for each of the Explanatory Variable(s) provided as well as any Explanatory Distance Features for the area of interest. If the field names from the Input Features and Prediction Locations parameters do not match, a variable matching parameter is provided. When matching the explanatory variables, the fields from the Input Features and Prediction Locations parameters must be of the same type (double fields must be matched with double fields, for example). Any Explanatory Distance Features must also be matched.

Outputs

The Generalized Linear Regression tool produces a variety of different outputs. A summary of the GLR model and statistical summaries are available as a message at the bottom of the Geoprocessing pane during tool execution. To access the messages, hover the pointer over the progress bar, click the pop-out button, or expand the messages section in the Geoprocessing pane. You can also access messages of a previously run Generalized Linear Regression tool via the geoprocessing history. The tool also generates Output Features, charts and optionally Output Predicted Features. The Output Features and associated charts are automatically added to the Contents pane with a hot and cold rendering scheme applied to model residuals. The diagnostics and charts generated depend on the Model Type of the Input Features and are described below.

Continuous (Gaussian)

Interpreting messages and diagnostics

  • AICc—This is a measure of model performance and can be used to compare regression models. Taking into account model complexity, the model with the lower AICc value provides a better fit to the observed data. AICc is not an absolute measure of goodness of fit but is useful for comparing models with different explanatory variables as long as they apply to the same dependent variable. If the AICc values for two models differ by more than 3, the model with the lower AICc value is held to be better. Comparing the GWR AICc value to the GLR AICc value is one way to assess the benefits of moving from a global model (GLR) to a local regression model (GWR).
  • R2—The R-Squared is a measure of goodness of fit. Its value varies from 0.0 to 1.0, with higher values being preferable. It may be interpreted as the proportion of dependent variable variance accounted for by the regression model. The denominator for the R2 computation is the sum of squared dependent variable values. Adding an extra explanatory variable to the model does not alter the denominator but does alter the numerator; this gives the impression of improvement in model fit that may not be real. See Adjusted R2 below.
  • Adjusted R2—Because of the problem described above for the R2 value, calculations for the adjusted R-squared value normalize the numerator and denominator by their degrees of freedom. This has the effect of compensating for the number of variables in a model, and consequently, the Adjusted R2 value is almost always less than the R2 value. However, in making this adjustment, you lose the interpretation of the value as a proportion of the variance explained. In GWR, the effective number of degrees of freedom is a function of the neighborhood used, so the adjustment may be quite marked in comparison to a global model such as GLR. For this reason, AICc is preferred as a means of comparing models.
  • Joint F--Statistic and Joint Wald Statistic—Both the Joint F-Statistic and Joint Wald Statistic are measures of overall model statistical significance. The Joint F-Statistic is trustworthy only when the Koenker (BP) statistic (see below) is not statistically significant. If the Koenker (BP) statistic is significant, consult the Joint Wald Statistic to determine overall model significance. The null hypothesis for both tests is that the explanatory variables in the model are not effective. For a 95 percent confidence level, a p-value (probability) less than 0.05 indicates a statistically significant model.
  • Koenker (BP) statistic (Koenker's studentized Bruesch-Pagan statistic)—This is a test to determine whether the explanatory variables in the model have a consistent relationship to the dependent variable both in geographic space and in data space. When the model is consistent in geographic space, the spatial processes represented by the explanatory variables behave the same everywhere in the study area (the processes are stationary). When the model is consistent in data space, the variation in the relationship between predicted values and each explanatory variable does not change with changes in explanatory variable magnitudes (there is no heteroscedasticity in the model). Suppose you want to predict crime, and one of your explanatory variables is income. The model would have problematic heteroscedasticity if the predictions were more accurate for locations with small median incomes than they were for locations with large median incomes. The null hypothesis for this test is that the model is stationary. For a 95 percent confidence level, a p-value (probability) less than 0.05 indicates statistically significant heteroscedasticity or nonstationarity. When results from this test are statistically significant, consult the robust coefficient standard errors and probabilities to assess the effectiveness of each explanatory variable. Regression models with statistically significant nonstationarity are often good candidates for (GWR) analysis.
  • Jarque-Bera—This indicates whether the residuals (the observed or known dependent variable values minus the predicted or estimated values) are normally distributed. The null hypothesis for this test is that the residuals are normally distributed, so if you were to construct a histogram of those residuals, they would resemble the classic bell curve, or Gaussian distribution. When the p-value (probability) for this test is small (less than 0.05 for a 95 percent confidence level, for example), the residuals are not normally distributed, indicating your model is biased. If you also have statistically significant spatial autocorrelation of your residuals (see below), the bias may be the result of model misspecification (a key variable is missing from the model). Results from a misspecified OLS model are not trustworthy. A statistically significant Jarque-Bera test can also occur if you are modeling nonlinear relationships, if your data includes influential outliers, or when there is strong heteroscedasticity.

Output charts

The charts created with this tool for the Continuous Model Type include a scatter plot matrix of the variables used in the model, a histogram of model residuals, and a plot of the residuals and predictions.

Binary (Logistic)

Interpreting messages and diagnostics

  • AICc—This is a measure of model performance and can be used to compare regression models. Taking into account model complexity, the model with the lower AICc value provides a better fit to the observed data. AICc is not an absolute measure of goodness of fit but is useful for comparing models with different explanatory variables as long as they apply to the same dependent variable. If the AICc values for two models differ by more than 3, the model with the lower AICc value is held to be better. Comparing the GWR AICc value to the GLR AICc value is one way to assess the benefits of moving from a global model (GLR) to a local regression model (GWR).
  • % deviance explained—The proportion of the dependent variable variance that is accounted for by the explanatory variables.
  • Joint Wald Statistic—The Joint Wald Statistic is a measure of overall model statistical significance. The null hypothesis for this test is that the explanatory variables in the model are not effective. For a 95 percent confidence level, a p-value (probability) less than 0.05 indicates a statistically significant model.

Output charts

The charts created with this tool for the Binary Model Type include a scatter plot matrix of the variables used in the model, a box plot displaying the distribution of the explanatory variables, a histogram of model residuals, and a prediction performance table.

Count (Poisson)

Interpreting messages and diagnostics

  • AICc—This is a measure of model performance and can be used to compare regression models. Taking into account model complexity, the model with the lower AICc value provides a better fit to the observed data. AICc is not an absolute measure of goodness of fit but is useful for comparing models with different explanatory variables as long as they apply to the same dependent variable. If the AICc values for two models differ by more than 3, the model with the lower AICc value is held to be better. Comparing the GWR AICc value to the GLR AICc value is one way to assess the benefits of moving from a global model (GLR) to a local regression model (GWR).
  • % deviance explained—The proportion of the dependent variable variance that is accounted for by the explanatory variables.
  • Joint Wald Statistic—The Joint Wald Statistic is a measure of overall model statistical significance. The null hypothesis for this test is that the explanatory variables in the model are not effective. For a 95 percent confidence level, a p-value (probability) less than 0.05 indicates a statistically significant model.

Output charts

The charts created with this tool for the Count Model Type include a scatter plot matrix of the variables used in the model, a histogram of model residuals, and a plot of the residuals and predictions.

Additional resources

There are a number of resources to help you learn more about Generalized Linear Regression and Geographically Weighted Regression. Start with Regression analysis basics or work through the Regression Analysis tutorial.

The following are also helpful resources:

Fox, J. (1991). Regression Diagnostics. Sage, Newbury Park, CA.

Menard, S. (2002). Applied logistic regression analysis (Vol. 106). Sage.

Nelder, J. A. and Wedderburn, R. W. M. (1972) Generalized linear models. J. R. Statist. Soc. A, 135, 370 - 384.