What they don't tell you about regression analysis

Regression analysis is used to understand, model, predict, and explain complex phenomena. It helps you answer questions such as "Why are there places in the United States with test scores that are consistently above the national average?" or "Why are there areas of the city with such high rates of residential burglary?" You might use regression analysis to explain childhood obesity, for example, using a set of related variables such as income, education, and accessibility to healthy food.

Typically, regression analysis helps you answer these why questions so that you can do something about them. If, for example, you discover that childhood obesity is lower in schools that serve fresh fruits and vegetables at lunch, you can use that information to guide policy and make decisions about school lunch programs. Likewise, knowing the variables that help explain high crime rates can allow you to make predictions about future crime so that prevention resources can be allocated more effectively.

These are the things they do tell you about regression analysis.

What they don't tell you about regression analysis is that it isn't always easy to find a set of explanatory variables that will allow you to answer your question or to explain the complex phenomenon you are trying to model. Childhood obesity, crime, test scores, and almost all the things that you might want to model using regression analysis are complicated issues that rarely have simple answers. Chances are, if you have ever tried to build your own regression model, this is nothing new to you.

Fortunately, when you run the Generalized Linear Regression (GLR) tool, you are presented with a set of diagnostics that can help you figure out whether you have a properly specified model; a properly specified model is one you can trust. This document examines and explains several of these checks you'll want to pass in order to have confidence in your model. These diagnostics, and the techniques that you can use to solve some of the most common regression analysis problems, are resources that can make your work easier.

Tip:

Once you understand the information presented below, you might decide to use the Exploratory Regression tool to help you find a model that meets all the requirements of the GLR method.

Get started

Choosing the variable you want to understand, predict, or model is your first task. This variable is known as the dependent variable. Childhood obesity, crime, and test scores would be the dependent variables being modeled in the examples described above.

Next, you have to decide which factors might help explain your dependent variable. These variables are known as the explanatory variables. In the childhood obesity example, the explanatory variables might be things such as income, education, and accessibility to healthy food. You will need to do your research here to identify all the explanatory variables that might be important; consult theory and existing literature, talk to experts, and always rely on your common sense. The preliminary research you do up front will increase your chances of finding a good model.

With the dependent variable and the candidate explanatory variables selected, you are ready to run your analysis. It is suggested to start your regression analysis with Generalized Linear Regression (GLR) or Exploratory Regression, because these tools perform important diagnostic tests that tell you if you've found a useful model or if you still have some work to do.

The GLR tool generates several outputs, including a map of the regression residuals, charts, and a summary report. The regression residuals map shows the under- and overpredictions from your model, and analyzing this map is an important step in finding a good model. The summary report is largely numeric and includes all the diagnostics you will use when going through the six checks below.

GLR outputs
Output from the GLR tool includes a summary report, charts, and a residuals map.

Six diagnostic checks

Check 1: Are these explanatory variables helping my model?

After consulting theory and existing research, you will have identified a set of candidate explanatory variables. You'll have good reasons for including each one in your model. However, after running your model, you'll find that some of your explanatory variables are statistically significant and some are not.

How will you know which explanatory variables are significant? The GLR tool calculates a coefficient for each explanatory variable in the model and performs a statistical test to determine whether that variable is helping your model or not. The statistical test computes the probability that the coefficient is actually zero. If the coefficient is zero (or very near zero), the associated explanatory variable is not helping your model. When the statistical test returns a small probability (p-value) for a particular explanatory variable, however, it indicates that it is unlikely (there is a small probability) that the coefficient is zero. When the probability is smaller than 0.05, an asterisk next to the probability on the GLR summary report indicates the associated explanatory variable is important to your model (in other words, its coefficient is statistically significant at the 95 percent confidence level). So you are looking for explanatory variables associated with statistically significant probabilities (look for ones with asterisks).

The GLR tool computes both the probability and the robust probability for each explanatory variable. With spatial data, it is not unusual for the relationships you are modeling to vary across the study area. These relationships are characterized as nonstationary. When the relationships are nonstationary, you can only trust robust probabilities to tell you whether an explanatory variable is statistically significant.

How will you know if the relationships in your model are nonstationary? Another statistical test included in the GLR summary report is the Koenker (Koenker's studentized Breusch-Pagan) statistic for nonstationarity. An asterisk next to the Koenker p-value indicates the relationships you are modeling exhibit statistically significant nonstationarity, so be sure to consult the robust probabilities.

Koenker diagnostic
When the Koenker p-value is statistically significant (asterisk), you can only trust the robust probabilities.

Typically, you will remove explanatory variables from your model if they are not statistically significant. However, if theory indicates a variable is very important, or if a particular variable is the focus of your analysis, you might retain it even if it's not statistically significant.

Note:

In the process of looking for a properly specified GLR model, you will likely try a variety of explanatory variables. Be aware that explanatory variable coefficients (and their statistical significance) can change radically depending on the combination of variables you include in your model.

Check 2: Are the relationships what I expected?

Not only is it important to determine whether an explanatory variable is actually helping your model, but you also want to check the sign (+/-) associated with each coefficient to make sure the relationship is what you were expecting. The sign of the explanatory variable coefficient indicates whether the relationship is positive or negative. Suppose you were modeling crime, for example, and one of your explanatory variables is average neighborhood income. If the coefficient for the income variable is a negative number, it means that crimes tend to decrease as neighborhood incomes increase (a negative relationship). If you were modeling childhood obesity and the accessibility to fast food variable had a positive coefficient, it would indicate that childhood obesity tends to increase as access to fast food increases (a positive relationship).

When you create your list of candidate explanatory variables, you should include for each variable the relationship (positive or negative) you are expecting. You would have a hard time trusting a model reporting relationships that don't match with theory or common sense. Suppose you were building a model to predict forest fire frequencies and your regression model returned a positive coefficient for the precipitation variable. You probably wouldn't expect forest fires to increase in locations with a large amount of rain.

Unexpected coefficient signs often indicate other problems with your model that will surface as you continue working through the other diagnostic checks. You can only trust the sign and strength of your explanatory variable coefficients if your model passes all of these. If you do find a model that passes all the checks despite the unexpected coefficient sign, you may have discovered an opportunity to learn something new. Perhaps there is a positive relationship between forest fire frequency and precipitation because the primary source of forest fires in your study area is lightning. It may be worthwhile to try to obtain data about lightning for your study area to see if it improves model performance.

Check 3: Are any of the explanatory variables redundant?

When choosing explanatory variables to include in your analysis, look for variables that explore different aspects of what you are trying to model; avoid variables that provide similar information. For example, if you were trying to model home values, you probably wouldn't include explanatory variables for both home square footage and number of bedrooms. Both of these variables relate to the size of the home, and including both could make your model unstable. Ultimately, you cannot trust a model that includes redundant variables.

How will you know if two or more variables are redundant? Fortunately, whenever you have more than two explanatory variables, the GLR tool computes a Variance Inflation Factor (VIF) for each variable. The VIF value is a measure of variable redundancy and can help you decide which variables can be removed from your model without jeopardizing explanatory power. As a general rule, a VIF value above 7.5 is problematic. If you have two or more variables with VIF values above 7.5, you should remove them one at a time and rerun OLS until the redundancy is gone. Keep in mind that you do not want to remove all the variables with high VIF values. In the example of modeling home values, square footage and number of bedrooms would likely both have inflated VIF values. As soon as you remove one of those two variables, however, the redundancy is eliminated. Including a variable that reflects home size is important; you just don't want to model that aspect of home values redundantly.

Check 4: Is my model biased?

This may seem like a difficult question, but the answer is actually very simple. When you have a properly specified GLR model, the model residuals (the over- and underpredictions) are normally distributed with a mean of zero (as in a bell curve). When your model is biased, however, the distribution of the residuals is unbalanced, as shown below. You cannot fully trust predicted results when the model is biased. There are several strategies to help you correct this problem.

Normal and skewed distributions

A statistically significant Jarque-Bera diagnostic (look for the asterisk) indicates that your model is biased. Sometimes your model is doing a good job for low values but is not predicting well for high values (or vice versa). With the childhood obesity example, this would mean that in locations with low childhood obesity, the model is working well, but in areas with high childhood obesity, the predictions are off. Model bias can also be the result of outliers that are influencing model estimation.

To help you resolve model bias, create a scatterplot matrix for all your model variables. A nonlinear relationship between your dependent variable and one of your explanatory variables is a common cause of model bias. These relationships might look like a curved line in the scatterplot matrix. Linear relationships look like diagonal lines.

Curvilinear relationship

If you see that your dependent variable has a nonlinear relationship with one of your explanatory variables, you have some work to do. GLR is a linear regression method that assumes the relationships you are modeling are linear. When they aren't, you can try transforming your variables to see if this creates relationships that are more linear. Common transformations include log and exponential. If some of your explanatory variables are strongly skewed, you might be able to remove model bias by transforming them as well.

Transformations

A scatterplot matrix will also reveal data outliers. To see whether an outlier is impacting your model, try running Generalized Linear Regression (GLR) both with and without an outlier, and check to see how much it changes model performance and whether removing it corrects model bias. In some instances (especially if you think that the outliers represent bad data), you might be able to drop the outliers from your analysis.

Data outlier

Check 5: Have I found all the key explanatory variables?

Often you start an analysis with hypotheses about which variables are going to be important predictors. Maybe you believe 5 particular variables will produce a good model, or maybe you have a firm list of 10 variables you think might be related. While it is important to approach regression analysis with a hypothesis, it is also important to allow your creativity and insight to help you investigate further. Resist the inclination to limit yourself to your initial variable list, and try to consider all the possible variables that might impact what you are modeling. Create thematic maps of each of your candidate explanatory variables and compare those to a map of your dependent variable. Review the relevant literature. Use your intuition to look for relationships in your mapped data. Try to come up with as many candidate spatial variables as you can, such as distance from the urban center, proximity to major highways, or access to large bodies of water. These kinds of variables will be especially important for analyses where you believe geographic processes influence relationships in your data. Until you find explanatory variables that effectively capture the spatial structure in your dependent variable, in fact, your model will be missing key explanatory variables and you will not be able to pass all the diagnostic checks outlined here.

Evidence that you are missing one or more key explanatory variables is statistically significant spatial autocorrelation of your model residuals. In regression analysis, issues with spatially autocorrelated residuals usually take the form of clustering: the overpredictions cluster together and the underpredictions cluster together. How will you know if you have statistically significant spatial autocorrelation in your model residuals? Running the Spatial Autocorrelation tool on your regression residuals will tell you if you have a problem with spatial autocorrelation. A statistically significant z-score indicates you are missing key explanatory variables from your model.

Finding those missing explanatory variables is often as much an art as a science. Try these strategies to see if they are helpful:

Examine the GLR residual map

The standard output from the Generalized Linear Regression (GLR) tool is a map of the model residuals. Dark purple areas indicate that the actual values (your dependent variable) are larger than your model predicted they would be. Darker turquoise areas show where the actual values are lower than predicted. Sometimes just seeing the residual map will give you an idea about what might be missing. If you notice that you are consistently overpredicting in urbanized areas, for example, you might want to consider adding a variable that reflects distance to urban centers. If it looks like overpredictions are associated with mountain peaks or valley bottoms, perhaps you need an elevation variable. Do you see regional clusters, or can you recognize trends in your data? If so, creating a dummy variable to capture these regional differences may be effective. The classic example for a dummy variable is one that distinguishes urban and rural features. By assigning all rural features a value of 1 and all other features a value of 0, you may be able to capture spatial relationships in the landscape that could be important to your model. Sometimes creating a hot spot map of model residuals will help you visualize broad regional patterns.

Figuring out the missing spatial variables not only has the potential to improve your model, but this process can also help you better understand the phenomenon you are modeling in new and innovative ways.

Note:

While spatial regime dummy variables are helpful to include in your GLR model, you will want to remove them when you run Geographically Weighted Regression (GWR) to avoid problems with local multicollinearity.

Examine nonstationarity

You can also try running Geographically Weighted Regression and creating coefficient surfaces for each of your explanatory variables or maps of the local R2 values. Select the GLR model that is performing well (one with a high adjusted R2 value that is passing all or most of the other diagnostic checks). Because GWR creates a regression equation for each feature in your study area, the coefficient surfaces illustrate how the relationships between the dependent variable and each explanatory variable fluctuate geographically; the map of local R2 values shows variations in model explanatory power. Sometimes seeing these geographic variations will spark ideas about what variables might be missing: a dip in explanatory power near major freeways, a decline with distance from the coast, a change in the sign of the coefficients near an industrial region, or a strong east-to-west trend or boundary—all these would be information about spatial variables that may improve your model.

Coefficient surfaces illustrating nonstationarity

When you examine the coefficient surfaces, look for explanatory variables with coefficients that change sign from positive to negative. This is important because GLR will likely discount the predictive potential of these highly nonstationary variables. Consider, for example, the relationship between childhood obesity and access to healthy food options. It may be that in low-income areas with less access to cars, being far away from a supermarket is a real barrier to making healthy food choices. In high-income areas with more access to vehicles, however, having a supermarket within walking distance might actually be undesirable; the distance to the supermarket might not act as a barrier to buying healthy foods at all. While GWR is capable of modeling these types of complex relationships, GLR is not. GLR is a global model and expects variable relationships to be consistent (stationary) across the study area. When coefficients change sign, they cancel each other out. Think of it as (+1) + (-1) = 0. When you find variables where the coefficients are changing dramatically, especially if they are changing signs, you should keep them in your model even if they are not statistically significant. These types of variables will be effective when you move to GWR.

Try fitting GLR to smaller subset study areas

GWR is tremendously useful when dealing with nonstationarity, and it can be tempting to move directly to GWR without first finding a properly specified GLR model. Unfortunately, GWR doesn't have all the diagnostics to help you figure out whether your explanatory variables are statistically significant, your residuals are normally distributed, or, ultimately, you have a good model. GWR will not fix an improperly specified model unless you can be sure that the only reason your GLR model is failing the six checks is the direct result of nonstationarity. Evidence of nonstationarity would be finding explanatory variables that have a strong positive relationship in some parts of the study area and a strong negative relationship in other parts. Sometimes the issue isn't with individual explanatory variables but with the set of explanatory variables used in the model. It may be that one set of variables provides a good model for one part of the study area, but another set of different variables works best everywhere else. To see if this is the case, you can select several smaller subset study areas and try to fit GLR models to each of these. Select your subset areas based on the processes you think may be related to your model (high- versus low-income areas, old versus new housing). Alternatively, select areas based on the GWR map of local R2 values; the locations with poor model performance might be modeled better using a different set of explanatory variables.

Tip:

The Spatially Constrained Multivariate Clustering tool can be very helpful for identifying subregions in your broader study area.

If you do find properly specified GLR models in several small study areas, you can conclude that nonstationarity is the culprit and move to GWR using the full set of explanatory variables you found from all subset area models. If you don't find properly specified models in the smaller subset areas, it may be that you are trying to model something that is too complex to be reduced to a simple series of numeric measurements and linear relationships. In that case, you probably need to explore alternative analytic methods.

All of this can be a bit of work, but it is also a good exercise in exploratory data analysis and will help you understand your data better and find new variables to use. It may even result in a good model.

Check 6: How well am I explaining my dependent variable?

Now it's time to evaluate model performance. The adjusted R2 value is an important measure of how well your explanatory variables are modeling your dependent variable. The R2 value is also one of the first things most people learn about regression analysis. So why is this important check left until the end? What you might not know is that you cannot trust your R2 value unless you have passed all the other checks listed above. If your model is biased, it may be performing well in some areas or with a particular range of your dependent variable values, but otherwise not performing well at all. The R2 value doesn't reflect that. Likewise, if you have spatial autocorrelation of your residuals, you cannot trust the coefficient relationships from your model. With redundant explanatory variables, you can get extremely high R2 values, but your model will be unstable; it will not reflect the true relationships you are trying to model and might produce completely different results with the addition of even a single observation.

Once you have gone through the other checks and feel confident that you have met all the necessary criteria, however, it is time to figure out how well your model is explaining the values for your dependent variable by assessing the adjusted R2 value. R2 values range between 0 and 1 and represent a percentage. Suppose you are modeling crime rates and find a model that passes all five of the previous checks with an adjusted R2 value of 0.65. This tells you that the explanatory variables in your model are explaining 65 percent of the variation in the crime rate dependent variable. Adjusted R2 values have to be judged somewhat subjectively. In some areas of science, explaining 23 percent of a complex phenomenon will be very exciting. In other fields, an R2 value may need to be closer to 80 or 90 percent before it gets anyone's attention. Either way, the adjusted R2 value will help you judge how well your model is performing.

Another important diagnostic to help you assess model performance is the corrected Akaike's information criterion (AICc). The AICc value is a useful measure for comparing multiple models. For example, you might want to try modeling student test scores using several different sets of explanatory variables. In one model you might use only demographic variables, while in another model you might select variables relating to the school and classroom, such as per-student spending and teacher-to-student ratios. As long as the dependent variable for all the models being compared is the same (in this case, student test scores), you can use the AICc values from each model to determine which performs better. The model with the smaller AICc value provides a better fit to the observed data.

And don't forget ...

Keep in mind as you go through these steps of building a properly specified regression model that the goal of your analysis is ultimately to understand your data and use that understanding to solve problems and answer questions. The truth is that you could try a number of models (with and without transformed variables), explore several small study areas, and analyze your coefficient surfaces—and still not find a properly specified GLR model. But—and this is important—you will still be contributing to the body of knowledge on the phenomenon you are modeling. If the model that you hypothesized would be an accurate predictor turns out not to be significant at all, discovering this is very helpful information. If one of the variables you thought would be strong has a positive relationship in some areas and a negative relationship in others, knowing about this certainly increases your understanding of the issue. The work that you do here, trying to find a good model using GLR and then applying GWR to explore regional variation among the variables in your model, is always going to be valuable.

For more information about regression analysis as well as hands-on tutorials, see https://www.esriurl.com/spatialstats.