Finding a properly specified OLS model can be difficult, especially when there are lots of potential explanatory variables you think might be important contributing factors to the variable you are trying to model (your dependent variable). The Exploratory Regression tool can help. It is a data mining tool that will try all possible combinations of explanatory variables to see which models pass all of the necessary OLS diagnostics. By evaluating all possible combinations of the candidate explanatory variables, you greatly increase your chances of finding the best model to solve your problem or answer your question. While Exploratory Regression is similar to Stepwise Regression (found in many statistical software packages), rather than only looking for models with high Adjusted R2 values, Exploratory Regression looks for models that meet all of the requirements and assumptions of the OLS method.
Using the Exploratory Regression tool
When you run the Exploratory Regression tool, you specify a minimum and maximum number of explanatory variables each model should contain, along with threshold criteria for Adjusted R2, coefficient p-values, Variance Inflation Factor (VIF) values, Jarque-Bera p-values, and spatial autocorrelation p-values. Exploratory Regression runs OLS on every possible combination of the Candidate Explanatory Variables for models with at least the Minimum Number of Explanatory Variables and not more than the Maximum Number of Explanatory Variables. Each model it tries is assessed against your Search Criteria. When it finds a model:
- That exceeds your specified Adjusted R2 threshold
- With coefficient p-values, for all explanatory variables, less than you specified
- With coefficient VIF values, for all explanatory variables, less than your specified threshold
- Returning a Jarque-Bera p-value larger than you specified
It then runs the Spatial Autocorrelation (Global Moran’s I) tool on that model’s residuals. If the spatial autocorrelation p-value is also larger than you specified in the tool’s search criteria (Minimum Acceptable Spatial Autocorrelation p-value), the model is listed as a passing model. The Exploratory Regression tool will also test regression residuals using the Spatial Autocorrelation tool for models with the three highest Adjusted R2 results.
Models listed under Passing Models meet your specified search criteria. If you take the default values for the Maximum Coefficient p value Cutoff, Maximum VIF Value Cutoff, Minimum Acceptable Jarque Bera p value, and Minimum Acceptable Spatial Autocorrelation p value, your passing models will also be properly specified OLS models. A properly specified OLS model has:
- Explanatory variables where all of the coefficients are statistically significant
- Coefficients reflecting the expected, or at least a justifiable, relationship between each explanatory variable and the dependent variable
- Explanatory variables that get at different aspects of what you are trying to model (none are redundant; small VIF values less than 7.5)
- Normally distributed residuals indicating your model is free from bias (the Jarque-Bera p-value is not statistically significant)
- Randomly distributed over and under predictions indicating model residuals are normally distributed (the spatial autocorrelation p-value is not statistically significant)
When you specify an Output Results Table, models that meet your Maximum VIF Value Cutoff and for which all explanatory variables meet the Maximum Coefficient p value Cutoff will be written to a table. This table is helpful when you want to examine more than just those models included in the text report file.
Some cautions
Please be aware that, similar to using methods such as Stepwise Regression, using the Exploratory Regression tool is controversial. While an exaggeration, there are basically two schools of thought on this: the scientific method viewpoint and the data miner’s viewpoint.
Scientific method viewpoint
A strong proponent of the scientific method might object to exploratory regression methods. From their perspective, you should formalize your hypotheses before exploring your data to avoid creating models that fit only your data, but don’t reflect broader processes. Constructing models that overfit one particular dataset may not be relevant to other datasets—sometimes, in fact, even adding new observations will cause an overfit model to become unstable (performance might decrease and/or explanatory variable coefficient significance may wane). When your model isn’t robust, even to new observations, it certainly is not getting at the key processes for what you are trying to model.
In addition, please realize that regression statistics are based on probability theory, and when you run thousands of models, you strongly increase your chances of inappropriately rejecting the null hypothesis (a type 1 statistical error). When you select a 95 percent confidence level, for example, you are accepting a particular risk; if you could resample your data 100 times, probability indicates that as many as 5 out of those 100 samples would produce false positives. P-values are computed for each coefficient; the null hypothesis is that the coefficient is actually zero and, consequently, the explanatory variable associated with that coefficient is not helping your model. Probability theory indicates that in as many as 5 out of 100 samples, the p-value might be statistically significant only because you just happened to select observations that falsely support that conclusion. When you are only running one model, a 95 percent confidence level seems conservative. As you increase the number of models you try, you diminish your ability to draw conclusions from your results. The Exploratory Regression tool can try thousands of models in just a few minutes. The number of models tried is reported in the Global Summary section of the Output Report File.
Data miner's viewpoint
Researchers from the data mining school of thought, on the other hand, would likely feel it is impossible to know a priori all of the factors that contribute to any given real-world outcome. Often the questions we are trying to answer are complex, and theory on our particular topic may not exist, or might be out of date. Data miners are big proponents of inductive analyses such as those provided by exploratory regression. They encourage thinking outside of the box and using exploratory regression methods for hypothesis development.
Recommendations
We feel that Exploratory Regression, when used with discretion, is a valuable data mining tool that can help you find a properly specified OLS model. Our recommendation is that you always select candidate explanatory regression variables that are supported by theory, guidance from experts, and common sense. Calibrate your regression models using a portion of your data, and validate it with the remainder, or validate your model on additional datasets. If you do plan to draw inferences from your results, at minimum, you will want to perform a sensitivity analysis such as bootstrapping.
Using the Exploratory Regression tool does have advantages over using other exploratory methods that only assess model performance in terms of Adjusted R2 values. The Exploratory Regression tool is looking for models that pass all of the OLS diagnostics described above.