What is EBK Regression Prediction?—ArcGIS Pro

Available with Geostatistical Analyst license.

Introduction

EBK Regression Prediction is a geostatistical interpolation method that uses Empirical Bayesian Kriging (EBK) with explanatory variable rasters that are known to affect the value of the data you are interpolating. This approach combines kriging with regression analysis to make predictions that are more accurate than either regression or kriging can achieve on their own.

Learn more about empirical Bayesian kriging

Learn more about the basics of regression analysis

Basics of regression kriging models

As their name implies, regression kriging models are a hybrid of ordinary least-squares regression and simple kriging. These regression and kriging models predict the dependent variable by separating the estimation of the mean (average) value and an error term:

Dependent variable = (mean) + (error)

Ordinary least squares (OLS) works by modeling the mean value as a weighted sum of the explanatory variables (called the regression equation), and the error term is assumed to be random, uncorrelated noise. Simple kriging works by modeling the error term using a semivariogram/covariance model, and the mean value is assumed to be a constant value. In this sense, OLS does all heavy analysis on the mean value, and kriging does all heavy analysis on the error term. Regression kriging models, however, simultaneously estimate both a regression model for the mean value and a semivariogram/covariance model for the error term. By operating on both components at the same time, regression kriging models can make more accurate predictions than either regression or kriging can achieve on their own. In fact, both OLS regression and simple kriging are special cases of regression kriging.

Much care should be taken in choosing which variables should be used as explanatory variable rasters. Each explanatory variable should be known to affect the value of the dependent variable. It is recommended that you choose your explanatory variables in a similar way that you choose explanatory variables for Ordinary Least Squares. However, you do not need to check if the explanatory variables are correlated with each other. This is explained further in the next section.

Principal component analysis

Before building the regression kriging model, the explanatory variable rasters are transformed into their principal components, and these principal components are used as the explanatory variables in the regression model. The principal components are linear combinations (weighted sums) of the explanatory variables and are calculated such that each principal component is uncorrelated with every other principal component. Because they are mutually uncorrelated, using principal components solves the problem of multicollinearity (explanatory variables that are correlated with each other) in the regression model.

Each principal component captures a certain proportion of the total variability of the explanatory variables. In many cases, most of the information contained in all explanatory variables can be captured in only a few principal components. By discarding the least useful principal components, the model calculation becomes more stable and efficient without significant loss of accuracy. You can control how much variation the principal components must account for by using the Minimum cumulative percent of variance parameter.

Why the explanatory variables must be rasters

In this tool, all explanatory variables must be supplied as rasters, and the regression kriging model is constructed by extracting the values of the explanatory variable rasters that fall under each input point. You may be wondering why the explanatory variables cannot be fields on the same point feature class that stores the dependent variable. To make a prediction at a new location, the explanatory variables must be measured at the new location to calculate the prediction from the regression kriging model. If the explanatory variables were fields on the input dependent variable features, you would only be able to make predictions at the input point locations. To actually interpolate (predict values at new locations), the explanatory variables must be measured at the locations where you want to interpolate. The most natural way to specify the explanatory variables at every prediction location is to store the explanatory variables as rasters.

If your explanatory variables are not in raster format, but instead are stored as fields in the input dependent variable features, you should convert every explanatory variable to a raster using one of the available interpolation methods. However, it should be noted that EBK Regression Prediction assumes that the explanatory variables are measured values (rather than interpolated predictions), so any error introduced while interpolating the explanatory variables will not be properly accounted for in the subsequent calculations. In practice, this means that predictions could be biased and standard errors could be underestimated.

Creating and evaluating local models

One of the biggest advantages of EBK Regression Prediction compared to other regression kriging models is that the models are calculated locally. This allows the model to change itself in different areas and account for local effects. For example, the relationships between the explanatory variables and the dependent variable may change in different regions, and EBK Regression Prediction can accurately model these regional changes.

EBK Regression Prediction accounts for these local effects by dividing the input data into subsets of a given size before doing any modeling. The number of points in each local subset is controlled by the Maximum number of points in each local model parameter. The regression kriging model is calculated for each of these local subsets independently, and these local models are mixed together to produce the final prediction map. Alternatively, the local subsets can be defined by using the Subset polygon features parameter. If polygon features are provided for this parameter, each polygon feature will define a single subset, and all points contained within a single polygon feature will be processed as a subset. In this case, each polygon must contain at least 20 points and no more than 1,000 points.

The Output diagnostic feature class parameter can be used to produce model diagnostics for each of these local models. Using this parameter will create a polygon feature class in which each polygon contains all the points that contribute to that local model. For example, if there are five subsets, five polygons will be created, and each polygon will show the region of each subset. The polygon feature class will also contain various fields showing diagnostic information about how well the local model fits the subset. If subset polygon features are provided, the output diagnostic feature class will have the same geometry as the subset polygons.

Transformations and semivariogram models

A variety of transformation and semivariogram models are available for EBK Regression Prediction.

The following transformation options are available:

None—No transformation is applied to the dependent variable.
Empirical—A nonparametric kernel mixture is applied to the dependent variable. This option is recommended when the dependent variable is not normally distributed.
Log empirical—A logarithmic transformation is applied to the dependent variable before the Empirical transformation is applied. This option will ensure that every prediction is greater than zero, so this option is recommended when the dependent variable cannot be negative, such as rainfall measurements.

The following semivariogram models are available:

Exponential—This semivariogram model assumes that the spatial autocorrelation of the error term diminishes relatively quickly compared to the other options. This is the default.
Nugget—This semivariogram model assumes that the error term is spatially independent. Using this option is equivalent to using ordinary least-squares regression, so this option is rarely useful for the actual interpolation. Instead, it can serve as a baseline to see how much improvement you get by using regression kriging compared to ordinary least-squares regression.
Whittle—This semivariogram model assumes that the spatial autocorrelation of the error term diminishes relatively slowly compared to other options.
K-Bessel—This semivariogram model allows the spatial autocorrelation of the error term to diminish slowly, quickly, or anywhere between. Because it is flexible, it will almost always give the most accurate predictions, but it requires the estimation of an additional parameter, so it takes longer to calculate. If you are unsure which semivariogram to use, and you are willing to wait longer to get the most accurate results, this is the recommended option.

References

Chilès, J-P., and P. Delfiner (1999). Chapter 4 of Geostatistics: Modeling Spatial Uncertainty. New York: John Wiley & Sons, Inc.
Krivoruchko K. (2012). "Empirical Bayesian Kriging," ArcUser Fall 2012.
Krivoruchko K. (2012). "Modeling Contamination Using Empirical Bayesian Kriging," ArcUser Fall 2012.
Krivoruchko K. and Gribov A. (2014). "Pragmatic Bayesian kriging for non-stationary and moderately non-Gaussian data," Mathematics of Planet Earth. Proceedings of the 15^th Annual Conference of the International Association for Mathematical Geosciences, Springer 2014, pp. 61-64.
Krivoruchko K. and Gribov A. (2019). "Evaluation of empirical Bayesian kriging," Spatial Statistics Volume 32. https://doi.org/10.1016/j.spasta.2019.100368.
Pilz, J., and G. Spöck (2007). "Why Do We Need and How Should We Implement Bayesian Kriging Methods," Stochastic Environmental Research and Risk Assessment 22 (5):621–632.