EBK Regression Prediction (Geostatistical Analyst)

Available with Geostatistical Analyst license.

Summary

EBK Regression Prediction is a geostatistical interpolation method that uses Empirical Bayesian Kriging with explanatory variable rasters that are known to affect the value of the data that you are interpolating. This approach combines kriging with regression analysis to make predictions that are more accurate than either regression or kriging can achieve on their own.

Learn more about EBK Regression Prediction

Usage

  • This tool only supports prediction map outputs. To create standard error, quantile, or probability maps, output a geostatistical layer and convert it to a raster (or multiple rasters) using GA Layer To Rasters.

  • This kriging method can handle moderately nonstationary input data.

  • Only Standard Circular and Smooth Circular Search neighborhoods are allowed for this interpolation method.

  • If any of your Input explanatory variable rasters have many NoData cells, the Output geostatistical layer may fail to visualize in the map. This is not a problem, and the calculations have been performed correctly. To visualize the output, convert your geostatistical layer to a raster using GA Layer To Rasters or GA Layer To Grid. You can also choose to output a raster directly from this tool using the Output prediction raster parameter.

  • If the Input dependent variable features are in a geographic coordinate system, all distances will be calculated using chordal distances. For more information on chordal distances, see the Distance calculations for data in geographic coordinates section of the What is Empirical Bayesian Kriging help topic.

Parameters

LabelExplanationData Type
Input dependent variable features

The input point features containing the field that will be interpolated.

Feature Layer
Dependent variable field

The field of the Input dependent variable features containing the values of the dependent variable. This is the field that will be interpolated.

Field
Input explanatory variable rasters

Input rasters representing the explanatory variables that will be used to build the regression model. These rasters should represent variables that are known to influence the values of the dependent variable. For example, when interpolating temperature data, an elevation raster should be used as an explanatory variable because temperature is influenced by elevation. You can use up to 62 explanatory rasters.

Raster Layer; Mosaic Layer
Output geostatistical layer

The output geostatistical layer displaying the result of the interpolation.

Geostatistical Layer
Output prediction raster
(Optional)

The output raster displaying the result of the interpolation. The default cell size will be the maximum of the cell sizes of the Input explanatory variable rasters. To use a different cell size, use the cell size environmental setting.

Raster Dataset
Output diagnostic feature class
(Optional)

Output polygon feature class that shows the regions of each local model and contains fields with diagnostic information for the local models. For each subset, a polygon will be created that surrounds the points in the subset so you can easily identify which points were used in each subset. For example, if there are 10 local models, there will be ten polygons in this output. The feature class will contain the following fields:

  • Number of Principal Components (PrincComps)—The number of principal components that were used as explanatory variables. The value will always be less than or equal to the number of explanatory variable rasters.
  • Percent of Variance (PercVar)—The percent of variance captured by the principal components. This value will be greater than or equal to the value specified in the Minimum cumulative percent of variance parameter below.
  • Root Mean Square Error (RMSE)—The square root of the average squared cross-validation errors. The smaller this value, the better the model fits.
  • Percent 90 Interval (Perc90)—The percent of data points that fall within a 90 percent cross-validation confidence interval. Ideally, this number should be close to 90. A value significantly smaller than 90 indicates that standard errors are being underestimated. A value significantly larger than 90 indicates that standard errors are being overestimated.
  • Percent 95 Interval (Perc95)—The percent of data points that fall within a 95 percent cross-validation confidence interval. Ideally, this number should be close to 95. A value significantly smaller than 95 indicates that standard errors are being underestimated. A value significantly larger than 95 indicates that standard errors are being overestimated.
  • Mean Absolute Error (MeanAbsErr)—The average of the absolute values of the cross-validation errors. This value should be as small as possible. It is similar to Root Mean Square Error, but it is less influenced by extreme values.
  • Mean Error (MeanError)—The average of the cross-validation errors. This value should be close to zero. A value significantly different than zero indicates that the predictions are biased.
  • Continuous Ranked Probability Score (CRPS)—The continuous ranked probability score is a diagnostic that measures the deviation from the predictive cumulative distribution function to each observed data value. This value should be as small as possible. This diagnostic has advantages over cross-validation diagnostics because it compares the data to a full distribution rather than to single-point predictions.
Feature Class
Dependent variable measurement error field
(Optional)

A field that specifies the measurement error for each point in the dependent variable features. For each point, the value of this field should correspond to one standard deviation of the measured value of the point. Use this field if the measurement error values are not the same at each point.

A common source of nonconstant measurement error is when the data is measured with different devices. One device might be more precise than another, which means that it will have a smaller measurement error. For example, one thermometer rounds to the nearest degree and another thermometer rounds to the nearest tenth of a degree. The variability of measurements is often provided by the manufacturer of the measuring device, or it may be known from empirical practice.

Leave this parameter empty if there are no measurement error values or the measurement error values are unknown.

Field
Minimum cumulative percent of variance
(Optional)

Defines the minimum cumulative percent of variance from the principal components of the explanatory variable rasters. Before building the regression model, the principal components of the explanatory variables are calculated, and these principal components are used as explanatory variables in the regression. Each principal component captures a certain percent of the variance of the explanatory variables, and this parameter controls the minimum percent of variance that must be captured by the principal components of each local model. For example, if a value of 75 is provided, the software will use the minimum number of principal components that are necessary to capture at least 75 percent of the variance of the explanatory variables.

Principal components are all mutually uncorrelated with each other, so using principal components solves the problem of multicollinearity (explanatory variables that are correlated with each other). Most of the information contained in all explanatory variables can frequently be captured in just a few principal components. By discarding the least useful principal components, the model calculation becomes more stable and efficient without significant loss of accuracy.

To calculate principal components, there must be variability in the explanatory variables, so if any of your Input explanatory variable rasters contain constant values within a subset, these constant rasters will not be used to compute principal components for that subset. If all explanatory variable rasters in a subset contain constant values, the Output diagnostic feature class will report that zero principal components were used and that they captured zero percent of the variability.

Double
Subset polygon features
(Optional)

Polygon features defining where the local models will be calculated. The points inside each polygon will be used for the local models. This parameter is useful when you know that the values of the dependent variable changes according to known regions. For example, these polygons may represent administrative health districts where health policy changes in different districts.

You can also use the Generate Subset Polygons tool to create subset polygons. The polygons created by this tool will be non-overlapping and compact.

Feature Layer
Dependent variable transformation type
(Optional)

Type of transformation to be applied to the input data.

  • None —Do not apply any transformation. This is the default.
  • Empirical —Multiplicative Skewing transformation with Empirical base function.
  • Log empirical —Multiplicative Skewing transformation with Log Empirical base function. All data values must be positive. If this option is chosen, all predictions will be positive.
String
Semivariogram model type
(Optional)

The semivariogram model that will be used for the interpolation.

  • Exponential —Exponential semivariogram
  • Nugget —Nugget semivariogram
  • Whittle —Whittle semivariogram
  • K-Bessel —K-Bessel semivariogram
String
Maximum number of points in each local model
(Optional)

The input data will automatically be divided into subsets that do not have more than this number of points. If Subset polygon features are supplied, the value of this parameter will be ignored.

Long
Local model area overlap factor
(Optional)

A factor representing the degree of overlap between local models (also called subsets). Each input point can fall into several subsets, and the overlap factor specifies the average number of subsets that each point will fall into. A high value of the overlap factor makes the output surface smoother, but it also increases processing time. Values must be between 1 and 5. If Subset polygon features are supplied, the value of this parameter will be ignored.

Double
Number of simulations
(Optional)

The number of simulated semivariograms of each local model. Using more simulations will make the model calculations more stable, but the model will take longer to calculate.

Long
Search neighborhood
(Optional)

Defines which surrounding points will be used to control the output. Standard is the default.

Standard Circular

  • Max neighbors—The maximum number of neighbors that will be used to estimate the value at the unknown location.
  • Min neighbors—The minimum number of neighbors that will be used to estimate the value at the unknown location.
  • Sector Type—The geometry of the neighborhood.
    • One sector—Single ellipse.
    • Four sectors—Ellipse divided into four sectors.
    • Four sectors shifted—Ellipse divided into four sectors and shifted 45 degrees.
    • Eight sectors—Ellipse divided into eight sectors.
  • Angle—The angle of rotation for the axis (circle) or semimajor axis (ellipse) of the moving window.
  • Radius—The length of the radius of the search circle.

Smooth Circular

  • Smoothing factor—The Smooth Interpolation option creates an outer ellipse and an inner ellipse at a distance equal to the Major Semiaxis multiplied by the Smoothing factor. The points that fall outside the smallest ellipse but inside the largest ellipse are weighted using a sigmoidal function with a value between zero and one.
  • Radius—The length of the radius of the search circle.
Geostatistical Search Neighborhood

arcpy.ga.EBKRegressionPrediction(in_features, dependent_field, in_explanatory_rasters, out_ga_layer, {out_raster}, {out_diagnostic_feature_class}, {measurement_error_field}, {min_cumulative_variance}, {in_subset_features}, {transformation_type}, {semivariogram_model_type}, {max_local_points}, {overlap_factor}, {number_simulations}, {search_neighborhood})
NameExplanationData Type
in_features

The input point features containing the field that will be interpolated.

Feature Layer
dependent_field

The field of the Input dependent variable features containing the values of the dependent variable. This is the field that will be interpolated.

Field
in_explanatory_rasters
[[in_explanatory_raster,…],...]

Input rasters representing the explanatory variables that will be used to build the regression model. These rasters should represent variables that are known to influence the values of the dependent variable. For example, when interpolating temperature data, an elevation raster should be used as an explanatory variable because temperature is influenced by elevation. You can use up to 62 explanatory rasters.

Raster Layer; Mosaic Layer
out_ga_layer

The output geostatistical layer displaying the result of the interpolation.

Geostatistical Layer
out_raster
(Optional)

The output raster displaying the result of the interpolation. The default cell size will be the maximum of the cell sizes of the Input explanatory variable rasters. To use a different cell size, use the cell size environmental setting.

Raster Dataset
out_diagnostic_feature_class
(Optional)

Output polygon feature class that shows the regions of each local model and contains fields with diagnostic information for the local models. For each subset, a polygon will be created that surrounds the points in the subset so you can easily identify which points were used in each subset. For example, if there are 10 local models, there will be ten polygons in this output. The feature class will contain the following fields:

  • Number of Principal Components (PrincComps)—The number of principal components that were used as explanatory variables. The value will always be less than or equal to the number of explanatory variable rasters.
  • Percent of Variance (PercVar)—The percent of variance captured by the principal components. This value will be greater than or equal to the value specified in the Minimum cumulative percent of variance parameter below.
  • Root Mean Square Error (RMSE)—The square root of the average squared cross-validation errors. The smaller this value, the better the model fits.
  • Percent 90 Interval (Perc90)—The percent of data points that fall within a 90 percent cross-validation confidence interval. Ideally, this number should be close to 90. A value significantly smaller than 90 indicates that standard errors are being underestimated. A value significantly larger than 90 indicates that standard errors are being overestimated.
  • Percent 95 Interval (Perc95)—The percent of data points that fall within a 95 percent cross-validation confidence interval. Ideally, this number should be close to 95. A value significantly smaller than 95 indicates that standard errors are being underestimated. A value significantly larger than 95 indicates that standard errors are being overestimated.
  • Mean Absolute Error (MeanAbsErr)—The average of the absolute values of the cross-validation errors. This value should be as small as possible. It is similar to Root Mean Square Error, but it is less influenced by extreme values.
  • Mean Error (MeanError)—The average of the cross-validation errors. This value should be close to zero. A value significantly different than zero indicates that the predictions are biased.
  • Continuous Ranked Probability Score (CRPS)—The continuous ranked probability score is a diagnostic that measures the deviation from the predictive cumulative distribution function to each observed data value. This value should be as small as possible. This diagnostic has advantages over cross-validation diagnostics because it compares the data to a full distribution rather than to single-point predictions.
Feature Class
measurement_error_field
(Optional)

A field that specifies the measurement error for each point in the dependent variable features. For each point, the value of this field should correspond to one standard deviation of the measured value of the point. Use this field if the measurement error values are not the same at each point.

A common source of nonconstant measurement error is when the data is measured with different devices. One device might be more precise than another, which means that it will have a smaller measurement error. For example, one thermometer rounds to the nearest degree and another thermometer rounds to the nearest tenth of a degree. The variability of measurements is often provided by the manufacturer of the measuring device, or it may be known from empirical practice.

Leave this parameter empty if there are no measurement error values or the measurement error values are unknown.

Field
min_cumulative_variance
(Optional)

Defines the minimum cumulative percent of variance from the principal components of the explanatory variable rasters. Before building the regression model, the principal components of the explanatory variables are calculated, and these principal components are used as explanatory variables in the regression. Each principal component captures a certain percent of the variance of the explanatory variables, and this parameter controls the minimum percent of variance that must be captured by the principal components of each local model. For example, if a value of 75 is provided, the software will use the minimum number of principal components that are necessary to capture at least 75 percent of the variance of the explanatory variables.

Principal components are all mutually uncorrelated with each other, so using principal components solves the problem of multicollinearity (explanatory variables that are correlated with each other). Most of the information contained in all explanatory variables can frequently be captured in just a few principal components. By discarding the least useful principal components, the model calculation becomes more stable and efficient without significant loss of accuracy.

To calculate principal components, there must be variability in the explanatory variables, so if any of your Input explanatory variable rasters contain constant values within a subset, these constant rasters will not be used to compute principal components for that subset. If all explanatory variable rasters in a subset contain constant values, the Output diagnostic feature class will report that zero principal components were used and that they captured zero percent of the variability.

Double
in_subset_features
(Optional)

Polygon features defining where the local models will be calculated. The points inside each polygon will be used for the local models. This parameter is useful when you know that the values of the dependent variable changes according to known regions. For example, these polygons may represent administrative health districts where health policy changes in different districts.

You can also use the Generate Subset Polygons tool to create subset polygons. The polygons created by this tool will be non-overlapping and compact.

Feature Layer
transformation_type
(Optional)

Type of transformation to be applied to the input data.

  • NONEDo not apply any transformation. This is the default.
  • EMPIRICALMultiplicative Skewing transformation with Empirical base function.
  • LOGEMPIRICALMultiplicative Skewing transformation with Log Empirical base function. All data values must be positive. If this option is chosen, all predictions will be positive.
String
semivariogram_model_type
(Optional)

The semivariogram model that will be used for the interpolation.

Learn more about the semivariogram models in EBK Regression Prediction

  • EXPONENTIALExponential semivariogram
  • NUGGETNugget semivariogram
  • WHITTLEWhittle semivariogram
  • K_BESSELK-Bessel semivariogram
String
max_local_points
(Optional)

The input data will automatically be divided into subsets that do not have more than this number of points. If Subset polygon features are supplied, the value of this parameter will be ignored.

Long
overlap_factor
(Optional)

A factor representing the degree of overlap between local models (also called subsets). Each input point can fall into several subsets, and the overlap factor specifies the average number of subsets that each point will fall into. A high value of the overlap factor makes the output surface smoother, but it also increases processing time. Values must be between 1 and 5. If Subset polygon features are supplied, the value of this parameter will be ignored.

Double
number_simulations
(Optional)

The number of simulated semivariograms of each local model. Using more simulations will make the model calculations more stable, but the model will take longer to calculate.

Long
search_neighborhood
(Optional)

Defines which surrounding points will be used to control the output. Standard is the default.

The following are Search Neighborhood classes: SearchNeighborhoodStandardCircular and SearchNeighborhoodSmoothCircular.

Standard Circular

  • radius—The length of the radius of the search circle.
  • angle—The angle of rotation for the axis (circle) or semimajor axis (ellipse) of the moving window.
  • nbrMax—The maximum number of neighbors that will be used to estimate the value at the unknown location.
  • nbrMin—The minimum number of neighbors that will be used to estimate the value at the unknown location.
  • sectorType—The geometry of the neighborhood.
    • ONE_SECTOR—Single ellipse.
    • FOUR_SECTORS—Ellipse divided into four sectors.
    • FOUR_SECTORS_SHIFTED—Ellipse divided into four sectors and shifted 45 degrees.
    • EIGHT_SECTORS—Ellipse divided into eight sectors.

Smooth Circular

  • radius—The length of the radius of the search circle.
  • smoothFactor—The Smooth Interpolation option creates an outer ellipse and an inner ellipse at a distance equal to the Major Semiaxis multiplied by the Smoothing factor. The points that fall outside the smallest ellipse but inside the largest ellipse are weighted using a sigmoidal function with a value between zero and one.
Geostatistical Search Neighborhood

Code sample

EBKRegressionPrediction example 1 (Python window)

Interpolates a point feature class using explanatory variable rasters.

import arcpy
arcpy.EBKRegressionPrediction_ga("HousingSales_Points", "SalePrice",
                ["AREASQFEET", "NUMBATHROOMS", "NUMBEDROOMS","TOTALROOMS"],
                "out_ga_layer", None, None, None, 95, None, "LOGEMPIRICAL",
                "EXPONENTIAL", 100, 1, 100, None)
EBKRegressionPrediction example 2 (stand-alone script)

Interpolates a point feature class using explanatory variable rasters.

# Name: EBKRegressionPrediction_Example_02.py
# Description: Interpolates housing prices using EBK Regression Prediction
# Requirements: Geostatistical Analyst Extension
# Author: Esri

# Import system modules
import arcpy

# Set environment settings
arcpy.env.workspace = "C:/gaexamples/data.gdb"

# Set local variables
inDepFeatures = "HousingSales_Points"
inDepField = "SalePrice"
inExplanRasters = ["AREASQFEET", "NUMBATHROOMS", "NUMBEDROOMS","TOTALROOMS"]
outLayer = "outEBKRP_layer"
outRaster = "outEBKRP_raster"
outDiagFeatures = "outEBKRP_features"
inDepMeField = ""
minCumVariance = 97.5
outSubsetFeatures = ""
depTransform = ""
semiVariogram= "K_BESSEL"
maxLocalPoints = 50
overlapFactor = 1
numberSinulations = 200
radius = 100000
searchNeighbourhood = arcpy.SearchNeighborhoodStandardCircular(radius)

# Check out the ArcGIS Geostatistical Analyst extension license
arcpy.CheckOutExtension("GeoStats")

# Execute EBKRegressionPrediction
arcpy.EBKRegressionPrediction_ga(inDepFeatures, inDepField, inExplanRasters,
                outLayer, outRaster, outDiagFeatures, inDepMeField, minCumVariance,
                outSubsetFeatures, depTransform, semiVariogram, maxLocalPoints,
                overlapFactor, numberSinulations, searchNeighbourhood)

Licensing information

  • Basic: Requires Geostatistical Analyst
  • Standard: Requires Geostatistical Analyst
  • Advanced: Requires Geostatistical Analyst

Related topics