Geographically Weighted Regression (GWR) (Spatial Statistics)

Summary

Performs Geographically Weighted Regression (GWR), a local form of linear regression used to model spatially varying relationships.

Legacy:
An enhanced version of this tool has been added to ArcGIS Pro 2.3. This is the tool documentation for the older deprecated tool. It is recommended that you upgrade and use the new Geographically Weighted Regression tool available in ArcGIS Pro or later.

Illustration

Geographically Weighted Regression
GWR is a local regression model. Coefficients are allowed to vary.

Usage

  • GWR constructs a separate equation for every feature in the dataset incorporating the dependent and explanatory variables of features within the bandwidth of each target feature. The shape and extent of the bandwidth is dependent on user input for the Kernel type, Bandwidth method, Distance, and Number of neighbors parameters with one restriction: when the number of neighboring features will exceed 1000, only the closest 1000 are incorporated into each local equation.

  • GWR should be applied to datasets with several hundred features for best results. It is not an appropriate method for small datasets. The tool does not work with multipoint data.

  • Note:

    The GWR tool produces a variety of outputs. A summary of the GWR model is available as messages at the bottom of the Geoprocessing pane during tool execution. You can access the messages by hovering over the progress bar, clicking the pop-out button, or expanding the messages section in the Geoprocessing pane. You can also access the messages of a previously run Geographically Weighted Regression tool via the geoprocessing history.

    The GWR tool also produces an Output feature class and a table with the tool execution summary report diagnostic values. The name of this table is automatically generated using the output feature class name with the _supp suffix. The Output feature class is automatically added to the table of contents with a hot/cold rendering scheme applied to model residuals. A full explanation of each output is provided in Interpreting GWR results.

  • The _supp file is always created in the same location as the Output feature class unless the output feature class is created inside a feature dataset. When the output feature class is inside a feature dataset, the _supp table is created in the geodatabase containing the feature dataset.

  • It is recommended that you use projected data. This is especially important when distance is a component of the analysis, as it is for GWR when you select Fixed for Kernel type. It is recommended that your data be projected using a projected coordinate system (rather than a geographic coordinate system).

  • Some of the GWR tool computations take advantage of multiple CPUs to increase performance and will automatically use up to eight threads/CPUs for processing.

  • You should always begin regression analysis with Ordinary Least Squares (OLS) regression. First find a properly specified OLS model. Then use the same explanatory variables to run GWR (excluding any dummy explanatory variables representing different spatial regimes).

  • Dependent and explanatory variables should be numeric fields containing a variety of values. Linear regression methods, such as GWR, are not appropriate for predicting binary outcomes (for example, all of the values for the dependent variable are either 1 or 0).

  • In global regression models, such as Ordinary Least Squares Regression (OLS), results are unreliable when two or more variables exhibit multicollinearity (when two or more variables are redundant or together tell the same story). GWR builds a local regression equation for each feature in the dataset. When the values for a particular explanatory variable cluster spatially, it is likely that there are problems with local multicollinearity. The condition number field (COND) in the output feature class indicates when results are unstable due to local multicollinearity. In general, be skeptical of results for features with a condition number greater than 30, equal to Null or, for shapefiles, equal to -1.7976931348623158e+308.

  • Use caution when including nominal or categorical data in a GWR model. Where categories cluster spatially, there is risk of encountering local multicollinearity issues. The condition number included in the GWR output indicates when local collinearity is a problem (a condition number less than zero, greater than 30, or set to Null). Results in the presence of local multicollinearity are unstable.

  • Do not use artificial explanatory variables to represent different spatial regimes in a GWR model (for example, census tracts outside the urban core are assigned a value of 1, while all others are assigned a value of 0). Because GWR allows explanatory variable coefficients to vary, these spatial regime explanatory variables are unnecessary, and if included, will create problems with local multicollinearity.

  • To better understand regional variation among the coefficients of your explanatory variables, examine the optional raster coefficient surfaces created by GWR. These raster surfaces are created in the Coefficient raster workspace. For polygon data, you can use graduated color or cold-to-hot rendering on each coefficient field in the Output feature class to examine changes across your study area.

  • You can use GWR for prediction by supplying a Predictions locations feature class (often this feature class is the same as the Input feature class), the Prediction explanatory variables, and an Output prediction feature class. There must be a one-to-one correspondence between the fields used to calibrate the regression model (the values entered for the Explanatory variables field) and the fields used for prediction (the values entered for the Prediction explanatory variables field). The order of these variables must be the same. Suppose, for example, you are modeling traffic accidents as a function of speed limits, road conditions, number of lanes, and number of cars. You can predict the impact that changing speed limits or improving roads might have on traffic accidents by creating a new variables with the amended speed limits and road conditions. The existing variables would be used to calibrate the regression model and would be used for the Explanatory variables parameter. The amended variables would be used for predictions and would be entered as your Prediction explanatory variables.

  • If a Prediction locations feature class is provided but no Prediction explanatory variables are specified, the Output prediction feature class is created with computed coefficients for each location only (no predictions).

  • A regression model is incorrectly specified if it is missing a key explanatory variable. Statistically significant spatial autocorrelation of the regression residuals or unexpected spatial variation among the coefficients of one or more explanatory variables suggests that your model is incorrectly specified. You should make every effort (through OLS residual analysis and GWR coefficient variation analysis, for example) to discover what these key missing variables are so they can be included in the model.

  • Always question whether it makes sense for an explanatory variable to be nonstationary. For example, suppose you are modeling the density of a particular plant species as a function of several variables including ASPECT. If you find that the coefficient for the ASPECT variable changes across the study area, you are likely seeing evidence of a key missing explanatory variable (perhaps prevalence of competing vegetation, for example). You should make every effort to include all key explanatory variables in your regression model.

  • Caution:

    When using shapefiles, keep in mind that they cannot store null values. Tools or other procedures that create shapefiles from nonshapefile inputs may, consequently, store null values as zero or as some very small negative number (-DBL_MAX = -1.7976931348623158e+308). This can lead to unexpected results. For more information, see Geoprocessing considerations for shapefile output.

  • When the result of a computation is infinity or undefined, the result for nonshapefiles will be Null; for shapefiles, the result will be -DBL_MAX = -1.7976931348623158e+308.

  • When you select either Akaike Information Criterion or Cross Validation for the Bandwidth Method parameter, GWR will find the optimal distance (for a fixed kernel) or optimal number of neighbors (for an adaptive kernel). Problems with local multicollinearity, however, will prevent both the Akaike Information Criterion and Cross Validation bandwidth methods from resolving an optimal distance/number of neighbors. If an error occurs indicating severe model design problems, try specifying a particular distance or neighbor count. Then examine the condition numbers in the output feature class to see which features are associated with local collinearity problems

  • Severe model design errors, or errors indicating local equations do not include enough neighbors, often indicate a problem with global or local multicollinearity. To determine where the problem is, run your model using OLS and examine the VIF value for each explanatory variable. If some of the VIF values are large (above 7.5, for example), global multicollinearity is preventing GWR from solving. More likely, however, local multicollinearity is the problem. Try creating a thematic map for each explanatory variable. If the map reveals spatial clustering of identical values, consider removing those variables from the model or combining those variables with other explanatory variables to increase value variation. If, for example, you are modeling home values and have variables for bedrooms and bathrooms, you may want to combine these to increase value variation, or to represent them as bathroom/bedroom square footage. Avoid using spatial regime dummy variables, spatially clustering categorical or nominal variables, or variables with very few possible values when constructing GWR models.

  • GWR is a linear model subject to the same requirements as OLS. Review the How regression models go bad section in Regression analysis basics to ensure that your GWR model is properly specified.

Syntax

GeographicallyWeightedRegression(in_features, dependent_field, explanatory_field, out_featureclass, kernel_type, bandwidth_method, {distance}, {number_of_neighbors}, {weight_field}, {coefficient_raster_workspace}, {cell_size}, {in_prediction_locations}, {prediction_explanatory_field}, {out_prediction_featureclass})
ParameterExplanationData Type
in_features

The feature class containing the dependent and independent variables.

Feature Layer
dependent_field

The numeric field containing the values that will be modeled.

Field
explanatory_field
[explanatory_field,...]

A list of fields representing independent explanatory variables in the regression model.

Field
out_featureclass

The output feature class that will receive dependent variable estimates and residuals.

Feature Class
kernel_type

Specifies whether the kernel is constructed as a fixed distance, or if it is allowed to vary in extent as a function of feature density.

  • FIXEDThe spatial context (the Gaussian kernel) used to solve each local regression analysis is a fixed distance.
  • ADAPTIVEThe spatial context (the Gaussian kernel) is a function of a specified number of neighbors. Where feature distribution is dense, the spatial context is smaller; where feature distribution is sparse, the spatial context is larger.
String
bandwidth_method

Specifies how the extent of the kernel will be determined. When AICc or CV is selected, the tool will find the optimal distance or number of neighbors. Typically, you will select either AICc or CV when you aren't sure what to use for the distance or number_of_neighbors parameter. Once the tool determines the optimal distance or number of neighbors, however, you'll use the BANDWIDTH_PARAMETER option.

  • AICcThe extent of the kernel is determined using the Akaike Information Criterion (AICc).
  • CVThe extent of the kernel is determined using Cross Validation.
  • BANDWIDTH_PARAMETERThe extent of the kernel is determined by a fixed distance or a fixed number of neighbors. You must specify a value for either the distance or number_of_neighbors parameter.
String
distance
(Optional)

The distance to use when kernel_type is FIXED and bandwidth_method is BANDWIDTH_PARAMETER.

Double
number_of_neighbors
(Optional)

The exact number of neighbors to include in the local bandwidth of the Gaussian kernel when kernel_type is ADAPTIVE and bandwidth_method is BANDWIDTH_PARAMETER.

Long
weight_field
(Optional)

The numeric field containing a spatial weighting for individual features. This weight field allows some features to be more important in the model calibration process than others. This is useful when the number of samples taken at different locations varies, values for the dependent and independent variables are averaged, and places with more samples are more reliable (should be weighted higher). If you have an average of 25 different samples for one location but an average of only 2 samples for another location, for example, you can use the number of samples as your weight field so that locations with more samples have a larger influence on model calibration than locations with few samples.

Field
coefficient_raster_workspace
(Optional)

The full path to the workspace where the coefficient rasters will be created. When this workspace is provided, rasters are created for the intercept and every explanatory variable.

Workspace
cell_size
(Optional)

The cell size (a number) or reference to the cell size (a path to a raster dataset) to use when creating the coefficient rasters.

The default cell size is the shortest of the width or height of the extent specified in the geoprocessing environment output coordinate system, divided by 250.

Analysis Cell Size
in_prediction_locations
(Optional)

A feature class containing features representing locations where estimates should be computed. Each feature in this dataset should contain values for all of the explanatory variables specified; the dependent variable for these features will be estimated using the model calibrated for the input feature class data.

Feature Layer
prediction_explanatory_field
[prediction_explanatory_field,...]
(Optional)

A list of fields representing explanatory variables in the Prediction locations feature class. These field names should be provided in the same order (a one-to-one correspondence) as those listed for the input feature class Explanatory variables parameter. If no prediction explanatory variables are given, the output prediction feature class will only contain computed coefficient values for each prediction location.

Field
out_prediction_featureclass
(Optional)

The output feature class to receive dependent variable estimates for each feature in the Prediction locations feature class.

Feature Class

Derived Output

NameExplanationData Type
out_table

The table with the tool execution summary report diagnostic values.

Table
out_regression_rasters

The workspace where all of the coefficient rasters will be created.

Raster Layer

Code sample

GeographicallyWeightedRegression example (Python window)

The following Python window script demonstrates how to use the GeographicallyWeightedRegression tool.

import arcpy
arcpy.env.workspace = "c:/data"
arcpy.GeographicallyWeightedRegression_stats("CallData.shp", "Calls", "BUS_COUNT;RENTROCC00;NoHSDip",
                                             "CallsGWR.shp", "ADAPTIVE", "BANDWIDTH PARAMETER",
                                             "#", "25", "#","CoefRasters", "135", "PredictionPoints", 
                                             "#", "GWRCallPredictions.shp")
GeographicallyWeightedRegression example (stand-alone script)

The following stand-alone Python script demonstrates how to use the GeographicallyWeightedRegression tool.

# Model 911 emergency calls using GWR

# Import system modules
import arcpy

# Set property to overwrite existing outputs
arcpy.env.overwriteOutput = True

# Local variables...
workspace = r"C:\Data"

try:
    # Set the current workspace (to avoid having to specify the full path to the 
    # feature classes each time)
    arcpy.env.workspace = workspace

    # 911 Calls as a function of {number of businesses, number of rental units,
    # number of adults who didn't finish high school}
    # Process: Geographically Weighted Regression... 
    gwr = arcpy.GeographicallyWeightedRegression_stats("CallData.shp", "Calls", 
                        "BUS_COUNT;RENTROCC00;NoHSDip",
                        "CallsGWR.shp", "ADAPTIVE", "BANDWIDTH PARAMETER","#", "25", "#",
                        "CoefRasters", "135", "PredictionPoints", "#", "GWRCallPredictions.shp")

    # Create Spatial Weights Matrix to use with Global Moran's I tool
    # Process: Generate Spatial Weights Matrix... 
    swm = arcpy.GenerateSpatialWeightsMatrix_stats("CallsGWR.shp", "UniqID",
                        "CallData25Neighs.swm",
                        "K_NEAREST_NEIGHBORS",
                        "#", "#", "#", 25) 
                        
    # Calculate Moran's Index of Spatial Autocorrelation for 
    # OLS Residuals using a SWM File.  
    # Process: Spatial Autocorrelation (Morans I)...      
    moransI = arcpy.SpatialAutocorrelation_stats("CallsGWR.shp", "StdResid",
                        "NO_REPORT", "GET_SPATIAL_WEIGHTS_FROM_FILE", 
                        "EUCLIDEAN_DISTANCE", "NONE", "#", 
                        "CallData25Neighs.swm")

except arcpy.ExecuteError:
    # If an error occurred when running the tool, print out the error message.
    print(arcpy.GetMessages())

Environments

Output Coordinate System

Feature geometry is projected to the output coordinate system after analysis is complete. Consequently, the value entered for the Distance parameter should be specified in the same units as the Input feature class. Values entered for Output cell size should be specified in the same units as the output coordinate system.

Related topics