An important component of many GIS analysis workflows is the comparison of two variables across a study area to determine if the variables are related and how they are related. For example, is there a relationship between diabetes and obesity in certain areas? Historically, this type of question has been answered through either careful cartographic comparison or linear regression analysis. Cartographic comparison can be subjective, and regression analysis can only detect simple relationships.
The Local Bivariate Relationships tool allows you to quantify the relationship between two variables on the same map by determining if the values of one variable are dependent on or are influenced by the values of another variable and if those relationships vary over geographic space. The tool calculates an entropy statistic in each local neighborhood that quantifies the amount of shared information between the two variables. Unlike other statistics that can often only capture linear relationships (such as linear regression), entropy can capture any structural relationships between the two variables, including exponential, quadratic, sinusoidal, and even complex relationships that cannot be represented by typical mathematical functions. This tool accepts polygons or points and creates an output feature class summarizing the significance and form of the relationships of each input feature. In addition, custom pop-ups and a variety of diagnostics, charts, and messages are provided.
The tool can be used for the following types of applications:
- The Centers for Disease Control and Prevention (CDC) states, "People who have obesity, compared to those with a normal or healthy weight, are at increased risk for many serious diseases and health conditions, including type 2 diabetes." The CDC could use this tool to quantify the strength of the relationship between obesity and diabetes and determine if the relationship is consistent across the study area.
- A public health official could explore the relationships between air pollution levels and socioeconomic factors to detect potential environmental injustice.
What does it mean for two variables to be related to each other? There are many types of relationships between variables, but in the simplest sense, two variables are related if information can be learned about one variable by observing values of the other variable. For example, information can be gained about diabetes risk by observing information about obesity. This is called dependence between two variables, and, conversely, if no information can be gained about one variable by observing the other variable, the variables are called independent.
One way to measure the degree of the relationship between variables is with entropy. Entropy is a fundamental concept of information theory, and it is used to quantify the amount of uncertainty in a random variable. In general, the less predictable the variable, the higher the entropy. Entropy is widely applicable and can be calculated for individual random variables and a joint entropy can be calculated between two or more variables. The joint entropy of two variables is equal to the entropy of the first variable plus the entropy of the second variable, minus the mutual information of the two variables. The mutual information serves as a useful quantification of the level of dependence between the variables because it directly measures how much information can be gained about one variable by observing values of the other variable.
To estimate the mutual information, the entropy of each individual variable and its joint entropy must all be estimated. However, these values depend on the underlying distributions of the variables and are almost never known in practice. Fortunately, recent research has shown that the joint entropy of multiple variables can be estimated using power-weighted minimum spanning trees as a proxy for the joint distribution of the variables (Guo, 2010). This allows estimation of the joint entropy without knowing the individual distributions of the two variables. Being able to estimate the joint entropy is useful, but you really need to know the mutual information between the variables to determine whether the two variables are related. While you cannot directly estimate the mutual information without knowing the distributions of the two variables, it is possible to use permutations to construct a null hypothesis test for statistically significant relationships.
Test for significant relationships using permutations
As shown in the previous section, the question of whether two variables are related is equivalent to asking whether their joint entropy (which can be estimated) is significantly less than the sum of the individual entropies of the two variables (which cannot be estimated). Said another way, is the estimated joint entropy of the data significantly less than it would be if the two variables were independent?
To make this determination, permutations are performed on the variables by randomly reassigning each value of the first variable to a new value of the second variable. By randomizing the pairings, the permuted datasets will share no mutual information, but the individual entropies of the two variables will not change. By generating many permuted datasets and estimating the joint entropy of each of them, you can build a distribution of the joint entropy under the null hypothesis that the two variables are independent and not related. The joint entropy estimated from the actual data can then be compared to this distribution, and a pseudo p-value can be calculated based on the proportion of the permutations that have a joint entropy smaller than the joint entropy of the actual data.
Test for local spatial relationships
The procedure described above for testing for significant relationships between two variables can be applied to any continuous bivariate data. To turn this into a test for local spatial relationships, this hypothesis test is performed on each input feature using neighborhoods. This allows you to map the results and identify local areas where the variables have a significant relationship.
All values of the Dependent Variable and Explanatory Variable parameters are first rescaled to be between 0 and 1 by subtracting the minimum value of the entire dataset and dividing by the range (maximum minus minimum) of the entire dataset. The following steps are then performed for each input feature:
- Find the closest neighboring features. The Number of Neighbors parameter specifies how many neighbors will be used. The input feature is counted as a neighbor of itself.
- Merge the values of the two rescaled variables from the neighbors into a single dataset.
- Construct the minimum spanning tree and estimate the joint entropy.
- Randomly permute the values of the two variables and estimate the joint entropy for each permutation. The Number of Permutations parameter specifies how many permutations will be performed.
- Calculate the pseudo p-value and determine if the variables have a statistically significant relationship.
Because this procedure performs a different hypothesis test on each input feature, you can use the Apply False Discovery Rate (FDR) Correction parameter to control the proportion of false positive results (Type 1 errors).
Classify the local relationships
Identifying areas where two variables have a statistically significant relationship is very important. To put this information to use, it helps to identify the type of relationship between the variables based on how well the explanatory variable can predict the value of the dependent variable.
Each feature will be classified into one of the following relationship types:
- Not Significant—The relationship between the variables is not statistically significant.
- Positive Linear—The dependent variable increases linearly as the explanatory variable increases.
- Negative Linear—The dependent variable decreases linearly as the explanatory variable increases.
- Concave—The dependent variable changes by a concave curve as the explanatory variable increases. In general, concave curves bend or arc downward.
- Convex—The dependent variable changes by a convex curve as the explanatory variable increases. In general, convex curves bend or arc upward.
- Undefined complex—The variables are significantly related, but the type of relationship cannot be reliably described by any of the other categories.
The following images are examples of a Concave relationship:
The following images are examples of a Convex relationship:
Classify each significant feature using the following steps:
- Estimate an ordinary linear regression model to predict the dependent variable based on the explanatory variable and calculate the Corrected Akaike's Information Criterion (AICc) for the model.
- Estimate a second linear regression model to predict the dependent variable based on the explanatory variable and the square of the explanatory variable (a quadratic regression model), and calculate the AICc.
- Compare the AICc values for the two regression models and choose the one that best represents the relationship. The AICc of the quadratic regression model must be at least 3 less than the AICc of the linear regression model for the quadratic model to be chosen. Otherwise, the linear model will be chosen.
- For the chosen model, calculate the adjusted R2 value. If it is less than 0.05, the chosen model explains less than 5 percent of the data variation, and the relationship is classified as Undefined Complex.
- If the adjusted R2 is greater than 0.05, classify according to the following rules:
- If a linear model is chosen and the coefficient is positive, classify as Positive Linear.
- If the linear model is chosen and the coefficient is negative, classify as Negative Linear.
- If the quadratic model is chosen and the coefficient of the squared term is positive, classify as Convex.
- If the quadratic model is chosen and the coefficient of the squared term is negative, classify as Concave.
The output of this tool is a feature class symbolized by the relationship type, along with summary statistics printed in geoprocessing messages. The output features contain informational fields as well as pop-ups that visualize the relationship using scatter plots.
Scatter plot pop-ups
If specified, custom scatter plot pop-ups are generated for each output feature and can be viewed by clicking the feature in the map. The following image shows a scatter plot pop-up for a feature with a positive linear relationship:
The rescaled explanatory variable is displayed on the x-axis and the rescaled dependent variable on the y-axis. The single highlighted point in the scatter plot is the feature; all other points are neighbors of the feature.
Hover over a point in the scatter plot to see the Source ID of the feature and the rescaled values of the dependent and explanatory variables along with their raw (original scale) values in parentheses.
Summary information about the statistical significance and types of relationships appear as geoprocessing messages. An example of these messages is shown below.
The Categorical Summary section of the message lists the number of features and percentages for each relationship type. The Entropy Results Summary section lists the minimum (Min), maximum (Max), average (Mean), and median (Median) of the entropy and p-values of the input features. The FDR Comparison sections lists the number and percent of statistically significant relationships with and without applying a false discover rate correction.
Geoprocessing messages appear at the bottom of the Geoprocessing pane during tool execution. You can also access the messages using the geoprocessing history by hovering over the progress bar and clicking the pop-out button or expanding the messages section in the Geoprocessing pane.
The tool output contains various fields that provide information about how and why each feature was categorized into its relationship type.
The following fields provide information about whether the relationship between the dependent and explanatory variables is statistically significant:
- Entropy—The estimated entropy value for the feature.
- P-values—The pseudo p-value of the test for a significant relationship between the dependent and explanatory variable. This value is not adjusted for false discovery rate.
- Local Bivariate Relationship Confidence Level—The highest level of confidence that is satisfied for the feature. The possible values of this field are 90% Confident, 95% Confident, 99% Confident, and Not Significant. The confidence level adjusts for false discovery rate if an Apply False Discovery Rate (FDR) Correction parameter value is specified.
The following fields provide information about classifying the type of relationship between the dependent and explanatory variables:
- Type of Relationship—The type of relationship between the dependent and explanatory variable
- AICc (Linear)—The corrected Akaike information criterion for the linear model
- R-squared (Linear)—The R2 value for the linear model
- AICc (Polynomial)—The corrected Akaike information criterion for the polynomial model
- R-squared (Polynomial)—The R2 value for the polynomial model
AICc and R2 values will be null for features that do not have a statistically significant relationship between the dependent and explanatory variables.
Regression coefficients and significance
The following fields provide information about the coefficients of the linear and polynomial models that are used to classify the relationship:
- Intercept—The intercept of the linear model.
- Coefficient (Linear)—The coefficient of the linear term of the linear model.
- Polynomial Intercept—The intercept of the polynomial model.
- Polynomial Coefficient (Linear)—The coefficient of the linear term of the polynomial model.
- Polynomial Intercept (Squared)—The coefficient of the squared term of the polynomial model.
- Significance of Coefficients (Linear)—A two-character code indicating whether the intercept and coefficient are statistically significant at a 90 percent confidence level. An underscore (_) indicates that the value is not statistically significant, and an asterisk (*) indicates that the value is statistically significant. For example, *_ indicates that the intercept is statistically significant, but the linear coefficient is not. Similarly, _* indicates that the intercept is not statistically significant, but the linear coefficient is.
- Significance of Coefficients (Polynomial)—A three-character code indicating whether the intercept, linear coefficient, and squared coefficient of the polynomial model are statistically significant at a 90 percent confidence level. For example, *_* indicates that the intercept is statistically significant, the linear coefficient is not statistically significant, and the squared coefficient is statistically significant.
All fields related to regression coefficients will be null or empty strings for each feature that does not have a statistically significant relationship between the dependent and explanatory variables.
Consider the following tips when using the Local Bivariate Relationships tool:
- Use the Scaling Factor parameter to control how sensitive the tool is to subtle relationships. Scaling factors close to zero will only detect strong relationships between the variables, and scaling factors close to one can additionally detect weaker relationships. The default value of 0.5 is a compromise that will detect strong and moderate relationships.
- The Number of Neighbors parameter value you choose has several important implications. Using a larger number of neighbors provides more data for each hypothesis test, which increases the likelihood of detecting a significant relationship. However, using many neighbors will make the test less local because it has to search farther to find neighbors, so it may fail to detect relationships that are very local. Large numbers of neighbors also rapidly increases the execution time of the tool.
- The Number of Permutations parameter value you choose is a balance between precision and increased processing time. Increasing the number of permutations increases precision by increasing the range of possible values for the pseudo-p. For example, with 99 permutations, the precision of the pseudo-p value is .01, and for 999 permutations, the precision is .001. These values are computed by dividing 1 by the number of permutations plus 1: 1/(1+99) and 1/(1+999). A lower number of permutations can be used when first exploring a problem, but it is a best practice to increase the permutations to the highest number feasible for final results. It is also recommended that you use a larger number of permutations when using a larger number of neighbors.
- Guo, D. "Local entropy map: a nonparametric approach to detecting spatially varying multivariate relationships." International Journal of Geographical Information Science 24, no. 9 (September 2010): 1367-1389.