Given a set of features (Input Feature Class) and an analysis field (Input Field), the Cluster and Outlier Analysis tool identifies spatial clusters of features with high or low values. The tool also identifies spatial outliers. To do this, the tool calculates a local Moran's I value, a z-score, a pseudo p-value, and a code representing the cluster type for each statistically significant feature. The z-scores and pseudo p-values represent the statistical significance of the computed index values.
A positive value for I indicates that a feature has neighboring features with similarly high or low attribute values; this feature is part of a cluster. A negative value for I indicates that a feature has neighboring features with dissimilar values; this feature is an outlier. In either instance, the p-value for the feature must be small enough for the cluster or outlier to be considered statistically significant. For more information on determining statistical significance, see What is a z-score? What is a p-value? Note that the local Moran's I index (I) is a relative measure and can only be interpreted within the context of its computed z-score or p-value. The z-scores and p-values reported in the output feature class are uncorrected for multiple testing or spatial dependency.
The cluster/outlier type (COType) field distinguishes between a statistically significant cluster of high values (HH), cluster of low values (LL), outlier in which a high value is surrounded primarily by low values (HL), and outlier in which a low value is surrounded primarily by high values (LH). Statistical significance is set at the 95 percent confidence level. When no FDR correction is applied, features with p-values smaller than 0.05 are considered statistically significant. The FDR correction reduces this p-value threshold from 0.05 to a value that better reflects the 95 percent confidence level given multiple testing.
This tool creates a new output feature class with the following attributes for each feature in the input feature class: local Moran's I index, z-score, p-value, and COType.
When this tool runs, the output feature class is automatically added to the table of contents (TOC) with default rendering applied to the COType field. The rendering applied is defined by a layer file in <ArcGIS Pro>\Resources\ArcToolBox\Templates\Layers. You can reapply the default rendering, if needed, by using the Apply Symbology From Layer tool.
Permutations are used to determine how likely it would be to find the actual spatial distribution of the values that you are analyzing by comparing your values to a set of randomly generated values. Even with complete spatial randomness (CSR), some degree of clustering will always be observed simply due to randomness. Permutations will generate many random datasets and compare these values to the Local Moran's I of your original data. To do this, each permutation randomly rearranges the neighborhood values around each feature and calculates the Local Moran's I value of this random data. By looking at the distribution of the Local Moran's I generated from permutations, you can see the range of Local Moran's I values that could reasonably be due to randomness. If there is a statistically significant spatial pattern in your data, you expect the Local Moran's I values generated from permutations to display less clustering than the Local Moran's I value from your original data. A pseudo p-value is then calculated by determining the proportion of Local Moran's I values generated from permutations that display more clustering than your original data. If this proportion (the pseudo p-value) is small (less than 0.05), you can conclude that your data does display statistically significant clustering.
Choosing the number of permutations is a balance between precision and increased processing time. Increasing the number of permutations increases precision by increasing the range of possible values for the pseudo-p. For example, with 99 permutations, the precision of the pseudo-p value is .01, and for 999 permutations, the precision is .001. These values are computed by dividing one by the number of permutations plus one: 1/(1+99) and 1/(1+999). A lower number of permutations can be used when first exploring a problem, but it is best practice to increase the permutations to the highest number feasible for final results.
Best practice guidelines
- Results are only reliable if the input feature class contains at least 30 features.
- This tool requires an input field such as a count, rate, or other numeric measurement. If you are analyzing point data, where each point represents a single event or incident, you might not have a specific numeric attribute to evaluate (a severity ranking, count, or other measurement). If you are interested in finding locations with many incidents (hot spots) and/or locations with very few incidents (cold spots), you will need to aggregate your incident data prior to analysis. The Hot Spot Analysis (Getis-Ord Gi*) tool is also effective for finding hot and cold spots. Only the Cluster and Outlier Analysis (Anselin Local Moran's I) tool, however, will identify statistically significant spatial outliers (a high value surrounded by low values or a low value surrounded by high values).
- Select an appropriate conceptualization of spatial relationships.
- When you select the Space time window conceptualization, you can identify space-time clusters and outliers. See Space-time cluster analysis for more information.
- Select an appropriate distance band or threshold distance.
- All features should have at least one neighbor.
- No feature should have all other features as a neighbor.
- Especially if the values for the input field are skewed, each feature should have about eight neighbors.
The Cluster and Outlier Analysis (Anselin Local Moran's I) tool identifies concentrations of high values, concentrations of low values, and spatial outliers. It can help you answer questions such as these:
- Where are the sharpest boundaries between affluence and poverty in a study area?
- Are there locations in a study area with anomalous spending patterns?
- Where are the unexpectedly high rates of diabetes across the study area?
Applications can be found in many fields including economics, resource management, biogeography, political geography, and demographics.
Anselin, Luc. "Local Indicators of Spatial Association—LISA," Geographical Analysis 27(2): 93–115, 1995.
Mitchell, Andy. The ESRI Guide to GIS Analysis, Volume 2. ESRI Press, 2005.