How Optimized Outlier Analysis Works

Optimized Outlier Analysis executes the Cluster and Outlier Analysis (Anselin Local Moran's I) tool using parameters derived from characteristics of your input data. Similar to the way that the automatic setting on a digital camera will use lighting and subject versus ground readings to determine an appropriate aperture, shutter speed, and focus, the Optimized Outlier Analysis tool interrogates your data to obtain the settings that will yield optimal analysis results. If, for example, the Input Features dataset contains incident point data, the tool will aggregate the incidents into weighted features. Using the distribution of the weighted features, the tool will identify an appropriate scale of analysis. The classification type reported in the Output Features will be automatically adjusted for multiple testing and spatial dependence using the False Discovery Rate (FDR) correction method.

Each of the decisions the tool makes in order to give you the best results possible is reported as messages during tool execution and an explanation for these decisions is documented below.

Just like your camera has a manual mode that allows you to override the automatic settings, the Cluster and Outlier Analysis (Anselin Local Moran's I) tool gives you full control over all parameter options. Running the Optimized Outlier Analysis tool and noting the parameter settings it uses may help you refine the parameters you provide to the full control Cluster and Outlier Analysis (Anselin Local Moran's I) tool.

The workflow for the Optimized Outlier Analysis tool includes the following components. The calculations and algorithms used within each of these components are described below.

Initial data assessment

In this component, the Input Features and the optional Analysis Field, Bounding Polygons Defining Where Incidents Are Possible, and Incident Data Aggregation Method are scrutinized to ensure there are sufficient features and adequate variation in the values to be analyzed. If the tool encounters records with corrupt or missing geometry, or if an Analysis Field is specified and null values are present, the associated records will be listed as bad records and excluded from analysis.

The Optimized Outlier Analysis tool uses the Anselin Local Moran's I statistic and, similar to many statistical methods, the results are not reliable when there are less than 30 features. If you provide polygon Input Features or point Input Features and an Analysis Field, you will need a minimum of 30 features to use this tool. The minimum number of Polygons For Aggregating Incidents Into Points is also 30. The feature layer representing Bounding Polygons Defining Where Incidents Are Possible may include one or more polygons.

The Anselin Local Moran's I statistic also requires values to be associated with each feature it analyzes. When the Input Features you provide represent incident data (when you don't provide an Analysis Field), the tool will aggregate the incidents and the incident counts will serve as the values to be analyzed. After the aggregation process completes, there still must be a minimum of 30 features, so with incident data you will want to start with more than 30 features. The table below documents the minimum number of features for each Incident Data Aggregation Method:

Minimum Number of IncidentsAggregation MethodMinimum Number of Features After Aggregation

60

Count incidents within fishnet grid and Count incidents within hexagon grid, without specifying Bounding Polygons Defining Where Incidents Are Possible

30

30

Count incidents within fishnet grid and Count incidents within hexagon grid when you do provide a feature class for the Bounding Polygons Defining Where Incidents Are Possible parameter

30

30

Count incidents within aggregation polygons

30

60

Snap nearby incidents to create weighted points

30

The Anselin Local Moran's I statistic was also designed for an Analysis Field with a variety of different values. For example, the statistic is not appropriate for binary data. The Optimized Outlier Analysis tool will check the Analysis Field to make sure that the values have at least some variation.

Locational outliers are features that are much farther away from neighboring features than the majority of features in the dataset. Think of an urban environment with large, densely populated cities in the center, and smaller, less densely populated cities at the periphery. If you computed the average nearest neighbor distance for these cities you would find that the result would be smaller if you excluded the peripheral locational outliers and focused only on the cities near the urban center. This is an example of how locational outliers can have a strong impact on spatial statistics, such as Average Nearest Neighbor. Since the Optimized Outlier Analysis tool uses the average and the median nearest neighbor calculations for aggregation and also to identify an appropriate scale of analysis, the Initial Data Assessment component of the tool will also identify any locational outliers in the Input Features or Polygons For Aggregating Incidents Into Points and will report the number it encounters. To do this, the tool computes each feature's average nearest neighbor distance and evaluates the distribution of all of these distances. Features that are more than a three standard deviation distance away from their closest noncoincident neighbor are considered locational outliers.

Incident Aggregation

For incident data the next component in the workflow aggregates your data. There are three possible approaches based on the Incident Data Aggregation Method you select. The algorithms for each of these approaches are described below.

  • Count incidents within fishnet grid or Count incidents within hexagon grid:
    1. Collapse coincident points yielding a single point at each unique location in the dataset, using the same method employed by the Collect Events tool.
    2. Compare the density of the N Input Features to the density of N random features based on the minimum bounding polygon of the Input Features (in geodesic meters). The average nearest neighbor distance for a random set of N points in the given minimum bounding polygon is computed. If twice this average nearest neighbor distance for the random feature distribution is less than the max extent of the study area divided by 100, then the dataset is considered dense and the grid Cell Size used is max extent divided by 100.
    3. If the dataset is not considered dense using the method above, the Cell Size distance used is 2 times the larger of either the average or the median nearest neighbor distance. The average nearest neighbor distance (ANN) for all of the unique location points, excluding locational outliers is computed by summing the distance to each feature's nearest neighbor and dividing by the number of features (N). The median nearest neighbor distance (MNN) is computed by sorting the nearest neighbor distances smallest to largest and selecting the distance that falls in the middle of the sorted list (also excluding locational outliers). Whichever distance is larger (ANN or MNN) is multiplied by 2 and used as the grid Cell Size.
    4. Construct a fishnet or hexagon polygon grid using the optimized Cell Size and overlay the grid with the incident points.
    5. Count the incidents in each polygon cell.
    6. When you provide Bounding Polygons Defining Where Incidents Are Possible, all polygon cells within the bounding polygons are retained. When you do not provide Bounding Polygons Defining Where Incidents Are Possible, polygon cells with zero incidents are removed.
    7. If the aggregation process results in less than 30 polygon cells or if the counts in all the polygon cells are identical, you will get a message indicating the Input Features you provided are not appropriate for the Incident Data Aggregation Method selected; otherwise, the aggregation component for this method completes successfully.
  • Count incidents within aggregation polygons:
    1. For this Incident Data Aggregation Method, a Polygons For Aggregating Incidents Into Points feature layer is required. These aggregation polygons overlay the incident points.
    2. Count the incidents within each polygon.
    3. Ensure there is sufficient variation in the incident counts for analysis. If the aggregation process results in all polygons having the same number of incidents, you will get a message indicating the data is not appropriate for the Incident Data Aggregation Method you selected.
  • Snap nearby incidents to create weighted points:
    1. Collapse coincident points yielding a single point at each unique location in the dataset, using the same method employed by the Collect Events tool. Count the number of unique location features (UL).
    2. Compute both the average and the median nearest neighbor distances on all of the unique location points, excluding locational outliers. The average nearest neighbor distance (ANN) is computed by summing the distance to each feature's nearest neighbor and dividing by the number of features (N). The median nearest neighbor distance (MNN) is computed by sorting the nearest neighbor distances smallest to largest and selecting the distance that falls in the middle of the sorted list.
    3. Set the initial snap distance (SD) to the smaller of either ANN or MNN.
    4. Adjust the snap distance to account for coincident points. Scalar = (UL/N), where N is the number of features in the Input Features layer. The adjusted snap distance becomes SD * Scalar.
    5. Integrate the incident points in three iterations, first using the adjusted snap distance times 0.10, then using the adjusted snap distance times 0.25, and finally integrating with a snap distance equal to the fully adjusted snap distance. Performing the integrate step in three passes minimizes distortion of the original point locations.
    6. Collapse the snapped points yielding a single point at each location with a weight to indicate the number of incidents that were snapped together. This part of the aggregation process uses the Collect Events method.
    7. If the aggregation process results in less than 30 weighted points or if the counts for all of the points are identical, you will get a message indicating the Input Features you provided are not appropriate for the Incident Data Aggregation Method selected; otherwise, the aggregation component for this method completes successfully.

Scale of analysis

This next component of the Optimized Outlier Analysis workflow is applied to weighted features, either because you provided Input Features with an Analysis Field or because the Incident Data Aggregation Method has created weights from incident counts. The next step is to identify an appropriate scale of analysis. The ideal scale of analysis is a distance that matches the scale of the question you are asking (if you are looking for clusters and outlier areas of a disease outbreak and know that the mosquito vector has a range of 10 miles, for example, using a 10-mile distance would be most appropriate). When you can't justify any specific distance to use for your scale of analysis, there are some strategies to help with this. The Optimized Outlier Analysis tool employs these strategies.

The first strategy tried is Incremental Spatial Autocorrelation. Whenever you see spatial clustering in the landscape, you are seeing evidence of underlying spatial processes at work. The Incremental Spatial Autocorrelation tool performs the Global Moran's I statistic for a series of increasing distances, measuring the intensity of spatial clustering for each distance. Locational outliers are excluded from the calculations of the beginning and increment distances used in Incremental Spatial Autocorrelation. The intensity of clustering is determined by the z-score returned. Typically, as the distance increases, so does the z-score, indicating intensification of clustering. At some particular distance, however, the z-score generally peaks. Peaks reflect distances where the spatial processes promoting clustering are most pronounced. The Optimized Outlier Analysis tool looks for peak distances using Incremental Spatial Autocorrelation. If a peak distance is found, this distance becomes the scale of analysis. If multiple peak distances are found, the first peak distance is selected.

When no peak distance is found, Optimized Outlier Analysis examines the spatial distribution of the features and computes the average distance that would yield K neighbors for each feature. K is computed as 0.05 * N, where N is the number of features in the Input Features layer. K will be adjusted so it is never smaller than three or larger than 30. If the average distance that would yield K neighbors exceeds one standard distance, the scale of analysis will be set to one standard distance; otherwise, it will reflect the K neighbor average distance.

The Incremental Spatial Autocorrelation step can take a long time to finish for large, dense datasets. Consequently, when a feature with 500 or more neighbors is encountered, the incremental analysis is skipped, and the average distance that would yield 30 neighbors is computed and used for the scale of analysis.

The distance reflecting the scale of analysis will be reported as messages during tool execution and will be used to perform the cluster and outlier analysis. This distance corresponds to the Distance Band or Threshold Distance parameter used by the Cluster and Outlier Analysis (Anselin Local Moran's I) tool.

For features with no neighbors at this distance, the Distance Band is extended to include their nearest neighbor.

Cluster and Outlier analysis

At this point in the Optimized Outlier Analysis workflow, all of the checks and parameter settings have been made. The next step is to run the Anselin Local Moran's I statistic. Details about the mathematics for this statistic are outlined in How Cluster and Outlier Analysis (Anselin Local Moran's I) works. Results from the Anselin Local Moran's I statistic will be automatically corrected for multiple testing and spatial dependence using the False Discovery Rate (FDR) correction method. Messages written during tool execution summarize the number of features identified as statistically significant high or low outliers as well as high or low clusters, after the FDR correction is applied.

Output

The last component of the Optimized Outlier Analysis tool is to create the Output Features. If the Input Features represent incident data requiring aggregation, the Output Features will reflect the aggregated weighted features (fishnet or hexagon polygon cells, or the aggregation polygons you provided for the Polygons For Aggregating Incidents Into Points parameter, or weighted points). Each feature will have a local Moran's I index value (LMiIndex), z-score, p-value, a cluster/outlier type (COType) result and the number of neighbors each feature included in their calculations.

Additional resources

Anselin, Luc. "Local Indicators of Spatial Association-LISA," Geographical Analysis 27(2): 93-115, 1995.

The spatial statistics resource page has short videos, tutorials, web seminars, articles, and a variety of other materials to help you get started with spatial statistics.