The Spatial Outlier Detection tool identifies global or local spatial outliers in point features. A global outlier is a point that is far away from all other points in the feature class. A local outlier is a point that is farther away from its neighbors than would be expected by the density of points in the local area. In addition to classifying input points as outliers or inliers, the tool can produce a raster surface with the calculated local outlier factor (LOF) across the study area, which may assist in determining how new observations will be classified given the spatial distribution of your data. Furthermore, the tool can optimize the selection of needed parameters, such as the number of neighbors and the percent of locations considered outliers.
Potential applications of this tool include the following scenarios:
- An organization maintains air quality monitoring stations that are used for air quality surface interpolation and the organization wants to identify the most isolated monitors to identify where supplemental data gathering will be necessary.
- Blood donation drives are often hosted near clusters of potential donors to minimize the travel needed by each donor, but important donors that live far away may require further communication and incentives to help spur voluntary donations. A coordinator may identify these candidate donors that are considered spatial outliers and send a mailer with additional incentives for traveling farther to a blood donation drive.
Global and local spatial outliers
Outliers in space are defined as points whose locations are not typical of the patterns of the rest of the points in the dataset. In the simplest case, this means that a point is far apart from the rest of the points, and this is called a global outlier. For example, a map of emergency rooms across a state may identify emergency rooms in low population areas as global outliers because there are larger distances between them compared to high population areas. However, sometimes it is more meaningful to detect points whose location deviates from the patterns of the points in its area, and this is called a local outlier. Using the same example of emergency rooms in a state, a local spatial outlier is a hospital that is far from other emergency rooms, taking into account the changing density of the emergency rooms across the state. This could identify emergency rooms in high population areas that service more people than surrounding emergency rooms, which could identify areas with lower access to emergency care.
The following image shows a typical result of spatial outlier detection with outliers colored in orange and inliers colored in gray. Global outliers are shown on the right, and local outliers for a small section of Washington state are shown on the left. The local points do not appear to be outliers when looking at all points across the country, but they are significantly far away from a local cluster of points in their local area.
The tool provides an output feature layer highlighting the features designated as spatial outliers. Outliers are symbolized in orange, and inliers are symbolized in semitransparent gray, allowing inlier spatial density to be assessed visually.
The feature layer includes two charts: a bar chart showing the count of outliers and inliers and a histogram showing the distribution of LOF (for local outliers) or neighbor distance values (for global outliers).
The bar chart showing the count of outliers provides an immediate count of outliers and can be an effective way of selecting all outliers from the output analysis.
The histogram showing the distribution of LOF or neighbor distance values includes the average value and the threshold used to distinguish outliers and inliers.
Additionally, if a value is entered in the Output Prediction Raster parameter, an output raster is produced showing the calculated LOF or neighbor distance values for each cell in the study area.
Detecting global spatial outliers
Global outliers are simpler than local outliers. For global outlier detection, outliers are determined by calculating the distance to one of its closest neighbors, called the neighbor distance. By default, the closest neighbor is used, but you can change the number using the Number of Neighbors parameter. Providing a value of three, for example, will calculate the distance to the third nearest neighbor for each point. Points with the largest neighbor distances are farthest from their closest neighbors, and any point with a neighbor distance above a certain threshold will be detected as a global outlier.
The threshold for detection is determined by the distribution of the neighbor distances and the value of the Detection Sensitivity parameter. You can visualize the threshold using a boxplot by adding a number of interquartile ranges (the range of the middle 50 percent of the data) to the third quartile. For the High sensitivity option, one interquartile range is added to the third quartile. For Medium sensitivity, 1.5 interquartile ranges are added. For Low sensitivity, two interquartile ranges are added. Note that higher sensitivities result in lower thresholds, allowing shorter neighbor distances to be detected as global outliers.
Detecting local spatial outliers
Identifying abnormal, or outlying, locations is often more important than identifying typical, or clustered, locations. An example is the investigation of potentially fraudulent financial transactions, which often occur in abnormal locations that differ from typical spatial patterns of transactions.
Despite this need, most approaches that attempt to identify outliers focus on first identifying clusters and then using the remaining features as corollaries for spatial outliers. For example, the Density-based Clustering tool is proficient at defining and identifying spatial clustering using a variety of approaches, but its identification of outliers is relegated to features that did not satisfy the criteria for a cluster, binarily designated as noise features. Consequently, the sole use of clustering approaches to identify spatial outliers includes at least two drawbacks. First, clustering approaches, by design, are focused on defining and identifying clusters, not outliers. Second, the designation of an outlier is often done in a binary manner, without tolerance or quantified levels of how much an observation is an outlier.
The local outlier factor (LOF) addresses these drawbacks by focusing on identifying outliers and by providing a measurement of how outlying a feature may be. Furthermore, this approach uses local density patterns to compare the density of a feature's neighborhood in relation to the neighborhoods of other features in its vicinity. This allows a distinction between global outliers, points that are abnormal in the context of the entire study area, and local outliers, points that are abnormal in the context of their immediate vicinity. The focus on local outliers helps shed light on more complex local phenomena that require closer investigation, such as the transaction history scenario that was previously mentioned.
Defining criteria for detecting local spatial outliers
For the tool to measure and identify spatial outliers, it requires a value for the Number of Neighbors parameter evaluated for each feature and a value for the Percent of Locations Considered Outliers parameter in the study area; these criteria are important when determining the size of the neighborhood in the LOF calculation and the threshold for designating outliers and inliers.
- The Number of Neighbors parameter establishes a neighborhood for each feature. The LOF calculation uses this neighborhood to calculate a reachability distance and a local reachability density, which forms the basis of comparison to estimate how outlying a feature is spatially from features in its immediate vicinity.
- The Percent of Locations Considered Outliers parameter establishes a threshold for designating features as outliers or inliers. This threshold uses the calculated LOF values for all features in the input data, establishing the amount of features with the highest LOF values that are designated as outliers.
Where possible, it is recommended for domain knowledge that you help set the value of these parameters, such as in the following examples:
- A transportation engineer may have inherent domain knowledge about how many crashes in an intersection indicate a systemic safety problem and may use this value as the number of neighbors evaluated when detecting spatial outliers.
- A coordinator for a blood donation drive has a list of potential donation volunteers. The coordinator has a budget to incentivize the 10 percent most remote volunteers to compensate for their travel time to a blood donation site, and they use 10 percent for the percent of locations considered outliers to help plan the sites and incentives for the blood donation drive.
Additionally, if the Number of Neighbors and Percent of Locations Considered Outliers parameters values are not known, or if you want to explore data-driven values for these parameters, the tool can use a parameter value search using the spatial distribution of the data. For more information on this approach, the Data-driven parameter selection section below provides a detailed explanation.
Local outlier factor
The local outlier factor calculation is the main mechanism for identifying and describing spatial outliers. It is characterized by four main steps: establishing a neighborhood, finding the reachability distance, calculating the local reachability density, and calculating the local outlier factor itself. Each step is described in the sections below.
Establish a neighborhood and find the reachability distance
A local neighborhood is established for each location using a specified minimum number of features. This approach is commonly referred to as K-nearest neighbors, where K corresponds to the specified minimum number of features in the vicinity of the currently analyzed feature. As an example, the illustration below displays a scenario for feature A, where the number of neighbors, k, is equal to 4
Once a feature's neighborhood is established, the reachability distance corresponds to the larger of the distance between A and B and the distance from B to its kth nearest neighbor.
The following illustration shows the reachability distance for point A in a scenario where k = 4.
In the same manner, each feature has a reachability distance defined by its K-nearest neighbors.
Find the local reachability density
Once a reachability distance is found for each feature, the average of the reachability distances for all features in the feature's neighborhood is calculated. This average is used to determine the local reachability density, which is a measurement of spatial density for the feature's neighborhood. The calculation for the local reachability density corresponds to the inverse of the average reachability distance for all features in a feature's neighborhood.
Another way to conceptualize local reachability density is to calculate the reachability distance for all features, B1 through B4, that belong to the neighborhood of feature A, as shown in the image below.
Then divide the total distance by the number of features (4, in this case), and take the inverse (divide 1 by this total).
You may further conceptualize that as the average reachability distance for features increases, the local reachability density decreases. Consequently, as the average reachability distance for features decreases, the local reachability density increases.
Calculate the local outlier factor
With local reachability densities calculated for all features, the final step in the local outlier factor calculation is to compute the ratios between the local reachability density of a feature and the local reachability density of each of its neighbors. The average of these ratios is the local outlier factor.
To conceptualize how this helps detect whether a feature is a spatial outlier, consider that as the local reachability density of a feature decreases (in other words, the neighborhood of a feature is sparse) and the local reachability density of its neighbors increases (in other words, the neighborhood of a feature's neighbor is more dense), the local outlier factor increases: the feature is more outlying because its spatial density is low and the spatial densities of its neighboring features are higher.
With local outlier factors calculated for all features, the tool then uses the percentage of locations to be designated as outliers parameter value to designate features as outliers and inliers. Consequently, the selection of an appropriate percentage is among important criteria when defining and interpreting the analysis results.
Data-driven parameter selection
The Number of Neighbors and Percent of Locations Considered Outliers parameters have an important influence in the result of the LOF calculation and detected spatial outliers. While it is recommended that domain knowledge drives the selection of these parameter values, it is acknowledged that not every analysis question may include a clear value for these criteria.
If logical values for the number of neighbors or percent of locations considered outliers are not known before executing the analysis, or if you want to evaluate data-driven results, the tool can automatically search for appropriate parameter values based on the spatial distribution of the input features. To do this, the tool performs a search by comparing combinations of the number of neighbors parameter, k, and the percent of locations considered outliers, deemed c, which is converted to a number of locations considered outliers search parameter, deemed n.
For each parameter value pair, [(c1, k1), (c2, k2), …], the local outlier factor is calculated. The resulting LOF values are ranked from highest to lowest, and the mean of the log(LOF) of the top n outliers is compared to the mean of the log(LOF) of the following n inliers (second-highest LOF) using a t-statistic Tci,kj.
Keep the following in mind before you proceed:
- Provided a value of c, the tool identifies the k that maximizes the significance of the t-statistic. That is, the value of nearest neighbors that maximizes the difference in LOF between the outlier group and the inlier group.
- The tool identifies the value of c that maximizes the t-statistics after adjusting for the size of n.
The search occurs over a domain of k andc values established by the number of input points, and each of the decisions the tool makes for chosen parameter values is reported as a message following tool execution.
For input datasets with many features, only a subset of the values of the number of neighbors and LOF threshold is verified by the tool.
Considerations and interpretations of outputs
There are several important considerations when interpreting the output of this tool.
- The LOF values calculated for an input dataset cannot be used to compare with calculated LOF values in a different dataset. The LOF calculations are dependent on the spatial distribution of the input features in a dataset; consequently, any differences in separate datasets will result in different calculated local reachability densities and LOF values.
- The calculated LOF results may differ between a point in the output features and a cell in the output prediction raster coinciding with the point. The reason for this difference is that the point's neighborhood includes the neighbors in its vicinity, and does not include itself; however, the raster cell coinciding with the point includes the point as one of its neighbors.
- Small differences in values submitted for the
Percent of Locations Considered Outliers parameter may
result in the same output percent of locations considered outliers.
This can occur when similarities in spatial distribution for
features result in the same LOF value for multiple features and the
same LOF threshold is established even if the percent is different
by a small margin.
- Consider a simple dataset with 10 features whose LOF calculation results in the following LOF values: [0, 1, 2, 3, 4, 5, 9, 9, 9, 9]. In this example, a value of 10 percent for percent of locations considered outliers would result in selecting the top 10 percent LOF values, which corresponds to an LOF threshold of 9. Similarly, passing a value of 40 percent for percent of locations considered outliers would result in selecting the top 40 percent LOF values, though this will still set an LOF threshold of 9. Therefore, the output count of outliers designated as outliers will be the same for percentages 10 percent through 40 percent.
For more information about the local outlier factor and optimizing parameters, see the following references:
- Breunig, M. M., Kriegel, H. P., Ng, R. T., Sander, J. (2000). "LOF: identifying density-based local outliers." Proceedings of the 2000 ACM SIGMOD international conference on Management of data. (pp. 93-104).
- Xu, Z., Kakde, D., Chaudhuri, A. (2019). "Automatic Hyperparameter Tuning Method for Local Outlier Factor, with Applications to Anomaly Detection." 2019 IEEE International Conference on Big Data (pp. 4201-4207)