The Density-based Clustering tool works by detecting areas where points are concentrated and where they are separated by areas that are empty or sparse. Points that are not part of a cluster are labeled as noise.
This tool uses unsupervised machine learning clustering algorithms which automatically detect patterns based purely on spatial location and the distance to a specified number of neighbors. These algorithms are considered unsupervised because they do not require any training on what it means to be a cluster.
Clustering, grouping and classification techniques are some of the most widely used methods in machine learning. The Multivariate Clustering and the Spatially Constrained Multivariate Clustering tool also utilize unsupervised machine learning methods to determine natural clusters in your data. These classification methods are considered unsupervised as they do not require a set of pre-classified features to guide or train on in order to find clusters in your data.
Some of the ways this tool might be applied are as follows:
- Urban water supply networks are an important hidden underground asset. The clustering of pipe ruptures and bursting can indicate looming problems. Using the Density-based Clustering tool, an engineer can find where these clusters are and take pre-emptive action on high-danger zones within water supply networks.
- Suppose you have position data for all successful and unsuccessful shots for NBA players. Density-based Clustering can show you the different patterns of successful versus failed shot positions for each player. This information can then be used to inform game strategy.
- Say you are studying a particular pest-borne disease and have a point dataset representing households in your study area, some of which are infested, some of which are not. By using the Density-based Clustering tool, you can determine the largest clusters of infested households to help pinpoint an area to begin treatment and extermination of pests.
- Geo-locating tweets following natural hazards or terror attacks can be clustered and rescue and evacuation needs can be informed based on the size and location of the clusters identified.
The Density-based Clustering tool provides three different Clustering Methods with which to find clusters in your point data:
- Defined distance (DBSCAN)—Uses a specified distance to separate dense clusters from sparser noise. The DBSCAN algorithm is the fastest of the clustering methods, but is only appropriate if there is a very clear Search Distance to use, and that works well for all potential clusters. This requires that all meaningful clusters have similar densities.
- Self-adjusting (HDBSCAN)—Uses a range of distances to separate clusters of varying densities from sparser noise. The HDBSCAN algorithm is the most data-driven of the clustering methods, and thus requires the least user input.
- Multi-scale (OPTICS)—Uses the distance between neighboring features to create a reachability plot which is then used to separate clusters of varying densities from noise. The OPTICS algorithm offers the most flexibility in fine-tuning the clusters that are detected, though it is computationally intensive, particularly with a large Search Distance.
This tool takes Input Point Features, a path for the Output Features and a value representing the minimum number of features required to be considered a cluster. Depending on the Clustering Method selected there may be additional parameters to specify as described below.
Minimum Features per Cluster (Required by all methods)
This parameter determines the minimum number of features required to consider a grouping of points a cluster. For instance, if you have a number of different clusters, ranging in size from 10 points to 100 points, and you choose a Minimum Features per Cluster of 20, all clusters that have less than 20 points will either be considered noise (because they do not form a grouping large enough to be considered a cluster) or they will be merged with nearby clusters in order to satisfy the minimum number of features required. In contrast, choosing a Minimum Features Per Cluster smaller than what you consider your smallest meaningful cluster may lead to meaningful clusters being divided into smaller clusters. In other words, the smaller the Minimum Features Per Cluster, the more clusters will be detected. The larger the Minimum Features Per Cluster, the fewer clusters will be detected.
The ideal Minimum Features Per Cluster depends on what you are trying to capture and your analysis question. It should be set to the smallest size grouping you wish to consider a meaningful cluster. Increasing the Minimum Features Per Cluster may result in merging some of the smaller clusters together.
The Minimum Features Per Cluster parameter is also important in the calculation of the core-distance, which is a measurement used by all three methods to find clusters. Conceptually, the core-distance or each point, is a measurement of the distance that is required to travel from each point to the defined minimum number of features. So, if a large Minimum Features per Cluster is chosen, then the corresponding core-distance will be larger. If a small Minimum Features per Cluster is chosen, then the corresponding core-distance will be smaller. At the borders of clusters, points will have large core-distances, (and likely not fall within a cluster). The core-distance, is related to the Search Distance parameter, which is used by both the Defined distance (DBSCAN) and Multi-scale (OPTICS) methods.
Search Distance (DBSCAN and OPTICS)
For Defined distance (DBSCAN), if the Minimum Features per Cluster cannot be found within the Search Distance from a particular point, then that point will be marked as noise. In other words, if the core-distance (the distance required to reach the minimum number of features) for a feature is greater than the Search Distance, the point is marked as noise. The Search Distance, when using Defined distance (DBSCAN), is treated as a search cut-off.
For Multi-scale (OPTICS), the Search Distance is treated as the maximum distance that will be compared to the core-distance. Multi-scale (OPTICS) uses a concept of a maximum reachability distance, which is the distance from a point to its nearest neighbor that has not yet been visited by the search (Note: OPTICS is an ordered algorithm that starts with the feature at OID 0 and goes from that point to the next to create a plot. The order of the points is fundamental to the results). Multi-scale (OPTICS) will search all neighbor distances within the specified Search Distance, comparing each of them to the core-distance. If any distance is smaller than the core-distance, then that feature is assigned that core-distance as its reachability distance. If all of the distances are larger than the core-distance, then the smallest of those distances is assigned as the reachability distance. These distances are then used to create the reachability plot, which is an ordered plot of the reachability distances that is used to detect clusters.
For both Defined distance (DBSCAN) and Multi-scale (OPTICS), if no distance is specified, the default Search Distance is the highest core-distance found in the dataset, excluding those core-distances in the top 1% (in other words excluding the most extreme core-distances).
Reachability plot (OPTICS)
After all of the reachability distances have been calculated for the entire dataset, a reachability plot is constructed that orders the distances and reveals the clustering structure of the points.
The valleys in the reachability plot imply that a short distance needs to be traveled from one point to the next. Thus, valleys represent distinct clusters in the point pattern. The denser a cluster, the lower the reachability distances will be, and the lower the valley on the plot (the pink cluster, for instance, is the most dense in the above example). Less dense clusters have higher reachability distances and higher valleys on the plot (the dark green cluster, for instance, is the least dense in the above example). The peaks represent the distances needed to travel from cluster to cluster, or from cluster to noise to cluster again, depending on the configuration of the set of points.
Plateaus can occur in a reachability plot when the Search Distance is less than the largest core-distance. A key aspect of using the OPTICS clustering method is determining how to detect clusters from the reachability plot, which is done using the Cluster Sensitivity parameter.
Cluster Sensitivity (OPTICS)
The Cluster Sensitivity parameter determines how the shape (both slope and height) of peaks within the reachability plot will be used to separate clusters. A very high Cluster Sensitivity (close to 100) will treat even the smallest peak as a separation between clusters, resulting in a higher number of clusters. A very low Cluster Sensitivity (close to 0) will treat only the steepest, highest peaks as a separation between clusters, resulting in a lower number of clusters.
The default Cluster Sensitivity is calculated as the threshold at which adding more clusters does not add additional information, done using the Kullback-Leibler Divergence between the original reachability plot and the smoothed reachability plot obtained after clustering.
Comparing the methods
While only Multi-scale (OPTICS) uses the reachability plot to detect clusters, the plot can be used to help explain, conceptually, how these methods differ from each other. For the purposes of illustration, the reachability plot below will be used to explain the differences in the 3 methods. The plot reveals clusters of varying densities and separation distances. We'll explore the results of using each of the clustering methods on this illustrative data.
For Defined distance (DBSCAN), you can imagine drawing a line across the reachability plot at the specified Search Distance. Areas below the Search Distance represent clusters while peaks above the Search Distance represent noise points. Defined distance (DBSCAN) is the fastest of the clustering methods, but is only appropriate if there is a very clear Search Distance to use as a cut-off, and that Search Distance works well for all clusters. This requires that all meaningful clusters have similar densities.
For Self-adjusting (HDBSCAN), the reachability distances can be thought of as nested levels of clusters. Each level of clustering would result in a different set of clusters being detected. Self-adjusting (HDBSCAN) chooses which level of clusters within each series of nested clusters will optimally create the most stable clusters that incorporate as many members as possible without incorporating noise. To learn more about the algorithm, check out the great documentation from the creators of HDBSCAN. Self-adjusting (HDBSCAN) is the most data-driven of the clustering methods, and thus requires the least user input.
For Multi-scale (OPTICS), the work of detecting clusters is based not on a particular distance, but instead on the peaks and valleys within the plot. Let's say that each peak has a level of either Small, Medium, or Large.
Choosing a very high Cluster Sensitivity essentially means that all peaks, from small to large, will act as separations between clusters (resulting in more clusters).
Choosing a moderate Cluster Sensitivity will result in both the Medium and Large peaks being used, but not the small peaks.
Choosing a very low Cluster Sensitivity will result in only the Large peaks being used, resulting in the lowest number of clusters detected.
Multi-scale (OPTICS) offers the most flexibility in fine-tuning the clusters that are detected, though it is also the slowest of the three clustering methods.
This tool produces an output feature class with a new integer field CLUSTER_ID showing you which cluster each feature is a member of. Default rendering is based on the COLOR_ID field. Multiple clusters will be assigned each color. Colors will be assigned and repeated so that each cluster is visually distinct from its neighboring clusters.
If the Clustering Method chosen was Self-adjusting (HDBSCAN), the output feature class will also contain fields PROB, which is the probability the feature belongs in its assigned group, OUTLIER, designating the feature maybe an outlier within its own cluster (when the value is higher more the feature is more likely to be an outlier) and EXEMPLAR which denotes the features that are the most prototypical or most representative of each cluster.
This tool also creates messages and charts to help you understand the characteristics of the clusters identified. You may access the messages by hovering over the progress bar, clicking on the pop-out button, and expanding the Messages section in the Geoprocessing pane. You may also access the messages for a previous run of the Density-based Clustering tool via the geoprocessing history. The charts created can be accessed from the Contents pane.
In addition to the reachability plot that is created when Multi-scale (OPTICS) is chosen, all clustering methods will also create a Bar Chart that shows all unique Cluster IDs. This chart can be used to easily select all of the features that fall within a specified cluster, and to explore the size of each of the clusters.
For more information about DBSCAN:
- Ester, M., Kriegel, H. P., Sander, J., & Xu, X. (1996, August). A density-based algorithm for discovering clusters in large spatial databases with noise. In Kdd (Vol. 96, No. 34, pp. 226-231).
For more information about HDBSCAN:
- Campello, R. J., Moulavi, D., & Sander, J. (2013, April). Density-based clustering based on hierarchical density estimates. In Pacific-Asia Conference on Knowledge Discovery and Data Mining (pp. 160-172). Springer, Berlin, Heidelberg.
For more information about OPTICS:
- Ankerst, M., Breunig, M. M., Kriegel, H. P., & Sander, J. (1999, June). OPTICS: ordering points to identify the clustering structure. In ACM Sigmod record (Vol. 28, No. 2, pp. 49-60). ACM.