How Density-based Clustering works

The Density-based Clustering tool works by detecting areas where points are concentrated and where they are separated by areas that are empty or sparse. Points that are not part of a cluster are labeled as noise. Optionally, the time of the points can be used to find groups of points that cluster together in space and time.

This tool uses unsupervised machine learning clustering algorithms which automatically detect patterns based purely on spatial location and the distance to a specified number of neighbors. These algorithms are considered unsupervised because they do not require any training on what it means to be a cluster.

Tip:

Clustering, grouping and classification techniques are some of the most widely used methods in machine learning. The Multivariate Clustering and the Spatially Constrained Multivariate Clustering tool also utilize unsupervised machine learning methods to determine natural clusters in your data. These classification methods are considered unsupervised as they do not require a set of pre-classified features to guide or train on in order to find clusters in your data.

Potential applications

Some of the ways this tool can be applied are as follows:

  • Urban water supply networks are an important hidden underground asset. The clustering of pipe ruptures and bursting can indicate looming problems. Using the Density-based Clustering tool, an engineer can find where these clusters are and take pre-emptive action on high-danger zones within water supply networks.
  • Suppose you have position data for all successful and unsuccessful shots for NBA players. The Density-based Clustering tool can show you the different patterns of successful versus failed shot positions for each player. This information can then be used to inform game strategy.
  • Say you are studying a particular pest-borne disease and have a point dataset representing households in your study area, some of which are infested, some of which are not. By using the Density-based Clustering tool, you can determine the largest clusters of infested households to help pinpoint an area to begin treatment and extermination of pests.
  • Geo-locating tweets following natural hazards or terror attacks can be clustered and rescue and evacuation needs can be informed based on the size and location of the clusters identified.

Clustering Methods

The Density-based Clustering tool's Clustering Methods parameter provides three options with which to find clusters in your point data:

  • Defined distance (DBSCAN)—Uses a specified distance to separate dense clusters from sparser noise. The DBSCAN algorithm is the fastest of the clustering methods, but is only appropriate if there is a very clear Search Distance to use, and that works well for all potential clusters. This requires that all meaningful clusters have similar densities. This method also allows you to use the Time Field and Search Time Interval parameters to find clusters of points in space and time.
  • Self-adjusting (HDBSCAN)—Uses a range of distances to separate clusters of varying densities from sparser noise. The HDBSCAN algorithm is the most data-driven of the clustering methods, and thus requires the least user input.
  • Multi-scale (OPTICS)—Uses the distance between neighboring features to create a reachability plot, which is then used to separate clusters of varying densities from noise. The OPTICS algorithm offers the most flexibility in fine-tuning the clusters that are detected, though it is computationally intensive, particularly with a large Search Distance. This method also allows you to use the Time Field and Search Time Interval parameters to find clusters of points in space and time.

This tool takes Input Point Features, a path for the Output Features and a value representing the minimum number of features required to be considered a cluster. Depending on the Clustering Method selected there may be additional parameters to specify as described below.

Minimum Features per Cluster

This parameter determines the minimum number of features required to consider a grouping of points a cluster. For instance, if you have a number of different clusters, ranging in size from 10 points to 100 points, and you choose a Minimum Features per Cluster of 20, all clusters that have less than 20 points will either be considered noise (because they do not form a grouping large enough to be considered a cluster) or they will be merged with nearby clusters in order to satisfy the minimum number of features required. In contrast, choosing a Minimum Features per Cluster smaller than what you consider your smallest meaningful cluster may lead to meaningful clusters being divided into smaller clusters. In other words, the smaller the Minimum Features per Cluster, the more clusters will be detected. The larger the Minimum Features per Cluster, the fewer clusters will be detected.

Tip:

The ideal Minimum Features per Cluster depends on what you are trying to capture and your analysis question. It should be set to the smallest size grouping you wish to consider a meaningful cluster. Increasing the Minimum Features Per Cluster may result in merging some of the smaller clusters together.

The Minimum Features per Cluster parameter is also important in the calculation of the core-distance, which is a measurement used by all three methods to find clusters. Conceptually, the core-distance for each point is a measurement of the distance that is required to travel from each point to the defined minimum number of features. So, if a large Minimum Features per Cluster is chosen, then the corresponding core-distance will be larger. If a small Minimum Features per Cluster is chosen, then the corresponding core-distance will be smaller. The core-distance is related to the Search Distance parameter, which is used by both the Defined distance (DBSCAN) and Multi-scale (OPTICS) methods.

core-distance graphic

Illustration of the core-distance, measured as the distance from a particular feature that must be traveled to create a cluster with a minimum of 4 features including itself.

Search Distance (DBSCAN and OPTICS)

For Defined distance (DBSCAN), if the Minimum Features per Cluster can be found within the Search Distance from a particular point, that point will be marked as a core-point and included in a cluster, along with all points within the core-distance. A border-point is a point that is within the search distance of a core-point but does not itself have the minimum number of features within the search distance. Each resulting cluster is composed of core-points and border-points, where core-points tend to fall in the middle of the cluster and border-points fall on the exterior. If a point does not have the minimum number of features within the search distance and is not a within the search distance of another core-point (in other words, it is neither a core-point nor a border-point), it is marked as a noise point and not included in any cluster.

Illustration of core-point, border-point, and noise point

Clustering four minimum features per cluster is shown. The core-point has four neighbors within the search distance (including itself). The border-point only has three features within the search distance, but it is a neighbor of a core-point, so it is included in the cluster. The noise point does not have four features within the search distance, and it is not a neighbor of a core-point, so it is not included in a cluster.

For Multi-scale (OPTICS), the Search Distance value is treated as the maximum distance that will be compared to the core-distance. Multi-scale (OPTICS) uses a concept of a minimum reachability distance, which is the distance from a point to its nearest neighbor that has not yet been visited by the search. (Note: OPTICS is an ordered algorithm that starts with the feature with the smallest ID and goes from that point to the next to create a plot. The order of the points is fundamental to the results.) Multi-scale (OPTICS) will search all neighbor distances within the specified Search Distance, comparing each of them to the core-distance. If any distance is smaller than the core-distance, that feature is assigned that core-distance as its reachability distance. If all of the distances are larger than the core-distance, the smallest of those distances is assigned as the reachability distance. When no further points are inside the search distance, the process restarts at a new point that has not previously been visited. At each iteration, reachability distances are recalculated and sorted. The smallest of the distances is used for the final reachability distance of each point. These reachability distances are then used to create the reachability plot, which is an ordered plot of the reachability distances that is used to detect clusters.

Illustration of the reachability distance

When all features within the core-distance have already been visited, the Reachability Distance assigned to a feature is the smallest distance between the selected feature and all other features within the Search Distance threshold.

For both Defined distance (DBSCAN) and Multi-scale (OPTICS), if no distance is specified, the default Search Distance is the highest core-distance found in the dataset, excluding those core-distances in the top 1% (in other words excluding the most extreme core-distances).

Clustering in space and time (DBSCAN and OPTICS)

In two of the clustering methods, the time of each point can be provided in the Time Field parameter. If provided, the tool will find clusters of points that are close to each other in space and time. The Search Time Interval parameter is analogous to the search distance and is used to determine the core-points, border-points, and noise points of the space-time clusters.

For Defined distance (DBSCAN), when searching for cluster members, the Minimum Features per Cluster must be found within the Search Distance and Search Time Interval values to be a core-point of a space-time cluster.

In the following image, the search distance is 1 mile, the search time interval is 3 days, and the minimum number of features is 4. The point P is a core point because there are four points within the search distance and time interval (including itself). The point Q is a border point because there are only three points within the search distance and time interval, but the point is within the search distance and time interval of a core-point of the cluster (the point with time stamp 1/5/2021). The points C and D are noise points because they are neither core-points nor border-points of the cluster, and all other points (including P and Q) form into a single cluster.

DBSCAN with time

For Multi-scale (OPTICS), all points outside of the Search Time Interval range of a point will be excluded when the point calculates its core-distance, searches all neighbor distances within the specified Search Distance value, and calculates the reachability distance.

In the following image, the search distance is 2 miles, the search time interval is 3 days, and the minimum number of features is 4. The core-distance for point P is calculated skipping point q3 because it is outside of the search time interval. In this case, the core-distance of P is equal to the reachability distance of P.

OPTICS with time

The next image shows the calculation of the reachability distance when some neighboring points have already been visited. Points q1, q2, and q4 were skipped because they were previously visited, and points q3 and q5 were skipped because they are outside of the search time window.

OPTICS with time

Reachability plot (OPTICS)

After all of the reachability distances have been calculated for the entire dataset, a reachability plot is constructed that orders the distances and reveals the clustering structure of the points.

Reachability Chart example

The valleys in the reachability plot imply that a short distance needs to be traveled from one point to the next. Thus, valleys represent distinct clusters in the point pattern. The denser a cluster, the lower the reachability distances will be, and the lower the valley on the plot (the pink cluster, for instance, is the most dense in the above example). Less dense clusters have higher reachability distances and higher valleys on the plot (the dark green cluster, for instance, is the least dense in the above example). The peaks represent the distances needed to travel from cluster to cluster, or from cluster to noise to cluster again, depending on the configuration of the set of points.

The reachability distances of peaks versus valleys

Reachability distances form peaks and valleys in the reachability plot. As longer distances are measured between points, a peak results in the reachability plot.

Plateaus can occur in a reachability plot when the Search Distance is less than the largest core-distance. A key aspect of using the OPTICS clustering method is determining how to detect clusters from the reachability plot, which is done using the Cluster Sensitivity parameter.

Cluster Sensitivity (OPTICS)

The Cluster Sensitivity parameter determines how the shape (both slope and height) of peaks within the reachability plot will be used to separate clusters. A very high Cluster Sensitivity (close to 100) will treat even the smallest peak as a separation between clusters, resulting in a higher number of clusters. A very low Cluster Sensitivity (close to 0) will treat only the steepest, highest peaks as a separation between clusters, resulting in a lower number of clusters.

Illustration of cluster sensitivity

Low Cluster Sensitivity versus High Cluster Sensitivity

The default Cluster Sensitivity is calculated as the threshold at which adding more clusters does not add additional information, done using the Kullback-Leibler Divergence between the original reachability plot and the smoothed reachability plot obtained after clustering.

Comparing the methods

While only Multi-scale (OPTICS) uses the reachability plot to detect clusters, the plot can be used to help explain, conceptually, how these methods differ from each other. For the purposes of illustration, the reachability plot below will be used to explain the differences in the 3 methods. The plot reveals clusters of varying densities and separation distances. We'll explore the results of using each of the clustering methods on this illustrative data.

Conceptual reachability plot

A conceptual reachability plot with clusters of points with varying densities and distances between them.

For Defined distance (DBSCAN), you can imagine drawing a line across the reachability plot at the specified Search Distance. Areas below the Search Distance represent clusters while peaks above the Search Distance represent noise points. Defined distance (DBSCAN) is the fastest of the clustering methods, but is only appropriate if there is a very clear Search Distance to use as a cut-off, and that Search Distance works well for all clusters. This requires that all meaningful clusters have similar densities.

Illustration of Search Distance in the DBSCAN algorithm

Illustration of Search Distance in the DBSCAN algorithm

For Self-adjusting (HDBSCAN), the reachability distances can be thought of as nested levels of clusters. Each level of clustering would result in a different set of clusters being detected. Self-adjusting (HDBSCAN) chooses which level of clusters within each series of nested clusters will optimally create the most stable clusters that incorporate as many members as possible without incorporating noise. To learn more about the algorithm, check out the great documentation from the creators of HDBSCAN. Self-adjusting (HDBSCAN) is the most data-driven of the clustering methods, and thus requires the least user input.

Illustration of the hierarchical levels of HDBSCAN

Illustration of the hierarchical levels used by the HDBSCAN algorithm to find the optimal clusters to maximize stability

For Multi-scale (OPTICS), the work of detecting clusters is based not on a particular distance, but instead on the peaks and valleys within the plot. Let's say that each peak has a level of either Small, Medium, or Large.

Illustration of the intensity of the peaks in the reachability plot

Illustration of the intensity of the peaks in the reachability plot

Choosing a very high Cluster Sensitivity essentially means that all peaks, from small to large, will act as separations between clusters (resulting in more clusters).

Illustration of high cluster sensitivity

Illustration of the impact of a high Cluster Sensitivity used with OPTICS and the corresponding clusters

Choosing a moderate Cluster Sensitivity will result in both the Medium and Large peaks being used, but not the small peaks.

Illustration of moderate cluster sensitivity

Illustration of the impact of a moderate Cluster Sensitivity used with OPTICS and the corresponding clusters

Choosing a very low Cluster Sensitivity will result in only the Large peaks being used, resulting in the lowest number of clusters detected.

Illustration of low cluster sensitivity

Illustration of the impact of a low Cluster Sensitivity parameter used with OPTICS and the corresponding clusters

Multi-scale (OPTICS) offers the most flexibility in fine-tuning the clusters that are detected, though it is also the slowest of the three clustering methods.

Results

This tool produces an output feature class with a new integer field CLUSTER_ID showing you which cluster each feature is a member of. Default rendering is based on the COLOR_ID field. Multiple clusters will be assigned each color. Colors will be assigned and repeated so that each cluster is visually distinct from its neighboring clusters.

If the Clustering Method chosen was Self-adjusting (HDBSCAN), the output feature class will also contain fields PROB, which is the probability the feature belongs in its assigned group, OUTLIER, designating the feature maybe an outlier within its own cluster (when the value is higher more the feature is more likely to be an outlier) and EXEMPLAR which denotes the features that are the most prototypical or most representative of each cluster.

This tool also creates messages and charts to help you understand the characteristics of the clusters identified. You may access the messages by hovering over the progress bar, clicking on the pop-out button, and expanding the Messages section in the Geoprocessing pane. You may also access the messages for a previous run of the Density-based Clustering tool via the geoprocessing history. The charts created can be accessed from the Contents pane.

In addition to the reachability plot that is created when Multi-scale (OPTICS) is chosen, all clustering methods will also create a Bar Chart that shows all unique Cluster IDs. This chart can be used to easily select all of the features that fall within a specified cluster, and to explore the size of each of the clusters.

Additional resources

For more information about DBSCAN:

  • Birant, D. & Kut, A. (2007, January). "ST-DBSCAN: An algorithm for clustering spatial–temporal data." In Data & Knowledge Engineering (Vol 60, No. 1, pp. 208-221). https://doi.org/10.1016/j.datak.2006.01.013
  • Ester, M., Kriegel, H. P., Sander, J., & Xu, X. (1996, August). "A density-based algorithm for discovering clusters in large spatial databases with noise." In Kdd (Vol. 96, No. 34, pp. 226-231).

For more information about HDBSCAN:

  • Campello, R. J., Moulavi, D., & Sander, J. (2013, April). "Density-based clustering based on hierarchical density estimates." In Pacific-Asia Conference on Knowledge Discovery and Data Mining (pp. 160-172). Springer, Berlin, Heidelberg.

For more information about OPTICS:

  • Agrawal, K. P., Garg, S., Sharma, S., & Patel, P. (2016, November). "Development and validation of OPTICS based spatio-temporal clustering technique." In Information Sciences (Vol 369, pp. 388-401). https://doi.org/10.1016/j.ins.2016.06.048.
  • Ankerst, M., Breunig, M. M., Kriegel, H. P., & Sander, J. (1999, June). "OPTICS: ordering points to identify the clustering structure." In ACM Sigmod record (Vol. 28, No. 2, pp. 49-60). ACM.