Density-based Clustering (Spatial Statistics)

Summary

Finds clusters of point features within surrounding noise based on their spatial distribution.

Learn more about how Density-based Clustering works

Illustration

Density-based Clustering Diagram

Usage

  • This tool extracts clusters from your Input Point Features and identifies any surrounding noise.

  • There are three Clustering Method options. The Defined distance (DBSCAN) algorithm finds clusters of points that are in close proximity based on a specified search distance. The Self-adjusting (HDBSCAN) algorithm finds clusters of points similar to DBSCAN but uses varying distances, allowing for clusters with varying densities based on cluster probability (or stability). The Multi-scale (OPTICS) algorithm orders the input points based on the smallest distance to the next feature. A reachability plot is then constructed, and clusters are obtained based on the fewest features to be considered a cluster, a search distance, and characteristics of the reachability plot (such as the slope and height of peaks).

  • This tool produces an output feature class with a new integer field, CLUSTER_ID, showing the cluster each feature falls into. Default rendering is based on the COLOR_ID field. Multiple clusters will be assigned each color. Colors will be assigned and repeated so that each cluster is visually distinct from its neighboring clusters.

  • This tool also creates messages and charts to help you understand the characteristics of the identified clusters. You can access the messages by hovering over the progress bar, clicking the pop-out button, or expanding the messages section in the Geoprocessing pane. You can also access the messages for a previous run of the Density-based Clustering tool via the geoprocessing history. The charts created can be accessed from the Contents pane.

  • For more information about the output messages and charts and to learn more about the algorithms behind this tool, see How Density-based Clustering works.

  • If Self-adjusting (HDBSCAN) is chosen for the Clustering Method parameter, the output feature class will also contain the fields PROB, which is the probability the feature belongs in its assigned group, OUTLIER, which designates the feature may be an outlier within its own cluster (a high value indicates the feature is more likely to be an outlier), and EXEMPLAR, which denotes the features that are the most prototypical or most representative of each cluster.

  • If Multi-scale (OPTICS) is chosen for the Clustering Method parameter, the output feature class will also contain the fields REACHORDER, which is how the Input Point Features were ordered for analysis, and REACHDIST, which is the distance between each feature and its closest unvisited neighbor.

  • For both Defined distance (DBSCAN) and Multi-scale (OPTICS), the default Search Distance is the highest core distance found in the dataset, excluding those core distances in the top 1 percent that is, excluding the most extreme core distances).

  • When the Input Features are not projected (that is, when coordinates are given in degrees, minutes, and seconds) or when the output coordinate system is set to a geographic coordinate system, distances are computed using chordal measurements. Chordal distance measurements are used because they can be computed quickly and provide good estimates of true geodesic distances, at least for points within about 30 degrees of each other. Chordal distances are based on an oblate spheroid. Given any two points on the earth's surface, the chordal distance between them is the length of a line, passing through the three-dimensional earth, to connect those two points. Chordal distances are reported in meters.

    Caution:

    It is best practice to project your data, especially if your study area extends beyond 30 degrees. Chordal distances are not a good estimate of geodesic distances beyond 30 degrees.

  • This tool includes z-values in its calculations if z-values are present, and the result will be 3D.

  • This tool supports parallel processing and uses 50 percent of available processors by default. The number of processors can be increased or decreased using the Parallel Processing Factor environment.

Syntax

DensityBasedClustering(in_features, output_features, cluster_method, min_features_cluster, {search_distance}, cluster_sensitivity)
ParameterExplanationData Type
in_features

The point feature class for which density-based clustering will be performed.

Feature Layer
output_features

The output feature class to receive the cluster results.

Feature Class
cluster_method

Specifies the method used to define clusters.

  • DBSCAN Uses a specified distance to separate dense clusters from sparser noise. DBSCAN is the fastest of the clustering methods but is only appropriate if there is a very clear distance to use that works well to define all clusters that may be present. This results in clusters that have similar densities.
  • HDBSCAN Uses varying distances to separate clusters of varying densities from sparser noise. HDBSCAN is the most data-driven of the clustering methods and requires the least user input.
  • OPTICS Uses the distance between neighbors and a reachability plot to separate clusters of varying densities from noise. OPTICS offers the most flexibility in fine-tuning the clusters that are detected, though it is computationally intensive, particularly with a large Search Distance.
String
min_features_cluster

The minimum number of features to be considered a cluster. Any cluster with fewer features than the number provided will be considered noise.

Long
search_distance
(Optional)

The maximum distance to consider.

For Defined distance (DBSCAN), the Minimum Features per Cluster specified must be found within this distance for cluster membership. Individual clusters will be separated by at least this distance. If a feature is located further than this distance from the next closest feature in the cluster, it will not be included in the cluster.

For Multi-scale (OPTICS), this parameter is optional and is used as the maximum search distance when creating the reachability plot. For OPTICS, the reachability plot, combined with the Cluster Sensitivity parameter, determines cluster membership. If no distance is specified, the tool will search all distances, which will increase processing time.

If left blank, the default distance used will be the highest core distance found in the dataset, excluding those core distances in the top 1 percent (excluding the most extreme core distances).

Linear Unit
cluster_sensitivity

An integer between 0 and 100 that determines the compactness of clusters. A number close to 100 will result in a higher number of dense clusters. A number close to 0 will result in fewer, less compact clusters. If left blank, the tool will find a sensitivity value using the Kullback-Leibler divergence that finds the value where adding more clusters does not add additional information.

Long

Code sample

DensityBasedClustering example 1 (Python window)

The following Python window script demonstrates how to use the DensityBasedClustering tool.

import arcpy
arcpy.env.workspace = r"C:\Analysis"
arcpy.DensityBasedClustering_stats("Chicago_Arson", "Arson_HDB", "HDBSCAN", 15)

The following stand-alone Python script demonstrates how to use the DensityBasedClustering tool.

# Clustering crime incidents in a downtown area using the Density-based Clustering tool

# Import system modules
import arcpy
import os

# Overwrite existing output, by default
arcpy.env.overwriteOutput = True

# Local variables...
workspace = r"E:\working\data.gdb"
arcpy.env.workspace = workspace

# Run Density-based Clustering with the HDBSCAN Cluster Method using a minimum 
# of 15 features per cluster
arcpy.stats.DensityBasedClustering("Chicago_Arson", "Arson_HDB", "HDBSCAN", 15)

# Run Density-based Clustering again using OPTICS with a Search Distance and 
# Cluster Sensitivity to create tighter clusters
arcpy.stats.DensityBasedClustering("Chicago_Arson", "Arson_Optics", "OPTICS", 
                                   15, "1200 Meters", 70)

Licensing information

  • Basic: Yes
  • Standard: Yes
  • Advanced: Yes

Related topics