Skip To Content

Density-based Clustering

Summary

Finds clusters of point features within surrounding noise based on their spatial distribution.

Learn more about how Density-based Clustering works.

Illustration

Density-based Clustering Diagram

Usage

  • This tool extracts clusters from your Input Point Features and identifies any surrounding noise.

  • There are three Clustering Method options. Defined distance uses the DBSCAN algorithm and finds clusters of points that are in close proximity based on a specified search distance. Self-adjusting uses the HDBSCAN algorithm and finds clusters of points similar to DBSCAN but uses varying distances allowing for clusters with varying densities based on cluster probability (or stability). Multi-scale uses the OPTICS algorithm which orders the input points based on the smallest distance to the next feature. A reachability plot is then constructed and clusters are obtained based on the fewest features to be considered a cluster, a search distance and characteristics of the reachability plot (such as the slope and height of peaks).

  • This tool produces an output feature class with a new integer field CLUSTER_ID showing you which cluster each feature falls into. Default rendering is based on the COLOR_ID field. Multiple clusters will be assigned each color. Colors will be assigned and repeated so that each cluster is visually distinct from its neighboring clusters.

  • This tool also creates messages and charts to help you understand the characteristics of the clusters identified. You may access the messages by hovering over the progress bar, clicking on the pop-out button, or expanding the messages section in the Geoprocessing pane. You may also access the messages for a previous run of the Density-based Clustering tool via the Geoprocessing history. The charts created can be accessed by selecting the List By Charts tab in the Contents pane.

  • For more information about the output messages and charts, see Learn more about how Density-based Clustering works.

  • If the Clustering Method chosen was Self-adjusting (HDBSCAN), the output feature class will also contain the following fields PROB, which is the probability the feature belongs in its assigned group, OUTLIER, designating the feature maybe an outlier within its own cluster (when the value is higher more the feature is more likely to be an outlier) and EXEMPLAR, which denotes the features that are the most prototypical or most representative of each cluster.

  • If the Clustering Method chosen was Multi-scale (OPTICS), the output feature class will also contain the fields REACHORDER, which is how the Input Point Features were ordered for analysis and REACHDIST, which is the distance between each feature and its closest unvisited neighbor.

  • For both Defined distance (DBSCAN) and Multi-scale (OPTICS), the default Search Distance is the highest core-distance found in the dataset, excluding those core-distances in the top 1% (i.e. excluding the most extreme core-distances).

  • When the Input Features are not projected (that is, when coordinates are given in degrees, minutes, and seconds) or when the output coordinate system is set to a Geographic Coordinate System, distances are computed using chordal measurements. Chordal distance measurements are used because they can be computed quickly and provide very good estimates of true geodesic distances, at least for points within about thirty degrees of each other. Chordal distances are based on an oblate spheroid. Given any two points on the earth's surface, the chordal distance between them is the length of a line, passing through the three-dimensional earth, to connect those two points. Chordal distances are reported in meters.

    Caution:

    It is best practice to project your data especially if your study area extends beyond 30 degrees. Chordal distances are not a good estimate of geodesic distances beyond 30 degrees.

  • This tool includes z-values in its calculations if z-values are present and the result will be 3D.

Syntax

DensityBasedClustering_stats (in_features, output_features, cluster_method, min_features_cluster, {search_distance}, cluster_sensitivity)
ParameterExplanationData Type
in_features

The point feature class for which density-based clustering will be performed.

Feature Layer
output_features

The output feature class to receive the cluster results.

Feature Class
cluster_method

The method used to define clusters.

  • DBSCAN Uses a specified distance to separate dense clusters from sparser noise. DBSCAN is the fastest of the clustering methods, but is only appropriate if there is a very clear distance to use that works well to define all clusters that may be present. This results in clusters that have similar densities.
  • HDBSCAN Uses varying distances to separate clusters of varying densities from sparser noise. HDBSCAN is the most data-driven of the clustering methods, and requires the least user input.
  • OPTICS Uses the distance between neighbors and a reachability plot to separate clusters of varying densities from noise. OPTICS offers the most flexibility in fine-tuning the clusters that are detected, though it is computationally intensive, particularly with a large Search Distance.
String
min_features_cluster

The minimum number features to be considered a cluster. Any cluster with fewer features than the number provided will be considered noise.

Long
search_distance
(Optional)

The maximum distance to consider.

For Defined distance (DBSCAN), the Minimum Features per Cluster specified must be found within this distance for cluster membership. Individual clusters will be separated by at least this distance. If a feature is located further than this distance from the next closest feature in the cluster, it will not be included in the cluster.

For Multi-scale (OPTICS), this parameter is optional and is used as the maximum search distance when creating the reachability plot. For OPTICS, the reachability plot, combined with the Cluster Sensitivity parameter, determines cluster membership. If no distance is specified, the tool will search all distances which will greatly increase processing time.

If left blank, the default distance used will be the highest core distance found in the dataset, excluding those core distances in the top 1% (excluding the most extreme core distances).

Linear Unit
cluster_sensitivity

An integer between 1 and 100 that determines the compactness of clusters. A number closer to 100 will result in a higher number of denser clusters. A number closer to 1 will result in fewer, less compact clusters.

Long

Code sample

DensityBasedClustering example 1 (Python window)

The following Python window script demonstrates how to use the DensityBasedClustering tool.

import arcpy
arcpy.env.workspace = r"C:\Analysis"
arcpy.DensityBasedClustering_stats("Chicago_Arson", "Arson_HDB", "HDBSCAN", 15)

The following stand-alone Python script demonstrates how to use the DensityBasedClustering tool.

# Clustering crime incidents in a downtown area using the Density-based Clustering tool

# Import system modules
import arcpy
import os

# Overwrite existing output, by default
arcpy.env.overwriteOutput = True

# Local variables...
workspace = r"E:\working\data.gdb"
arcpy.env.workspace = workspace

# Run Density-based Clustering with the HDBSCAN Cluster Method using a minimum 
# of 15 features per cluster
arcpy.stats.DensityBasedClustering("Chicago_Arson", "Arson_HDB", "HDBSCAN", 15)

# Run Density-based Clustering again using OPTICS with a Search Distance and 
# Cluster Sensitivity to create tighter clusters
arcpy.stats.DensityBasedClustering("Chicago_Arson", "Arson_Optics", "OPTICS", 
                                   15, "1200 Meters", 70)

Licensing information

  • ArcGIS Desktop Basic: Yes
  • ArcGIS Desktop Standard: Yes
  • ArcGIS Desktop Advanced: Yes

Related topics