Multivariate Clustering (Spatial Statistics)—ArcGIS Pro

Summary

Finds natural clusters of features based solely on feature attribute values.

Learn more about how Multivariate Clustering works

Illustration

Usage

This tool produces an output feature class with the fields used in the analysis plus a new integer field named CLUSTER_ID. Default rendering is based on the CLUSTER_ID field and specifies which cluster each feature is a member of. If you indicate that you want three clusters, for example, each record will contain a 1, 2, or 3 for the CLUSTER_ID field. The output feature class will also contain a binary field called IS_SEED. The IS_SEED field indicates which features were used as starting points to grow clusters. The number of nonzero values in the IS_SEED field will match the value you entered for the Number of Clusters parameter.
The Input Features parameter value can be points, lines, or polygons.
This tool creates messages and charts to help you understand the characteristics of the clusters identified. To access the messages, hover over the progress bar, click the pop-out button, or expand the View details section in the Geoprocessing pane. You can also access the messages for a previous run of the Multivariate Clustering tool via the geoprocessing history. Access the charts from the Contents pane.
For more information about the output messages and charts, see How Multivariate Clustering works.
The fields for the Analysis Fields parameter must be numeric and should contain a variety of values. Fields with no variation (that is, the same or very similar value for every record) will be dropped from the analysis but will be included in the Output Features parameter value. Categorical fields can be used with the Multivariate Clustering tool if they are represented as numeric dummy variables (a value of one for all features in a category and zero for all other features).
The Multivariate Clustering tool will construct nonspatial clusters. For some applications you can impose contiguity or other proximity requirements on the clusters created. In those cases, use the Spatially Constrained Multivariate Clustering tool to create clusters that are spatially contiguous.
For this tool, a best practice is to start with a single variable for the Analysis Fields parameter and add variables as necessary. Results are easier to interpret with fewer analysis fields. It is also easier to determine which variables are the best discriminators when there are fewer fields.
There are three options for the Initialization Method parameter: Optimized seed locations, User defined seed locations, and Random seed locations. Seeds are the features used to grow individual clusters. If, for example, you enter 3 for the Number of Clusters parameter, the analysis will begin with three seed features. The default option, Optimized seed locations, randomly selects the first seed and makes sure that the subsequent seeds selected represent features that are far away from each other in data space (attribute values). Selecting initial seeds that capture different areas of data space improves performance. Sometimes you know that specific features reflect distinct characteristics that you want represented by different clusters. In that case, you can provide those locations by creating a seed field to identify those distinctive features. Use zero for all but the initial seed features for the seed field you create; use a value of 1 for the initial seed features. Then select the User defined seed locations option for the Initialization Method parameter. To perform a sensitivity analysis to see which features are always found in the same cluster, use the Random seed locations option for the Initialization Method parameter. For this option, the seed features are randomly selected.

Note:
When using random seeds, you can choose a seed to initiate the random number generator through the Random Number Generator Environment setting. However, the Random Number Generator value used by this tool is always Mersenne Twister.
Any values of 1 in the Initialization Field parameter will be interpreted as a seed. If you specify seed locations, the Number of Clusters parameter will be disabled and the tool will find as many clusters as there are nonzero entries in the Initialization Field parameter.
If you don't know the Number of Clusters parameter value most appropriate for your data, you can experiment with different numbers of clusters, noting which values provide the best clustering differentiation. If you leave the Number of Clusters parameter empty, the tool will evaluate the optimal number of clusters by computing a pseudo F-statistic for clustering solutions with 2 through 30 clusters and report the optimal number of clusters in the messages window. If you specify an optional Output Table for Evaluating Number of Clusters parameter value, a chart will be created showing the pseudo F-statistic values for solutions with 2 through 30 clusters. The largest pseudo F-statistic values indicate solutions that perform best at maximizing both within-cluster similarities and between-cluster differences. If no other criteria are used for the Number of Clusters parameter value, use a number associated with one of the largest pseudo F-statistic values.
This tool uses either the K means or K medoids algorithm to partition features into clusters. When Random seed locations is selected for the Initialization Method parameter, the algorithm incorporates heuristics and may return a different result each time you run the tool (even when you use the same data and tool parameters). This is because there is a random component to finding the initial seed features used to grow the clusters. Because of this heuristic solution, determining the optimal number of clusters is more involved, and the pseudo F-Statistic may be different each time the tool is run due to different initial seed features. When a distinct pattern exists in your data, however, solutions from one run to the next will be more consistent. Consequently, to help determine the optimal number of clusters, for each number of clusters 2 through 30, the tool solves 10 times and uses the highest of the 10 pseudo F-statistic values.
The K means and K medoids options generally produce similar results. However, K medoids is more robust to noise and outliers in the Input Features parameter value. K means is generally faster than K medoids and is recommended for large data sets.
The cluster number assigned to a set of features may change from one run to the next. For example, if you partition features into two clusters based on an income variable, the first time you run the analysis you might see the high income features labeled as cluster 2 and the low income features labeled as cluster 1. The second time you run the same analysis, the high income features might be labeled as cluster 1. It's also possible that some of the middle income features will switch cluster membership from one run to another.

Parameters

Label	Explanation	Data Type
Input Features	The feature class or feature layer for which clusters will be created.	Feature Layer
Output Features	The output feature class that will be created containing all features, the analysis fields specified, and a field indicating to which cluster each feature belongs.	Feature Class
Analysis Fields	A list of fields that will be used to distinguish one cluster from another.	Field
Clustering Method (Optional)	Specifies the clustering algorithm that will be used. The K means and K medoids options generally produce similar results. However, K medoids is more robust to noise and outliers in the Input Features parameter value. K means is generally faster than K medoids and is recommended for large data sets. K means—The Input Features parameter value will be clustered using the K means algorithm. This is the default. K medoids—The Input Features parameter value will be clustered using the K medoids algorithm.	String
Initialization Method (Optional)	Specifies how initial seeds used to grow clusters will be obtained. If you indicate you want three clusters, for example, the analysis will begin with three seeds. Optimized seed locations—Seed features will be selected to optimize analysis results and performance. This is the default. User defined seed locations—Nonzero entries in the Initialization Field parameter value will be used as starting points to grow clusters. Random seed locations—Initial seed features will be randomly selected.	String
Initialization Field (Optional)	The numeric field identifying seed features. Features with a value of 1 for this field will be used to grow clusters. Each seed results in a cluster, so at least two seed features must be provided.	Field
Number of Clusters (Optional)	The number of clusters that will be created. If you leave this parameter empty, the tool will evaluate the optimal number of clusters by computing a pseudo F-statistic for clustering solutions with 2 through 30 clusters. This parameter is disabled if the seed locations were provided in an initialization field.	Long
Output Table for Evaluating Number of Clusters (Optional)	The table containing the pseudo F-statistic for clustering solutions 2 through 30, calculated to evaluate the optimal number of clusters. The chart created from this table can be accessed in the stand-alone tables section of the Contents pane.	Table

arcpy.stats.MultivariateClustering(in_features, output_features, analysis_fields, {clustering_method}, {initialization_method}, {initialization_field}, {number_of_clusters}, {output_table})

Name	Explanation	Data Type
in_features	The feature class or feature layer for which clusters will be created.	Feature Layer
output_features	The output feature class that will be created containing all features, the analysis fields specified, and a field indicating to which cluster each feature belongs.	Feature Class
analysis_fields [analysis_field,...]	A list of fields that will be used to distinguish one cluster from another.	Field
clustering_method (Optional)	Specifies the clustering algorithm that will be used. The K_MEANS and K_MEDOIDS options generally produce similar results. However, K_MEDOIDS is more robust to noise and outliers in the in_features parameter value. K_MEANS is generally faster than K_MEDOIDS and is recommended for large data sets. K_MEANS—The in_features parameter value will be clustered using the K means algorithm. This is the default. K_MEDOIDS—The in_features parameter value will be clustered using the K medoids algorithm.	String
initialization_method (Optional)	Specifies how initial seeds to grow clusters are obtained. If you indicate you want three clusters, for example, the analysis will begin with three seeds. OPTIMIZED_SEED_LOCATIONS—Seed features will be selected to optimize analysis results and performance. This is the default. USER_DEFINED_SEED_LOCATIONS—Nonzero entries in the initialization_field parameter value will be used as starting points to grow clusters. RANDOM_SEED_LOCATIONS—Initial seed features will be randomly selected.	String
initialization_field (Optional)	The numeric field identifying seed features. Features with a value of 1 for this field will be used to grow clusters. Each seed results in a cluster, so at least two seed features must be provided.	Field
number_of_clusters (Optional)	The number of clusters that will be created. If you leave this parameter empty, the tool will evaluate the optimal number of clusters by computing a pseudo F-statistic for clustering solutions with 2 through 30 clusters. This parameter is disabled if the seed locations were provided in an initialization field.	Long
output_table (Optional)	The table containing the pseudo F-statistic for clustering solutions 2 through 30, calculated to evaluate the optimal number of clusters. The chart created from this table can be accessed in the stand-alone tables section of the Contents pane.	Table

Code sample

MultivariateClustering example 1 (Python window)

The following Python window script demonstrates how to use the MultivariateClustering function.

import arcpy
arcpy.env.workspace = r"C:\Analysis"
arcpy.stats.MultivariateClustering(
    "District_Vandalism", "outVandalism", ["TOTPOP", "VACANT_CY", "UNEMP"],
    "K_MEANS", "OPTIMIZED_SEED_LOCATIONS", None, "5")

MultivariateClustering example 2 (stand-alone script)

The following stand-alone Python script demonstrates how to use the MultivariateClustering function.

# Clustering Vandalism data in a metropolitan area
# using the Multivariate Clustering tool.

# Import system modules
import arcpy

# Set environment property to overwrite existing output, by default.
arcpy.env.overwriteOutput = True

# Set the current workspace (to avoid having to specify the full path to the
# feature classes each time).
arcpy.env.workspace = r"C:\GA"

# Join the 911 Call Point feature class to the Block Group Polygon feature 
# class.
# Process: Spatial Join
fieldMappings = arcpy.FieldMappings()
fieldMappings.addTable("ReportingDistricts.shp")
fieldMappings.addTable("Vandalism2006.shp")

sj = arcpy.analysis.SpatialJoin(
    "ReportingDistricts.shp", "Vandalism2006.shp", "Dist_Vand.shp", 
    "JOIN_ONE_TO_ONE","KEEP_ALL", fieldMappings, "COMPLETELY_CONTAINS")

# Use the Multivariate Clustering tool to create groups based on different 
# variables or analysis fields.
# Process: Cluster Similar Features  
ga = arcpy.stats.MultivariateClustering(
    "District_Vandalism", "outVandalism",
    ["Join_Count", "TOTPOP", "VACANT_CY", "UNEMP"], "K_MEANS",
    "OPTIMIZED_SEED_LOCATIONS", None, 5)
    
# Use Summary Statistic tool to get the Mean of variables used to group.
# Process: Summary Statistics
sum_stats = arcpy.analysis.Statistics(
    "outVandalism", "outSS",
    [["Join_Count", "MEAN"],
     ["VACANT_CY", "MEAN"],
     ["TOTPOP_CY", "MEAN"],
     ["UNEMP_CY", "MEAN"]],
    "GSF_CLUSTER")

Environments

Output Coordinate System, Geographic Transformations, Current Workspace, Scratch Workspace, Qualified Field Names, Output has M values, M Resolution, M Tolerance, Output has Z values, Default Output Z Value, Z Resolution, Z Tolerance, XY Resolution, XY Tolerance, Random number generator

Special cases

Output Coordinate System: Feature geometry is projected to the Output Coordinate System prior to analysis. All mathematical computations are based on the Output Coordinate System spatial reference. When the Output Coordinate System is based on degrees, minutes, and seconds, geodesic distances are estimated using chordal distances.

Random number generator: The Random Generator Type used is always Mersenne Twister.

Licensing information

Basic: Yes
Standard: Yes
Advanced: Yes