Multivariate Clustering (Spatial Statistics)

Summary

Finds natural clusters of features based solely on feature attribute values.

Learn more about how the Multivariate Clustering tool works

Illustration

Multivariate Clustering diagram

Usage

  • This tool produces an output feature class with the fields used in the analysis plus a new integer field named CLUSTER_ID. Default rendering is based on the CLUSTER_ID field and specifies which cluster each feature is a member of. If you indicate that you want three clusters, for example, each record will contain a 1, 2, or 3 for the CLUSTER_ID field. The output feature class will also contain a binary field called IS_SEED. The IS_SEED field indicates which features were used as starting points to grow clusters. The number of nonzero values in the IS_SEED field will match the value you entered for the Number of Clusters parameter.

  • Input Features can be points, lines, or polygons.

  • This tool creates messages and charts to help you understand the characteristics of the clusters identified. You may access the messages by hovering over the progress bar, clicking the pop-out button, or expanding the View details section in the Geoprocessing pane. You may also access the messages for a previous run of the Multivariate Clustering tool via the geoprocessing history. The charts created can be accessed from the Contents pane.

  • For more information about the output messages and charts, see How Multivariate Clustering works.

  • The Analysis Fields must be numeric and should contain a variety of values. Fields with no variation (that is, the same or very similar value for every record) will be dropped from the analysis but will be included in the Output Features. Categorical fields may be used with the Multivariate Clustering tool if they are represented as numeric dummy variables (a value of one for all features in a category and zeros for all other features).

  • The Multivariate Clustering tool will construct nonspatial clusters. For some applications you may want to impose contiguity or other proximity requirements on the clusters created. In those cases, you would use the Spatially Constrained Multivariate Clustering tool to create clusters that are spatially contiguous.

  • While there is a tendency to want to include as many Analysis Fields as possible, for this tool, it works best to start with a single variable and then add additional variables. Results are easier to interpret with fewer analysis fields. It is also easier to determine which variables are the best discriminators when there are fewer fields.

  • There are three options for the Initialization Method: Optimized seed locations, User defined seed locations, and Random seed locations. Seeds are the features used to grow individual clusters. If, for example, you enter a 3 for the Number of Clusters parameter, the analysis will begin with three seed features. The default option, Optimized seed locations, randomly selects the first seed and makes sure that the subsequent seeds selected represent features that are far away from each other in data space (attribute values). Selecting initial seeds that capture different areas of data space improves performance. Sometimes you know that specific features reflect distinct characteristics that you want represented by different clusters. In that case, you can provide those locations by creating a seed field to identify those distinctive features. The seed field you create should have zeros for all but the initial seed features; the initial seed features should have a value of 1. You will then select User defined seed locations for the Initialization Method parameter. If you are interested in doing a sensitivity analysis to see which features are always found within the same cluster, you might select the Random seed locations option for the Initialization Method parameter. For this option, the seed features are randomly selected.

    Note:

    When using random seeds, you may wish to choose a seed to initiate the random number generator through the Random Number Generator Environment setting. However, the Random Number Generator used by this tool is always Mersenne Twister.

  • Any values of 1 in the Initialization Field will be interpreted as a seed. If you choose to specify seed locations, the Number of Clusters parameter will be disabled and the tool will find as many clusters as there are non-zero entries in the Initialization Field.

  • Sometimes you know the Number of Clusters most appropriate for your data. In the case that you don't, you may have to experiment with different numbers of clusters, noting which values provide the best clustering differentiation. When you leave the Number of Clusters parameter empty, the tool will evaluate the optimal number of clusters by computing a pseudo F-statistic for clustering solutions with 2 through 30 clusters and report the optimal number of clusters in the messages window. When you specify an optional Output Table for Evaluating Number of Clusters, a chart will be created showing the pseudo F-statistic values for solutions with 2 through 30 clusters. The largest pseudo F-statistic values indicate solutions that perform best at maximizing both within-cluster similarities and between-cluster differences. If no other criteria guide your choice for Number of Clusters, use a number associated with one of the largest pseudo F-statistic values. Pseudo-F Statistic Chart for finding optimal number of clusters

  • This tool uses either the K means or K medoids algorithm to partition features into clusters. When Random seed locations is selected for the Initialization Method, the algorithm incorporates heuristics and may return a different result each time you run the tool (even using the same data and the same tool parameters). This is because there is a random component to finding the initial seed features used to grow the clusters. Because of this heuristic solution, determining the optimal number of clusters is more involved, and the pseudo F-Statistic may be different each time the tool is run due to different initial seed features. When a distinct pattern exists in your data, however, solutions from one run to the next will be more consistent. Consequently, to help determine the optimal number of clusters, for each number of clusters 2 through 30, the tool solves 10 times and uses the highest of the ten pseudo F-statistic values.

  • K means and K medoids are both popular clustering algorithms and will generally produce similar results. However, K medoids is more robust to noise and outliers in the Input Features. K means is generally faster than K medoids and is preferred for large data sets.

  • The cluster number assigned to a set of features may change from one run to the next. For example, suppose you partition features into two clusters based on an income variable. The first time you run the analysis you might see the high income features labeled as cluster 2 and the low income features labeled as cluster 1; the second time you run the same analysis, the high income features might be labeled as cluster 1. You might also see that some of the middle income features switch cluster membership from one run to another.

Syntax

arcpy.stats.MultivariateClustering(in_features, output_features, analysis_fields, {clustering_method}, {initialization_method}, {initialization_field}, {number_of_clusters}, {output_table})
ParameterExplanationData Type
in_features

The feature class or feature layer for which you want to create clusters.

Feature Layer
output_features

The new output feature class created containing all features, the analysis fields specified, and a field indicating to which cluster each feature belongs.

Feature Class
analysis_fields
[analysis_field,...]

A list of fields you want to use to distinguish one cluster from another.

Field
clustering_method
(Optional)

The clustering algorithm used. K_MEANS is the default.

K_MEANS and K_MEDOIDS are both popular clustering algorithms and will generally produce similar results. However, K_MEDOIDS is more robust to noise and outliers in the in_features. K_MEANS is generally faster than K_MEDOIDS and is preferred for large data sets.

  • K_MEANSThe in_features will be clustered using the K means algorithm. This is the default.
  • K_MEDOIDSThe Input Features will be clustered using the K medoids algorithm.
String
initialization_method
(Optional)

Specifies how initial seeds to grow clusters are obtained. If you indicate you want three clusters, for example, the analysis will begin with three seeds.

  • OPTIMIZED_SEED_LOCATIONSSeed features will be selected to optimize analysis results and performance. This is the default.
  • USER_DEFINED_SEED_LOCATIONSNonzero entries in the initialization_field will be used as starting points to grow clusters.
  • RANDOM_SEED_LOCATIONSInitial seed features will be randomly selected.
String
initialization_field
(Optional)

The numeric field identifying seed features. Features with a value of 1 for this field will be used to grow clusters. All other features should contain zeros.

Field
number_of_clusters
(Optional)

The number of clusters to create. When you leave this parameter empty, the tool will evaluate the optimal number of clusters by computing a pseudo F-statistic for clustering solutions with 2 through 30 clusters.

This parameter is disabled if the seed locations were provided in an initialization field.

Long
output_table
(Optional)

If specified, the table created contains the pseudo F-statistic for clustering solutions 2 through 30, calculated to evaluate the optimal number of clusters. The chart created from this table can be accessed in the stand-alone tables section of the Contents pane.

Table

Code sample

MultivariateClustering example 1 (Python window)

The following Python window script demonstrates how to use the MultivariateClustering tool.

import arcpy
arcpy.env.workspace = r"C:\Analysis"
arcpy.MultivariateClustering_stats("District_Vandalism", "outVandalism", 
                                   ["TOTPOP", "VACANT_CY", "UNEMP"], "K_MEANS", 
                                    "OPTIMIZED_SEED_LOCATIONS", None, "5")
MultivariateClustering example 2 (stand-alone script)

The following stand-alone Python script demonstrates how to use the MultivariateClustering tool

# Clustering Vandalism data in a metropolitan area
# using the Multivariate Clustering Tool

# Import system modules
import arcpy

# Set environment property to overwrite existing output, by default
arcpy.env.overwriteOutput = True

try:
    # Set the current workspace (to avoid having to specify the full path to
    # the feature classes each time)
    arcpy.env.workspace = r"C:\GA"

    # Join the 911 Call Point feature class to the Block Group Polygon feature 
    # class
    # Process: Spatial Join
    fieldMappings = arcpy.FieldMappings()
    fieldMappings.addTable("ReportingDistricts.shp")
    fieldMappings.addTable("Vandalism2006.shp")

    sj = arcpy.SpatialJoin_analysis("ReportingDistricts.shp", 
                                    "Vandalism2006.shp", "Dist_Vand.shp", 
                                    "JOIN_ONE_TO_ONE","KEEP_ALL", fieldMappings, 
                                    "COMPLETELY_CONTAINS")

    # Use the Multivariate Clustering tool to create groups based on different 
    # variables or analysis fields
    # Process: Cluster Similar Features  
    ga = arcpy.MultivariateClustering_stats("District_Vandalism", "outVandalism", 
                                            ["Join_Count", "TOTPOP", "VACANT_CY", "UNEMP"],
	             																														"K_MEANS", "OPTIMIZED_SEED_LOCATIONS", 
                                            None, 5)
    
    # Use Summary Statistic tool to get the Mean of variables used to group
    # Process: Summary Statistics
    SumStat = arcpy.Statistics_analysis("outVandalism", "outSS", 
                                        [["Join_Count", "MEAN"], 
                                         ["VACANT_CY", "MEAN"], 
                                         ["TOTPOP_CY", "MEAN"], 
                                         ["UNEMP_CY", "MEAN"]], 
                                        "GSF_CLUSTER")

except arcpy.ExecuteError:
    # If an error occurred when running the tool, print out the error message.
    print(arcpy.GetMessages())

Environments

Output Coordinate System

Feature geometry is projected to the Output Coordinate System prior to analysis. All mathematical computations are based on the Output Coordinate System spatial reference. When the Output Coordinate System is based on degrees, minutes, and seconds, geodesic distances are estimated using chordal distances.

Random number generator

The Random Generator Type used is always Mersenne Twister.

Licensing information

  • Basic: Yes
  • Standard: Yes
  • Advanced: Yes

Related topics