Grouping Analysis (Spatial Statistics)

Summary

Groups features based on feature attributes and optional spatial or temporal constraints.

Legacy:

This is a deprecated tool. The algorithm behind this tool has been enhanced and new functionality has been added to these methods. To simplify the new features and methods, this tool has been replaced by two new tools. Use the Spatially Constrained Multivariate Clustering tool if you would like to create spatially constrained groups. Use the Multivariate Clustering tool to create groups with no spatial constraints.

Illustration

Grouping Analysis diagram

Usage

    Legacy:

    The algorithm behind the Grouping Analysis tool has been enhanced and new functionality has been added to these methods at ArcGIS Pro 2.1. To simplify the new features and methods, two new tools have been created to replace the Grouping Analysis tool. Use the Spatially Constrained Multivariate Clustering tool if you would like to create spatially contiguous groups. Use the Multivariate Clustering tool to create groups with no spatial constraints.

  • This tool produces an output feature class with the fields used in the analysis plus a new integer field named SS_GROUP. Default rendering is based on the SS_GROUP field and shows you which group each feature falls into. If you indicate that you want three groups, for example, each record will contain a 1, 2, or 3 for the SS_GROUP field. When No spatial constraint is selected for the Spatial Constraints parameter, the output feature class will also contain a new binary field called SS_SEED. The SS_SEED field indicates which features were used as starting points to grow groups. The number of nonzero values in the SS_SEED field will match the value you entered for the Number of Groups parameter.

  • This tool will optionally create a PDF report file when you specify a path for the Output Report File parameter. This report contains a variety of tables and graphs to help you understand the characteristics of the groups identified. The path to the PDF report will be included with the messages summarizing the tool execution parameters. Clicking on that path will pop open the report file. You may access the messages by hovering over the progress bar, clicking on the pop-out button, or expanding the messages section in the Geoprocessing pane. You may also access the messages for a previous run of Grouping Analysis via the Geoprocessing History.

    Note:

    Creating the report file can add substantial processing time. Consequently, while Grouping Analysis will create the Output Feature Class showing group membership, the PDF report file will not be created if you specify more than 15 groups or more than 15 variables.

  • When the Input Feature Class is not projected (that is, when coordinates are given in degrees, minutes, and seconds) or when the output coordinate system is set to a Geographic Coordinate System, distances are computed using chordal measurements. Chordal distance measurements are used because they can be computed quickly and provide very good estimates of true geodesic distances, at least for points within about thirty degrees of each other. Chordal distances are based on an oblate spheroid. Given any two points on the earth's surface, the chordal distance between them is the length of a line, passing through the three-dimensional earth, to connect those two points. Chordal distances are reported in meters.

    Caution:

    Be sure to project your data if your study area extends beyond 30 degrees. Chordal distances are not a good estimate of geodesic distances beyond 30 degrees.

  • The Unique ID Field provides a way for you to link records in the Output Feature Class back to data in the original input feature class. Consequently, the Unique ID Field values must be unique for every feature and typically should be a permanent field that remains with the feature class. If you don't have a Unique ID Field in your dataset, you can easily create one by adding a new integer field to your feature class table and calculating the field values to be equal to the FID/OID field. You cannot use the FID/OID field directly for the Unique ID Field parameter.

  • The Analysis Fields should be numeric and should contain a variety of values. Fields with no variation (that is, the same value for every record) will be dropped from the analysis but will be included in the Output Feature Class. Categorical fields may be used with the Grouping Analysis tool if they are represented as dummy variables (a value of one for all features in a category and zeros for all other features).

  • The Grouping Analysis tool will construct groups with or without space or time constraints. For some applications you may not want to impose contiguity or other proximity requirements on the groups created. In those cases, you will set the Spatial Constraints parameter to No spatial constraint.

  • For some analyses, you will want groups to be spatially contiguous. The contiguity options are enabled for polygon feature classes and indicate features can only be part of the same group if they share an edge (Contiguity edges only) or if they share either an edge or a vertex (Contiguity edges corners) with another member of the group.

  • The Delaunay triangulation and K nearest neighbors options are appropriate for point or polygon features when you want to ensure all group members are proximal. These options indicate that a feature will only be included in a group if at least one other feature is a natural neighbor (Delaunay triangulation) or a K nearest neighbor. K is the number of neighbors to consider and is specified using the Number of Neighbors parameter.

  • In order to create groups with both space and time constraints, use the Generate Spatial Weights Matrix tool to first create a spatial weights matrix file (.swm) defining the space-time relationships among your features. Next run Grouping Analysis, setting the Spatial Constraints parameter to Get spatial weights from file and the Spatial Weights Matrix File parameter to the SWM file you created.

  • In order to create three-dimensional groups that take into consideration the z-values of your features, use the Generate Spatial Weights Matrix tool with the Use Z values parameter checked on to first create a spatial weights matrix file (.swm) defining the 3D relationships among your features. Next, run Grouping Analysis, setting the Spatial Constraints parameter to Get spatial weights from file and the Spatial Weights Matrix File parameter to the SWM file you created.

  • Additional Spatial Constraints, such as fixed distance, may be imposed by using the Generate Spatial Weights Matrix tool to first create an SWM file and then providing the path to that file for the Spatial Weights Matrix File parameter.

    Note:

    Even though you may create a spatial weights matrix (SWM) file to define spatial constraints, there is no actual weighting being applied. The SWM defines which features are contiguous or proximal. Imposing a spatial constraint determines who can and cannot be members of the same group. If you select Contiguity edges only, for example, all the features in a single group will have at least one edge in common with another feature in the group. This keeps the resultant groups spatially contiguous.

  • Defining a spatial constraint ensures compact, contiguous, or proximal groups. Including spatial variables in your list of Analysis Fields can also encourage these group attributes. Examples of spatial variables would be distance to freeway on-ramps, accessibility to job openings, proximity to shopping opportunities, measures of connectivity, and even coordinates (X, Y). Including variables representing time, day of the week, or temporal distance can encourage temporal compactness among group members.

  • When there is a distinct spatial pattern to your features (an example would be three separate, spatially distinct clusters), it can complicate the spatially constrained grouping algorithm. Consequently, the grouping algorithm first determines if there are any disconnected groups. If the number of disconnected groups is larger than the Number of Groups specified, the tool cannot solve and will fail with an appropriate error message. If the number of disconnected groups is exactly the same as the Number of Groups specified, the spatial configuration of the features alone determines group results, as shown in (A) below. If the Number of Groups specified is larger than the number of disconnected groups, grouping begins with the disconnected groups already determined. For example, if there are three disconnected groups and the Number of Groups specified is 4, one of the three groups will be divided to create a fourth group, as shown in (B) below.

    Disconnected groups

  • In some cases, the Grouping Analysis tool will not be able to meet the spatial constraints imposed, and some features will not be included with any group (the SS_GROUP value will be -9999 with hollow rendering). This happens if there are features with no neighbors. To avoid this, use K nearest neighbors, which ensures all features have neighbors. Increasing the Number of Neighbors parameter will help resolve issues with disconnected groups.

  • While there is a tendency to want to include as many Analysis Fields as possible, for this tool, it works best to start with a single variable and build. Results are much easier to interpret with fewer analysis fields. It is also easier to determine which variables are the best discriminators when there are fewer fields.

  • When you select No spatial constraint for the Spatial Constraints parameter, you have three options for the Initialization Method: Find seed locations, Get seeds from field, and Use random seeds. Seeds are the features used to grow individual groups. If, for example, you enter a 3 for the Number of Groups parameter, the analysis will begin with three seed features. The default option, Find seed locations, randomly selects the first seed and makes sure that the subsequent seeds selected represent features that are far away from each other in data space. Selecting initial seeds that capture different areas of data space improves performance. Sometimes you know that specific features reflect distinct characteristics that you want represented by different groups. In that case, create a seed field to identify those distinctive features. The seed field you create should have zeros for all but the initial seed features; the initial seed features should have a value of 1. You will then select Get seeds from field for the Initialization Method parameter. If you are interested in doing some kind of sensitivity analysis to see which features are always found in the same group, you might select the Use random seeds option for the Initialization Method parameter. For this option, all of the seed features are randomly selected.

    Note:

    When using random seeds, you may wish to choose a seed to initiate the random number generator through the Random Number Generator Environment setting. However, the Random Number Generator used by this tool is always Mersenne Twister.

  • Any values of 1 in the Initialization Field will be interpreted as a seed. If there are more seed features than Number of Groups, the seed features will be randomly selected from those identified by the Initialization Field. If there are fewer seed features than specified by Number of Groups, the additional seed features will be selected so they are far away (in data space) from those identified by the Initialization Field.

  • Sometimes you know the Number of Groups most appropriate for your data. In the case that you don't, however, you may have to try different numbers of groups, noting which values provide the best group differentiation. When you check the Evaluate Optimal Number of Groups parameter, a pseudo F-statistic will be computed for grouping solutions with 2 through 15 groups. If no other criteria guide your choice for Number of Groups, use a number associated with one of the largest pseudo F-statistic values. The largest F-statistic values indicate solutions that perform best at maximizing both within-group similarities and between-group differences. When you specify an optional Output Report File, that PDF report will include a graph showing the F-statistic values for solutions with 2 through 15 groups.

  • Regardless of the Number of Groups you specify, the tool will stop if division into additional groups becomes arbitrary. Suppose, for example, that your data consists of three spatially clustered polygons and a single analysis field. If all the features in a cluster have the same analysis field value, it becomes arbitrary how any one of the individual clusters is divided after three groups have been created. If you specify more than three groups in this situation, the tool will still only create three groups. As long as at least one of the analysis fields in a group has some variation of values, division into additional groups can continue.

    No more groups will be created
    Groups will not be divided further if there is no variation in the analysis field values.

  • When you include a spatial or space-time constraint in your analysis, the pseudo F-Statistics are comparable (as long as the Input Features and Analysis Fields don't change). Consequently, you can use the F-Statistic values to determine not only optimal Number of Groups but also to help you make choices about the most effective Spatial Constraints option, Distance Method, and Number of Neighbors.

  • The K-Means algorithm used to partition features into groups when No spatial constraint is selected for the Spatial Constraints parameter and Find seed locations or Use random seeds is selected for the Initialization Method incorporates heuristics and may return a different result each time you run the tool (even using the same data and the same tool parameters). This is because there is a random component to finding the initial seed features used to grow the groups.

  • When a spatial constraint is imposed, there is no random component to the algorithm, so a single pseudo F-Statistic can be computed for groups 2 through 15, and the highest F-Statistic values can be used to determine the optimal Number of Groups for your analysis. Because the No spatial constraint option is a heuristic solution, however, determining the optimal number of groups is more involved. The F-Statistic may be different each time the tool is run, due to different initial seed features. When a distinct pattern exists in your data, however, solutions from one run to the next will be more consistent. Consequently, to help determine the optimal number of groups when the No spatial constraint option is selected, the tool solves the grouping analysis 10 times for 2, 3, 4, and up to 15 groups. Information about the distribution of these 10 solutions is then reported (min, max, mean, and median) to help you determine an optimal number of groups for your analysis.

  • The Grouping Analysis tool returns three derived output values for potential use in custom models and scripts. These are the pseudo F-Statistic for the Number of Groups (Output_FStat), the largest pseudo F-Statistic for groups 2 through 15 (Max_FStat), and the number of groups associated with the largest pseudo F-Statistic value (Max_FStat_Group). When you do not elect to Evaluate Optimal Number of Groups, all of the derived output variables are set to None.

  • The group number assigned to a set of features may change from one run to the next. For example, suppose you partition features into two groups based on an income variable. The first time you run the analysis you might see the high income features labeled as group 2 and the low income features labeled as group 1; the second time you run the same analysis, the high income features might be labeled as group 1. You might also see that some of the middle income features switch group membership from one run to another when No spatial constraint is specified.

  • While you can select to create a very large number of different groups, in most scenarios you will likely be partitioning features into just a few groups. Because the graphs and maps become difficult to interpret with lots of groups, no report is created when you enter a value larger than 15 for the Number of Groups parameter or select more than 15 Analysis Fields. You can increase this limitation on the maximum number of groups, however.

    Dive-in:

    Because you have the Python source code for the Grouping Analysis tool, you may override the 15 variables or 15 groups report limitation, if desired. This upper limit is set by two variables in both the Partition.py script file and the tool's validation code inside the Spatial Statistics Toolbox:

    maxNumGroups = 15
    maxNumVars = 15

  • For more information about the Output Report File, see Learn more about how Grouping Analysis works.

Syntax

arcpy.stats.GroupingAnalysis(Input_Features, Unique_ID_Field, Output_Feature_Class, Number_of_Groups, Analysis_Fields, Spatial_Constraints, {Distance_Method}, {Number_of_Neighbors}, {Weights_Matrix_File}, {Initialization_Method}, {Initialization_Field}, {Output_Report_File}, {Evaluate_Optimal_Number_of_Groups})
ParameterExplanationData Type
Input_Features

The feature class or feature layer for which you want to create groups.

Feature Layer
Unique_ID_Field

An integer field containing a different value for every feature in the input feature class. If you don't have a Unique ID field, you can create one by adding an integer field to your feature class table and calculating the field values to equal the FID or OBJECTID field.

Field
Output_Feature_Class

The new output feature class created containing all features, the analysis fields specified, and a field indicating to which group each feature belongs.

Feature Class
Number_of_Groups

The number of groups to create. The Output Report parameter will be disabled for more than 15 groups.

Long
Analysis_Fields
[analysis_field,...]

A list of fields you want to use to distinguish one group from another. The Output Report parameter will be disabled for more than 15 fields.

Field
Spatial_Constraints

Specifies if and how spatial relationships among features should constrain the groups created.

  • CONTIGUITY_EDGES_ONLYGroups contain contiguous polygon features. Only polygons that share an edge can be part of the same group.
  • CONTIGUITY_EDGES_CORNERSGroups contain contiguous polygon features. Only polygons that share an edge or a vertex can be part of the same group.
  • DELAUNAY_TRIANGULATIONFeatures in the same group will have at least one natural neighbor in common with another feature in the group. Natural neighbor relationships are based on Delaunay Triangulation. Conceptually, Delaunay Triangulation creates a nonoverlapping mesh of triangles from feature centroids. Each feature is a triangle node and nodes that share edges are considered neighbors.
  • K_NEAREST_NEIGHBORSFeatures in the same group will be near each other; each feature will be a neighbor of at least one other feature in the group. Neighbor relationships are based on the nearest K features where you specify an Integer value, K, for the Number_of_Neighbors parameter.
  • GET_SPATIAL_WEIGHTS_FROM_FILESpatial, and optionally temporal, relationships are defined by a spatial weights file (.swm). Create the spatial weights matrix file using the Generate Spatial Weights Matrix tool or the Generate Network Spatial Weights tool.
  • NO_SPATIAL_CONSTRAINTFeatures will be grouped using data space proximity only. Features do not have to be near each other in space or time to be part of the same group.
String
Distance_Method
(Optional)

Specifies how distances are calculated from each feature to neighboring features.

  • EUCLIDEANThe straight-line distance between two points (as the crow flies)
  • MANHATTANThe distance between two points measured along axes at right angles (city block); calculated by summing the (absolute) difference between the x- and y-coordinates
String
Number_of_Neighbors
(Optional)

This parameter may be specified whenever the Spatial_Constraints parameter is K_NEAREST_NEIGHBORS or one of the contiguity methods (CONTIGUITY_EDGES_ONLY or CONTIGUITY_EDGES_CORNERS). The default number of neighbors is 8 and cannot be smaller than 2 for K_NEAREST_NEIGHBORS. This value reflects the exact number of nearest neighbor candidates to consider when building groups. A feature will not be included in a group unless one of the other features in that group is a K nearest neighbor. The default for CONTIGUITY_EDGES_ONLY and CONTIGUITY_EDGES_CORNERS is 0. For the contiguity methods, this value reflects the minimum number of neighbor candidates to consider. Additional nearby neighbors for features with less than the Number_of_Neighbors specified will be based on feature centroid proximity.

Long
Weights_Matrix_File
(Optional)

The path to a file containing spatial weights that define spatial relationships among features.

File
Initialization_Method
(Optional)

Specifies how initial seeds are obtained when the Spatial_Constraint parameter selected is NO_SPATIAL_CONSTRAINT. Seeds are used to grow groups. If you indicate you want three groups, for example, the analysis will begin with three seeds.

  • FIND_SEED_LOCATIONSSeed features will be selected to optimize performance.
  • GET_SEEDS_FROM_FIELDNonzero entries in the Initialization Field will be used as starting points to grow groups.
  • USE_RANDOM_SEEDSInitial seed features will be randomly selected.
String
Initialization_Field
(Optional)

The numeric field identifying seed features. Features with a value of 1 for this field will be used to grow groups.

Field
Output_Report_File
(Optional)

The full path for the PDF report file to be created summarizing group characteristics. This report provides a number of graphs to help you compare the characteristics of each group. Creating the report file can add substantial processing time.

File
Evaluate_Optimal_Number_of_Groups
(Optional)
  • EVALUATEGroupings from 2 to 15 will be evaluated.
  • DO_NOT_EVALUATENo evaluation of the number of groups will be performed. This is the default.
Boolean

Derived Output

NameExplanationData Type
Output_FStat

Max_FStat_Group
Max_FStat

Code sample

GroupingAnalysis example 1 (Python window)

The following Python window script demonstrates how to use the GroupingAnalysis tool.

import arcpy
import arcpy.stats as SS
arcpy.env.workspace = r"C:\GA"
SS.GroupingAnalysis("Dist_Vandalism.shp", "TARGET_FID", "outGSF.shp", "4",
                    "Join_Count;TOTPOP_CY;VACANT_CY;UNEMP_CY",
                    "NO_SPATIAL_CONSRAINT", "EUCLIDEAN", "", "", "FIND_SEED_LOCATIONS", "",
                    "outGSF.pdf", "DO_NOT_EVALUATE")
GroupingAnalysis example 2 (stand-alone script)

The following stand-alone Python script demonstrates how to use the GroupingAnalysis tool.

# Grouping Analysis of Vandalism data in a metropolitan area
# using the Grouping Analysis Tool

# Import system modules
import arcpy, os
import arcpy.stats as SS

# Set geoprocessor object property to overwrite existing output, by default
arcpy.gp.overwriteOutput = True

try:
    # Set the current workspace (to avoid having to specify the full path to
    # the feature classes each time)
    arcpy.env.workspace = r"C:\GA"

    # Join the 911 Call Point feature class to the Block Group Polygon feature class
    # Process: Spatial Join
    fieldMappings = arcpy.FieldMappings()
    fieldMappings.addTable("ReportingDistricts.shp")
    fieldMappings.addTable("Vandalism2006.shp")

    sj = arcpy.SpatialJoin_analysis("ReportingDistricts.shp", "Vandalism2006.shp", "Dist_Vand.shp",
                               "JOIN_ONE_TO_ONE",
                               "KEEP_ALL",
                               fieldMappings,
                               "COMPLETELY_CONTAINS", "", "")
    
    # Use Grouping Analysis tool to create groups based on different variables or analysis fields
    # Process: Group Similar Features  
    ga = SS.GroupingAnalysis("Dist_Vand.shp", "TARGET_FID", "outGSF.shp", "4",
                                       "Join_Count;TOTPOP_CY;VACANT_CY;UNEMP_CY",
                                       "NO_SPATIAL_CONSRAINT", "EUCLIDEAN", "", "", "FIND_SEED_LOCATIONS", "",
                                       "outGSF.pdf", "DO_NOT_EVALUATE")
    
    # Use Summary Statistic tool to get the Mean of variables used to group
    # Process: Summary Statistics
    SumStat = arcpy.Statistics_analysis("outGSF.shp", "outSS", "Join_Count MEAN; \
                               VACANT_CY MEAN;TOTPOP_CY MEAN;UNEMP_CY MEAN", 
                                       "GSF_GROUP")

except:
    # If an error occurred when running the tool, print out the error message.
    print(arcpy.GetMessages())

Environments

Output Coordinate System

Feature geometry is projected to the Output Coordinate System prior to analysis. All mathematical computations are based on the Output Coordinate System spatial reference. When the Output Coordinate System is based on degrees, minutes, and seconds, geodesic distances are estimated using chordal distances.

Random number generator

The Random Generator Type used is always Mersenne Twister.

Licensing information

  • Basic: Yes
  • Standard: Yes
  • Advanced: Yes