Whenever we look at the world around us, it is very natural for us to organize, group, differentiate, and catalog what we see to help us make better sense of it; this type of mental classification process is fundamental to learning and comprehension. Similarly, to help you learn about and better comprehend your data, you can use the Multivariate Clustering tool. Given the number of clusters to create, it will look for a solution where all the features within each cluster are as similar as possible, and all the clusters themselves are as different as possible. Feature similarity is based on the set of attributes that you specify for the Analysis Fields parameter, and clusters are created using the K-Means algorithm.
Clustering, grouping, and classification techniques are some of the most widely used methods in machine learning. The Multivariate Clustering tool utilizes unsupervised machine learning methods to determine natural clusters in your data. These classification methods are considered unsupervised as they do not require a set of preclassified features to guide or train the method to find the clusters in your data.
While hundreds of clustering analysis algorithms such as these exist, all of them are classified as NP-hard. This means that the only way to ensure that a solution will perfectly maximize both within-group similarities and between-group differences is to try every possible combination of the features you want to cluster. While this might be feasible with a handful of features, the problem quickly becomes intractable.
Not only is it intractable to ensure that you've found an optimal solution, it is also unrealistic to try to identify a clustering algorithm that will perform best for all possible types of data and scenarios. Clusters come in all different shapes, sizes, and densities; attribute data can include a variety of ranges, symmetry, continuity, and measurement units. This explains why so many different cluster analysis algorithms have been developed over the past 50 years. It is most appropriate, therefore, to think of these tools as exploratory tools that can help you learn more about underlying structures in your data.
Some of the ways that this tool might be applied are as follows:
- Suppose you have salmonella samples from farms around your state and attributes including the type/class, location, and date/time. To better understand how the bacteria is transmitted and spread, you can use the Multivariate Clustering tool to partition the samples into individual "outbreaks". Even though the analysis itself is not spatial, you may discover a spatial pattern in your results as the outbreak spreads. Once the clusters are determined, you can use other spatial pattern analysis tools such as Standard Deviational Ellipse, Mean Center, or Near to analyze each outbreak.
- If you've collected data on animal sightings to better understand their territories, the Multivariate Clustering tool might be helpful. Understanding where and when salmon congregate at different life stages, for example, could assist with designing protected areas that may help ensure successful breeding.
- Clustering customers by their buying patterns, demographic characteristics, travel patterns, or other behavioral attributes may help you design an efficient marketing strategy for your company's products.
This tool takes point, polyline, or polygon Input Features, a path for the Output Features, one or more Analysis Fields, and an integer value representing the Number of Clusters to create. There are also a number of optional parameters, including options for Initialization Method and an Output Table for Evaluating Optimal Number of Clusters.
Select fields that are numeric, reflecting ratio, interval, or ordinal measurement systems. While Nominal data can be represented using dummy (binary) variables, these generally do not work as well as other numeric variable types. For example, you could create a variable called Rural and assign to each feature (each census tract, for example) a 1 if it is mostly rural and a 0 if it is mostly urban. A better representation for this variable would be the amount or proportion of rural acreage associated with each feature.
The values of the Analysis Fields are standardized by the tool because variables with large variances (where data values are very spread out around the mean) tend to have a larger influence on the clusters than variables with small variances. Standardization of the attribute values involves a z-transform, where the mean for all values is subtracted from each value and divided by the standard deviation for all values. Standardization puts all the attributes on the same scale, even when they are represented by very different types of numbers: rates (numbers from 0 to 1.0), population (with values larger than 1 million), and distances (kilometers, for example).
You should select variables that you think will distinguish one cluster of features from another. Suppose, for example, you are interested in clustering school districts by student performance on standardized achievement tests. You might select Analysis Fields that include overall test scores, results for particular subjects such as math or reading, the proportion of students meeting some minimum test score threshold, and so forth. When you run the Multivariate Clustering tool, an R2 value is computed for each variable and reported in the the messages window. In the summary below, for example, school districts are grouped based on student test scores, the percentage of adults in the area who didn't finish high school, per student spending, and average student-to-teacher ratios. Notice that the TestScores variable has the highest R2 value. This indicates that this variable divides the school districts into clusters most effectively. The R2 value reflects how much of the variation in the original TestScores data was retained after the clustering process, so the larger the R2 value is for a particular variable, the better that variable is at discriminating among your features.
R2 is computed as:
(TSS - ESS) / TSS
where TSS is the total sum of squares and ESS is the explained sum of squares. TSS is calculated by squaring and then summing deviations from the global mean value for a variable. ESS is calculated the same way, except deviations are cluster by cluster: every value is subtracted from the mean value for the cluster it belongs to and is then squared and summed.
Number of clusters
Sometimes you will know the number of clusters most appropriate to your question or problem and you would enter that number for the Number of Clusters parameter. In many cases, however, you won't have any criteria for selecting a specific number of clusters; instead, you just want the number that best distinguishes feature similarities and differences. To help you in this situation, you can leave the Number of Clusters parameter blank and let the Multivariate Clustering tool assess the effectiveness of dividing your features into 2, 3, 4, and up to 30 clusters. The clustering effectiveness is measured using the Calinski-Harabasz pseudo F-statistic, which is a ratio of between-cluster variance to within-cluster variance. In other words, a ratio reflecting within-group similarity and between-group difference:
The Multivariate Clustering tool uses the K Means algorithm by default. The goal of the K Means algorithm is to partition features so the differences among the features in a cluster, over all clusters, are minimized. Because the algorithm is NP-hard, a greedy heuristic is employed to cluster features. The greedy algorithm will always converge to a local minimum, but will not always find the global (most optimal) minimum.
The K Means algorithm works by first identifying seeds used to grow each cluster. Consequently, the number of seeds will always match the Number of Clusters. The first seed is selected randomly. Selection of remaining seeds, however, while still employing a random component, applies a weighting that favors selection of subsequent seeds farthest in data space from the existing set of seed features (this part of the algorithm is called K Means ++). Because of the random component in finding seeds whenever you select Optimized seed locations or Random seed locations for the Initialization Method, you might get variations in clustering results from one run of the tool to the next.
Once the seeds are identified, all features are assigned to the closest seed feature (closest in data space). For each cluster of features, a mean data center is computed, and each feature is reassigned to the closest center. The process of computing a mean data center for each cluster and then reassigning features to the closest center continues until cluster membership stabilizes (up to a maximum of 100 iterations).
Like the K Means algorithm, K Medoids works by first identifying seed features to grow each cluster. Each of the seed features is an actual feature in Input Features. These seed features are called medoids. All features are assigned to the closest medoid (closest in data space). This is the initial cluster solution. The sum of the distance (in data space) between the medoid and all non-medoid features is calculated. To refine this solution, within each cluster, the medoid is swapped with each non-medoid feature and the sum of the distances (in data space) between each medoid and non-medoid feature is calculated. If the swap increases the sum of the distances, it is undone otherwise the swapped feature becomes the new medoid. The process of finding new medoids and then reassigning features to the closest medoid continues until cluster membership stabilizes.
K Means and K Medoids are both popular clustering algorithms and will generally produce similar results. However, K Medoids is more robust to noise and outliers in the Input Features. K Means is generally faster than K Medoids and is preferred for large data sets.
A number of outputs are created by the Multivariate Clustering tool. The messages can be accessed from the Geoprocessing pane by hovering over the progress bar, clicking the tool progress button , or expanding the messages section at the bottom of the Geoprocessing pane. You can also access the messages from a previous run of Multivariate Clustering via the Geoprocessing History.
The default output for the Multivariate Clustering tool is a new output feature class containing the fields used in the analysis, plus a new integer field called CLUSTER_ID identifying which cluster each feature belongs to. This output feature class is added to the table of contents with a unique color rendering scheme applied to the CLUSTER_ID field. The IS_SEED field indicates which features were chosen as seeds and used to grow clusters.
Multivariate Clustering chart outputs
Multiple types of charts are created to summarize the clusters that were created. Box plots are used to show information about both the characteristics of each cluster as well as characteristics of each variable used in the analysis. The graphic below shows you how to interpret box plots and their summary values for each Analysis Field and cluster created: minimum data value, 1st quartile, global median, 3rd quartile, maximum data value, and data outliers (values smaller or larger than 1.5 times the interquartile range). Hover over the box plot on the chart to see these values as well as the interquartile range value. Any point marks falling outside the minimum or maximum (upper or lower whisker) represent data outliers.
The interquartile range (IQR) is the 3rd quartile minus the 1st quartile. Low outliers would be values less than 1.5*IQR (Q1-1.5*IQR), and high outliers would be values greater than 1.5*IQR (Q3+1.5*IQR). Outliers appear in the box plots as a point symbol.
The default parallel box plot chart summarizes both the clusters and the variables within them. For example, the Multivariate Clustering tool was run on census tracts to create four clusters. From the chart below, notice that cluster 2 (red) reflects tracts with about average rents, the highest values for female-headed households with children (FHH_CHILD), the highest values for number of housing units (HSE_UNITS), and the highest values for children under the age of 5. Cluster 4 (goldenrod) reflects tracts with the highest median rents, almost the lowest number of female-headed households with children and more than the average number of housing units. Cluster 3 (green) reflects tracts with the fewest female-headed households with children, the fewest children under the age of 5, fewest number of housing units and almost the lowest rent (not as low as cluster 1). Hover over each node of the mean lines to see the cluster's average value for each Analysis Field.
After inspecting the global summary of the analysis with the parallel box plots above, you can inspect each cluster's box plots for each variable by switching to Show as multiple box plots in the Chart Properties pane. With this view of the data, it is easy to see which group has the highest and lowest range of values within each variable. Box plots will be created for each cluster for each variable so you can see how each cluster's values relate to the other clusters created. Hover over each variable's box plot to see the Minimum, Maximum, and Median values for each variable in each cluster. In the chart below, for example, you see that Cluster 4 (gold) has the highest values for the MEDIANRENT variable and contains tracts with a range of values from 354 to 813.
A bar chart is also created showing the number of features per cluster. Selecting each bar will also select that cluster's features in the map, which may be useful for further analysis.
When you leave the Number of Clusters parameter blank, the tool will evaluate the optimal number of clusters based on your data. If you specify a path for the Output Table for Evaluating Number Clusters, a chart will be created showing the pseudo F-statistic values calculated. The highest peak on the graph is the largest F-statistic, indicating how many clusters will be most effective at distinguishing the features and variables you specified. In the chart below, the F-statistic associated with four groups is highest. Five groups with a high pseudo F-statistic would also be a good choice.
While there is a tendency to want to include as many Analysis Fields as possible, for the Multivariate Clustering tool, it works best to start with a single variable and build. Results are easier to interpret with fewer analysis fields. It is also easier to determine which variables are the best discriminators when there are fewer fields.
In many scenarios, you will likely run the Multivariate Clustering tool a number of times looking for the optimal Number of Clusters and most effective combination of Analysis Fields that best separate your features into clusters.
If the tool returns 30 as the optimal number of clusters, be sure to look at the chart of the F-statistics. Choosing the number of clusters and interpreting the F-statistic chart is an art form, and a lower number of clusters may be more appropriate for your analysis.
Jain, A. K. 2009. "Data Clustering: 50 years beyond K-Means." Pattern Recognition Letters.
Hinde, A., T. Whiteway, R. Ruddick, and A. D. Heap. 2007. "Seascapes of the Australian Margin and adjacent sea floor: Keystroke Methodology." in Geoscience Australia, Record 2007/10, 58pp.