The Time Series Clustering tool identifies the locations in a space-time cube that are most similar and partitions them into distinct clusters where members of each cluster have similar time series characteristics. Time series can be clustered so they have similar values across time or similar behaviors or profiles across time (increases and decreases at the same points in time). It takes a space-time NetCDF cube as input. The cube must have been created using either the Create Space Time Cube By Aggregating Points tool or the Create Space Time Cube From Defined Locations tool. The time series of the Analysis Variable for each location in the cube are compared and similarity metrics are calculated. The time series are clustered using k-medoids. The tool generates a 2D map displaying each location in the cube symbolized by its cluster membership and messages, as well as optional charts that summarize the time series in each cluster.
- An analyst has created a space-time cube representing several years of 911 calls and can use the Time Series Clustering tool to determine which neighborhoods have similar temporal trends in terms of high and low call patterns.
- Demographers might use this tool to evaluate which countries have similar patterns of population growth, both in terms of value and profile of time series.
- A large retailer might use this tool to find stores that have similar purchasing patterns or total sales. This information can then be used to help the retailer predict retail demand and ensure that stores have sufficient inventory.
A number of outputs are created by this tool. A 2D map showing each location in the Input Space Time Cube, symbolized by its cluster membership allows you to explore any spatial patterns. Even though the k-medoids clustering technique used in this tool does not take any spatial relationships into account when performing the clustering, there may still be spatial patterns present. In addition, messages summarizing the analysis results and Mann-Kendall trend statistics for each cluster are written at the bottom of the Geoprocessing pane during tool execution. You can access the messages by hovering over the progress bar, clicking the pop out button , or expanding the messages section in the Geoprocessing pane. You can also access the messages for a previously run tool via the Geoprocessing History.
The default output of the Time Series Clustering tool is a new output features class containing the CLUSTER_ID field, which indicates to which cluster each location belongs. This output feature class is added to the table of contents with a unique color rendering scheme applied to the CLUSTER_ID field. The CENTER_REP field indicates which location in the space-time cube is most representative of each cluster. For each cluster, this field will contain a 1 for the location whose time series is the median of all of the time series in that cluster. All other locations will contain a 0.
Time Series Clustering chart outputs
Optional charts are produced when you choose to create the Output Table for Charts. The Average Time Series per Cluster chart displays the average of all the time series for each cluster. The average time series is calculated by averaging the value of the Analysis Variable at each time step. The Time Series Cluster Medoids chart displays the medoids of the clusters at each time step. At each time step of this chart, half of the values in the cluster will be above the medoid and half will be below. Unlike the line graphs in the Average Time Series per Cluster chart, the time series represented by the medoids is an actual location in the Input Space Time Cube. This location is indicated by a 1 in the CENTER_REP field in the Output Features.
Comparing time series
Determining how different two numbers are is a straight forward task. Ten is twice as many as five and four and seven are three units apart. Since time series are composed of many numbers or values across time, they are not quite as easy to compare. They must be summarized using a different kind of metric, and the metric you use depends on the time series characteristic you are interested in.
The Characteristic of Interest parameter is used to determine the time series characteristic that will be used to determine what it means for two time series to be similar. Clustering can be based on the Value of the Analysis Variable over time or the attribute Profile of the time series for the Analysis Variable. For Profile, the cosine similarity between each pair of locations is calculated and captures whether the time series are increasing or decreasing at the same time rather than comparing based on magnitudes. For example, a time series with the values 10, 20, 10, and 40 and a time series with values 100, 200, 100, and 400 would be considered similar when using Profile. However, these two time series would be considered very different from one another when using Value.
For either option, each location in the space-time cube is compared to every other location, and a dissimilarity measure is calculated that results in a dissimilarity matrix that stores the relationship between every combination of locations. This matrix is then clustered using k-medoids. It may seem strange to calculate a dissimilarity matrix when you are interested in clustering similar time series together; however, by convention, the k-medoids algorithm works on a dissimilarity matrix, the inverse of a similarity.
Optimal number of clusters
When you leave the Number of Clusters parameter blank, the tool will evaluate the optimal number of clusters based on your data using the spectral gap heuristic. The optimal number of clusters will be reported in the messages window. One of the benefits of using a dissimilarity matrix is that it contains a great deal of information about how similar the various locations of the space-time cube are to each other. The spectral gap heuristic transforms the dissimilarity matrix and summarizes its transformation into a series of eigenvalues. Eigenvalues are numbers that satisfy certain equations defined on the matrix and summarize the information contained in the matrix. Conceptually, an eigenvalue is calculated for each possible number of clusters, one through the total number of locations in the space-time cube. The spectral gap algorithm orders these eigenvalues by number of clusters and looks for a jump or gap in their values. A gap indicates that a larger number of clusters is less optimal.
For more information about k-medoids, see the following:
- Kaufman, L., and P. J. Rousseau, Finding groups in data: an introduction to cluster analysis (Vol. 344). John Wiley & Sons, 2009.
For more information about cosine similarity, see the following:
- Leydesdorff, L., "Similarity measures, author cocitation analysis, and information theory," Journal of the Association for Information Science and Technology 56, no. 7 (2005): 769-772.
For more information about spectral gap, see the following:
- Ng, A. Y., M. I. Jordan, and Y Weiss. "On spectral clustering: Analysis and an algorithm." In Advances in neural information processing systems (2002): 849-856.
- Von Luxburg, U., "A tutorial on spectral clustering," Statistics and computing 17, no. 4 (2007): 395-416.