Skip To Content

Time Series Clustering

Summary

Partitions a collection of time series, stored in a space-time cube, based on the similarity of time series characteristics. Time series can be clustered so they have similar values in time or similar behaviors or profiles across time (increase or decrease at the same points in time). The output of this tool is a 2D map displaying each location in the cube symbolized by cluster membership and messages. The output also includes charts containing information about the representative time series signature for each cluster.

Learn more about how the Time Series Clustering tool works

Usage

  • This tool accepts netCDF files created by the Create Space Time Cube By Aggregating Points or Create Space Time Cube From Defined Features tool tool.

  • This tool compares the time series at each location to all other locations in the Input Space Time Cube. The Characteristic of Interest for each pair of locations is compared and summarized as a dissimilarity matrix. These similarity measures are what is clustered using k-medoids.

  • The Output Features will be added to the Contents pane with rendering based on the CLUSTER_ID field and indicate which cluster each location fell into. If you specify that you want three clusters, for example, each record will contain a value of 1, 2, or 3 for the CLUSTER_ID field. The CENTER_REP field, which contains a value of 1 for the location that was most representative of each cluster's time series, is also added.

  • This tool creates messages and optional charts to help you understand the characteristics of the identified clusters. You can access the messages by hovering over the progress bar, clicking the pop-out button, or expanding the messages section in the Geoprocessing pane. You can also access the messages for a previous run of the Time Series Clustering tool using Geoprocessing History. The charts created when you choose to create the Output Table for Charts can be accessed by clicking the List By Charts tab List By Charts in the Contents pane.

  • When you specify an optional Output Table for Charts, charts will be created showing the Average Time Series per Cluster and the Time Series Cluster Medoids.

  • For more information about the output messages and charts, see How Time Series Clustering works.

  • Sometimes you know the Number of Clusters most appropriate for your data. In cases where you don't, you may have to experiment with different numbers of clusters, noting which values provide the best clustering differentiation. If you leave the Number of Clusters parameter empty, the tool will evaluate the optimal number of clusters using the spectral gap heuristic and report the optimal number of clusters in the messages window.

  • Similarity can be based on either the Value of the time series across time or the Profile of each time series. For Profile, the cosine similarity between each pair of locations is calculated and captures whether the time series are increasing or decreasing at the same time rather than trying to match magnitudes. For example, a time series with the values 10, 20, 10, and 40 and a time series with values 100, 200, 100, and 400 would be considered similar when using Profile. However, these two time series would be considered very different from one another when using Value.

  • The cluster number assigned to a location may change from one run to the next because the algorithm randomly selects initial seeds to begin growing clusters. For example, suppose you partition locations into two clusters based on annual population growth. The first time you run the analysis you might see the high growth locations labeled as cluster 2 and the low growth locations labeled as cluster 1; the second time you run the same analysis, the high growth locations might be labeled as cluster 1. You might also see that some of the average or middle growth locations switch cluster membership from one run to another. This is due to a random component in k-medoids.

Syntax

TimeSeriesClustering_stpm (in_cube, analysis_variable, output_features, characteristic_of_interest, cluster_count, output_table_for_charts)
ParameterExplanationData Type
in_cube

The netCDF cube to be analyzed. This file must have an .nc extension and must have been created using the Create Space Time Cube By Aggregating Points or Create Space Time Cube From Defined Features tool.

File
analysis_variable

The numeric variable in the netCDF file, changing over time, that will be used to distinguish one cluster from another.

String
output_features

The new output feature class containing all locations in the space-time cube and a field indicating cluster membership. This feature class will be a two-dimensional representation of the clusters in your data.

Feature Class
characteristic_of_interest

The aspect of the time series that will be used to define what it means to be similar. Choose whether clustering of time series should be based on values or time series profiles.

  • VALUE Locations that have time series with similar values of the Analysis Variable at the same points in time will be clustered together.
  • PROFILETime series that have similar behaviors, fluctuations, and complexity will be clustered together. Clustering is based on the common form or shape of each location's time series.
String
cluster_count

The number of clusters to create. When left empty, the tool will evaluate the optimal number of clusters using the spectral gap heuristic. The optimal number of clusters will be reported in the messages window.

Long
output_table_for_charts

If specified, this table contains the representative time series for each cluster based on both the average for each time series cluster and the medoid time series. Charts created from this table can be accessed by clicking the List By Charts tab in the Contents pane

Table

Code sample

TimeSeriesClustering example 1 (Python window)

The following Python script demonstrates how to use the TimeSeriesClustering tool.

import arcpy
arcpy.env.workspace = r"C:\Analysis"
arcpy.stpm.TimeSeriesClustering("COUNTY_CRIME.NC", "COUNT", 
    "COUNTY_CRIME_PATTERNS", "PROFILE", 5 , "CRIME_CHARTS")
TimeSeriesClustering example 2 (stand-alone script)

The following Python script demonstrates how to use the TimeSeriesClustering tool to cluster similar store locations.

# Creating clusters of store locations with similar sales volumes over time

# Import system modules
import arcpy

# Set property to overwrite existing output, by default
arcpy.env.overwriteOutput = True

# Local variables...
workspace = r"C:\Analysis"
arcpy.env.workspace = workspace

# Create clusters of store location with similar sales volumes.  Automatically 
# determine optimal number of clusters.
arcpy.stats.TimeSeriesClustering("OH_STORES.NC", "SALES_SUM_ZERO", "SALES_TRENDS", 
    "VALUE", None , "SALES_CHARTS")

# Create a feature class containing all the bins in the input space time cube.
arcpy.stpm.VisualizeSpaceTimeCube3D("OH_STORES", "SALES_SUM_ZERO", "VALUE", 
    "SALES_BINS")

# Join the clustering results to the bins so each bin now has a cluster ID.
arcpy.management.AddJoin("SALES_BINS", "LOCATION", "SALES_TRENDS", "LOCATION", 
    "KEEP_ALL")

# Summarize the bins using Summary Statistics with Cluster ID as a case field 
# to get the minimum, maximum and average sales for each cluster.
arcpy.analysis.Statistics("SALES_DATA", "SALES_DATA_STATISTICS", 
    "SALES_DATA.VALUE MIN;SALES_DATA.VALUE MEAN;SALES_DATA.VALUE MAX", 
    "SALES_BINS.CLUSTER_ID")

Environments

Random number generator

The Random Generator Type used is always Mersenne Twister.

Licensing information

  • ArcGIS Desktop Basic: Yes
  • ArcGIS Desktop Standard: Yes
  • ArcGIS Desktop Advanced: Yes