Producing signature files, class, and cluster analysis

Available with Spatial Analyst license.

With the ArcGIS Spatial Analyst extension, you can create a classification by grouping raster cells into classes or clusters. A class is usually a known category, such as forests, residential areas, or water bodies, while a cluster is a grouping of cells based on the statistics of their attributes. A signature is a subset of cells that are representative of a class or cluster. The statistics of signatures are stored in a signature file that will be used to classify all cells in the intersection of the input bands.

What is a class?

A class corresponds to a meaningful grouping of locations. For example, forest, water, and high wheat yield are all classes.

Each location is characterized by a set or vector of values, one value for each variable, or input band. Each location can be visualized as a point in a multidimensional attribute space whose axes correspond to the variables in the input bands. A grouping of points in this multidimensional attribute space is referred to as a cluster, and in this case, since the cluster refers to something meaningful, it can also be considered a class. Two locations belong to the same cluster if their attributes (vector of band values) are similar.

Known classes may form clusters in attribute space if the classes can be separated, or distinguished, by their attribute values. Locations corresponding to natural clusters in attribute space can be interpreted as naturally occurring classes of strata.

Identifying classes for supervised classification

In a supervised classification, you know what classes you want to divide the study site into, and you have sample locations in the study site that are representative of each class. For example, if you are creating a land-use map from a satellite image, the classes might be urban, water, forest, fields, and roads. The goal is to assign each location in the study area to a known class. The more sample locations that can be identified as belonging to a class and the more homogeneous the cell values are within a class, the better the ensuing classification will be. The actual locations identifying the known class locations are called the training samples.

The training samples can be identified on a polygon layer or on a raster. When defining the training samples, you can identify an existing raster as reference. Generally, a color composite of the first three layers in the raster is displayed as background and used as reference to identify areas to encircle when producing training samples.

Creating clusters in an unsupervised classification

The first step in an unsupervised classification is to create clusters. Statistically, clusters are naturally occurring groupings in the data. The Iso Cluster tool requires input raster bands, the number of classes, the name of the output signature file, the number of iterations, minimum class size, and the interval at which to take the sample points from which to calculate the clusters (the final three parameters are discussed below).

The tool returns a signature file containing the multivariate statistics for a subset of the cells for the identified clusters. The resultant calculations identify which cell location belongs to which cluster, the mean value for the cluster, and the variance-covariance matrix. This information is stored in an ASCII signature file. The signature file is essential in the clustering and classification of the remaining unsampled cells.

Storing class or cluster statistics: the signature file

The signature file is an ASCII file that stores the multivariate statistics for each class or cluster of interest. The file includes the mean for each class or cluster, the number of cells in the class or cluster, and the variance-covariance matrix for the class or cluster.

The signature file can be displayed with any text editor.

For any class or cluster, the diagonal values moving from the upper left to the lower right in the variance-covariance matrix are the variance values for the variables corresponding to the input raster bands identified by the row/column intersection in the matrix for the bands. All other values in the matrix are covariance values.

How clusters are determined for an unsupervised classification

The name of the algorithm used for creating clusters in an unsupervised classification is Iso Cluster. The Iso prefix of the isodata clustering algorithm stands for Iterative Self Organizing (ISO), a method of performing clustering. Clusters are calculated using a subset of the cells in the study area. All cluster calculations are performed on the cell values in multivariate attribute space and are not based on any spatial characteristics. That is, the mean is derived from the attribute values for the different input bands. The variance and covariance values are calculated from the variation within and between bands.

The following example uses a K-mean or ISO clustering approach. A two-band raster will be used to theoretically explain the methodology. This same methodology works for as many bands that are entered, or in n-dimensional space. The following discussion is conceptual to allow a better understanding of the ISO clustering approach.

  • An empty graph is made with the range of values in the first band plotted on the x-axis and the range of values in the second band plotted on the y-axis.
  • A 45-degree line is drawn and divided into the number of classes you specify. The center point of each of these line segments is the initial mean value for the classes.

Mean values for classes determined
Mean values for classes are determined.

  • Each sample cell is plotted on the graph, and the distance from the point to each mean center point on the 45-degree line is determined. The distance is calculated in attribute space using the Pythagorean theorem. The sample point is assigned to the cluster represented by the closest mean center point.

Distance from each point to the mean center point is calculated.
Distance from each point to the mean center point is calculated.

  • The next sample point is plotted, and the above procedure is repeated for all sample points.

Distance is calculated for all sample points.
Distance is calculated for all sample points.

  • The above process will iterate. Before the next iteration, a new mean center point is calculated for each cluster based on the values of the cell locations currently assigned to the cluster in the previous iteration. With the new mean center point for each cluster, the previous two steps are repeated.

New mean center points for each class are calculated.
New mean center points for each class are calculated.

  • The means are updated, and the previous step is repeated. The iteration process for updating the mean values continues until the user-defined number of iterations is reached or until less than 2 percent of the cells change from one cluster to another relative to the new means within an iteration.

Clustering is sensitive to the range of values within each band. This range of values determines the values on the x- and y-axis from which the Euclidean distances between means and sample points are calculated. To have the attributes of each band considered equally, the value range for each band should be similar, whether performing a supervised classification or unsupervised clustering. When the value range of one band is small relative to the other bands, the Euclidean distance in multivariate space may be so small that several clusters may result in a mean of zero. If any cluster has a mean of zero, the final classification and any other multivariate tool that depends on a signature file will fail. Ideally, all bands should be normalized to the same value range.

Related topics