Dimension Reduction (Spatial Statistics)

Summary

Reduces the number of dimensions of a set of continuous variables by aggregating the highest possible amount of variance into fewer components using Principal Component Analysis (PCA) or Reduced-Rank Linear Discriminant Analysis (LDA). The variables are specified as fields in an input table or feature layer, and new fields representing the new variables are saved in the output table or feature class. The number of new fields will be fewer than the number of original variables while maintaining the highest possible amount of variance from all the original variables.

Dimension reduction is commonly used to explore multivariate relationships between variables and to reduce the computational cost of machine learning algorithms in which the required memory and processing time depend on the number of dimensions of the data. Using the components in place of the original data in analysis or machine learning algorithms can often provide comparable (or better) results while consuming fewer computational resources.

Learn more about how Dimension Reduction works

Illustration

Dimension Reduction illustration
Eight variables are reduced to three components.

Usage

  • At least two numeric fields must be provided in the Analysis Fields parameter because the data must have at least two dimensions to have its dimensions reduced.

  • There are two options for the Dimension Reduction Method parameter:

    • Principal Component Analysis (PCA)—This method sequentially builds components that each capture as much of the total variance and correlations between the original variables as possible. The Scale Data parameter can be used to put each original variable in the same scale so that each variable is given equal importance in the principal components. If the data is not scaled, variables with larger values will account for most of the total variance and will be overrepresented in the first several components. This method is recommended when you intend to perform an analysis or machine learning method in which the components are used to predict the value of a continuous variable.
    • Reduced-Rank Linear Discriminant Analysis (LDA)—This method builds components that maximize the separability of the analysis variables and different levels of a categorical variable provided in the Categorical Field parameter. The components will maintain as much between-category variance as possible so that the resulting components are most effective at classifying each record into one of the categories. This method automatically scales the data and is recommended when you intend to perform an analysis or machine learning method in which the components are used to classify the category of a categorical variable.

  • The geoprocessing messages display the percent and cumulative percent of variance maintained by each component.

  • The number of components that will be created depends on whether you specify values for the Minimum Percent Variance to Maintain and Minimum Number of Components parameters.

    • If one parameter is specified and the other is not, the value of the specified parameter determines the number of components. The number of components will be equal to the smallest number needed to satisfy the specified minimum.
    • If both parameters are specified, the larger of the two resulting number of components is used.
    • If neither parameter is specified, the number of components is determined using several statistical methods, and the tool will use the largest number of components recommended by each of the methods. For both dimension reduction methods, the methods include the Broken-Stick Method and Bartlett's Test of Sphericity. For PCA, a permutation test is also performed if the Number of Permutations parameter value is greater than zero.

    Information about the results of each test are displayed as geoprocessing messages.

  • If a table is created by the Output Eigenvalues Table parameter, a Scree Plot chart is created in the output table to visualize the variance maintained by each component.

  • If a table is created by the Output Eigenvectors Table parameter, a bar chart is created in the output table to visualize each of the eigenvectors.

  • For additional information about PCA and Reduced-Rank LDA, see the following reference:

    • James, G., Witten, D., Hastie, T., Tibshirani, R. (2014). "An Introduction to Statistical Learning: with Applications in R." Springer Publishing Company, Incorporated. https://doi.org/10.1007/978-1-4614-7138-7

    For additional information about the methods for determining the number of components, see the following reference:

    • Peres-Neto, P., Jackson, D., Somers, K. (2005). "How many principal components? Stopping rules for determining the number of non-trivial axes revisited." Computational Statistics & Data Analysis. 49.4: 974-997. https://doi.org/10.1016/j.csda.2004.06.015.

Syntax

arcpy.stats.DimensionReduction(in_table, output_data, fields, {method}, {scale}, {categorical_field}, {min_variance}, {min_components}, {append_fields}, {output_eigenvalues_table}, {output_eigenvectors_table}, {number_of_permutations})
ParameterExplanationData Type
in_table

The table or features containing the fields with the dimension that will be reduced.

Table View
output_data

The output table or feature class containing the resulting components of the dimension reduction.

Table
fields
[fields,...]

The fields representing the data with the dimension that will be reduced.

Field
method
(Optional)

Specifies the method to be used to reduce the dimensions of the analysis fields.

  • PCAThe analysis fields will be partitioned into components that each maintain the maximum proportion of the total variance. This is the default.
  • LDAThe analysis fields will be partitioned into components that each maintain the maximum between-category separability of a categorical variable.
String
scale
(Optional)

Specifies whether the values of each analysis will be scaled to have a variance equal to one. This scaling ensures that each analysis field is given equal priority in the components. Scaling also removes the effect of linear units, for example, the same data measured in meters and feet will result in equivalent components. The values of the analysis fields will be shifted to have mean zero for both options.

  • SCALE_DATAThe values of each analysis field will be scaled to have a variance of one. This is the default.
  • NO_SCALE_DATAThe values of each analysis field will be shifted to have a mean of zero, but the variance will not be scaled.
Boolean
categorical_field
(Optional)

The field representing the categorical variable for LDA. The components will maintain the maximum amount of information needed to classify each input record into these categories.

Field
min_variance
(Optional)

The minimum percent of total variance of the analysis fields that must be maintained in the components. The total variance depends on whether the analysis fields were scaled using the Scale Data parameter.

Long
min_components
(Optional)

The minimum number of components.

Long
append_fields
(Optional)

Specifies whether all fields from the input table or features will be copied and appended to the output table or features. The fields provided in the fields parameter will be copied to the output regardless of the value of this parameter.

  • APPENDAll fields from the input table or features will be appended to the output.
  • NO_APPENDOnly the analysis fields will be included in the output. This is the default.
Boolean
output_eigenvalues_table
(Optional)

The output table containing the eigenvalues of each component.

Table
output_eigenvectors_table
(Optional)

The output table containing the eigenvectors of each component.

Table
number_of_permutations
(Optional)

Specifies the number of permutations to be used when determining the optimal number of components. The default value is 0, which indicates that no permutation test will be performed. The provided value must be equal to 0, 99, 199, 499, or 999. If any other value is provided, 0 will be used and no permutation test will be performed.

Long

Code sample

DimensionReduction example 1 (Python window)

The following Python script demonstrates how to use the DimensionReduction tool.

arcpy.stats.DimensionReduction("DemographicData", 
           "DemographicData_DimensionReduction", 
           "age_group_1;age_group2;age_group_3;age_group_4;age_group_5", 
           "PCA", "NO_SCALE_DATA", None, None, 3, 
           "NO_APPEND", "EigenValueTable", None, 99)
DimensionReduction example 2 (stand-alone script)

The following stand-alone Python script demonstrates how to use the DimensionReduction tool to forecast population:

# Import system modules 
import arcpy

# Overwrite existing output, by default
arcpy.env.overwriteOutput = True

# Local variables...
arcpy.env.workspace = r"c:\projects\dimensionreduction.gdb"

# Reduce the fields of population by age group using Reduced-Rank LDA method; 
# use "State" as the categorical field; choose the eigenvector output. 
arcpy.stats.DimensionReduction("DemographicData", 
           "DemographicData_DimensionReduction", 
           "age_group_1;age_group2;age_group_3;age_group_4;age_group_5", 
           "LDA", "SCALE_DATA", "State", None, None, 
           "APPEND", None, "EigenVectorTable", 0)

Licensing information

  • Basic: Yes
  • Standard: Yes
  • Advanced: Yes

Related topics