Similarity Search (Spatial Statistics)—ArcGIS Pro

Summary

Identifies which candidate features are most similar or most dissimilar to one or more input features based on feature attributes.

Learn more about how Similarity Search works

Illustration

Usage

You will provide a layer containing the Input Features To Match values and a second layer containing the Candidate Features values from which matches will be obtained. Often these values will be in the same feature layer. One option is to create two separate datasets. The other option is to create layers with two different definition queries, which may be easier. For example, if you have a file with all crime incidents that have occurred over the past month and you want to find all of the crimes that are most similar to the latest carjacking, you can do the following:
- Copy the layer displaying all crime incidents to the Contents pane to make a duplicate layer. Then rename the layer.
- On the renamed layer, make a selection or set a definition query representing the latest carjacking feature. Use this layer for the Input Features To Match parameter.
- Apply a selection or set a definition query for the original layer so it excludes the latest carjacking. Use this layer for the Candidate Features parameter.
If there is more than one Input Features To Match value, matching is based on the averaged Attributes of Interest values. For example, if there are two Input Features To Match values and one of the Attributes of Interest values is a population variable, the tool will check for Candidate Features values with populations that are most like the average population values. If the population values are 100 and 102, for example, the tool will check for candidates with populations near 101.
Note:
When you have more than one Input Features To Match value, select similar values from the Attributes of Interest parameter. If, for example, the population value for one of the inputs is 100 and the other input is 100,000, the tool will check for matches with populations near the average of those two values: 50,050. This averaged value is not close to the population for either of the Input Features To Match values.
Output Features values always contain points unless the Input Features To Match and Candidate Features values are both polygons or both polylines. Creating polygon or polyline Output Features values can slow performance for large datasets. Check the Collapse Output To Points parameter to force point geometries for improved performance.
With the Most Or Least Similar parameter, you can search for features using the Most similar or Least similar option to match to the Input Features To Match values. It may be helpful to see both. If you enter 3 for the Number of Results parameter and choose Both for the Most Or Least Similar parameter, for example, the tool will return the three most similar and the three least similar candidate features.
Any given solution match in the Output Features parameter will be a solution that is either most similar or least similar to the target Input Features To Match parameter; a single solution cannot be both (and solution matches won't be duplicated in the Output Features parameter). Consequently, when you choose Both for the Most Or Least Similar parameter, the maximum number of resulting matches possible (Number of Results value) will be half the number of the Candidate Features values. When you enter a value for Number of Results that is too large, the tool will adjust it to the maximum possible.
To explore the spatial pattern of similarity, you can rank similarity for all the Candidate Features values. To do this, enter 0 for the Number of Results parameter. The tool will identify the number of valid features in the candidates dataset and write all of them to the Output Features parameter in rank order from most to least similar.
The Match Method parameter options are Attribute values, Ranked attribute values, and Attribute profiles.
- Attribute values—The most similar candidates will have the smallest sum of squared differences for all of the Attributes of Interest values; all values are standardized before differences are calculated.
- Ranked attribute values—The most similar candidates will have the smallest sum of squared ranks for all of the Attributes of Interest values. The Output Features parameter reports these sums in the SIMINDEX (Sum of Squared Rank Differences) field.
- Attribute profiles—The cosine similarity is measured. Cosine similarity checks for the same relationships among standardized attribute values rather than trying to match magnitudes. For example, there are four Attributes of Interest values called A1, A2, A3, and A4. A2 is twice as large as A1, A3 is almost equal to A2, and A4 is three times larger than A3. For Attribute profiles, the tool will check for candidates with those same attribute relationships: twice as large, then almost equal, then three times larger. Because this method is checking attribute relationships, you must specify a minimum of two Attributes of Interest values for this method. You can use the cosine similarity method (Attribute profiles) to find places such as Los Angeles but at a smaller scale overall. The cosine similarity index ranges from 1.0 (perfect similarity) to -1.0 (perfect dissimilarity). The cosine similarity index is written to the SIMINDEX (Cosine Similarity) field of the Output Features parameter.
The Attributes of Interest values must be numeric and must be present (same field name and same field type) in both the Input Features To Match and the Candidate Features datasets. For the Attributes of Interest parameter, the tool will list all numeric fields found in the Input Features To Match dataset. If the tool doesn't find corresponding fields for the Candidate Features, a warning appears indicating the missing attributes were dropped from the analysis. If all of the Attributes of Interest values are dropped, no matching can occur and an error indicating the tool cannot perform the analysis is issued.
All of the attributes used for matching are written to the Output Features parameter. The Fields To Append To Output parameter allows you to include other fields in the output table. Because numeric Attributes of Interest fields are not effective identifiers, you can append a name or other identifier field for each solution match. If you need to decide among several matching solutions, you can append other nonnumeric attributes as well. If the solution you are seeking must be one of several land-use types, for example, appending a categorical land-use attribute will help you find solutions that meet this requirement. You can also to include additional numeric attributes in the output table for reference purposes only. For example, you are looking for suitable habitat for a particular animal. You can use known locations where the species is successful for the Input Features To Match parameter. You can select Attributes of Interest values that relate to species success. In addition, you can append a numeric area attribute to the Output Features values, not because you want to actually match on the area value of the target, but because ultimately you are looking for solutions with the largest areas possible.

All of the Input Features To Match values and solution matches are written to the Output Features parameter along with Attributes of Interest and the Fields To Append To Output parameters. In addition, the following fields are included in the Output Features:


Field Name	Field Alias	Description	Notes
MATCH_ID	MATCH_ID	All of the target features in the Input Features To Match layer are listed first with their OID or FID identifier written to the MATCH_ID field. Solution matches have NULL values for this field.	When the Output Features value is a shapefile, NULL values are represented by a very large negative number (such as -21474836).
CAND_ID	CAND_ID	All of the solution matches are listed next and this value is their OID or FID identifier. The target features in the Input Features To Match layer have NULL values for this field.	When the Output Features value is a shapefile, NULL values are represented by a very large negative number (such as -21474836).
SIMRANK	Similarity Rank	When you choose Most Similar or Both for the Match Method parameter, all of the solution matches are ranked from most similar to least similar. The most similar solution match has a rank value of 1.	This field is only included in the Output Features value when you choose Most Similar or Both for the Match Method parameter.
DSIMRANK	Dissimilarity Rank	When you choose Least similar or Both for the Match Method parameter, all of the solution matches are ranked from least similar to most similar. The solution that is least similar gets a rank value of 1.	This field is only included in the Output Features when you choose Least similar or Both for the Match Method parameter.
SIMINDEX	Sum of Squared Value Differences, Sum of Squared Rank Differences, or Cosine Similarity	This field quantifies how similar each solution match is to the target feature. When you specify Attribute values for the Match Method parameter, the field alias is Sum of Squared Value Differences. When you specify Ranked attribute values for the Match Method parameter, the field alias is Sum of Squared Rank Differences. When you specify Attribute profiles for the Match Method parameter, the field alias is Cosine Similarity. For more information about how these indices are computed, see How Similarity Search works.	If there is only one Input Features To Match value, the target feature is this feature. When more than one Input Features To Match value is specified, the target feature is a temporary feature created with averaged values for all the Attributes Of Interest values.
LABELRANK	Render Rank	This field is used for display purposes only. The tool uses this field to provide default rendering of the analysis results.

The Output Features layer is automatically added to the table of contents with default rendering applied to the LABELRANK field. The rendering applied is defined by a layer file in <ArcGIS Pro>\Resources\ArcToolBox\Templates\Layers. You can reapply the default rendering, if needed, using the Apply Symbology From Layer tool.
Note:
The default sample size is 10,000 records. When the Number Of Results value is larger than this default, you will need to increase the sample size to render all of the results. To increase the sample size, open the Symbology pane and click the Advanced symbology options tab . Expand Sample size and change the Maximum sample size value.

Syntax

arcpy.stats.SimilaritySearch(Input_Features_To_Match, Candidate_Features, Output_Features, Collapse_Output_To_Points, Most_Or_Least_Similar, Match_Method, Number_Of_Results, Attributes_Of_Interest, {Fields_To_Append_To_Output})

Parameter	Explanation	Data Type
Input_Features_To_Match	The layer, or a selection on a layer, containing the features you want to match; you are searching for other features that look like these features. When more than one feature is provided, matching is based on attribute averages. Tip: When the Input Features To Match and Candidate Features values are from a single dataset layer, you can do the following: Copy the layer to the Contents pane, making a duplicate layer. Rename the duplicate layer. On the renamed layer, make a selection or set a definition query for the reference features you want to match. Use the new layer created for the Input Features To Match parameter. Apply a selection or set a definition query on the original layer so it excludes the reference features. This will be the layer to use for the Candidate Features parameter.	Feature Layer
Candidate_Features	The layer, or a selection on a layer, containing candidate matching features. The tool will check for features most similar (or most dissimilar) to the Input_Features_To_Match values among these candidates.	Feature Layer
Output_Features	The output feature class containing a record for each of the Input_Features_To_Match values and for all the solution-matching features found.	Feature Class
Collapse_Output_To_Points	Specifies whether the geometry for the Output_Features parameter will be collapsed to points or will match the original geometry (lines or polygons) of the input features if the Input_Features_To_Match and Candidate_Features parameter values are both either lines or polygons. This parameter is only available with an Desktop Advanced license. Choosing COLLAPSE will improve tool performance for large line and polygon datasets. COLLAPSE —The line and polygon features will be represented as feature centroids (points). NO_COLLAPSE —The output geometry will match the line or polygon geometry of the input features. This is the default.	Boolean
Most_Or_Least_Similar	Specifies whether features that are most similar or most dissimilar to the Input_Features_To_Match values will be identified. MOST_SIMILAR —Features that are most similar will be identified. This is the default. LEAST_SIMILAR —Features that are most dissimilar will be identified. BOTH —Features that are most similar and features that are most dissimilar will both be identified.	String
Match_Method	Specifies whether matching will be based on values, ranks, or cosine relationships. ATTRIBUTE_VALUES —Matching will be based on the sum of squared standardized attribute value differences for all of the Attributes Of Interest values. This is the default. RANKED_ATTRIBUTE_VALUES —Matching will be based on the sum of squared rank differences for all of the Attributes Of Interest values. ATTRIBUTE_PROFILES —Matching will be computed as a function of cosine similarity for all of the Attributes Of Interest values.	String
Number_Of_Results	The number of solution matches to find. Entering zero or a number larger than the total number of Candidate_Features values will return rankings for all the candidate features. The default is 10.	Long
Attributes_Of_Interest [Attributes_Of_Interest,...]	The numeric attributes representing the matching criteria.	Field
Fields_To_Append_To_Output [Fields_To_Append_To_Output,...] (Optional)	The fields to include with the Output_Features parameter. These fields are not used to determine similarity; they are only included in the Output_Features parameter for reference.	Field

Code sample

SimilaritySearch example 1 (Python window)

The following Python window script demonstrates how to use the SimilaritySearch function.

import arcpy
import arcpy.stats as SS
arcpy.env.workspace = r"C:\Analysis"
SS.SimilaritySearch("Crime_selection", "AllCrime", "c:\\Analysis\\CrimeMatches", 
                    "NO_COLLAPSE", "MOST_SIMILAR", "ATTRIBUTE_VALUES", 4, 
                    "HEIGHT;WEIGHT;SEVERITY;DST2CHPSHP", "Name;WEAPON")

SimilaritySearch example 2 (stand-alone script)

The following stand-alone Python script demonstrates how to use the SimilaritySearch function.

# Similarity Search of crime data in a metropolitan area

# Import system modules
import arcpy
import os
import arcpy.stats as SS

# Set property to overwrite existing output
arcpy.env.overwriteOutput = True

try:
    # Set the current workspace (to avoid having to specify the full path to
    # the feature classes each time)
    arcpy.env.workspace = r"C:\Analysis"

    # Make a layer from the crime feature class
    arcpy.MakeFeatureLayer_management("AllCrime", "Crime_selection") 

    # Select the target crime to match
    # Process: Select By Attribute
    arcpy.SelectLayerByAttribute_management("Crime_selection", "NEW_SELECTION",
                                            '"OBJECTID" = 1230043')

    # Use Similarity Search to find  to create groups based on different 
    # variables or analysis fields
    # Process: Group Similar Features  
    SS.SimilaritySearch("Crime_selection", "AllCrime", "CJMatches", 
                        "NO_COLLAPSE", "MOST_SIMILAR", "ATTRIBUTE_VALUES", 4,
                        "HEIGHT;WEIGHT;SEVERITY;DST2CHPSHP", "Name;WEAPON")
    
except:
    # If an error occurred when running the tool, print out the error message.
    print(arcpy.GetMessages())

Environments

Output Coordinate System, Geographic Transformations, Current Workspace, Scratch Workspace, Qualified Field Names, Output has M values, M Resolution, M Tolerance, Output has Z values, Default Output Z Value, Z Resolution, Z Tolerance, XY Resolution, XY Tolerance

Licensing information

Basic: Yes
Standard: Yes
Advanced: Yes

Summary

Illustration

Usage

Note:

Note:

Syntax

Tip:

Code sample

Environments

Licensing information

Related topics

In this topic