Similarity Search (Spatial Statistics)—ArcGIS Pro

Summary

Identifies which candidate features are most similar or most dissimilar to one or more input features based on feature attributes.

Learn more about how Similarity Search works

Illustration

Usage

You will provide a layer containing the Input Features To Match and a second layer containing the Candidate Features from which matches will be obtained. Often your Input Features To Match and your Candidate Features will be in the same feature layer. While one option is to create two separate datasets, you don't have to do this. It may be easier to create layers with two different definition queries instead. Suppose you have a file with all crime incidents that have occurred over the past month. If you want to find all of the crimes that are most similar to the latest carjacking, you could do the following:
- Copy the layer displaying all crime incidents and paste it in Contents to make a duplicate layer. Then rename the layer.
- On the renamed layer, make a selection or set a definition query representing the latest carjacking feature. Use this layer for the Input Features To Match parameter.
- Apply a selection or set a definition query for the original layer so it excludes the latest carjacking. Use this layer for the Candidate Features parameter.
If there is more than one Input Features To Match, matching is based on averaged Attributes of Interest values. So, for example, if there are two Input Features To Match and one of the Attributes of Interest is a population variable, the tool will look for Candidate Features with populations that are most like the average population values. If the population values are 100 and 102, for example, the tool will look for candidates with populations near 101.
Note:
When you have more than one Input Features To Match, you will want to select Attributes of Interest with similar values. If, for example, the population value for one of the inputs is 100 and the other input is 100,000, the tool will look for matches with populations near the average of those two values: 50,050. Notice that this averaged value is nothing like the population for either of the Input Features To Match.
Output Features will always contain points unless the Input Features To Match and the Candidate Features are both polygons or both polylines. Creating polygon or polyline Output Features can slow performance for large datasets, so you can check the Collapse Output To Points to force point geometries for improved performance.
With the Most Or Least Similar parameter, you can search for features that are either Most similar or Least similar to the Input Features To Match. In some cases you will want to see both ends of the spectrum. If you enter 3 for the Number of Results parameter and Both for the Most Or Least Similar parameter, for example, the tool will return the three most similar and the three least similar candidate features.
Any given solution match in the Output Features will either be a solution that is most similar or least similar to the target Input Features To Match; a single solution cannot be both (and solution matches won't be duplicated in the Output Features). Consequently, when you select Both for the Most Or Least Similar parameter, the maximum number of resulting matches possible (Number of Results) will be half the number of Candidate Features. When you enter a Number of Results value that is too large, the tool will adjust it to the maximum possible.
Sometimes, in order to explore the spatial pattern of similarity, you will want to rank similarity for all the Candidate Features. An easy way to indicate that you want all of the Candidate Features to be ranked is to enter 0 for the Number of Results parameter. The tool will then determine the number of valid features in the candidates dataset and write all of them to the Output Features in rank order from most to least similar.
For the Match Method parameter, you may select Attribute values, Ranked attribute values, or Attribute profiles.
- For Attribute values, the most similar candidates will have the smallest sum of squared differences for all the Attributes of Interest; all values are standardized before differences are calculated.
- For Ranked attribute values, the most similar candidates will have the smallest sum of squared ranks for all the Attributes of Interest. The Output Features reports these sums in the SIMINDEX (Sum of Squared Rank Differences) field.
- For Attribute profiles, the cosine similarity is measured. Cosine similarity looks for the same relationships among standardized attribute values rather than trying to match magnitudes. Suppose there are four Attributes of Interest called A1, A2, A3, and A4, and that A2 is twice as large as A1, A3 is almost equal to A2, and A4 is three times larger than A3. For the Attribute profiles Match Method the tool will be looking for candidates with those same attribute relationships: twice as large, then almost equal, then three times larger. Because this method is looking at attribute relationships, you must specify a minimum of two Attributes of Interest for this method. You might use the cosine similarity method (Attribute profiles) to find places like Los Angeles, but at a smaller scale overall. The cosine similarity index ranges from 1.0 (perfect similarity) to -1.0 (perfect dissimilarity). The cosine similarity index is written to the Output Features SIMINDEX (Cosine Similarity) field.
The Attributes of Interest must be numeric and must be present (same field name and same field type) in both the Input Features To Match and the Candidate Features datasets. For the Attributes of Interest parameter, the tool will list all numeric fields found in the Input Features To Match dataset. If the tool doesn't find corresponding fields for the Candidate Features, you will see a warning indicating the missing attributes were dropped from the analysis. If all of the Attributes of Interest are dropped, the tool has nothing to use for matching and you will get an error indicating the tool cannot perform the analysis.
All of the attributes used for matching are written to the Output Features. The Fields To Append To Output parameter allows you to include other fields in the output table, if desired. Because numeric Attributes of Interest fields are probably not effective identifiers, you may want to append a name or other identifier field for each solution match. If you need to decide among several matching solutions, you may want to append other nonnumeric attributes as well. If the solution you are seeking must be one of several land-use types, for example, appending a categorical land-use attribute will help you hone in on solutions that meet this requirement. Sometimes you will want to include additional numeric attributes in the output table for reference purposes only. Suppose, for example, you are looking for suitable habitat for a particular animal. You can use known locations where the species is successful for the Input Features To Match. You can select Attributes of Interest that relate to species success. In addition, you might append a numeric area attribute to the Output Features, not because you want to actually match on the area value of the target, but because ultimately you are looking for solutions with the largest areas possible.

All of the Input Features To Match and solution matches are written to the Output Features along with Attributes of Interest and the Fields To Append To Output. In addition, the following fields are included in the Output Features:


Field Name	Field Alias	Description	Notes
MATCH_ID	MATCH_ID	All of the target features in the Input Features To Match layer are listed first with their OID or FID identifier written to the MATCH_ID field. Solution matches have NULL values for this field.	When the Output Features is a shapefile, NULL values are represented by a very large negative number (such as -21474836).
CAND_ID	CAND_ID	All of the solution matches are listed next and this value is their OID or FID identifier. The target features in the Input Features To Match layer have NULL values for this field.	When the Output Features is a shapefile, NULL values are represented by a very large negative number (such as -21474836).
SIMRANK	Similarity Rank	When you select Most Similar or Both for the Match Method parameter, all of the solution matches are ranked from most similar to least similar. The most similar solution match has a rank value of 1.	This field is only included in the Output Features when you select Most Similar or Both for the Match Method parameter.
DSIMRANK	Dissimilarity Rank	When you select Least similar or Both for the Match Method parameter, all of the solution matches are ranked from least similar to most similar. The solution that is least similar gets a rank value of 1.	This field is only included in the Output Features when you select Least similar or Both for the Match Method parameter.
SIMINDEX	Sum of Squared Value Differences, Sum of Squared Rank Differences, or Cosine Similarity	This field quantifies how similar each solution match is to the target feature. When you specify Attribute values for the Match Method, the field alias is Sum of Squared Value Differences. When you specify Ranked attribute values for the Match Method, the field alias is Sum of Squared Rank Differences. When you specify Attribute profiles for the Match Method, the field alias is Cosine Similarity. For more information about how these indices are computed, see How Similarity Search Works.	If there is only one Input Features To Match, the target feature is this feature. When more than one Input Features To Match is specified, the target feature is a temporary feature created with averaged values for all the Attributes Of Interest.
LABELRANK	Render Rank	This field is used for display purposes only. The tool uses this field to provide default rendering of the analysis results.

The Output Features layer is automatically added to the table of contents with default rendering applied to the LABELRANK field. The rendering applied is defined by a layer file in <ArcGIS Pro>\Resources\ArcToolBox\Templates\Layers. You can reapply the default rendering, if needed, by using the Apply Symbology From Layer tool.
Note:
The default sample size is 10,000 records. When the Number Of Results is larger than this default, you will need to increase the sample size to render all of the results. To increase the sample size, click on the Symbology pane and choose Advanced. Expand the Sample size option and change the Maximum sample size value.

Syntax

SimilaritySearch(Input_Features_To_Match, Candidate_Features, Output_Features, Collapse_Output_To_Points, Most_Or_Least_Similar, Match_Method, Number_Of_Results, Attributes_Of_Interest, {Fields_To_Append_To_Output})

Parameter	Explanation	Data Type
Input_Features_To_Match	The layer (or a selection on a layer) containing the features you want to match; you are searching for other features that look like these features. When more than one feature is provided, matching is based on attribute averages. Tip: When your Input Features To Match and Candidate Features come from a single dataset layer you can Copy and paste the layer to Contents, making a duplicate layer. Rename the duplicate layer. On the renamed layer, make a selection or set a definition query for the reference features you want to match. Use the new layer created for the Input Features To Match parameter. Apply a selection or set a definition query on the original layer so it excludes the reference features. This will give you the layer to use for the Candidate Features parameter.	Feature Layer
Candidate_Features	The layer (or a selection on a layer) containing candidate matching features. The tool will look for features most like (or most dislike) the Input_Features_To_Match among these candidates.	Feature Layer
Output_Features	The output feature class contains a record for each of the Input_Features_To_Match and for all the solution-matching features found.	Feature Class
Collapse_Output_To_Points	When the Input_Features_To_Match and the Candidate_Features are both either lines or polygons, you may choose whether you want the geometry for the Output_Features to be collapsed to points or to match the original geometry (lines or polygons) of the input features. This option is only available with an Desktop Advanced license. Choosing COLLAPSE for large line or polygon datasets will improve tool performance. COLLAPSE —The line and polygon features will be represented as feature centroids (points). NO_COLLAPSE —The output geometry will match the line or polygon geometry of the input features. This is the default.	Boolean
Most_Or_Least_Similar	Choose whether you are interested in features that are most alike or most different to the Input_Features_To_Match. MOST_SIMILAR —Find the features that are most alike. LEAST_SIMILAR —Find the features that are most different. BOTH —Find both the features that are most alike and the features that are most different.	String
Match_Method	Choose whether matching should be based on values, ranks, or cosine relationships. ATTRIBUTE_VALUES —Similarity or dissimilarity will be based on the sum of squared standardized attribute value differences for all the Attributes Of Interest. RANKED_ATTRIBUTE_VALUES —Similarity or dissimilarity will be based on the sum of squared rank differences for all the Attributes Of Interest. ATTRIBUTE_PROFILES —Similarity or dissimilarity will be computed as a function of cosine similarity for all the Attributes Of Interest.	String
Number_Of_Results	The number of solution matches to find. Entering zero or a number larger than the total number of Candidate_Features will return rankings for all the candidate features.	Long
Attributes_Of_Interest [field,...]	A list of numeric attributes representing the matching criteria.	Field
Fields_To_Append_To_Output [field,...] (Optional)	An optional list of attributes to include with the Output_Features. You might want to include a name identifier, categorical field, or date field, for example. These fields are not used to determine similarity; they are only included in the Output_Features for your reference.	Field

Code sample

SimilaritySearch example 1 (Python window)

The following Python window script demonstrates how to use the SimilaritySearch tool.

import arcpy
import arcpy.stats as SS
arcpy.env.workspace = r"C:\Analysis"
SS.SimilaritySearch ("Crime_selection", "AllCrime", "c:\\Analysis\\CrimeMatches", 
                     "NO_COLLAPSE", "MOST_SIMILAR", "ATTRIBUTE_VALUES", 4, 
                     "HEIGHT;WEIGHT;SEVERITY;DST2CHPSHP", "Name;WEAPON")

SimilaritySearch example 2 (stand-alone Python script)

The following stand-alone Python script demonstrates how to use the SimilaritySearch tool.

# Similarity Search of crime data in a metropolitan area

# Import system modules
import arcpy
import os
import arcpy.stats as SS

# Set property to overwrite existing output
arcpy.env.overwriteOutput = True

try:
    # Set the current workspace (to avoid having to specify the full path to
    # the feature classes each time)
    arcpy.env.workspace = r"C:\Analysis"

    # Make a layer from the crime feature class
    arcpy.MakeFeatureLayer_management("AllCrime", "Crime_selection") 

    # Select the target crime to match
    # Process: Select By Attribute
    arcpy.SelectLayerByAttribute_management("Crime_selection", "NEW_SELECTION",
                                            '"OBJECTID" = 1230043')

    # Use Similarity Search to find  to create groups based on different variables 
    # or analysis fields
    # Process: Group Similar Features  
    SS.SimilaritySearch("Crime_selection", "AllCrime", "CJMatches", "NO_COLLAPSE",
                        "MOST_SIMILAR", "ATTRIBUTE_VALUES", 4,
                        "HEIGHT;WEIGHT;SEVERITY;DST2CHPSHP", "Name;WEAPON")
    
except:
    # If an error occurred when running the tool, print out the error message.
    print(arcpy.GetMessages())

Environments

Output Coordinate System, Geographic Transformations, Current Workspace, Scratch Workspace, Qualified Field Names, Output has M values, M Resolution, M Tolerance, Output has Z values, Default Output Z Value, Z Resolution, Z Tolerance, XY Resolution, XY Tolerance

Licensing information

Basic: Yes
Standard: Yes
Advanced: Yes