Summary
Identifies which candidate features are most similar or most dissimilar to one or more input features based on feature attributes.
Illustration
Usage
You will provide a layer containing the Input Features To Match values and a second layer containing the Candidate Features values from which matches will be obtained. Often these values will be in the same feature layer. One option is to create two separate datasets. The other option is to create layers with two different definition queries, which may be easier. For example, if you have a file with all crime incidents that have occurred over the past month and you want to find all of the crimes that are most similar to the latest carjacking, you can do the following:
- Copy the layer displaying all crime incidents to the Contents pane to make a duplicate layer. Then rename the layer.
- On the renamed layer, make a selection or set a definition query representing the latest carjacking feature. Use this layer for the Input Features To Match parameter.
- Apply a selection or set a definition query for the original layer so it excludes the latest carjacking. Use this layer for the Candidate Features parameter.
If there is more than one Input Features To Match value, matching is based on the averaged Attributes of Interest values. For example, if there are two Input Features To Match values and one of the Attributes of Interest values is a population variable, the tool will check for Candidate Features values with populations that are most like the average population values. If the population values are 100 and 102, for example, the tool will check for candidates with populations near 101.
Note:
When you have more than one Input Features To Match value, select similar values from the Attributes of Interest parameter. If, for example, the population value for one of the inputs is 100 and the other input is 100,000, the tool will check for matches with populations near the average of those two values: 50,050. This averaged value is not close to the population for either of the Input Features To Match values.
Output Features values always contain points unless the Input Features To Match and Candidate Features values are both polygons or both polylines. Creating polygon or polyline Output Features values can slow performance for large datasets. Check the Collapse Output To Points parameter to force point geometries for improved performance.
With the Most Or Least Similar parameter, you can search for features using the Most similar or Least similar option to match to the Input Features To Match values. It may be helpful to see both. If you enter 3 for the Number of Results parameter and choose Both for the Most Or Least Similar parameter, for example, the tool will return the three most similar and the three least similar candidate features.
Any given solution match in the Output Features parameter will be a solution that is either most similar or least similar to the target Input Features To Match parameter; a single solution cannot be both (and solution matches won't be duplicated in the Output Features parameter). Consequently, when you choose Both for the Most Or Least Similar parameter, the maximum number of resulting matches possible (Number of Results value) will be half the number of the Candidate Features values. When you enter a value for Number of Results that is too large, the tool will adjust it to the maximum possible.
To explore the spatial pattern of similarity, you can rank similarity for all the Candidate Features values. To do this, enter 0 for the Number of Results parameter. The tool will identify the number of valid features in the candidates dataset and write all of them to the Output Features parameter in rank order from most to least similar.
The Match Method parameter options are Attribute values, Ranked attribute values, and Attribute profiles.
- Attribute values—The most similar candidates will have the smallest sum of squared differences for all of the Attributes of Interest values; all values are standardized before differences are calculated.
- Ranked attribute values—The most similar candidates will have the smallest sum of squared ranks for all of the Attributes of Interest values. The Output Features parameter reports these sums in the SIMINDEX (Sum of Squared Rank Differences) field.
- Attribute profiles—The cosine similarity is measured. Cosine similarity checks for the same relationships among standardized attribute values rather than trying to match magnitudes. For example, there are four Attributes of Interest values called A1, A2, A3, and A4. A2 is twice as large as A1, A3 is almost equal to A2, and A4 is three times larger than A3. For Attribute profiles, the tool will check for candidates with those same attribute relationships: twice as large, then almost equal, then three times larger. Because this method is checking attribute relationships, you must specify a minimum of two Attributes of Interest values for this method. You can use the cosine similarity method (Attribute profiles) to find places such as Los Angeles but at a smaller scale overall. The cosine similarity index ranges from 1.0 (perfect similarity) to -1.0 (perfect dissimilarity). The cosine similarity index is written to the SIMINDEX (Cosine Similarity) field of the Output Features parameter.
The Attributes of Interest values must be numeric and must be present (same field name and same field type) in both the Input Features To Match and the Candidate Features datasets. For the Attributes of Interest parameter, the tool will list all numeric fields found in the Input Features To Match dataset. If the tool doesn't find corresponding fields for the Candidate Features, a warning appears indicating the missing attributes were dropped from the analysis. If all of the Attributes of Interest values are dropped, no matching can occur and an error indicating the tool cannot perform the analysis is issued.
All of the attributes used for matching are written to the Output Features parameter. The Fields To Append To Output parameter allows you to include other fields in the output table. Because numeric Attributes of Interest fields are not effective identifiers, you can append a name or other identifier field for each solution match. If you need to decide among several matching solutions, you can append other nonnumeric attributes as well. If the solution you are seeking must be one of several land-use types, for example, appending a categorical land-use attribute will help you find solutions that meet this requirement. You can also to include additional numeric attributes in the output table for reference purposes only. For example, you are looking for suitable habitat for a particular animal. You can use known locations where the species is successful for the Input Features To Match parameter. You can select Attributes of Interest values that relate to species success. In addition, you can append a numeric area attribute to the Output Features values, not because you want to actually match on the area value of the target, but because ultimately you are looking for solutions with the largest areas possible.
All of the Input Features To Match values and solution matches are written to the Output Features parameter along with Attributes of Interest and the Fields To Append To Output parameters. In addition, the following fields are included in the Output Features:
Field Name Field Alias Description Notes MATCH_ID
MATCH_ID
All of the target features in the Input Features To Match layer are listed first with their OID or FID identifier written to the MATCH_ID field. Solution matches have NULL values for this field.
When the Output Features value is a shapefile, NULL values are represented by a very large negative number (such as -21474836).
CAND_ID
CAND_ID
All of the solution matches are listed next and this value is their OID or FID identifier. The target features in the Input Features To Match layer have NULL values for this field.
When the Output Features value is a shapefile, NULL values are represented by a very large negative number (such as -21474836).
SIMRANK
Similarity Rank
When you choose Most Similar or Both for the Match Method parameter, all of the solution matches are ranked from most similar to least similar. The most similar solution match has a rank value of 1.
This field is only included in the Output Features value when you choose Most Similar or Both for the Match Method parameter.
DSIMRANK
Dissimilarity Rank
When you choose Least similar or Both for the Match Method parameter, all of the solution matches are ranked from least similar to most similar. The solution that is least similar gets a rank value of 1.
This field is only included in the Output Features when you choose Least similar or Both for the Match Method parameter.
SIMINDEX
Sum of Squared Value Differences, Sum of Squared Rank Differences, or Cosine Similarity
This field quantifies how similar each solution match is to the target feature.
- When you specify Attribute values for the Match Method parameter, the field alias is Sum of Squared Value Differences.
- When you specify Ranked attribute values for the Match Method parameter, the field alias is Sum of Squared Rank Differences.
- When you specify Attribute profiles for the Match Method parameter, the field alias is Cosine Similarity.
If there is only one Input Features To Match value, the target feature is this feature. When more than one Input Features To Match value is specified, the target feature is a temporary feature created with averaged values for all the Attributes Of Interest values.
LABELRANK
Render Rank
This field is used for display purposes only. The tool uses this field to provide default rendering of the analysis results.
The Output Features layer is automatically added to the table of contents with default rendering applied to the LABELRANK field. The rendering applied is defined by a layer file in <ArcGIS Pro>\Resources\ArcToolBox\Templates\Layers. You can reapply the default rendering, if needed, using the Apply Symbology From Layer tool.
Note:
The default sample size is 10,000 records. When the Number Of Results value is larger than this default, you will need to increase the sample size to render all of the results. To increase the sample size, open the Symbology pane and click the Advanced symbology options tab . Expand Sample size and change the Maximum sample size value.
Syntax
arcpy.stats.SimilaritySearch(Input_Features_To_Match, Candidate_Features, Output_Features, Collapse_Output_To_Points, Most_Or_Least_Similar, Match_Method, Number_Of_Results, Attributes_Of_Interest, {Fields_To_Append_To_Output})
Parameter | Explanation | Data Type |
Input_Features_To_Match | The layer, or a selection on a layer, containing the features you want to match; you are searching for other features that look like these features. When more than one feature is provided, matching is based on attribute averages. Tip:When the Input Features To Match and Candidate Features values are from a single dataset layer, you can do the following:
| Feature Layer |
Candidate_Features | The layer, or a selection on a layer, containing candidate matching features. The tool will check for features most similar (or most dissimilar) to the Input_Features_To_Match values among these candidates. | Feature Layer |
Output_Features | The output feature class containing a record for each of the Input_Features_To_Match values and for all the solution-matching features found. | Feature Class |
Collapse_Output_To_Points | Specifies whether the geometry for the Output_Features parameter will be collapsed to points or will match the original geometry (lines or polygons) of the input features if the Input_Features_To_Match and Candidate_Features parameter values are both either lines or polygons. This parameter is only available with an Desktop Advanced license. Choosing COLLAPSE will improve tool performance for large line and polygon datasets.
| Boolean |
Most_Or_Least_Similar | Specifies whether features that are most similar or most dissimilar to the Input_Features_To_Match values will be identified.
| String |
Match_Method | Specifies whether matching will be based on values, ranks, or cosine relationships.
| String |
Number_Of_Results | The number of solution matches to find. Entering zero or a number larger than the total number of Candidate_Features values will return rankings for all the candidate features. The default is 10. | Long |
Attributes_Of_Interest [Attributes_Of_Interest,...] | The numeric attributes representing the matching criteria. | Field |
Fields_To_Append_To_Output [Fields_To_Append_To_Output,...] (Optional) | The fields to include with the Output_Features parameter. These fields are not used to determine similarity; they are only included in the Output_Features parameter for reference. | Field |
Code sample
The following Python window script demonstrates how to use the SimilaritySearch function.
import arcpy
import arcpy.stats as SS
arcpy.env.workspace = r"C:\Analysis"
SS.SimilaritySearch("Crime_selection", "AllCrime", "c:\\Analysis\\CrimeMatches",
"NO_COLLAPSE", "MOST_SIMILAR", "ATTRIBUTE_VALUES", 4,
"HEIGHT;WEIGHT;SEVERITY;DST2CHPSHP", "Name;WEAPON")
The following stand-alone Python script demonstrates how to use the SimilaritySearch function.
# Similarity Search of crime data in a metropolitan area
# Import system modules
import arcpy
import os
import arcpy.stats as SS
# Set property to overwrite existing output
arcpy.env.overwriteOutput = True
try:
# Set the current workspace (to avoid having to specify the full path to
# the feature classes each time)
arcpy.env.workspace = r"C:\Analysis"
# Make a layer from the crime feature class
arcpy.MakeFeatureLayer_management("AllCrime", "Crime_selection")
# Select the target crime to match
# Process: Select By Attribute
arcpy.SelectLayerByAttribute_management("Crime_selection", "NEW_SELECTION",
'"OBJECTID" = 1230043')
# Use Similarity Search to find to create groups based on different
# variables or analysis fields
# Process: Group Similar Features
SS.SimilaritySearch("Crime_selection", "AllCrime", "CJMatches",
"NO_COLLAPSE", "MOST_SIMILAR", "ATTRIBUTE_VALUES", 4,
"HEIGHT;WEIGHT;SEVERITY;DST2CHPSHP", "Name;WEAPON")
except:
# If an error occurred when running the tool, print out the error message.
print(arcpy.GetMessages())
Environments
Licensing information
- Basic: Yes
- Standard: Yes
- Advanced: Yes