Tiled processing of large datasets

To improve the performance and scalability of feature overlay tools such as Union and Intersect, operational logic called adaptive subdivision processing is used. The use of this logic is triggered when data cannot be processed within the amount of available virtual memory. To stay within the available virtual memory, which improves performance, processing is done incrementally on subdivisions of the original extent. Features that straddle the edges of these subdivisions (also called tiles) are split at the edge of the tile and reassembled into a single feature during the last stage of processing. The vertices introduced at these tile edges remain in the output features. Tile boundaries may also be left in the output feature class when a feature being processed is so large that the subdivision process is unable to put the feature back to its original state due to a lack of available virtual memory.

The adaptive subdivision processing logic is dynamic based on the available virtual memory and the complexity of the data. Running the same overlay operation on the same machine may result in the data being divided into tiles differently if the amount of available memory changes in any way.

An optimization to adaptive subdivision processing is the addition of parallel processing of the tiles. When parallel processing is enabled, each tile processed by a different core shares the total amount of available virtual memory on the system. This means that to process multiple tiles at once, each tile must be smaller than it would be when processed sequentially. Adaptive subdivision processing keeps track of the size of each tile to ensure that the tiles being processed concurrently during parallel processing do not exceed the total amount of available virtual memory. Parallel processing may not be faster in all cases. The size and complexity of the data must exceed the overhead of managing the parallel processes for there to be performance gains. See individual tool documentation for how to enable parallel processing for those tools that support it.

Advantages to subdividing data

Tools perform best when processing can be done within a machine's available virtual memory (free memory not being used by the system or other applications). This may not always be possible when working with datasets that contain a large number of features, complex features with complex feature interaction, or features containing hundreds of thousands or millions of vertices (or a combination of all three of these). Without the use of tiling, available memory is quickly exhausted and the operating system starts to page (use secondary storage as if it were main memory). Paging leads to performance degradation and at some point the system runs out of resources and the operation fails. Tiling helps avoid paging. In extreme cases, the processing memory management is transferred to the system in an attempt to complete the processing by allowing the operating system to use all of the resources at its disposal to complete the process.

When 64-bit applications started to become mainstream, the common impression was that everything would be faster. This is not always the case, however. The use of 64-bit processing allows you to do more at once and this may give the impression of increased performance. Some operations are slower as far as one increment of work is concerned. Another common belief is that being able to push all of your data into memory to process should allow the processing to be faster. While this may be the case for simple data, for large and complex spatial operations, this is not the case. For overlay tool processing, having all the data in memory and analyzing it all at once is equivalent to processing the single upper level tile noted earlier. This is slower than tiling the data more extensively. Tiling can be reduced, but not eliminated, to experience performance gains. The use of 64-bit processing also allows the parallel processing of tiles. This can provide performance improvement in many cases.

Processing tiles

Every process starts with a single tile that spans the entire extent of the data. If the data in the single tile is too large to be processed within the available virtual memory, it is subdivided into four equal tiles. Processing then begins on a subtile, which is further subdivided if the data in this second level of tiles is too large. This continues until the data in each tile can be processed within the available virtual memory. See the example below:

Extent of input datasets

Start with the footprint of all the input features.

Geoprocessing tile level 1

The process begins with a tile that spans the entire extent of all datasets. For reference, this is called tile level 1.

Geoprocessing tile level 2

If the data is too large to process in memory, the level 1 tile is subdivided into four equal tiles. These four subtiles are called level 2 tiles.

Geoprocessing tiles adaptive

Based on the size of the data in each tile, some tiles are further subdivided, while others are not.

Tools that use subdivisions

The following tools from the Analysis Tools toolbox use subdivision logic when dealing with large data:

  • Buffer—When using the dissolve option.
  • Clip—Not in all scenarios. An internal batch processes is now used for many situations.
  • Erase
  • Identity
  • Intersect
  • Split
  • Symmetrical Difference
  • Union
  • Update

The following tools from the Data Management toolbox also use subdivision logic when dealing with large datasets:

  • Dissolve
  • Feature To Line
  • Feature To Polygon
  • Polygon To Line

Progress indication during adaptive subdivision

Adaptive subdivision is dynamic, as it is based on the amount of available resources on the machine where the process is being run. There is no set number of steps to complete the analysis. During the processing of a tile, the tile may not be able to be processed within the available memory. This tile is then divided and the new (divided) tiles are processed. One or all of these new tiles also may not be able to be processed within the bounds of the available memory. Since there is no way to efficiently determine the number of tiles needed prior to run time, there is no way to provide a standard count down progress message. Therefore, the messages and progress bar you see for tools using subdivision logic only indicates progress for individual tiles being processed.

Process fails with an out of memory error or performance degradation occurs

Subdivision may not help in processing extremely large features (features with many millions of vertices). Splitting and reassembling extremely large features multiple times across tile boundaries is costly in terms of memory and may cause out of memory errors or poor performance if the feature is too large. How large is too large? This depends on how much available virtual memory (free memory not being used by the system or other applications) is available on the machine running the process. The less memory available, the smaller the feature can be to be considered too large and cause problems. Some large features may cause slow performance or an out of memory error on one machine configuration and not on another. Slow performance or an out of memory error can occur on the same machine one time and not another depending on the resources being used by other applications. Examples of very large features with many vertices are road casings for an entire city or a polygon representing a complex river estuary.

A few methods are available for dealing with features that are too large. One recommended technique is to use the Dice tool to divide large features into smaller features before processing.

Another technique is to use the Multipart To Singlepart tool on the large features. Before running the tool, make sure that the feature class has a unique identifier field so that after running the Multipart to Singlepart tool, each of the parts will have the identifier of the original feature it belonged to. If the large feature has many parts (such as individual features with hundreds of thousands of parts), converting it to single part allows you to process the parts individually. If required, after running the overlay tool of your choice on the single part version of the data, you can re-create the multipart feature by dissolving on the unique identifier field.

Out of memory errors or performance degradation may also occur when processing areas of extreme overlap. An example of extreme overlap is to run the Union or Intersect tool on the results of a buffer operation that was run on a very dense set of points. The buffer output will contain many overlapping buffer features. Selecting any one point in an area of densely packed buffer features may return thousands, tens of thousands, or even hundreds of thousands of overlapping buffers. Running the Intersect or the Union tool on this buffer output will create many new features representing every unique overlap relationship.

For example, two feature layers are illustrated below. One contains 10 features—buffers around points, and the other contains one feature—a square.

Input 1

Input 1 contains 10 overlapping polygon features, labeled by OID.

Input 2

Input 2 - 1 polygon, labeled by OID.

The illustration below shows the result of Intersect with the two feature layers above. Note that all overlaps (intersections) among all features are calculated, regardless of what input feature layer they belong to. This can result in many more features than found in the original inputs. There are more polygons in the output (167 features) than were in the inputs combined (input total was 11 features). This increase in the number of new features in the Intersect tool, output can grow very quickly, depending on the complexity of the overlap in the inputs.

Intersect output
The Intersect tool output shows 167 output features labeled by OID.

To avoid out of memory errors or performance degradation with large complex datasets, you may need to process your data by removing some of the overlap complexity and iteratively running the overlay process. You can also review the Pairwise Overlay tools to see if one provides similar functionality to what you are trying to accomplish.

Note:
Using the pairwise tools may require changes to your workflow due to differences in functionality offered and differences in their outputs. See the individual tool documentation for details

An out of memory error can also occur if a second application or process is run while a tool is processing. This second process may use a portion of the available memory the subdivision process needs to use causing the subdivision process to use more memory than is actually available. It is recommended that you perform no other operations on a machine while processing large datasets.

Data formats for large data

Enterprise and file geodatabases support very large datasets. It is recommended that you use these as the output workspace when processing very large or complex datasets. For enterprise geodatabases, see your database administrator for details on data loading policies. Performing unplanned or unapproved data loading operations may be restricted.