Apache Arrow in ArcGIS

Apache Arrow is an in-memory, columnar, cross-platform, cross-language, and open-source data representation that allows you to efficiently transfer data between resources. Many big data projects interface with Arrow, making it a convenient option to read and write columnar file formats across languages and platforms. For more information, see Apache Arrow documentation for use cases and projects and products using Apache Arrow.

The primary tabular data representation in Arrow is the Arrow table. The Arrow table is a two-dimensional tabular representation in which columns are Arrow chunked arrays. The interface for Arrow in Python is PyArrow. For more information, see the Apache Arrow and PyArrow library documentation.

Tables and feature data

You can convert tables and feature classes to an Arrow table using the TableToArrowTable function in the data access (arcpy.da) module.

Create an Arrow table from a feature class.

import arcpy

infc = r'C:\data\usa.gdb\cities'
arrow_table = arcpy.da.TableToArrowTable(infc)

To convert an Arrow table to a table or feature class, use the Copy Rows or Copy Features tool.

Create a feature class from an Arrow table.

outfc = arcpy.management.CopyFeatures(arrow_table, r'C:\data\usa.gdb\cities_new')

Arrow tables can be used as input to any geoprocessing tool that accepts a table or feature class, with the exception of tools that modify the input, such as the Calculate Field tool. While a geoprocessing tool can accept an Arrow table as input, the output will not be an Arrow table and will instead be a table or feature class.

Schema

Arrow tables must follow a specific schema to be recognized by a geoprocessing tool. The schema is composed of the field names, their data types, and accompanying metadata. The metadata is stored as a JSON-encoded object.

An Object ID field must be of PyArrow data type int64 with the following metadata key/value pair:

{
    'esri.oid': 'esri.int64'
}

Example for defining a field containing Object IDs.

pyarrow.field(
    "OBJECTID",
    pyarrow.int64(),
    metadata={'esri.oid': 'esri.int64'}
)

The geometry column must be of PyArrow binary or string data type, depending on the geometry data's encoding. The geometry encoding is specified in the metadata under the esri.encoding key. The following are supported geometry encodings:

  • EsriShape and Well Known Binary (WKB) for binary fields
  • EsriJSON, GeoJSON, and Well Known Text (WKT) for string fields

In addition, the coordinate system for the geometry must be specified using the esri.sr_wkt key, for which the value must be a WKT coordinate system string. If the coordinate system is unknown, leave the string empty. The general structure of the metadata is as follows:

{
    'esri.encoding': '<EsriShape, EsriJSON, GeoJSON, WKB, or WKT>',
    'esri.sr_wkt': '<a WKT coordinate system string>'
}

Example for defining a geometry field containing Esri Shape binary geometry with spatial reference GCS North American 1983.

f = pyarrow.field(
    "SHAPE",
    pyarrow.binary(),
    metadata={
        'esri.encoding': 'EsriShape',
        'esri.sr_wkt': 'GEOGCS["GCS_North_American_1983",DATUM["D_North_American_1983",' \
        'SPHEROID["GRS_1980",6378137.0,298.257222101]],PRIMEM["Greenwich",0.0],'         \
        'UNIT["Degree",0.0174532925199433]];-400 -400 1000000000;-100000 10000;'         \
        '-100000 10000;8.98315284119521E-09;0.001;0.001;IsHighPrecision'
    }
)

When using an Arrow table as input to a geoprocessing tool that requires a feature class or feature layer, a geometry column must be properly specified for geoprocessing tools to recognize the geometry data. If it is not properly specified, the tool will fail. In the special case in which the geometry is of type EsriShape, the esri.encoding key can be omitted. While various geometry formats are supported, EsriShape is recommended for lossless geometry transport and best performance. The Object ID field is optional. Because the uniqueness of column values is not guaranteed in an Arrow table, any existing Object ID field values will be overritten by new IDs when the table is used in a geoprocessing tool. If there is no Object ID field in the Arrow table, an Object ID field will be generated automatically during geoprocessing. To review the current schema of an Arrow table, use the table's schema property.

You can use knowledge of the schema to assemble an Arrow table that is compatible with geoprocessing tools, or if necessary to adapt existing data from outside sources for use with geoprocessing tools.

Create an Arrow table from scratch and convert it to a geodatabase table.

import arcpy
import pyarrow as pa

# Get spatial reference
sr = arcpy.SpatialReference(3857)  #WGS 1984 Web Mercator (auxiliary sphere)

# Specify fields for schema
fields = [

    pa.field("SHAPE", pa.string(), metadata={'esri.encoding': 'WKT', 'esri.sr_wkt': sr.exportToString()}),
    pa.field('NAME', pa.string()),
    pa.field('STATE', pa.string()),
    pa.field('POP2010', pa.int32()),
]
# Specify data (smallest and largest major US city)
arrays = [

    pa.array([
        'POINT (-8238770.1834999993 4969744.1656000018)', 
        'POINT (-8078649.3640999999 5506675.5481000021)',
    ]),
    pa.array(['New York City', 'Montpelier']),
    pa.array(['NY', 'VT']),
    pa.array([8175133,  7855]),
]

# Create Arrow table from data and schema
patable = pa.Table.from_arrays(
    arrays=arrays,
    schema=pa.schema(fields)
)

# Convert Arrow table to geodatabase table
cities = arcpy.management.CopyRows(
    patable, r'C:\data\usa.gdb\smallest_largest_city')

Type conversions

When converting a table or feature class to an Arrow table using the TableToArrowTable function, the data types of the created Arrow table's columns (pyarrow.ChunkedArray objects) are determined from the field types of the input table or feature class. Additionally, the geometry encoding in the output Arrow table's geometry column can be specified with the geometry_encoding parameter.

Field typePyArrow data typePyArrow metadata

Short

int16

Long

int32

Big Integer

int64

Float

float

Double

double

Text

string

Date

timestamp[ms]

Date Only

date64[ms]

Time Only

time32[ms]

Timestamp Offset

string

{b'esri.interop.type': b'esri.timestamp_offset'}

GUID

string

{b'esri.interop.type': b'esri.guid'}

Global ID

string not null

{b'esri.interop.type': b'esri.global_id'}

Object ID

int64

{b'esri.oid': b'esri.int64'}

Blob

binary

b'esri.interop.type': b'esri.blob'

Geometry

binary or string

{b'esri.encoding': b'EsriShape', b'WKB', b'EsriJSON', b'GeoJSON', or b'WKT'>, b'esri.sr_wkt: b'<Spatial Reference WKT>'}

Other field types not listed above are not converted and will be dropped.

Nota:

Text fields are trimmed at 5,000 characters when converted to an Arrow table.

Nota:

Because the uniqueness of column values is not guaranteed in an Arrow table, it will be the responsibility of the user and the receiving database to ensure uniqueness of values stored in Global ID fields. Use the preserveGlobalIds environment setting to control how Global ID values are handled in the target database.

When converting an Arrow table to a table or feature class using a geoprocessing tool, the field types of the output table or feature class are determined by the data types of the input Arrow table's columns. An Object ID field will automatically be added to the output table or feature class if it does not already exist.

PyArrow data typePyArrow metadataField type

bool

Short (Small Integer)

int8

Short (Small Integer)

int16

Short (Small Integer)

int32

Long (Integer)

int64

Big Integer

uint8

Short (Small Integer)

uint16

Long (Integer)

uint32

Big Integer

uint64

Double

float32

Float (Single)

float64

Double

string

Text (String)

utf8

Text (String)

timestamp

Date

date32

Date Only

date64

Date Only

time32

Time Only

time64

Time Only

int64

b'esri.oid': b'esri.int64'

Object ID

binary

b'esri.interop.type': b'esri.blob'

Blob

binary

{'esri.encoding': <b'EsriShape' or b'WKB'>, b'esri.sr_wkt: b'<Spatial Reference WKT>'}

Geometry

string

{'esri.encoding': <b'EsriJSON', b'GeoJSON', or b'WKT'>, b'esri.sr_wkt: b'<Spatial Reference WKT>'}

Geometry

Any Arrow data types not listed above will not be converted and will be dropped.