Big data connections

A big data connection (BDC) allows you to quickly connect to data sources to visualize and analyze large datasets. A BDC provides functionality and flexibility to work with your data and its formatting.

A BDC references a folder of one or more datasets. Datasets in a BDC are used as input feature data (points, polylines, polygons, and tabular data) to geoprocessing tools. When you create a BDC, a .bdc file is created. This file points to a directory of datasets that outlines the datasets and their schema in the BDC, including geometry and time information. You can browse for BDC datasets in geoprocessing tools and view BDC datasets on the map. The following are examples of when a BDC is appropriate:

  • You have multiple shapefiles representing a large area. Each shapefile represents a subset of the area, and you want to use all of the shapefiles together.
  • You receive a new .csv file daily with temperature measurements. You want to include the new .csv file as part of a dataset with your existing .csv files.
  • You use data that has multiple fields representing the time of an event. You want to use all the fields to represent the time.
  • You have parquet files to use.

The following are reasons to use a BDC as input to geoprocessing tools:

  • You can represent multiple datasets of the same schema and file type as a single dataset.
  • A BDC accesses the data when the analysis is run, so you can continue to add data to an existing dataset in your BDC without reregistering or publishing your data.
  • You can modify the BDC to remove, add, or update which datasets are visible.
  • BDCs are flexible in how time and geometry can be defined and allow for multiple time formats on a single dataset.

Supported data formats

Big data connections support the following datasets:

  • Delimited files (such as .csv, .tsv, and .txt)
  • Shapefiles (.shp)
  • Parquet files (.gz.parquet)
    Note:

    Only unencrypted parquet files are supported.

  • ORC files (orc.crc)

If you are using a BDC in GeoAnalytics Desktop tools, all input formats are supported. If you are using BDC datasets in any other geoprocessing tool, only delimited files and shapefiles are supported.

Learn more about analysis with BDC files

Big data connection terminology

The table below lists common terms for working with BDCs.

TermDescription

Big data connection

The item representing the BDC file. This BDC can be expanded to view datasets and browsed to for use in geoprocessing tools. This connection file is the ArcGIS Pro interface for your BDC file.

Big data connection file

The file (.bdc) that is created and stored when you create a BDC using the Create Big Data Connection tool. The file contains information about contained datasets and schemas, as well as geometry and time properties. When you view this file in ArcGIS Pro, it is a BDC item.

Learn more about big data connection files

Big data connection dataset

A dataset in your BDC. You can add this dataset to a map or use it as input to geoprocessing tools.

Source location

The folder location registered as a BDC. This location contains one or more folders representing BDC datasets. Big data connection tools do not modify this folder.

Source data

The datasets registered in the BDC. When you use a BDC, the source data is not modified. Big data connection tools do not modify this data.

Use a BDC

To prepare and use BDCs, the following steps are needed:

  1. Structure your input data.
  2. Configure a BDC.
  3. Visualize a BDC dataset.
  4. Use BDC datasets in analysis.

Structure your input data

To use your datasets as inputs in a BDC, the data must be correctly structured. To prepare your data for a BDC, format your datasets as subfolders under a single source folder that you register. In this source folder, the names of the subfolders represent the dataset names.

One source folder with three dataset subfolders
A source folder that contains three subfolders, each representing a dataset, is shown.

The image above represents the correct structure of a BDC. The source folder is registered, and each subfolder in the source folder represents a dataset. In this example, you would register the source folder, and three datasets would be included in the BDC: Dataset-1, Dataset-2, and Dataset-3.

In the dataset subfolders, you can structure your data as desired. If your subfolders contain multiple folders or files, all of the contents of the subfolders are read as a single dataset, and they must share the same schema and file type.

Note:

All files in a dataset folder have the same schema. If a file has a different schema, it will not be used correctly in visualization and analysis.

The following image shows three datasets with different structures and file contents:

Example source folder and contents
Example registrations with a source folder, three dataset folders, and their contents are shown.

In this example, the same three dataset folders have different content. Each dataset is described below:

  • Dataset-1—This dataset is composed of a single file; D1-1. When Dataset-1 is used for visualization or analysis, a single shapefile will be used.
  • Dataset-2—This dataset is composed of two text files: D2-1 and D2-2. When Dataset-2 is used for visualization or analysis, both text files will be used.
  • Dataset-3—This dataset is composed of two folders: D3-Folder-1 and D3-Folder-2, each containing a single dataset, D3-1 and D3-2. When Dataset-3 is used for visualization or analysis, both D3-1 and D3-2 will be used.

These are examples of how you can structure your data. The number of files or folders doesn't change how the data is used for visualization or analysis. There is no advantage to adding a subfolder to or removing subfolders from each dataset folder; structuring the folders at that level is optional.

Configure a BDC

To get started with a BDC, you need to create one. To create a BDC, use the Create Big Data Connection geoprocessing tool.

You may run into one of two issues when discovering datasets in your BDC:

  • Datasets that you expected are missing. In this case, verify that the path you specified as a source folder that contains subfolders is correct and that it's a supported data type.
  • One or more datasets fail to register. If datasets fail to register, you may note some of the following:

    IssueSolutionExample

    The dataset is not in the expected format.

    Open the file to see if it looks as expected. If the data is structured incorrectly, update and try again.

    A .csv file has a few lines and a summary of the data, and then only empty lines.

    The schemas of datasets in a folder do not match.

    All files in a dataset folder must have the same schema. Open the files to compare the schemas. Resolve any mismatched schemas and try to register the dataset again.

    You have one .csv file with 10 fields, and another with 8.

    The file types of a dataset in a folder do not match.

    All files in a dataset folder must have the same extension (file type). Check the file types of the data source location and remove or relocate any misplaced files.

    A shapefile dataset is in the same folder as a parquet file.

    You have an unrecognized field format.

    This is unlikely but may occur if ORC and parquet use an unexpected format. Ensure that you use valid field formats.

    You have a parquet file with an unknown field format.

If you create a BDC using a delimited file and don't see header rows, you may have an invalid header row. Ensure that all fields have a header and that none are empty. If needed, you can update the field names by using the Update Big Data Connection Dataset Properties tool.

When you create a BDC, the schema, geometry, and time are discovered for each of your datasets. Often, there are changes you can make as to how the datasets represent those values. To verify that each dataset correctly represents the geometry, time, and fields, use the Describe Dataset geoprocessing tool. For example, when reviewing your datasets, you may want to make one or more of the following changes to one or more datasets in your BDC:

  • Change the field names of delimited datasets.
  • Modify which fields are visible for analysis.
  • Change the fields used to represent geometry or time.
  • Add a filter to a dataset.
  • Add an alias to a dataset.
  • Remove datasets from the BDC that you aren't interested in analyzing.
  • Refresh the BDC to include a newly added dataset (a new subfolder under the source folder).

To make these optional changes, you can use any combination of the following tools:

Visualize a BDC dataset

Delimited- and shapefile-based BDC datasets can be visualized on a map.

Note:
BDC datasets using parquet and ORC source files cannot be visualized.

To add your dataset to the map, locate the BDC item in the Catalog pane, click to expand the datasets, and add the dataset to the map.

Big data connection datasets have a simplified experience in the map viewer, and have the following limitations:

  • When visualizing BDC datasets, the time properties in the BDC dataset properties are not automatically set in the new layer. To visualize the dataset with time, set the layer's time properties after adding the dataset to the map.
  • Drawing delimited files will zoom to the full extent of the BDC dataset's spatial reference.
  • If you add new records to an existing BDC dataset, for example, adding new rows to a CSV file in an existing BDC, the new records will not draw until you restart ArcGIS Pro.
  • If you add new files to an existing BDC dataset, for example, adding a new CSV file to an existing BDC dataset, the new records will not draw until you restart ArcGIS Pro.

Use BDC datasets in analysis

When BDC datasets are used as input to GeoAnalytics Desktop tools, analysis is optimized to read the data and run in parallel across the cores of your machine. For all other geoprocessing tools, BDC dataset reading and processing is not optimized to run in parallel, rather it is sequential and single-threaded.

Big data connection datasets based on delimited files or shapefiles can be used in most geoprocessing tools.

Note:
BDC datasets using parquet and ORC source files can only be used in GeoAnalytics Desktop tools.

You cannot apply a selection to a BDC dataset when it's used as input to a GeoAnalytics Desktop tool.

To use a BDC dataset in a geoprocessing tool, add a BDC dataset to a map and select the layer name from the parameter choice list, or use the browse button to browse to a BDC workspace and select the input dataset. The following tools do not support input BDC files:

  • Service-based tools, including GeoAnalytics Server, standard feature analysis, and ArcGIS Online analysis tools
  • Tools that modify the input dataset, such as Calculate Field and Near

Related topics