A big data connection (BDC) allows you to quickly connect to data sources to visualize and analyze large datasets. A BDC provides functionality and flexibility to work with your data and its formatting.
A BDC references a folder of one or more datasets. Datasets in a BDC are used as input feature data (points, polylines, polygons, and tabular data) to geoprocessing tools. When you create a BDC, a .bdc file is created. This file points to a directory of datasets that outlines the datasets and their schema in the BDC, including geometry and time information. You can browse for BDC datasets in geoprocessing tools and view BDC datasets on the map. The following are examples of when a BDC is appropriate:
- You have multiple shapefiles representing a large area. Each shapefile represents a subset of the area, and you want to use all of the shapefiles together.
- You receive a new .csv file daily with temperature measurements. You want to include the new .csv file as part of a dataset with your existing .csv files.
- You use data that has multiple fields representing the time of an event. You want to use all the fields to represent the time.
- You have parquet files to use.
The following are reasons to use a BDC as input to geoprocessing tools:
- You can represent multiple datasets of the same schema and file type as a single dataset.
- A BDC accesses the data when the analysis is run, so you can continue to add data to an existing dataset in your BDC without reregistering or publishing your data.
- You can modify the BDC to remove, add, or update which datasets are visible.
- BDCs are flexible in how time and geometry can be defined and allow for multiple time formats on a single dataset.
Supported data formats
Big data connections support the following datasets:
- Delimited files (such as .csv, .tsv, and .txt)
- Shapefiles (.shp)
- Parquet files (.gz.parquet)
Only unencrypted parquet files are supported.
- ORC files (orc.crc)
If you are using a BDC in GeoAnalytics Desktop tools, all input formats are supported. If you are using BDC datasets in any other geoprocessing tool, only delimited files and shapefiles are supported.
Big data connection terminology
The table below lists common terms for working with BDCs.
Big data connection
The item representing the BDC file. This BDC can be expanded to view datasets and browsed to for use in geoprocessing tools. This connection file is the ArcGIS Pro interface for your BDC file.
Big data connection file
The file (.bdc) that is created and stored when you create a BDC using the Create Big Data Connection tool. The file contains information about contained datasets and schemas, as well as geometry and time properties. When you view this file in ArcGIS Pro, it is a BDC item.
Big data connection dataset
A dataset in your BDC. You can add this dataset to a map or use it as input to geoprocessing tools.
The folder location registered as a BDC. This location contains one or more folders representing BDC datasets. Big data connection tools do not modify this folder.
The datasets registered in the BDC. When you use a BDC, the source data is not modified. Big data connection tools do not modify this data.
Structure your input data
To use your datasets as inputs in a BDC, the data must be correctly structured. To prepare your data for a BDC, format your datasets as subfolders under a single source folder that you register. In this source folder, the names of the subfolders represent the dataset names.
The image above represents the correct structure of a BDC. The source folder is registered, and each subfolder in the source folder represents a dataset. In this example, you would register the source folder, and three datasets would be included in the BDC: Dataset-1, Dataset-2, and Dataset-3.
In the dataset subfolders, you can structure your data as desired. If your subfolders contain multiple folders or files, all of the contents of the subfolders are read as a single dataset, and they must share the same schema and file type.
All files in a dataset folder have the same schema. If a file has a different schema, it will not be used correctly in visualization and analysis.
In this example, the same three dataset folders have different content. Each dataset is described below:
- Dataset-1—This dataset is composed of a single file: D1-1. When Dataset-1 is used for visualization or analysis, a single shapefile will be used.
- Dataset-2—This dataset is composed of two text files: D2-1 and D2-2. When Dataset-2 is used for visualization or analysis, both text files will be used.
- Dataset-3—This dataset is composed of two folders: D3-Folder-1 and D3-Folder-2, each containing a single dataset, D3-1 and D3-2. When Dataset-3 is used for visualization or analysis, both D3-1 and D3-2 will be used.
These are examples of how you can structure your data. The number of files or folders doesn't change how the data is used for visualization or analysis. There is no advantage to adding a subfolder to or removing subfolders from each dataset folder; structuring the folders at that level is optional.
To start using big connections, see Use big data connections.