How Entity Recognition works

Entity recognition, also known as Named Entity Recognition (NER), is the process of extracting information from sentences, paragraphs, textual reports, or other forms of unstructured text. The Text Analysis toolset in the GeoAI toolbox contains tools that use natural language processing (NLP) techniques for training entity recognition models and using these models to extract entities from unstructured text.

The Train Entity Recognition Model tool trains NLP models for extracting a predefined set of entities (such as organizations, persons, date, or country) provided as part of a training dataset. Trained entity recognition models can be used with the Extract Entities Using Deep Learning tool to extract those entities from raw text into a structured format.

Potential applications

The following are potential applications for this tool:

  • Extracting data such as location, reporting officer, or weapon used from crime reports. This can help display trends and aid in planning remediation efforts.
  • Information such as location or whether assistance is required can be extracted from tweets or social media posts. This can help identify areas where immediate help is required during a natural disaster.
Entity Recognition Model flow chart
This entity recognition model is for extracting entities such as date, time, and location from text reports.

Entity recognition models in ArcGIS are based on two back ends—one based on the Transformer architecture, and another based on the spaCy library that works on the Embed, encode, attend, predict framework.

  • Embed—In this process, the input text is transformed into dense word embeddings. These embeddings capture semantic information from the input text and are much easier to work with for the model.
    Embed workflow
  • Encode—In this process, the context is encoded into a word vector. This is done using residual trigram convolutional neural networks (CNN).
    Encode workflow
  • Attend—The matrix output from the previous step is reduced into a single vector that will be passed for prediction to a standard feed-forward network. This step has a similar effect as the attention mechanism.
    Attend workflow
  • Predict—The final step in the model is making a prediction given the input text. Here the vector from the attention layer is passed to a multilayer perceptron to output the entity label ID.
    Predict workflow

The transformer-based back end uses the architecture proposed by Vaswani et al. in the paper “Attention Is All You Need.” This allows the models to be more accurate and parallelizable, while requiring less-accurate labeled data for training. Internally, transformer-based entity recognition models in ArcGIS have two components:

  • Encoder—The encoder serves as the model’s backbone and transforms the input text to a feature representation in the form of fixed-size vectors. The model uses well-known encoders such as BERT, ALBERT, and RoBERTa that are based on the transformer architecture and are pretrained on huge volumes of text.
  • Token Level Classifier—A token-level classifier serves as the model’s head and classifies the feature representation of each token into multiple categories representing the various entities. A classifier is often a simple linear layer in the neural network.
Entity recognition model components


NLP models can be powerful tools when automating the analysis of huge volumes of unstructured text. As with other types of models, care should be taken to ensure that they are applied to relevant tasks, with the appropriate level of human oversight and transparency about the type of model and training datasets used for training the model.

Use entity recognition models

You can use the Extract Entities Using Deep Learning tool to apply a trained entity recognition model to unstructured text and extract useful information from it in the form of structured data. You can use pretrained entity recognition models from ArcGIS Living Atlas of the World or train your own models using the Train Entity Recognition Model tool.

The input to the Extract Entities Using Deep Learning tool is a folder containing the text files on which named entity recognition will be performed. The input model can be an Esri model definition JSON file (.emd), or a deep learning model package (.dlpk). The model contains the path to the deep learning model file (containing model weights) and other model parameters. Some models may have additional model arguments.

The tool creates a table containing the extracted entities from each text file in the input folder. If a locator is provided and the model can extract addresses, a feature class will be produced instead by geocoding the extracted addresses. If a text file contains multiple addresses, a feature is created by geocoding each address and replicating the other entities for that text file.

While the tool can run on CPUs, a GPU is recommended for processing because deep learning is computationally expensive. To run this tool using a GPU, set the Processor Type environment setting to GPU. If you have more than one GPU, specify the GPU ID environment setting instead.

Train entity recognition models

You can use the Train Entity Recognition Model tool to train NLP models for named entity recognition. This tool uses a machine learning approach and trains the model by providing it training samples consisting of pairs of input text and labeled entities in that text. Training NLP models is a computationally intensive task, so a GPU is recommended.

The training data is provided in the form of a folder that contains training data in the form of standard datasets for NER tasks. The training data is in the form of .json or .csv files. The following are supported dataset types:

  • ner_json—The folder must contain a .json file with text and the labeled entities formatted using spaCy’s JSON training format.
  • IOB—The inside, outside, beginning (IOB) format proposed by Ramshaw and Marcus in the paper "Text Chunking using Transformation-Based Learning.". In this format, the folder must contain the following two .csv files:
    • token.csv—This file must contain text as input chunks.
    • tags.csv—This file must contain IOB tags for the text chunks.
  • BILUO—An extension of the IOB format that contains the following tags: B for 'beginning', I for 'inside', L for 'last', U for 'unit’, O for 'outside'. When using this format, the folder containing training data must contain the following two .csv files:
    • token.csv—This file must contain text as input chunks.
    • tags.csv—This file must contain BILUO tags for the text chunks.

For more information about these formats and labeling data in these formats, visit the Labeling text using Doccano guide.

When training an entity recognition model, you can choose to either train the model from scratch, or further fine-tune an already trained model.

If you already have access to a pretrained entity recognition model with the same set of target entities as there are in the training samples, you may choose to further fine-tune it on the new training data. Fine-tuning an existing model is often quicker than training a new model, and this process also requires fewer training samples. When fine-tuning a pretrained model, ensure that you keep the same backbone model that was used in the pretrained model.

The pretrained model can be an Esri model definition file (.emd) or a deep learning package file (.dlpk). The output model is also saved in these formats in the specified Output Model folder.

Entity recognition models in ArcGIS treat address entities differently from other entities. If an entity is to be treated as a location, it should be specified as an address entity in the tool parameters. During inference, these entities are geocoded using the specified locator and a feature class is produced as a result of the entity extraction process. If a locator is not provided, or the trained model does not extract address entities, a table containing the extracted entities is produced instead.

Training deep learning models is an iterative process in which the input training data is passed through the neural network several times. Each training pass through the entirety of the training data is known as an epoch. The Max Epochs parameter specifies the maximum number of times the training data is seen by the model while it is being trained. This is dependent on the model you are training, the complexity of the task, and the number of training samples that you have. If you have a large number of training samples, you can use a small value. In general, it is good practice to keep training for more epochs repeatedly, until the validation loss continues to go down.

The Model Backbone parameter specifies the preconfigured neural network that serves as the encoder for the model and extracts feature representations of the input text. This model supports BERT-, ALBERT-, RoBERTa-, and spaCy-based encoders that are based on the transformer architecture and are pretrained on large volumes of text in a semisupervised manner.

Model training happens in batches, and the batch size parameter specifies the number of training samples that are processed for training at one time. Increasing the batch size can improve the performance of the tool, but as the batch size increases, more memory is used. If an out of memory error occurs while training the model, use a smaller batch size.

The Learning Rate parameter is one of the most important hyperparameters. It is the rate at which the model weights are adjusted during training. If you specify a low learning rate, the model improves slowly and may take a long time to train, leading to wasted time and resources. A high learning rate may be counterproductive, and the model may not learn well. With high learning rates, the model weights may be adjusted drastically, causing it to produce erroneous results. It is often best to not specify a value for the Learning Rate parameter, as the tool uses an automated learning rate finder based on the paper Cyclical Learning Rates for Training Neural Networks by Leslie N. Smith.

The tool uses a portion of the training data (10 percent by default) as a validation set. The Validation Percentage parameter allows you to adjust the amount of training data to be used for validation.

By default, the tool uses an early stopping technique that causes model training to stop when the model is no longer improving over subsequent training epochs. You can turn this behavior off by unchecking the Stop when model stops improving check box.

You can also specify whether the backbone layers in the pretrained model will be frozen, so that the weights and biases remain as originally designed. By default, the layers of the backbone model are not frozen, and the weights and biases of the Model Backbone value can be altered to fit the training samples. This takes more time to process but typically produces better results.


Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin. "Attention Is All You Need." December 6, 2017.

Honnibal, Matthew. "Embed, encode, attend, predict: The new deep learning formula for state-of-the-art NLP models."

Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. "Fully Understanding the Hashing Trick." May 22, 2018.


Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding." May 24, 2019.

Smith, Leslie N. "Cyclical Learning Rates for Training Neural Networks." April 4, 2017.

Ramshaw, Lance, and Mitch Marcus. "Text Chunking using Transformation-Based Learning." 1995. In Third Workshop on Very Large Corpora.