How Entity Recognition works—ArcGIS Pro

Entity recognition, also known as Named Entity Recognition (NER), is the process of extracting information from sentences, paragraphs, textual reports, or other forms of unstructured text. The Text Analysis toolset in the GeoAI toolbox contains tools that use natural language processing (NLP) techniques for training entity recognition models and using these models to extract entities from unstructured text.

The Train Entity Recognition Model tool trains NLP models for extracting a predefined set of entities (such as organizations, persons, date, or country) provided as part of a training dataset. Trained entity recognition models can be used with the Extract Entities Using Deep Learning tool to extract those entities from raw text into a structured format.

Potential applications

The following are potential applications for this tool:

Extracting data such as location, reporting officer, or weapon used from crime reports. This can help display trends and aid in planning remediation efforts.
Information such as location or whether assistance is required can be extracted from tweets or social media posts. This can help identify areas where immediate help is required during a natural disaster.

Entity Recognition Model flow chart — This entity recognition model is for extracting entities such as date, time, and location from text reports.

Entity recognition models in ArcGIS are based on the Transformer architecture that works on the Embed, encode, attend, predict framework.

Embed—In this process, the input text is transformed into dense word embeddings. These embeddings capture semantic information from the input text and are much easier to work with for the model.
Encode—In this process, the context is encoded into a word vector. This is done using residual trigram convolutional neural networks (CNN).
Attend—The matrix output from the previous step is reduced into a single vector that will be passed for prediction to a standard feed-forward network. This step has a similar effect as the attention mechanism.
Predict—The final step in the model is making a prediction given the input text. Here the vector from the attention layer is passed to a multilayer perceptron to output the entity label ID.

The transformer-based back end uses the architecture proposed by Vaswani et al. in the paper “Attention Is All You Need.” This allows the models to be more accurate and parallelizable, while requiring less-accurate labeled data for training. Internally, transformer-based entity recognition models in ArcGIS have two components:

Encoder—The encoder serves as the model’s backbone and transforms the input text to a feature representation in the form of fixed-size vectors. The model uses well-known encoders such as BERT, ALBERT, and RoBERTa that are based on the transformer architecture and are pretrained on huge volumes of text.
Token Level Classifier—A token-level classifier serves as the model’s head and classifies the feature representation of each token into multiple categories representing the various entities. A classifier is often a simple linear layer in the neural network.

Entity recognition models in ArcGIS also support the Mistral model backbone. Mistral is a large language model based on a transformer architecture and operates as a decoder-only model. The key components of Mistral's architecture include the following:

Sliding Window Attention—Efficiently handles long texts by processing them in smaller, overlapping segments, reducing computational cost and memory usage while preserving important context.
Grouped Query Attention—Improves efficiency by clustering similar queries together, which minimizes the amount of attention computations and speeds up processing.
Byte-Fallback BPE (Byte Pair Encoding) Tokenizer—Converts text into tokens for the model to process.

Note:

NLP models can be powerful tools when automating the analysis of huge volumes of unstructured text. As with other types of models, care should be taken to ensure that they are applied to relevant tasks, with the appropriate level of human oversight and transparency about the type of model and training datasets used for training the model.

Use entity recognition models

You can use the Extract Entities Using Deep Learning tool to apply a trained entity recognition model to unstructured text and extract useful information from it in the form of structured data. You can use pretrained entity recognition models from ArcGIS Living Atlas of the World or train your own models using the Train Entity Recognition Model tool.

The input to the Extract Entities Using Deep Learning tool is a folder containing the text files or a text column in a feature class or table on which named entity recognition will be performed. The input model can be an Esri model definition JSON file (.emd), or a deep learning model package (.dlpk). The model contains the path to the deep learning model file (containing model weights) and other model parameters. Some models may have additional model arguments.

The tool creates a table containing the extracted entities from each text. If a locator is provided and the model can extract addresses, a feature class will be produced instead by geocoding the extracted addresses. If a text file contains multiple addresses, a feature is created by geocoding each address and replicating the other entities for that text file.

While the tool can run on CPUs, a GPU is recommended for processing because deep learning is computationally expensive. To run this tool using a GPU, set the Processor Type environment setting to GPU. If you have more than one GPU, specify the GPU ID environment setting instead.

Train entity recognition models

You can use the Train Entity Recognition Model tool to train NLP models for named entity recognition. This tool uses a machine learning approach and trains the model by providing it training samples consisting of pairs of input text and labeled entities in that text. With the Mistral model backbone, it uses in-context learning, guiding the model's understanding and responses through input prompts and by providing the model with specific examples that help it infer the desired output. Training NLP models is a computationally intensive task, so a GPU is recommended.

The input can be either of the following:

A feature class or table containing a text field with the input text for the model and the labeled entities where the selected text field will be used as input text for the model and the remaining fields will be treated as named entities labels.
A folder containing training data in the form of standard datasets for NER tasks. The training data must be in .json or .csv files. The file format determines the dataset type of the input.
- When the input is a folder, the following dataset types are supported:
  - ner_json—The training data folder should contain a .json file with text and the labeled entities formatted using the spaCy JSON training format.
  - IOB—The IOB (I - inside, O - outside, B - beginning tags) format proposed by Ramshaw and Marcus in the paper Text Chunking using Transformation-Based Learning.
    The training data folder should contain the following two .csv files:
    - tokens.csv—Contains text as input chunks
    - tags.csv—Contains IOB tags for the text chunks
  - BILUO—An extension of the IOB format that additionally contains L - last and U - unit tags.
    The training data folder should contain the following two .csv files:
    - tokens.csv—Contains text as input chunks
    - tags.csv—Contains BILUO tags for the text chunks

When training an entity recognition model, you can train the model from scratch, further fine-tune an already trained model, or use in-context learning.

If you already have access to a pretrained entity recognition model with the same set of target entities as there are in the training samples, you may choose to further fine-tune it on the new training data. Fine-tuning an existing model is often quicker than training a new model, and this process also requires fewer training samples. When fine-tuning a pretrained model, ensure that you keep the same backbone model that was used in the pretrained model.

The pretrained model can be an Esri model definition file (.emd) or a deep learning package file (.dlpk). The output model is also saved in these formats in the specified Output Model folder.

Entity recognition models in ArcGIS treat address entities differently from other entities. If an entity is to be treated as a location, it should be specified as an address entity in the tool parameters. During inference, these entities are geocoded using the specified locator and a feature class is produced as a result of the entity extraction process. If a locator is not provided, or the trained model does not extract address entities, a table containing the extracted entities is produced instead.

Training deep learning models is an iterative process in which the input training data is passed through the neural network several times. Each training pass through the entirety of the training data is known as an epoch. The Max Epochs parameter specifies the maximum number of times the training data is seen by the model while it is being trained. This is dependent on the model you are training, the complexity of the task, and the number of training samples that you have. If you have a large number of training samples, you can use a small value. In general, it is good practice to keep training for more epochs repeatedly, until the validation loss continues to go down.

The Model Backbone parameter specifies the preconfigured neural network that serves as the encoder for the model and extracts feature representations of the input text. This parameter supports BERT-, ALBERT-, and RoBERTa-based encoders that are based on the transformer architecture and are pretrained on large volumes of text in a semisupervised manner. The Model Backbone parameter also supports Mistral large language model (LLM). Mistral is a decoder-only transformer that uses Sliding Window Attention for efficient long-text processing, Grouped Query Attention to streamline computations, and Byte-Fallback BPE Tokenizer to handle diverse text inputs.

Model training happens in batches, and the Batch Size parameter specifies the number of training samples that are processed for training at one time. Increasing the batch size can improve the performance of the tool, but as the batch size increases, more memory is used. If an out of memory error occurs while training the model, use a smaller batch size.

The Learning Rate parameter is one of the most important hyperparameters. It is the rate at which the model weights are adjusted during training. If you specify a low learning rate, the model improves slowly and may take a long time to train, leading to wasted time and resources. A high learning rate may be counterproductive, and the model may not learn well. With high learning rates, the model weights may be adjusted drastically, causing it to produce erroneous results. It is often best to not specify a value for the Learning Rate parameter, as the tool uses an automated learning rate finder based on the paper Cyclical Learning Rates for Training Neural Networks by Leslie N. Smith.

The tool uses a portion of the training data as a validation set. The Validation Percentage parameter allows you to adjust the amount of training data to be used for validation. For the Mistral model, at least 50 percent of the data must be reserved for validation, as Mistral requires a smaller training set and relies on a larger validation set to assess model performance.

By default, the tool uses an early stopping technique that causes model training to stop when the model is no longer improving over subsequent training epochs. You can turn this behavior off by unchecking the Stop when model stops improving check box.

You can also specify whether the backbone layers in the pretrained model will be frozen, so that the weights and biases remain as originally designed. By default, the layers of the backbone model are not frozen, and the weights and biases of the Model Backbone value can be altered to fit the training samples. This takes more time to process but typically produces better results.

Resources

See the following for more information:

"Attention Is All You Need" by Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin on arXivLabs at Cornell University

Honnibal, Matthew. "Embed, encode, attend, predict: The new deep learning formula for state-of-the-art NLP models." https://explosion.ai/blog/deep-learning-formula-nlp.

Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. "Fully Understanding the Hashing Trick." May 22, 2018. https://arxiv.org/pdf/1805.08539.pdf.

Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding." May 24, 2019. https://arxiv.org/pdf/1810.04805.pdf.

Smith, Leslie N. "Cyclical Learning Rates for Training Neural Networks." April 4, 2017. https://doi.org/10.48550/arxiv.1506.01186.

Ramshaw, Lance, and Mitch Marcus. "Text Chunking using Transformation-Based Learning." 1995. In Third Workshop on Very Large Corpora.

"Mistral." https://docs.mistral.ai/getting-started/open_weight_models/.

Feedback on this topic?

Potential applications

Note:

Use entity recognition models

Train entity recognition models

Resources

In this topic