How Text Transformation works

Text transformation is the process of translating or converting a sequence of text to another sequence of the same or different length. This is accomplished by using sequence-to-sequence (or Seq2seq) models in the natural language processing (NLP) domain. The Text Analysis toolset in the GeoAI toolbox contains tools for training text transformation models as well as for using the models for transforming text from one form to another.

The Train Text Transformation Model tool trains NLP models for sequence-to-sequence tasks and the trained models can be used with the Transform Text Using Deep Learning tool to transform, translate, or summarize text similarly.

Potential applications

The following are potential applications for this tool:

  • Incorrect street addresses with spelling mistakes and nonstandard formatting can be corrected and standardized. This can increase the accuracy when geocoding these addresses.
  • Text in a foreign language can be translated to allow you to understand it better (within the limits of machine translation) or process it further.
  • Legal descriptions of parcel boundaries (such as metes and bounds) can be transformed to traverse files and be automatically processed to derive the geometry of parcels.
Text transformation model flow chart
A text transformation model for address standardization can correct spelling mistakes and standardize street addresses.

Text transformation models in ArcGIS are based on the Transformer architecture proposed by Vaswani, et al. in the seminal “Attention is All you Need” paper. This allows the models to be more accurate and parallelizable, while requiring lesser labelled data for training.

Internally, text transformation models in ArcGIS are encoder-decoder models. The encoder’s attention layers receive all the words in the input text, but the decoder’s attention layers can only access the words before the word being processed. The decoder then turns the encoded feature representations to produce the output sequence of tokens. Some well-known encoder-decoder models are BART and T5.

Encoder and decoder are described as follows:

  • Encoder—The encoder transforms the input text to a numerical representation in the form of fixed-length feature vectors. This numerical representation retains the semantic meaning of the input text.
  • Decoder—The decoder takes the encoded feature vectors from the encoder and combines them with the input sequence to produce an output sequence of tokens.
Components in text transformation model

NLP models can be effective tools when automating the analysis of large volumes of unstructured text. As with other types of models, ensure that they are applied to relevant tasks with the appropriate level of human oversight and transparency about the type of model and training datasets used for training the model.

Use text transformation models

The Transform Text Using Deep Learning tool can be used to apply a trained text transformation model to unstructured text and transform it to a different language or format. You can use pretrained text transformation models from ArcGIS Living Atlas of the World or train custom models using the Train Text Transformation Model tool.

The input to the Transform Text Using Deep Learning tool is a feature class or table containing the text to be transformed. The input model can be an Esri model definition JSON file (.emd) or a deep learning model package (.dlpk). The model contains the path to the deep learning model file (containing model weights) and other model parameters. Certain models may have additional model arguments. The tool creates a field in the input table containing the transformed text.

While the tool can run on CPUs, a GPU is recommended for processing as deep learning is computationally intensive. To run this tool using a GPU, set the Processor Type environment setting to GPU. If you have more than one GPU, specify the GPU ID environment setting instead.

Train text transformation models

The Train Text Transformation Model tool can be used to train NLP models for text transformation. This tool uses a machine learning approach and trains the model by providing it with training samples consisting of pairs of input text and the target transformed output. Training NLP models is a computationally intensive task, and a GPU is recommended for this.

The training data is provided in the form of an input table that contains a text field that acts as the predictor variable and a label field that contains the target label for each input text in the table.

When training a text transformation model, you can choose to either train the model from scratch or fine-tune a trained model. In general, language models using the transformer architecture are considered few-hot learners.

However, if you have access to a pretrained text transformation model that performs a similar task, you can fine-tune it on the new training data. Fine-tuning an existing model is often quicker than training a new model, and this process also requires fewer training samples. When fine-tuning a pretrained model, ensure that you keep the same backbone model that was used in the pretrained model.

The pretrained model can be an Esri Model Definition file or a deep learning package file. The output model is also saved in these formats in the specified Output Model folder.

Training deep learning models is an iterative process in which the input training data is passed through the neural network several times. Each training pass through the entire training data is known as an epoch. The Max Epochs parameter specifies the maximum number of times the training data is seen by the model while it is being trained. This is dependent on the model you are training, the complexity of the task, and the number of training samples. If you have a lot of training samples, you can use a small value. In general, it is a good idea to keep training for more epochs repeatedly, until the validation loss continues to go down.

The Model Backbone parameter specifies the preconfigured neural network that serves as the encoder for the model and extracts feature representations of the input text. This model supports T5-based encoders that are based on the transformer architecture and are pretrained on large volumes of text in a semisupervised manner and have a good understanding of language.

Model training occurs in batches, and the Batch Size parameter specifies the number of training samples that are processed for training at one time. Increasing the batch size can improve the performance of the tool. However, as the batch size increases, more memory is used. If an out of memory error occurs while training the model, use a smaller batch size.

The Learning Rate parameter is, one of the most important hyperparameters. It is the rate at which the model weights are adjusted during training. If you specify a low learning rate, the model improves slowly and may take a long time to train, leading to wasted time and resources. A high learning rate may be counterproductive, and the model may not learn well. With high learning rates, the model weights may be adjusted drastically, causing it to produce erroneous results. It is often best to not specify a value for the Learning Rate parameter, as the tool uses an automated learning rate finder based on the "Cyclical Learning Rates for Training Neural Networks" paper by Leslie N. Smith.

The tool uses a part of the training data (10 percent by default) as a validation set. The Validation Percentage parameter allows you to adjust the amount of training data to be used for validation.

By default, the tool uses an early stopping technique that causes model training to stop when the model is no longer improving over subsequent training epochs. You can turn this behavior off by unchecking the Stop when model stops improving parameter.

You can also specify whether the backbone layers in the pretrained model will be frozen, so that the weights and biases remain as originally designed. By default, the layers of the backbone model are not frozen, and the weights and biases of the Model Backbone value can be altered to fit the training samples. This takes more time to process but typically produces better results.

Text data often contains noise in the form of HTML tags and URLs. You can use the Remove HTML Tags and Remove URLs parameters to preprocess the text and remove the tags before processing.


Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin. "Attention Is All You Need." December 6, 2017.

Raffel, Colin. et al. "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer." July 20, 2020.

"Encoder models."

Brown, Tom B. et al. "Language Models are Few-Shot Learners." July 22, 2020.

Smith, Leslie N. "Cyclical Learning Rates for Training Neural Networks." April 4, 2017.