How Transformers work

The Text Analysis tools in the GeoAI toolbox use the recently developed transformer architecture to perform natural language processing (NLP) tasks. This architecture makes the tools more accurate and parallelizable, while requiring lesser labelled data for training.

Transformers are language models trained in a self-supervised fashion on large amounts of data in the form of raw text. You can fine-tune these pretrained models for their data using transfer learning. They are novel architectures proposed in the paper Attention Is All You Need, aiming to solve tasks such as text classification, text translation, text generation, and entity recognition. A transformer consists of two main components: an encoding component and a decoding component. The encoding component is a stack of encoders that receive an input and encodes it into a representation in the form of features. The decoding component is a stack of decoders that use the representation from the encoding component along with other inputs to generate an output in the form of a target sequence.

Transformer architecture
A high-level view depicts components of a Transformer.

Depending on the task, encoders and decoders can be used in the following ways:

  • The encoding component can be used for tasks such as text classification and named entity recognition. Some well-known encoding models are ALBERT, BERT, and RoBERTa.
  • The decoding component can be used for tasks such as text generation. Some well-known decoding models are CTRL and Transformer XL.
  • Together the components can be used for tasks such as text translation and summarization. Some well-known encoding-decoding models are BART and T5.

Encoder layers
A high-level view depicts components of an encoder.

All the encoders in the encoding component have identical structure. Each encoder has two sublayers:

  • Self-attention layer—This layer allows models to find other words in the input sequence to enhance encoding of a current word.

    As an example, consider the following input sequence: The man couldn't cross the road as he was injured. Here, the word he refers to the man who couldn't cross the road. A self-attention layer in this case will allow the model to associate the word he with the word man.

  • Feed forward layer—This layer accepts the output from the attention layer and transforms it into a form that the next encoder can accept.

All the decoders in the decoding component have an attention layer, also referred to as encoder-decoder attention, in addition to the self-attention and feed forward layer. This encoder-decoder attention layer allows the decoder to focus on certain sections in the input sequence.

Decoder layers
A high-level view depicts components of a decoder.


See the following resources for additional information:

In this topic
  1. Resources