top of page

Building a Simple Transformer for Text Classification in Tensorflow



Introduction

Machine learning and natural language processing (NLP) have witnessed a paradigm shift in recent years with the introduction of Transformer models. These models have revolutionized various NLP tasks, such as machine translation, sentiment analysis, and text classification, achieving state-of-the-art results. In this comprehensive guide, we will delve into the world of Transformers, explaining their core concepts and then walk through the process of building a simple Transformer model for text classification using Tensorflow.


This blog aim is to make you understand the fundamentals of Transformer models and how to implement them from scratch. By the end of this guide, you will have a strong foundation in Transformers and be able to create your own custom models for text classification tasks.


Why Transformer Models?


Understanding the Limitations of Traditional NLP Models

Before the emergence of Transformer models, traditional NLP models like RNNs (Recurrent Neural Networks) and CNNs (Convolutional Neural Networks) were widely used for sequence and text data tasks. However, these models had some limitations:

  • Sequential Processing: RNNs process data sequentially, making them slow and challenging to parallelize and ability to use GPU to accelerate was limited because of which large model cannot be trained and the time for training was significantly high.

  • Limited Context: CNNs have limited context understanding, which is crucial for understanding the meaning of words in a sentences etc.. The main limitation of using CNN based models was its performance was not able to match with the RNN based models in certain scenarios.

  • Long Dependencies: Both RNNs and CNNs struggle with capturing long-range dependencies in text. Also model which tried to address this by using CNN-RNN ro vis-a-vis also faced the limitation of not able to use GPU to accelerate the training and there by limiting the size of the model.

Advantages of Transformers

Transformers introduced a novel architecture that addressed these limitations:

  • Parallel Processing: Transformers can process input data in parallel, significantly speeding up training and inference.

  • Attention Mechanism: The self-attention mechanism allows Transformers to consider all words in a sequence simultaneously, capturing long-range dependencies effectively.

  • State-of-the-Art Performance: Transformers achieved state-of-the-art results on a wide range of NLP tasks, thanks to their ability to model context effectively.

In the following sections, we will delve into the core concepts of Transformers, preparing us to build a simple Transformer model for text classification.


Introduction to Transformers

Transformers are a type of deep learning model introduced by Vaswani et al. in the paper "Attention Is All You Need." These models have two key components: the self-attention mechanism and feedforward neural networks. Let's understand these components:

Self-Attention Mechanism

Self-attention, also known as scaled dot-product attention, is at the heart of the Transformer model. It allows the model to weigh the importance of different words in a sequence when processing each word. Here's a simplified breakdown of how self-attention works:

  1. Query, Key, and Value: For each word in the input sequence, three vectors are generated: Query, Key, and Value. These vectors are learned during training.

  2. Attention Scores: The similarity between the Query of a word and the Key of every other word in the sequence is computed using dot products.

  3. Scaled Attention: The attention scores are scaled down to prevent gradients from becoming too large.

  4. Softmax: The scaled attention scores are passed through a softmax function to obtain a distribution over all words in the sequence, indicating their importance.

  5. Weighted Sum: Finally, the Value vectors are multiplied by the attention scores and summed up to obtain the output for each word.

Multi-Head Attention

In practice, Transformers use multi-head attention, which means that the self-attention mechanism is applied multiple times in parallel, each with different learned Query, Key, and Value transformations. This allows the model to focus on different parts of the input sequence for different tasks.


Positional Encoding

One limitation of Transformers is that they do not have an inherent sense of word order. To overcome this, positional encoding is added to the input embeddings. Positional encodings are learned embeddings that convey the position of each word in the sequence.


Now that we have a solid understanding of the core concepts of Transformers, let's move on to building a simple Transformer model for text classification using Tensorflow.


Preparing the Dataset


Choosing a Text Classification Dataset

To build and train our text classification model, we need a dataset. For this tutorial, we will use the well-known IMDb movie reviews dataset. It contains movie reviews classified as positive or negative sentiment.


You can download the dataset from the official IMDb website or use the Tensorflow library to load it. as shown in the code below:

vocab_size = 20000  # Only consider the top 20k words
maxlen = 200  # Only consider the first 200 words of each movie review
(x_train, y_train), (x_val, y_val) = keras.datasets.imdb.load_data(num_words=vocab_size)
print(len(x_train), "Training sequences")
print(len(x_val), "Validation sequences")
x_train = keras.utils.pad_sequences(x_train, maxlen=maxlen)
x_val = keras.utils.pad_sequences(x_val, maxlen=maxlen)


Building Blocks of a Transformer Model

To construct our Transformer model, we need to implement several key components:

  • Self-Attention Layer: Responsible for capturing dependencies between words.

  • Multi-Head Attention Layer: Enables the model to focus on different parts of the input sequence using the Self-Attention Layer

  • Position-wise Feedforward Layer: Adds depth to the model and allows it to learn complex representations.


Building the Transformer Model

# Embedding size for each token
embed_dim = 64  
# Number of attention heads
num_heads = 2
# Hidden layer size in feed forward network inside transformer  
ff_dim = 32  
# Define the Input Layer
inputs = layers.Input(shape=(maxlen,))
# Define the Embedding Layer
embedding_layer = TokenAndPositionEmbedding(
maxlen, 
vocab_size, 
embed_dim
)
x = embedding_layer(inputs)
transformer_block = TransformerBlock(
embed_dim, 
num_heads, 
ff_dim)
# Define the transformer block
x = transformer_block(x)
x = layers.GlobalAveragePooling1D()(x)
x = layers.Dropout(0.1)(x)
x = layers.Dense(20, activation="relu")(x)
x = layers.Dropout(0.1)(x)
outputs = layers.Dense(2, activation="softmax")(x)

# Define the final model
model = keras.Model(inputs=inputs, outputs=outputs)

Training the Transformer Model

Loss Function and Optimization

For text classification, we typically use the Sparse Cross-Entropy Loss as our loss function. We also choose an optimization algorithm, such as Adam, to update the model's parameters during training.

model.compile(optimizer="adam", loss="sparse_categorical_crossentropy", metrics=["accuracy"])history = model.fit(x_train, y_train, batch_size=32, epochs=2, validation_data=(x_val, y_val))

Evaluating the Model

Metrics for Text Classification

Text classification models are typically evaluated using the following metrics:

  • Accuracy: The proportion of correctly classified instances.

In two epochs itself the model was able to achieve 82% accuracy in the validation dataset, this shows the potential of Transformer model compared to other model built using RNN, CNN etc...


Conclusion

In this extensive guide, we've explored the world of Transformers and walked through the process of building a simple Transformer model for text classification using Tensorflow. We covered essential concepts such as self-attention, multi-head attention, and positional encoding, which form the core of Transformers.


By implementing the key components and training the model on the IMDb dataset, you've gained valuable hands-on experience in building and evaluating Transformer models for text classification tasks. Remember that Transformers have shown remarkable results across various NLP tasks, making them a powerful tool in your machine learning toolkit.


If you need to help with your project or need consultation with Transformer based models feel free to contact us at contact@codersarts.com

bottom of page