How to Create Word Embeddings With TensorFlow

Derrick Mwiti
Derrick Mwiti
6 min read

Table of Contents

Context length is one of the biggest problems with GPT models such as ChatGPT. There is a limitation on the number of words in your prompt because these models can only accept a certain number of tokens.

The solution? Embeddings.

What are Word Embeddings?

Word embedding is a technique used to represent documents with a dense vector representation. The vocabulary in these documents is mapped to real number vectors. Semantically similar words are mapped close to each other in the vector space.

Embeddings visualized using an embedding projector

For instance, you want to ask a question to one of Lex's videos, which are over two hours long. The first step is to transcribe the video. The transcription is longer than the input context that a GPT model can accept. The solution is to break up the transcription into shorter sentences and create word embeddings for each.

Next, you create the embedding for the input question. Then you compare the embedding of the question to the embeddings of the transcription and return the top, say 3 most similar embeddings. Now instead of passing the entire transcribed text to the model, you will pass the context as those similar embeddings. With that, you can talk to a PDF, a transcribed video, etc., without passing the entire video or PDF to the model, which is impossible.

How to Represent Words as Numbers

Before creating word embeddings, you must convert the words to some numerical representation. For example, consider the sentence, "The cat sat on the mat". Each word can be represented in a matrix with 0 indicating the absence of the word and 1 its presence.

One-hot encoding example

The above approach is inefficient because it leads to a vector with many zeros, a sparse matrix. The alternative is to represent each word with a unique integer. In "The cat sat on the mat" you can define the words as:

  • The 1
  • cat 2
  • sat 3
  • on 4
  • the 1
  • mat 5

Therefore the sentence will be numerically represented as [1, 2, 3,4,1,5], which is a dense vector.

Creating Word Embeddings in TensorFlow

A word embedding represents the words in a text corpus with floating point values while considering the relationship between the different words. These relationships are learned when training the embeddings. The size of the embedding vector can be assigned manually.    

A 4-dimensional word embedding

The Embedding layer is used for learning word embeddings in TensorFlow. Here's a demonstration using the IMDB dataset.  You can follow along with the code on Kaggle.

Load and Processing the Data

First, import all the required packages and load the data.

from numpy import array
from tensorflow.keras.preprocessing.text import one_hot
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding,Dense, GlobalAveragePooling1D
import nltk
from nltk.corpus import stopwords
import tensorflow as tf
from sklearn.model_selection import train_test_split 
import tensorflow_datasets as tfds
import pandas as pd
import numpy as np
from datasets import load_dataset

dataset = load_dataset("imdb", split="train[:5000]")
df = pd.DataFrame(dataset)
df.head()

Clean the Text Data

The text data contains unnecessary items, such as punctuation marks and other special characters that must be removed. After that, convert all the reviews to lowercase. Removing common English words – stopwords– reduces the size of the data and improves the model's performance. You'll need to decide the size of vocabulary you need from the text corpus.

Remove stopwords from the reviews using NLTK:

nltk.download('stopwords')
def remove_stop_words(review):
    review_minus_sw = []
    stop_words = stopwords.words('english')
    review = review.split()
    cleaned_review = [review_minus_sw.append(word) for word in review if word not in stop_words]            
    cleaned_review = ' '.join(review_minus_sw)
    return cleaned_review   
    
df['review'] = df['text'].apply(remove_stop_words)

Split the dataset into a training and testing set:

docs = df['review']
labels = array(df['label'])
X_train, X_test , y_train, y_test = train_test_split(docs, labels , test_size = 0.20)

Text preprocessing with TensorFlow


Next, convert the reviews to a numerical representation using the TextVectorization layer. It expects:

  • standardize used to specify how the text data is processed. For example, the lower_and_strip_punctuation option will lowercase the data and remove punctuations.
  • max_tokens dictates the maximum size of the vocabulary.
  • output_mode determines the output of the vectorization layer. Setting int outputs integers.
  • output_sequence_length indicates the maximum length of the output sequence. This ensures that all sequences have the same length.
max_features = 5000  # Maximum vocab size.
batch_size = 32
max_len = 100 # Sequence length to pad the outputs to.
vectorize_layer = tf.keras.layers.TextVectorization(standardize='lower_and_strip_punctuation',max_tokens=max_features,output_mode='int',output_sequence_length=max_len)
vectorize_layer.adapt(X_train,batch_size=None)

Apply the layer to the training and testing data and bundle the dataset as TensorFlow datasets.

X_train_padded =  vectorize_layer(X_train)
X_test_padded =  vectorize_layer(X_test)

training_data = tf.data.Dataset.from_tensor_slices((X_train_padded, y_train))
validation_data = tf.data.Dataset.from_tensor_slices((X_test_padded, y_test))
training_data = training_data.batch(batch_size)
validation_data = validation_data.batch(batch_size)

Define Keras Embedding Layer

With the data in the proper format, the next step is to create a Keras model to train the embedding layer. The Embedding layer expects:

  • The size of the vocabulary, defined here as 5000
  • The dimension of the dense embeddings, in this case, 8
  • The length of the input sequences, 100 in this example

We compile the model using the Adam optimizer and the binary cross entropy loss since it's a binary classification task.  Next, fit the data to the model.

model = Sequential()
model.add(Embedding(max_features, 8, input_length=max_len))
model.add(GlobalAveragePooling1D())
model.add(Dense(16, activation='relu'))
model.add(Dense(1, activation='sigmoid'))

model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['acc'])

print(model.summary())

model.fit(training_data, validation_data=validation_data,epochs=2)

The final step is to evaluate the model:

loss, accuracy = model.evaluate(training_data, verbose=1)
print('Training Accuracy is {}'.format(accuracy*100))
loss, accuracy = model.evaluate(validation_data)
print('Testing Accuracy is {} '.format(accuracy*100))

Final Thoughts

Apart from training word embeddings from scratch, you can use pre-trained ones such as Word2Vec and GloVe. Once you have trained your model, the next step is to deploy it. Deploying models is complex because you have to monitor their performance on real-world data and make changes based on that. For example, you have to retrain it if it starts performing dismally on the data. You also have to consider the required latency and throughput of the model. Check out ML School if you'd like to delve into the intricacies of deploying a machine learning model for real-world use.

Got any questions? Post a comment below.

TensorFlow

Derrick Mwiti Twitter

Google Developer Expert - Machine Learning

Discussion

Community guidelines