
How to Create Word Embeddings With TensorFlow
Table of Contents
Context length is one of the biggest problems with GPT models such as ChatGPT. There is a limitation on the number of words in your prompt because these models can only accept a certain number of tokens.
The solution? Embeddings.
mlnuggets newsletter
Join the newsletter to receive the technical deep dives in your inbox.
What are Word Embeddings?
Word embedding is a technique used to represent documents with a dense vector representation. The vocabulary in these documents is mapped to real number vectors. Semantically similar words are mapped close to each other in the vector space.

For instance, you want to ask a question to one of Lex's videos, which are over two hours long. The first step is to transcribe the video. The transcription is longer than the input context that a GPT model can accept. The solution is to break up the transcription into shorter sentences and create word embeddings for each.
Next, you create the embedding for the input question. Then you compare the embedding of the question to the embeddings of the transcription and return the top, say 3 most similar embeddings. Now instead of passing the entire transcribed text to the model, you will pass the context as those similar embeddings. With that, you can talk to a PDF, a transcribed video, etc., without passing the entire video or PDF to the model, which is impossible.
How to Represent Words as Numbers
Before creating word embeddings, you must convert the words to some numerical representation. For example, consider the sentence, "The cat sat on the mat". Each word can be represented in a matrix with 0 indicating the absence of the word and 1 its presence.

The above approach is inefficient because it leads to a vector with many zeros, a sparse matrix. The alternative is to represent each word with a unique integer. In "The cat sat on the mat" you can define the words as:
- The 1
- cat 2
- sat 3
- on 4
- the 1
- mat 5
Therefore the sentence will be numerically represented as [1, 2, 3,4,1,5], which is a dense vector.
mlnuggets newsletter
Join the newsletter to receive the technical deep dives in your inbox.
Creating Word Embeddings in TensorFlow
A word embedding represents the words in a text corpus with floating point values while considering the relationship between the different words. These relationships are learned when training the embeddings. The size of the embedding vector can be assigned manually.

The Embedding layer is used for learning word embeddings in TensorFlow. Here's a demonstration using the IMDB dataset. You can follow along with the code on Kaggle.
Load and Processing the Data
First, import all the required packages and load the data.
from numpy import array
from tensorflow.keras.preprocessing.text import one_hot
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding,Dense, GlobalAveragePooling1D
import nltk
from nltk.corpus import stopwords
import tensorflow as tf
from sklearn.model_selection import train_test_split
import tensorflow_datasets as tfds
import pandas as pd
import numpy as np
from datasets import load_dataset
dataset = load_dataset("imdb", split="train[:5000]")
df = pd.DataFrame(dataset)
df.head()

Clean the Text Data
The text data contains unnecessary items, such as punctuation marks and other special characters that must be removed. After that, convert all the reviews to lowercase. Removing common English words – stopwords– reduces the size of the data and improves the model's performance. You'll need to decide the size of vocabulary you need from the text corpus.
Remove stopwords from the reviews using NLTK:
nltk.download('stopwords')
def remove_stop_words(review):
review_minus_sw = []
stop_words = stopwords.words('english')
review = review.split()
cleaned_review = [review_minus_sw.append(word) for word in review if word not in stop_words]
cleaned_review = ' '.join(review_minus_sw)
return cleaned_review
df['review'] = df['text'].apply(remove_stop_words)
Split the dataset into a training and testing set:
docs = df['review']
labels = array(df['label'])
X_train, X_test , y_train, y_test = train_test_split(docs, labels , test_size = 0.20)
mlnuggets newsletter
Join the newsletter to receive the technical deep dives in your inbox.
Text preprocessing with TensorFlow
Next, convert the reviews to a numerical representation using the TextVectorization layer. It expects:
standardize
used to specify how the text data is processed. For example, thelower_and_strip_punctuation
option will lowercase the data and remove punctuations.max_tokens
dictates the maximum size of the vocabulary.output_mode
determines the output of the vectorization layer. Settingint
outputs integers.output_sequence_length
indicates the maximum length of the output sequence. This ensures that all sequences have the same length.
max_features = 5000 # Maximum vocab size.
batch_size = 32
max_len = 100 # Sequence length to pad the outputs to.
vectorize_layer = tf.keras.layers.TextVectorization(standardize='lower_and_strip_punctuation',max_tokens=max_features,output_mode='int',output_sequence_length=max_len)
vectorize_layer.adapt(X_train,batch_size=None)
Apply the layer to the training and testing data and bundle the dataset as TensorFlow datasets.
X_train_padded = vectorize_layer(X_train)
X_test_padded = vectorize_layer(X_test)
training_data = tf.data.Dataset.from_tensor_slices((X_train_padded, y_train))
validation_data = tf.data.Dataset.from_tensor_slices((X_test_padded, y_test))
training_data = training_data.batch(batch_size)
validation_data = validation_data.batch(batch_size)

Define Keras Embedding Layer
With the data in the proper format, the next step is to create a Keras model to train the embedding layer. The Embedding layer expects:
- The size of the vocabulary, defined here as 5000
- The dimension of the dense embeddings, in this case, 8
- The length of the input sequences, 100 in this example
We compile the model using the Adam optimizer and the binary cross entropy loss since it's a binary classification task. Next, fit the data to the model.
model = Sequential()
model.add(Embedding(max_features, 8, input_length=max_len))
model.add(GlobalAveragePooling1D())
model.add(Dense(16, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['acc'])
print(model.summary())
model.fit(training_data, validation_data=validation_data,epochs=2)
The final step is to evaluate the model:
loss, accuracy = model.evaluate(training_data, verbose=1)
print('Training Accuracy is {}'.format(accuracy*100))
loss, accuracy = model.evaluate(validation_data)
print('Testing Accuracy is {} '.format(accuracy*100))
mlnuggets newsletter
Join the newsletter to receive the technical deep dives in your inbox.
Final Thoughts
Apart from training word embeddings from scratch, you can use pre-trained ones such as Word2Vec and GloVe. Once you have trained your model, the next step is to deploy it. Deploying models is complex because you have to monitor their performance on real-world data and make changes based on that. For example, you have to retrain it if it starts performing dismally on the data. You also have to consider the required latency and throughput of the model. Check out ML School if you'd like to delve into the intricacies of deploying a machine learning model for real-world use.
Got any questions? Post a comment below.
mlnuggets Newsletter
Join the newsletter to receive the latest updates in your inbox.