TensorFlow Recurrent Neural Networks (Complete guide with examples and code)

Thomas Tsuma
Thomas Tsuma

Table of Contents

Recurrent Neural Networks (RNNs) are a class of neural networks that form associations between sequential data points. For example, the average sales made per month over a certain period. The data has a natural progression from month to month, meaning that the sales for the first month are the only independent sales. The rest are dependent on the sales made prior.

Such deep learning techniques have found use in the fields of natural language processing, time series analysis & prediction, speech recognition, and image captioning, among others.

What is a Recurrent Neural Network?

Recurrent Neural Networks are an improvement on feedforward networks.

Feedforward neural networks are a form of neural networks where the nodes form a strictly serial connection. They do not have any cycles. Information flows sequentially from the input node to the hidden layers until the output layer. These types of neural networks have no recall ability. They do not store any previously used information in memory and have difficulty making predictions.

Traditional Feed-Forward Network

In contrast,  information in a Recurrent Neural Network cycles through a loop. Recurrent Neural Networks have a hidden input, which is the previous input from the earlier layers. Thus, when making predictions, each layer considers the current input and the lessons it learned from the previous inputs.

Standard Recurrent Neural Network

Feedforward and Recurrent Neural Networks apply weights to their inputs. Recurrent Neural Networks, however, have to apply their weights to both the current and the previous input. They also tweak the weights for the gradient descent during backpropagation through time, which is the next concept to uncover.

Backpropagation through time

Let's start with a definition of the concept of rudimentary backpropagation.

The steps for training a neural network are as follows:

  1. A forward pass through the layers from input to hidden layers and the output layer to generate a prediction.
  2. Comparison of the prediction with the actual value using a loss function. The loss function gives us the marginal error of the algorithm to determine how accurate it is.
  3. Using the error, a second traversal is made backward from the output layer to calculate gradients for each node using the loss function and error.

Step three is what is referred to as backpropagation. Gradients essentially define the learning ability of a particular layer. A higher gradient results in a higher adjustment to the weights in a specific layer. Each node within a neural network usually has its gradient determined by the effects of the gradients made to the prior layer; therefore, the adjustments made to a particular layer will be smaller than those of the previous layer.

Backpropagation through time is an algorithm that adjusts weights in neural networks with recall ability.

The backpropagation steps through time are as follows:

  1. Present a sequence of timesteps of input and output pairs to the network.
  2. Unroll the network, then calculate and accumulate errors across each timestep.
  3. Roll up the network and update weights.

Unrolling or unfolding is a method of simplifying an RNN by visualizing the different steps as a graph with no cycles. It is a requirement for Recurrent Neural Networks because each consecutive timestep requires the previous one to determine its output.

Types of Recurrent Neural Networks

There are four main types of neural networks:

  • One to One: This neural network takes in one input and produces a single output. It is sometimes referred to as a vanilla neural network.
  • One to many: One to many neural networks have several outputs for a single input. A typical example is an image captioning RNN.
  • Many to one: A many to one RNN requires a sequence of inputs to generate a single output. This type of RNN is applicable in sentiment analysis. A sentence could contain several tokens whose combination can be determined to either be positive or negative.
  • Many to many: Many to many RNN models take a sequence of inputs and produce a sequence of outputs. A typical application is in machine translation.

Weaknesses of RNNs

Let's now talk about some of the challenges you will encounter when using RNNs.

1. Vanishing gradient problem

The vanishing gradient problem is the Short-Term Memory problem faced by standard RNNs:

  1. The gradient determines the learning ability of the neural network. The gradient, in turn, is set during backpropagation.
  2. A larger gradient means more ability to learn from specific inputs. So with decreasing gradients, the learning ability is depleted until it reaches zero.

An activation function converts the (input * weight) + bias into an output for the next layer. There are different activation functions. For this illustration, let us take into account the Sigmoid activation function.

Read more: Activation functions in JAX and Flax

The sigmoid activation function outputs a value between 0 and 1. If a series of layers were stacked with the sigmoid activation function, they would result in an exponential gradient reduction due to the chain rule of derivatives.

Backpropagation results in the neural network only being able to learn from a specific range of inputs towards the end. Meaning that the set of input values from the start would eventually hold little or no value in determining the overall prediction.

Take this example whereby we are trying to classify user intent in a particular text:

If the text was, "Please go and get me a very big glass of water now!".

As the input size increases, the learning ability of the model from the initial values decreases. For demonstration, we could say in our case that only the text that comes after 'very' would be useful in the text classifier neural network.

2. Exploding gradient problem

In exploding gradients, the gradients accumulate and become so big that the updates made to the neural network weights are very large during training. This occurs when the gradients of the consecutive nodes are larger than 1.0.

Since the weights are updated to values larger than the previous ones, the weights could grow to become so large that they result in NaN values. At best, the neural network's gradients would be so large that it cannot learn from the input data. At worst, the weights could result in NaN values. Either way, the result is an unstable neural network that does not give accurate outputs.

Here's how to identify exploding gradient problem in a neural network:

  1. The loss function could always have poor results. The neural network cannot learn to give accurate predictions.
  2. If the changes made to the loss are very large after each update.
  3. If the loss reaches a NaN value.
  4. If the weights grow  large and  quickly.
  5. If the weights go to a NaN value.
  6. If the error gradients are consistently above 1.0.

Long-Short Term Memory (LSTM)

Considering the RNNN weaknesses mentioned above, an improvement was necessary to overcome them. These weaknesses are essentially due to the rapid decay or rapid increase in the gradients due to the chain rule of derivatives being applied from node to node. These account for standard RNNs failing to learn for large time steps (Around 5 - 10 discrete time steps).

Opensource LSTM image by Wikimedia

LSTMs overcome the issues of vanishing gradients and exploding gradients. They contain special units known as cells. Each cell comprises one or more memory units and three multiplicative units. These are referred to as the gates of the cells.

Let us break down the functionality of the cells.

  1. Input - This is the read gate. It retrieves relevant inputs to allow the adjustment of weights in a particular node.
  2. Output - This is the write gate. It allows for information in the cell to adjust the weights of a particular node based on the relevance of the information within.
  3. Reset - This is the forget gate. It gets rid of information within a cell that is no longer necessary.

The memory units can be referred to as the remember gate. This allows the LSTM network to retain information. The memory units are what account for the long-term recall ability of the LSTM neural network.

Applications of LSTM

LSTMs have a wide range of applications. Let's mention a couple:  

  1. Handwriting recognition and generation.
  2. Language modeling and translation.
  3. Acoustic modeling and speech.
  4. Speech synthesis.
  5. Protein secondary structure prediction.
  6. Analysis of audio and video data.

Bidirectional LSTM

LSTMs are built on the logic standard RNNs. So to define Bidirectional LSTMs, it only makes sense to start with Bidirectional RNNs. These represent each training sequence forward and backward to two RNNs. The two RNNs are connected to the same output layer. The implication here is that the nodes in the Bidirectional RNN have sequential information about the points before and after them.

Nonetheless, they still face the same issues of exploding and vanishing gradients. The solution would be to create Bidirectional LSTMs that can access long-range contextual information in both input directions.

Applications of Bi-LSTMs include:

  1. Text classification.
  2. Speech classification.
  3. Forecasting models.

Time series analysis with LSTM in TensorFlow

There are different ways to perform time series analysis. For example, one could use statistics using the ARIMA, SARIMA, and SARIMAX models.

In this example, we will keep the theme of this article and implement a time series model using Recurrent Neural Networks. This project aims to predict the total loan amount a company could give out in a day.

The assumption is that the company makes a profit from the loans it gives. If true, then there is a positive correlation between the amount of the loans given and the revenue generated. Hence, by predicting future loans, we could predict how much the company could make.

Imports

First, let us go through this project's imports and their functionality.

import pandas as pd
import numpy as np
import tensorflow as tf
import keras
from keras.models import Sequential
from keras.layers import Dense, LSTM, Dropout
from sklearn.preprocessing import MinMaxScaler
import matplotlib.pyplot as plt

Pandas is used to load the dataset as a DataFrame along with other pre-processing steps.

Numpy is used to manipulate arrays and matrices.

TensorFlow is used during the creation and evaluation of the LSTM neural network.

We use the Keras Sequential API, which will stack layers. So the data will move from the first layer, the input layer, to the hidden layers up to the output layer.

Read more: How to build TensorFlow models with the Keras Functional API

Dense is used to make sure we have a fully connected neural network.

LSTM is the specific type of Recurrent Neural Network that we will be using.

Dropout is used to ensure that we do not have an overfitted model.

MinMaxScaler is used to normalize the dataset. This means that the range of data will be reduced from 0 to 1.

Matplotlib is used to visualize our data.

Data pre-processing

Load the data and perform a couple of pre-processing steps.

loans = pd.read_csv("loans.csv")
loans = loans[['created_at','amount']]
loans['created_at'] = pd.DatetimeIndex(loans['created_at'])
loans = loans.groupby(['created_at']).amount.sum().reset_index()
loans.sort_values(by=['created_at'], inplace=True)
loans = loans.set_index('created_at')

The steps involved here are as follows:

  1. Load the data from a .csv file.
  2. Retrieve the created_at and amount fields from the dataset. We are doing a univariate analysis, so we only require the date and value we want to predict.
  3. Get the cumulative sum of the loans given out on a particular day by getting the sum of the loans.
  4. Since we are attempting to get sequential data, it is paramount that we ensure the data is stored in the proper order. We sort the values by the created_at date.
  5. Set the created_at field as the index.

Reduce the variance of the data by scaling it. Large variance can lead to unwanted trends being caught in data.

scaler = MinMaxScaler(feature_range=(0,1))
scaled_loans = scaler.fit_transform(loans)

Next, prepare the data for loading to the LSTM Model. We feed the neural network with enough data that it can predict the next steps.  y_train is the target variable.

In this case, y_train is the value after each 60th interval. x_train is each consecutive 60 values. Essentially, we obtain the loans given for sixty days and use the value to predict the loan on the sixty-first day.

x_train = []
y_train = []
for i in range(60, 1955):
    x_train.append(scaled_loans[i-60:i, 0])
    y_train.append(scaled_loans[i, 0])

Next, convert x_train into a 3D array. This is because LSTM takes in three-dimensional matrices as the input. The original x_train was only two dimensions.

y_train = np.array(y_train)
x_train = np.array(x_train)
x_train = np.reshape(x_train, (np.shape(x_train)[0], np.shape(x_train)[1], 1 ))

Create LSTM network in Keras

Let's design the LSTM network. We store the model in a variable known as regressor. Next, define the layers of the Sequential model. The first layer is the input layer. We can stack LSTM layers to increase the correctness of the model. This is done by setting the return_sequence parameter as True.

We define units=50 for the LSTM layer to ensure that we have 50 LSTM cells in a layer. The input layer has the input_shape defined as the shape of one value of x_train.

regressor  = Sequential()
regressor.add(LSTM(units=50, return_sequences=True, input_shape=(np.shape(x_train)[1],1)))
regressor.add(Dropout(0.2))

regressor.add(LSTM(units=50, return_sequences=True))
regressor.add(Dropout(0.2))

regressor.add(LSTM(units=50, return_sequences=True))
regressor.add(Dropout(0.2))

regressor.add(LSTM(units=50))
regressor.add(Dropout(0.2))

regressor.add(Dense(units=1))
regressor.add(Dropout(0.2))

regressor.add(Dense(units=1))

regressor.summary()

For each successive LSTM layer in the hidden layer, you will find that it does not have the input shape defined since it takes the input of the preceding layer. Also, other than the last LSTM layer of the hidden layer, the return_sequence is set to true.

We use the Dense layer to ensure that we have a fully connected neural network.

The summary() function of Sequential gives details on the neural network, including:

- The layers of the models.

  • The input shape at each layer.
  • The number of parameters at each layer.

It also gives us the total trainable and non-trainable parameters.

To convert this into a Bidirectional LSTM, wrap each layer with Bidirectional so they can have previous and future information.


regressor.add(Bidirectional(LSTM(units=50, return_sequences=True)))

Compile the LSTM model

The next step is to compile the model.

The first line sets the initial value for the learning rate. This value will, in turn, be altered during the data fitting.

While compiling a Keras model, one of the parameters required is the optimizer. We set the optimizer function as Adam optimizer.

Read more: Optimizers in JAX and Flax

Using the EarlyStopping  acts as a sort of a break functionality in model training. We plan on using 150 epochs, and it may be unnecessary to iterate through all of them if the model does not improve at some point. We monitor the loss and  exit training if there is no improvement after20 epochs .

ModelCheckpoint is used to save the model with the best performance.

lr_schedule = keras.callbacks.LearningRateScheduler(lambda epoch:1e-7 * 10**(epoch/20))

opt = tensorflow.keras.optimizers.Adam(learning_rate=1e-7)

regressor.compile(optimizer=opt, loss='mse', metrics=['mae','mape'])

early_stopping = tf.keras.callbacks.EarlyStopping(monitor='loss', mode='min',patience=20)

mc = tf.keras.callbacks.ModelCheckpoint('best_model.h5', monitor='loss', mode='min', verbose=0, save_best_only=True)

hist = regressor.fit(x_train,y_train, epochs=150, batch_size=32, callbacks=[mc, lr_schedule, early_stopping])

The last step here is to fit the x_train and y_train  into the regressor model.

To make predictions on one single set of sixty values:

prediction = regressor.predict( np.array( [x_train[0],]))

We have chosen to predict the first 60 values. If, for example, we wanted to predict the next 30 steps, we would need to do the following:

  1. Make a prediction on the last 60 values.
  2. Take the last 59 digits of the x list and append the prediction.
def predictSteps(x, steps):
  if(steps == 0):
    prediction = regressor.predict( np.array( [x,]))
    print(prediction)
    return prediction
  else:
    prediction = regressor.predict( np.array( [x,]))
    pred = x[1:]
    pred = np.append(pred,prediction)
    steps = steps-1
    print(prediction)
    predictSteps(pred, steps)

This is a recursive function meant to predict n steps into the future. x is the value to be predicted, and steps are the number of steps into the future.

predictSteps(x_copy[-1], 10)

LSTM model evaluation

To get the metrics used in the model access keys from the history:

print(hist.history.keys())
.

We can visualize the progression of the loss of our model using Matplotlib.

plt.plot(hist.history['loss'])

This will output the following line graph:

We can also try to determine the error in predictions made by the model.

def getError(actual, prediction):
  m = keras.metrics.MeanAbsolutePercentageError()
  n = keras.metrics.MeanAbsoluteError()
  m.update_state(actual, prediction)
  n.update_state(actual, prediction)
  err = m.result().numpy()
  err_1 = n.result().numpy()
  return ({'MAE':err_1, 'MAPE':err})

Here we want to get both the MeanAbsoluteError and the MeanPercentageError.

To make the predictions with this, all you need to do is make predictions on the test dataset and store it in a variable called y_preds, Then pass y_test and y_preds into the getError() function.

train_errors = getError(y_train, y_preds)
print(train_errors)

Intent classification with LSTM

Intent classification is a type of natural language processing problem that involves determining the aim of a particular text. For example, a person saying, "Please help me out." The intent here can be stated as "Making a request."

Intent classification can be used by a company trying to keep track of the products being referred to on their social media accounts. For example, a bank offering mortgages, business loans, personal loans, and savings accounts. When tracking the posts and comments of their user base, they may need to use intent classification to determine the product being addressed and direct the post or comment to the appropriate department.

Other classification algorithms include KNearestNeighbors, RandomForest, and SGDClassifier. These, however, would only classify intent on statistics rather than meaning.  

LSTM Recurrent Neural Networks can memorize important information. Therefore, sequences of words are taken into account rather than just the word itself. This enables the word's meaning within a particular context to be considered. We accomplish this using embedding and encoding layers.

In this example, we will explore customer complaints about a company. We only want to determine the particular aspect of the company that the customers were complaining about. Hence, we will be doing a bivariate analysis. We will analyze the product and the customer complaint.

Imports

Let's start by making standard imports.

We added an Embedding layer to draw meaning from the words in the sentences.

SpatialDropout1D is used to avoid overfitting. It works similarly to Dropout but drops entire one-dimensional feature maps rather than the individual elements.

NLTK (Natural Language Processing ToolKit ) is used to identify stopwords in text. Stopwords are common words in a certain language that add any value in the classification task.

import pandas as pd
import re
import tensorflow as tf
import keras
from keras.models import Sequential
from keras.layers import Dense, LSTM, Dropout, Bidirectional, Embedding, SpatialDropout1D
import matplotlib.pyplot as plt
import nltk
nltk.download("stopwords")
from nltk.corpus import stopwords

Load dataset

Load the dataset into a variable and retrieve the required columns. In this case, Consumer complaint narrative and Product . Finally, we remove the null values.

complaints = pd.read_csv("complaints.csv")
complaints = complaints[['Consumer complaint narrative','Product']]
complaints.dropna(inplace=True)

Data cleaning

We need to remove unnecessary symbols from the text data. symbols_regex contains a list of characters that need to be replaced with a space. bad_symbols_regex comprises regex for digits and other symbols combined with text data and hence need to be removed without adding a space.

symbols_regex = re.compile('[/(){}\[\]\|@,;]')
bad_symbols_regex = re.compile('[^0-9a-z #+_]')

def clean_text(text):
    text = text.replace('\d+','')
    text = text.lower()
    text = symbols_regex.sub(' ', text)
    text = bad_symbols_regex.sub('', text) 
    text = text.replace('x', '')
    return text
complaints['Consumer complaint narrative'] = complaints['Consumer complaint narrative'].apply(clean_text)

The clean_text function performs the following operations:

  1. Convert the data to lowercase. Since "list" in the middle of a sentence and "List" at the beginning of a sentence should not be considered different words.
  2. Remove all the digits.
  3. Replaces symbols with space.
  4. In the text, we see several instances of combinations of X being used to mask specific data such as phone numbers. These are not words with meaning; hence also need to be removed.

Label exploration

Let's look at the number of complaints in each category.

complaints['Product'].value_counts().sort_values(ascending=False)

Text vectorization

Let's tokenize the sentences into individual words. We set the maximum number of words used by the TextVectorizer using the max_tokens parameter.

vectorize_layer = tf.keras.layers.TextVectorization(standardize='lower_and_strip_punctuation',max_tokens=5000,output_mode='int',output_sequence_length=512)
vectorize_layer.adapt(complaints_text,batch_size=None)

X_train_padded =  vectorize_layer(complaints_text)
X_train_padded = X_train_padded.numpy()

Next, we convert the text into sequences, then pad the sequences to ensure that the sequences are all reset to 512.

Since the neural network can only have numbers as its input, we use LabelEncoder to transform the target data into numbers.

le = sklearn.preprocessing.LabelEncoder()
complaints['Product'] = le.fit_transform(complaints['Product'])
y = complaints['Product']

Next, separate the dataset into training and testing datasets. The testing dataset is 30 percent. random_state is set to 42 to avoid inconsistent results with each training. This prevents data leakage, where part of our testing dataset is used for training.

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.3, random_state = 42)

Create LSTM network

Other than the Embedding layer, the other layers have functionality detailed earlier in this article.

classifier = Sequential()
classifier.add(Embedding(50000, 100, input_length=X.shape[1]))
classifier.add(SpatialDropout1D(0.2))
classifier.add(LSTM(100, dropout=0.2, recurrent_dropout=0.2))
classifier.add(Dense(17, activation='softmax'))

The Embedding layer represents the tokens as a dense vector. A word's position within a vector space depends on the words surrounding it. This is how we assign meaning to a word depending on the context in which it is used.

The final layer of the model has 17 cells – for the 17 different outputs. We use the softmax activation function because this is a multiclass labeling problem.

classifier.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
history = classifier.fit(X_train, y_train, epochs=5, batch_size=64,validation_split=0.1,callbacks=[tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience=3, min_delta=0.0001)])
y_preds = classifier.predict(X_test)

LSTM model evaluation

Let's evaluate the model by making predictions on the test set.  

classifier.evaluate(X_test,y_test)

Final thoughts

We have explored the natural progression of concepts from traditional feed forward networks to Recurrent Neural Networks. The difference, in this case, was the looping mechanism in an RNN that allows it to have recall ability. Hence it can use previous information to come up with its predictions. Feedforward networks take in the current input and use it to generate a prediction. RNNs, on the other hand, use both the current and previous inputs to come up with predictions. Hence they are more suited to predicting progressive data, such as in time series analysis.

We also went through the weaknesses of standard RNNs and how they impact their performance with regard to their short-term memory. This is solved by making use of LSTM RNNs. These contain four gates that enable them to have long-term recall ability.

Open On GitHub

TensorFlow Resources


Follow us on LinkedIn, Twitter, GitHub, and subscribe to our blog, so you don't miss a new issue.

The Complete Data Science and Machine Learning Bootcamp on Udemy is a great next step if you want to keep exploring the data science and machine learning field.

TensorFlow

Comments