Recurrent Neural Network - Linear Model¶
Linear model of Recurrent Neural Network
:
$$S_{k} = f(S_{k-1} * W_{rec} + X_k * W_x)$$
The model we are going to train is to count how many 1's it sees on a binary input stream, and output the total count at the end of the sequence. Obviously, we know $w_{rec} = w_x = 1$. The point here is to understand the training process of RNN.
It can also be visicualized as:
# Python imports
import numpy as np # Matrix and vector computation package
import matplotlib
import matplotlib.pyplot as plt # Plotting library
from matplotlib import cm # Colormaps
from matplotlib.colors import LogNorm # Log colormaps
# Allow matplotlib to plot inside this notebook
%matplotlib inline
# Set the seed of the numpy random number generator so that the tutorial is reproducable
np.random.seed(seed=1)
Prepare Dataset¶
# Create Dataset
nb_of_samples = 20
sequence_len = 10
# Create Sequence
X = np.zeros((nb_of_samples, sequence_len))
for idx in range(nb_of_samples):
X[idx, :] = np.around(np.random.rand(sequence_len)).astype(int)
# Create the tagets of each squence
t = np.sum(X, axis=1)
Forward Step¶
# Define the forward step functions
def update_state(xk, sk, wx, wRec):
"""
Compute state k from the previous state (sk) and current input (xk),
by use of the input weights (wx) and recursive weights (wRec).
"""
return xk * wx + sk * wRec
def forward_states(X, wx, wRec):
"""
Unfold the network and compute all state activations given the input X,
and input weights (wx) and recursive weights (wRec).
Return the state activations in a matrix, the last column S[:,-1] contains the
final activations.
"""
# Initialize the matrix S that holds all status for all input sequences.
# The initial state s0 is set to 0 here (can be set as one of the parameter as well)
S = np.zeros((X.shape[0], X.shape[1]+1))
# Use the recurrce relation defined by update_state to update the states through time
for k in range(0, X.shape[1]):
# S[k] = S[k-1] * WRec + X[k] * wx
S[:, k+1] = update_state(X[:, k], S[:, k], wx, wRec)
return S
def cost(y, t):
"""
Return the MSE between the targets t and the outputs y.
"""
return ((t - y)**2).sum() / nb_of_samples
Backward Propagation¶
Start from getting the gradient of cost function $\xi$ by $\partial \xi / \partial y$. Then use this gradient for backward propagation through time (i.e. layer by layer).
The recurrent relation of gradient for each state during back propagation is:
$$\frac{\partial \xi}{\partial S_{k-1}} = \frac{\partial \xi}{\partial S_{k}} \frac{\partial S_{k}}{\partial S_{k-1}} = \frac{\partial \xi}{\partial S_{k}} w_{rec}$$starts at: $$\frac{\partial \xi}{\partial y} = \frac{\partial \xi}{\partial S_{n}}$$
The update rules for the weights are sum of the n and k states of: $$ \frac{\partial \xi}{\partial w_x} = \sum_{k=0}^{n} \frac{\partial \xi}{\partial S_{k}} x_k \\ \frac{\partial \xi}{\partial w_{rec}} = \sum_{k=1}^{n} \frac{\partial \xi}{\partial S_{k}} S_{k-1} $$
def output_gradient(y, t):
"""
Comput the gradient of the MSE cost function with respect to the output y.
"""
return 2.0 * (y-t) / nb_of_samples
def backward_gradient(X, S, grad_out, wRec):
"""
Backpropagate the gradient computed at the output (grad_out) through the network.
Accumulate the parameter gradients for wX and wRec by for each layer by addition.
Return the parameter gradients as a tuple, and the gradients at the output of each layer.
"""
# Initialize the array that stores the gradients of the cost with respect to the states.
grad_over_time = np.zeros((X.shape[0], X.shape[1]+1)) # the same size as S
grad_over_time[:, -1] = grad_out # the result of output_gradient()
# Set the gradient accumulations to 0
wx_grad = 0
wRec_grad = 0
for k in range(X.shape[1], 0, -1):
# Comput the paameter gradients and accumulate the results
wx_grad += np.sum(grad_over_time[:, k] * X[:, k-1])
wRec_grad += np.sum(grad_over_time[:, k] * S[:, k-1])
# Compute the gradient at the output of the previous layer
grad_over_time[:, k-1] = grad_over_time[:, k] * wRec
return (wx_grad, wRec_grad), grad_over_time
The instability of gradient in RNN¶
Because
$$\frac{\partial \xi}{\partial S_{k-1}} = \frac{\partial \xi}{\partial S_{k}} \frac{\partial S_{k}}{\partial S_{k-1}} = \frac{\partial \xi}{\partial S_{k}} w_{rec}$$For ex, The gradient of a state $S_k$ between a state mm timesteps back $S_{kâm}$ can then be written as:
$$ \frac{\partial S_{k}}{\partial S_{k-m}} = \frac{\partial S_{k}}{\partial S_{k-1}} * \cdots * \frac{\partial S_{k-m+1}}{\partial S_{k-1}} = w_{rec}^m $$So in our model the gradient grows exponentially if $|w_{rec}|>1$ (known as exploding gradient
). And that the gradient shrinks exponentially if $|w_{rec}|<1$ (known as vanishing gradient
). As you can see from the graph below.
# Define plotting functions
# Define points to annotate (wx, wRec, color)
points = [(2,1,'r'), (1,2,'b'), (1,-2,'g'), (1,0,'c'), (1,0.5,'m'), (1,-0.5,'y')]
def get_cost_surface(w1_low, w1_high, w2_low, w2_high, nb_of_ws, cost_func):
"""Define a vector of weights for which we want to plot the cost."""
w1 = np.linspace(w1_low, w1_high, num=nb_of_ws) # Weight 1
w2 = np.linspace(w2_low, w2_high, num=nb_of_ws) # Weight 2
ws1, ws2 = np.meshgrid(w1, w2) # Generate grid
cost_ws = np.zeros((nb_of_ws, nb_of_ws)) # Initialize cost matrix
# Fill the cost matrix for each combination of weights
for i in range(nb_of_ws):
for j in range(nb_of_ws):
cost_ws[i,j] = cost_func(ws1[i,j], ws2[i,j])
return ws1, ws2, cost_ws
def plot_surface(ax, ws1, ws2, cost_ws):
"""Plot the cost in function of the weights."""
surf = ax.contourf(ws1, ws2, cost_ws, levels=np.logspace(-0.2, 8, 30), cmap=cm.pink, norm=LogNorm())
ax.set_xlabel('$w_{in}$', fontsize=15)
ax.set_ylabel('$w_{rec}$', fontsize=15)
return surf
def plot_points(ax, points):
"""Plot the annotation points on the given axis."""
for wx, wRec, c in points:
ax.plot(wx, wRec, c+'o', linewidth=2)
def get_cost_surface_figure(cost_func, points):
"""Plot the cost surfaces together with the annotated points."""
# Plot figures
fig = plt.figure(figsize=(10, 4))
# Plot overview of cost function
ax_1 = fig.add_subplot(1,2,1)
ws1_1, ws2_1, cost_ws_1 = get_cost_surface(-3, 3, -3, 3, 100, cost_func)
surf_1 = plot_surface(ax_1, ws1_1, ws2_1, cost_ws_1 + 1)
plot_points(ax_1, points)
ax_1.set_xlim(-3, 3)
ax_1.set_ylim(-3, 3)
# Plot zoom of cost function
ax_2 = fig.add_subplot(1,2,2)
ws1_2, ws2_2, cost_ws_2 = get_cost_surface(0, 2, 0, 2, 100, cost_func)
surf_2 = plot_surface(ax_2, ws1_2, ws2_2, cost_ws_2 + 1)
plot_points(ax_2, points)
ax_2.set_xlim(0, 2)
ax_2.set_ylim(0, 2)
# Show the colorbar
fig.subplots_adjust(right=0.8)
cax = fig.add_axes([0.85, 0.12, 0.03, 0.78])
cbar = fig.colorbar(surf_1, ticks=np.logspace(0, 8, 9), cax=cax)
cbar.ax.set_ylabel('$\\xi$', fontsize=15, rotation=0, labelpad=20)
cbar.set_ticklabels(['{:.0e}'.format(i) for i in np.logspace(0, 8, 9)])
fig.suptitle('Cost surface', fontsize=15)
return fig
def plot_gradient_over_time(points, get_grad_over_time):
"""Plot the gradients of the annotated point and how the evolve over time."""
fig = plt.figure(figsize=(6.5, 4))
ax = plt.subplot(111)
# Plot points
for wx, wRec, c in points:
grad_over_time = get_grad_over_time(wx, wRec)
x = np.arange(-grad_over_time.shape[1]+1, 1, 1)
plt.plot(x, np.sum(grad_over_time, axis=0), c+'-', label='({0}, {1})'.format(wx, wRec), linewidth=1, markersize=8)
plt.xlim(0, -grad_over_time.shape[1]+1)
# Set up plot axis
plt.xticks(x)
plt.yscale('symlog')
plt.yticks([10**8, 10**6, 10**4, 10**2, 0, -10**2, -10**4, -10**6, -10**8])
plt.xlabel('timestep k', fontsize=12)
plt.ylabel('$\\frac{\\partial \\xi}{\\partial S_{k}}$', fontsize=20, rotation=0)
plt.grid()
plt.title('Unstability of gradient in backward propagation.\n(backpropagate from left to right)')
# Set legend
leg = plt.legend(loc='center left', bbox_to_anchor=(1, 0.5), frameon=False, numpoints=1)
leg.set_title('$(w_x, w_{rec})$', prop={'size':15})
def get_grad_over_time(wx, wRec):
"""Helper func to only get the gradient over time from wx and wRec."""
S = forward_states(X, wx, wRec)
grad_out = output_gradient(S[:,-1], t).sum()
_, grad_over_time = backward_gradient(X, S, grad_out, wRec)
return grad_over_time
# Plot cost surface and gradients
# Get and plot the cost surface figure with markers
fig = get_cost_surface_figure(lambda w1, w2: cost(forward_states(X, w1, w2)[:,-1] , t), points)
# Get the plots of the gradients changing by backpropagating.
plot_gradient_over_time(points, get_grad_over_time)
# Show figures
plt.show()
Resilient Backpropagation¶
One way to handle the unstable gradients is by using a technique called resilient backpropagation
(Rprop). The Rprop algorithm can be defined as:
- Set initial weight update value $\Delta$ to a nonzero value.
- For each parameter $w$:
- if $sign(\partial \xi /\partial w(i)) \neq sign(\partial \xi /\partial w(i-1))$
- Multiply the weight update value $\Delta$ by $\eta^-$, with $\eta^-<1$.
$\Delta(i)=\Delta(i)â\eta^-$ - else if $sign(\partial \xi /\partial w(i)) = sign(\partial \xi /\partial w(i-1))$
- Multiply the weight update value $\Delta$ by $\eta^+$, with $\eta^+<1$.
$\Delta(i)=\Delta(i)â\eta^+$
The hyperparameters are usually set as $\eta^+$=1.2 and $\eta^-$=0.5. Note that the weight update value $\Delta$ is similar to the momentum's velocity parameter, the difference is that the weight update value only reflects the size of the velocity for each parameter. The direction is determined by the sign of the current gradient.
# Define Rprop optimization function
def update_rprop(X, t, W, W_prev_sign, W_delta, eta_p, eta_n):
"""
Update Rprop values in one iteration.
X: input data.
t: targets.
W: Current weight parameters.
W_prev_sign: Previous sign of the W gradient.
W_delta: Rprop update values (Delta).
eta_p, eta_n: Rprop hyperparameters.
"""
# Perform forward and backward pass to get the gradients
S = forward_states(X, W[0], W[1])
grad_out = output_gradient(S[:, -1], t)
W_grads, _ = backward_gradient(X, S, grad_out, W[1])
W_sign = np.sign(W_grads)
# Update the Delta for each weight parameter separately
for i, _ in enumerate(W):
if W_sign[i] == W_prev_sign[i]:
W_delta[i] *= eta_p
else:
W_delta[i] *= eta_n
return W_delta, W_sign
# Perform Rprop optimisation
# Set hyperparameters
eta_p = 1.2
eta_n = 0.5
# Set initial parameters
W = [-1.5, 2] # [wx, wRec]
W_delta = [0.001, 0.001] # Update values (Delta) for W
W_sign = [0, 0] # Previous sign of W
ls_of_ws = [(W[0], W[1])] # List of weights to plot
# Iterate over 500 iterations
for i in range(500):
# Get the update values and sign of the last gradient
W_delta, W_sign = update_rprop(X, t, W, W_sign, W_delta, eta_p, eta_n)
# Update each weight parameter seperately
for i, _ in enumerate(W):
W[i] -= W_sign[i] * W_delta[i]
ls_of_ws.append((W[0], W[1])) # Add weights to list to plot
print('Final weights are: wx = {0}, wRec = {1}'.format(W[0], W[1]))
test_inpt = np.asmatrix([[0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1]])
test_outpt = forward_states(test_inpt, W[0], W[1])[:,-1]
print('Target output: {:d} vs Model output: {:.2f}'.format(test_inpt.sum(), test_outpt[0]))
The sample codes in this note come from peterroelants.github.io where providing more details on neural netwrok and deep learning. It's very informative and highly recommanded. Here is more like my personal memo.