28 December 2016

Recurrent Neural Network - Linear Model

Linear model of Recurrent Neural Network: $$S_{k} = f(S_{k-1} * W_{rec} + X_k * W_x)$$

The model we are going to train is to count how many 1's it sees on a binary input stream, and output the total count at the end of the sequence. Obviously, we know $w_{rec} = w_x = 1$. The point here is to understand the training process of RNN.

It can also be visicualized as:

# Python imports
import numpy as np # Matrix and vector computation package
import matplotlib
import matplotlib.pyplot as plt  # Plotting library
from matplotlib import cm  # Colormaps
from matplotlib.colors import LogNorm  # Log colormaps
# Allow matplotlib to plot inside this notebook
%matplotlib inline
# Set the seed of the numpy random number generator so that the tutorial is reproducable

Prepare Dataset

# Create Dataset
nb_of_samples = 20
sequence_len = 10
# Create Sequence
X = np.zeros((nb_of_samples, sequence_len))
for idx in range(nb_of_samples):
    X[idx, :] = np.around(np.random.rand(sequence_len)).astype(int)
# Create the tagets of each squence
t = np.sum(X, axis=1)

Forward Step

# Define the forward step functions
def update_state(xk, sk, wx, wRec):
    Compute state k from the previous state (sk) and current input (xk),
    by use of the input weights (wx) and recursive weights (wRec).
    return xk * wx + sk * wRec

def forward_states(X, wx, wRec):
    Unfold the network and compute all state activations given the input X,
    and input weights (wx) and recursive weights (wRec).
    Return the state activations in a matrix, the last column S[:,-1] contains the
    final activations.
    # Initialize the matrix S that holds all status for all input sequences.
    # The initial state s0 is set to 0 here (can be set as one of the parameter as well)
    S = np.zeros((X.shape[0], X.shape[1]+1))
    # Use the recurrce relation defined by update_state to update the states through time
    for k in range(0, X.shape[1]):
        # S[k] = S[k-1] * WRec + X[k] * wx
        S[:, k+1] = update_state(X[:, k], S[:, k], wx, wRec)
    return S

def cost(y, t):
    Return the MSE between the targets t and the outputs y.
    return ((t - y)**2).sum() / nb_of_samples

Backward Propagation

Start from getting the gradient of cost function $\xi$ by $\partial \xi / \partial y$. Then use this gradient for backward propagation through time (i.e. layer by layer).

The recurrent relation of gradient for each state during back propagation is:

$$\frac{\partial \xi}{\partial S_{k-1}} = \frac{\partial \xi}{\partial S_{k}} \frac{\partial S_{k}}{\partial S_{k-1}} = \frac{\partial \xi}{\partial S_{k}} w_{rec}$$

starts at: $$\frac{\partial \xi}{\partial y} = \frac{\partial \xi}{\partial S_{n}}$$

The update rules for the weights are sum of the n and k states of: $$ \frac{\partial \xi}{\partial w_x} = \sum_{k=0}^{n} \frac{\partial \xi}{\partial S_{k}} x_k \\ \frac{\partial \xi}{\partial w_{rec}} = \sum_{k=1}^{n} \frac{\partial \xi}{\partial S_{k}} S_{k-1} $$

def output_gradient(y, t):
    Comput the gradient of the MSE cost function with respect to the output y.
    return 2.0 * (y-t) / nb_of_samples

def backward_gradient(X, S, grad_out, wRec):
    Backpropagate the gradient computed at the output (grad_out) through the network.
    Accumulate the parameter gradients for wX and wRec by for each layer by addition.
    Return the parameter gradients as a tuple, and the gradients at the output of each layer.
    # Initialize the array that stores the gradients of the cost with respect to the states.
    grad_over_time = np.zeros((X.shape[0], X.shape[1]+1))  # the same size as S
    grad_over_time[:, -1] = grad_out  # the result of output_gradient()
    # Set the gradient accumulations to 0
    wx_grad = 0
    wRec_grad = 0
    for k in range(X.shape[1], 0, -1):
        # Comput the paameter gradients and accumulate the results
        wx_grad += np.sum(grad_over_time[:, k] * X[:, k-1])
        wRec_grad += np.sum(grad_over_time[:, k] * S[:, k-1])
        # Compute the gradient at the output of the previous layer
        grad_over_time[:, k-1] = grad_over_time[:, k] * wRec
    return (wx_grad, wRec_grad), grad_over_time

The instability of gradient in RNN


$$\frac{\partial \xi}{\partial S_{k-1}} = \frac{\partial \xi}{\partial S_{k}} \frac{\partial S_{k}}{\partial S_{k-1}} = \frac{\partial \xi}{\partial S_{k}} w_{rec}$$

For ex, The gradient of a state $S_k$ between a state mm timesteps back $S_{k−m}$ can then be written as:

$$ \frac{\partial S_{k}}{\partial S_{k-m}} = \frac{\partial S_{k}}{\partial S_{k-1}} * \cdots * \frac{\partial S_{k-m+1}}{\partial S_{k-1}} = w_{rec}^m $$

So in our model the gradient grows exponentially if $|w_{rec}|>1$ (known as exploding gradient). And that the gradient shrinks exponentially if $|w_{rec}|<1$ (known as vanishing gradient). As you can see from the graph below.

# Define plotting functions

# Define points to annotate (wx, wRec, color)
points = [(2,1,'r'), (1,2,'b'), (1,-2,'g'), (1,0,'c'), (1,0.5,'m'), (1,-0.5,'y')]

def get_cost_surface(w1_low, w1_high, w2_low, w2_high, nb_of_ws, cost_func):
    """Define a vector of weights for which we want to plot the cost."""
    w1 = np.linspace(w1_low, w1_high, num=nb_of_ws)  # Weight 1
    w2 = np.linspace(w2_low, w2_high, num=nb_of_ws)  # Weight 2
    ws1, ws2 = np.meshgrid(w1, w2)  # Generate grid
    cost_ws = np.zeros((nb_of_ws, nb_of_ws))  # Initialize cost matrix
    # Fill the cost matrix for each combination of weights
    for i in range(nb_of_ws):
        for j in range(nb_of_ws):
            cost_ws[i,j] = cost_func(ws1[i,j], ws2[i,j])
    return ws1, ws2, cost_ws

def plot_surface(ax, ws1, ws2, cost_ws):
    """Plot the cost in function of the weights."""
    surf = ax.contourf(ws1, ws2, cost_ws, levels=np.logspace(-0.2, 8, 30), cmap=cm.pink, norm=LogNorm())
    ax.set_xlabel('$w_{in}$', fontsize=15)
    ax.set_ylabel('$w_{rec}$', fontsize=15)
    return surf

def plot_points(ax, points):
    """Plot the annotation points on the given axis."""
    for wx, wRec, c in points:
        ax.plot(wx, wRec, c+'o', linewidth=2)

def get_cost_surface_figure(cost_func, points):
    """Plot the cost surfaces together with the annotated points."""
    # Plot figures
    fig = plt.figure(figsize=(10, 4))   
    # Plot overview of cost function
    ax_1 = fig.add_subplot(1,2,1)
    ws1_1, ws2_1, cost_ws_1 = get_cost_surface(-3, 3, -3, 3, 100, cost_func)
    surf_1 = plot_surface(ax_1, ws1_1, ws2_1, cost_ws_1 + 1)
    plot_points(ax_1, points)
    ax_1.set_xlim(-3, 3)
    ax_1.set_ylim(-3, 3)
    # Plot zoom of cost function
    ax_2 = fig.add_subplot(1,2,2)
    ws1_2, ws2_2, cost_ws_2 = get_cost_surface(0, 2, 0, 2, 100, cost_func)
    surf_2 = plot_surface(ax_2, ws1_2, ws2_2, cost_ws_2 + 1)
    plot_points(ax_2, points)
    ax_2.set_xlim(0, 2)
    ax_2.set_ylim(0, 2)
    # Show the colorbar
    cax = fig.add_axes([0.85, 0.12, 0.03, 0.78])
    cbar = fig.colorbar(surf_1, ticks=np.logspace(0, 8, 9), cax=cax)
    cbar.ax.set_ylabel('$\\xi$', fontsize=15, rotation=0, labelpad=20)
    cbar.set_ticklabels(['{:.0e}'.format(i) for i in np.logspace(0, 8, 9)])
    fig.suptitle('Cost surface', fontsize=15)
    return fig

def plot_gradient_over_time(points, get_grad_over_time):
    """Plot the gradients of the annotated point and how the evolve over time."""
    fig = plt.figure(figsize=(6.5, 4))  
    ax = plt.subplot(111)
    # Plot points
    for wx, wRec, c in points:
        grad_over_time = get_grad_over_time(wx, wRec)
        x = np.arange(-grad_over_time.shape[1]+1, 1, 1)
        plt.plot(x, np.sum(grad_over_time, axis=0), c+'-', label='({0}, {1})'.format(wx, wRec), linewidth=1, markersize=8)
    plt.xlim(0, -grad_over_time.shape[1]+1)
    # Set up plot axis
    plt.yticks([10**8, 10**6, 10**4, 10**2, 0, -10**2, -10**4, -10**6, -10**8])
    plt.xlabel('timestep k', fontsize=12)
    plt.ylabel('$\\frac{\\partial \\xi}{\\partial S_{k}}$', fontsize=20, rotation=0)
    plt.title('Unstability of gradient in backward propagation.\n(backpropagate from left to right)')
    # Set legend
    leg = plt.legend(loc='center left', bbox_to_anchor=(1, 0.5), frameon=False, numpoints=1)
    leg.set_title('$(w_x, w_{rec})$', prop={'size':15})
def get_grad_over_time(wx, wRec):
    """Helper func to only get the gradient over time from wx and wRec."""
    S = forward_states(X, wx, wRec)
    grad_out = output_gradient(S[:,-1], t).sum()
    _, grad_over_time = backward_gradient(X, S, grad_out, wRec)
    return grad_over_time
# Plot cost surface and gradients

# Get and plot the cost surface figure with markers
fig = get_cost_surface_figure(lambda w1, w2: cost(forward_states(X, w1, w2)[:,-1] , t), points)

# Get the plots of the gradients changing by backpropagating.
plot_gradient_over_time(points, get_grad_over_time)
Resilient Backpropagation

One way to handle the unstable gradients is by using a technique called resilient backpropagation (Rprop). The Rprop algorithm can be defined as:

  • Set initial weight update value $\Delta$ to a nonzero value.
  • For each parameter $w$:
    • if $sign(\partial \xi /\partial w(i)) \neq sign(\partial \xi /\partial w(i-1))$
    • Multiply the weight update value $\Delta$ by $\eta^-$, with $\eta^-<1$.
    • else if $sign(\partial \xi /\partial w(i)) = sign(\partial \xi /\partial w(i-1))$
    • Multiply the weight update value $\Delta$ by $\eta^+$, with $\eta^+<1$.

The hyperparameters are usually set as $\eta^+$=1.2 and $\eta^-$=0.5. Note that the weight update value $\Delta$ is similar to the momentum's velocity parameter, the difference is that the weight update value only reflects the size of the velocity for each parameter. The direction is determined by the sign of the current gradient.

# Define Rprop optimization function
def update_rprop(X, t, W, W_prev_sign, W_delta, eta_p, eta_n):
    Update Rprop values in one iteration.
    X: input data.
    t: targets.
    W: Current weight parameters.
    W_prev_sign: Previous sign of the W gradient.
    W_delta: Rprop update values (Delta).
    eta_p, eta_n: Rprop hyperparameters.
    # Perform forward and backward pass to get the gradients
    S = forward_states(X, W[0], W[1])
    grad_out = output_gradient(S[:, -1], t)
    W_grads, _ = backward_gradient(X, S, grad_out, W[1])
    W_sign = np.sign(W_grads)
    # Update the Delta for each weight parameter separately
    for i, _ in enumerate(W):
        if W_sign[i] == W_prev_sign[i]:
            W_delta[i] *= eta_p
            W_delta[i] *= eta_n
    return W_delta, W_sign
# Perform Rprop optimisation

# Set hyperparameters
eta_p = 1.2
eta_n = 0.5

# Set initial parameters
W = [-1.5, 2]  # [wx, wRec]
W_delta = [0.001, 0.001]  # Update values (Delta) for W
W_sign = [0, 0]  # Previous sign of W

ls_of_ws = [(W[0], W[1])]  # List of weights to plot
# Iterate over 500 iterations
for i in range(500):
    # Get the update values and sign of the last gradient
    W_delta, W_sign = update_rprop(X, t, W, W_sign, W_delta, eta_p, eta_n)
    # Update each weight parameter seperately
    for i, _ in enumerate(W):
        W[i] -= W_sign[i] * W_delta[i]
    ls_of_ws.append((W[0], W[1]))  # Add weights to list to plot

print('Final weights are: wx = {0},  wRec = {1}'.format(W[0], W[1]))
Final weights are: wx = 1.001355547207059,  wRec = 0.9996744737846419
test_inpt = np.asmatrix([[0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1]])
test_outpt = forward_states(test_inpt, W[0], W[1])[:,-1]
print('Target output: {:d} vs Model output: {:.2f}'.format(test_inpt.sum(), test_outpt[0]))
Target output: 5 vs Model output: 5.00

The sample codes in this note come from peterroelants.github.io where providing more details on neural netwrok and deep learning. It's very informative and highly recommanded. Here is more like my personal memo.