My journey of pursuing Deeplearning.ai Course @Coursera

Course 1: Neural Networks and Deep Learning

Week 3 :Neural Networks and Deep Learning

Key Concepts

Understand hidden units and hidden layers
Be able to apply a variety of activation functions in a neural network.
Build your first forward and backward propagation with a hidden layer
Apply random initialization to your neural network
Become fluent with Deep Learning notations and Neural Network Representations
Build and train a neural network with one hidden layer.

Part 1 : Neural Networks Overview

First, we'll inputs the features, x, together with some parameters w and b, and this will allow you to compute z one. So, new notation that we'll introduce is that we'll use superscript square bracket one to refer to quantities associated with this stack of nodes, it's called a layer. Then later, we'll use superscript square bracket two to refer to quantities associated with that node. That's called another layer of the neural network.

So superscripts, [i] in sq bracket show the Hidden layer number

You remember that for logistic regression, we had this backward calculation in order to compute derivatives or as you're computing your d a, d z, and so on. So, in the same way, a neural network will end up doing a backward calculation that looks like this in which you end up computing da_2, dz_2, that allows you to compute dw_2, db_2, and so on. This right to left backward calculation that is denoting with the red arrows.

Part 2: Neural Network Representation

The training set doesn't contain the hidden layer. It contains only the input layer and output layer.

Previously, we were using the vector X to denote the input features and alternative notation for the values of the input features will be A superscript square bracket 0.

The next layer, the hidden layer, will in turn generate some set of activations, which I'm going to write as A superscript square bracket 1. So in particular, this first unit or this first node, we generate a value A superscript square bracket 1 subscript 1. This second node we generate a value. Now we have a subscript 2 and so on. And so, A superscript square bracket 1, this is a four dimensional vector you want in Python because the 4x1 matrix, or a 4 column vector, which looks like this. And it's four dimensional, because in this case we have four nodes, or four units, or four hidden units in this hidden layer.

And then finally, the open layer regenerates some value A2, which is just a real number. And so y hat is going to take on the value of A2. So this is analogous to how in logistic regression we have y hat equals a and in logistic regression which we only had that one output layer, so we don't use the superscript square brackets. But with our neural network, we now going to use the superscript square bracket to explicitly indicate which layer it came from.

This is a two layered neural network as we dont consider the input layer.

Part 3 : Computing a Neural Network's Output

When we're vectorizing, one of the rules of thumb that might help you navigate this, is that while we have different nodes in the layer, we'll stack them vertically. So, that's why we have z_[1]_1 through z_[1]_4, those corresponded to four different nodes in the hidden layer,

Part 4: Vectorizing across multiple examples

So we need to vectorize this:

Part 5: Explanation for Vectorized Implementation

Part 6: Activation functions

Avoid the sigmoid function

Can use tanh,relu or Leaky Relu
Exception: Use Linear function when it is a binary classification problem in the output layer

In hidden layer u can use any function

Part 7: Why do you need non-linear activation functions?

But then the hidden units should not use the activation functions. They could use ReLU or tanh or Leaky ReLU or maybe something else. So the one place you might use a linear activation function is usually in the output layer. But other than that, using a linear activation function in the hidden layer except for some very special circumstances relating to compression that we're going to talk about using the linear activation function is extremely rare. And, of course, if we're actually predicting housing prices, as you saw in the week one video, because housing prices are all non-negative, Perhaps even then you can use a value activation function so that your output y-hats are all greater than or equal to 0. So I hope that gives you a sense of why having a non-linear activation function is a critical part of neural networks.

Part 8 : Derivatives of activation functions

So, if g of z is the sigmoid function, then the slope of the function is d, dz g of z, and so we know from calculus that it is the slope of g of x at z. If you are familiar with calculus and know how to take derivatives, if you take the derivative of the Sigmoid function, it is possible to show that it is equal to this formula. Again, I'm not going to do the calculus steps, but if you are familiar with calculus, feel free to post a video and try to prove this yourself. So, this is equal to just g of z, times 1 minus g of z. So, let's just sanity check that this expression make sense. First, if z is very large, so say z is equal to 10, then g of z will be close to 1, and so the formula we have on the left tells us that d dz g of z does be close to g of z, which is equal to 1 times 1 minus 1, which is therefore very close to 0. This isn't the correct because when z is very large, the slope is close to 0. Conversely, if z is equal to minus 10, so it says well there, then g of z is close to 0. So, the formula on the left tells us d dz g of z would be close to g of z, which is 0 times 1 minus 0. So it is also very close to 0, which is correct. Finally, if z is equal to 0, then g of z is equal to one-half, that's the sigmoid function right here, and so the derivative is equal to one-half times 1 minus one-half, which is equal to one-quarter, and that actually turns out to be the correct value of the derivative or the slope of this function when z is equal to 0.

Derivation of the derivative of Tanh(z) :-

define: u = e^z - e^(-z), v = e^z + e^(-z)

then tanh(z) = u / v

then notice: u_prime = e^z + e^(-z) = v

and: v_prime = e^z - e^(-z) = u

apply quotient rule ( f = u/v -> f_prime = ( u_prime * v - u*v _prime ) / v^2):

you get: tanh_prime(z) = (v^2 - u^2) / v^2 = 1 - u^2/v^2 = 1 - ( tanh(z) ) ^2

Part 9 : Gradient descent for Neural Networks

Part 10: Backpropagation intuition (optional)

One tip when implementing a backdrop: If you just make sure that the dimensions of your matrices match up, so if you think through what are the dimensions of your various matrices including w_1, w_2, z_1, z_2, a_1, a_2, and so on and just make sure that the dimensions of these matrix operations may match up. Sometimes, that will already eliminate quite a lot of bugs and back-prop.

the text should be "dw[1]" instead of "dw[2]".

Forward Prop:

Back prop:

Part 11: Random Initialization

We need to initialize the weights to non-zero unlike in regression.

So here's what to do:-

You can set w1 = np.random.randn. This generates a Gaussian random variable (2,2). And then usually, you multiply this by a very small number, such as 0.01. So you initialize it to very small random values. And then b, it turns out that b does not have the symmetry problem, what's called the symmetry-breaking problem. So it's okay to initialize b to just zeros. Because so long as w is initialized randomly, you start off with the different hidden units computing different things.

So you might be wondering, where did this constant come from and why is it 0.01? Why not put the number 100 or 1000? Turns out that we usually prefer to initialize the weights to very small random values. Because if you are using a tanh or sigmoid activation function, or the other sigmoid, even just at the output layer. If the weights are too large, then when you compute the activation values, remember that z[1]=w1 x + b. And then a1 is the activation function applied to z1. So if w is very big, z will be very, or at least some values of z will be either very large or very small. And so in that case, you're more likely to end up at these fat parts of the tanh function or the sigmoid function, where the slope or the gradient is very small. Meaning that gradient descent will be very slow. So learning was very slow. So just a recap, if w is too large, you're more likely to end up even at the very start of training, with very large values of z. Which causes your tanh or your sigmoid activation function to be saturated, thus slowing down learning. If you don't have any sigmoid or tanh activation functions throughout your neural network, this is less of an issue. But if you're doing binary classification, and your output unit is a sigmoid function, then you just don't want the initial parameters to be too large. So that's why multiplying by 0.01 would be something reasonable to try, or any other small number. And same for w2, right? This can be random.random.

A BETTER RESOURCE FOR EXPLANATION OF VISUALIZING BACKPROPAGATION

This site is too good : https://developers-dot-devsite-v2-prod.appspot.com/machine-learning/crash-course/backprop-scroll

Thus WEEK 3 is COMPLETED :)

Search This Blog

My Coding Blog