Week 4: Deep Neural Network
Key Concepts
- See deep neural networks as successive blocks put one after each other
- Build and train a deep L-layer Neural Network
- Analyze matrix and vector dimensions to check neural network implementations.
- Understand how to use a cache to pass information from forward propagation to back propagation.
- Understand the role of hyperparameters in deep learning
Part 1 : Deep L-layer neural network
It shows the different layers which will be used.
Part 2 : Forward Propagation in a Deep Network
So, seems that there is a For loop here. I know that when implementing neural networks, we usually want to get rid of explicit For loops. But this is one place where I don't think there's any way to implement this without an explicit For loop. So when implementing forward propagation, it is perfectly okay to have a For loop to compute the activations for layer one, then layer two, then layer three, then layer four. No one knows, and I don't think there is any way to do this without a For loop that goes from one to capital L, from one through the total number of layers in the neural network. So, in this place, it's perfectly okay to have an explicit For loop.
Part 3 : Getting your matrix dimensions right
In the video, at time 6:35, the correct formula should be:
a[l]=g[l](z[l])
Note that "a" and "z have dimensions (n[l],1)
Part 4: Why deep representations?
So deep neural network with multiple hidden layers might be able to have the earlier
layers learn these lower level simple features and
then have the later deeper layers then put together the simpler things it's detected
in order to detect more complex things like recognize specific words or
even phrases or sentences.
The uttering in order to carry out speech recognition.
And what we see is that whereas the other layers are computing, what seems like
relatively simple functions of the input such as where the edge is, by the time
you get deep in the network you can actually do surprisingly complex things.
Such as detect faces or detect words or phrases or sentences.
It starts taking up simple things... and slowly becomes more complex as we go deeper in the network.
let's say you're trying to compute the exclusive OR, or the parity of all your input features. So you're trying to compute X1, XOR, X2, XOR, X3, XOR, up to Xn if you have n or n X features. So if you build in XOR tree like this, so for us it computes the XOR of X1 and X2, then take X3 and X4 and compute their XOR. And technically, if you're just using AND or NOT gate, you might need a couple layers to compute the XOR function rather than just one layer, but with a relatively small circuit, you can compute the XOR, and so on. And then you can build, really, an XOR tree like so, until eventually, you have a circuit here that outputs, well, lets call this Y. The outputs of Y hat equals Y. The exclusive OR, the parity of all these input bits. So to compute XOR, the depth of the network will be on the order of log N. We'll just have an XOR tree. So the number of nodes or the number of circuit components or the number of gates in this network is not that large. You don't need that many gates in order to compute the exclusive OR. But now, if you are not allowed to use a neural network with multiple hidden layers with, in this case, order log and hidden layers, if you're forced to compute this function with just one hidden layer, so you have all these things going into the hidden units. And then these things then output Y. Then in order to compute this XOR function, this hidden layer will need to be exponentially large, because essentially, you need to exhaustively enumerate our 2 to the N possible configurations. So on the order of 2 to the N, possible configurations of the input bits that result in the exclusive OR being either 1 or 0. So you end up needing a hidden layer that is exponentially large in the number of bits. I think technically, you could do this with 2 to the N minus 1 hidden units. But that's the older 2 to the N, so it's going to be exponentially larger on the number of bits. So I hope this gives a sense that there are mathematical functions, that are much easier to compute with deep networks than with shallow networks. Actually, I personally found the result from circuit theory less useful for gaining intuitions, but this is one of the results that people often cite when explaining the value of having very deep representations.
Part 5 : Building blocks of deep neural networks
So for layer L, you have some parameters wl and
bl and for the forward prop, you will input the activations a l-1 from your previous layer and output a l. So the way we did this previously was you compute z l = w l times al - 1 + b l. And then al = g of z l. All right. So, that's how you go from the input al minus one to the output al. And, it turns out that for later use it'll be useful to also cache the value zl. So, let me include this on cache as well because storing the value zl would be useful for backward, for the back propagation step later. And then for the backward step or for the back propagation step, again, focusing on computation for this layer l, you're going to implement a function that inputs da(l).
And outputs da(l-1), and just to flesh out the details, the input is actually da(l), as well as the cache so you have available to you the value of zl that you computed and then in addition, outputing da(l) minus 1 you bring the output or the gradients you want in order to implement gradient descent for learning, okay? So this is the basic structure of how you implement this forward step, what we call the forward function as well as this backward step, which we'll call backward function. So just to summarize, in layer l, you're going to have the forward step or the forward prop of the forward function. Input al- 1 and output, al, and in order to make this computation you need to use wl and bl. And also output a cache, which contains zl, right? And then the backward function, using the back prop step, will be another function that now inputs da(l) and outputs da(l-1). So it tells you, given the derivatives respect to these activations, that's da(l), what are the derivatives? How much do I wish? You know, al- 1 changes the computed derivatives respect to deactivations from a previous layer. Within this box, right? You need to use wl and bl, and it turns out along the way you end up computing dzl, and then this box, this backward function can also output dwl and dbl, but I was sometimes using red arrows to denote the backward iteration. So if you prefer, we could draw these arrows in red.
it turns out that we'll see later that inside these boxes we end up computing the dz's as well. So one iteration of training through a neural network involves: starting with a(0) which is x and going through forward prop as follows. Computing y hat and then using that to compute this and then back prop, right, doing that and now you have all these derivative terms and so, you know, w would get updated as w1 = the learning rate times dw, right? For each of the layers and similarly for b rate. Now the computed back prop have all these derivatives. So that's one iteration of gradient descent for your neural network. Now before moving on, just one more informational detail. Conceptually, it will be useful to think of the cache here as storing the value of z for the backward functions. But when you implement this, and you see this in the programming exercise, When you implement this, you find that the cache may be a convenient way to get to this value of the parameters of w1, b1, into the backward function as well. So for this exercise you actually store in your cache to z as well as w and b. So this stores z2, w2, b2. But from an implementation standpoint, I just find it a convenient way to just get the parameters, copy to where you need to use them later when you're computing back propagation.
Part 6: Forward and Backward Propagation
the text that's written should be dw[l]=dz[l]∗a[l−1]T.
The "a" should be transposed.
Part 7: Parameters vs Hyperparameters
all of these things are things that you need to tell your learning algorithm and so these are parameters that control the ultimate parameters W and B and so we call all of these things below hyperparameters because these things like alpha the learning rate the number of iterations number of hidden layers and so on these are all parameters that control W and B so we call these things hyperparameters because it is the hyperparameters that you know somehow determine the final value of the parameters W and B that you end up with.'
Part 8 : What does this have to do with the brain?
Here is the correct set of formulas.
dZ[L]=A[L]−Y
dW[L]=m1dZ[L]A[L−1]T
db[L]=m1np.sum(dZ[L],axis=1,keepdims=True)
dZ[L−1]=W[L]TdZ[L]∗g′[L−1](Z[L−1])
Note that * denotes element-wise multiplication)
⋮
dZ[1]=W[2]dZ[2]∗g′[1](Z[1])
dW[1]=m1dZ[1]A[0]T
Note that A[0]T is another way to denote the input features, which is also written as XT
db[1]=m1np.sum(dZ[1],axis=1,keepdims=True)
Note:
Question 7
During forward propagation, in the forward function for a layer l you need to know what is the activation function in a layer (Sigmoid, tanh, ReLU, etc.). During backpropagation, the corresponding backward function also needs to know what is the activation function for layer l, since the gradient depends on it.
Thus the Course completes :)
Comments
Post a Comment