Training A Neural Network
Refer to Intro to Neural Networks
Big idea: TRAINING NETWORK MEANS TO MINIMIZE LOSS
Overall Process
Give outputs numbers (0 represents Male, 1 represents Female)
Shift data numbers by mean (normalization)
Calculate the loss function: Measure any mistakes between ytrue and ypred.
Use backpropagation to quantify how bad a particular weight is at making mistake.
Use optimization algorithm that tells us how to change weights and biasses to minimize loss (e.g. gradient descent)
Loss
Quantifies how "good" network is at predicting
Trying to minimize this
Mean Squared Error
SquaredError=(ytrue−ypred)2 MSE=n1∑(ytrue−ypred)2 - Takes average of squared error
Adjusting weights and biases to decrease loss
Assuming we only have 1 item in dataset (for simplicity):
MSE=11∑(1−ypred)2
MSE=(1−ypred)2
So L=(1−ypred)2
We can write loss as a multivariable function of the weights and bias:
L(w1,w2,w3,w4,w5,w6,b1,b2,b3)
Question: How would tweaking w1 affect loss? How do we find this out?
Take partial derivative of L with respect to w1
i.e. Solve for ∂w1∂L
Solve for ∂w1∂L
∂w1∂L = ypred∂L∂w1∂ypred - Chain Rule
So now, we must solve for ∂ypred∂L and ∂w1∂ypred
Solving for ∂ypred∂L (Easy!)
Remember L=(1−ypred)2, so simply
Result: ypred∂L = −2(1−ypred)2
Solving for ∂w1∂ypred (Not as obvious)
Remember that ypred is really just the output. From our neuron calculations before, output is just o1
So to calculate ypred...
ypred=o1 o1=f(h1∗w5+h2∗w6+b3) (Refer to the neural network diagram)
But even now, w1 does not appear in this equation.. so we still can't solve properly, so we need to break this down even further.
Remember that we can also calculate h1 and h2 as... h1=f(w1∗x1+b1) h2=f(w2∗x2+b2)
Now we have w1! Since we only see w1 in h1 (meaning that w1 only affects h1), we can just include h1 So we can rewrite ∂w1∂ypred in a solvable form now.
Result: ∂w1∂ypred = ∂h1∂ypred∂w1∂h1
If you want to simplify this further...
∂h1∂ypred=f′(h1∗w5+h2∗w6+b3)∗w5
∂w1∂h1=f′(w1∗x1+b1)∗x1
Since we've seen f′(x) multiple times, might as well solve for that too. f(x)=1+e−x1
f′(x)=(1+e−x)2e−x=f(x)∗(1−f(x))
So to sum it up...
∂w1∂L=∂ypred∂L∂h1∂ypred∂w1∂h1
Calculating ∂w1∂L:
Result is 0.0214
Means that if w1 increases, the loss also increases (only by a bit because it's a pretty flat slope) - think of a linear graph
Optimization Algorithm
Using stochastic gradient descent (SGD)
w1←w1−η∂w1∂L
η is learning constant
If ∂w1∂L>0→w1 decreases →L decreases
Last updated