Training A Neural Network

Refer to Intro to Neural Networks

Big idea: TRAINING NETWORK MEANS TO MINIMIZE LOSS

Overall Process

  1. Give outputs numbers (0 represents Male, 1 represents Female)

  2. Shift data numbers by mean (normalization)

  3. Calculate the loss function: Measure any mistakes between ytruey_{true} and ypredy_{pred}.

  4. Use backpropagation to quantify how bad a particular weight is at making mistake.

  5. Use optimization algorithm that tells us how to change weights and biasses to minimize loss (e.g. gradient descent)

Loss

  • Quantifies how "good" network is at predicting

  • Trying to minimize this

Mean Squared Error

SquaredError=(ytrueypred)2Squared Error = (ytrue − ypred)^2 MSE=1n(ytrueypred)2MSE=\frac{1}{n}∑ (ytrue − ypred)^2 - Takes average of squared error

Adjusting weights and biases to decrease loss

Assuming we only have 1 item in dataset (for simplicity):

MSE=11(1ypred)2MSE=\frac{1}{1}∑ (1 − ypred)^2

MSE=(1ypred)2MSE= (1 − ypred)^2

So L=(1ypred)2L=(1 − ypred)^2

We can write loss as a multivariable function of the weights and bias:

L(w1,w2,w3,w4,w5,w6,b1,b2,b3)L(w_1, w_2, w_3, w_4, w_5, w_6, b_1, b_2, b_3)

Question: How would tweaking w1w_1 affect loss? How do we find this out?

  • Take partial derivative of LL with respect to w1w_1

    • i.e. Solve for Lw1\frac{∂L}{∂w_1}

Solve for Lw1\frac{∂L}{∂w_1}

Lw1\frac{∂L}{∂w_1} = Lypredypredw1\frac{∂L}{ypred} \frac{∂ypred}{∂w_1} - Chain Rule

So now, we must solve for Lypred\frac{∂L}{∂ypred} and ypredw1\frac{∂ypred}{∂w_1}

Solving for Lypred\frac{∂L}{∂ypred} (Easy!)

Remember L=(1ypred)2L=(1 − ypred)^2, so simply

Result: Lypred\frac{∂L}{ypred} = 2(1ypred)2-2(1-ypred)^2

Solving for ypredw1\frac{∂ypred}{∂w_1} (Not as obvious)

Remember that ypredypred is really just the output. From our neuron calculations before, output is just o1o_1

So to calculate ypredypred...

ypred=o1ypred = o_1 o1=f(h1w5+h2w6+b3)o_1 = f(h_1*w_5 + h_2 * w_6 + b_3) (Refer to the neural network diagram)

But even now, w1w_1 does not appear in this equation.. so we still can't solve properly, so we need to break this down even further.

Remember that we can also calculate h1h_1 and h2h_2 as... h1=f(w1x1+b1)h_1 = f(w_1*x_1 + b_1) h2=f(w2x2+b2)h_2 = f(w_2*x_2 + b_2)

Now we have w1w_1! Since we only see w1w_1 in h1h_1 (meaning that w1w_1 only affects h1h_1), we can just include h1h_1 So we can rewrite ypredw1\frac{∂ypred}{∂w_1} in a solvable form now.

Result: ypredw1\frac{∂ypred}{∂w_1} = ypredh1h1w1\frac{∂ypred}{∂h_1}\frac{∂h_1}{∂w_1}

If you want to simplify this further...

ypredh1=f(h1w5+h2w6+b3)w5\frac{∂ypred}{∂h_1}=f'(h_1*w_5 + h_2*w_6 + b_3)*w_5

h1w1=f(w1x1+b1)x1\frac{∂h_1}{∂w_1}=f'(w_1*x_1+b_1)*x_1

Since we've seen f(x)f'(x) multiple times, might as well solve for that too. f(x)=11+exf(x) = \frac{1}{1+e^-x}

f(x)=ex(1+ex)2=f(x)(1f(x))f'(x) = \frac{e^-x}{(1+e^-x)^2}=f(x)*(1-f(x))

So to sum it up...

Lw1=Lypredypredh1h1w1\frac{∂L}{∂w_1}=\frac{∂L}{∂ypred}\frac{∂ypred}{∂h_1}\frac{∂h_1}{∂w_1}

Calculating Lw1\frac{∂L}{∂w_1}:

  • Result is 0.0214

  • Means that if w1w_1 increases, the loss also increases (only by a bit because it's a pretty flat slope) - think of a linear graph

Optimization Algorithm

  • Using stochastic gradient descent (SGD)

    • w1w1ηLw1w1​←w1​−η\frac{∂L}{∂w_1}

      • η is learning constant

    • If Lw1>0w1\frac{∂L}{∂w_1}>0 → w_1 decreases L→ L decreases

Last updated