# Training A Neural Network

Refer to [Intro to Neural Networks](https://github.com/lauradang/wiki-notes/blob/master/Machine_Learning/Neural_Networks/\[https:/victorzhou.com/blog/intro-to-neural-networks/README.md)

**Big idea:** TRAINING NETWORK MEANS TO **MINIMIZE LOSS**

## Overall Process

1. Give outputs numbers (0 represents Male, 1 represents Female)
2. Shift data numbers by mean (normalization)
3. Calculate the **loss** function: Measure any mistakes between $$y\_{true}$$ and $$y\_{pred}$$.
4. Use **backpropagation** to quantify how bad a particular weight is at making mistake.
5. Use **optimization** algorithm that tells us how to change weights and biasses to minimize loss (e.g. **gradient descent**)

## Loss

* Quantifies how "good" network is at predicting
* Trying to minimize this

### Mean Squared Error

$$Squared Error = (ytrue − ypred)^2$$ $$MSE=\frac{1}{n}∑ (ytrue − ypred)^2$$ - Takes average of squared error

## Adjusting weights and biases to decrease loss

Assuming we only have 1 item in dataset (for simplicity):

$$MSE=\frac{1}{1}∑ (1 − ypred)^2$$

$$MSE= (1 − ypred)^2$$

So $$L=(1 − ypred)^2$$

We can write **loss** as a *multivariable function of the weights and bias*:

$$L(w\_1, w\_2, w\_3, w\_4, w\_5, w\_6, b\_1, b\_2, b\_3)$$

**Question**: How would tweaking $$w\_1$$ affect loss? How do we find this out?

* Take partial derivative of $$L$$ with respect to $$w\_1$$
  * i.e. Solve for $$\frac{∂L}{∂w\_1}$$

## Solve for $$\frac{∂L}{∂w\_1}$$

$$\frac{∂L}{∂w\_1}$$ **=** $$\frac{∂L}{ypred} \frac{∂ypred}{∂w\_1}$$ - *Chain Rule*

So now, we must solve for $$\frac{∂L}{∂ypred}$$ and $$\frac{∂ypred}{∂w\_1}$$

### Solving for $$\frac{∂L}{∂ypred}$$ (Easy!)

Remember $$L=(1 − ypred)^2$$, so simply

**Result**: $$\frac{∂L}{ypred}$$ = $$-2(1-ypred)^2$$

### Solving for $$\frac{∂ypred}{∂w\_1}$$ (Not as obvious)

Remember that $$ypred$$ is really just the output. From our neuron calculations before, output is just $$o\_1$$

So to calculate $$ypred$$...

$$ypred = o\_1$$ $$o\_1 = f(h\_1\*w\_5 + h\_2 \* w\_6 + b\_3)$$ (Refer to the neural network diagram)

But even now, $$w\_1$$ does not appear in this equation.. so we still can't solve properly, so we need to break this down even further.

Remember that we can also calculate $$h\_1$$ and $$h\_2$$ as... $$h\_1 = f(w\_1*x\_1 + b\_1)$$ $$h\_2 = f(w\_2*x\_2 + b\_2)$$

Now we have $$w\_1$$! Since we only see $$w\_1$$ in $$h\_1$$ (meaning that $$w\_1$$ only affects $$h\_1$$), we can just include $$h\_1$$ So we can rewrite $$\frac{∂ypred}{∂w\_1}$$ in a solvable form now.

**Result**: $$\frac{∂ypred}{∂w\_1}$$ = $$\frac{∂ypred}{∂h\_1}\frac{∂h\_1}{∂w\_1}$$

If you want to simplify this further...

$$\frac{∂ypred}{∂h\_1}=f'(h\_1*w\_5 + h\_2*w\_6 + b\_3)\*w\_5$$

$$\frac{∂h\_1}{∂w\_1}=f'(w\_1\*x\_1+b\_1)\*x\_1$$

Since we've seen $$f'(x)$$ multiple times, might as well solve for that too. $$f(x) = \frac{1}{1+e^-x}$$

$$f'(x) = \frac{e^-x}{(1+e^-x)^2}=f(x)\*(1-f(x))$$

So to sum it up...

## $$\frac{∂L}{∂w\_1}=\frac{∂L}{∂ypred}\frac{∂ypred}{∂h\_1}\frac{∂h\_1}{∂w\_1}$$

## Calculating $$\frac{∂L}{∂w\_1}$$:

* Result is 0.0214
* Means that if $$w\_1$$ increases, the loss also increases (only by a bit because it's a pretty flat slope) - think of a linear graph

## Optimization Algorithm

* Using [stochastic gradient descent](https://en.wikipedia.org/wiki/Stochastic_gradient_descent) (SGD)
  * $$w1​←w1​−η\frac{∂L}{∂w\_1}$$
    * η is learning constant
  * If $$\frac{∂L}{∂w\_1}>0 → w\_1$$ decreases $$→ L$$ decreases
