Normal Equation
You've seen gradient descent and how it is used to find the most optimal parameters to produce the lowest cost function. Using a normal equation is another method to find the most optimal parameters to minimize J(θ)
Finding Single Parameter:
If the cost function, J(θ) has a single parameter, we could just set dθdJ(θ)=0 and solve for θ. Done!
Finding Two Parameters:
If the cost function, J(θ0,θ1) has 2 parameters, we could just set dθ0dJ(θ0)=0 and dθ1dJ(θ1)=0 and solve for θ0 and θ1. Done!
Finding Many Parameters:
Now image using the above two methods for a cost function with MANY parameters. Essentially what if θ∈Rn+1 and n represents the # of features and is VERY large. It would take a long time right?
So here's an equation that solves for θ altogether:
θ=(XTX)−1XTy
Example of using Normal Equation
Note: Unlike gradient descent, feature scaling is not needed when using the normal equation.
Gradient Descent vs. Normal Equation
So if using the normal equation is so simple? Why do we need gradient descent? Here are the pros and cons of each:
Gradient Descent
Normal Equation
O(n3) since calculating (XTX)−1 is costly
Works well if n is large
General Rule of Thumb: If n>10000, then use gradient descent instead of normal equation. Also, even though the normal equation works with linear regression, some algorithms don't work with it, so we would have to use gradient descent anyway.
What if
XTXis not invertible?
Redundant features (X might be linearly dependent), so just take those features out
Too many features (n≥m), so delete some features or use regularization