Computing Parameters Analytically
Last updated
Last updated
You've seen gradient descent and how it is used to find the most optimal parameters to produce the lowest cost function. Using a normal equation is another method to find the most optimal parameters to minimize
If the cost function, has a single parameter, we could just set and solve for . Done!
If the cost function, has 2 parameters, we could just set and and solve for and . Done!
Now image using the above two methods for a cost function with MANY parameters. Essentially what if and represents the # of features and is VERY large. It would take a long time right?
So here's an equation that solves for altogether:
Note: Unlike gradient descent, feature scaling is not needed when using the normal equation.
So if using the normal equation is so simple? Why do we need gradient descent? Here are the pros and cons of each:
Need to choose alpha
No need to choose alpha
Needs many iterations
No need to iterate
since calculating is costly
Works well if is large
Slow if is large
General Rule of Thumb: If , then use gradient descent instead of normal equation. Also, even though the normal equation works with linear regression, some algorithms don't work with it, so we would have to use gradient descent anyway.
Redundant features ( might be linearly dependent), so just take those features out
Too many features (), so delete some features or use regularization