Computing Parameters Analytically
Normal Equation
You've seen gradient descent and how it is used to find the most optimal parameters to produce the lowest cost function. Using a normal equation is another method to find the most optimal parameters to minimize
Finding Single Parameter:
If the cost function, has a single parameter, we could just set and solve for . Done!
Finding Two Parameters:
If the cost function, has 2 parameters, we could just set and and solve for and . Done!
Finding Many Parameters:
Now image using the above two methods for a cost function with MANY parameters. Essentially what if and represents the # of features and is VERY large. It would take a long time right?
So here's an equation that solves for altogether:
Example of using Normal Equation

Note: Unlike gradient descent, feature scaling is not needed when using the normal equation.
Gradient Descent vs. Normal Equation
So if using the normal equation is so simple? Why do we need gradient descent? Here are the pros and cons of each:
Need to choose alpha
No need to choose alpha
Needs many iterations
No need to iterate
since calculating is costly
Works well if is large
Slow if is large
General Rule of Thumb: If , then use gradient descent instead of normal equation. Also, even though the normal equation works with linear regression, some algorithms don't work with it, so we would have to use gradient descent anyway.
What if is not invertible?
Redundant features ( might be linearly dependent), so just take those features out
Too many features (), so delete some features or use regularization
Last updated
Was this helpful?