Computing Parameters Analytically

Normal Equation

You've seen gradient descent and how it is used to find the most optimal parameters to produce the lowest cost function. Using a normal equation is another method to find the most optimal parameters to minimize J(θ)J(\theta)

Finding Single Parameter:

If the cost function, J(θ)J(\theta) has a single parameter, we could just set ddθJ(θ)=0\frac{d}{d\theta}J(\theta)=0 and solve for θ\theta. Done!

Finding Two Parameters:

If the cost function, J(θ0,θ1)J(\theta_0, \theta_1) has 2 parameters, we could just set ddθ0J(θ0)=0\frac{d}{d\theta_0}J(\theta_0)=0 and ddθ1J(θ1)=0\frac{d}{d\theta_1}J(\theta_1)=0 and solve for θ0\theta_0 and θ1\theta_1. Done!

Finding Many Parameters:

Now image using the above two methods for a cost function with MANY parameters. Essentially what if θRn+1\theta\in\R^{n+1} and nn represents the # of features and is VERY large. It would take a long time right?

So here's an equation that solves for θ\theta altogether:

θ=(XTX)1XTy\theta=(X^TX)^{-1}X^Ty

Example of using Normal Equation

Note: Unlike gradient descent, feature scaling is not needed when using the normal equation.

Gradient Descent vs. Normal Equation

So if using the normal equation is so simple? Why do we need gradient descent? Here are the pros and cons of each:

Gradient DescentNormal Equation

Need to choose alpha

No need to choose alpha

Needs many iterations

No need to iterate

O(kn2)O(kn^2)

O(n3)O(n^3) since calculating (XTX)1(X^TX)^{-1} is costly

Works well if nn is large

Slow if nn is large

General Rule of Thumb: If n>10000n\gt10000, then use gradient descent instead of normal equation. Also, even though the normal equation works with linear regression, some algorithms don't work with it, so we would have to use gradient descent anyway.

What if XTXX^TXis not invertible?

  • Redundant features (XX might be linearly dependent), so just take those features out

  • Too many features (nmn\ge m), so delete some features or use regularization

Last updated