Multivariate Linear Regression

This is linear regression, but with multiple variables. So we have a new hypothesis function!

Hypothesis: hθ(x)=θ0+θ1x1+...+θn1xn1+θnxn=θTxh_\theta(x)=\theta_0+\theta_1x_1+...+\theta_{n-1}x_{n-1}+\theta_nx_n=\theta^Tx

Parameters: θ0,θ1,...,θn=θRn+1\theta_0, \theta_1, ... ,\theta_n=\theta\in\R^{n+1}

Variables: x0,x1,...,xn=xRn+1x_0, x_1,...,x_n=x\in\R^{n+1}

e.g. Think of each variable as a different feature. x1x_1 could be square footage, x2x_2 could be number of floors, etc.

Gradient Descent for Multiple Variables

Repeat until convergence {

θ0:=θ0α1mΣ ((hθ(x(i))y(i))x0(i))\theta_0:=\theta_0-\alpha\frac{1}{m}\Sigma\space ((h_\theta(x^{(i)})-y^{(i)})x_0^{(i)})

θ1:=θ1α1mΣ ((hθ(x(i))y(i))x1(i))\theta_1:=\theta_1-\alpha\frac{1}{m}\Sigma\space ((h_\theta(x^{(i)})-y^{(i)})x_1^{(i)})

θ2:=θ2α1mΣ ((hθ(x(i))y(i))x2(i))\theta_2:=\theta_2-\alpha\frac{1}{m}\Sigma\space ((h_\theta(x^{(i)})-y^{(i)})x_2^{(i)})

​ ...

}

We can condense this down to:

Gradient Descent Algorithm for Multiple Variables:

Repeat until convergence {

θj:=θjα1mΣ ((hθ(x(i))y(i))xj(i))\theta_j:=\theta_j-\alpha\frac{1}{m}\Sigma\space ((h_\theta(x^{(i)})-y^{(i)})x_j^{(i)}) for j=0,1,2,..,nj=0,1,2,..,n

}

Gradient Descent with Feature Scaling

Feature scaling speeds up gradient descent and ensures gradient descent to converge.

Let xix_i be a feature like age of house.

We generally want the range of the feature to stay in either one of these intervals:

1xi1,0.5xi0.5-1\le x_i\le1, -0.5\le x_i\le 0.5

Mean Normalization: xi:=xiμisix_i := \frac{x_i-\mu_i}{s_i}

This is a way you can ensure feature scaling (sis_i is range).

Choosing a Learning Rate for Gradient Descent

Debugging Gradient Descent

Plot # of iterations of gradient descent on x-axis and J(θ)J(\theta) on y-axis.

Automatic convergence test: If J(θ)J(\theta) is decreasing by less than 10310^{-3}, most likely has converged (but usually just use a graph because it's easier to see).

Summary:

If α\alpha is too large → loss function will diverge (yellow line)

If α\alpha is too small → loss function will converge too slowly (green line).

If α\alpha is good → loss function will decrease every epoch at a reasonable time.

Creating Features for Polynomial Regression

We don't always have to stick to only having the features in the equation. We can derive more from existing features.

Examples:

Let our hypothesis be hθ(x)=θ0+θ1x1.h_\theta(x)=\theta_0+\theta_1x_1.

What if we don't want a linear function to fit the data? What if we want a parabola, square root, cubic?

To get a square root function from this: hθ(x)=θ0+θ1x1+θ2x1h_\theta(x)=\theta_0+\theta_1x_1+\theta_2\sqrt{x_1}

Let x1=x2\sqrt{x_1}=x_2. We have created a new feature x2x_2! So now we have: hθ(x)=θ0+θ1x1+θ2x2h_\theta(x)=\theta_0+\theta_1x_1+\theta_2x_2.

Note: Feature scaling is very important here! Imagine if you created a cubic function. Your new feature would then be x13x_1^3, so the values would be very large.

Last updated