# Multivariate Linear Regression

This is linear regression, but with multiple variables. So we have a new hypothesis function!

## Hypothesis: $$h\_\theta(x)=\theta\_0+\theta\_1x\_1+...+\theta\_{n-1}x\_{n-1}+\theta\_nx\_n=\theta^Tx$$

## Parameters: $$\theta\_0, \theta\_1, ... ,\theta\_n=\theta\in\R^{n+1}$$

## Variables: $$x\_0, x\_1,...,x\_n=x\in\R^{n+1}$$

e.g. Think of each variable as a different feature. $$x\_1$$ could be square footage, $$x\_2$$ could be number of floors, etc.

## Gradient Descent for Multiple Variables

Repeat until convergence {

​ $$\theta\_0:=\theta\_0-\alpha\frac{1}{m}\Sigma\space ((h\_\theta(x^{(i)})-y^{(i)})x\_0^{(i)})$$

​ $$\theta\_1:=\theta\_1-\alpha\frac{1}{m}\Sigma\space ((h\_\theta(x^{(i)})-y^{(i)})x\_1^{(i)})$$

​ $$\theta\_2:=\theta\_2-\alpha\frac{1}{m}\Sigma\space ((h\_\theta(x^{(i)})-y^{(i)})x\_2^{(i)})$$

​ ...

}

We can condense this down to:

### Gradient Descent Algorithm for Multiple Variables:

Repeat until convergence {

​ $$\theta\_j:=\theta\_j-\alpha\frac{1}{m}\Sigma\space ((h\_\theta(x^{(i)})-y^{(i)})x\_j^{(i)})$$ for $$j=0,1,2,..,n$$

}

## Gradient Descent with Feature Scaling

Feature scaling **speeds up** gradient descent and ensures gradient descent to **converge**.

Let $$x\_i$$ be a feature like age of house.

We generally want the range of the feature to stay in either one of these intervals:

$$-1\le x\_i\le1, -0.5\le x\_i\le 0.5$$

### Mean Normalization: $$x\_i := \frac{x\_i-\mu\_i}{s\_i}$$

This is a way you can ensure feature scaling ($$s\_i$$ is range).

## Choosing a Learning Rate for Gradient Descent

### Debugging Gradient Descent

Plot # of iterations of gradient descent on x-axis and $$J(\theta)$$ on y-axis.

![](https://868646840-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-LztfBhQUrZzyA7O_ZkJ%2Fuploads%2Fgit-blob-d205d6f0e8f3022280d50c6b4c9e8a2d776e7a27%2Flearning_rates.png?alt=media)

**Automatic convergence test**: If $$J(\theta)$$ is decreasing by less than $$10^{-3}$$, most likely has converged (but usually just use a graph because it's easier to see).

**Summary**:

If $$\alpha$$ is too large → loss function will diverge (yellow line)

If $$\alpha$$ is too small → loss function will converge too slowly (green line).

If $$\alpha$$ is good → loss function will decrease every epoch at a reasonable time.

## Creating Features for Polynomial Regression

We don't always have to stick to only having the features in the equation. We can derive more from existing features.

**Examples**:

Let our hypothesis be $$h\_\theta(x)=\theta\_0+\theta\_1x\_1.$$

What if we don't want a linear function to fit the data? What if we want a parabola, square root, cubic?

To get a square root function from this: $$h\_\theta(x)=\theta\_0+\theta\_1x\_1+\theta\_2\sqrt{x\_1}$$

Let $$\sqrt{x\_1}=x\_2$$. We have created a new feature $$x\_2$$! So now we have: $$h\_\theta(x)=\theta\_0+\theta\_1x\_1+\theta\_2x\_2$$.

**Note**: Feature scaling is very important here! Imagine if you created a cubic function. Your new feature would then be $$x\_1^3$$, so the values would be very large.
