e.g. Think of each variable as a different feature. x1 could be square footage, x2 could be number of floors, etc.
Gradient Descent for Multiple Variables
Repeat until convergence {
θ0:=θ0−αm1Σ((hθ(x(i))−y(i))x0(i))
θ1:=θ1−αm1Σ((hθ(x(i))−y(i))x1(i))
θ2:=θ2−αm1Σ((hθ(x(i))−y(i))x2(i))
...
}
We can condense this down to:
Gradient Descent Algorithm for Multiple Variables:
Repeat until convergence {
θj:=θj−αm1Σ((hθ(x(i))−y(i))xj(i)) for j=0,1,2,..,n
}
Gradient Descent with Feature Scaling
Feature scaling speeds up gradient descent and ensures gradient descent to converge.
Let xi be a feature like age of house.
We generally want the range of the feature to stay in either one of these intervals:
−1≤xi≤1,−0.5≤xi≤0.5
Mean Normalization: xi:=sixi−μi
This is a way you can ensure feature scaling (si is range).
Choosing a Learning Rate for Gradient Descent
Debugging Gradient Descent
Plot # of iterations of gradient descent on x-axis and J(θ) on y-axis.
Automatic convergence test: If J(θ) is decreasing by less than 10−3, most likely has converged (but usually just use a graph because it's easier to see).
Summary:
If α is too large → loss function will diverge (yellow line)
If α is too small → loss function will converge too slowly (green line).
If α is good → loss function will decrease every epoch at a reasonable time.
Creating Features for Polynomial Regression
We don't always have to stick to only having the features in the equation. We can derive more from existing features.
Examples:
Let our hypothesis be hθ(x)=θ0+θ1x1.
What if we don't want a linear function to fit the data? What if we want a parabola, square root, cubic?
To get a square root function from this: hθ(x)=θ0+θ1x1+θ2x1
Let x1=x2. We have created a new feature x2! So now we have: hθ(x)=θ0+θ1x1+θ2x2.
Note: Feature scaling is very important here! Imagine if you created a cubic function. Your new feature would then be x13, so the values would be very large.