Machine Learning- A Student Guide: April 2018

Thursday, April 19, 2018

MACHINE LEARNING | NORMAL EQUATIONS

MACHINE LEARNING - DAY 5

NORMAL EQUATIONS

For previous blogs and notes you can click on the following links:

NORMAL EQUATIONS ROLE AND USE

Normal equations provide a faster way of computation then gradient descent depending on the number of features since there are not many iterations required.

Formula for computing the features coefficient:

Θ = (X^T* X)^-1* X^T * Y

where,

X: the features matrix

Y: the output matrix

X^T: transpose of the features matrix

(X^T* X)^-1: inverse of the product of two matrices

Let's consider an example in the given image:

So x_0,x₁, x₂, x₃, and x₄are the features and X is the matrix of all the features and Y is the vector of size m * 1 with the real outputs.

With these matrices, the value of Θ for an optimized solution is calculated and we get a nice predicting model.

NOTE: There is no need of feature scaling in normal equation i.e.,

0 < x < 1000

0 < x < 0.00005

doesn’t matter here.

Question: What to choose then Gradient Descent or Normal Equation? Which is better??

Answer:

GRADIENT DESCENT	NORMAL EQUATION
Alpha value needs to be decided.	No need for alpha.
Needs many iterations	No iterations required.
Complexity: O(kn²)	Complexity: O(n³), since inverse needs to be calculated of X^TX
Works well when n i.e., number of features is large.	It slows down when the number of features increases.

So, now we know when to use Gradient Descent and when to use Normal Equations method.

Generally, when the number of features i.e., n = 10000, the performance of normal equations start to decrease since the inverse computation of such a large matrix (m* (10001)) becomes time-consuming.

Question: Since we need to take inverse in the computation of Θ in normal equations then how to deal with the non- invertibility of a matrix i.e., if a matrix in a singular matrix or it doesn’t have an inverse, then what to do?

Answer:

NORMAL EQUATIONS NON-INVERTIBILITY

If a matrix is a singular or degenerate or non-invertible matrix, then it’s inverse is not possible.

So, for this there are only a few cases where this happens, they can be:

1. Redundant Features(linear dependency):

When the features are related to each other linearly.

x₁= 2.5*x₂

2. Too many features(m <= n):

Delete some features or use regularization. Use only important features.

Eg. m = 10, n = 100

This creates a problem since we are trying to fit 100+1 features from just 10 records.

That’s all for Day 5. Next, we will learn about CLASSIFICATION AND
REPRESENTATION along-with LOGISTIC REGRESSION.

If you think this article helped you in learning some new things or refreshing your
knowledge then do share this article with your friends and everyone. If you have any
thoughts upon this article then do write them in the comment section and hope we learn
new things every day.

Till then Happy Learning!!!

Friday, April 6, 2018

OPTIMISING GRADIENT DESCENT

MACHINE LEARNING - DAY 4

If you have left the earlier tutorials then you can learn them which are the basics by clicking the following links in the day- order wise :

Day 1: What is Machine Learning and it's Types

Day 2: Linear Regression with single variable

Day 3: Multiple Linear Regression

Continuing our learning in machine learning let's now move on to today's topic.

IMPROVING THE PERFORMANCE OF GRADIENT DESCENT

Technique 1: Feature Scaling

Feature Scaling is dividing the input values by some fixed value, for eg. maximum input value. It would make the input values lie within the range of 1.

Gradient Descent works faster if each of the features lies roughly within the same range. This is because Θ will descend quickly on small ranges and slowly on large ranges, and so will oscillate inefficiently down to the optimum when the variables are very uneven.

Feature 1: 2,3,4,5

Feature 2: 180,200,170,120

Feature Scaling for feature 1: 2/5,3/5,4/5,5/5

Feature Scaling for feature 2: 180/200, 200/200, 170/200, 120/200

That will make both the features lie between the range

0 ≤ x ≤ 1

which will enhance the speed of calculating the gradient descent.

The range should mostly lie between:

-1 ≤ x ≤ 1 or -0.5 ≤ x ≤ 0.5

The above mentioned range is the most effective range.

Ranges can vary from

0 ≤ x ≤ 3

Till this range the results are considerable.

Technique 2: Mean Normalization

Mean Normalization is subtracting the average value from the input variables and then dividing that by the standard deviation.

x_i= x_i - μ_i

s_i

Technique to Learn if Gradient Descent is working properly:

Method 1:

Plot the cost function against number of iterations.

In the above graph after 350 iterations there doesn't seem to be a considerable change in the value of the cost function, i.e., it took 350 iterations to find the accurate Θ values.

After each iteration the graph should converge or decrease.

If the graph or the y-value doesn't decrease, it means there is some problem with the gradient descent calculation, generally the learning rate.

Method 2:

Declare convergence or optimum value:

Choose some minimum value or threshold value and if the decrement in the Θ value is less than that it means we have reached the optimum or the required value.

NOTE: Method 1 is better than Method 2 since deciding the threshold value in Method 2 is very difficult.0.

LEARNING RATE:

· It is mathematically proven, if α is sufficiently small, J(Θ) should decrease at each iteration.

· If α is too small it will take very baby steps and will lead to delay or too much time consumption in finding the optimum value i.e., very slow computation.

CHOOSING THE LEARNING RATE(α):

Optimum value for α is 0.1, 0.01, 0.001.

To increase the value of α multiply the values with 3 i.e., 0.3, 0.03, 0.003.

That’s all for our Day 4. Next we will learn about Normal Equations in Day 5 which will be uploaded soon.

If you feel this article helped you in any way do not forget to share and if you have any thoughts or doubts do write them in the comment section.

Till then Happy Learning!!

Monday, April 2, 2018

MACHINE LEARNING | MULTIVARIATE REGRESSION

MACHINE LEARNING - DAY 3

If you have left the earlier tutorials then you can visit the following links :

Day 1: What is Machine Learning and it's Types

Day 2: Linear Regression with single variable

Continuing our learning in machine learning let's now move on to today's topic.

Linear Regression with Multiple Variables or Multiple Linear Regression

In the previous tutorial in Day 2, we saw how to predict the output based on a single input value, x i.e.,

h_Θ(x) = Θ₀ + Θ₁x

Now, let's see what if the output depends upon more than one value or more than one feature?

for eg: a house price depends upon the square feet, number of rooms, location etc.

This is where Multiple Linear Regression comes in handy.

Notations:

n: Number of features

m: number of training sets

x⁽ⁱ⁾: i^thtraining set

x⁽ⁱ⁾_(j): j^th value in i^thtraining set

For eg:

Height Age Standard

5’11 25 CA

5’9 29 Artist

5’5 21 Singer

6’2 32 Scientist

Here:

n = 3

m = 4

x⁽²⁾= 5’9, 29, Artist

x⁽²⁾₍₂₎= 29

GENERAL FORM OF MULTIVARIATE LINEAR REGRESSION

The general form of multivariate linear regression is :

h_Θ(x) = Θ₀ + Θ₁x₁+ Θ₂x₂+ …. + Θ_nx_n

Θ_j = features of the hypothesis

x_j= input values

Let x₀ = 1 Now,

h_Θ(x) = Θ₀x₀ + Θ₁x₁+ Θ₂x₂+ …. + Θ_nx_n

Θ = [Θ₀Θ₁Θ₂…. Θ_n]

x = [x₀ x₁x_{2 ….}x_n]

Then using matrix multiplication,

h_Θ(x) = Θ^Tx

So that was multiple regression or multivariate regression. Now let’s move on to gradient descent for multiple regression which decides how to predict the output value for a given set of inputs.

GRADIENT DESCENT FOR MULTIVARIATE REGRESSION

The gradient descent equation itself is generally of the same form as in simple linear regression, we just have to repeat it for the required number of parameters i.e. repeat it till ‘n’ times.

Repeat until convergence:

Multiple linear regression is same as simple linear regression, the only difference is that in the latter there is only one input variable while in the other one there are multiple input variables and hence gradient descent is executed multiple times for each parameter.

That's all for day 3. Next we will learn about how to make gradient descent work efficiently and how to choose the learning rate alpha in day 4 which will be uploaded by April 6, 2018.

If you feel this article helped you in any way do not forget to share and if you have any thoughts or doubts upon it do write them in the comment section.

Till then Happy Learning!!