Showing posts with label machinelearning. Show all posts
Showing posts with label machinelearning. Show all posts

Sunday, May 6, 2018

MACHINE LEARNING | MULTICLASS CLASSIFICATION

MACHINE LEARNING - DAY 9

MULTI-CLASS CLASSIFICATION: ONE-VS-ALL


For the basics, you can check the earlier articles.

Terms used in this article can be understood from:

Continuing our learning in machine learning today we’ll learn about the multi-class classification in logistic regression also known as one vs all.

Till now we have discussed about the 2 classification possibilities or 2 outcomes i.e., 1 or 0. Now, let’s see what happens when there are more number of possibilities.

for eg.,

lWeather: sunny, rainy, pleasant, windy

The outcome or the categorical value can be: 0, 1, 2, 3

lHealth: ill, dizzy, well
  
   The outcome or the categorical value can be: 0, 1, 2

The numbering doesn’t matter. It can be 1,2,3,4 or 0,1,2,3. These are just values which categorizes the given data or output into different categories.

y  {0,1,2…,n}

hΘ(0)(x) =  P(y = 0 | x; Θ )

hΘ(1)(x) =  P(y = 1 | x; Θ )
.
.
.
hΘ(n)(x) =  P(y = n | x; Θ )

prediction : max(hΘ(i)(x))
           i

STEPS OF COMPUTATION:

1. Plot the data





2. Take the classes one by one and rest of the 2 classes will behave as a single class or category. The probability of the single class is calculated in this way.

   For eg,




CONCLUSION:

Train a logistic regression hΘ(x) for each class to predict the probability that y = i.

To make a prediction on a new x, pick the class that maximizes hΘ(x) and that will be the output.


That’s all for day 9. Today we learned about the multi-class classification and how to compute it.

In day 10, we will be learning about the issue known as Overfitting which originates due to over-training of the model. The solution for this issue is Regularization which we’ll also cover in the next article.

If you think this article helped you in learning something new or can help someone then do share this article among the peers.

Till then Happy Learning!!!





MACHINE LEARNING | DECISION BOUNDARY

MACHINE LEARNING - DAY 7

DECISION BOUNDARY FOR LOGISTIC REGRESSION


For the basics, you can check the earlier articles.

Terms used in this article can be understood from:


DECISION BOUNDARY

Decision boundary means the shape of the curve dividing the data into 2 segments, one which has y = 1 and the other category with y = 0.

hΘ(x) = g(ΘTX) = P(y = 1 | x; Θ )


g(z) = 1/(1+(e^(-z)) = 1/(1+(e^(-ΘTX))


Suppose, we want to predict y=1 then

                             

hΘ(x)  0.5

and, for prediction of y=0,

hΘ(x) < 0.5

Now, let’s see when are these values possible.

1. y=1 when,

g(z)  0.5

When z  0

hΘ(x) = g(ΘTX)  0.5

y = 1, when ΘTX  0.

2. y = 0 when,

g(z) < 0.5

When z < 0

hΘ(x) = g(ΘTX) < 0.5

y = 0, when ΘTX < 0.

Now, let’s discuss about the decision boundaries with the help of some examples.

Example 1:

hΘ(x) = g(Θ0 + Θ1x1 + Θ2x2)



Let Θ0 = -3, Θ1 = 1, Θ2 = 1

Θ = [-3;1;1]

Dimension of Θ matrix is 3X1.

Predict y = 1, if

-3 + x1 + x2  0 ≈ g(z) > 0.5 ≈ z > 0.

x1 + x2 ≥ 3.

And for y = 0,

x1 + x2 < 3


NON - LINEAR DECISION BOUNDARIES

Sometimes, the data points are arranged in such a manner that the curve separating them takes a complex shape then a straight line.

Example

Hypothesis: hΘ(x) = g(Θ0 + Θ1x1 + Θ2x2 + Θ3x12 + Θ4x22)




Let Θ0 = -1, Θ1 = 0, Θ2 = 0, Θ3 =1, Θ4 = 1 (for now we’ll see how to find the parameters automatically under upcoming lessons.)

Θ = [-1;0;0;1;1]

The dimension of the matrix is 5X1.

To predict:

y = 1 if,

-1 + x12 + x22   0 ≈ x12 + x22   1(equation of a circle with center at origin).

NOTE: Decision boundaries depends upon the parameters i.e., Θ values.

Decision boundaries can vary depending upon the hypothesis. It can get complex or it can also get simplified with the increase of the parameters and the variables.

Points to remember:

lg(z)  0.5  z  0

lz = 0, e0 = 1, g(z) = 1/2

lz  , e- → 0 → g(z) = 1

lz  -, e  → g(z) = 0


That’s all for day 7. Today we learned about the decision boundaries in classification problems, especially in logistic regression.

In day 8, we will be learning about the cost function of logistic regression which will help us in figuring out the parameter i.e., Θ values automatically for the best fit and we will also learn about the concept of multi-class classification in logistic regression.

If you think this article helped you in learning something new or can help someone then do share this article among the peers.

Till then Happy Learning!!!





Saturday, March 31, 2018

MACHINE LEARNING | LINEAR REGRESSION

MACHINE LEARNING - DAY 2



LINEAR REGRESSION

Notations:

m: number of training sets

x: input variable/feature

y: output variable/ target variable


( x(i) , y(i)): ith training set

Linear regression is mostly used in supervised learning and with the given data-set, our aim is to learn a function or a hypothesis  h: x -> y, so that h(x) is a “good” predictor for the corresponding value of y.





Hypothesis for linear regression:

hΘ(x) = Θ0 + Θ1x
Or

h(x) = Θ0 + Θ1x

Where,

h(x): hypothesis for the problem

Θ0: constant or the intercept

Θ1: the slope of the line

COST FUNCTION

The accuracy of the prepared hypothesis can be found out by using a cost function. This takes an average difference of all the results of the hypothesis with the inputs from x and the actual output y.

Given below is the required cost function:



The function is called Squared Error Function.

Now, hΘ(x(i)) - y(i)) is the difference between the predicted results for the input x and the real output y. Taking summation of it will provide the total difference between the predicted output and the real output.

Here 1/2 is taken to simplify the calculations which we will see in gradient descent.


AIM: To minimize the cost function J(Θ0Θ1)

Hypothesis : hΘ(x) = Θ0 + Θ1x

Cost function :



To minimize the cost function or to get the best fit the line should pass through all the values in the result set. In such a case

J(Θ0, Θ1) = 0

as the distance between the prediction value and the actual value is zero.

Note: for Θ0 , Θ1 and J(Θ0, Θ1) we plot contour plots as they are 3D plots and hence are used to plot 3 values. The smallest circle in the contour plot, when shown in 2D, depicts the global minimum which is the perfect fit for the given hypothesis.

Now the question arises how to find the accurate (Θ0, Θ1) values??

The solution for the above question is a technique called Gradient Descent which is our next topic.

GRADIENT DESCENT

Gradient Descent is used to find the Θi values for i = 0,1,….n to minimize the cost function J(Θi).

Formula of Gradient Descent :

 


Functioning of Gradient Descent:




Note: If α is too small then it will take a lot of time to converge to the global minimum.

Note: If α is too large then instead of converging to the global minimum it will start diverging.

So the choice of α i.e., the learning rate is very important.

Now since for every next value of (Θ0, Θ1) the gradient descent algorithm is executed and hence the partial derivative is taken which leads to updating the values of (Θ0, Θ1) simultaneously after each iteration.




Now let’s see how the parameter i.e.,(Θ0, Θ1) values are found



In the first case, the point A has a positive slope which depicts the value of partial derivative of J(Θ0, Θ1) and so the gradient formula decreases the value of (Θ0, Θ1) as

Θ0 - (+ve) = decrease in value of Θ0

Θ1 - (+ve) = decrease in value of Θ1

and finally it will reach to point B which is the global minimum.

Similarly, in the second case, point A has a negative slope which depicts the value of partial derivative of J(Θ0, Θ1) and so the gradient formula increases the value of (Θ0, Θ1) as

Θ0 - (-ve) = increase in value of Θ0

Θ1 - (-ve) = increase in value of Θ1

and finally it will reach to point B which is the global minimum.



As the point reaches to point B, the derivative will come 0 because the actual output and the predicted output are same and hence derivative of a constant is 0, so the value of (Θ0, Θ1) will not change any further and that’s the required values for our parameters.

In this way gradient descent help in figuring out which value set for (Θ0, Θ1) would suit the hypothesis for the best fit.

This process is also called Batch Processing since for every computation for (Θ0, Θ1), the process looks upon the entire batch of data until and unless it finds the global minimum.



That's all for day 2. Next we will learn about linear regression with multiple variables in DAY 3.
      
If you feel this article helped you in any way do not forget to share and if you have any thoughts or doubts upon it do write them in the comment section.

Till then Happy Learning..