Linear regression and Logistic Regression/Classification

Sonu Ranjit Jacob
5 min readJun 26, 2020

Linear regression is a type of machine learning model which predicts real value output, i.e the output variable is continuous. It is based on supervised learning. In my first article, I had given the example of using machine learning to predict the cost of a house based on the area of the house in square feet.

The data is represented by the table shown. The following notation is used to denote the data — X represents the input training data and y represents the output value. (X¹, y¹) represents one training example.

So we can think of the machine learning model as a function that maps X to y, i.e. what value of y do we get for a particular X. This is represented by h : X → y. Here, h is called the hypothesis.

The example above is an example of linear regression as using this technique we predict a linear relationship between the cost of the house (y) and the area of the house (X). Hence the hypothesis function will be

where,

y = predicted cost of house (output of the machine learning model)

X = training data

w0, w1 = the parameters/weights that have to be learned by our machine learning model to correctly predict the cost of the house

This equation represents a straight line and can be “fitted” to the data as shown in the figure. Future predictions can be made by projecting lines from the x axis to the line fitted on the data.

The parameters w0 and w1 is where the magic of machine learning lies. These parameters are learned by the machine learning model through techniques like gradient descent which I will cover in my next article.

Moving on to classification, it is also a type of supervised learning which identifies which category the selected training example falls into. It falls under the umbrella of pattern recognition (One important thing to note is that classification is also called logistic regression. This caused a great deal of confusion for me when I was starting of machine learning!). In classification the input is classified into a particular category. Hence the output is discrete. For example when predicting if a person has a tumour or no, the output can be only either 1 (the tumour is present) or 0 (the tumour is absent).

Hence for classification, the hypothesis function used by the machine learning model to make predictions will change. It does not make sense to use a linear relation between the input and output variables in this case. Instead, the output of the machine learning model will be the probability that the given training example falls into a particular category. The sigmoid function is used as a hypothesis function to classify the data as it predicts values between 0 and 1. The equation and plot of the sigmoid function is shown below:

(If you are wondering how probablility is related to the sigmoid function, please refer to my explanation in the appendix)

In case of binary classification, we can decide on a threshold. If the output of the sigmoid function for a particular training example is greater than 0.5, then it is classified as 1, else it is classified as 0. These thresholds can be adjusted according to how the data is distributed.

There can also be multiple classes in a classification problem, as it need not necessarily be only two outputs. Consider a machine learning model which classifies cartoon characters to which TV show they belong to. Let each TV show be represented by a number, say ‘Tom and Jerry” is represented by 0 and “Baby Looney Tunes” is represented by 1 and “Mickey Mouse” is represented by 2. The data will consist of a set of images of each character from each TV show and the model is trained on a subset of this dataset.

If I give the model an image of Bugs Bunny, then the model will ideally predict the output as 1 (which is a discrete value) and corresponds to the show “Baby Looney Tunes”. This is another example of classification.

When designing machine models for regression, metrics used to evaluate their performance are usually accuracy, precision, recall and the confusion matrix but for classification, the metrics used are mean absolute error or mean squared error. I will explain more about this in my post on Performance Metrics.

In my next article I will explain how gradient descent is used to calculate the parameters for a regression model. See you then!

References:

  1. Notes from Andrew Ng’s course on Machine Learning (Coursera) http://cs229.stanford.edu/notes/cs229-notes1.pdf
  2. https://towardsdatascience.com/derivative-of-the-sigmoid-function-536880cf918e

Appendix:

Why the sigmoid function is used for logistic regression: Suppose we want to find the probability of an example x being classified as 1, i.e

Using Bayes rule, the posterior probability is

On expanding we get,

Assuming they have uniform priors,

Substituting the above,

On simplifying we get,

This equation is of the form 1/(1+z) where z is given by

As z can take only positive values (probability values can only be positive) and we are using regression to predict z, it should be able to take any real number. Hence the log transformation is applied.

So we get,

This equation is of the form below which is the sigmoid function and hence it is used for regression.

--

--