Gradient Descent

Sonu Ranjit Jacob
4 min readJul 3, 2020

In my last post, I had mentioned I will explain how this “training of the model” works. (If you haven’t read it, check it out here — Linear regression and Classification)

So far what we have seen is, we input some data to a “machine learning model”, which in the simplest of cases is a simple linear equation and it will output the value (linear regression) or class (classification/logistic regression) predicted for that particular input data.

Now, for a machine learning model, there are two types of parameters — internal parameters and hyperparameters.

Internal parameters are those parameters that are learned by the model when it is trained, i.e it is dependent on the training data. This is where gradient descent comes in. Gradient descent is used to estimate these parameters such that the built machine learning model fits the data. In the example where we predict the cost of a house based on its area in square feet, we know that it is represented by the equation y = mx +c or y = w₀ + w₁*x but the parameters (m,c) or (w₀, w₁ — in case of the second equation) will be determined using gradient descent.

First we can calculate the performance of our system by using a cost function. This function calculates the difference between the value predicted by the model and the actual value by using the mean squared error. The cost function is given by:

where ŷᵢ is the predicted value and yᵢ is the actual value. Hence, if we substitute ŷᵢ by its hypothesis equation, we get

This gives us the performance of the system and how well our model fits the data. Our goal is to minimize the value of J(w₀, w₁).

If we graph the hypothesis function such that w₀ is plotted on the x axis, w₁ is plotted on the y axis and J(w₀, w₁) is plotted on the z axis, we get a graph as shown below for various values of w₀ and w₁.

The value of the cost function will be minimum at the lowest points in the above graph. We obtain these points by taking the derivative of the cost function. The slope of the tangent at that point is the derivative and will give which direction to move in to move towards the lowest point in the graph. We use steps to move down the cost function in the direction with the steepest descent and the size of the step, also called learning rate is determined by the parameter alpha.

The algorithm for gradient descent is given by

Repeat until convergence:

where j = 0,1 represents the index of the parameter. Make note that at each iteration both w₀ and w₁ should be updated.

This is how we build a machine learning model. We take an initial hypothesis function and calculate J(w₀, w₁). Then we calculate the new w₀, w₁ using the result of J(w₀, w₁). If it does not converge (i.e the new values of w₀, w₁ are not equal to the old values of w₀, w₁) we replace the new values of w₀, w₁ in J(w₀, w₁) and calculate w₀, w₁ again. We continue this till convergence.

Another thing to note is that there are two types of gradient descent — batch gradient descent and stochastic gradient descent. In batch gradient descent, we use all the training examples to update the weights (notice the summation symbol in the function J(w₀, w₁)). In stochastic gradient descent, the parameters are updated each time by using only one training example. Hence, all the training examples will not be summed in the cost function when updating the parameters, instead the equation will be as follows and will be repeated for all the training examples:

Other examples of internal parameters are the weights of a neural network or the support vectors in a support vector machine.

Hyperparameters refer to parameters which are tuned by us to improve the performance of the model. Examples include the learning rate for training a neural network, the value of k in k-nearest neighbours and the batch size. The best values for these parameters are found through experimentation.

To sum up, in this article I have explained how the learning part of machine learning works — through gradient descent.

References:

  1. Andrew Ng’s course on Machine Learning
  2. https://machinelearningmastery.com/difference-between-a-parameter-and-a-hyperparameter/
  3. https://towardsdatascience.com/difference-between-batch-gradient-descent-and-stochastic-gradient-descent-1187f1291aa1

--

--