Linear regression is one of the most popular and best understood algorithms in the machine learning landscape. Since regression tasks belong to the most common machine learning problems in supervised learning, every Machine Learning Engineer should have a thorough understanding of how it works.
This blog post covers how the linear regression algorithm works, where it is used, how you can evaluate its performance and which tools & techniques should be used along with it.
Table of Contents:
- Example of a use case
- How it works
- Cost Function
- Gradient Descent
- Feature Normalization
- Polynomial Regression
Example of a Use Case:
A good example of the use of linear regression is the prediction of housing prices. We're going to use housing prices dataset from the city of Portland, Oregon. Some houses with different sizes and prices are plotted in the image above.
Imagine a friend of you want’s to sell his house of the size of 1250 square feet and he wants to know from you, for how much he can probably sell it for. One thing you could do is to fit a straight line to the data point in the graph, which is called regression.
Based on this line you can see that he is maybe able to sell it for $ 220,000.
Predicting housing prices is a regression problem because the term regression refers to the fact that we are prediction real valued output.
The other most common supervised learning problem is called classification, where we predict discrete valued outputs, like telling if a tumor is malignant (1) or benign (0). But we will discuss classification in another blog post.
Usually, in supervised learning, we have a dataset, which is called the training set. If we go back to the housing prices example, we have a training set of housing prices, that contains examples of different houses, details about the houses and their prices. Our job is to learn from this data how to predict housing prices.
How it works:
We feed the training set into our learning algorithm, which then outputs a function (h), based on what he has learned from the training set. This function is called the „hypothesis“.
The hypothesis is then able to estimate the price of a house, using its size as an input. This example is called „univariate“ regression, which means that you want to predict a single output value from a single input value. There is also „multivariate“ regression, where you have several input values.
The next thing we need to decide is how do we represent the hypothesis. There are different ways to represent it and I will use the one of Andrew Ng (for univariate regression):
Note that this is like the equation of a straight line.
After we’ve trained our learning algorithm and got a hypothesis, we need to examine how good our results are. This is done by the so called cost function.
The cost function measures the accuracy of the hypothesis outputs. It does this by comparing the predicted prices of the hypothesis with the actual prices of a house.
To explain it more detailed: The cost function takes an average (actually a more complex version of an average) of all the hypothesis results with inputs compared to the actual prices of the houses. Therefore we want, that the output of the cost function is as small as possible because this means improving how accurate our predictions are. The most used cost function in linear regression is the so-called “Mean Square Error (MSE)” cost function.
So we have a hypothesis and a technique to measure its accuracy, but how can we now actually improve the outputs of our hypothesis? We need to optimize the parameters θ0 and θ1 of the hypothesis, which are later fed into the cost function.
This is were gradient descent comes into play.
Gradient descent is an optimization algorithm that tweaks it’s parameters iteratively.
But what does that actually mean?
We already know that we want to find parameters that reduce the output of the cost function as good as possible. Heres an illustration of gradient descent:
In this picture, the horizontal axes represents the space of parameters θ0 and θ1. The cost function J(θ0, θ1) is some type of surface above them. So the hight of the surface represents the value of this surface at a certain point.
Like you already know, we want to find values of θ0 and θ1 that correspond to the minimum of the cost function (marked with the red arrow). This minimum of the cost function is called the global optimum. You can also see that gradient descent is a convex function, which is one of the main reasons why we use it for regression.
To start with finding the right values we initialize the values of θ0 and θ1 with some random numbers and Gradient Descent then starts at that initial point (somewhere around the top of our illustration) and then takes one step after another in the steepest downside direction (e.g. from the top to the bottom of the illustration), till it reaches the global optimum (where the cost function is as small as possible), which is marked with the red arrow.
How big these steps are is determined by the so called Learning Rate. It is important that the learning rate is neither too high or too low, because if it is too high it will take steps that are too big and will maybe not reach the global optimum and if it is too small it will simply take too much time till it reaches the global optimum.
There is a way to find out if you have chosen the right learning rate. You need to generate a plot, with the number of iterations of gradient descent vs. the cost function (MSE). If the cost function is increasing at some point, you need to decrease the learning rate.
There is a way to speed up the computation of gradient descent, called feature normalization. Instead of initializing the parameters with random numbers, we choose numbers that are roughly in the same range.
This works better because θ will descend quickly on small ranges and slowly on large ranges. Therefore gradient descent would take inefficient steps down to the global optimum when the variables are very uneven. We prevent this by having the input parameters in roughly the same range.
There are 2 techniques that help us with this task: Feature scaling and Mean normalization.
With Feature scaling, we divide the input values by the range (i.e. the maximum value minus the minimum value) of the input variables, which results in a new range of just 1.
With Mean normalization we subtract the average value for an input variable from the values for that input variable, resulting in a new average value for the input variable of just zero.
Let’s explain polynomial regression with an example. Imagine you have another dataset of houses and their corresponding prices and you want to again predict their prices. At the plotted dataset below, you can clearly see, that a straight line wouldn’t fit the data very well.
The thing is our hypothesis function does not have to be a straight linear line all the time, if that doesn’t fit the data well. We could also change the function so that it is a quadratic, cubic or square root function (or any other form). You can see an example below.
The is called polynomial regression. Unfortunately, explaining how this works in detail would go beyond the scope of this blog post, but we will definitely cover it in a future blog post.