Linear Regression

Niklas Donges Niklas Donges
April 25, 2019 AI & Machine Learning

Linear regression is one of the most popular and best understood algorithms in the machine learning landscape. Since regression tasks belong to the most common machine learning problems in supervised learning, every Machine Learning Engineer should have a thorough understanding of how it works.

This blog post covers how the linear regression algorithm works, where it is used, how you can evaluate its performance and which tools & techniques should be used along with it.

 

Table of Contents:

  • Example of a use case
  • How it works
  • Cost Function
  • Gradient Descent
  • Feature Normalization
  • Polynomial Regression

 

Example of a Use Case:

 

1.png

A good example of the use of linear regression is the prediction of housing prices. We’re going to use housing prices dataset from the city of Portland, Oregon.  Some houses with different sizes and prices are plotted in the image above.

Imagine a friend of you want’s to sell his house of the size of 1250 square feet and he wants to know from you, for how much he can probably sell it for. One thing you could do is to fit a straight line to the data point in the graph, which is called regression.

Based on this line you can see that he is maybe able to sell it for $ 220,000.

Predicting housing prices is a regression problem because the term regression refers to the fact that we are prediction real valued output.

2.png

The other most common supervised learning problem is called classification, where we predict discrete valued outputs, like telling if a tumor is malignant (1) or benign (0). But we will discuss classification in another blog post.

Usually, in supervised learning, we have a dataset, which is called the training set. If we go back to the housing prices example, we have a training set of housing prices, that contains examples of different houses, details about the houses and their prices. Our job is to learn from this data how to predict housing prices.

 

 

How it works:

4.png

We feed the training set into our learning algorithm, which then outputs a function (h), based on what he has learned from the training set. This function is called the „hypothesis“.

The hypothesis is then able to estimate the price of a house, using its size as an input. This example is called „univariate“ regression, which means that you want to predict a single output value from a single input value. There is also „multivariate“ regression, where you have several input values.

The next thing we need to decide is how do we represent the hypothesis. There are different ways to represent it and I will use the one of Andrew Ng (for univariate regression):

5

Note that this is like the equation of a straight line.

 

Cost Function:

After we’ve trained our learning algorithm and got a hypothesis, we need to examine how good our results are. This is done by the so called cost function.

The cost function measures the accuracy of the hypothesis outputs. It does this by comparing the predicted prices of the hypothesis with the actual prices of a house.

To explain it more detailed: The cost function takes an average (actually a more complex version of an average) of all the hypothesis results with inputs compared to the actual prices of the houses. Therefore we want, that the output of the cost function is as small as possible because this means improving how accurate our predictions are. The most used cost function in linear regression is the so-called “Mean Square Error (MSE)” cost function.

 

 

Gradient Descent:

So we have a hypothesis and a technique to measure its accuracy, but how can we now actually improve the outputs of our hypothesis? We need to optimize the parameters θ0 and θ1 of the hypothesis, which are later fed into the cost function.

This is were gradient descent comes into play.

Gradient descent is an optimization algorithm that tweaks it’s parameters iteratively.

But what does that actually mean?

We already know that we want to find parameters that reduce the output of the cost function as good as possible. Heres an illustration of gradient descent:

7.png

In this picture, the horizontal axes represents the space of parameters θ0 and θ1. The cost function J(θ0, θ1) is some type of surface above them. So the hight of the surface represents the value of this surface at a certain point.

Like you already know, we want to find values of θ0 and θ1 that correspond to the minimum of the cost function (marked with the red arrow). This minimum of the cost function is called the global optimum. You can also see that gradient descent is a convex function, which is one of the main reasons why we use it for regression.

To start with finding the right values we initialize the values of θ0 and θ1 with some random numbers and Gradient Descent then starts at that initial point (somewhere around the top of our illustration) and then takes one step after another in the steepest downside direction (e.g. from the top to the bottom of the illustration), till it reaches the global optimum (where the cost function is as small as possible), which is marked with the red arrow.

How big these steps are is determined by the so called Learning Rate. It is important that the learning rate is neither too high or too low, because if it is too high it will take steps that are too big and will maybe not reach the global optimum and if it is too small it will simply take too much time till it reaches the global optimum.

8.png

There is a way to find out if you have chosen the right learning rate. You need to generate a plot, with the number of iterations of gradient descent vs. the cost function (MSE).  If the cost function is increasing at some point, you need to decrease the learning rate.

 

Feature Normalization

There is a way to speed up the computation of gradient descent, called feature normalization. Instead of initializing the parameters with random numbers, we choose numbers that are roughly in the same range.

This works better because θ will descend quickly on small ranges and slowly on large ranges. Therefore gradient descent would take inefficient steps down to the global optimum when the variables are very uneven. We prevent this by having the input parameters in roughly the same range.

There are 2 techniques that help us with this task: Feature scaling and Mean normalization.

With Feature scaling, we divide the input values by the range (i.e. the maximum value minus the minimum value) of the input variables, which results in a new range of just 1.

With Mean normalization we subtract the average value for an input variable from the values for that input variable, resulting in a new average value for the input variable of just zero.

 

Polynomial Regression

Let’s explain polynomial regression with an example. Imagine you have another dataset of houses and their corresponding prices and you want to again predict their prices. At the plotted dataset below, you can clearly see, that a straight line wouldn’t fit the data very well.

9.png

The thing is our hypothesis function does not have to be a straight linear line all the time, if that doesn’t fit the data well. We could also change the function so that it is a quadratic, cubic or square root function (or any other form). You can see an example below.

 

10.png

The is called polynomial regression. Unfortunately, explaining how this works in detail would go beyond the scope of this blog post, but we will definitely cover it in a future blog post.

  • Experfy Insights

    Top articles, research, podcasts, webinars and more delivered to you monthly.

  • Niklas Donges

    Tags
    Machine Learning
    © 2021, Experfy Inc. All rights reserved.
    Leave a Comment
    Next Post
    The Top 4 Quality Management Technology Trends

    The Top 4 Quality Management Technology Trends

    Leave a Reply Cancel reply

    Your email address will not be published. Required fields are marked *

    More in AI & Machine Learning
    AI & Machine Learning,Future of Work
    AI’s Role in the Future of Work

    Artificial intelligence is shaping the future of work around the world in virtually every field. The role AI will play in employment in the years ahead is dynamic and collaborative. Rather than eliminating jobs altogether, AI will augment the capabilities and resources of employees and businesses, allowing them to do more with less. In more

    5 MINUTES READ Continue Reading »
    AI & Machine Learning
    How Can AI Help Improve Legal Services Delivery?

    Everybody is discussing Artificial Intelligence (AI) and machine learning, and some legal professionals are already leveraging these technological capabilities.  AI is not the future expectation; it is the present reality.  Aside from law, AI is widely used in various fields such as transportation and manufacturing, education, employment, defense, health care, business intelligence, robotics, and so

    5 MINUTES READ Continue Reading »
    AI & Machine Learning
    5 AI Applications Changing the Energy Industry

    The energy industry faces some significant challenges, but AI applications could help. Increasing demand, population expansion, and climate change necessitate creative solutions that could fundamentally alter how businesses generate and utilize electricity. Industry researchers looking for ways to solve these problems have turned to data and new data-processing technology. Artificial intelligence, in particular — and

    3 MINUTES READ Continue Reading »

    About Us

    Incubated in Harvard Innovation Lab, Experfy specializes in pipelining and deploying the world's best AI and engineering talent at breakneck speed, with exceptional focus on quality and compliance. Enterprises and governments also leverage our award-winning SaaS platform to build their own customized future of work solutions such as talent clouds.

    Join Us At

    Contact Us

    1700 West Park Drive, Suite 190
    Westborough, MA 01581

    Email: [email protected]

    Toll Free: (844) EXPERFY or
    (844) 397-3739

    © 2025, Experfy Inc. All rights reserved.