{"id":22479,"date":"2020-12-03T10:17:32","date_gmt":"2020-12-03T10:17:32","guid":{"rendered":"https:\/\/www.experfy.com\/blog\/linear-regression-gradient-descent-for-beginners\/"},"modified":"2021-05-21T03:31:50","modified_gmt":"2021-05-21T03:31:50","slug":"linear-regression-gradient-descent-for-beginners","status":"publish","type":"post","link":"https:\/\/www.experfy.com\/blog\/ai-ml\/linear-regression-gradient-descent-for-beginners\/","title":{"rendered":"Linear Regression And Gradient Descent For Absolute Beginners"},"content":{"rendered":"\n<p class=\"has-normal-font-size\"><em>A simple explanation and implementation of gradient descent<\/em><\/p>\n\n\n\n<p id=\"41f7\">Let\u2019s say we have a fictional dataset of pairs of variables, a mother and her daughter\u2019s heights:<\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter\"><img decoding=\"async\" src=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1rxOC_xOwmIti3efm2npdMw.png\" alt=\"height of mother\/daughter pairs\"\/><figcaption>height of mother\/daughter pairs<\/figcaption><\/figure><\/div>\n\n\n\n<p id=\"72f5\">Given a new mother height, 63, how do we predict* her daughter\u2019s height?<\/p>\n\n\n\n<p id=\"a13e\">The way you do it is by <a href=\"https:\/\/www.experfy.com\/blog\/ai-ml\/linear-regression\/\" target=\"_blank\" rel=\"noreferrer noopener\">linear regression<\/a>.<\/p>\n\n\n\n<p id=\"2a09\">First, you find the line of best fit. Then you use that line for your prediction*.<\/p>\n\n\n\n<p id=\"d1ba\">*Note: I used \u201cpredict\/prediction\u201d in this article. However, a reader pointed out in the comment below that the correct terminology is \u201cestimate\/estimation.\u201d<\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter\"><img decoding=\"async\" src=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1ardkI1_H4bHSOscCcgfOOw.png\" alt=\"Line of best fit, or \u201cregression line\"\/><figcaption>Line of best fit, or \u201cregression line\u201d<\/figcaption><\/figure><\/div>\n\n\n\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\"><p>Linear regression is about finding the line of best fit for a dataset. This line can then be used to make predictions.<\/p><\/blockquote>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"3596\"><strong>How do you find the line of best fit?<\/strong><\/h2>\n\n\n\n<p id=\"664b\">This is where gradient descent comes in.<\/p>\n\n\n\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\"><p>Gradient descent is a tool to arrive at the line of best fit<\/p><\/blockquote>\n\n\n\n<p id=\"58e1\">Before&nbsp;we dig into gradient descent, let\u2019s first look at another way of computing the line of best fit.<\/p>\n\n\n\n<p id=\"7c3c\"><strong>Statistics way of computing line of best fit:<\/strong><\/p>\n\n\n\n<p id=\"cf1c\">A line can be represented by the formula:&nbsp;<code>y = mx + b<\/code>.<\/p>\n\n\n\n<p id=\"8dcf\">The formula for slope&nbsp;<code>m<\/code>&nbsp;of the regression line is:<\/p>\n\n\n\n<p id=\"874c\"><code>m = r * (SD of y \/ SD of x)<\/code><\/p>\n\n\n\n<p id=\"c633\"><em>Translation<\/em>: correlation coefficient between x and y values (<code>r<\/code>), multiplied by the standard deviation of y values (<code>SD of y<\/code>) divided by standard deviation of x values (<code>SD of x<\/code>).<\/p>\n\n\n\n<p id=\"0879\">The standard deviation of mothers\u2019 heights in the data above is approximately 4.07. The standard deviation of daughters\u2019 heights is approximately 5.5. The correlation coefficient between these two sets of variable is about 0.89.<\/p>\n\n\n\n<p id=\"06af\">So the line of best fit, or regression line is:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">y = 0.89*(5.5 \/ 4.07)x + b<br>y = 1.2x + b<\/pre>\n\n\n\n<p id=\"8a6b\">We know that the regression line crosses the point of averages, so one point on the line is&nbsp;<code>(average of x values, average of y values)<\/code>, or&nbsp;<code>(63.5, 63.33)<\/code><\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">63.33 = 1.2*63.5 + b<br>b = -12.87<\/pre>\n\n\n\n<p id=\"4c63\">Therefore, the regression line as calculated using correlation coefficient and standard deviations is approximately:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">y = 1.2x - 12.87<\/pre>\n\n\n\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\"><p>Regression line using statistics is y = 1.2x -12.87<\/p><\/blockquote>\n\n\n\n<p id=\"ff86\">Now, let\u2019s dig into gradient descent.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"a8e1\"><strong>Gradient descent way of computing line of best fit:<\/strong><\/h2>\n\n\n\n<p id=\"9e53\">In gradient descent, you start with a random line. Then you change the parameters of the line (i.e. slope and y-intercept) little by little to arrive at the line of best fit.<\/p>\n\n\n\n<p id=\"23c7\">How do you know when you arrived at the line of best fit?<\/p>\n\n\n\n<p id=\"4ca9\">For every line you try \u2014 line A, line B, line C, etc \u2014 you calculate the sum of squares of the errors. If line B has a smaller value than line A, then line B is a better fit, etc.<\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter\"><img decoding=\"async\" src=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1NPq-trtaYvMT6nzkIrkK7g.png\" alt=\"Linear Regression And Gradient Descent For Absolute Beginners\"\/><figcaption>Error and square of error<\/figcaption><\/figure><\/div>\n\n\n\n<p id=\"e369\">Error is your actual value minus your predicted value. The line of best fit minimizes the sum of the squares of all the errors. In linear regression, the line of best fit we computed above using correlation coefficient also happens to be the least squared error line. That\u2019s why the regression line is called the LEAST SQUARE REGRESSION LINE.<\/p>\n\n\n\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\"><p>Line of best fit is the least square regression line<\/p><\/blockquote>\n\n\n\n<p id=\"6faf\">In the image below, line C is better fit than line B, which is better fit than line A.<\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter\"><img decoding=\"async\" src=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1UePLtYAo0byq4oj6KEadeQ.png\" alt=\"Linear Regression And Gradient Descent For Absolute Beginners\"\/><figcaption>Line C is better fit than B, which is better fit than A<\/figcaption><\/figure><\/div>\n\n\n\n<h4 class=\"wp-block-heading\" id=\"ef83\">At a high level, this is how gradient descent works:<\/h4>\n\n\n\n<p id=\"4ab5\">You start with a random line, let\u2019s say line A. You compute the sum of squared errors for that line. Then, you adjust your slope and y-intercept. You compute the sum of squared errors again for your new line. You continue adjusting until you reach a local minimum, where the sum of squared errors is the smallest and additional tweaks does not produce better result. The way you adjust your slope and intercept will be covered in more details momentarily.<\/p>\n\n\n\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\"><p>Gradient descent is an algorithm that approaches the least squared regression line via minimizing sum of squared errors through multiple iterations.<\/p><\/blockquote>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"cf02\">Gradient Descent Algorithm<\/h2>\n\n\n\n<p id=\"d9a8\">In machine learning terminology, the sum of squared error is called the \u201ccost\u201d. This cost equation is:<\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter\"><img decoding=\"async\" src=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1CAQ0BlIIC99rfJ4ePH_GSg.png\" alt=\"Linear Regression And Gradient Descent For Absolute Beginners\"\/><figcaption>cost equation<\/figcaption><\/figure><\/div>\n\n\n\n<p id=\"47cb\">Where:<\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter\"><img decoding=\"async\" src=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1xgsItDj0tXu5TJ9mfJzx_Q.png\" alt=\"Linear Regression And Gradient Descent For Absolute Beginners\"\/><\/figure><\/div>\n\n\n\n<p id=\"24c3\">This equation is therefore roughly \u201csum of squared errors\u201d as it computes the sum of predicted value minus actual value squared.<\/p>\n\n\n\n<p id=\"7430\">The&nbsp;<code>1\/2m<\/code>is to \u201caverage\u201d the squared error over the number of data points so that the number of data points doesn\u2019t affect the function. See&nbsp;<a href=\"https:\/\/datascience.stackexchange.com\/questions\/52157\/why-do-we-have-to-divide-by-2-in-the-ml-squared-error-cost-function\" rel=\"noopener\">this<\/a>&nbsp;explanation for why we divide by 2.<\/p>\n\n\n\n<p id=\"0d88\">In gradient descent, the goal is to minimize the cost function. We do this by trying different values of slope and intercept. But which values to try and how do you go about changing those values?<\/p>\n\n\n\n<p id=\"0ff7\">We change their values according to the gradient descent formula, which comes from taking the partial derivative of the cost function. The exact math can be found in this&nbsp;<a href=\"https:\/\/www.ritchieng.com\/one-variable-linear-regression\/\" rel=\"noopener\">link<\/a>.<\/p>\n\n\n\n<p id=\"b16c\">By taking the partial derivative, you arrive at the formula:<\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter\"><img decoding=\"async\" src=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1zCNiUQ1UcWaUx8RIAgiZSQ.png\" alt=\"Gradient descent formula by taking partial derivative of the cost function\"\/><figcaption>Gradient descent formula by taking partial derivative of the cost function<\/figcaption><\/figure><\/div>\n\n\n\n<p id=\"82ed\">This formula computes by how much you change your theta with each iteration.<\/p>\n\n\n\n<p id=\"515a\">The alpha (\u03b1) is called the learning rate. The learning rate determines how big the step would be on each iteration. It\u2019s critical to have a good learning rate because if it\u2019s too large your algorithm will not arrive at the minimum, and if it\u2019s too small, your algorithm will take forever to get there. For my example, I picked the alpha to be 0.001<\/p>\n\n\n\n<h4 class=\"wp-block-heading\" id=\"ecbb\">To summarize, the steps are:<\/h4>\n\n\n\n<ol class=\"wp-block-list\"><li>Estimate \u03b8<\/li><li>Compute cost<\/li><li>Tweak \u03b8<\/li><li>Repeat 2 and 3 until you reach convergence.<\/li><\/ol>\n\n\n\n<p id=\"29e0\">Here\u2019s my implementation for simple linear regression using gradient descent.<\/p>\n\n\n\n<p id=\"615a\">I started at 0,0 for both the slope and intercept.<\/p>\n\n\n\n<p id=\"b1c7\">Note: In machine learning, we use theta to represent the vector [y-intercept, slope]. Theta0 = y-intercept. Theta1=slope. That\u2019s why you see theta as variable name in the implementation below.<\/p>\n\n\n<div class='gist '><br \/><\/div>\n\n\n<p id=\"c288\">Using this algorithm and the dataset above for mothers and daughters heights, I got to a cost of 3.4 after 500 iterations.<\/p>\n\n\n\n<p id=\"c31c\">The equation after 500 iterations is&nbsp;<code>y = 0.998x + 0.078<\/code>. The actual regression line is&nbsp;<code>y = 1.2x -12.87<\/code>&nbsp;with cost of approximately 3.1.<\/p>\n\n\n\n<p id=\"697d\">With an estimate of [0,0] as initial value for [y-intercept, slope], it\u2019s impractical to get to&nbsp;<code>y = 1.2x -12.87<\/code>&nbsp;. To get close to that without tons and tons of iterations, you\u2019d have to start with a better estimate.<\/p>\n\n\n\n<p id=\"cbdc\">For example, [-10, 1] will get roughly&nbsp;<code>y = 1.153x \u2014 10<\/code>&nbsp;and cost of 3.1 after less than 10 iterations.<\/p>\n\n\n\n<p id=\"c863\">Adjusting parameters like learning rate and starting estimate is commonplace in the world of machine learning.<\/p>\n\n\n\n<p id=\"ee3c\">There it is, the gist of gradient descent in linear regression.<\/p>\n\n\n\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\"><p>Gradient descent is an algorithm that approaches the least squared regression line via minimizing sum of squared errors through multiple iterations.<\/p><\/blockquote>\n\n\n\n<p id=\"3230\">So far, I\u2019ve talked about simple linear regression, where you only have 1 independent variable (i.e. one set of&nbsp;<strong>x<\/strong>&nbsp;values). Theoretically, gradient descent can handle n number of variables.<\/p>\n\n\n\n<p id=\"631a\">I\u2019ve refactored my previous algorithm to handle n number of dimensions below.<\/p>\n\n\n<div class='gist '><br \/><\/div>\n\n\n<p id=\"6a91\">Everything is the same, the only exception is that instead of using&nbsp;<code>mx + b<\/code>&nbsp;(i.e. slope times variable x plus y-intercept) directly to get your prediction, you do a matrix multiplication. See&nbsp;<code>def get_prediction<\/code>&nbsp;in the gist above.<\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter\"><img decoding=\"async\" src=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/11bbhoX_XTNfbslS4_dzUug.png\" alt=\"def get_prediction\"\/><figcaption><a href=\"https:\/\/algebra1course.wordpress.com\/2013\/02\/19\/3-matrix-operations-dot-products-and-inverses\/\" target=\"_blank\" rel=\"noreferrer noopener\">https:\/\/algebra1course.wordpress.com\/2013\/02\/19\/3-matrix-operations-dot-products-and-inverses\/<\/a><\/figcaption><\/figure><\/div>\n\n\n\n<p id=\"36f9\">With dot product, your algorithm can take in n number of variables to compute a prediction.<\/p>\n\n\n\n<p id=\"e2a6\">Thank you for reading! Comment below if you have questions!<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Linear regression is about finding the line of best fit for a dataset. This line can then be used to make predictions. Gradient descent is a tool to arrive at the line of best fit.<\/p>\n","protected":false},"author":990,"featured_media":18091,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"content-type":"","footnotes":""},"categories":[183],"tags":[656,1079,92],"ppma_author":[3687],"class_list":["post-22479","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-ai-ml","tag-gradient-descent","tag-linear-regression","tag-machine-learning"],"authors":[{"term_id":3687,"user_id":990,"is_guest":0,"slug":"lily","display_name":"Lily Chen","avatar_url":"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/Lily-Chen-150x150.jpeg","user_url":"http:\/\/slack.com","last_name":"Chen","first_name":"Lily","job_title":"","description":"Lily Chen is Frontend Infrastructure Engineer at Slack, the leading channel-based messaging platform,"}],"_links":{"self":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/22479","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/users\/990"}],"replies":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/comments?post=22479"}],"version-history":[{"count":1,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/22479\/revisions"}],"predecessor-version":[{"id":23178,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/22479\/revisions\/23178"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/media\/18091"}],"wp:attachment":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/media?parent=22479"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/categories?post=22479"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/tags?post=22479"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/ppma_author?post=22479"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}