{"id":1555,"date":"2019-03-06T01:15:43","date_gmt":"2019-03-06T01:15:43","guid":{"rendered":"http:\/\/kusuaks7\/?p=1160"},"modified":"2023-08-21T13:44:37","modified_gmt":"2023-08-21T13:44:37","slug":"machine-learning-algorithms-in-laymans-terms-part-1","status":"publish","type":"post","link":"https:\/\/www.experfy.com\/blog\/ai-ml\/machine-learning-algorithms-in-laymans-terms-part-1\/","title":{"rendered":"Machine Learning Algorithms In Layman\u2019s Terms &#8211; Part 1"},"content":{"rendered":"<p id=\"380d\">This series of posts is me sharing with the world how I would explain all the machine learning topics I come across on a regular basis&#8230;to my grandma. Some get a bit in-depth, others less so, but all I believe are useful to a non-Data Scientist.<\/p>\n<p>The topics in this first part are:<\/p>\n<ul>\n<li id=\"f610\"><a href=\"https:\/\/medium.com\/p\/d0368d769a7b\/edit#9c09\" class=\"broken_link\" rel=\"noopener\">Gradient Descent \/ Line of Best Fit<\/a><\/li>\n<li id=\"2886\"><a href=\"https:\/\/medium.com\/p\/d0368d769a7b\/edit#cdab\" class=\"broken_link\" rel=\"noopener\">Linear Regression (includes regularization)<\/a><\/li>\n<li id=\"a3b7\"><a href=\"https:\/\/medium.com\/p\/d0368d769a7b\/edit#ef31\" class=\"broken_link\" rel=\"noopener\">Logistic Regression<\/a><\/li>\n<\/ul>\n<p id=\"1251\">In the upcoming parts of this series, I\u2019ll be going over:<\/p>\n<ul>\n<li id=\"c9b6\">Decision Trees<\/li>\n<li id=\"fdf4\">Random Forest<\/li>\n<li id=\"41e8\">SVM<\/li>\n<li id=\"668c\">Naive Bayes<\/li>\n<li id=\"e1bd\">RNNs &amp; CNNs<\/li>\n<li id=\"49e9\">K-NN<\/li>\n<li id=\"838f\">K-Means<\/li>\n<li id=\"b411\">DBScan<\/li>\n<li id=\"7d0b\">Hierarchical Clustering<\/li>\n<li id=\"1668\">Agglomerative Clustering<\/li>\n<li id=\"c63b\">eXtreme Gradient Boosting<\/li>\n<li id=\"a8bf\">AdaBoost<\/li>\n<\/ul>\n<p id=\"4a53\">Before we start, a quick aside on the difference(s) between algorithms and models, taken from\u00a0<a href=\"https:\/\/www.quora.com\/What-is-the-difference-between-an-algorithm-and-a-model-in-machine-learning\" target=\"_blank\" rel=\"noopener noreferrer\" class=\"broken_link\">this great Quora post<\/a>:<\/p>\n<blockquote id=\"1d4b\"><p>\u201ca model is like a Vending Machine, which given an input (money), will give you some output (a soda can maybe)\u00a0.\u00a0.\u00a0. An\u00a0<strong>algorithm<\/strong>\u00a0is what is used to train a\u00a0<strong>model,\u00a0<\/strong>all the decisions a model is supposed to take based on the given input, to give an expected output. For example, an algorithm will decide based on the dollar value of the money given, and the product you chose, whether the money is enough or not, how much balance you are supposed to get [back], and so on.\u201d<\/p><\/blockquote>\n<p id=\"c8be\">To summarize, an algorithm is the mathematical life force behind a model. What differentiates models are the algorithms they employ, but without a model, an algorithm is just a mathematical equation hanging out with nothing to do.<\/p>\n<p id=\"42c5\">With that, onwards!<\/p>\n<h3 id=\"9c09\"><strong>Gradient Descent \/ Line of Best\u00a0Fit<\/strong><\/h3>\n<p id=\"3758\">(While this first one isn\u2019t traditionally thought of as a machine-learning algorithm, understanding gradient descent is vital<em>\u00a0<\/em>to understanding how many machine learning algorithms work and are optimized.)<\/p>\n<blockquote id=\"8618\"><p>Me-to-grandma:<\/p><\/blockquote>\n<p id=\"90f1\">\u201cBasically, gradient descent helps us get the most accurate predictions based on some data.<\/p>\n<p id=\"9cee\">Let me explain a bit more \u2013 let\u2019s say you have a big list of the height and weight of every person you know. And let\u2019s say you graph that data. It would probably look something like this:<\/p>\n<figure id=\"e205\" data-scroll=\"native\"><canvas width=\"75\" height=\"40\"><\/canvas><img decoding=\"async\" src=\"https:\/\/cdn-images-1.medium.com\/max\/480\/1*Zv0P0I3MT6DO1mH0zaD-2w.png\" data-src=\"https:\/\/cdn-images-1.medium.com\/max\/480\/1*Zv0P0I3MT6DO1mH0zaD-2w.png\" \/><\/figure>\n<p style=\"text-align: center;\"><span style=\"font-size: 11px;\">Our fake height and weight data set (\u2026strangely geometric)<\/span><\/p>\n<p id=\"d5e4\">Now let\u2019s say there\u2019s a local guessing competition where the person to guess someone\u2019s weight correctly, given their height, gets a cash prize. Besides using your eyes to size the person up, you\u2019d have to rely pretty heavily on the list of heights and weights you have at your disposal, right?<\/p>\n<p id=\"59db\">So, based on the graph of your data above, you could probably make some pretty good predictions if only you had a line on the graph that showed the\u00a0<em>trend<\/em>\u00a0of the data. With such a line, if you were given someone\u2019s height, you could just find that height on the x-axis, go up until you hit your trend line, and then see what the corresponding weight is on the y-axis, right?<\/p>\n<p id=\"44d4\">But how in the world do you find that perfect line? You could probably do it manually, but it would take forever. That\u2019s where gradient descent comes in!<\/p>\n<figure id=\"edb4\" data-scroll=\"native\"><canvas width=\"75\" height=\"37\"><\/canvas><img decoding=\"async\" src=\"https:\/\/cdn-images-1.medium.com\/max\/480\/1*4M95uZzY5Uk4wH5xeJ01kw.png\" data-src=\"https:\/\/cdn-images-1.medium.com\/max\/480\/1*4M95uZzY5Uk4wH5xeJ01kw.png\" \/><\/figure>\n<p style=\"text-align: center;\"><span style=\"font-size: 11px;\">Our \u201cline of best fit\u201d is in red\u00a0above.<\/span><\/p>\n<p id=\"9295\">It does this by trying to minimize something called RSS (the residual sum of squares), which is basically the sum of the squares of the differences between our dots and our line, i.e. how far away our real data (dots) is from our line (red line). We get a smaller and smaller RSS by changing where our line is on the graph, which makes intuitive sense \u2014 we want our line to be wherever it\u2019s closest to the majority of our dots.<\/p>\n<p id=\"bcdd\">We can actually take this further and graph each different line\u2019s parameters on something called a\u00a0<em>cost curve<\/em>. Using gradient descent, we can get to the bottom of our cost curve. At the bottom of our cost curve is our lowest RSS!<\/p>\n<figure id=\"2ec9\"><canvas width=\"75\" height=\"56\"><\/canvas><img decoding=\"async\" style=\"width: 640px; height: 480px;\" src=\"https:\/\/cdn-images-1.medium.com\/max\/640\/1*rCCH2J1JHTdknga6qFpwyA.gif\" data-src=\"https:\/\/cdn-images-1.medium.com\/max\/640\/1*rCCH2J1JHTdknga6qFpwyA.gif\" \/><\/figure>\n<p style=\"text-align: center;\"><span style=\"font-size: 11px;\">Gradient Descent visualized (using <\/span><span style=\"font-size: 11px;\">MatplotLib<\/span><span style=\"font-size: 11px;\"> ), from the incredible Data Scientist\u00a0<a href=\"https:\/\/www.linkedin.com\/feed\/update\/urn:li:ugcPost:6503460920944099328\" target=\"_blank\" rel=\"noopener noreferrer\" data-href=\"https:\/\/www.linkedin.com\/feed\/update\/urn:li:ugcPost:6503460920944099328\" data->Bhavesh\u00a0Bhatt<\/a><\/span><\/p>\n<p id=\"1e69\">There are more granular aspects of gradient descent like \u201cstep sizes\u201d (i.e. how fast we want to approach the bottom of our skateboard ramp) and \u201clearning rate\u201d (i.e. what direction we want to take to reach the bottom), but in essence: gradient descent gets our line of best fit by minimizes the space between our dots and our line of best fit. Our line of best fit, in turn, allows us to make predictions!\u201d<\/p>\n<h3 id=\"cdab\">Linear Regression<\/h3>\n<blockquote id=\"cfae\"><p>Me-to-grandma:<\/p><\/blockquote>\n<p id=\"1011\">\u201cSuper simply, linear regression is a way we analyze the strength of the relationship between 1 variable (our \u201coutcome variable\u201d) and 1 or more other variables (our \u201cindependent variables\u201d).<\/p>\n<p id=\"e99f\">A hallmark of linear regression, like the name implies, is that the relationship between the independent variables and our outcome variable is\u00a0<em>linear<\/em>. For our purposes, all that means is that when we plot the independent variable(s) against the outcome variable, we can see the points start to take on a line-like shape, like they do below.<\/p>\n<p id=\"8946\">(If you can\u2019t plot your data, a good way to think about linearity is by answering the question: does a certain amount of change in my independent variable(s) result in the same amount of change in my outcome variable? If yes, your data is linear!)<\/p>\n<figure id=\"fd54\" data-scroll=\"native\"><canvas width=\"75\" height=\"50\"><\/canvas><img decoding=\"async\" src=\"https:\/\/cdn-images-1.medium.com\/max\/480\/1*efZubzM6zO3byCyiBz-fjg.png\" data-src=\"https:\/\/cdn-images-1.medium.com\/max\/480\/1*efZubzM6zO3byCyiBz-fjg.png\" \/><\/figure>\n<p style=\"text-align: center;\"><span style=\"font-size: 11px;\">This looks a ton like what we did above! That\u2019s because the line of best fit we discussed before IS our \u201cregression line\u201d in linear regression. The line of best fit shows us the best possible linear relationship between our points. That, in turn, allows us to make predictions.<\/span><\/p>\n<p id=\"6ff7\">Another important thing to know about linear regression is that the outcome variable, or the thing that changes depending on how we change our other variables, is always\u00a0<em>continuous<\/em>. But what does that mean?<\/p>\n<p id=\"cac5\">Let\u2019s say we wanted to measure what effect elevation has on rainfall in New York State: our outcome variable (or the variable we care about seeing a change in) would be rainfall, and our independent variable would be elevation. With linear regression, that outcome variable would have to be specifically\u00a0<em>how many inches of rainfall<\/em>, as opposed to just a True\/False category indicating whether or not it rained at\u00a0<em>x<\/em>\u00a0elevation. That is because our outcome variable has to be continuous \u2014 meaning that it can be any number (including fractions) in a range of numbers.<\/p>\n<p id=\"4558\">The coolest thing about linear regression is that it can predict things using the line of best fit that we spoke about before! If we run a linear regression analysis on our rainfall vs. elevation scenario above, we can find the line of best fit like we did in the gradient descent section (this time shown in blue), and then we can use that line to make educated guesses as to how much rain one could reasonably expect at some elevation.\u201d<\/p>\n<h4 id=\"80a1\">Ridge &amp; LASSO Regression<\/h4>\n<blockquote id=\"2bfb\"><p>Me, continuing to hopefully-not-too-scared-grandma:<\/p><\/blockquote>\n<p id=\"ad4b\">\u201cSo linear regression\u2019s not that scary, right? It\u2019s just a way to see what effect something has on something else. Cool.<\/p>\n<p id=\"3b8e\">Now that we know about simple linear regression, there are even cooler linear regression-like things we can discuss, like ridge regression.<\/p>\n<p id=\"6adc\">Like gradient descent\u2019s relationship to linear regression, there\u2019s one back-story we need to cover to understand ridge regression, and that\u2019s\u00a0<strong>regularization.<\/strong><\/p>\n<p id=\"fe07\">Simply put, data scientists use regularization methods to make sure that their models only pay attention to independent variables that have a significant impact on their outcome variable.<\/p>\n<p id=\"d144\">You\u2019re probably wondering why we care if our model uses independent variables that don\u2019t have an impact. If they don\u2019t have an impact, wouldn\u2019t our regression just ignore them? The answer is no! We can get more into the details of machine learning later, but basically we create these models by feeding them a bunch of \u201ctraining\u201d data. Then, we see how good our models are by testing them on a bunch of \u201ctest\u201d data. So, if we train our model with a bunch of independent variables, with some that matter and some that don\u2019t, our model will perform super well on our training data (because we are tricking it to think all of what we fed it matters), but super poorly on our test data. This is because our model isn\u2019t<em>\u00a0flexible<\/em>\u00a0enough to work well on new data that doesn\u2019t have every. single. little. thing we fed it during the training phase. When this happens, we say that the model is \u201coverfit.\u201d<\/p>\n<p id=\"2872\">To understand over-fitting, let\u2019s look at a (lengthy) example:<\/p>\n<blockquote id=\"ac83\"><p>Let\u2019s say you\u2019re a new mother and your baby boy loves pasta. As the months go by, you make it a habit to feed your baby pasta with the kitchen window open because you like the breeze. Then your baby\u2019s cousin gets him a onesie, and you start a tradition of only feeding him pasta when he\u2019s in his special onesie. Then you adopt a dog who diligently sits beneath the baby\u2019s highchair to catch the stray noodles while he\u2019s eating his pasta\u00a0. At this point, you only feed your baby pasta while he\u2019s wearing the special onesie\u00a0\u2026and the kitchen window\u2019s open\u00a0\u2026and the dog is underneath the highchair. As a new mom you naturally correlate your son\u2019s love of pasta with all of these features: the open kitchen window, the onesie, and the dog. Right now, your mental model of the baby\u2019s feeding habits is pretty complex!<\/p><\/blockquote>\n<blockquote id=\"ae51\"><p>One day, you take a trip to grandma\u2019s. You have to feed your baby dinner (pasta, of course) because you\u2019re staying the weekend. You go into a panic because there is no window in this kitchen, you forgot his onesie at home, and the dog is with the neighbors! You freak out so much that you forget all about feeding your baby his dinner and just put him to bed.<\/p><\/blockquote>\n<blockquote id=\"0e13\"><p>Wow. You performed pretty poorly when you were faced with a scenario you hadn\u2019t faced before. At home you were perfect at it, though! It doesn\u2019t make sense!<\/p><\/blockquote>\n<blockquote id=\"8968\"><p>After revisiting your mental model of your baby\u2019s eating habits and disregarding all the \u201cnoise,\u201d or things you think probably don\u2019t contribute to your boy\u00a0<strong>actually<\/strong>loving pasta, you realize that the only thing that really matters is that it\u2019s cooked\u00a0<strong>by you.<\/strong><\/p><\/blockquote>\n<blockquote id=\"207b\"><p>The next night at grandma\u2019s you feed him his beloved pasta in her windowless kitchen while he\u2019s wearing just a diaper and there\u2019s no dog to be seen. And everything goes fine! Your idea of why he loves pasta is a lot simpler now.<\/p><\/blockquote>\n<p id=\"f798\">That is exactly what regularization can do for a machine learning model.<\/p>\n<p id=\"3039\">So, regularization helps your model only pay attention to what matters in your data and gets rid of the noise.<\/p>\n<figure id=\"218a\" data-scroll=\"native\"><canvas width=\"75\" height=\"46\"><\/canvas><img decoding=\"async\" src=\"https:\/\/cdn-images-1.medium.com\/max\/480\/1*SsY0JE7brR7vwlyQ4iEm1w.png\" data-src=\"https:\/\/cdn-images-1.medium.com\/max\/480\/1*SsY0JE7brR7vwlyQ4iEm1w.png\" \/><\/figure>\n<p style=\"text-align: center;\"><span style=\"font-size: 11px;\">On the left: LASSO regression (you can see that the coefficients, represented by the red rungs, can equal zero when they cross the y-axis). On the right: Ridge regression (you can see that the coefficients approach, but never equal <\/span><span style=\"font-size: 11px;\">zero,<\/span><span style=\"font-size: 11px;\"> because they never cross the y-axis). Meta-credit: \u201c<a href=\"https:\/\/towardsdatascience.com\/regularization-in-machine-learning-76441ddcf99a\" target=\"_blank\" rel=\"noopener noreferrer\" data-href=\"https:\/\/towardsdatascience.com\/regularization-in-machine-learning-76441ddcf99a\" data->Regularization in Machine Learning<\/a>\u201d by\u00a0<a href=\"https:\/\/towardsdatascience.com\/@prashantgupta17\" target=\"_blank\" rel=\"noopener noreferrer\" data-href=\"https:\/\/towardsdatascience.com\/@prashantgupta17\" data->Prashant\u00a0Gupta<\/a><\/span><\/p>\n<p id=\"1965\">In all types of regularization, there is something called a\u00a0<strong>penalty term<\/strong>\u00a0(the Greek letter lambda: \u03bb). This penalty term is what mathematically shrinks the noise in our data.<\/p>\n<p id=\"00b3\">In ridge regression, sometimes known as \u201cL2 regression,\u201d the penalty term is the sum of the squared value of the coefficients of your variables. (Coefficients in linear regression are basically just numbers attached to each independent variable that tell you how much of an effect each will have on the outcome variable. Sometimes we refer to them as \u201cweights.\u201d) In ridge regression, your penalty term shrinks the coefficients of your independent variables, but never actually does away with them totally. This means that with ridge regression, noise in your data will always be taken into account by your model a\u00a0<em>little<\/em>\u00a0bit.<\/p>\n<p id=\"6bbd\">Another type of regularization is LASSO, or \u201cL1\u201d regularization. In LASSO regularization, instead of penalizing every feature in your data, you only penalize the\u00a0<em>high<\/em>\u00a0coefficient-features. Additionally, LASSO has the ability to shrink coefficients all the way to zero. This essentially deletes those features from your data set because they now have a \u201cweight\u201d of zero (i.e. they\u2019re essentially being multiplied by zero).\u201d With LASSO regression, your model has the potential to get rid of most all of the noise in your dataset. This is super helpful in some scenarios!<\/p>\n<h3 id=\"ef31\">Logistic Regression<\/h3>\n<blockquote id=\"e5a4\"><p>Me-to-grandma:<\/p><\/blockquote>\n<p id=\"5ed9\">\u201cSo, cool, we have linear regression down. Linear regression = what effect some variable(s) has on another variable, assuming that 1) the outcome variable is continuous and 2) the relationship(s) between the variable(s) and the outcome variable is linear.<\/p>\n<p id=\"d094\">But what if your outcome variable is \u201ccategorical\u201d? That\u2019s where logistic regression comes in!<\/p>\n<p id=\"5a18\">Categorical variables are just variables that can be only fall within in a single category. Good examples are days of the week \u2014if you have a bunch of data points about things that happened on certain days of the week, there is no possibility that you\u2019ll ever get a datapoint that could have happened sometime between Monday and Tuesday. If something happened on Monday, it happened on Monday, end of story.<\/p>\n<p id=\"d18f\">But if we think of how our linear regression model works, how would it be possible for us to figure out a line of best fit for something\u00a0<em>categorical<\/em>? It would be impossible! That is why logistic regression models output a\u00a0<em>probability\u00a0<\/em>of your datapoint being in one category or another, rather than a regular numeric value. That\u2019s why logistic regression models are primarily used for\u00a0<strong>classification<\/strong>.<\/p>\n<figure id=\"1876\" data-scroll=\"native\"><canvas width=\"69\" height=\"75\"><\/canvas><img decoding=\"async\" src=\"https:\/\/cdn-images-1.medium.com\/max\/480\/1*yVt81wWNhs-uING5oIJcjQ.png\" data-src=\"https:\/\/cdn-images-1.medium.com\/max\/480\/1*yVt81wWNhs-uING5oIJcjQ.png\" \/><\/figure>\n<p style=\"text-align: center;\"><span style=\"font-size: 11px;\">Scary<\/span><span style=\"font-size: 11px;\"> looking graph that\u2019s <\/span><span style=\"font-size: 11px;\">actual<\/span><span style=\"font-size: 11px;\"> super intuitive if you stare at it long enough. From\u00a0<a href=\"https:\/\/www.linkedin.com\/feed\/update\/urn:li:activity:6493249916125663232\/\" target=\"_blank\" rel=\"noopener noreferrer\" data-href=\"https:\/\/www.linkedin.com\/feed\/update\/urn:li:activity:6493249916125663232\/\" data->Brandon Rohrer<\/a>\u00a0via LinkedIn.<\/span><\/p>\n<p id=\"d3e7\">But back to both linear regression and logistic regression being \u201clinear.\u201d If we can\u2019t come up with a line of best fit in logistic regression, where does the\u00a0<em>linear\u00a0<\/em>part of logistic regression come in? Well in the world of logistic regression, the outcome variable has a linear relationship with the\u00a0<em>log-odds<\/em>\u00a0of the independent variables.<\/p>\n<p id=\"1880\">But what in the world are the log-odds? Okay here we go\u2026.<\/p>\n<h4 id=\"13e2\"><strong>Odds<\/strong><\/h4>\n<p id=\"2d10\">The core of logistic regression = odds.<\/p>\n<p id=\"21f8\">Intuitively, odds are something we understand \u2014they are the probability of success to the probability of failure. In other words, they are the probability of something happening compared to the probability of something not happening.<\/p>\n<p id=\"b35e\">For a concrete example of odds, we can think of a class of students. Let\u2019s say the odds of women passing the test are 5:1, while the odds of men passing the test are 3:10. This means that, of 6 women, 5 are likely to pass the test, and that, of 13 men, 3 are likely to pass the test. The total class size here is 19 students (6 women+ 13 men).<\/p>\n<h4 id=\"a0e8\">So\u2026aren\u2019t odds just the same as probability?<\/h4>\n<p id=\"0e61\">Sadly, no! While probability measures the\u00a0<em>ratio of the number of times something happened out of the total number of times everything happened<\/em>\u00a0(e.g. 10 heads out 30 coin tosses), odds measures the\u00a0<em>ratio of the number of times something happened to the number of times something\u00a0<\/em><strong><em>didn\u2019t<\/em><\/strong><em>\u00a0happen<\/em>\u00a0(e.g. 10 heads to 20 tails).<\/p>\n<p id=\"b383\">That means that while probability will always be confined to a scale of 0\u20131, odds can continuously grow from 0 to positive infinity! This presents a problem for our logistic regression model, because we know that our expected output is a\u00a0<em>probability<\/em>\u00a0(i.e. a number from 0\u20131).<\/p>\n<h4 id=\"7f4f\">So, how do we get from odds to probability?<\/h4>\n<p id=\"366a\">Let\u2019s think of a classification problem\u2026say your favorite soccer team winning over another soccer team. You might say that the odds of your team losing are 1:6, or 0.17. And the odds of your team winning, because they\u2019re a great team, are 6:1, or 6. You could represent those odds on a number line like below:<\/p>\n<figure id=\"a53b\"><canvas width=\"75\" height=\"21\"><\/canvas><img decoding=\"async\" src=\"https:\/\/cdn-images-1.medium.com\/max\/640\/1*zm-ra24vMd3DxDqYl8hrWA.png\" data-src=\"https:\/\/cdn-images-1.medium.com\/max\/640\/1*zm-ra24vMd3DxDqYl8hrWA.png\" \/><\/figure>\n<p style=\"text-align: center;\"><span style=\"font-size: 11px;\"><a href=\"https:\/\/www.youtube.com\/watch?v=ARfXDSkQf1Y\" target=\"_blank\" rel=\"nofollow noopener noreferrer\" data-href=\"https:\/\/www.youtube.com\/watch?v=ARfXDSkQf1Y\" data->https:\/\/www.youtube.com\/watch?v=ARfXDSkQf1Y<\/a><\/span><\/p>\n<p id=\"153d\">Now, you wouldn\u2019t want your model to predict that your team will win on a future game just because the\u00a0<em>magnitude<\/em>\u00a0of the odds of them winning in the past is so much bigger than the\u00a0<em>magnitude\u00a0<\/em>of the odds of them losing in the past, right? There is so much more you want your model to take into account (maybe weather, maybe starting players, etc.)! So, to get the magnitude of the odds to be evenly distributed, or\u00a0<em>symmetrical<\/em>, we calculate something called the\u00a0<em>log-<\/em>odds.<\/p>\n<h4 id=\"7b2e\"><strong>Log-Odds<\/strong><\/h4>\n<figure id=\"ee15\" data-scroll=\"native\"><canvas width=\"75\" height=\"50\"><\/canvas><img decoding=\"async\" src=\"https:\/\/cdn-images-1.medium.com\/max\/480\/1*waIWefnFADT5V-w83_o2qg.png\" data-src=\"https:\/\/cdn-images-1.medium.com\/max\/480\/1*waIWefnFADT5V-w83_o2qg.png\" \/><\/figure>\n<p style=\"text-align: center;\"><span style=\"font-size: 11px;\">What we mean by \u201cnormally distributed\u201d: the classic bell-shaped curve!<\/span><\/p>\n<p id=\"f247\">Log-odds is a shorthand way of referring to taking the\u00a0<em>natural logarithm<\/em>\u00a0of the odds. When you take the natural logarithm of something, you basically make it more normally distributed. When we make something more normally distributed, we are essentially putting it on a scale that\u2019s super easy to work with.<\/p>\n<p id=\"092f\">When we take the log-odds, we transform the scale of our odds from 0-positive infinity to negative infinity-positive infinity. You can see this well on the bell curve above.<\/p>\n<p id=\"9a6f\">Even though we still need our output to be between 0\u20131, the symmetry we achieve by taking the log-odds gets us closer to the output we want than we were before!<\/p>\n<h4 id=\"d904\">Logit Function<\/h4>\n<p id=\"a465\">The \u201clogit function\u201d is simply the math we do to get the log-odds!<\/p>\n<p style=\"text-align: center;\"><img decoding=\"async\" src=\"https:\/\/cdn-images-1.medium.com\/max\/640\/1*JuzEcjLiw2vGeD8RrBk6Eg.png\" \/><\/p>\n<p style=\"text-align: center;\"><span style=\"font-size: 11px;\">Some scary math thingymabob. Er, I mean the logit function.<\/span><\/p>\n<p style=\"text-align: center;\"><img decoding=\"async\" src=\"https:\/\/cdn-images-1.medium.com\/max\/640\/0*OG5GqZwC-NPOIWHd\" \/><\/p>\n<p style=\"text-align: center;\"><span style=\"font-size: 11px;\">The logit <\/span><span style=\"font-size: 11px;\">function,<\/span><span style=\"font-size: 11px;\"> graphed<\/span><\/p>\n<p id=\"05fd\">The logit function puts our odds on a scale of negative infinity to positive infinity by taking their natural logarithm, as you can see above.<\/p>\n<h4 id=\"2942\">Sigmoid Function<\/h4>\n<p id=\"5ba7\">Okay, but we\u2019re still not at the point where our model is giving us a probability. Right now, all we have are numbers on a scale of negative infinity to positive infinity. Enter: the sigmoid function.<\/p>\n<p id=\"ce48\">The sigmoid function, named after the s-shape it assumes when graphed, is just the inverse of the log-odds. By taking the inverse of the log-odds, we are mapping our values from negative infinity-positive infinity to 0\u20131. This, in turn, let\u2019s us get probabilities, which are exactly what we want!<\/p>\n<p id=\"f2e0\">As opposed to the graph of the logit function where our y-values range from negative infinity to positive infinity, the graph of our sigmoid function has y-values from 0\u20131. Yay!<\/p>\n<figure id=\"4b3c\"><canvas width=\"75\" height=\"46\"><\/canvas><img decoding=\"async\" src=\"https:\/\/cdn-images-1.medium.com\/max\/640\/1*ZABstiiKHmqTSUo5T9MUbQ.png\" data-src=\"https:\/\/cdn-images-1.medium.com\/max\/640\/1*ZABstiiKHmqTSUo5T9MUbQ.png\" \/><\/figure>\n<p style=\"text-align: center;\"><span style=\"font-size: 11px;\"><a href=\"https:\/\/hackernoon.com\/introduction-to-machine-learning-algorithms-logistic-regression-cbdd82d81a36\" target=\"_blank\" rel=\"noopener noreferrer\" data-href=\"https:\/\/hackernoon.com\/introduction-to-machine-learning-algorithms-logistic-regression-cbdd82d81a36\" data->The lovely sigmoid function<\/a>.<\/span><\/p>\n<p id=\"c681\">With this, we can now plug in any x-value and trace it back to its predicted y-value. That y-value will be the\u00a0<em>probability<\/em>\u00a0of that x-value being in one class or another.<\/p>\n<h4 id=\"1050\">Maximum Likelihood Estimation<\/h4>\n<p id=\"ee4f\">\u2026Not finished just yet.<\/p>\n<p id=\"ece6\">You remember how we found the line of best fit during linear regression by minimizing the RSS (a method sometimes called the \u201cordinary least squares,\u201d or OLS, method)? Here, we use something called\u00a0<strong>Maximum Likelihood Estimation<\/strong>\u00a0(MLE) to get our most accurate predictions.<\/p>\n<p id=\"dd7b\"><strong>MLE gets us the most accurate predictions by determining the parameters of the probability distribution that best describe our data.<\/strong><\/p>\n<p id=\"d698\">Why would we care about figuring out the distribution of our data? Because it\u2019s cool!\u00a0\u2026But really, it just makes our data easier to work with and makes our model\u00a0<em>generalizable<\/em>\u00a0to lots of different data.<\/p>\n<figure id=\"76e8\"><canvas width=\"75\" height=\"34\"><\/canvas><img decoding=\"async\" src=\"https:\/\/cdn-images-1.medium.com\/max\/640\/1*GQNX5bsiHnYeR9VTpXl3HA.png\" data-src=\"https:\/\/cdn-images-1.medium.com\/max\/640\/1*GQNX5bsiHnYeR9VTpXl3HA.png\" \/><\/figure>\n<p style=\"text-align: center;\"><span style=\"font-size: 11px;\"><a href=\"https:\/\/www.youtube.com\/watch?v=BfKanl1aSG0\" target=\"_blank\" rel=\"noopener noreferrer\" data-href=\"https:\/\/www.youtube.com\/watch?v=BfKanl1aSG0\" data->Logistic Regression Details, Part\u00a02<\/a><\/span><\/p>\n<p id=\"31a1\">Super generally, to get the MLE for our data, we take the data points on our s-curve and add up their log-likelihoods. Basically, we want to find the s-curve that maximizes the log-likelihood of our data. We just keep calculating the log-likelihood for every log-odds line (sort of like what we do with the RSS of each line-of-best-fit in linear regression) until we get the largest number we can.<\/p>\n<p id=\"9504\">(As an aside \u2014 we revert back to the world of natural logs because logs are the easiest form of number to work with sometimes. This is because logs are \u201cmonotonically increasing\u201d functions, which basically just means that it consistently increases or decreases.)<\/p>\n<p id=\"b204\">The estimates that we come up with in the MLE process are those that maximize something called the \u201clikelihood function\u201d (which we won\u2019t go into here).\u201d<\/p>\n<figure id=\"424b\"><canvas width=\"69\" height=\"75\"><\/canvas><img decoding=\"async\" src=\"https:\/\/cdn-images-1.medium.com\/max\/640\/0*eCDoAxTWHYdh2WWG\" data-src=\"https:\/\/cdn-images-1.medium.com\/max\/640\/0*eCDoAxTWHYdh2WWG\" \/><\/figure>\n<p style=\"text-align: center;\"><span style=\"font-size: 11px;\"><a href=\"http:\/\/incolors.club\/collectiongdwn-great-job-funny-meme.htm\" target=\"_blank\" rel=\"nofollow noopener noreferrer\" data-href=\"http:\/\/incolors.club\/collectiongdwn-great-job-funny-meme.htm\" data->http:\/\/incolors.club\/collectiongdwn-great-job-funny-meme.htm<\/a><\/span><\/p>\n<p id=\"a324\">And that\u2019s it! Now you know all about gradient descent, linear regression, and logistic regression.\u201d<\/p>\n<h3 id=\"2263\">Coming Up<\/h3>\n<p id=\"218c\">Coming up on Audrey-explains-machine-learning-algorithms-to-her-grandma: Decision Trees, Random Forest, and SVM. Stay tuned!<\/p>\n","protected":false},"excerpt":{"rendered":"<p>A soft skill that keeps coming to the forefront is the ability to explain complex machine learning algorithms to a non-technical person. An algorithm is the mathematical life force behind a model. What differentiates models are the algorithms they employ, but without a model, an algorithm is just a mathematical equation hanging out with nothing to do. An&nbsp;algorithm&nbsp;is what is used to train a&nbsp;model,&nbsp;all the decisions a model is supposed to take based on the given input, to give expected output.&nbsp;<\/p>\n","protected":false},"author":489,"featured_media":4048,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"content-type":"","footnotes":""},"categories":[183],"tags":[92],"ppma_author":[3116],"class_list":["post-1555","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-ai-ml","tag-machine-learning"],"authors":[{"term_id":3116,"user_id":489,"is_guest":0,"slug":"audrey-sage-lorberfeld","display_name":"Audrey Lorberfeld","avatar_url":"https:\/\/secure.gravatar.com\/avatar\/?s=96&d=mm&r=g","user_url":"","last_name":"Lorberfeld","first_name":"Audrey","job_title":"","description":"Audrey Sage Lorberfeld is Technical librarian-turned data scientist with a passion for democratizing information and making data work for people. He is experienced in data acquisition &amp; modeling, statistical analysis, machine learning (unsupervised, supervised, clustering, time series, NLP), and deep learning."}],"_links":{"self":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/1555","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/users\/489"}],"replies":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/comments?post=1555"}],"version-history":[{"count":3,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/1555\/revisions"}],"predecessor-version":[{"id":31012,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/1555\/revisions\/31012"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/media\/4048"}],"wp:attachment":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/media?parent=1555"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/categories?post=1555"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/tags?post=1555"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/ppma_author?post=1555"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}