{"id":626,"date":"2018-03-26T03:14:36","date_gmt":"2018-03-26T03:14:36","guid":{"rendered":"http:\/\/kusuaks7\/?p=231"},"modified":"2025-05-23T10:48:27","modified_gmt":"2025-05-23T10:48:27","slug":"gradient-descent-algorithm-and-its-variants","status":"publish","type":"post","link":"https:\/\/www.experfy.com\/blog\/ai-ml\/gradient-descent-algorithm-and-its-variants\/","title":{"rendered":"Gradient Descent Algorithm and Its Variants"},"content":{"rendered":"\t\t<div data-elementor-type=\"wp-post\" data-elementor-id=\"626\" class=\"elementor elementor-626\" data-elementor-post-type=\"post\">\n\t\t\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-7c09ac1b elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"7c09ac1b\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-6d38e39e\" data-id=\"6d38e39e\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-7947649f elementor-widget elementor-widget-text-editor\" data-id=\"7947649f\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<strong><em>Ready to learn Machine Learning? <a href=\"https:\/\/www.experfy.com\/training\/courses\">Browse courses<\/a>\u00a0like\u00a0<a href=\"https:\/\/www.experfy.com\/training\/courses\/machine-learning-foundations-supervised-learning\">Machine Learning Foundations: Supervised Learning<\/a> developed by industry thought leaders and Experfy in Harvard Innovation Lab.<\/em><\/strong>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-a221d26 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"a221d26\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-b21012e\" data-id=\"b21012e\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-6c1ad90 elementor-widget elementor-widget-text-editor\" data-id=\"6c1ad90\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<center><\/center><strong>Optimization<\/strong>\u00a0refers to the task of minimizing\/maximizing an objective function\u00a0<em>f(x)<\/em>parameterized by\u00a0<em>x<\/em>. In machine\/deep learning terminology, it\u2019s the task of minimizing the cost\/loss function\u00a0<em>J(w)<\/em>\u00a0parameterized by the model\u2019s parameters\u00a0w\u2208Rdw\u2208Rd. Optimization algorithms (in case of minimization) have one of the following goals:\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-c6847a1 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"c6847a1\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-d79e85c\" data-id=\"d79e85c\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-0859212 elementor-widget elementor-widget-text-editor\" data-id=\"0859212\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<ul>\n \t<li>Find the global minimum of the objective function. This is feasible if the objective function is convex, i.e. any local minimum is a global minimum.<\/li>\n \t<li>Find the lowest possible value of the objective function within its neighborhood. That\u2019s usually the case if the objective function is not convex as the case in most deep learning problems.<\/li>\n<\/ul>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-31ad249 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"31ad249\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-4c5b08c\" data-id=\"4c5b08c\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-504c172 elementor-widget elementor-widget-text-editor\" data-id=\"504c172\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tThere are three kinds of optimization algorithms:\n<ul>\n \t<li>Optimization algorithm that is not iterative and simply solves for one point.<\/li>\n \t<li>Optimization algorithm that is iterative in nature and converges to acceptable solution regardless of the parameters initialization such as gradient descent applied to logistic regression.<\/li>\n \t<li>Optimization algorithm that is iterative in nature and applied to a set of problems that have non-convex cost functions such as neural networks. Therefore, parameters\u2019 initialization plays a critical role in speeding up convergence and achieving lower error rates.<\/li>\n<\/ul\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-2783b51 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"2783b51\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-90b8b2f\" data-id=\"90b8b2f\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-b0e2a15 elementor-widget elementor-widget-text-editor\" data-id=\"b0e2a15\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<strong>Gradient Descent<\/strong>\u00a0is the most common optimization algorithm in\u00a0<em>machine learning<\/em>and\u00a0<em>deep learning<\/em>. It is a first-order optimization algorithm. This means it only takes into account the first derivative when performing the updates on the parameters. On each iteration, we update the parameters in the opposite direction of the gradient of the objective function\u00a0<em>J(w)<\/em>\u00a0w.r.t the parameters where the gradient gives the direction of the steepest ascent. The size of the step we take on each iteration to reach the local minimum is determined by the learning rate \u03b1. Therefore, we follow the direction of the slope downhill until we reach a local minimum.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-f4faf6b elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"f4faf6b\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-d0000f1\" data-id=\"d0000f1\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-7982a44 elementor-widget elementor-widget-text-editor\" data-id=\"7982a44\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tIn this notebook, we\u2019ll cover gradient descent algorithm and its variants:\u00a0<em>Batch Gradient Descent, Mini-batch Gradient Descent, and Stochastic Gradient Descent<\/em>.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-a75b4c8 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"a75b4c8\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-4b199af\" data-id=\"4b199af\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-efa853a elementor-widget elementor-widget-text-editor\" data-id=\"efa853a\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tLet\u2019s first see how gradient descent and its associated steps works on logistic regression before going into the details of its variants. For the sake of simplicity, let\u2019s assume that the logistic regression model has only two parameters: weight\u00a0<em>w<\/em>\u00a0and bias\u00a0<em>b<\/em>.\n<ol>\n \t<li>Initialize weight\u00a0<em>w<\/em>\u00a0and bias\u00a0<em>b<\/em>\u00a0to any random numbers.<\/li>\n \t<li>Pick a value for the learning rate \u03b1. The learning rate determines how big the step would be on each iteration.Therefore, plot the cost function against different values of \u03b1 and pick the value of \u03b1 that is right before the first value that didn\u2019t converge so that we would have a very fast learning algorithm that converges\n<center><\/center>&nbsp;\n<ul>\n \t<li>If \u03b1 is very small, it would take long time to converge and become computationally expensive.<\/li>\n \t<li>IF \u03b1 is large, it may fail to converge and overshoot the minimum.<\/li>\n \t<li>The most commonly used rates are :\u00a0<em>0.001, 0.003, 0.01, 0.03, 0.1, 0.3<\/em>.<\/li>\n<\/ul>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-2dcc162 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"2dcc162\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-decdf75\" data-id=\"decdf75\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-36f369e elementor-widget elementor-widget-text-editor\" data-id=\"36f369e\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<\/li>\n \t<li>Make sure to scale the data if it\u2019s on very different scales. If we don\u2019t scale the data, the level curves (contours) would be narrower and taller which means it would take longer time to converge (see figure 3).\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-3ee4a8b elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"3ee4a8b\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-8e744db\" data-id=\"8e744db\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-e6de29a elementor-widget elementor-widget-text-editor\" data-id=\"e6de29a\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tScale the data to have \u03bc = 0 and \u03c3 = 1. Below is the formula for scaling each example:\n<p align=\"center\"><img decoding=\"async\" src=\"http:\/\/latex.codecogs.com\/svg.latex?frac{x_i%20-%20mu}{sigma}\" border=\"0\" \/><\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-703886b elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"703886b\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-04f0133\" data-id=\"04f0133\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-c05c2ae elementor-widget elementor-widget-text-editor\" data-id=\"c05c2ae\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<\/ol>\n<ul>\n \t<li>On each iteration, take the partial derivative of the cost function\u00a0<em>J(w)<\/em>\u00a0w.r.t each parameter (gradient):\n<p align=\"center\"><img decoding=\"async\" src=\"http:\/\/latex.codecogs.com\/svg.latex?frac{partial}{partial%20w}J(w,%20b)%20=%20 abla_wJ\" border=\"0\" \/>\n<img decoding=\"async\" src=\"http:\/\/latex.codecogs.com\/svg.latex?frac{partial}{partial%20b}J(w,%20b)%20=%20 abla_bJ\" border=\"0\" \/><\/p>\nThe update equations are:\n<p align=\"center\"><img decoding=\"async\" src=\"http:\/\/latex.codecogs.com\/svg.latex?w%20=%20w%20-%20alpha%20 abla_w%20J\" border=\"0\" \/>\n<img decoding=\"async\" src=\"http:\/\/latex.codecogs.com\/svg.latex?b%20=%20b%20-%20alpha%20 abla_b%20J\" border=\"0\" \/><\/p>\n&nbsp;<\/li>\n<\/ul>\n<center style=\"text-align: left;\">An illustration of how gradient descent algorithm uses the first derivative of the loss function to follow downhill it&#8217;s minimum.<\/center>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-5c5dfc7 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"5c5dfc7\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-58b381e\" data-id=\"58b381e\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-aa07e8d elementor-widget elementor-widget-text-editor\" data-id=\"aa07e8d\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<ul>\n \t<li>Continue the process until the cost function converges. That is, until the error curve becomes flat and doesn\u2019t change.<\/li>\n \t<li>For the sake of illustration, let\u2019s assume we don\u2019t have bias. If the slope of the current value of\u00a0<em>w &gt; 0<\/em>, this means that we are to the right of optimal\u00a0<em>w<\/em>*. Therefore, the update will be negative, and will start getting close to the optimal values of\u00a0<em>w<\/em>*. However, if it\u2019s negative, the update will be positive and will increase the current values of\u00a0<em>w<\/em>\u00a0to converge to the optimal values of\u00a0<em>w<\/em>*(see figure 4):<\/li>\n \t<li>In addition, on each iteration, the step would be in the direction that gives the\u00a0<em>maximum<\/em>\u00a0change since it\u2019s perpendicular to level curves at each step.<\/li>\n<\/ul>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-4b647ed elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"4b647ed\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-d2dfaf9\" data-id=\"d2dfaf9\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-686ca3f elementor-widget elementor-widget-text-editor\" data-id=\"686ca3f\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tNow let\u2019s discuss the three variants of gradient descent algorithm. The main difference between them is the amount of data we use when computing the gradients for each learning step. The trade-off between them is the accuracy of the gradient versus the time complexity to perform each parameter\u2019s update (learning step).\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-8f05eda elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"8f05eda\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-86fcb81\" data-id=\"86fcb81\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-db1c83e elementor-widget elementor-widget-heading\" data-id=\"db1c83e\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h3 class=\"elementor-heading-title elementor-size-default\"><h3><strong>Batch Gradient Descent<\/strong><\/h3><\/h3>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-e9e0355 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"e9e0355\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-82eaf00\" data-id=\"82eaf00\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-5886270 elementor-widget elementor-widget-text-editor\" data-id=\"5886270\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tBatch Gradient Descent is when we sum up over all examples on each iteration when performing the updates to the parameters. Therefore, for each update, we have to sum over all examples:\n<p align=\"center\"><img decoding=\"async\" src=\"http:\/\/latex.codecogs.com\/svg.latex?w%20=%20w%20-%20alpha%20 abla_w%20J\" border=\"0\" \/><\/p>\n\n<pre><code>for i in range(num_epochs):\ngrad = compute_gradient(data, params)\nparams = params - learning_rate * grad\n<\/code><\/pre>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-572e912 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"572e912\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-5e33d8e\" data-id=\"5e33d8e\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-aeb74a9 elementor-widget elementor-widget-text-editor\" data-id=\"aeb74a9\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tThe main advantages:\n<ul>\n \t<li>We can use fixed learning rate during training without worrying about learning rate decay.<\/li>\n \t<li>It has straight trajectory towards the minimum and it is guaranteed to converge in theory to the global minimum if the loss function is convex and to a local minimum if the loss function is not convex.<\/li>\n \t<li>It has unbiased estimate of gradients. The more the examples, the lower the standard error.<\/li>\n<\/ul>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-24e7be1 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"24e7be1\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-c8581be\" data-id=\"c8581be\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-c00e5d6 elementor-widget elementor-widget-text-editor\" data-id=\"c00e5d6\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tThe main disadvantages:\n<ul>\n \t<li>Even though we can use vectorized implementation, it may still be slow to go over all examples especially when we have large datasets.<\/li>\n \t<li>Each step of learning happens after going over all examples where some examples may be redundant and don\u2019t contribute much to the update.<\/li>\n<\/ul>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-5d07cf2 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"5d07cf2\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-5508741\" data-id=\"5508741\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-36ef7b9 elementor-widget elementor-widget-heading\" data-id=\"36ef7b9\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h3 class=\"elementor-heading-title elementor-size-default\"><h3><strong>Mini-Batch Gradient Descent<\/strong><\/h3><\/h3>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-89d718d elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"89d718d\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-5568b54\" data-id=\"5568b54\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-150037f elementor-widget elementor-widget-text-editor\" data-id=\"150037f\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tInstead of going over all examples, Mini-batch Gradient Descent sums up over lower number of examples based on batch size. Therefore, learning happens on each mini-batch of\u00a0<em>b<\/em>\u00a0examples:\n<p align=\"center\"><img decoding=\"async\" src=\"http:\/\/latex.codecogs.com\/svg.latex?w%20=%20w%20-%20alpha%20 abla_w%20J(x#k8SjZc9Dxk{{i:i%20+%20b}},%20y#k8SjZc9Dxk{{i:%20i%20+%20b}};%20w,%20b)\" border=\"0\" \/><\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-cfe5f87 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"cfe5f87\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-8d19652\" data-id=\"8d19652\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-5d2759a elementor-widget elementor-widget-text-editor\" data-id=\"5d2759a\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<ul>\n \t<li>Shuffle the training dataset to avoid pre-existing order of examples.<\/li>\n \t<li>Partition the training dataset into\u00a0<em>b<\/em>\u00a0mini-batches based on the batch size. If the training set size is not divisible by batch size, the remaining will be its own batch.<\/li>\n<\/ul>\n<pre><code>for i in range(num_epochs):\nnp.random.shuffle(data)\nfor batch in radom_minibatches(data, batch_size=32):\ngrad = compute_gradient(batch, params)\nparams = params - learning_rate * grad\n<\/code><\/pre>\nThe batch size is something we can tune. It is usually chosen as power of 2 such as 32, 64, 128, 256, 512, etc. The reason behind it is because some hardware such as GPUs achieve better runtime with common batch sizes such as power of 2.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-7175656 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"7175656\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-1cbfbd9\" data-id=\"1cbfbd9\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-6c5bc4e elementor-widget elementor-widget-text-editor\" data-id=\"6c5bc4e\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\nThe main advantages:\n<ul>\n \t<li>Faster than Batch version because it goes through a lot less examples than Batch (all examples).<\/li>\n \t<li>Randomly selecting examples will help avoid redundant examples or examples that are very similar that don\u2019t contribute much to the learning.<\/li>\n \t<li>With batch size &lt; size of training set, it adds noise to the learning process that helps improving generalization error.<\/li>\n \t<li>Even though with more examples the estimate would have lower standard error, the return is less than linear compared to the computational burden we incur.<\/li>\n<\/ul>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-69d751c elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"69d751c\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-9701ed6\" data-id=\"9701ed6\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-5869171 elementor-widget elementor-widget-text-editor\" data-id=\"5869171\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tThe main disadvantages:\n<ul>\n \t<li>It won\u2019t converge. On each iteration, the learning step may go back and forth due to the noise. Therefore, it wanders around the minimum region but never converges.<\/li>\n \t<li>Due to the noise, the learning steps have more oscillations (see figure 5) and requires adding learning-decay to decrease the learning rate as we become closer to the minimum.<\/li>\n<\/ul>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-c02e354 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"c02e354\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-100c2c2\" data-id=\"100c2c2\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-9e6ad7c elementor-widget elementor-widget-text-editor\" data-id=\"9e6ad7c\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<center><\/center>With large training datasets, we don\u2019t usually need more than 2-10 passes over all training examples (epochs). Note: with batch size\u00a0<em>b = m<\/em>, we get the Batch Gradient Descent.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-e43a40f elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"e43a40f\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-cddec28\" data-id=\"cddec28\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-2d1c5e1 elementor-widget elementor-widget-heading\" data-id=\"2d1c5e1\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h3 class=\"elementor-heading-title elementor-size-default\"><h3><strong>Stochastic Gradient Descent<\/strong><\/h3>\n<\/h3>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-977f5b9 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"977f5b9\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-156f654\" data-id=\"156f654\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-75fb7ab elementor-widget elementor-widget-text-editor\" data-id=\"75fb7ab\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tInstead of going through all examples, Stochastic Gradient Descent (SGD) performs the parameters update on each example\u00a0<em>(x#k8SjZc9Dxki, y#k8SjZc9Dxki)<\/em>. Therefore, learning happens on every example:\n<p align=\"center\"><img decoding=\"async\" src=\"http:\/\/latex.codecogs.com\/svg.latex?w%20=%20w%20-%20alpha%20 abla_w%20J(x#k8SjZc9Dxki,%20y#k8SjZc9Dxki;%20w,%20b)\" border=\"0\" \/><\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-2987c57 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"2987c57\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-81f5174\" data-id=\"81f5174\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-a719d28 elementor-widget elementor-widget-text-editor\" data-id=\"a719d28\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<ul>\n \t<li>Shuffle the training dataset to avoid pre-existing order of examples.<\/li>\n \t<li>Partition the training dataset into\u00a0<em>m<\/em>\u00a0examples.<\/li>\n<\/ul>\n<pre><code>for i in range(num_epochs):\nnp.random.shuffle(data)\nfor example in data:\ngrad = compute_gradient(example, params)\nparams = params - learning_rate * grad\n<\/code><\/pre>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-79b25ae elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"79b25ae\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-d76a6ec\" data-id=\"d76a6ec\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-c70bd5c elementor-widget elementor-widget-text-editor\" data-id=\"c70bd5c\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tIt shares most of the advantages and the disadvantages with mini-batch version. Below are the ones that are specific to SGD:\n<ul>\n \t<li>It adds even more noise to the learning process than mini-batch that helps improving generalization error. However, this would increase the run time.<\/li>\n \t<li>We can\u2019t utilize vectorization over 1 example and becomes very slow. Also, the variance becomes large since we only use 1 example for each learning step.<\/li>\n<\/ul>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-7d2a2a3 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"7d2a2a3\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-28c9bf6\" data-id=\"28c9bf6\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-c0dcaf1 elementor-widget elementor-widget-text-editor\" data-id=\"c0dcaf1\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tBelow is a graph that shows the gradient descent\u2019s variants and their direction towards the minimum:\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-b7b32f3 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"b7b32f3\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-9b6851e\" data-id=\"9b6851e\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-bcccf4e elementor-widget elementor-widget-text-editor\" data-id=\"bcccf4e\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<center><\/center>As the figure above shows, SGD direction is very noisy compared to mini-batch.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-c5bff85 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"c5bff85\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-691323e\" data-id=\"691323e\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-a561018 elementor-widget elementor-widget-heading\" data-id=\"a561018\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\"><h2><strong>Challenges<\/strong><\/h2><\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-c159141 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"c159141\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-324280e\" data-id=\"324280e\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-43bc072 elementor-widget elementor-widget-text-editor\" data-id=\"43bc072\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tBelow are some challenges regarding gradient descent algorithm in general as well as its variants &#8211; mainly batch and mini-batch:\n<ul>\n \t<li>Gradient descent is a first-order optimization algorithm, which means it doesn\u2019t take into account the second derivatives of the cost function. However, the curvature of the function affects the size of each learning step. The gradient measures the steepness of the curve but the second derivative measures the curvature of the curve. Therefore, if:\n<ul>\n \t<li>Second derivative = 0 \u2013&gt; the curvature is linear. Therefore, the step size = the learning rate \u03b1.<\/li>\n \t<li>Second derivative &gt; 0 \u2013&gt; the curvature is going upward. Therefore, the step size &lt; the learning rate \u03b1 and may lead to divergence.<\/li>\n \t<li>Second derivative &lt; 0 \u2013&gt; the curvature is going downward. Therefore, the step size &gt; the learning rate \u03b1.<\/li>\n<\/ul>\nAs a result, the direction that looks promising to the gradient may not be so and may lead to slow the learning process or even diverge.<\/li>\n \t<li>If Hessian matrix has poor conditioning number, i.e. the direction of the most curvature has much more curvature than the direction of the lowest curvature. This will lead the cost function to be very sensitive in some directions and insensitive in other directions. As a result, it will make it harder on the gradient because the direction that looks promising for the gradient may not lead to big changes in the cost function (see figure 7).<\/li>\n<\/ul>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-55d094a elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"55d094a\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-b03116a\" data-id=\"b03116a\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-ec93422 elementor-widget elementor-widget-text-editor\" data-id=\"ec93422\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t&nbsp;\n<ul>\n \t<li>The norm of the gradient\u00a0<em>g#k8SjZc9DxkTg<\/em>\u00a0is supposed to decrease slowly with each learning step because the curve is getting flatter and steepness of the curve will decrease. However, we see that the norm of the gradient is increasing, because of the curvature of the curve. Nonetheless, even though the gradients\u2019 norm is increasing, we\u2019re able to achieve a very low error rates (see figure 8).<\/li>\n<\/ul>\n&nbsp;\n<ul>\n \t<li>In small dimensions, local minimum is common; however, in large dimensions, saddle points are more common. Saddle point is when the function curves up in some directions and curves down in other directions. In other words, saddle point looks a minimum from one direction and a maximum from other direction (see figure 9). This happens when at least one eigenvalue of the hessian matrix is negative and the rest of eigenvalues are positive.<\/li>\n<\/ul>\n&nbsp;\n<ul>\n \t<li>As discussed previously, choosing a proper learning rate is hard. Also, for mini-batch gradient descent, we have to adjust the learning rate during the training process to make sure it converges to the local minimum and not wander around it. Figuring out the decay rate of the learning rate is also hard and changes with different datasets.<\/li>\n \t<li>All parameter updates have the same learning rate; however, we may want to perform larger updates to some parameters that have their directional derivatives more inline with the trajectory towards the minimum than other parameters.<\/li>\n<\/ul>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<\/div>\n\t\t","protected":false},"excerpt":{"rendered":"<p>Ready to learn Machine Learning? Browse courses\u00a0like\u00a0Machine Learning Foundations: Supervised Learning developed by industry thought leaders and Experfy in Harvard Innovation Lab.Optimization\u00a0refers to the task of minimizing\/maximizing an objective function\u00a0f(x)parameterized by\u00a0x. In machine\/deep learning terminology, it\u2019s the task of minimizing the cost\/loss function\u00a0J(w)\u00a0parameterized by the model\u2019s parameters\u00a0w\u2208Rdw\u2208Rd. Optimization algorithms (in case of minimization) have one<\/p>\n","protected":false},"author":265,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"content-type":"","footnotes":""},"categories":[183],"tags":[97],"ppma_author":[1778],"class_list":["post-626","post","type-post","status-publish","format-standard","hentry","category-ai-ml","tag-artificial-intelligence"],"authors":[{"term_id":1778,"user_id":265,"is_guest":0,"slug":"imad-dabbura","display_name":"Imad Dabbura","avatar_url":"https:\/\/secure.gravatar.com\/avatar\/?s=96&d=mm&r=g","user_url":"","last_name":"Dabbura","first_name":"Imad","job_title":"","description":"Imad Dabbura is a Data Scientist at Baylor Scott and White Health. He has many years of experience in predictive analytics where he worked in a variety of industries such as Consumer Goods, Real Estate, Marketing, and Healthcare.&nbsp; Among other things, Imad is interested in Artificial Intelligence and Machine Learning. He writes articles related to machine learning and AI and contributes to open source such as publishing his educational notebooks on <a href=\"https:\/\/github.com\/ImadDabbura\">Github<\/a>."}],"_links":{"self":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/626","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/users\/265"}],"replies":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/comments?post=626"}],"version-history":[{"count":5,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/626\/revisions"}],"predecessor-version":[{"id":37866,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/626\/revisions\/37866"}],"wp:attachment":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/media?parent=626"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/categories?post=626"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/tags?post=626"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/ppma_author?post=626"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}