{"id":909,"date":"2018-10-01T03:17:08","date_gmt":"2018-10-01T03:17:08","guid":{"rendered":"http:\/\/kusuaks7\/?p=514"},"modified":"2023-07-25T16:45:49","modified_gmt":"2023-07-25T16:45:49","slug":"the-secrets-behind-reinforcement-learning","status":"publish","type":"post","link":"https:\/\/www.experfy.com\/blog\/ai-ml\/the-secrets-behind-reinforcement-learning\/","title":{"rendered":"The secrets behind Reinforcement Learning"},"content":{"rendered":"<p><strong><em>Ready to learn Machine Learning? Browse<\/em><\/strong> <strong><em><a href=\"https:\/\/www.experfy.com\/training\/tracks\/machine-learning-training-certification\">Machine Learning Training and Certification courses<\/a> developed by industry thought leaders and Experfy in Harvard Innovation Lab.<\/em><\/strong><\/p>\n<p id=\"7b02\">Bots that play Dota2, AI that beat the best Go players in the world, computers that excel at Doom. What\u2019s going on? Is there a reason why the AI community has been so busy playing games?<\/p>\n<p id=\"1508\">Let me put it that way. If you want a robot to learn how to walk what do you do? You build one, program it and release it on the streets of New York? Of course not. You build a simulation, a game, and you use that virtual space to teach it how to move around it. Zero cost, zero risks. That\u2019s why games are so useful in research areas. But how do you teach it to walk? The answer is the topic of today\u2019s article and is probably the most exciting field of Machine learning at the time:<\/p>\n<p id=\"7ceb\">You probably knew that there are two types of machine learning. Supervised and unsupervised. Well, there is a third one, called Reinforcement Learning. RL is arguably the most difficult area of ML to understand cause there are so many, many things going on at the same time. I\u2019ll try to simplify as much as I can because it is a really astonishing area and you should definitely know about it. But let me warn you. It involves complex thinking and 100% focus to grasp it. And some math. So, take a deep breath and let\u2019s dive in:<\/p>\n<h3 id=\"8507\">Markov Decision Processes<\/h3>\n<p id=\"6184\">Reinforcement learning is a trial and error process where an AI (<strong>agent<\/strong>) performs a number of\u00a0<strong>actions<\/strong>\u00a0in an\u00a0<strong>environment<\/strong>. Each unique moment the agent has a\u00a0<strong>state<\/strong>\u00a0and acts from this given state to a new one. This particular action may on may not has a\u00a0<strong>reward.<\/strong>\u00a0Therefore, we can say that each learning epoch (or episode) can be represented as a sequence of states, actions, and rewards. Each state depends only on the previous states and actions and as the environment is inherently stochastic (we don\u2019t know the state that comes next), this process satisfies the\u00a0<a href=\"https:\/\/en.wikipedia.org\/wiki\/Markov_property\" target=\"_blank\" rel=\"noopener noreferrer\" data-href=\"https:\/\/en.wikipedia.org\/wiki\/Markov_property\" data->Markov property<\/a>. The Markov property says that the conditional probability distribution of future states of the process depends only upon the present state, not on the sequence of events that preceded it. The whole process is referred to as\u00a0<a href=\"https:\/\/en.wikipedia.org\/wiki\/Markov_decision_process\" target=\"_blank\" rel=\"noopener noreferrer\" data-href=\"https:\/\/en.wikipedia.org\/wiki\/Markov_decision_process\" data->a Markov Decision Process<\/a>. Markov Decision Process is the main mathematical tool we use to frame almost any RL problem in a way that is easy to study and experimenting different solutions.<\/p>\n<figure id=\"abf0\"><canvas width=\"75\" height=\"31\"><\/canvas><img decoding=\"async\" src=\"https:\/\/cdn-images-1.medium.com\/max\/640\/0*gxPFMVL-oVYahJMm.jpg\" data-src=\"https:\/\/cdn-images-1.medium.com\/max\/640\/0*gxPFMVL-oVYahJMm.jpg\" \/><\/figure>\n<p id=\"7ada\">Let\u2019s see a real example using Super Mario. In this case:<\/p>\n<ul>\n<li id=\"7554\">The agent is, of course, the beloved Mario<\/li>\n<li id=\"15dd\">The state is the current situation (let\u2019s say the frame of our screen)<\/li>\n<li id=\"8d45\">The actions are: left, right movement and jump<\/li>\n<li id=\"7298\">The environment is the virtual world of each level;<\/li>\n<li id=\"3a3e\">And the reward is whether Mario is alive or dead.<\/li>\n<\/ul>\n<p id=\"211f\">Ok, we have properly defined the problem. What\u2019s next? We need a solution. But first, we need a way to evaluate how good the solution is?<\/p>\n<p id=\"3618\">What I am saying is that the reward on each episode it\u2019s not enough. Imagine a Mario game where the Mario is controlled by an agent. He is receiving constantly positive rewards through the whole level by just before the final flag, he is killed by one Hammer Bro (I hate those guys). You see that each individual reward is not enough for us to win the game. We need a reward that captures the whole level. This is where the term of the\u00a0<strong>discounted cumulative expected reward<\/strong>\u00a0comes into play.<\/p>\n<figure id=\"62ed\"><img decoding=\"async\" src=\"https:\/\/cdn-images-1.medium.com\/max\/640\/0*FFY9ji8x_LIgZii0.jpg\" data-height=\"73\" data-image-id=\"0*FFY9ji8x_LIgZii0.jpg\" data-width=\"278\" \/><\/figure>\n<p id=\"9328\">It is nothing more than the sum of all rewards discounted by a factor gamma, where gamma belongs to [0,1). The discount is essential because the rewards tend to be much more significant in the beginning than in the end. And it makes perfect sense.<\/p>\n<p id=\"55f9\">The next step is to solve the problem. To do that we define that the goal of the learning task is: The agent needs to learn which action to perform from a given state that maximized the cumulative reward over time. Or to learn the\u00a0<strong>Policy \u03c0: S-&gt;A.<\/strong>\u00a0The policy is just a mapping between a state and an action.<\/p>\n<p id=\"c17a\">To sum all the above, we use the following equation:<\/p>\n<figure id=\"4701\"><canvas width=\"75\" height=\"25\"><\/canvas><img decoding=\"async\" src=\"https:\/\/cdn-images-1.medium.com\/max\/640\/0*dguPqFh3pw1zgZrB.jpg\" data-src=\"https:\/\/cdn-images-1.medium.com\/max\/640\/0*dguPqFh3pw1zgZrB.jpg\" \/><\/figure>\n<p id=\"266d\">where V (<strong>Value<\/strong>) is the\u00a0<strong>expected<\/strong>\u00a0long-term reward achieved by a policy (\u03c0) from a state (s).<\/p>\n<p id=\"4c94\">You still with me? If you are, let\u2019s pause for 5 seconds cause this was a bit overwhelming:<\/p>\n<p id=\"0163\">1\u20262\u20263\u20264\u20265<\/p>\n<p id=\"1d66\">Now that we regain our mental clarity, let recap. We have the definition of the problem as a Markov Decision Process and we have our goal to learn the best Policy or the best Value. How we proceed?<\/p>\n<p id=\"f111\">We need an algorithm (Thank you, Sherlock\u2026)<\/p>\n<p id=\"4a20\">Well, there is an abundance of developed RL algorithms over the years. Each algorithm focuses on a different thing, whether it is to maximize the value or the policy or both. Whether to use a model(e.g a neural network) to simulate the environment or not. Whether it will capture the reward on each step or the end. As you guessed, it is not very easy to categorize all those algorithms in classes, but that\u2019s what I am about to do.<\/p>\n<p id=\"3039\">As you can see we can classify RL algorithms in two big categories: Model-based and Model-free:<\/p>\n<h3 id=\"0923\">Model-based<\/h3>\n<p id=\"0559\">These algorithms aim to learn how the environment works (its dynamics) from its observations and then plan a solution using that model. When they have a model, they use some planning method to find the best policy. They known to be data efficient, but they fail when the state space is too large. Try to build a model-based algorithm to play Go. Not gonna happen.<\/p>\n<p id=\"41bc\"><a href=\"https:\/\/en.wikipedia.org\/wiki\/Dynamic_programming\" target=\"_blank\" rel=\"noopener noreferrer\" data-href=\"https:\/\/en.wikipedia.org\/wiki\/Dynamic_programming\" data->Dynamic programming<\/a>\u00a0methods are an example of model-based methods, as they require the complete knowledge of the environment, such as transition probabilities and rewards.<\/p>\n<h3 id=\"a1f3\">Model-free<\/h3>\n<p id=\"bf9d\">Model-free algorithms do not require to learn the environment and store all the combination of states and actions. The can be divided into two categories, based on the ultimate goal of the training.<\/p>\n<p id=\"c9d2\"><strong>Policy-based<\/strong>\u00a0methods try to find the optimal policy, whether it\u2019s stochastic or deterministic. Algorithms like policy gradients and REINFORCE belong in this category. Their advantages are better convergence and effectiveness on high dimensional or continuous action spaces.<\/p>\n<p id=\"b928\">Policy-based methods are essentially an optimization problem, where we find the maximum of a policy function. That\u2019s why we also use algorithms like evolution strategies and hill climbing.<\/p>\n<p id=\"2c8b\"><strong>Value-based<\/strong>\u00a0methods, on the other hand, try to find the optimal value. A big part of this category is a family of algorithms called\u00a0<a href=\"https:\/\/en.wikipedia.org\/wiki\/Q-learning\" target=\"_blank\" rel=\"noopener noreferrer\" data-href=\"https:\/\/en.wikipedia.org\/wiki\/Q-learning\" data->Q-learning<\/a>, which learn to optimize the Q-Value. I plan to analyze Q-learning thoroughly on a next article because it is an essential aspect of Reinforcement learning. Other algorithms involve SARSA and value iteration.<\/p>\n<p id=\"38bd\">At the intersection of policy and value-based method, we find the\u00a0<strong>Actor-Critic<\/strong>methods, where the goal is to optimize both the policy and the value function.<\/p>\n<p id=\"36dc\">And now to the cool part. In the past few years, there is a new kid in town. And it was inevitable to affect and enhance all the existing methods to solve Reinforcement Learning. I am sure you guessed it. Deep Learning. And thus, we have a new term to represent all those new research ideas.<\/p>\n<figure id=\"5317\"><canvas width=\"75\" height=\"43\"><\/canvas><img decoding=\"async\" src=\"https:\/\/cdn-images-1.medium.com\/max\/640\/0*XsUtFVAdl-8kz5Pu.jpg\" data-src=\"https:\/\/cdn-images-1.medium.com\/max\/640\/0*XsUtFVAdl-8kz5Pu.jpg\" \/><\/figure>\n<p id=\"c3d2\">Deep neural networks have been used to model the dynamics of the environment(mode-based), to enhance policy searches (policy-based) and to approximate the Value function (value-based). Research on the last one (which is my favorite) has produced a model called\u00a0<a href=\"https:\/\/deepmind.com\/research\/dqn\/\" target=\"_blank\" rel=\"noopener noreferrer\" data-href=\"https:\/\/deepmind.com\/research\/dqn\/\" data->Deep Q Network<\/a>, which is responsible, along with its many improvements, for some of the most astonishing breakthroughs around the area (take Atari for example). And to excite you, even more, we don\u2019t just use simple Neural Networks but Convolutional, Recurrent and many else as well.<\/p>\n<p id=\"f30b\">Ok, I think that\u2019s enough for the first contact with Reinforcement Learning. I just wanted to give you the basis behind the whole idea and present you an overview of all the important techniques implemented over the years. But also, to give you a hint of what\u2019s next for the field.<\/p>\n<p id=\"618f\">Reinforcement learning has applications both in industry and in research. To name a few it has been used for: Robotics control, Optimizing chemical reactions, Recommendation systems, Advertising, Product design, Supply chain optimization, Stock trading. I could go on forever.<\/p>\n<p id=\"51fd\">Its probably the most exciting area of AI right now and in my opinion, it has all the rights to be.<\/p>\n<p id=\"1299\">This is the first post on a long series of posts, where we are going to unveil Reinforcement Learning secrets and try to explain both the intuition and the math behind all the different algorithms. The main focus will be how Deep Learning is used to greatly enhance the existing techniques and how it has led to revolutionary results in just a few years.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>You probably knew that there are two types of machine learning. Supervised and unsupervised. Well, there is a third one, called Reinforcement Learning. RL is arguably the most difficult area of ML to understand because there are so many things going on at the same time. It is a really astonishing area and you should definitely know about it. It involves complex thinking and 100% focus to grasp it, and some math.&nbsp;<\/p>\n","protected":false},"author":356,"featured_media":3105,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"content-type":"","footnotes":""},"categories":[183],"tags":[97],"ppma_author":[2101],"class_list":["post-909","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-ai-ml","tag-artificial-intelligence"],"authors":[{"term_id":2101,"user_id":356,"is_guest":0,"slug":"sergios-karagiannakos","display_name":"Sergios Karagiannakos","avatar_url":"https:\/\/secure.gravatar.com\/avatar\/?s=96&d=mm&r=g","user_url":"","last_name":"Karagiannakos","first_name":"Sergios","job_title":"","description":"<a href=\"https:\/\/sergioskar.github.io\">Sergios <\/a><a href=\"https:\/\/sergioskar.github.io\">Karagiannakos<\/a>&nbsp;is Data Scientist at EWORX S.A. where he designs recommendation systems leveraging Natural Language Processing and Data-centric Web Backends.&nbsp;He builds Artificial Intelligence software and Machine Learning applications."}],"_links":{"self":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/909","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/users\/356"}],"replies":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/comments?post=909"}],"version-history":[{"count":2,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/909\/revisions"}],"predecessor-version":[{"id":29576,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/909\/revisions\/29576"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/media\/3105"}],"wp:attachment":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/media?parent=909"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/categories?post=909"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/tags?post=909"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/ppma_author?post=909"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}