{"id":10054,"date":"2020-10-05T12:32:41","date_gmt":"2020-10-05T12:32:41","guid":{"rendered":"https:\/\/www.experfy.com\/blog\/?p=10054"},"modified":"2023-10-25T09:17:25","modified_gmt":"2023-10-25T09:17:25","slug":"introduction-to-reinforcement-learning","status":"publish","type":"post","link":"https:\/\/www.experfy.com\/blog\/ai-ml\/introduction-to-reinforcement-learning\/","title":{"rendered":"Introduction to Reinforcement Learning"},"content":{"rendered":"\t\t<div data-elementor-type=\"wp-post\" data-elementor-id=\"10054\" class=\"elementor elementor-10054\" data-elementor-post-type=\"post\">\n\t\t\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-39a8f4a9 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"39a8f4a9\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-7ed93872\" data-id=\"7ed93872\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-2d430b6d elementor-widget elementor-widget-text-editor\" data-id=\"2d430b6d\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\n<p class=\"has-text-align-right has-small-font-size\">This article appeared in <a href=\"https:\/\/towardsdatascience.com\/introduction-to-reinforcement-learning-c99c8c0720ef\" target=\"_blank\" rel=\"noreferrer noopener\">Towards Data Science<\/a><\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-f315d90 elementor-widget elementor-widget-heading\" data-id=\"f315d90\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\">A high-level structural overview of classical Reinforcement Learning algorithms<\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-881e146 elementor-widget elementor-widget-text-editor\" data-id=\"881e146\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<id=\"4390\">Reinforcement Learning (RL) is an increasing subset of Machine Learning and one of the most important frontiers of Artificial Intelligence, since it has gained great popularity in the last years with a lot of successful real-world applications in robotics, games and many other fields. It denotes a set of algorithms that handle sequential decision-making and have the ability to take intelligent decisions depending on their local environment.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-88aeb1e elementor-widget elementor-widget-text-editor\" data-id=\"88aeb1e\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"3269\">A RL algorithm can be described as a model that indicates to an agent which set of actions it should take within a closed environment in order to to maximize a predefined overall reward. Generally speaking, the agent tries different sets of actions, evaluating the total obtained return. After many trials, the algorithm learns which actions give a greater reward and establishes a pattern of behavior. Thanks to this, it is able to tell the agent which actions to take in every condition.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-4d744c9 elementor-widget elementor-widget-text-editor\" data-id=\"4d744c9\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tThe goal of RL is to capture more complex structures and use more adaptable algorithms than classical Machine Learning, infact RL algorithms are more dynamic in their behavior compared to classical Machine Learning ones.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-b3015ee elementor-widget elementor-widget-heading\" data-id=\"b3015ee\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\">Applications<\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-a80b76a elementor-widget elementor-widget-text-editor\" data-id=\"a80b76a\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\n<p id=\"9856\">Let\u2019s see some examples of applications based on RL:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><em>Robotics<\/em>\u00a0&#8211; RL can be used for high-dimensional control problems and for various industrial applications.<\/li>\n<li><em>Text mining\u00a0<\/em>&#8211; RL, along with a text generation model, can be used to develop a system that is able to produce highly readable summaries of long texts.<\/li>\n<li><em>Trade execution<\/em>\u00a0&#8211; Major companies in the financial industry use RL algorithms to improve their trading strategy.<\/li>\n<li><em>Healthcare<\/em>\u00a0&#8211; RL is useful for medication dosing, and for the optimization of treatment for people suffering from chronic clinical trials, etc.<\/li>\n<li><em>Games<\/em>\u00a0-RL is famous for being the main algorithm used to solve different games and to achieve superhuman performances.<\/li>\n<\/ul>\n\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-1c72aa0 elementor-widget elementor-widget-heading\" data-id=\"1c72aa0\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\">Actors<\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-7429218 elementor-widget elementor-widget-text-editor\" data-id=\"7429218\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\n<p id=\"7d31\">RL algorithms are based on Markov Decision Process (MDP). A Markov Decision Processes is a special stochastic time control process for decision making. The main actors of a RL algorithm are:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><em>Agent<\/em>: an entity which performs actions in an environment in order to optimize a long-term reward;<\/li>\n<li><em>Environment<\/em>: the scenario in which the agent takes decisions;<\/li>\n<li><em>Set of states<\/em>\u00a0(<em>S<\/em>): the set of all the possible states s of the environment, where the state describes the current situation of the environment;<\/li>\n<li><em>Set of actions\u00a0<\/em>(<em>A<\/em>): the set of all the possible actions a that can be performed by the agent;<\/li>\n<li><em>State transition model P (s_0|s , a)<\/em>: describes the probability that the environment state changes in\u00a0<em>s_0<\/em>\u00a0when the agent performs the action a at state\u00a0<em>s<\/em>, for every states\u00a0<em>s<\/em>,\u00a0<em>s_0<\/em>\u00a0and action\u00a0<em>a<\/em>;<\/li>\n<li><em>Reward (r = R(s , a))<\/em>: a function that indicates the immediate the real valued reward for taking action a at state s;<\/li>\n<li><em>Episode (rollout)<\/em>: it\u2019s a sequence of states st and actions at for\u00a0<em>t<\/em>\u00a0that varies from\u00a0<em>0<\/em>\u00a0to a final value\u00a0<em>L<\/em>\u00a0(that is called horizon and can eventually be infinite); the agent starts in a given state of its environment; at each timestep\u00a0<em>t<\/em>\u00a0the agent observes the current state\u00a0<em>s_t \u2208 S<\/em>\u00a0and consequently takes an action\u00a0<em>a_t \u2208 A<\/em>; the state evolves into a new state\u00a0<em>s_(t+1)<\/em>, that depends only on the state\u00a0<em>s_t<\/em>\u00a0and on the action\u00a0<em>a_t<\/em>\u00a0, according to the state transition model; the agent obtains a reward\u00a0<em>r_t<\/em>; then the agent observes the new state\u00a0<em>s_(t+1)\u2208 S<\/em>\u00a0and the loop restarts;<\/li>\n<\/ul>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-03471cd elementor-widget elementor-widget-image\" data-id=\"03471cd\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/miro.medium.com\/max\/1205\/1*PNvu5kOpkPf5-EwSoRh6_w.png\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-64461a3 elementor-widget elementor-widget-text-editor\" data-id=\"64461a3\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\n<ul class=\"wp-block-list\">\n<li><em>Policy function<\/em>: a policy can be deterministic (<em>\u03c0 (s)<\/em>) or stochastic (<em>(a|s)<\/em>): a deterministic policy \u03c0 (s) indicates the action a performed by the the agent when the environment is in the state s (<em>a = \u03c0 (s)<\/em>); a stochastic policy\u00a0<em>\u03c0 (a|s)<\/em>\u00a0is a function that describe the probability that action\u00a0<em>a<\/em>\u00a0is performed by the the agent when the environment is in the state\u00a0<em>s<\/em>. Once that the policy is specified, the new state only depends on the policy and on the state transaction model;<\/li>\n<li><em>Return G_t<\/em>\u00a0: the total long term reward with discount obtained at the end of the episode, according to the immediate reward of the current timestep and of every following timesteps, and to the the discount factor \u03b3 &lt; 1:<\/li>\n<\/ul>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-2c5d2a3 elementor-widget elementor-widget-image\" data-id=\"2c5d2a3\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/miro.medium.com\/max\/398\/1*giupuOoXCMShk71ejr8qzQ.png\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-a3d88f7 elementor-widget elementor-widget-text-editor\" data-id=\"a3d88f7\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\n<ul class=\"wp-block-list\">\n<li>Value function\u00a0<em>V(s)<\/em>: the expected long-term return at the end of the episode, starting from state s at current timestep\u00a0<em>t<\/em>:<\/li>\n<\/ul>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-189d09d elementor-widget elementor-widget-image\" data-id=\"189d09d\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/miro.medium.com\/max\/626\/1*S0PivY-7AnjGWRrVPLUV2g.png\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-28025c7 elementor-widget elementor-widget-text-editor\" data-id=\"28025c7\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\n<ul class=\"wp-block-list\">\n<li>Q-Value or Action-Value function\u00a0<em>Q(s , a)<\/em>: the expected long-term return at the end of the episode, starting from state\u00a0<em>s<\/em>\u00a0at current timestep, and performing action\u00a0<em>a<\/em>;<\/li>\n<li><em>The Bellman equation<\/em>: the theoretical core in most RL algorithms; according to it, the current value function is equal to the current reward plus itself evaluated at the next step and discounted by \u03b3 (we recall that in the equation\u00a0<em>P<\/em>\u00a0is the model transition model):<\/li>\n<\/ul>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-0f97a47 elementor-widget elementor-widget-image\" data-id=\"0f97a47\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/miro.medium.com\/max\/491\/1*ikm9T8t7ENdw5LR6BspWqA.png\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-5c9f7f2 elementor-widget elementor-widget-heading\" data-id=\"5c9f7f2\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\">Optimal policy<\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-7070259 elementor-widget elementor-widget-text-editor\" data-id=\"7070259\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tThe maximum of the action value function, as the policy changes, is referred as the optimal action value function\u00a0<em>Q*(s , a)<\/em>, and according to Bellman equation is given by\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-a7b84d1 elementor-widget elementor-widget-image\" data-id=\"a7b84d1\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/miro.medium.com\/max\/513\/1*cuzPRgqfIAgfhcTgAE1REw.png\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-d4b3345 elementor-widget elementor-widget-text-editor\" data-id=\"d4b3345\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tThen the optimal policy\u00a0<em>\u03c0*(s)<\/em>\u00a0is given by the action that maximizes the action value function:\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-f6f8733 elementor-widget elementor-widget-image\" data-id=\"f6f8733\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/miro.medium.com\/max\/272\/1*lWnmMkJS89gHLiTDFXbVYw.png\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-fc20561 elementor-widget elementor-widget-text-editor\" data-id=\"fc20561\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p><span style=\"font-size: 19px;\">The problem is that in most real cases the state transition model and the reward function are unknown, so it\u2019s necessary to learn them from sampling in order to estimate the optimal action value function and the best policy. For these reasons RL algorithms are used, in order to take the actions in the environment, observe and learn the dynamics of the model, estimate the optimal value function and the optimal policy, and improve the rewards.<\/span><\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-64f36fd elementor-widget elementor-widget-heading\" data-id=\"64f36fd\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\">Exploration-exploitation dilemma<\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-0622153 elementor-widget elementor-widget-text-editor\" data-id=\"0622153\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tExploration is the training on new data points, while exploitation is the use of the previously captured data. If we keep searching for the best action in every iteration we might remain stopped in a limited set of states without being able to explore the entire environment. To get out of this suboptimal set, generally it\u2019s used a strategy called \u03f5-greedy: when we select the best action, there is small a probability \u03f5 that a random action is chosen.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-d23d576 elementor-widget elementor-widget-image\" data-id=\"d23d576\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/miro.medium.com\/max\/491\/1*tkj2ypKLvpdqIJGZ-EbdeA.png\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-0bf62f6 elementor-widget elementor-widget-heading\" data-id=\"0bf62f6\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\">Approaches<\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-84c1704 elementor-widget elementor-widget-text-editor\" data-id=\"84c1704\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\n<p id=\"357d\">There are 3 main possible approaches that we can use when we implement a\u00a0<em>RL\u00a0<\/em>algorithm:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><em>Value-based methods &#8211;<\/em>\u00a0A Value-based algorithm approximates the optimal value function, or the optimal action value function, by continuously improving their estimate. Usually the value function or the action value function are initialized randomly, then they are continuously updated until they converge. A Value-based algorithm is guaranteed to converge to the optimal values.<\/li>\n<li><em>Policy-based methods &#8211;\u00a0<\/em>A Policy-based algorithm looks for a policy such that the action performed at each state is optimal to gain maximum <a href=\"https:\/\/www.experfy.com\/blog\/9-amazing-remote-employee-rewards-for-your-team\/\" target=\"_blank\" rel=\"noreferrer noopener\">reward <\/a>in the future. It redefines the policy at each step and computes the value function according to this new policy until the policy converges. A Policy-based method is also guaranteed to converge to the optimal policy, and often takes less iterations to converge than the value-based algorithms.<\/li>\n<li><em>Model-based methods &#8211;\u00a0<\/em>A Model-based algorithm learns a virtual model starting from the original environment, and the agent learns how to perform in the virtual model. It uses a reduced number of interactions with the real environment during the learning phase, then it builds a new model based on these interactions, uses this model to simulate the further episodes, and and get the results returned by the virtual model.<\/li>\n<\/ul>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-b22aa87 elementor-widget elementor-widget-heading\" data-id=\"b22aa87\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\">Value-based methods<\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-2b5b703 elementor-widget elementor-widget-heading\" data-id=\"2b5b703\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h3 class=\"elementor-heading-title elementor-size-default\">Value Function Approximation<\/h3>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-53c2059 elementor-widget elementor-widget-text-editor\" data-id=\"53c2059\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tThe Value Function Approximation is one of the most classical Value-based methods. Its goal is to estimate the optimal policy\u00a0<em>\u03c0*(s)<\/em>\u00a0by iteratively approximating the optimal action value function\u00a0<em>Q*(s , a)<\/em>. We start considering a parametric action value function\u00a0<em>Q^(s , a , w)<\/em>, where\u00a0<em>w\u00a0<\/em>is a vector of parameters. We initialize randomly the vector\u00a0<em>w<\/em>\u00a0and we iterate on every step of every episode. For every iteration, given the state\u00a0<em>s<\/em>\u00a0and the action\u00a0<em>a<\/em>, we observe the reward\u00a0<em>R(s , a)<\/em>\u00a0and the new state\u00a0<em>s\u2019<\/em>. According to the obtained reward we update the parameters using the gradient descent:\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-0c9a57c elementor-widget elementor-widget-image\" data-id=\"0c9a57c\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/miro.medium.com\/max\/491\/1*tkj2ypKLvpdqIJGZ-EbdeA.png\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-dc67a78 elementor-widget elementor-widget-text-editor\" data-id=\"dc67a78\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tIn the equation, \u03b1 is the learning rate. It can be shown that this process converges, and the obtained action value function is our approximation of the optimal action value function. In most of the real cases the better choice for the parametric action value function\u00a0<em>Q^(s , a , w)<\/em>\u00a0is a Neural Network, and then the vector of parameters\u00a0<em>w<\/em>\u00a0is given by the vector of the weights of the Neural Network.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-6115867 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"6115867\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-63d0237\" data-id=\"63d0237\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-bf372e5 elementor-widget elementor-widget-text-editor\" data-id=\"bf372e5\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<em>Value Function Approximation algorithm:<\/em>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-da80681 elementor-widget elementor-widget-image\" data-id=\"da80681\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/miro.medium.com\/max\/801\/1*T1vPl02AtIRMX_E0m5j0pA.png\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-0dce071 elementor-widget elementor-widget-heading\" data-id=\"0dce071\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\">Deep Q-Networks<\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-949ad4b elementor-widget elementor-widget-text-editor\" data-id=\"949ad4b\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tA Deep Q-Network is a combination of Deep Learning and RL, since it is a Value Function Approximation algorithm where the parametric action value function\u00a0<em>Q^(s , a , w)<\/em>\u00a0is a Deep Neural Network, and in particular a Convolutional Neural Network. Moreover, a Deep Q-Network overcomes unstable learning using mainly 2 techniques<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><em>Target Network &#8211;\u00a0<\/em>The model updates could be very unstable since the real target changes each time the model updates itself. The solution is to create a Target Network\u00a0<em>Q^(s\u2019,a\u2019,w\u2019)<\/em>, which is a copy of the training model that is updated less frequently, for example every thousands steps (we indicate as\u00a0<em>w\u2019<\/em>\u00a0the weights of the Target Network). In every model update with the gradient descent, the Target Network is used as target in place of the model itself:<\/li>\n<\/ul>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-0570ab3 elementor-widget elementor-widget-image\" data-id=\"0570ab3\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/miro.medium.com\/max\/580\/1*SqXbUwGgxXQPyhxKudP7cA.png\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-757ea1e elementor-widget elementor-widget-text-editor\" data-id=\"757ea1e\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\n<ul class=\"wp-block-list\">\n<li><em>Experience Replay &#8211;<\/em>\u00a0In the described algorithm several consecutive updates are performed using data from the same episode, and this can cause overfitting. To solve this, it is created an Experience Replay buffer that stores the four-tuples (<em>s<\/em>,<em>a<\/em>,<em>r<\/em>,<em>s\u2019<\/em>) of all the different episodes, and randomly select a batch of tuples each time the model is updated. This solution has 3 advantages: reduces overfitting, increases learning speed with mini-batches, and reuses past tuples to avoid forgetting.<\/li>\n<\/ul>\n\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-ce059a0 elementor-widget elementor-widget-heading\" data-id=\"ce059a0\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h3 class=\"elementor-heading-title elementor-size-default\">Fitted Q-Iteration<\/h3>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-83cc107 elementor-widget elementor-widget-text-editor\" data-id=\"83cc107\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tAnother popular Value-based algorithm is Fitted Q-Iteration. Consider the deterministic case, in which we have that the new state\u00a0<em>s\u2019<\/em>\u00a0is uniquely determined by the state s and the action a according to some a function\u00a0<em>f ,\u00a0<\/em>then we can write\u00a0<em>s\u2019 = f (s , a)<\/em>. Let\u00a0<em>L\u00a0<\/em>be the horizon, possibly infinite, and we recall that the horizon is the length of all the episodes. The goal of this algorithm is to estimate the optimal action value function. By the Bellman equation, the optimal action value function\u00a0<em>Q*(s , a)\u00a0<\/em>can be seen as the application of an operator\u00a0<em>H<\/em>\u00a0to the action value function\u00a0<em>Q(s , a)<\/em>:\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-ebe6061 elementor-widget elementor-widget-image\" data-id=\"ebe6061\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/miro.medium.com\/max\/510\/1*wI3qM9CVPcF1nxDW-lgUXA.png\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-85b77cf elementor-widget elementor-widget-text-editor\" data-id=\"85b77cf\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\n<p id=\"fe60\">Consider now a temporal horizon\u00a0<em>N<\/em>\u00a0less than or equal to the horizon\u00a0<em>L<\/em>, and denote by\u00a0<em>Q_N (s , a)<\/em>\u00a0the action value function over\u00a0<em>N<\/em>\u00a0steps defined by the application of the just defined operator\u00a0<em>H<\/em>\u00a0to the action value function\u00a0<em>Q_(N\u22121) (s , a)<\/em>, with<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-578511f elementor-widget elementor-widget-image\" data-id=\"578511f\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/miro.medium.com\/max\/337\/1*BfpWAD6KPrsxEuCGwzomrg.png\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-439934c elementor-widget elementor-widget-text-editor\" data-id=\"439934c\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\n<p id=\"c27d\">It is possible to show that this sequence of\u00a0<em>N<\/em>-step action value functions\u00a0<em>Q_N (s , a)<\/em>\u00a0converges to the optimal action value function\u00a0<em>Q*(s , a)<\/em>\u00a0as\u00a0<em>N \u2192 L<\/em>. Thanks to this, it\u2019s possible to build an algorithm to approximate the optimal action value function\u00a0<em>Q*(s , a)<\/em>\u00a0iterating on\u00a0<em>N<\/em>.<\/p>\n\n\n\n<p id=\"62d9\"><em>Fitted Q-Iteration algorithm:<\/em><\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-d3acf0b elementor-widget elementor-widget-text-editor\" data-id=\"d3acf0b\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\n<p id=\"266f\">A full implementation of Fitted Q-Iteration can be found on GitHub<br \/>(<a href=\"https:\/\/github.com\/teopir\/ifqi\" target=\"_blank\" rel=\"noreferrer noopener\">https:\/\/github.com\/teopir\/ifqi<\/a>).<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-5da62e8 elementor-widget elementor-widget-heading\" data-id=\"5da62e8\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\"><em>Fitted Q-Iteration algorithm:Fitted Q-Iteration algorithm:Fitted Q-Iteration algorithm:<\/em><\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-1290853 elementor-widget elementor-widget-text-editor\" data-id=\"1290853\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\n<p id=\"f459\">Consider a car, modeled by a point mass, that is traveling on a hill with this form:<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-8dc16f8 elementor-widget elementor-widget-image\" data-id=\"8dc16f8\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/miro.medium.com\/max\/1832\/1*sYHxxc_JvDAFchg87DwbRg.png\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-192c893 elementor-widget elementor-widget-text-editor\" data-id=\"192c893\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\n<p id=\"7f5a\">The control problem goal is to bring the car in a minimum time to the top of the hill while preventing the position p of the car to become smaller than\u00a0<em>-1<\/em>\u00a0and its speed\u00a0<em>v<\/em>\u00a0to go outside the interval\u00a0<em>[-3 , 3]<\/em>. The top of the hill is reached at position\u00a0<em>p = 1<\/em>.<\/p>\n\n\n\n<p id=\"2201\"><em><strong>State space<\/strong> &#8211;\u00a0<\/em>This problem has a (continuous) state space of dimension two (the position\u00a0<em>p<\/em>\u00a0and the speed\u00a0<em>v<\/em>\u00a0of the car), and we want that the absolute value of the position is less than or equal to\u00a0<em>1<\/em>, and that the absolute value of the speed is less than or equal to\u00a0<em>3<\/em>:<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-67a50fc elementor-widget elementor-widget-image\" data-id=\"67a50fc\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/miro.medium.com\/max\/411\/1*Iv0mddJ5S-I0pUe5S6pRBw.png\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-3ac74ac elementor-widget elementor-widget-text-editor\" data-id=\"3ac74ac\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\n<p id=\"2236\">Every other combination of position and speed is considered a terminal state.<\/p>\n\n\n\n<p id=\"86ea\"><em><strong>Action space<\/strong> &#8211;<\/em>\u00a0The action\u00a0<em>a<\/em>\u00a0acts directly on the acceleration of the car and can<br \/>only assume two extreme values (full acceleration (<em>a<\/em>\u00a0= 4) or full deceleration (<em>a -4<\/em>)). Hence the action space is given by the set<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-621157e elementor-widget elementor-widget-image\" data-id=\"621157e\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/miro.medium.com\/max\/131\/1*S_1VLB9RicGLU5v3kGM9Gg.png\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-24f3926 elementor-widget elementor-widget-text-editor\" data-id=\"24f3926\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\n<p id=\"4c8e\"><em>System dynamics &#8211;\u00a0<\/em>The time is discretized in timesteps of\u00a0<em>0.1<\/em>\u00a0seconds. Given the state (<em>p<\/em>,\u00a0<em>v<\/em>) and the action\u00a0<em>a<\/em>\u00a0at timestep t, we are able to compute the state (<em>p<\/em>,\u00a0<em>v<\/em>) at timestep\u00a0<em>t + 1<\/em>\u00a0solving with a numeric method the two differential equations related to position and speed that describe the dynamic of the system:<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-b0d46bd elementor-widget elementor-widget-image\" data-id=\"b0d46bd\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/miro.medium.com\/max\/714\/1*jdUC0BL3ExnnycMtUNLurw.png\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-e2c6d84 elementor-widget elementor-widget-text-editor\" data-id=\"e2c6d84\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\n<p id=\"d484\">Of course for our purpose it\u2019s not important to understand the meaning of these equations, it\u2019s important to understand that given state and action at timestep\u00a0<em>t<\/em>, the state at timestep\u00a0<em>t + 1<\/em>\u00a0in uniquely determined.<\/p>\n\n\n\n<p id=\"c866\"><em><strong>Reward function<\/strong> &#8211;<\/em>\u00a0The reward function\u00a0<em>r(s , a)<\/em>\u00a0is defined through this expression:<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-e74121e elementor-widget elementor-widget-image\" data-id=\"e74121e\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/miro.medium.com\/max\/488\/1*iY2T1xTlVQfD58TFiCQZgA.png\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-66e9a3e elementor-widget elementor-widget-text-editor\" data-id=\"66e9a3e\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\n<p id=\"e260\">The reward is\u00a0<em>-1<\/em>\u00a0if the position is less then\u00a0<em>-1<\/em>\u00a0or if the absolute value of the speed is greater than\u00a0<em>3<\/em>\u00a0because we reached a termination state but we didn\u2019t reach the top of the hill; the reward is\u00a0<em>1<\/em>\u00a0if the position is greater than 1 and the absolute value of the speed is less than 3 because we reached the top of the hill respecting the speed limits; otherwise the reward is\u00a0<em>0<\/em>.<\/p>\n\n\n\n<p id=\"06ef\"><em><strong>Discount factor <\/strong>&#8211;<\/em>\u00a0The decay factor \u03b3 has been chosen equal to 0.95.<\/p>\n\n\n\n<p id=\"85ee\"><em><strong>Initial point<\/strong><\/em>\u00a0&#8211; At the begin the car stopped at the bottom of the hill (<em>p<\/em>\u00a0,\u00a0<em>v<\/em>) = (<em>0.5<\/em>,\u00a0<em>0<\/em>).<\/p>\n\n\n\n<p id=\"7712\"><em><strong>Regressor<\/strong> &#8211;\u00a0<\/em>The regressor used is an Extra Tree Regressor.<\/p>\n\n\n\n<p id=\"dbdf\">Performing the Fitted Q-Iteration for\u00a0<em>N = 1<\/em>\u00a0to\u00a0<em>50<\/em>\u00a0it turns out that for\u00a0<em>N &gt; 20<\/em>\u00a0the mean squared error between action value functions\u00a0<em>Q^_N<\/em>\u00a0(s , a) and\u00a0<em>Q^_(N+1)(s , a)<\/em>\u00a0(computed on all the combinations of (<em>p<\/em>,\u00a0<em>v<\/em>)) decreases quickly to\u00a0<em>0<\/em>\u00a0as\u00a0<em>N<\/em>\u00a0increases. For this reason the results are studied using the action state function\u00a0<em>Q^_20(s , a)<\/em>.<\/p>\n\n\n\n<p id=\"e6f8\">In figure on the left we can see the action chosen for every combination of<br \/>(<em>p<\/em>,\u00a0<em>v<\/em>), according to the action value function\u00a0<em>Q^_20(s , a)<\/em>\u00a0(red area represents deceleration, green area represents acceleration, blu area means that the action values of deceleration and acceleration are equal).<\/p>\n\n\n\n<p id=\"87df\">The optimal trajectory according to the action value function\u00a0<em>Q^_20(s , a)<\/em>\u00a0is represented in the figure on the right.<\/p>\n\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-4a66cd8 elementor-widget elementor-widget-image\" data-id=\"4a66cd8\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/miro.medium.com\/max\/1296\/1*sFiwGqV42VKozJmGrX2XMQ.png\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-48e6f70 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"48e6f70\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-d11006c\" data-id=\"d11006c\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-7dea8e2 elementor-widget elementor-widget-heading\" data-id=\"7dea8e2\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\">Policy-valued Methods<\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-f93f77d elementor-widget elementor-widget-heading\" data-id=\"f93f77d\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h3 class=\"elementor-heading-title elementor-size-default\">Policy Gradient<\/h3>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-b7b5cbc elementor-widget elementor-widget-text-editor\" data-id=\"b7b5cbc\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\n<p id=\"b266\">Policy Gradient os the most classical Policy-based method. The goal of the Policy Gradient method is to find the vector of parameters\u00a0<em>\u03b8<\/em>\u00a0that maximizes the value function\u00a0<em>V (s , \u03b8)<\/em>\u00a0under a parametric policy\u00a0<em>\u03c0 (a|s , \u03b8)<\/em>.<\/p>\n\n\n\n<p id=\"4b93\">We start considering a parametric policy\u00a0<em>\u03c0 (a|s , \u03b8)<\/em>\u00a0differentiable with respect to the vector of parameters\u00a0<em>\u03b8<\/em>; in particular in this case we choose a stochastic policy (in this case the method is called Stochastic Policy Gradient, however the case with a deterministic policy is very similar).<\/p>\n\n\n\n<p id=\"c234\">We initialize randomly the vector\u00a0<em>w<\/em>\u00a0and we iterate on every episode. For each timestep\u00a0<em>t<\/em>\u00a0we generate a sequence of triplets (<em>s<\/em>,\u00a0<em>a<\/em>,\u00a0<em>r<\/em>) choosing the action according the parametric policy\u00a0<em>\u03c0 (a|s , \u03b8)<\/em>. For every timestep in the resulting sequence we compute the total long term reward with discount\u00a0<em>G_t<\/em>\u00a0in function of the obtained rewards:<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-db89dd3 elementor-widget elementor-widget-image\" data-id=\"db89dd3\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/miro.medium.com\/max\/199\/1*2EMC6GT-866ELwSNpeMXXg.png\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-d7a5aa4 elementor-widget elementor-widget-text-editor\" data-id=\"d7a5aa4\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\n<p id=\"8a79\">Then the vector of parameters\u00a0<em>\u03b8_t<\/em>\u00a0is modified using a gradient update process<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-9d3ebdd elementor-widget elementor-widget-image\" data-id=\"9d3ebdd\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/miro.medium.com\/max\/507\/1*EFni7L0louDiwYgkDZ5LRw.png\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-c6d703b elementor-widget elementor-widget-text-editor\" data-id=\"c6d703b\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\n<p id=\"22d8\">In the equation \u03b1 &gt; 0 is the learning rate.<\/p>\n\n\n\n<p id=\"3700\">It can be shown that this process converges, and the obtained process is our approximated optimal policy.<\/p>\n\n\n\n<p id=\"580d\"><em>Policy Gradient algorithm:<\/em><\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-d21abfd elementor-widget elementor-widget-image\" data-id=\"d21abfd\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/miro.medium.com\/max\/778\/1*-T8GjUiWkSDzHE8i-jc49Q.png\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-b772006 elementor-widget elementor-widget-heading\" data-id=\"b772006\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h3 class=\"elementor-heading-title elementor-size-default\">\n<h3 class=\"wp-block-heading\" id=\"ea2c\">Examples of parametric policies<\/h3>\n<\/h3>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-fdbe7e3 elementor-widget elementor-widget-text-editor\" data-id=\"fdbe7e3\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\n\n<p id=\"2a4d\">The most used parametric policies are Softmax Policy and Gaussian Policy<em>.<\/em><\/p>\n\n\n\n<p id=\"69a2\"><em>Softmax Policy<\/em><br \/>The Softmax Policy consists of a softmax function that converts output to a<br \/>distribution of probabilities, and is mostly used in the case discrete actions:<\/p>\n\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-389bc8e elementor-widget elementor-widget-image\" data-id=\"389bc8e\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/miro.medium.com\/max\/302\/1*Kl2jiPDSade0MuSShHii-w.png\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-8f8ba47 elementor-widget elementor-widget-text-editor\" data-id=\"8f8ba47\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\n<p id=\"6a4f\">In this case the explicit formula for the gradient update is given by<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-34cc04b elementor-widget elementor-widget-image\" data-id=\"34cc04b\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/miro.medium.com\/max\/389\/1*_IUaauYYydzLl5HSvkuHFQ.png\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-1fcb918 elementor-widget elementor-widget-text-editor\" data-id=\"1fcb918\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\n<p id=\"a429\">where\u00a0<em>\u03c6(s , a)<\/em>\u00a0is the feature vector related to the state and the action.<\/p>\n\n\n\n<p id=\"776c\"><em>Gaussian Policy<\/em><br \/>The Gaussian Policy is used in the case of a continuous action space, and is given by the Gaussian function<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-6463ae9 elementor-widget elementor-widget-image\" data-id=\"6463ae9\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/miro.medium.com\/max\/274\/1*XInVLcdX30U2hyLETsMllw.png\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-6d21e10 elementor-widget elementor-widget-text-editor\" data-id=\"6d21e10\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\n<p id=\"1056\">where\u00a0<em>\u00b5(s)<\/em>\u00a0is given by \u03c6(s) T\u00a0<em>\u03b8,<\/em>\u00a0\u03c6(s , a) is feature vector, and\u00a0<em>\u03c3<\/em>\u00a0can be fixed or parametric. Also in this case we have the explicit formula for the gradient update:<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-3162212 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"3162212\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-936a236\" data-id=\"936a236\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-5617c97 elementor-widget elementor-widget-image\" data-id=\"5617c97\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/miro.medium.com\/max\/328\/1*XLufifJpoSFHQ_x0vON8Sg.png\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-f05d606 elementor-widget elementor-widget-heading\" data-id=\"f05d606\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\">Advantages and Disadvantages of Policy Gradient<\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-506891e elementor-widget elementor-widget-heading\" data-id=\"506891e\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h4 class=\"elementor-heading-title elementor-size-default\">Advantages<\/h4>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-e398dc8 elementor-widget elementor-widget-text-editor\" data-id=\"e398dc8\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\n<ul class=\"wp-block-list\">\n<li>A Policy Gradient method is a simpler process compared with value-based<br \/>methods.<\/li>\n<li>It allows the action to be continuous with respect to the state.<\/li>\n<li>It usually has better convergence properties with respect to other methods.<\/li>\n<li>It avoids the growth in the usage of memory and in the computation time when the action and state sets are large, because the goal is to learn a set of parameters whose size is much smaller than that of the set of states and the set of actions.<\/li>\n<li>It can learn stochastic policies.<\/li>\n<li>It allows the use \u03f5-greedy method, so that the agent can have a probability \u03f5 of taking random actions.<\/li>\n<\/ul>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-3001eb0 elementor-widget elementor-widget-heading\" data-id=\"3001eb0\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h4 class=\"elementor-heading-title elementor-size-default\">Disadvantages<\/h4>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-147a718 elementor-widget elementor-widget-text-editor\" data-id=\"147a718\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\n<ul class=\"wp-block-list\">\n<li>A Policy Gradient method typically converges to a local rather than global<br \/>optimum.<\/li>\n<li>It usually has high variance (that however can be reduced with some techniques).<\/li>\n<\/ul>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-4f433ad elementor-widget elementor-widget-heading\" data-id=\"4f433ad\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h3 class=\"elementor-heading-title elementor-size-default\">Example of Policy Gradient application: CartPole<\/h3>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-9308e99 elementor-widget elementor-widget-text-editor\" data-id=\"9308e99\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\n\n<p id=\"5148\">CartPole is a game where a pole is attached by an unactuated joint to a cart, which moves along a frictionless track. The pole starts upright.<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-970a4c9 elementor-widget elementor-widget-image\" data-id=\"970a4c9\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/miro.medium.com\/max\/1152\/1*6gtbpi9ZZfxTQJlo0aiZ7w.jpeg\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-e836872 elementor-widget elementor-widget-text-editor\" data-id=\"e836872\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\n\n<p id=\"6332\">The goal is to prevent the pole from falling over by increasing and reducing the cart\u2019s velocity.<\/p>\n\n\n\n<p id=\"1ef8\"><em>State space &#8211;<\/em>\u00a0A single state is composed of 4 elements:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>cart position<\/li>\n<li>cart velocity<\/li>\n<li>pole angle<\/li>\n<li>pole angular velocity<\/li>\n<\/ul>\n\n\n\n<p id=\"2541\">The game ends when the pole falls, which is when the pole angle is more than\u00a0<em>\u00b112\u00b0<\/em>, or the cart position reaches the edge of the display.<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-fe69a0f elementor-widget elementor-widget-text-editor\" data-id=\"fe69a0f\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\n<p id=\"08f2\"><em>Action space &#8211;\u00a0<\/em>The agent can take only 2 actions:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>move the pole to the left<\/li>\n<li>move the pole to the right<\/li>\n<\/ul>\n\n\n\n<p id=\"1286\"><em>Reward &#8211;\u00a0<\/em>For every step taken (including the termination step), the reward is increased by\u00a0<em>1<\/em>. This is obviously because we want to achieve the greatest possible number of steps.<\/p>\n\n\n\n<p id=\"0dd3\">The problem is solved with Gradient Policy method using Softmax Policy, with discount factor\u00a0<em>\u03b3 = 0.95<\/em>\u00a0and learning rate\u00a0<em>\u03b1 = 0.1<\/em>. For every episode a maximum number of 1000 iterations is fixed.<\/p>\n\n\n\n<p id=\"6a90\">After about 60 epochs (where 1 epoch is equal to 20 consecutive episodes) the agent learns a policy thanks to which we get a reward equal to 1000, that means that the pole doesn\u2019t fall for all the 1000 steps of the episode.<\/p>\n\n\n\n<p id=\"625a\">In this Figures we can see how the choice of the action vary in function of the pole angle and the cart velocity (left figure) and in function of the pole angular velocity and the cart velocity (right figure). The red area is where the move left action is chosen, the green area is where the move right action is chosen, and the yellow area is where there are similar probabilities to choose one action or the other.<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-295df47 elementor-widget elementor-widget-image\" data-id=\"295df47\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/miro.medium.com\/max\/825\/1*4buqt3ph7ASdttAeoS2OTw.png\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-c433e86 elementor-widget elementor-widget-text-editor\" data-id=\"c433e86\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\n<p id=\"bf7c\">A very interesting result is that if\u00a0<em>\u03b3<\/em>\u00a0is greater than 0.9 the reward of a single episode grows with the number of epochs and arrives to the maximum value of 1000, while if\u00a0<em>\u03b3<\/em>\u00a0is lower than 0.9, after some epochs the reward of a single episode stops growing. This means that in this problem the reward of the next steps is very important to find the best policy, and this is actually reasonable since the fundamental information to learn how to prevent the pole from falling is to know after how many steps it falls in each single episode.<\/p>\n\n\n\n<p id=\"2b7c\">On GitHub it\u2019s possible to find many different implementations of this example.<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-c495162 elementor-widget elementor-widget-heading\" data-id=\"c495162\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h3 class=\"elementor-heading-title elementor-size-default\">Actor-Critic Method<\/h3>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-2a98185 elementor-widget elementor-widget-text-editor\" data-id=\"2a98185\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\n<p id=\"acb1\">Another popular policy-based method is Actor-Critic. It is different from the Policy Gradient method because estimates both the policy and the value function, and updates both.<\/p>\n\n\n\n<p id=\"7db2\">In Policy Gradient, the vector of parameters\u00a0<em>\u03b8<\/em>\u00a0is updated using the long term reward\u00a0<em>G_t<\/em>\u00a0, but this estimation often has high variance. To address this issue and reduce the wide changes in the results, the idea of the Actor-Critic method is to subtract from the total reward with discount\u00a0<em>G_t<\/em>\u00a0a baseline\u00a0<em>b(s)<\/em>.<\/p>\n\n\n\n<p id=\"aab6\">The obtained value\u00a0<em>\u03b4 = Gt &#8211; b(s)<\/em>, that is called Temporal Difference error, is used to update the vector of parameters\u00a0<em>\u03b8<\/em>\u00a0in place of the long term reward\u00a0<em>G_t<\/em>\u00a0. The baselines can take several forms, but the most used is the estimation of the value function\u00a0<em>V(s)<\/em>.<\/p>\n\n\n\n<p id=\"7bbc\">As in value-based methods, the value function\u00a0<em>V(s)<\/em>\u00a0can be learned with a Neural Network, whose output is the approximated value function\u00a0<em>V^(s , w)<\/em>, where\u00a0<em>w<\/em>\u00a0is the vector of weights. Then in every iteration the Temporal Difference error\u00a0<em>\u03b4<\/em>\u00a0is used non only to adjust the vector of parameters\u00a0<em>\u03b8<\/em>, but also to update the vector of weights\u00a0<em>w<\/em>.<\/p>\n\n\n\n<p id=\"d93c\">This method is called Actor-Critic Methods, because:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>The Critic estimates the value function\u00a0<em>V(s)<\/em>.<\/li>\n<li>The Actor updates the policy distribution in the direction suggested by the Critic (as in policy gradient methods).<\/li>\n<\/ul>\n\n\n\n<p id=\"4f6c\"><em>Actor-Critic algorithm:<\/em><\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-998cb2e elementor-widget elementor-widget-image\" data-id=\"998cb2e\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/miro.medium.com\/max\/738\/1*PY9CKX6IUbKik9rzULBvKQ.png\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-acb018b elementor-widget elementor-widget-heading\" data-id=\"acb018b\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\">Model-based Method<\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-b753e8e elementor-widget elementor-widget-text-editor\" data-id=\"b753e8e\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\n<p id=\"5b2c\">As already underlined, a Model-based method creates a virtual model starting from the original environment, and that the agent learns how to perform in the virtual model. A Model-based method starts considering a base parametric model, and then run the following 3 steps:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li><em>Acting<\/em>: the base policy\u00a0<em>\u03c0_0(a_t|s_t)<\/em>\u00a0is used to select the actions to perform in the real environment, in order to collect a set of observations given by the triplets (state, action, new state);<\/li>\n<li><em>Model learning<\/em>: from the collected experience, a new model\u00a0<em>m(s , a)<\/em>\u00a0is deduced in order to minimize the least square error between the model\u2019s new state and the real new state; a supervised learning algorithm can be used to train a model to minimize the least square error from the sampled trajectory;<\/li>\n<li><em>Planning<\/em>: the value function and the policy are updated according to the new model, in order to be used to select the actions to perform in the real environment in the next iteration.<\/li>\n<\/ol>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-a639b37 elementor-widget elementor-widget-text-editor\" data-id=\"a639b37\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"cae2\">One of the most used models to represent the system dynamics is the Gaussian Process, in which the prediction interpolates the observations using Gaussian distribution. Another possibility is to use the Gaussian Mixture Model, that is a probabilistic model that assumes all the data points are generated from a mixture of a finite number of Gaussian distributions with unknown parameters. It\u2019s a sort of generalization of\u00a0<em>k<\/em>-means clustering that incorporates information about the covariance structure of the data as well as the centers of the latent Gaussians.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p id=\"e9eb\"><em>Model based method sample algorithm:<\/em><\/p>\n<!-- \/wp:paragraph -->\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-3d8e525 elementor-widget elementor-widget-image\" data-id=\"3d8e525\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/miro.medium.com\/max\/721\/1*WhE3LBdKuhwWyYcgmeqWpQ.png\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-dda4122 elementor-widget elementor-widget-heading\" data-id=\"dda4122\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h3 class=\"elementor-heading-title elementor-size-default\">Model Predictive Control<\/h3>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-ef5572e elementor-widget elementor-widget-text-editor\" data-id=\"ef5572e\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<!-- wp:paragraph -->\n<p id=\"4dac\">The Model Predictive Control is an evolution of the method just described. The described Model-based algorithm is vulnerable to drifting: small errors accumulate fast along the trajectory, and the search space is too big for any base policy to be covered all over. For this reasons the trajectory may arrive in areas where the model has not been learned yet. Without a proper model around these areas, it\u2019s impossible to plan the optimal control.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p id=\"804d\">To address that, instead of learning the model at the beginning, sampling and fitting of the model are performed continuously during the trajectory. Nevertheless, the previous method executes all planned actions before fitting the model again.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p id=\"44c4\">In Model Predictive Control, the whole trajectory is optimized, but only the first action is performed, then the new triplet (<em>s<\/em>,\u00a0<em>a<\/em>,\u00a0<em>s\u2019<\/em>) is added to the observations and the planning is done again. This allows to take a corrective action if the current state is observed again. For a stochastic model, this is particularly helpful.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p id=\"5507\">By constantly changing plan, MPC is less vulnerable to problems in the model. The new algorithm the run 5 steps, of which the first 3 are the same as the previous algorithm (acting, model learning, planning). Then we have:<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:list {\"ordered\":true} -->\n<ol>\n<li><em>Acting<\/em><\/li>\n<li><em>Model learning<\/em><\/li>\n<li><em>Planning<\/em><\/li>\n<li><em>Execution<\/em>: the first planned action is performed, and the resulting state\u00a0<em>s\u2019<\/em>\u00a0is observed;<\/li>\n<li><em>Dataset update<\/em>: the new triplet (<em>s<\/em>,<em>a<\/em>,<em>s\u2019<\/em>) is appended to the dataset; go to step 3, every\u00a0<em>N<\/em>\u00a0times go to step 2 (as already seen, this means that the planning is performed every step, and that the model is fitted every\u00a0<em>N<\/em>\u00a0steps of the trajectory).<\/li>\n<\/ol>\n<!-- \/wp:list -->\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-0381166 elementor-widget elementor-widget-heading\" data-id=\"0381166\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h3 class=\"elementor-heading-title elementor-size-default\">Model-based Methods Advantages and Disadvantages<\/h3>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-cec9324 elementor-widget elementor-widget-text-editor\" data-id=\"cec9324\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<!-- wp:paragraph -->\n<p id=\"2fce\">Model-based RL has a strong advantage of being very efficient with few samples, since many models behave linearly at least in the local proximity.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p id=\"dfae\">Once the model and the reward function are known, the planning of the optimal controls doesn\u2019t require additional sampling. Generally the learning phase is fast, since there is no need to wait for the environment to respond nor to reset the environment to some state in order to resume learning.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p id=\"11d4\">On the downside, if the model is inaccurate, we risk learning something completely different from the reality. Another point worth nothing is that Model-based algorithm still use Model-free methods either to construct the model or in the planning and simulation phases.<\/p>\n<!-- \/wp:paragraph -->\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-38fc07e elementor-widget elementor-widget-heading\" data-id=\"38fc07e\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\">Conclusions<\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-5281054 elementor-widget elementor-widget-text-editor\" data-id=\"5281054\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<!-- wp:paragraph -->\n<p id=\"debb\">This article is a high-level structural overview of many classical RL algorithms. However it\u2019s ovious that there are a lot of variants in each model family that we\u2019ve not covered. For example, in the Deep Q-Networks family, double Deep Q Networks give very interesting results.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p id=\"a830\">The main challenge in RL lays in preparing the simulation environment and choosing the most suitable approach. Those aspects are highly dependent on the task to be performed and are very important because many real world problems have enormous state or action spaces that must be represented efficiently and comprehensively.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p id=\"91f5\">The other main tasks are to optimize the rewards in order to obtain the desired results, to set up the system in oder to let the learning process converge to the optimum in a reasonable time, and to avoid overfitting and forgetting.<\/p>\n<!-- \/wp:paragraph -->\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-a9c4fb8 elementor-widget elementor-widget-heading\" data-id=\"a9c4fb8\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h4 class=\"elementor-heading-title elementor-size-default\">References<\/h4>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-cbe00a6 elementor-widget elementor-widget-text-editor\" data-id=\"cbe00a6\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<!-- wp:list {\"ordered\":true} -->\n<ol>\n<li>Richard S. Sutton and Andrew G. Barto.\u00a0<em>Reinforcement Learning: An<br \/>Introduction<\/em>.<\/li>\n<li>Vincent Fran\u00e7ois-Lavet, Peter Henderson, Riashat Islam, Marc G. Bellemare, Joelle Pineau.\u00a0<em>An Introduction to Deep Reinforcement Learning.<\/em><\/li>\n<li>Damien Ernst, Pierre Geurts, Louis Wehenkel.\u00a0<em>Tree-Based Batch Mode<br \/>Reinforcement Learning<\/em>. Journal of Machine Learning Research 6 (2005)<br \/>503\u2013556.<\/li>\n<li><a href=\"https:\/\/github.com\/openai\/gym\" target=\"_blank\" rel=\"noreferrer noopener\">https:\/\/github.com\/openai\/gym<\/a><\/li>\n<li><a href=\"https:\/\/github.com\/teopir\/ifqi\" target=\"_blank\" rel=\"noreferrer noopener\">https:\/\/github.com\/teopir\/ifqi<\/a><\/li>\n<\/ol>\n<!-- \/wp:list -->\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<\/div>\n\t\t","protected":false},"excerpt":{"rendered":"<p>Reinforcement Learning (RL) is an increasing subset of Machine Learning and one of the most important frontiers of Artificial Intelligence. This article is a high-level structural overview of many classical RL algorithms. <\/p>\n","protected":false},"author":927,"featured_media":10058,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"content-type":"","footnotes":""},"categories":[183],"tags":[97,206,92,695],"ppma_author":[3681],"class_list":["post-10054","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-ai-ml","tag-artificial-intelligence","tag-deep-learning","tag-machine-learning","tag-reinforcement-learning"],"authors":[{"term_id":3681,"user_id":927,"is_guest":0,"slug":"marco-del-pra","display_name":"Marco Del Pra","avatar_url":"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/09\/Marco-Del-Pra-150x150.jpg","user_url":"https:\/\/www.afiniti.com\/%20","last_name":"Del Pra","first_name":"Marco","job_title":"","description":"Marco Del Pra is Senior Director, Data &amp; Analytics at Afiniti."}],"_links":{"self":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/10054","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/users\/927"}],"replies":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/comments?post=10054"}],"version-history":[{"count":6,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/10054\/revisions"}],"predecessor-version":[{"id":33712,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/10054\/revisions\/33712"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/media\/10058"}],"wp:attachment":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/media?parent=10054"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/categories?post=10054"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/tags?post=10054"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/ppma_author?post=10054"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}