{"id":2270,"date":"2020-02-20T02:06:41","date_gmt":"2020-02-19T23:06:41","guid":{"rendered":"http:\/\/kusuaks7\/?p=1875"},"modified":"2024-01-05T18:07:22","modified_gmt":"2024-01-05T18:07:22","slug":"deep-reinforcement-learning-for-supply-chain-and-price-optimization","status":"publish","type":"post","link":"https:\/\/www.experfy.com\/blog\/ai-ml\/deep-reinforcement-learning-for-supply-chain-and-price-optimization\/","title":{"rendered":"Deep Reinforcement Learning For Supply Chain And Price Optimization"},"content":{"rendered":"\t\t<div data-elementor-type=\"wp-post\" data-elementor-id=\"2270\" class=\"elementor elementor-2270\" data-elementor-post-type=\"post\">\n\t\t\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-356d199d elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"356d199d\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-1d2b9b85\" data-id=\"1d2b9b85\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-3c1bdfc7 elementor-widget elementor-widget-text-editor\" data-id=\"3c1bdfc7\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tSupply chain and price management were among the first areas of enterprise operations that adopted data science and combinatorial optimization methods and have a long history of using these techniques with great success. Although a wide range of traditional optimization methods are available for inventory and price management applications, deep reinforcement learning has the potential to substantially improve the optimization capabilities for these and other types of enterprise operations due to impressive recent advances in the development of generic self-learning algorithms for optimal control. In this article, we explore how deep <a href=\"https:\/\/www.experfy.com\/blog\/ai-ml\/introduction-to-reinforcement-learning\/\">reinforcement learning<\/a> methods can be applied in several basic supply chain and price management scenarios.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-649afeb elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"649afeb\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-9c0aaee\" data-id=\"9c0aaee\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-75c4289 elementor-widget elementor-widget-text-editor\" data-id=\"75c4289\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\n<strong>This article is structured as a hands-on tutorial that describes how to develop, debug, and evaluate reinforcement learning optimizers using PyTorch and RLlib:<\/strong>\n<ul>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-48dd1e0 elementor-widget elementor-widget-text-editor\" data-id=\"48dd1e0\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<ul>\n \t<li>We start with a simple motivating example that illustrates how slight modifications of traditional price optimization problems can result in complex behavior and increase optimization complexity.<\/li>\n \t<li>Next, we use this simplistic price management environment to develop and evaluate our first optimizer using only a vanilla PyTorch toolkit.<\/li>\n \t<li>We then discuss how the implementation can be drastically simplified and made more robust with RLlib, an open-source library for reinforcement learning.<\/li>\n \t<li>Next, we develop a more complex supply chain environment that includes a factory, several warehouses, and transportation. For this environment, we first implement a baseline solution using a traditional inventory management policy. Then we show how this baseline can be improved using continuous control algorithms provided by RLlib.<\/li>\n \t<li>Finally, we conclude the article with a discussion of how deep reinforcement learning algorithms and platforms can be applied in practical enterprise settings.<\/li>\n<\/ul>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-3a0c39d elementor-widget elementor-widget-heading\" data-id=\"3a0c39d\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\"><h2 id=\"introductoryexampledifferentialpriceresponseandhilopricing\">Introductory example: Differential price response and Hi-Lo pricing<\/h2><\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-e5a1789 elementor-widget elementor-widget-text-editor\" data-id=\"e5a1789\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tThe traditional price optimization process in retail or manufacturing environments is typically framed as a what-if analysis of different pricing scenarios using some sort of demand model. In many cases, the development of a demand model is challenging because it has to properly capture a wide range of factors and variables that influence demand, including regular prices, discounts, marketing activities, seasonality, competitor prices, cross-product cannibalization, and halo effects. Once the demand model is developed, however, the optimization process for pricing decisions is relatively straightforward, and standard techniques such as linear or integer programming typically suffice. For instance, consider an apparel retailer that purchases a seasonal product at the beginning of the season and has to sell it out by the end of the period. Assuming that a retailer chooses pricing levels from a discrete set (e.g.,\u00a0$59.90,\u00a0$69.90, etc.) and can make price changes frequently (e.g., weekly), we can pose the following optimization problem:\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-fc156a7 elementor-widget elementor-widget-text-editor\" data-id=\"fc156a7\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\twhere\u00a0iterates over time intervals,\u00a0is an index that iterates over the valid price levels,\u00a0is the price with index\u00a0,\u00a0is the demand at time\u00a0given price level\u00a0,\u00a0is the inventory level at the beginning of the season, and\u00a0is a binary dummy variable that is equal to one if price\u00a0is assigned to time interval\u00a0, and zero otherwise. The first constraint ensures that each time interval has only one price, and the second constraint ensures that all demands sum up to the available stock level. This is an integer programming problem that can be solved using conventional optimization libraries.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-8b2e7f2 elementor-widget elementor-widget-text-editor\" data-id=\"8b2e7f2\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\nThe above model is quite flexible because it allows for a price-demand function of an arbitrary shape (linear, constant elasticity, etc.) and arbitrary seasonal patterns. It can also be straightforwardly extended to support joint price optimization for multiple products. The model, however, assumes no dependency between time intervals.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-75f1439 elementor-widget elementor-widget-text-editor\" data-id=\"75f1439\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tLet us now explore how the dependencies between time intervals can impact the optimization process. In the real world, demand depends not only on the absolute price level but can also be impacted by the magnitude of recent price changes\u2014price decrease can create a temporary demand splash, while price increase can result in a temporary demand drop. The impact of price changes can also be asymmetric, so that price increases have a much bigger or smaller impact than the decreases. We can codify these assumptions using the following price-demand function:\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-721c9b3 elementor-widget elementor-widget-text-editor\" data-id=\"721c9b3\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tand\u00a0is the price for the current time interval and\u00a0is the price for the previous time interval. The first two terms correspond to a linear demand model with intercept\u00a0and slope\u00a0. The second two terms model the response on a price change between two intervals. Coefficients\u00a0and\u00a0define the sensitivity to positive and negative price changes, respectively, and\u00a0is a shock function that can be used to specify a non-linear dependency between the price change and demand. For the sake of illustration, we assume that\u00a0.\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-41c04aa elementor-widget elementor-widget-text-editor\" data-id=\"41c04aa\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<details open=\"\"><summary>Price optimization environment. Click to expand the code sample.<\/summary><\/details>We can visualize this environment by plotting profit functions that correspond to different magnitudes of price changes (see the\u00a0<a href=\"https:\/\/github.com\/ikatsov\/algorithmic-marketing-examples\/blob\/master\/pricing\/price-optimization-using-dqn-reinforcement-learning.ipynb\" rel=\"noopener\">complete notebook<\/a>\u00a0for implementation details):\n\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-3b77b4e elementor-widget elementor-widget-image\" data-id=\"3b77b4e\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/blog.griddynamics.com\/content\/images\/2020\/02\/hilo-pricing-demand-functions.png\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-9b2cbd4 elementor-widget elementor-widget-text-editor\" data-id=\"9b2cbd4\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tWe can see that price increases &#8220;deflate&#8221; the baseline profit function, while price decreases &#8220;inflate&#8221; it. Next, we obtain our first profit baseline by searching for the optimal single (constant) price:\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-8ddb521 elementor-widget elementor-widget-text-editor\" data-id=\"8ddb521\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<details open=\"\"><summary>Price optimization: Constant price. Click to expand the code sample.<\/summary><\/details>The constant-price schedule is not optimal for this environment, and we can improve profits through greedy optimization: start with finding the optimal price for the first time step, then optimize the second time step having frozen the first one, and so on:\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-53bfe68 elementor-widget elementor-widget-text-editor\" data-id=\"53bfe68\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<details open=\"\"><summary>Price optimization: Dynamic price. Click to expand the code sample.<\/summary><\/details>This approach improves profit significantly and produces the following price schedule:\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-2edb1fe elementor-widget elementor-widget-image\" data-id=\"2edb1fe\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/blog.griddynamics.com\/content\/images\/2020\/02\/hilo-pricing-optimal-pricing-policy.png\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-dfa7295 elementor-widget elementor-widget-text-editor\" data-id=\"dfa7295\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tThis result is remarkable: a simple temporal dependency inside the price-demand function dictates a complex pricing strategy with price surges and discounts. It can be viewed as a formal justification of the Hi-Lo pricing strategy used by many retailers; we see how altering regular and promotional prices helps to maximize profit.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-d649ee2 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"d649ee2\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-fb98f4d\" data-id=\"fb98f4d\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-b79a409 elementor-widget elementor-widget-heading\" data-id=\"b79a409\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\"><h2 id=\"casestudy1pricingpolicyoptimizationusingdqn\">Case study 1: Pricing policy optimization using DQN<\/h2><\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-659921a elementor-widget elementor-widget-text-editor\" data-id=\"659921a\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tAlthough the greedy algorithm we implemented above produces the optimal pricing schedule for a simple differential price-response function, it becomes increasingly more challenging to reduce the problem to standard formulations, such as linear or integer programming, as we add more constraints or interdependencies. In this section, we approach the problem from a different perspective and apply a generic Deep Q Network (DQN) algorithm to learn the optimal price control policy.\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-a8ddb6f elementor-widget elementor-widget-text-editor\" data-id=\"a8ddb6f\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tWe use the original DQN in this example because it is a reasonably simple starting point that illustrates the main concepts of modern reinforcement learning. In practical settings, one is likely to use either more recent modifications of the original DQN or alternative algorithms\u2014we will discuss this topic more thoroughly at the end of the article.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-f684647 elementor-widget elementor-widget-text-editor\" data-id=\"f684647\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\nAlthough DQN implementations are available in most reinforcement learning libraries, we chose to implement the basic version of DQN from scratch to provide a clearer picture of how DQN is applied to this particular environment and to demonstrate several debugging techniques. Readers who are familiar with DQN can skip the next two sections that describe the core algorithm and its PyTorch implementation.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-0072beb elementor-widget elementor-widget-heading\" data-id=\"0072beb\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h3 class=\"elementor-heading-title elementor-size-default\"><h3 id=\"definingtheenvironment\">Defining\u00a0the\u00a0environment<\/h3><\/h3>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-41f1b63 elementor-widget elementor-widget-text-editor\" data-id=\"41f1b63\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tReinforcement learning considers the setup where an agent interacts with the environment in discrete time steps with the goal of learning a reward-maximizing behavior policy. At each time step\u00a0, with a given state\u00a0, the agent takes an action\u00a0according to its policy\u00a0and receives the reward\u00a0moving to the next state\u00a0. We redefine our pricing environment in these reinforcement learning terms as follows.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-b82c9c1 elementor-widget elementor-widget-text-editor\" data-id=\"b82c9c1\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tFirst, we encode the state of the environment at any time step\u00a0as a vector of prices for all previous time steps concatenated with one-hot encoding of the time step itself:\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-11169d4 elementor-widget elementor-widget-text-editor\" data-id=\"11169d4\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tNext, the action\u00a0for every time step is just an index in the array of valid price levels. Finally, the reward\u00a0is simply the profit of the seller. Our goal is to find a policy that prescribes a pricing action based on the current state in a way that the total profit for a selling season (episode) is maximized.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-16e92ad elementor-widget elementor-widget-heading\" data-id=\"16e92ad\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h3 class=\"elementor-heading-title elementor-size-default\"><h3 id=\"overviewofthedqnalgorithm\">Overview\u00a0of\u00a0the\u00a0DQN\u00a0algorithm<\/h3><\/h3>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-001c5e4 elementor-widget elementor-widget-text-editor\" data-id=\"001c5e4\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tIn this section, we briefly review the original DQN algorithm\u00a0<sup><a id=\"fnref1\" href=\"https:\/\/blog.griddynamics.com\/deep-reinforcement-learning-for-supply-chain-and-price-optimization\/#fn1\" rel=\"noopener\">[1]<\/a><\/sup>. The goal of the algorithm is to learn an action policy\u00a0that maximizes the total discounted cumulative reward (also known as the return) earned during the episode of\u00a0time steps:\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-58074eb elementor-widget elementor-widget-text-editor\" data-id=\"58074eb\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tSuch a policy can be defined if we know a function that estimates the expected return based on the current state and next action, under the assumption that all subsequent actions will also be taken according to the policy:\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-8deb5fa elementor-widget elementor-widget-text-editor\" data-id=\"8deb5fa\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\nAssuming that this function (known as the Q-function) is known, the policy can be straightforwardly defined as follows to maximize the return:\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-b27d96c elementor-widget elementor-widget-text-editor\" data-id=\"b27d96c\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tWe can combine the above definitions into the following recursive equation (the Bellman equation):\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-7135dd4 elementor-widget elementor-widget-text-editor\" data-id=\"7135dd4\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\twhere\u00a0and\u00a0are the next state and the action taken in that state, respectively. If we estimate the Q-function using some approximator, then the quality of the approximation can be measured using the difference between the two sides of this equation:\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-1753669 elementor-widget elementor-widget-text-editor\" data-id=\"1753669\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tThis value is called the temporal difference error. The main idea behind DQN is to train a deep neural network to approximate the Q-function using the temporal difference error as the loss function. In principle, the training process can be straightforward:\n<ol>\n \t<li>Initialize the network. Its input corresponds to state representation, while output is a vector of Q-values for all actions.<\/li>\n \t<li>For each time step:\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-cfeed35 elementor-widget elementor-widget-text-editor\" data-id=\"cfeed35\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<ol>\n \t<li>Estimate Q-values using the network.<\/li>\n \t<li>Execute the action with the maximum Q-value and observe the reward.<\/li>\n \t<li>Calculate the temporal difference error.<\/li>\n \t<li>Update the network&#8217;s parameters using stochastic gradient descent. The loss function is derived from the temporal difference error.<\/li>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-a501252 elementor-widget elementor-widget-text-editor\" data-id=\"a501252\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<\/ol>\nThis simple approach, however, is known to be unstable for training complex non-linear approximators, such as deep neural networks. DQN addresses this issue using two techniques:\n<ul>\n \t<li><strong>Replay buffer.<\/strong>\u00a0One of the problems with the basic training process described above is that sequential observations are usually correlated, while network training generally requires independently distributed samples. DQN works around this by accumulating multiple observed transitions\u00a0in a buffer and sampling batches of such transitions to retrain the network. The buffer is typically chosen large enough to minimize correlations between samples.<\/li>\n \t<li><strong>Target networks.<\/strong>\u00a0The second problem with the basic process is that network parameters are updated based on the loss function computed using Q-values produced by the same network. In other words, the learning target moves simultaneously with the parameters we are trying to learn, making the optimization process unstable. DQN mitigates this issue by maintaining two instances of the network. The first one is used to take actions and is continuously updated as described above. The second one, called a target network, is a lagged copy of the first one and used specifically to calculate Q-values for the loss function (i.e., target). This technique helps to stabilize the learning process.<\/li>\n<\/ul>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-e54be49 elementor-widget elementor-widget-text-editor\" data-id=\"e54be49\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tCombining the basic learning process with these two ideas, we obtain the DQN algorithm:\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-e2cd3b4 elementor-widget elementor-widget-text-editor\" data-id=\"e2cd3b4\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tAlgorithm: Deep Q Network (DQN)\n<ol>\n \t<li>Parameters and initialization:\n<ol>\n \t<li>\u2014 parameters of the policy network<\/li>\n \t<li>\u2014 parameters of the target network<\/li>\n \t<li>\u2014 learning rate<\/li>\n \t<li>\u2014 batch size<\/li>\n \t<li>\u2014 frequency of target updates<\/li>\n \t<li>Initialize<\/li>\n<\/ol>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-da1bd4e elementor-widget elementor-widget-text-editor\" data-id=\"da1bd4e\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<ul><li>For\u00a0do<ol><li>Choose the action based on<\/li><li>Execute the action and save transition\u00a0in the buffer<\/li><li>Update the policy network<ol><li>Sample a batch of\u00a0transitions from the buffer<\/li><li>Calculate target Q-values for each sample in the batch:\u00a0where\u00a0for last states of the episodes (initial condition)<\/li><li>Calculate the loss:<\/li><li>Update the network&#8217;s parameters:<\/li><\/ol><\/li><\/ol><\/li><\/ul>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-5b7d406 elementor-widget elementor-widget-text-editor\" data-id=\"5b7d406\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<\/ol>\n<\/li>\n \t<li>If\u00a0then\n<ol>\n \t<li>Update the target network:<\/li>\n<\/ol>\n<\/li>\n<\/ol>\n<\/li>\n<\/ol>\nThe last concern that we need to address is how the action is chosen based on Q-values estimated by the network. A policy that always takes action with the maximum Q-value will not work well because the learning process will not sufficiently explore the environment, so we choose to randomize the action selection process. More specifically, we use\u00a0-greedy policy that takes the action with the maximum Q-value with the probability of\u00a0and a random action with the probability of\u00a0. We also use the annealing technique starting with a relatively large value of\u00a0and gradually decreasing it from one training episode to another.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-841939b elementor-widget elementor-widget-text-editor\" data-id=\"841939b\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tMore details about DQN can be found in the original paper\u00a0<sup><a id=\"fnref1:1\" href=\"https:\/\/blog.griddynamics.com\/deep-reinforcement-learning-for-supply-chain-and-price-optimization\/#fn1\" rel=\"noopener\">[1:1]<\/a><\/sup>; its modifications and extensions are summarized in\u00a0<sup><a id=\"fnref2\" href=\"https:\/\/blog.griddynamics.com\/deep-reinforcement-learning-for-supply-chain-and-price-optimization\/#fn2\" rel=\"noopener\">[2]<\/a><\/sup>, and more thorough treatments of Q-learning are provided in excellent books by Sutton and Barto\u00a0<sup><a id=\"fnref3\" href=\"https:\/\/blog.griddynamics.com\/deep-reinforcement-learning-for-supply-chain-and-price-optimization\/#fn3\" rel=\"noopener\">[3]<\/a><\/sup>, Graesser and Keng\u00a0<sup><a id=\"fnref4\" href=\"https:\/\/blog.griddynamics.com\/deep-reinforcement-learning-for-supply-chain-and-price-optimization\/#fn4\" rel=\"noopener\">[4]<\/a><\/sup>.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-aeb8a0e elementor-widget elementor-widget-heading\" data-id=\"aeb8a0e\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h3 class=\"elementor-heading-title elementor-size-default\"><h3 id=\"implementingdqnusingpytorch\">Implementing\u00a0DQN\u00a0using\u00a0PyTorch<\/h3><\/h3>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-b69d3f1 elementor-widget elementor-widget-text-editor\" data-id=\"b69d3f1\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tOur next step is to implement the DQN algorithm using PyTorch [5]. We develop all major components in this section, and the complete implementation with all auxiliary functions is available in this\u00a0<a href=\"https:\/\/github.com\/ikatsov\/algorithmic-marketing-examples\/blob\/master\/pricing\/price-optimization-using-dqn-reinforcement-learning.ipynb\" rel=\"noopener\">notebook<\/a>.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-4cc3a72 elementor-widget elementor-widget-text-editor\" data-id=\"4cc3a72\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tThe first step is to implement a memory buffer that will be used to accumulate observed transitions and replay them during the network training. The implementation is straightforward, as it is just a generic cyclic buffer:\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-51e6555 elementor-widget elementor-widget-text-editor\" data-id=\"51e6555\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<details open=\"\"><summary>Experience replay buffer. Click to expand the code sample.<\/summary><\/details>The second step is to implement the policy network. The input of the network is the environment state, and the output is a vector of Q-values for each possible pricing action. We choose to implement a simple network with three fully connected layers, although a recurrent neural network (RNN) would also be a reasonable choice here because the state is essentially a time series:\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-12a8aa1 elementor-widget elementor-widget-text-editor\" data-id=\"12a8aa1\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<details open=\"\"><summary>Policy network architecture. Click to expand the code sample.<\/summary><\/details>Next, we define the policy that converts Q-values produced by the network into pricing actions. We use\u00a0-greedy policy with an annealed (decaying) exploration parameter: the probability\u00a0to take a random action (explore) is set relatively high in the beginning of the training process, and then decays exponentially to fine tune the policy.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-b82a97f elementor-widget elementor-widget-text-editor\" data-id=\"b82a97f\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<details open=\"\"><summary>Annealed\u00a0-greedy policy. Click to expand the code sample.<\/summary><\/details>The most complicated part of the implementation is the network update procedure. It samples a batch of non-final actions from the replay buffer, computes Q-values for the current and next states, calculates the loss, and updates the network accordingly:\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-b622c43 elementor-widget elementor-widget-text-editor\" data-id=\"b622c43\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<details open=\"\"><summary>DQN model update. Click to expand the code sample.<\/summary><\/details>Finally, we define a helper function that executes the action and returns the reward and updated state:\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-2a8d104 elementor-widget elementor-widget-text-editor\" data-id=\"2a8d104\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<details open=\"\"><summary>Environment state update. Click to expand the code sample.<\/summary><\/details>Let us now wire all pieces together in a simulation loop that plays multiple episodes using the environment, updates the policy networks, and records pricing actions and returns for further analysis:\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-225b9b7 elementor-widget elementor-widget-text-editor\" data-id=\"225b9b7\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<details open=\"\"><summary>DQN training. Click to expand the code sample.<\/summary><\/details>This concludes our basic DQN implementation. We will now focus on experimentation and analysis of the results.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-0d1f2c0 elementor-widget elementor-widget-heading\" data-id=\"0d1f2c0\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h3 class=\"elementor-heading-title elementor-size-default\"><h3 id=\"simulationresults\">Simulation\u00a0results<\/h3><\/h3>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-6fe1d28 elementor-widget elementor-widget-text-editor\" data-id=\"6fe1d28\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tThe following plot visualizes pricing schedules generated during the training process, and the schedule that corresponds to the last episode is highlighted in red:\n\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-470e95b elementor-widget elementor-widget-image\" data-id=\"470e95b\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/blog.griddynamics.com\/content\/images\/2020\/02\/hilo-pricing-dqn-training.png\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-f04e1f2 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"f04e1f2\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-60f9d71\" data-id=\"60f9d71\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-37e069b elementor-widget elementor-widget-text-editor\" data-id=\"37e069b\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tWe can see that the result closely resembles the baseline pricing strategy obtained using greedy optimization (but not exactly the same because of randomization). The achieved profit is also very close to the optimum.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-2815b21 elementor-widget elementor-widget-text-editor\" data-id=\"2815b21\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tThe following animation visualizes the same data, but better illustrates how the policy changes over training episodes:\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-ac80a93 elementor-widget elementor-widget-image\" data-id=\"ac80a93\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/blog.griddynamics.com\/content\/images\/2020\/02\/hilo-pricing-dqn-training-animation.gif\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-e07d590 elementor-widget elementor-widget-text-editor\" data-id=\"e07d590\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tThe process starts with a random policy, but the network quickly learns the sawtooth pricing pattern. The following plot shows how returns change during the training process (the line is smoothed using a moving average filter with a window of size 10; the shaded area corresponds to two standard deviations over the window):\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-ff4f84f elementor-widget elementor-widget-image\" data-id=\"ff4f84f\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/blog.griddynamics.com\/content\/images\/2020\/02\/hilo-pricing-dqn-training-returns.png\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-3095c8c elementor-widget elementor-widget-heading\" data-id=\"3095c8c\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h3 class=\"elementor-heading-title elementor-size-default\"><h3 id=\"policyvisualizationtuninganddebugging\">Policy\u00a0visualization,\u00a0tuning,\u00a0and\u00a0debugging<\/h3><\/h3>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-475fa1d elementor-widget elementor-widget-text-editor\" data-id=\"475fa1d\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tThe learning process is very straightforward for our simplistic environment, but policy training can be much more difficult as the complexity of the environment increases. In this section, we discuss some visualization and debugging techniques that can help analyze and troubleshoot the learning process.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-3831309 elementor-widget elementor-widget-text-editor\" data-id=\"3831309\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tOne of the most basic things we can do for policy debugging is to evaluate the network for a manually crafted input state and analyze the output Q-values. For example, let us make a state vector that corresponds to time step 1 and an initial price of\u00a0$170, then run it through the network:\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-ee7dc12 elementor-widget elementor-widget-text-editor\" data-id=\"ee7dc12\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<details open=\"\"><summary>Capturing Q-values for a given state. Click to expand the code sample.<\/summary><\/details>The output distribution of Q-values will be as follows for the network trained without reward discounting (that is, ):\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-46ac472 elementor-widget elementor-widget-image\" data-id=\"46ac472\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/blog.griddynamics.com\/content\/images\/2020\/02\/hilo-pricing-dqn-q-example-edited-2.png\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-3971e9c elementor-widget elementor-widget-text-editor\" data-id=\"3971e9c\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tWe see that the network correctly suggests increasing the price (in accordance with the Hi-Lo pattern), but the distribution of Q-values if relatively flat and the optimal action is not differentiated well. If we retrain the policy with\u00a0, then the distribution of Q-values for the same input state will be substantially different and price surges will be articulated better:\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-4620f4a elementor-widget elementor-widget-image\" data-id=\"4620f4a\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/blog.griddynamics.com\/content\/images\/2020\/02\/hilo-pricing-dqn-q-example-gamma-0.8-edited-2.png\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-3bae764 elementor-widget elementor-widget-text-editor\" data-id=\"3bae764\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tNote that the absolute values of the Q-function do not match the actual return in dollars because of discounting. More specifically, the Q-function now focuses only on the first 10\u201312 steps after the price action: for example, the discounting factor for 13-th action is\u00a0, so its contribution into Q-value is negligible. This often helps to improve the policy or learn it more rapidly because the short-term rewards provide a more stable and predictable guidance for the training process. We can see this clearly by plotting the learning dynamics for two values of\u00a0together (note that this plot shows the actual returns, not Q-values, so two lines have the same scale):\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-684d907 elementor-widget elementor-widget-image\" data-id=\"684d907\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/blog.griddynamics.com\/content\/images\/2020\/02\/hilo-pricing-dqn-training-returns-gamma1.00vs0.80.png\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-107bb27 elementor-widget elementor-widget-text-editor\" data-id=\"107bb27\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tThe second technique that can be useful for debugging and troubleshooting is visualization of temporal difference (TD) errors. In the following bar chart, we randomly selected several transitions and visualized individual terms that enter the Bellman equation:\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-b33fb37 elementor-widget elementor-widget-text-editor\" data-id=\"b33fb37\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tThe chart shows that TD errors are reasonably small, and the Q-values are meaningful as well:\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-5477a5e elementor-widget elementor-widget-image\" data-id=\"5477a5e\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/blog.griddynamics.com\/content\/images\/2020\/02\/hilo-pricing-td-error.png\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-104616a elementor-widget elementor-widget-text-editor\" data-id=\"104616a\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tFinally, it can be very useful to visualize the correlation between Q-values and actual episode returns. The following code snippet shows an instrumented simulation loop that records both values, and the correlation plot is shown right below (white crosses correspond to individual pairs of the Q-value and return).\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-f74fae5 elementor-widget elementor-widget-text-editor\" data-id=\"f74fae5\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<details open=\"\"><summary>Correlation between Q-values and actual returns. Click to expand the code sample.<\/summary><\/details>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-ea5a801 elementor-widget elementor-widget-image\" data-id=\"ea5a801\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/blog.griddynamics.com\/content\/images\/2020\/02\/hilo-pricing-qvalues-vs-returns.png\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-799bb96 elementor-widget elementor-widget-text-editor\" data-id=\"799bb96\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tThis correlation is almost ideal thanks to the simplicity of the toy price-response function we use. The correlation pattern can be much more sophisticated in more complex environments. A complicated correlation pattern might be an indication that a network fails to learn a good policy, but that is not necessarily the case (i.e., a good policy might have a complicated pattern).\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-071a33d elementor-widget elementor-widget-heading\" data-id=\"071a33d\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h3 class=\"elementor-heading-title elementor-size-default\"><h3 id=\"implementationusingrllib\">Implementation\u00a0using\u00a0RLlib<\/h3><\/h3>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-7ebc920 elementor-widget elementor-widget-text-editor\" data-id=\"7ebc920\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tThe DQN implementation we have created in the previous sections can be viewed mainly as an educational exercise. In real industrial settings, it is preferable to use stable frameworks that provide reinforcement learning algorithms and other tools out of the box. Consequently, our next step is to reimplement the same optimizer using RLlib, an open-source library for reinforcement learning developed at the UC Berkeley RISELab\u00a0<sup><a id=\"fnref5\" href=\"https:\/\/blog.griddynamics.com\/deep-reinforcement-learning-for-supply-chain-and-price-optimization\/#fn5\" rel=\"noopener\">[5]<\/a><\/sup>.\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-3d85bc8 elementor-widget elementor-widget-text-editor\" data-id=\"3d85bc8\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\nWe start with the development of a simple wrapper for our environment that casts it to the standard OpenAI Gym interface. We simply need to add a few minor details. First, the environment needs to fully encapsulate the state. Second, dimensionality and type of action and state (observation) spaces have to be explicitly specified:\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-4f34781 elementor-widget elementor-widget-text-editor\" data-id=\"4f34781\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<details open=\"\"><summary>Pricing environment: Gym wrapper. Click to expand the code sample.<\/summary><\/details>Once the environment is defined, training the pricing policy using a DQN algorithm can be very straightforward. In our case, it is enough to just specify a few parameters:\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-3a51d29 elementor-widget elementor-widget-text-editor\" data-id=\"3a51d29\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<details open=\"\"><summary>Pricing policy optimization using RLlib. Click to expand the code sample.<\/summary><\/details>The resulting policy achieves the same performance as our custom DQN implementation. However, RLlib provides many other tools and benefits out of the box such as a real-time integration with TensorBoard:\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-4ca337d elementor-widget elementor-widget-image\" data-id=\"4ca337d\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/blog.griddynamics.com\/content\/images\/2020\/02\/hilo-pricing-tensorboard.png\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-7a07737 elementor-widget elementor-widget-text-editor\" data-id=\"7a07737\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tThis concludes our first case study. The solution we developed can work with more complex price-response functions, as well as incorporate multiple products and inventory constraints. We develop some of these capabilities in the next section.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-ece1f8a elementor-widget elementor-widget-heading\" data-id=\"ece1f8a\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\"><h2 id=\"casestudy2multiecheloninventoryoptimizationusingddpg\">Case study 2: Multi-echelon inventory optimization using DDPG<\/h2><\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-1202470 elementor-widget elementor-widget-text-editor\" data-id=\"1202470\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tIn the first case study, we discussed how deep reinforcement learning can be applied to the basic revenue management scenario. We also tried out several implementation techniques and frameworks, and we are now equipped to tackle a more complex problem. Our second project will be focused on supply chain optimization, and we will use a much more complex environment with multiple locations, transportation issues, seasonal demand changes, and manufacturing costs.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-f1dde61 elementor-widget elementor-widget-heading\" data-id=\"f1dde61\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h3 class=\"elementor-heading-title elementor-size-default\"><h3 id=\"definingtheenvironment\">Defining\u00a0the\u00a0environment<\/h3><\/h3>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-cb6e3c5 elementor-widget elementor-widget-text-editor\" data-id=\"cb6e3c5\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>We start with defining the environment that includes a factory, central factory warehouse, and\u00a0distribution warehouses.<sup><a id=\"fnref6\" href=\"https:\/\/blog.griddynamics.com\/deep-reinforcement-learning-for-supply-chain-and-price-optimization\/#fn6\" rel=\"noopener\">[6]<\/a><\/sup><sup><a id=\"fnref7\" href=\"https:\/\/blog.griddynamics.com\/deep-reinforcement-learning-for-supply-chain-and-price-optimization\/#fn7\" rel=\"noopener\">[7]<\/a><\/sup>\u00a0An instance of such an environment with three warehouses is shown in the figure below.<br \/><br \/><\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-c66f43f elementor-widget elementor-widget-image\" data-id=\"c66f43f\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/blog.griddynamics.com\/content\/images\/2020\/02\/hilo-pricing-tensorboard.png\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-695f57d elementor-widget elementor-widget-text-editor\" data-id=\"695f57d\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tWe assume that the factory produces a product with a constant cost of\u00a0dollars per unit, and the production level at time step\u00a0is\u00a0.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-c3ea658 elementor-widget elementor-widget-text-editor\" data-id=\"c3ea658\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tNext, there is a factory warehouse with a maximum capacity of\u00a0units. The storage cost for one product unit for a one time step at the factory warehouse is\u00a0, and the stock level at time\u00a0is\u00a0.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-dd4a5e0 elementor-widget elementor-widget-text-editor\" data-id=\"dd4a5e0\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tAt any time\u00a0, the number of units shipped from the factory warehouse to the distribution warehouse\u00a0is\u00a0, and the transportation cost is\u00a0dollars per unit. Note that the transportation cost varies across the distribution warehouses.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-bea5c54 elementor-widget elementor-widget-text-editor\" data-id=\"bea5c54\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tEach distribution warehouse\u00a0has maximum capacity\u00a0, storage cost of\u00a0, and stock level at time\u00a0equal to\u00a0. Products are sold to retail partners at price\u00a0which is the same across all warehouses, and the demand for time step\u00a0at warehouse\u00a0is\u00a0units. We also assume that the manufacturer is contractually obligated to fulfill all orders placed by retail partners, and if the demand for a certain time step exceeds the corresponding stock level, it results in a penalty of\u00a0dollars per each unfulfilled unit. Unfulfilled demand is carried over between time steps (which corresponds to backordering), and we model it as a negative stock level.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-8d7c37d elementor-widget elementor-widget-text-editor\" data-id=\"8d7c37d\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tLet us now combine the above assumptions together and define the environment in reinforcement learning terms. First, we obtain the following reward function for each time step:\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-c51815e elementor-widget elementor-widget-text-editor\" data-id=\"c51815e\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\nThe first term is revenue, the second corresponds to production cost, the third is the total storage cost, and the fourth is the transportation cost. The last term corresponds to the penalty cost and enters the equation with a plus sign because stock levels would be already negative in case of unfulfilled demand.\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-8a21eb8 elementor-widget elementor-widget-text-editor\" data-id=\"8a21eb8\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tWe choose the state vector to include all current stock levels and demand values for all warehouses for several previous steps:\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-7566027 elementor-widget elementor-widget-text-editor\" data-id=\"7566027\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tNote that we assume that the agent observes only past demand values, but not the demand for the current (upcoming) time step. This means that the agent can potentially benefit from learning the demand pattern and embedding the demand prediction capability into the policy.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-21daf3b elementor-widget elementor-widget-text-editor\" data-id=\"21daf3b\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tThe state update rule will then be as follows:\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-bd18a58 elementor-widget elementor-widget-text-editor\" data-id=\"bd18a58\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\nFinally, the action vector simply consists of production and shipping controls:\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-0ac6e88 elementor-widget elementor-widget-text-editor\" data-id=\"0ac6e88\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tThe code snippet below shows the implementation of the state and action classes (see the\u00a0<a href=\"https:\/\/github.com\/ikatsov\/algorithmic-marketing-examples\/blob\/master\/supply-chain\/supply-chain-reinforcement-learning.ipynb\" rel=\"noopener\">complete notebook<\/a>\u00a0for implementation details).\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-d94996d elementor-widget elementor-widget-text-editor\" data-id=\"d94996d\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<details open=\"\"><summary>Supply chain environment: State and action. Click to expand the code sample.<\/summary><\/details>The next code snippet shows how the environment is initialized. We assume episodes with 26 time steps (e.g., weeks), three warehouses, and store and transportation costs varying significantly across the warehouses.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-16e50d8 elementor-widget elementor-widget-text-editor\" data-id=\"16e50d8\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<details open=\"\"><summary>Supply chain environment: Initialization. Click to expand the code sample.<\/summary><\/details>We also define a simple demand function that simulates seasonal demand changes and includes a stochastic component:\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-dfe8262 elementor-widget elementor-widget-text-editor\" data-id=\"dfe8262\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\twhere\u00a0is a random variable with a uniform distribution. This function is implemented below:\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-00830d9 elementor-widget elementor-widget-text-editor\" data-id=\"00830d9\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<details><summary>Supply chain environment: Demand function. Click to expand the code sample.<\/summary><\/details>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-d1fbfc1 elementor-widget elementor-widget-image\" data-id=\"d1fbfc1\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/blog.griddynamics.com\/content\/images\/2020\/02\/supply-chain-demands.png\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-3460dcd elementor-widget elementor-widget-text-editor\" data-id=\"3460dcd\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tFinally, we have to implement the state transition logic according to the specifications for reward and state we defined earlier in this section. This part is very straightforward: we just convert formulas for profit and state updates into the code.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-f01ef3d elementor-widget elementor-widget-text-editor\" data-id=\"f01ef3d\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<details open=\"\"><summary>Supply chain environment: Transition function. Click to expand the code sample.<\/summary><\/details>For the sake of simplicity, we assume that fractional amounts of the product can be produced or shipped (alternatively, one can think of it as measuring units in thousands or millions, so that rounding errors are immaterial).\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-6e0c615 elementor-widget elementor-widget-heading\" data-id=\"6e0c615\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h3 class=\"elementor-heading-title elementor-size-default\"><h3 id=\"establishingthebaselines\">Establishing\u00a0the\u00a0baselines<\/h3><\/h3>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-086dec9 elementor-widget elementor-widget-text-editor\" data-id=\"086dec9\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tWe have defined the environment, and now we need to establish some baselines for the supply chain control policy. One of the traditional solutions is the (s, Q)-policy. This policy can be expressed as the following simple rule: at every time step, compare the stock level with the reorder point\u00a0, and reorder\u00a0units if the stock level drops below the reorder point or take no action otherwise. This policy typically results in a sawtooth stock level pattern similar to the following:\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-5eac4ac elementor-widget elementor-widget-image\" data-id=\"5eac4ac\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/blog.griddynamics.com\/content\/images\/2020\/02\/supply-chain-sQ-typical-pattern-labeled.png\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-e49a089 elementor-widget elementor-widget-text-editor\" data-id=\"e49a089\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tReordering decisions are made independently for each warehouse, and policy parameters\u00a0and\u00a0can be different for different warehouses. We implement the (s,Q)-policy, as well as a simple simulator that allows us to evaluate this policy, in the code snippet below:\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-d5344da elementor-widget elementor-widget-text-editor\" data-id=\"d5344da\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<details open=\"\"><summary>(s,Q)-policy and simulator. Click to expand the code sample.<\/summary><\/details>Setting policy parameters represents a certain challenge because we have 8 parameters, i.e., four (s,Q) pairs, in our environment. In general, the parameters have to be set in a way that balances storage and shortage costs under the uncertainty of the demand (in particular, the reorder point has to be chosen to absorb demand shocks to a certain degree). This problem can be approached analytically given that the demand distribution parameters are known, but instead we take a simpler approach here and do a brute force search through the parameter space using the Adaptive Experimentation Platform developed by Facebook\u00a0<sup><a id=\"fnref8\" href=\"https:\/\/blog.griddynamics.com\/deep-reinforcement-learning-for-supply-chain-and-price-optimization\/#fn8\" rel=\"noopener\">[8]<\/a><\/sup>. This framework provides a very convenient API and uses Bayesian optimization internally. The code snippet below shows how exactly the parameters of the (s,Q)-policy are optimized:\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-a518df2 elementor-widget elementor-widget-text-editor\" data-id=\"a518df2\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<details open=\"\"><summary>Optimization of (s, Q)-policy parameters. Click to expand the code sample.<\/summary><\/details>We combine this optimization with grid search fine tuning to obtain the following policy parameters and achieve the following profit performance:\n<table data-tab-size=\"8\">\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-1adca74 elementor-widget elementor-widget-text-editor\" data-id=\"1adca74\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<tbody>\n<tr>\n<td id=\"file-drl-supply-chain-sq-results-py-L1\" data-line-number=\"1\"><\/td>\n<td id=\"file-drl-supply-chain-sq-results-py-LC1\">Optimized policy parameters:<\/td>\n<\/tr>\n<tr>\n<td id=\"file-drl-supply-chain-sq-results-py-L2\" data-line-number=\"2\"><\/td>\n<td id=\"file-drl-supply-chain-sq-results-py-LC2\">Factory (s, Q) = (0, 20)<\/td>\n<\/tr>\n<tr>\n<td id=\"file-drl-supply-chain-sq-results-py-L3\" data-line-number=\"3\"><\/td>\n<td id=\"file-drl-supply-chain-sq-results-py-LC3\">Warehouse 1 (s, Q) = (5, 5)<\/td>\n<\/tr>\n<tr>\n<td id=\"file-drl-supply-chain-sq-results-py-L4\" data-line-number=\"4\"><\/td>\n<td id=\"file-drl-supply-chain-sq-results-py-LC4\">Warehouse 2 (s, Q) = (5, 5)<\/td>\n<\/tr>\n<tr>\n<td id=\"file-drl-supply-chain-sq-results-py-L5\" data-line-number=\"5\"><\/td>\n<td id=\"file-drl-supply-chain-sq-results-py-LC5\">Warehouse 3 (s ,Q) = (5, 10)<\/td>\n<\/tr>\n<tr>\n<td id=\"file-drl-supply-chain-sq-results-py-L6\" data-line-number=\"6\"><\/td>\n<td id=\"file-drl-supply-chain-sq-results-py-LC6\"><\/td>\n<\/tr>\n<tr>\n<td id=\"file-drl-supply-chain-sq-results-py-L7\" data-line-number=\"7\"><\/td>\n<td id=\"file-drl-supply-chain-sq-results-py-LC7\">Achieved profit: 6871.0<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<a href=\"https:\/\/gist.github.com\/ikatsov\/7d87307bf80dd0a6e86277b6c9deca58\/raw\/543c130a4a87228ab39f6fdcaffa4e7bd8d20cc7\/drl-supply-chain-sq-results.py\" rel=\"noopener\">view raw<\/a><a href=\"https:\/\/gist.github.com\/ikatsov\/7d87307bf80dd0a6e86277b6c9deca58#file-drl-supply-chain-sq-results-py\" rel=\"noopener\">drl-supply-chain-sq-results.py<\/a>\u00a0hosted with \u2764 by\u00a0<a href=\"https:\/\/github.com\/\" rel=\"noopener\">GitHub<\/a>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-65dc61e elementor-widget elementor-widget-text-editor\" data-id=\"65dc61e\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tWe can get more insight into the policy behavior by visualizing how the stock levels, shipments, production levels, and profits change over time:\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-f8de4ce elementor-widget elementor-widget-image\" data-id=\"f8de4ce\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/blog.griddynamics.com\/content\/images\/2020\/02\/supply-chain-policy-trace-sQ-reward-6871.0.png\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-3c99928 elementor-widget elementor-widget-text-editor\" data-id=\"3c99928\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tIn our testbed environment, the random component of the demand is relatively small, and it makes more sense to ship products on an as-needed basis rather than accumulate large safety stocks in distribution warehouses. This is clearly visible in the above plots: the shipment patterns loosely follow the oscillating demand pattern, while stock levels do not develop a pronounced sawtooth pattern.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-5b29bcf elementor-widget elementor-widget-heading\" data-id=\"5b29bcf\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h3 class=\"elementor-heading-title elementor-size-default\"><h3 id=\"overviewoftheddpgalgorithm\">Overview\u00a0of\u00a0the\u00a0DDPG\u00a0algorithm<\/h3><\/h3>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-caf3c89 elementor-widget elementor-widget-text-editor\" data-id=\"caf3c89\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tWe now turn to the development of a reinforcement learning solution that can outperform the (s,Q)-policy baseline. Our supply chain environment is substantially more complex than the simplistic pricing environment we used in the first part of the tutorial, but, in principle, we can consider using the same DQN algorithm because we managed to reformulate the problem in reinforcement learning terms. The issue, however, is that DQN generally requires a reasonably small discrete action space because the algorithm explicitly evaluates all actions to find the one that maximizes the target Q-value (see step 2.3.2 of the DQN algorithm described earlier):\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-808d5c2 elementor-widget elementor-widget-text-editor\" data-id=\"808d5c2\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tThis was not an issue in our first project because the action space was defined as a set of discrete price levels. In the supply chain environment, an action is a vector of production and shipping controls, so the action space grows exponentially with the size of the chain, and we also allow controls to be real numbers, so the action space is continuous. In principle, we can work around this through discretization. For example, we can allow only three levels for each of four controls, which results in\u00a0possible actions. This approach, however, is not scalable. Fortunately, continuous control is a well-studied problem and there exists a whole range of algorithms that are designed to deal with continuous action spaces. We choose to use Deep Deterministic Policy Gradient (DDPG), which is one of the state-of-the-art algorithms suitable for continuous control problems\u00a0<sup><a id=\"fnref9\" href=\"https:\/\/blog.griddynamics.com\/deep-reinforcement-learning-for-supply-chain-and-price-optimization\/#fn9\" rel=\"noopener\">[9]<\/a><\/sup><sup><a id=\"fnref10\" href=\"https:\/\/blog.griddynamics.com\/deep-reinforcement-learning-for-supply-chain-and-price-optimization\/#fn10\" rel=\"noopener\">[10]<\/a><\/sup>. It is more complex than DQN, so we will review it only briefly, while more theoretical and implementation details can be found in the referenced articles.\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-6e32104 elementor-widget elementor-widget-text-editor\" data-id=\"6e32104\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<strong>Policy gradient.<\/strong>\u00a0DQN belongs to the family of Q-learning algorithms. The central idea of Q-learning is to optimize actions based on their Q-values, and thus all Q-learning algorithms explicitly learn or approximate the value function. The second major family of reinforcement learning algorithms is policy gradient algorithms. The central idea of the policy gradient is that the policy itself is a function with parameters\u00a0, and thus this function can be optimized directly using gradient descent. We can express this more formally using the following notation for the policy:\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-06f3b3d elementor-widget elementor-widget-text-editor\" data-id=\"06f3b3d\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tWe are generally interested in finding the policy that maximizes the average return\u00a0, so we define the following objective function:\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-0bed76c elementor-widget elementor-widget-text-editor\" data-id=\"0bed76c\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tThe policy gradient solves the following problem:\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-2756a25 elementor-widget elementor-widget-text-editor\" data-id=\"2756a25\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tusing, for example, gradient ascent to update the policy parameters:\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-cd4767c elementor-widget elementor-widget-text-editor\" data-id=\"cd4767c\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tThe most straightforward way to implement this idea is to compute the above gradient directly for each observed episode and its return (which is known as the REINFORCE algorithm). This is not particularly efficient because the estimates computed based on individual episodes are generally noisy, and each episode is used only once and then discarded. On the other hand, the policy gradient is well suited for continuous action spaces because individual actions are not explicitly evaluated.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-dd8969d elementor-widget elementor-widget-text-editor\" data-id=\"dd8969d\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<strong>Actor-Critic: Combining policy gradient with Q-learning.<\/strong>\u00a0The limitation of the basic policy gradient, however, can be overcome through combining it with Q-learning, and this approach is extremely successful. The main idea is that it can be more beneficial to compute the policy gradient based on learned value functions rather than raw observed rewards and returns. This helps to reduce the noise and increase robustness of the algorithm because the learned Q-function is able to generalize and \u201csmooth\u201d the observed experiences. This leads to the third family of algorithms known as Actor-Critic. All these algorithms have a dedicated approximator for the policy (actor) and the second approximator that estimates Q-values collected under this policy (critic).\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-52179a7 elementor-widget elementor-widget-text-editor\" data-id=\"52179a7\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tThe DDPG algorithm further combines the Actor-Critic paradigm with the stabilization techniques introduced in DQN: an experience replay buffer and target networks that allow for complex neural approximators. Another important aspect of DDPG is that it assumes a deterministic policy\u00a0, while the traditional policy gradient methods assume stochastic policies that specify probabilistic distributions over actions\u00a0. The deterministic policy approach has performance advantages and is generally more sample-efficient because the policy gradient integrates only over state space, but not action space. The pseudocode below shows how all these pieces fit together:\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-b19ebfb elementor-widget elementor-widget-text-editor\" data-id=\"b19ebfb\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tAlgorithm: Deep Deterministic Policy Gradient (DDPG)\n<ol>\n \t<li>Parameters and initialization:\n<ol>\n \t<li>and\u00a0\u2014 parameters of the policy network (actor)<\/li>\n \t<li>and\u00a0\u2014 parameters of the Q-function network (critic)<\/li>\n \t<li>\u2014 batch size<\/li>\n<\/ol>\n<\/li>\n \t<li>For\u00a0do\n<ol>\n \t<li>Choose the action according to<\/li>\n \t<li>Execute the action and save transition\u00a0in the buffer<\/li>\n \t<li>Update the networks\n<ol>\n \t<li>Sample a batch of\u00a0transitions from the buffer<\/li>\n \t<li>Compute targets:<\/li>\n \t<li>Update critic&#8217;s network parameters using\u00a0This step is similar to DQN becasue the critic represents the Q-learning side of the algotithm.<\/li>\n \t<li>Update actor&#8217;s network parameters using\u00a0This is gradient ascent for the policy parameters, but the gradient is computed based on critic&#8217;s value estimates.<\/li>\n \t<li>Update target networks:<\/li>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-642e950 elementor-widget elementor-widget-text-editor\" data-id=\"642e950\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tNote that the gradient ascent update for the actor is based on the Q-function approximation, not on the original transitions. DDPG also uses soft updates (incremental blending) for the target networks, as shown in step 2.3.5, while DQN uses hard updates (replacement).\n\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-daa7761 elementor-widget elementor-widget-heading\" data-id=\"daa7761\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h3 class=\"elementor-heading-title elementor-size-default\"><h3 id=\"implementationusingrllib\">Implementation\u00a0using\u00a0RLlib&lt;\/h3<\/h3>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-fb9ac41 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"fb9ac41\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-a860b27\" data-id=\"a860b27\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-14ba1c6 elementor-widget elementor-widget-text-editor\" data-id=\"14ba1c6\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tOur last step is to implement training of the supply chain management policy using RLlib. We first create a simple Gym wrapper for the environment we previously defined:\n\n<details open=\"\"><summary>Supply chain environment: Gym wrapper. Click to expand the code sample.<\/summary><\/details>Next, we implement the training process using RLlib, which is also very straightforward:\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-6165ba4 elementor-widget elementor-widget-text-editor\" data-id=\"6165ba4\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<details open=\"\"><summary>Supply chain optimization using RLlib and DDPG. Click to expand the code sample.<\/summary><\/details>The policy trained this way substantially outperforms the baseline (s, Q)-policy. The figure below shows example episodes for two policies compared side by side:\n\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-ba542d9 elementor-widget elementor-widget-image\" data-id=\"ba542d9\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/blog.griddynamics.com\/content\/images\/2020\/02\/supply-chain-policy-trace-sQ-vs-DDPG.png\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-0ec62f4 elementor-widget elementor-widget-text-editor\" data-id=\"0ec62f4\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tIn principle, it is possible to combine DDPG with parametric inventory management models like (s,Q)-policy in different ways. For example, one can attempt to optimize reorder points and amount parameters of the (s,Q) policy using DDPG.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-35b17c1 elementor-widget elementor-widget-heading\" data-id=\"35b17c1\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\"><h2 id=\"deepreinforcementlearningforenterpriseoperations\">Deep reinforcement learning for enterprise operations<\/h2><\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-3961bc7 elementor-widget elementor-widget-text-editor\" data-id=\"3961bc7\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tWe conclude this article with a broader discussion of how deep reinforcement learning can be applied in enterprise operations: what are the main use cases, what are the main considerations for selecting reinforcement learning algorithms, and what are the main implementation options.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-2301fe5 elementor-widget elementor-widget-text-editor\" data-id=\"2301fe5\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<strong>Use cases.<\/strong>\u00a0Most enterprise use cases can be approached from both myopic (single stage) and strategic (multi-stage) perspectives. This can be illustrated by the following examples:\n<ul>\n \t<li>Traditional price optimization focuses on estimating the price-demand function and determining the profit-maximizing price point. In the strategic context, one would consider a sequence of prices and inventory movements that must be optimized jointly.<\/li>\n \t<li>Traditional personalization models are trained to optimize the click-through rate, conversion rate, or other myopic metrics. In the strategic context, a sequence of multiple marketing actions has to be optimized to maximize customer lifetime value or a similar long-term objective.<\/li>\n<\/ul>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-7433c8d elementor-widget elementor-widget-text-editor\" data-id=\"7433c8d\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tReinforcement learning is a natural solution for strategic optimization, and it can be viewed as an extension of traditional predictive analytics that is usually focused on myopic optimization. Reinforcement learning is also a natural solution for dynamic environments where historical data is unavailable or quickly becomes obsolete (e.g., newsfeed personalization).\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-71d29a4 elementor-widget elementor-widget-text-editor\" data-id=\"71d29a4\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<strong>Action space.<\/strong>\u00a0Some enterprise use cases can be better modeled using discrete action spaces, and some are modeled using continuous action spaces. This is a major consideration for selecting a reinforcement learning algorithm. The DQN family (Double DQN, Dueling DQN, Rainbow) is a reasonable starting point for discrete action spaces, and the Actor-Critic family (DDPG, TD3, SAC) would be a starting point for continuous spaces.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-6dbcdcd elementor-widget elementor-widget-text-editor\" data-id=\"6dbcdcd\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<strong>On-policy vs. off-policy.<\/strong>\u00a0In many reinforcement learning problems, one has access to an environment or simulator that can be used to sample transitions and evaluate the policy. Typical examples include video game simulators, car driving simulators, and physical simulators for robotics use cases. This can be an option for enterprise use cases as well. For instance, we previously created a supply chain simulator. However, many enterprise use cases do not allow for accurate simulation, and real-life policy testing can also be associated with unacceptable risks. Marketing is a good example: although reinforcement learning is a very compelling option for strategic optimization of marketing actions, it is generally not possible to create an adequate simulator for a customer behavior, and random messaging to the customers for policy training or evaluation is also not feasible. In such cases, one has to learn offline-based historical data and carefully evaluate a new policy before deploying it to production. It is not trivial to correctly learn and evaluate a new policy having only the data collected under some other policy (off-policy learning), and this problem is one the central challenges for enterprise adoption of reinforcement learning.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-5c31b48 elementor-widget elementor-widget-text-editor\" data-id=\"5c31b48\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<strong>Single-agent vs. multi-agent.<\/strong>\u00a0Most innovations and breakthroughs in reinforcement learning in recent years have been achieved in single-agent settings. However, many enterprise use cases, including supply chains, can be more adequately modeled using the multi-agent paradigm (multiple warehouses, stores, factories, etc.). The choice of algorithms and frameworks is somewhat more limited in such a case.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-4100d6e elementor-widget elementor-widget-text-editor\" data-id=\"4100d6e\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<strong>Perception vs combinatorial optimization.<\/strong>\u00a0The success of deep reinforcement learning largely comes from its ability to tackle problems that require complex perception, such as video game playing or car driving. The applicability of deep reinforcement learning to traditional combinatorial optimization problems has been studied as well, but less thoroughly\u00a0<sup><a id=\"fnref11\" href=\"https:\/\/blog.griddynamics.com\/deep-reinforcement-learning-for-supply-chain-and-price-optimization\/#fn11\" rel=\"noopener\">[11]<\/a><\/sup>. Many enterprise use cases, including supply chains, require combinatorial optimization, and this is an area of active research for reinforcement learning.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-0545963 elementor-widget elementor-widget-text-editor\" data-id=\"0545963\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\n<strong>Technical platform.<\/strong>\u00a0There are a relatively large number of technical frameworks and platforms for reinforcement learning, including OpenAI Baselines, Berkeley RLlib, Facebook ReAgent, Keras-RL, and Intel Coach. Many of these frameworks provide only algorithm implementations, but some of them are designed as platforms that are able to learn directly from system logs and essentially provide reinforcement learning capabilities as a service. This latter approach is very promising in the context of enterprise operations.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-409e700 elementor-widget elementor-widget-heading\" data-id=\"409e700\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h4 class=\"elementor-heading-title elementor-size-default\"><h4 id=\"references\">References<\/h4><\/h4>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-a54793c elementor-widget elementor-widget-text-editor\" data-id=\"a54793c\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<section>\n<ol>\n \t<li id=\"fn1\">Mnih V., et al. &#8220;Human-level control through deep reinforcement learning&#8221;, 2015\u00a0<a href=\"https:\/\/blog.griddynamics.com\/deep-reinforcement-learning-for-supply-chain-and-price-optimization\/#fnref1\" rel=\"noopener\">\u21a9\ufe0e<\/a>\u00a0<a href=\"https:\/\/blog.griddynamics.com\/deep-reinforcement-learning-for-supply-chain-and-price-optimization\/#fnref1:1\" rel=\"noopener\">\u21a9\ufe0e<\/a><\/li>\n \t<li id=\"fn2\">Hessel M., et al. \u201cRainbow: Combining Improvements in Deep Reinforcement Learning,\u201d 2017\u00a0<a href=\"https:\/\/blog.griddynamics.com\/deep-reinforcement-learning-for-supply-chain-and-price-optimization\/#fnref2\" rel=\"noopener\">\u21a9\ufe0e<\/a><\/li>\n \t<li id=\"fn3\">Graesser L., Keng W. L., Foundations of Deep Reinforcement Learning, 2020\u00a0<a href=\"https:\/\/blog.griddynamics.com\/deep-reinforcement-learning-for-supply-chain-and-price-optimization\/#fnref3\" rel=\"noopener\">\u21a9\ufe0e<\/a><\/li>\n \t<li id=\"fn4\">Sutton R., Barto A., Reinforcement Learning, 2018\u00a0<a href=\"https:\/\/blog.griddynamics.com\/deep-reinforcement-learning-for-supply-chain-and-price-optimization\/#fnref4\" rel=\"noopener\">\u21a9\ufe0e<\/a><\/li>\n \t<li id=\"fn5\">RLlib: Scalable Reinforcement Learning\u00a0<a href=\"https:\/\/blog.griddynamics.com\/deep-reinforcement-learning-for-supply-chain-and-price-optimization\/#fnref5\" rel=\"noopener\">\u21a9\ufe0e<\/a><\/li>\n \t<li id=\"fn6\">Kemmer L., et al. \u201cReinforcement learning for supply chain optimization,\u201d 2018\u00a0<a href=\"https:\/\/blog.griddynamics.com\/deep-reinforcement-learning-for-supply-chain-and-price-optimization\/#fnref6\" rel=\"noopener\">\u21a9\ufe0e<\/a><\/li>\n \t<li id=\"fn7\">Oroojlooyjadid A., et al. \u201cA Deep Q-Network for the Beer Game: Reinforcement Learning for Inventory Optimization,\u201d 2019\u00a0<a href=\"https:\/\/blog.griddynamics.com\/deep-reinforcement-learning-for-supply-chain-and-price-optimization\/#fnref7\" rel=\"noopener\">\u21a9\ufe0e<\/a><\/li>\n \t<li id=\"fn8\">Adaptive Experimentation Platform\u00a0<a href=\"https:\/\/blog.griddynamics.com\/deep-reinforcement-learning-for-supply-chain-and-price-optimization\/#fnref8\" rel=\"noopener\">\u21a9\ufe0e<\/a><\/li>\n \t<li id=\"fn9\">Silver D., Lever G., Heess N., Degris T., Wierstra D., Riedmiller M. \u201cDeterministic Policy Gradient Algorithms,\u201d 2014\u00a0<a href=\"https:\/\/blog.griddynamics.com\/deep-reinforcement-learning-for-supply-chain-and-price-optimization\/#fnref9\" rel=\"noopener\">\u21a9\ufe0e<\/a><\/li>\n \t<li id=\"fn10\">Lillicrap T., Hunt J., Pritzel A., Heess N., Erez T., Tassa Y., Silver D., Wierstra D., \u201cContinuous control with deep reinforcement learning,\u201d 2015\u00a0<a href=\"https:\/\/blog.griddynamics.com\/deep-reinforcement-learning-for-supply-chain-and-price-optimization\/#fnref10\" rel=\"noopener\">\u21a9\ufe0e<\/a><\/li>\n \t<li id=\"fn11\">Bello I., Pham H., Le Q., Norouzi M., Bengio S. \u201cNeural Combinatorial Optimization with Reinforcement Learning,\u201d 2017\u00a0<a href=\"https:\/\/blog.griddynamics.com\/deep-reinforcement-learning-for-supply-chain-and-price-optimization\/#fnref11\" rel=\"noopener\">\u21a9\ufe0e<\/a><\/li>\n<\/ol>\nThis article was originally published on <a href=\"https:\/\/blog.griddynamics.com\/deep-reinforcement-learning-for-supply-chain-and-price-optimization\/\" rel=\"noopener\">Grid Dynamics blog<\/a>.\n\n<\/section>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<\/div>\n\t\t","protected":false},"excerpt":{"rendered":"<p>Although a wide range of traditional optimization methods are available for inventory and price management applications, deep reinforcement learning has the potential to substantially improve the optimization capabilities for these and other types of enterprise operations due to impressive recent advances in the development of generic self-learning algorithms for optimal control. In this article, we explore how deep reinforcement learning methods can be applied in several basic supply chain and price management scenarios. This article is structured as a hands-on tutorial that describes how to develop, debug, and evaluate reinforcement learning optimizers using PyTorch and RLlib.<\/p>\n","protected":false},"author":731,"featured_media":3737,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"content-type":"","footnotes":""},"categories":[183],"tags":[92],"ppma_author":[3573],"class_list":["post-2270","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-ai-ml","tag-machine-learning"],"authors":[{"term_id":3573,"user_id":731,"is_guest":0,"slug":"ilya-katsov","display_name":"Ilya Katsov","avatar_url":"https:\/\/secure.gravatar.com\/avatar\/?s=96&d=mm&r=g","user_url":"","last_name":"Katsov","first_name":"Ilya","job_title":"","description":"Ilya Katsov is Head of Practice, Industrial AI at Grid Dynamics. &nbsp;He is the author of several scientific articles and international patents, and also authored a book, &ldquo;Introduction to Algorithmic Marketing: Artificial Intelligence for Marketing Operations&rdquo;."}],"_links":{"self":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/2270","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/users\/731"}],"replies":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/comments?post=2270"}],"version-history":[{"count":5,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/2270\/revisions"}],"predecessor-version":[{"id":35412,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/2270\/revisions\/35412"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/media\/3737"}],"wp:attachment":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/media?parent=2270"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/categories?post=2270"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/tags?post=2270"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/ppma_author?post=2270"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}