{"id":22680,"date":"2021-03-12T10:40:00","date_gmt":"2021-03-12T10:40:00","guid":{"rendered":"https:\/\/www.experfy.com\/blog\/why-learn-to-forget-in-recurrent-neural-networks\/"},"modified":"2023-08-30T12:07:45","modified_gmt":"2023-08-30T12:07:45","slug":"why-learn-to-forget-in-recurrent-neural-networks","status":"publish","type":"post","link":"https:\/\/www.experfy.com\/blog\/ai-ml\/why-learn-to-forget-in-recurrent-neural-networks\/","title":{"rendered":"Why \u2018Learn To Forget\u2019 In Recurrent Neural Networks"},"content":{"rendered":"\t\t<div data-elementor-type=\"wp-post\" data-elementor-id=\"22680\" class=\"elementor elementor-22680\" data-elementor-post-type=\"post\">\n\t\t\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-71e1cfa elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"71e1cfa\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-c47ce4b\" data-id=\"c47ce4b\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-e891dea elementor-widget elementor-widget-text-editor\" data-id=\"e891dea\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p class=\"has-medium-font-size\"><em>Illustrated with a simple example<\/em><\/p>\n<p id=\"909c\">Consider the following binary classification problem. The input is a binary sequence of arbitrary length. We want the output to be 1 if and only if a 1 occurred in the input but not too recently. Specifically, the last&nbsp;<em>n<\/em>&nbsp;bits must be 0.<\/p>\n<p id=\"f39d\">We can also write this problem as one on language recognition. For&nbsp;<em>n<\/em>&nbsp;= 4, the language, described as a regular expression, is&nbsp;<code>(0 or 1)*10000*<\/code>.<\/p>\n<p id=\"c3a3\">Below are some labeled instances for the case&nbsp;<em>n<\/em>&nbsp;= 3.<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-55b440d elementor-widget elementor-widget-text-editor\" data-id=\"55b440d\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<pre class=\"wp-block-preformatted\">000 \u2192 0, 101 \u2192 0, 0100100000 \u2192 1, 1000 \u2192 1<\/pre>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-78bb2e7 elementor-widget elementor-widget-text-editor\" data-id=\"78bb2e7\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"7a0b\">Why this seemingly&nbsp;strange problem? It requires remembering that (i) a 1 occurred in the input and (ii) not too recently. As we will see soon, this example helps explain why simple recurrent neural networks are inadequate and how injecting a mechanism that learns to forget helps.<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-600ebc2 elementor-widget elementor-widget-heading\" data-id=\"600ebc2\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\">Simple Recurrent Neural Network<\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-d4a8f2f elementor-widget elementor-widget-text-editor\" data-id=\"d4a8f2f\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"5277\">Let\u2019s start with a basic recurrent neural network (RNN).&nbsp;<em>x<\/em>(<em>t<\/em>) is the bit, 0 or 1, that arrives at time&nbsp;<em>t<\/em>&nbsp;in the input. This RNN maintains a state&nbsp;<em>h<\/em>(<em>t<\/em>) that tries to remember whether it saw a 1 sometime in the past. The output is just read out from this state after a suitable transformation.<\/p>\n<p>More formally, we have<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-12f7757 elementor-widget elementor-widget-text-editor\" data-id=\"12f7757\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<pre class=\"wp-block-preformatted\"><em>h<\/em>(<em>t<\/em>) = tanh(<em>a<\/em>*<em>h<\/em>(<em>t<\/em>-1) + <em>b<\/em>*<em>x<\/em>(<em>t<\/em>) + <em>c<\/em>)<br><em>y<\/em>(<em>t<\/em>) = sigmoid(<em>d<\/em>*<em>h<\/em>(<em>t<\/em>))<\/pre>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-db5c5b4 elementor-widget elementor-widget-text-editor\" data-id=\"db5c5b4\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>Next, let\u2019s consider the following (<em>input sequence<\/em>,&nbsp;<em>output sequence<\/em>) pair and assume&nbsp;<em>n<\/em>&nbsp;= 3.<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-a6b144c elementor-widget elementor-widget-text-editor\" data-id=\"a6b144c\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<pre class=\"wp-block-preformatted\">x 10000000<br>y 00011111<\/pre>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-837fbcd elementor-widget elementor-widget-text-editor\" data-id=\"837fbcd\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"a61d\">To discuss the behavior and learning of the RNN on this pair, it will help to unroll the network in time as is commonly done.<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-435faee elementor-widget elementor-widget-image\" data-id=\"435faee\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t<figure class=\"wp-caption\">\n\t\t\t\t\t\t\t\t\t\t<img fetchpriority=\"high\" decoding=\"async\" width=\"1000\" height=\"328\" src=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/0kt-aa6F0mudyt0LQ.png\" class=\"attachment-large size-large wp-image-18921\" alt=\"Why \u2018Learn To Forget\u2019 In Recurrent Neural Networks\" srcset=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/0kt-aa6F0mudyt0LQ.png 1000w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/0kt-aa6F0mudyt0LQ-300x98.png 300w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/0kt-aa6F0mudyt0LQ-768x252.png 768w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/0kt-aa6F0mudyt0LQ-610x200.png 610w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/0kt-aa6F0mudyt0LQ-750x246.png 750w\" sizes=\"(max-width: 1000px) 100vw, 1000px\" \/>\t\t\t\t\t\t\t\t\t\t\t<figcaption class=\"widget-image-caption wp-caption-text\">Simple RNN unrolled in time<\/figcaption>\n\t\t\t\t\t\t\t\t\t\t<\/figure>\n\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-c0562fd elementor-widget elementor-widget-text-editor\" data-id=\"c0562fd\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"6d73\">Think of this as a pipeline with stages. The state travels from left to right and gets modified during the process by the input at a stage.<\/p>\n<p id=\"90e2\">Let\u2019s walk through what happens inside a stage in a bit more detail. Consider the third stage. It inputs the state&nbsp;<em>h<\/em>2 and the next input symbol&nbsp;<em>x<\/em>3.&nbsp;<em>h<\/em>2 may be thought of as a feature derived from&nbsp;<em>x<\/em>1 and&nbsp;<em>x<\/em>2 towards predicting&nbsp;<em>y<\/em>3. The box first computes the next state&nbsp;<em>h<\/em>3 from these two inputs.&nbsp;<em>h<\/em>3 is then carried forward to the next stage.&nbsp;<em>h<\/em>3 also determines the stage\u2019s output&nbsp;<em>y<\/em>3.<\/p>\n<p id=\"7817\">Consider what happens when the input 1000 is seen.&nbsp;<em>y<\/em>4 is 1 and since&nbsp;<em>y<\/em>^4 is less than 1 (which is always the case) there is some error. Following the backpropagation-through-time learning strategy, we will ripple the error back through time to the extent needed to update the various parameters.<\/p>\n<p id=\"da24\">Consider the parameter&nbsp;<em>b<\/em>. There are 4 instances of it, attached to&nbsp;<em>x<\/em>1 through&nbsp;<em>x<\/em>4 respectively. The instances attached to&nbsp;<em>x<\/em>2 through&nbsp;<em>x<\/em>4 don\u2019t change since&nbsp;<em>x<\/em>2 through&nbsp;<em>x<\/em>4 are all 0. So none of these b instances have any impact on&nbsp;<em>y^<\/em>4. The instance of b attached to x1 increases as making this change gets&nbsp;<em>y^<\/em>4 closer to 1.<\/p>\n<p id=\"e851\">As we continue seeing&nbsp;<em>x<\/em>5,&nbsp;<em>x<\/em>6,&nbsp;<em>x<\/em>7,&nbsp;<em>x<\/em>8, and their corresponding targets&nbsp;<em>y<\/em>5,&nbsp;<em>y<\/em>6,&nbsp;<em>y<\/em>7, and&nbsp;<em>y<\/em>8, the same learning behavior will happen.&nbsp;<em>b<\/em>&nbsp;will keep increasing. (Albeit less so as we need to backpropagate the errors further back in time to get to&nbsp;<em>x<\/em>1.)<\/p>\n<p id=\"3ec4\">Now imagine&nbsp;<em>x<\/em>9 is 1.&nbsp;<em>y<\/em>9 must be 0.&nbsp;<em>y<\/em>^9 is however large. This is because the parameter&nbsp;<em>b<\/em>&nbsp;has learned that&nbsp;<em>x<\/em>i = 1 predicts&nbsp;<em>y<\/em>j = 1 for&nbsp;<em>j<\/em>&nbsp;&gt;=&nbsp;<em>i<\/em>.&nbsp;<em>b<\/em>&nbsp;has no way of enforcing that&nbsp;<em>x<\/em>i = 1 must be followed only by 0s, numbering at least 3.<\/p>\n<p id=\"d185\">In short, this RNN is unable to capture the joint interaction of&nbsp;<em>x<\/em>i = 1 and all the bits that follow it are 0s, numbering at least 3, towards predicting&nbsp;<em>y<\/em>j. Also note that this is not a long-range influence.&nbsp;<em>n<\/em>&nbsp;is only 3. So the weakness of the RNN on this example cannot be explained in terms of vanishing error gradients when doing backpropagation-through-time [2]. There is something else going on here.<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-a569e80 elementor-widget elementor-widget-heading\" data-id=\"a569e80\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\">An RNN that learns to forget<\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-9a80c49 elementor-widget elementor-widget-text-editor\" data-id=\"9a80c49\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"781d\">Now considerthis version<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-8e604b2 elementor-widget elementor-widget-text-editor\" data-id=\"8e604b2\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<pre class=\"wp-block-preformatted\">z(<em>t<\/em>)    = sigmoid(<em>a<\/em>*<em>x<\/em>(<em>t<\/em>) + <em>b<\/em>)<br><em>h<\/em>new(<em>t<\/em>) = tanh(<em>c<\/em>*<em>x<\/em>(<em>t<\/em>) +<em>d<\/em>)<br><em>h<\/em>(<em>t<\/em>)    = (1-<em>z<\/em>(<em>t<\/em>))*<em>h<\/em>(<em>t<\/em>-1) + <em>z<\/em>(<em>t<\/em>)*<em>h<\/em>new(<em>t<\/em>)<br><em>y<\/em>(<em>t<\/em>)    = sigmoid(<em>e<\/em>*<em>h<\/em>(<em>t<\/em>))<\/pre>\n<p id=\"fc8a\">We didn\u2019t just pull it out of a hat. It is a key one in a popular gated recurrent neural network called GRU. We took this equation from it\u2019s description in [1].<\/p>\n<p id=\"24fb\">This RNN has an explicit mechanism to forget! It is&nbsp;<em>z<\/em>(<em>t<\/em>), a value between 0 and 1, denoting the degree of forgetfulness. When&nbsp;<em>z<\/em>(<em>t<\/em>) approaches 1, the state&nbsp;<em>h<\/em>(<em>t<\/em>-1) is completely forgotten.<\/p>\n<p id=\"2f1a\">When&nbsp;<em>h<\/em>(<em>t<\/em>-1) is completely forgotten, what should&nbsp;<em>h<\/em>(<em>t<\/em>) be? We encapsulate this is in an explicit function&nbsp;<em>h<\/em>new(<em>t<\/em>) denoting \u201cnew state\u201d.&nbsp;<em>h<\/em>new(<em>t<\/em>) is derived solely from the present input. This makes sense because if&nbsp;<em>h<\/em>(<em>t<\/em>-1) is to be forgotten, all we have in front of us is the new input&nbsp;<em>x<\/em>(<em>t<\/em>).<\/p>\n<p id=\"9680\">More generally, the next state&nbsp;<em>h<\/em>(<em>t<\/em>) is a mixture of the previous state&nbsp;<em>h<\/em>(<em>t<\/em>-1) and a new state&nbsp;<em>h<\/em>new(<em>t<\/em>), modulated by&nbsp;<em>z<\/em>(<em>t<\/em>).<\/p>\n<p id=\"f6e5\">Does this RNN have the capability to do better on this problem? We will answer this question in the affirmative by prescribing a solution that works. The accompanying explanation will reveal what roles the various neurons play in making this solution work.<\/p>\n<p id=\"3f10\">Consider&nbsp;<em>x<\/em>(<em>t<\/em>) is 1.&nbsp;<em>y<\/em>(<em>t<\/em>) must be 0. So we want to drive&nbsp;<em>y<\/em>^(<em>t<\/em>) towards 0. We can make this happen by setting&nbsp;<em>e<\/em>&nbsp;to a sufficiently negative number (say -1) and forcing&nbsp;<em>h<\/em>(<em>t<\/em>) to be close to 1. One way to get the desired&nbsp;<em>h<\/em>(<em>t<\/em>) is to force&nbsp;<em>z<\/em>(<em>t<\/em>) to be close to 1 and set&nbsp;<em>c<\/em>&nbsp;to a sufficiently positive number and&nbsp;<em>d<\/em>&nbsp;such that&nbsp;<em>c<\/em>+<em>d<\/em>&nbsp;is sufficiently positive. We can force&nbsp;<em>z<\/em>(<em>t<\/em>) to be close to 1 by setting&nbsp;<em>a<\/em>&nbsp;to be a sufficiently positive number and&nbsp;<em>b<\/em>&nbsp;such that&nbsp;<em>a<\/em>+<em>b<\/em>&nbsp;is sufficiently positive.<\/p>\n<p id=\"69ce\">This prescription operates as if<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-111cf4a elementor-widget elementor-widget-text-editor\" data-id=\"111cf4a\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<pre class=\"wp-block-preformatted\">If <em>x<\/em>(<em>t<\/em>) is 1<br>   Set <em>h<\/em>new(<em>t<\/em>) to close to 1.<br>   Reset <em>h<\/em>(<em>t<\/em>) to <em>h<\/em>new(<em>t<\/em>)<br>   Drive <em>y<\/em>^(<em>t<\/em>) towards 0 by setting <em>e<\/em> sufficiently negative<\/pre>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-e039f4c elementor-widget elementor-widget-text-editor\" data-id=\"e039f4c\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"7b2e\">The case&nbsp;<em>x<\/em>(<em>t<\/em>) is 0 is more involved as&nbsp;<em>y<\/em>(<em>t<\/em>) depends on the recent past values of&nbsp;<em>x<\/em>. Let\u2019s explain it in the following setting:<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-3c8fc0b elementor-widget elementor-widget-text-editor\" data-id=\"3c8fc0b\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<pre class=\"wp-block-preformatted\">Time \u2026 <em>t<\/em>    <em>t<\/em>+1     <em>t<\/em>+2      <em>t<\/em>+3<br><em>x<\/em>    \u2026 1     0       0        0<br><em>y<\/em>    \u2026 0     0       0        1<br><em>h<\/em>new \u2026 1 D=tanh(<em>d<\/em>)   D        D<br><em>z<\/em>    \u2026 1     \u00bd       \u00bd        \u00bd<br><em>h<\/em>    \u2026 1  \u00bd(1+D) \u00bd(<em>h<\/em>(<em>t<\/em>+1)+D) \u00bd(<em>h<\/em>(<em>t<\/em>+2)+D)<em>h<\/em>^   \u2026 &gt;&gt;0  &gt;&gt;0      &gt;&gt;0     &lt;&lt;0y^   \u2026 \u2192 0  \u2192 0      \u2192 0     \u2192 1<\/pre>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-ccdc623 elementor-widget elementor-widget-text-editor\" data-id=\"ccdc623\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"db41\">There is a lot in here! So let\u2019s walk through it row by row.<\/p>\n<p id=\"2545\">We are looking at the situation when processing the last 4 bits of the input&nbsp;<em>x<\/em>&nbsp;= \u20261000 in sequence. The corresponding target is&nbsp;<em>y<\/em>&nbsp;= \u20260001. We assume that the parameters of the RNN have been somehow chosen just right (or learned) as surfaced below. (These have to be consistent with the settings we used when&nbsp;<em>x<\/em>(t) was 1, of course.) In short, we are describing the behavior of a fixed network in this situation.<\/p>\n<p id=\"c1d5\">Now look at&nbsp;<em>h<\/em>new. When&nbsp;<em>x<\/em>(<em>t<\/em>) is 1, we have already discussed that&nbsp;<em>h<\/em>new(<em>t<\/em>) should approach 1. When&nbsp;<em>x<\/em>(<em>t<\/em>) is 0,&nbsp;<em>h<\/em>new(<em>t<\/em>) equals tanh(<em>cx<\/em>(<em>t<\/em>)+<em>d<\/em>)=tanh(<em>d<\/em>). We are calling this D.<\/p>\n<p id=\"1011\">Next look at&nbsp;<em>z<\/em>. When&nbsp;<em>x<\/em>(<em>t<\/em>) is 1, we already discussed that&nbsp;<em>z<\/em>(<em>t<\/em>) should approach 1. When&nbsp;<em>x<\/em>(<em>t<\/em>) is 0, since we want to remember the past, let\u2019s set&nbsp;<em>z<\/em>(<em>t<\/em>) to approximately \u00bd. For this, we just need to set&nbsp;<em>b<\/em>&nbsp;to 0. This can be achieved without unlearning the&nbsp;<em>z<\/em>(<em>t<\/em>) that works when&nbsp;<em>x<\/em>(<em>t<\/em>) is 1.<\/p>\n<p id=\"d311\">For the remaining rows, let\u2019s start from the last row and work our way in. In the&nbsp;<em>y<\/em>^ row, we describe what we want, given the&nbsp;<em>y<\/em>&nbsp;targets. Given that we have fixed&nbsp;<em>e<\/em>&nbsp;to a sufficiently negative number, this gives us what we want from our states. We call them&nbsp;<em>h<\/em>^.<\/p>\n<p id=\"d649\">So now all that remains is to show that&nbsp;<em>h<\/em>&nbsp;can be made to match up with&nbsp;<em>h<\/em>^. First let\u2019s zoom into these two rows and while at it also transform&nbsp;<em>h<\/em>&nbsp;to a more convenient form<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-50c5b30 elementor-widget elementor-widget-text-editor\" data-id=\"50c5b30\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<pre class=\"wp-block-preformatted\">h  \u2026 1        \u00bd + \u00bdD        \u00bc + \u00bcD + \u00bdD          \u215b + \u215b D + \u00bc D + \u00bd D<br>h^ \u2026 &gt;&gt; 0     &gt;&gt; 0          &gt;&gt; 0                 &lt;&lt; 0<\/pre>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-98c1f05 elementor-widget elementor-widget-text-editor\" data-id=\"98c1f05\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"4418\">It can be seen that choosing D such that -\u2153 &lt; D &lt; -1\/7 will meet the desiderata. It&#8217;s easy to find&nbsp;<em>d<\/em>&nbsp;such that tanh(<em>d<\/em>) is in this range.<\/p>\n<p id=\"2b26\">The prescription for the case&nbsp;<em>x<\/em>(<em>t<\/em>) = 0 may be summarized as<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-8828db1 elementor-widget elementor-widget-text-editor\" data-id=\"8828db1\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<pre class=\"wp-block-preformatted\">If <em>x<\/em>(<em>t<\/em>) is 0<br>   Set <em>h<\/em>new(<em>t<\/em>) to be slightly negative.<br>   Set <em>h<\/em>(<em>t<\/em>) as average of <em>h<\/em>(<em>t<\/em>-1) and <em>h<\/em>new(<em>t<\/em>)<\/pre>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-9007c74 elementor-widget elementor-widget-text-editor\" data-id=\"9007c74\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"8915\">So as 0s that follow a 1 are seen,&nbsp;<em>h<\/em>(<em>t<\/em>) keeps dropping. If enough 0s are seen,&nbsp;<em>h<\/em>(<em>t<\/em>) becomes negative.<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-49537eb elementor-widget elementor-widget-heading\" data-id=\"49537eb\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\">Summary<\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-5534a3f elementor-widget elementor-widget-text-editor\" data-id=\"5534a3f\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"9de4\">In this post, we discussed recurrent <a href=\"https:\/\/www.experfy.com\/blog\/ai-ml\/an-introduction-to-recurrent-neural-networks\/\" target=\"_blank\" rel=\"noreferrer noopener\">neural networks<\/a> with and without an explicit \u2018forget\u2019 mechanism. We discussed it in the context of a simply-described prediction problem which the simpler RNN is incapable of solving. The RNN with the \u2018forget\u2019 mechanism is able to solve this problem.<\/p>\n<p id=\"eecc\">This post will be useful to readers who\u2019d like to understand how simple RNNs work, how an enhanced version with a forgetting mechanism works (GRU in particular), and how the latter improves upon the former.<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-503bef4 elementor-widget elementor-widget-heading\" data-id=\"503bef4\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\">Further Reading<\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-c3f9e78 elementor-widget elementor-widget-text-editor\" data-id=\"c3f9e78\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<ol><li><a href=\"https:\/\/colah.github.io\/posts\/2015-08-Understanding-LSTMs\/\" target=\"_blank\" rel=\"noreferrer noopener\">https:\/\/colah.github.io\/posts\/2015-08-Understanding-LSTMs\/<\/a><\/li><li><a href=\"https:\/\/www.superdatascience.com\/blogs\/recurrent-neural-networks-rnn-the-vanishing-gradient-problem\" target=\"_blank\" rel=\"noreferrer noopener\">https:\/\/www.superdatascience.com\/blogs\/recurrent-neural-networks-rnn-the-vanishing-gradient-problem<\/a><\/li><\/ol>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<\/div>\n\t\t","protected":false},"excerpt":{"rendered":"<p>This post will be useful to readers who\u2019d like to understand how simple RNNs work, how an enhanced version with a forgetting mechanism works, GRU in particular, and how the latter improves upon the former.<\/p>\n","protected":false},"author":1044,"featured_media":18922,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"content-type":"","footnotes":""},"categories":[183],"tags":[1409,1410,1411],"ppma_author":[3691],"class_list":["post-22680","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-ai-ml","tag-gru","tag-rnn","tag-sequence-learning"],"authors":[{"term_id":3691,"user_id":1044,"is_guest":0,"slug":"arun-jagota","display_name":"Arun Jagota","avatar_url":"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/Arun-Jagota-150x150.jpeg","user_url":"https:\/\/www.salesforce.com\/in\/?ir=1","last_name":"Jagota","first_name":"Arun","job_title":"","description":"Arun Jagota is Director of Data Science at Salesforce.com. A PhD in computer science, he has taught undergraduate, graduate, and continuing education courses in Computer Science at many US Universities from 1992 through 2006. He has written a number of books, most available at Amazon.com, 50 academic publications and has 17+ patents issued."}],"_links":{"self":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/22680","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/users\/1044"}],"replies":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/comments?post=22680"}],"version-history":[{"count":4,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/22680\/revisions"}],"predecessor-version":[{"id":31937,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/22680\/revisions\/31937"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/media\/18922"}],"wp:attachment":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/media?parent=22680"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/categories?post=22680"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/tags?post=22680"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/ppma_author?post=22680"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}