{"id":10523,"date":"2020-10-19T10:41:18","date_gmt":"2020-10-19T10:41:18","guid":{"rendered":"https:\/\/www.experfy.com\/blog\/?p=10523"},"modified":"2023-10-19T16:18:23","modified_gmt":"2023-10-19T16:18:23","slug":"understanding-transformers-data-science-way","status":"publish","type":"post","link":"https:\/\/www.experfy.com\/blog\/bigdata-cloud\/understanding-transformers-data-science-way\/","title":{"rendered":"Understanding Transformers, the Data Science Way"},"content":{"rendered":"\t\t<div data-elementor-type=\"wp-post\" data-elementor-id=\"10523\" class=\"elementor elementor-10523\" data-elementor-post-type=\"post\">\n\t\t\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-5047cb0 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"5047cb0\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-3fa15227\" data-id=\"3fa15227\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-61ce4584 elementor-widget elementor-widget-text-editor\" data-id=\"61ce4584\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>Transformers have become the defacto standard for any NLP tasks nowadays. Not only that, but they are now also being used in Computer Vision and to generate music. I am sure you would all have heard about the GPT3 Transformer and its applications thereof. &lt;strong&gt;&lt;em&gt;But all these things aside, they are still hard to understand as ever.&lt;\/em&gt;&lt;\/strong&gt;<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-4bfa984 elementor-widget elementor-widget-text-editor\" data-id=\"4bfa984\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>It has taken me multiple readings through the Google research <a href=\"https:\/\/arxiv.org\/pdf\/1706.03762.pdf\" target=\"_blank\" rel=\"noreferrer noopener\">paper<\/a> that first introduced transformers along with just so many blog posts to really understand how a transformer works.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-2afbcbf elementor-widget elementor-widget-text-editor\" data-id=\"2afbcbf\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>So, I thought of putting the whole idea down in as simple words as possible along with some very basic Math and some puns as I am a proponent of having some fun while learning. I will try to keep both the jargon and the technicality to a minimum, yet it is such a topic that I could only do so much. And my goal is to make the reader understand even the goriest details of Transformer by the end of this post.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-87bcd8e elementor-widget elementor-widget-text-editor\" data-id=\"87bcd8e\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>So, here goes \u2014 This post will be a highly conversational one and it is about <strong><em>Decoding The Transformer\u201d.<\/em><\/strong><\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-00d0688 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"00d0688\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-aa3210c\" data-id=\"aa3210c\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-4b7f79e elementor-widget elementor-widget-heading\" data-id=\"4b7f79e\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\"><em>Q: So, Why should I even understand Transformer?<em><\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-df66ecc elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"df66ecc\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-de40a82\" data-id=\"de40a82\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-35faab5 elementor-widget elementor-widget-text-editor\" data-id=\"35faab5\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>In the past, the LSTM and GRU architecture(as explained here in my past <a href=\"https:\/\/towardsdatascience.com\/nlp-learning-series-part-3-attention-cnn-and-what-not-for-text-classification-4313930ed566\" target=\"_blank\" rel=\"noreferrer noopener\" class=\"broken_link\">post<\/a> on NLP) along with attention mechanism used to be the State of the Art Approach for Language modeling problems (put very simply, predict the next word) and Translation systems. But, the main problem with these architectures is that they are recurrent in nature, and the runtime increases as the sequence length increases. That is, these architectures take a sentence and process each word in a <strong><em>sequential<\/em><\/strong> way, and hence with the increase in sentence length the whole runtime increases.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-912c1e0 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"912c1e0\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-99d66cf\" data-id=\"99d66cf\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-6d36da9 elementor-widget elementor-widget-text-editor\" data-id=\"6d36da9\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>Transformer, a model architecture first explained in the paper Attention is all you need, lets go of this recurrence and instead relies entirely on an attention mechanism to draw global dependencies between input and output. And that makes it FAST.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-827e75d elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"827e75d\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-f91cd50\" data-id=\"f91cd50\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-d3ec75a elementor-widget elementor-widget-image\" data-id=\"d3ec75a\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img fetchpriority=\"high\" decoding=\"async\" width=\"342\" height=\"504\" src=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/10\/0_JCGfOqUEHCrwxNv2.png\" class=\"attachment-large size-large wp-image-33548\" alt=\"\" srcset=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/10\/0_JCGfOqUEHCrwxNv2.png 342w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/10\/0_JCGfOqUEHCrwxNv2-204x300.png 204w\" sizes=\"(max-width: 342px) 100vw, 342px\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-dfe4040 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"dfe4040\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-e9c2fcc\" data-id=\"e9c2fcc\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-8417bfd elementor-widget elementor-widget-text-editor\" data-id=\"8417bfd\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>This is the picture of the full transformer as taken from the paper. And, it surely is intimidating. So, I will aim to demystify it in this post by going through each individual piece. So read ahead.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-c6e89dd elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"c6e89dd\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-c33ab5d\" data-id=\"c33ab5d\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-5aac566 elementor-widget elementor-widget-heading\" data-id=\"5aac566\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\">The Big Picture<\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-38d102f elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"38d102f\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-9004494\" data-id=\"9004494\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-e0abcb6 elementor-widget elementor-widget-heading\" data-id=\"e0abcb6\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h3 class=\"elementor-heading-title elementor-size-default\">Q: That sounds interesting. So, what does a transformer do exactly?<\/h3>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-29f5f47 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"29f5f47\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-4ea2f68\" data-id=\"4ea2f68\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-77cb4e1 elementor-widget elementor-widget-text-editor\" data-id=\"77cb4e1\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>Essentially, a transformer can perform almost any NLP task. It can be used for language modeling, Translation, or Classification as required, and it does it fast by removing the sequential nature of the problem. So, the transformer in a machine translation application would convert one language to another, or for a classification problem will provide the class probability using an appropriate output layer.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-6266688 elementor-widget elementor-widget-text-editor\" data-id=\"6266688\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>It all will depend on the final outputs layer for the network but, the Transformer basic structure will remain quite the same for any task. For this particular post, I will be continuing with the machine translation example.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-b041311 elementor-widget elementor-widget-text-editor\" data-id=\"b041311\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>So from a very high place, this is how the transformer looks for a translation task. It takes as input an English sentence and returns a German sentence.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-b49fbb1 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"b49fbb1\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-12f5175\" data-id=\"12f5175\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-c7dde84 elementor-widget elementor-widget-image\" data-id=\"c7dde84\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" width=\"1024\" height=\"166\" src=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/10\/1_NBCtrY02DTg9ZQiK7dxaKQ-1024x166.png\" class=\"attachment-large size-large wp-image-33549\" alt=\"\" srcset=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/10\/1_NBCtrY02DTg9ZQiK7dxaKQ-1024x166.png 1024w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/10\/1_NBCtrY02DTg9ZQiK7dxaKQ-300x49.png 300w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/10\/1_NBCtrY02DTg9ZQiK7dxaKQ-768x125.png 768w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/10\/1_NBCtrY02DTg9ZQiK7dxaKQ-610x99.png 610w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/10\/1_NBCtrY02DTg9ZQiK7dxaKQ-750x122.png 750w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/10\/1_NBCtrY02DTg9ZQiK7dxaKQ-1140x185.png 1140w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/10\/1_NBCtrY02DTg9ZQiK7dxaKQ.png 1515w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-82d8399 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"82d8399\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-58b1c2c\" data-id=\"58b1c2c\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-75d8ac1 elementor-widget elementor-widget-heading\" data-id=\"75d8ac1\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\">The Building Blocks<\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-42e0e6e elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"42e0e6e\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-b3d6332\" data-id=\"b3d6332\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-375d1c8 elementor-widget elementor-widget-heading\" data-id=\"375d1c8\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\"><em>Q: That was too basic.&nbsp;Can you expand on it?<em><\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-89d8298 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"89d8298\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-f36df2e\" data-id=\"f36df2e\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-eda432c elementor-widget elementor-widget-text-editor\" data-id=\"eda432c\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>Okay, just remember in the end, you asked for it. Let\u2019s go a little deeper and try to understand what a transformer is composed of.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-24ab973 elementor-widget elementor-widget-text-editor\" data-id=\"24ab973\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>So, a transformer is essentially composed of a stack of encoder and decoder layers. The role of an encoder layer is to encode the English sentence into a numerical form using the attention mechanism, while the decoder aims to use the encoded information from the encoder layers to give the German translation for the particular English sentence.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-df15c3c elementor-widget elementor-widget-text-editor\" data-id=\"df15c3c\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>In the figure below, the transformer is given as input an English sentence, which gets encoded using 6 encoder layers. The output from the final encoder layer then goes to each decoder layer to translate English to German.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-707fd8e elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"707fd8e\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-22f2b70\" data-id=\"22f2b70\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-684cdd5 elementor-widget elementor-widget-image\" data-id=\"684cdd5\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" width=\"1024\" height=\"986\" src=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/10\/1_VtAiyIZOXmaELPfwFPfH5w-1024x986.png\" class=\"attachment-large size-large wp-image-33550\" alt=\"\" srcset=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/10\/1_VtAiyIZOXmaELPfwFPfH5w-1024x986.png 1024w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/10\/1_VtAiyIZOXmaELPfwFPfH5w-300x289.png 300w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/10\/1_VtAiyIZOXmaELPfwFPfH5w-768x740.png 768w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/10\/1_VtAiyIZOXmaELPfwFPfH5w-610x587.png 610w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/10\/1_VtAiyIZOXmaELPfwFPfH5w-750x722.png 750w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/10\/1_VtAiyIZOXmaELPfwFPfH5w-1140x1098.png 1140w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/10\/1_VtAiyIZOXmaELPfwFPfH5w.png 1377w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-482f62b elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"482f62b\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-cfe2c91\" data-id=\"cfe2c91\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-6c47535 elementor-widget elementor-widget-heading\" data-id=\"6c47535\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\">1. Encoder Architecture<\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-3ea75fb elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"3ea75fb\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-ac34c3f\" data-id=\"ac34c3f\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-0dd6e29 elementor-widget elementor-widget-heading\" data-id=\"0dd6e29\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h3 class=\"elementor-heading-title elementor-size-default\"><em>Q: That\u2019s alright but, how does an encoder stack encode an English sentence exactly?<\/em><\/h3>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-321749e elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"321749e\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-1b81164\" data-id=\"1b81164\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-5cb1d07 elementor-widget elementor-widget-text-editor\" data-id=\"5cb1d07\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>Patience, I am getting to it. So, as I said the encoder stack contains six encoder layers on top of each other(As given in the paper, but the future versions of transformers use even more layers). And each encoder in the stack has essentially two main layers:<\/p>\n\n<ul class=\"wp-block-list\">\n<li><strong>a multi-head self-attention Layer, and<\/strong><\/li>\n<li><strong>a position-wise fully connected feed-forward network<\/strong><\/li>\n<\/ul>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-167d038 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"167d038\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-db13bf7\" data-id=\"db13bf7\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-deba491 elementor-widget elementor-widget-image\" data-id=\"deba491\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img loading=\"lazy\" decoding=\"async\" width=\"360\" height=\"230\" src=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/10\/1_whDBhkUOBvLFykyhTVfJ7Q.png\" class=\"attachment-large size-large wp-image-33551\" alt=\"\" srcset=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/10\/1_whDBhkUOBvLFykyhTVfJ7Q.png 360w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/10\/1_whDBhkUOBvLFykyhTVfJ7Q-300x192.png 300w\" sizes=\"(max-width: 360px) 100vw, 360px\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-dee3904 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"dee3904\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-85c2f82\" data-id=\"85c2f82\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-a572ff8 elementor-widget elementor-widget-text-editor\" data-id=\"a572ff8\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>They are a mouthful. Right? Don\u2019t lose me yet as I will explain both of them in the coming sections. Right now, just remember that the encoder layer incorporates attention and a position-wise feed-forward network.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-b6bb34f elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"b6bb34f\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-b009e78\" data-id=\"b009e78\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-00eb161 elementor-widget elementor-widget-heading\" data-id=\"00eb161\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\"><em>Q: But, how does this layer expect its inputs to be?<\/em><\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-3cff680 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"3cff680\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-890f6da\" data-id=\"890f6da\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-5927c9f elementor-widget elementor-widget-text-editor\" data-id=\"5927c9f\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>This layer expects its inputs to be of the shape <code>SxD<\/code> (as shown in the figure below) where <code>S<\/code> is the source sentence(English Sentence) length, and <code>D<code> is the dimension of the embedding whose weights can be trained with the network. In this post, we will be using <code>D<\/code>as 512 by default throughout. While S will be the maximum length of sentence in a batch. So it normally changes with batches.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-ff6315c elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"ff6315c\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-963f578\" data-id=\"963f578\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-45024e3 elementor-widget elementor-widget-image\" data-id=\"45024e3\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img loading=\"lazy\" decoding=\"async\" width=\"270\" height=\"464\" src=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/10\/1_fJZGl35q-b3l0ZBNUUL3rg.png\" class=\"attachment-large size-large wp-image-33552\" alt=\"\" srcset=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/10\/1_fJZGl35q-b3l0ZBNUUL3rg.png 270w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/10\/1_fJZGl35q-b3l0ZBNUUL3rg-175x300.png 175w\" sizes=\"(max-width: 270px) 100vw, 270px\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-b232dc7 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"b232dc7\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-1a8ca14\" data-id=\"1a8ca14\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-0fec8b4 elementor-widget elementor-widget-text-editor\" data-id=\"0fec8b4\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>And what about the outputs of this layer? Remember that the encoder layers are stacked on top of each other. So, we want to be able to have an output of the same dimension as the input so that the output can flow easily into the next encoder. So the output is also of the shape, <code>SxD<\/code><\/p>\n<p>\u00a0<\/p>\n<p>\u00a0<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-584cf94 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"584cf94\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-1ad3fcd\" data-id=\"1ad3fcd\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-75f96cc elementor-widget elementor-widget-heading\" data-id=\"75f96cc\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\"><em>Q: Enough about the sizes talk, I understand what goes in and what goes out but what actually happens in the Encoder layer?<\/em><\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-1f9b24b elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"1f9b24b\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-d928cc2\" data-id=\"d928cc2\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-727b17b elementor-widget elementor-widget-text-editor\" data-id=\"727b17b\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>Okay, let\u2019s go through the attention layer and the feedforward layer one by one:<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-0ece4c3 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"0ece4c3\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-f03f8f2\" data-id=\"f03f8f2\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-9baa1b5 elementor-widget elementor-widget-heading\" data-id=\"9baa1b5\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\">Self-attention layer<\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-ca42e01 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"ca42e01\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-5182ba2\" data-id=\"5182ba2\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-344290a elementor-widget elementor-widget-image\" data-id=\"344290a\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"838\" src=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/10\/1_VBpS722t1eKJjoL895ilWg-1024x838.png\" class=\"attachment-large size-large wp-image-33553\" alt=\"\" srcset=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/10\/1_VBpS722t1eKJjoL895ilWg-1024x838.png 1024w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/10\/1_VBpS722t1eKJjoL895ilWg-300x246.png 300w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/10\/1_VBpS722t1eKJjoL895ilWg-768x629.png 768w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/10\/1_VBpS722t1eKJjoL895ilWg-1536x1258.png 1536w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/10\/1_VBpS722t1eKJjoL895ilWg-2048x1677.png 2048w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/10\/1_VBpS722t1eKJjoL895ilWg-610x499.png 610w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/10\/1_VBpS722t1eKJjoL895ilWg-750x614.png 750w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/10\/1_VBpS722t1eKJjoL895ilWg-1140x933.png 1140w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-943b48d elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"943b48d\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-beba53f\" data-id=\"beba53f\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-9ab660b elementor-widget elementor-widget-text-editor\" data-id=\"9ab660b\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>The above figure must look daunting but it is easy to understand. So just stay with me here.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-ae48a36 elementor-widget elementor-widget-text-editor\" data-id=\"ae48a36\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>Deep Learning<\/a> is essentially nothing but a lot of matrix calculations and what we are essentially doing in this layer is a lot of matrix calculations intelligently. The self-attention layer initializes with 3 weight matrices \u2014 Query(W_q), Key(W_k), and Value(W_v). Each of these matrices has a size of (<code>Dxd<\/code>) where d is taken as 64 in the paper. The weights for these matrices will be trained when we train the model.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-3e5a0d4 elementor-widget elementor-widget-text-editor\" data-id=\"3e5a0d4\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>In the first calculation(Calc 1 in the figure), we create matrices Q, K, and V by multiplying the input with the respective Query, Key, and Value matrix.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-b3ac397 elementor-widget elementor-widget-text-editor\" data-id=\"b3ac397\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>Till now it is trivial and shouldn\u2019t make any sense, but it is at the second calculation where it gets interesting. Let\u2019s try to understand the output of the softmax function. We start by multiplying the Q and K\u1d40 matrix to get a matrix of size <code>SxS<\/code>) and divide it by the scalar \u221ad. We then take a softmax to make the rows sum to one.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-bfd87a0 elementor-widget elementor-widget-text-editor\" data-id=\"bfd87a0\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>Intuitively, we can think of the resultant &lt;code&gt;SxS&lt;\/code&gt; matrix as the contribution of each word in another word. For example, it might look like this:<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-e0182d0 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"e0182d0\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-21e319b\" data-id=\"21e319b\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-e4c4e68 elementor-widget elementor-widget-image\" data-id=\"e4c4e68\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img loading=\"lazy\" decoding=\"async\" width=\"315\" height=\"203\" src=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/10\/1_RA8VDIcTzVnf631VmakP-w.png\" class=\"attachment-large size-large wp-image-33554\" alt=\"\" srcset=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/10\/1_RA8VDIcTzVnf631VmakP-w.png 315w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/10\/1_RA8VDIcTzVnf631VmakP-w-300x193.png 300w\" sizes=\"(max-width: 315px) 100vw, 315px\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-4b336dc elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"4b336dc\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-cd96677\" data-id=\"cd96677\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-74f7691 elementor-widget elementor-widget-text-editor\" data-id=\"74f7691\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>As you can see the diagonal entries are big. This is because the word contribution to itself is high. That is reasonable. But we can see here that the word \u201cquick\u201d devolves into \u201cquick\u201d and \u201cfox\u201d and the word \u201cbrown\u201d also devolves into \u201cbrown\u201d and \u201cfox\u201d. That intuitively helps us to say that both the words \u2014 \u201cquick\u201d and \u201cbrown\u201d each refers to the \u201cfox\u201d.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-a310059 elementor-widget elementor-widget-text-editor\" data-id=\"a310059\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>Once we have this SxS matrix with contributions we multiply this matrix by the Value matrix(Sxd) of the sentence and it gives us back a matrix of shape Sxd(4&#215;64). So, what the operation actually does is that it replaces the embedding vector of a word like \u201cquick\u201d with say .75 x (quick embedding) and .2x(fox embedding) and thus now the resultant output for the word \u201cquick\u201d has attention embedded in itself.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-84fad6e elementor-widget elementor-widget-text-editor\" data-id=\"84fad6e\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>Note that the output of this layer has the dimension (Sxd) and before we get done with the whole encoder we need to change it back to D=512 as we need the output of this encoder as the input of another encoder.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-28dcb64 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"28dcb64\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-cbbab0a\" data-id=\"cbbab0a\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-5a10dc4 elementor-widget elementor-widget-heading\" data-id=\"5a10dc4\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h3 class=\"elementor-heading-title elementor-size-default\"><em>Q: But, you called this layer Multi-head self-attention Layer. What is the multi-head?<\/em><\/h3>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-2b8fe4c elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"2b8fe4c\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-c80ad25\" data-id=\"c80ad25\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-d5350eb elementor-widget elementor-widget-text-editor\" data-id=\"d5350eb\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>Okay, my bad but in my defense, I was just getting to that.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-b0469f1 elementor-widget elementor-widget-text-editor\" data-id=\"b0469f1\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>It\u2019s called a multi-head because we use many such self-attention layers in parallel. That is, we have many self-attention layers stacked on top of each other. The number of attention layers,h, is kept as 8 in the paper. So the input X goes through many self-attention layers parallelly, each of which gives a z matrix of shape (Sxd) = 4&#215;64. We concatenate these 8(h) matrices and again apply a final output linear layer, Wo, of size DxD.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-1331a9b elementor-widget elementor-widget-text-editor\" data-id=\"1331a9b\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>What size do we get? For the concatenate operation we get a size of SxD(4x(64&#215;8) = 4&#215;512). And multiplying this output by Wo, we get the final output Z with the shape of SxD(4&#215;512) as desired.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-41ae0f2 elementor-widget elementor-widget-text-editor\" data-id=\"41ae0f2\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>Also, note the relation between h,d, and D i.e. h x d = D<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-2c2bf97 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"2c2bf97\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-fe94554\" data-id=\"fe94554\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-c6ab3b4 elementor-widget elementor-widget-image\" data-id=\"c6ab3b4\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"266\" src=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/10\/1_h6oLUHJ9QvALsU_RR3gOow-1024x266.png\" class=\"attachment-large size-large wp-image-33555\" alt=\"\" srcset=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/10\/1_h6oLUHJ9QvALsU_RR3gOow-1024x266.png 1024w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/10\/1_h6oLUHJ9QvALsU_RR3gOow-300x78.png 300w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/10\/1_h6oLUHJ9QvALsU_RR3gOow-768x199.png 768w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/10\/1_h6oLUHJ9QvALsU_RR3gOow-1536x399.png 1536w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/10\/1_h6oLUHJ9QvALsU_RR3gOow-2048x532.png 2048w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/10\/1_h6oLUHJ9QvALsU_RR3gOow-610x158.png 610w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/10\/1_h6oLUHJ9QvALsU_RR3gOow-750x195.png 750w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/10\/1_h6oLUHJ9QvALsU_RR3gOow-1140x296.png 1140w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-0eb8091 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"0eb8091\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-f069964\" data-id=\"f069964\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-8299ddd elementor-widget elementor-widget-text-editor\" data-id=\"8299ddd\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>Thus, we finally get the output Z of shape 4&#215;512 as intended. But before it goes into another encoder we pass it through a Feed-Forward Network.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-13c8769 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"13c8769\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-7f48231\" data-id=\"7f48231\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-78faa09 elementor-widget elementor-widget-heading\" data-id=\"78faa09\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\">Position-wise feed-forward network<\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-b9a0a3a elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"b9a0a3a\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-ecfb17a\" data-id=\"ecfb17a\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-0a40489 elementor-widget elementor-widget-text-editor\" data-id=\"0a40489\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>Once we understand the multi-headed attention layer, the Feed-forward network is actually pretty easy to understand. It is just a combination of various linear and dropout layers on the output Z. Consequentially, it is again just a lot of Matrix multiplication here.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-e2773bf elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"e2773bf\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-114e708\" data-id=\"114e708\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-fdbff7b elementor-widget elementor-widget-image\" data-id=\"fdbff7b\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"383\" src=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/10\/1_1l5JbeGfEGh2oxjI8koHdQ-1024x383.png\" class=\"attachment-large size-large wp-image-33556\" alt=\"\" srcset=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/10\/1_1l5JbeGfEGh2oxjI8koHdQ-1024x383.png 1024w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/10\/1_1l5JbeGfEGh2oxjI8koHdQ-300x112.png 300w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/10\/1_1l5JbeGfEGh2oxjI8koHdQ-768x287.png 768w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/10\/1_1l5JbeGfEGh2oxjI8koHdQ-1536x575.png 1536w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/10\/1_1l5JbeGfEGh2oxjI8koHdQ-2048x767.png 2048w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/10\/1_1l5JbeGfEGh2oxjI8koHdQ-610x228.png 610w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/10\/1_1l5JbeGfEGh2oxjI8koHdQ-750x281.png 750w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/10\/1_1l5JbeGfEGh2oxjI8koHdQ-1140x427.png 1140w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-e93c9b1 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"e93c9b1\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-914757e\" data-id=\"914757e\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-85574fb elementor-widget elementor-widget-text-editor\" data-id=\"85574fb\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>The feed-forward network applies itself to each position in the output Z parallelly(Each position can be thought of as a word) and hence the name Position-wise feed-forward network. The feed-forward network also shares weight, so that the length of the source sentence doesn\u2019t matter(Also, if it didn\u2019t share weights, we would have to initialize a lot of such networks based on max source sentence length and that is not feasible)<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-4d23f1e elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"4d23f1e\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-73b5a35\" data-id=\"73b5a35\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-1a54512 elementor-widget elementor-widget-image\" data-id=\"1a54512\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img loading=\"lazy\" decoding=\"async\" width=\"281\" height=\"198\" src=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/10\/1_kv6ibcFq0lP7sVEpYBGukQ.png\" class=\"attachment-large size-large wp-image-33557\" alt=\"\" srcset=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/10\/1_kv6ibcFq0lP7sVEpYBGukQ.png 281w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/10\/1_kv6ibcFq0lP7sVEpYBGukQ-120x86.png 120w\" sizes=\"(max-width: 281px) 100vw, 281px\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-f4bc046 elementor-widget elementor-widget-text-editor\" data-id=\"f4bc046\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>With this, we near an okayish understanding of the encoder part of the Transformer.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-22cdee0 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"22cdee0\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-128f450\" data-id=\"128f450\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-6c69fcc elementor-widget elementor-widget-heading\" data-id=\"6c69fcc\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h3 class=\"elementor-heading-title elementor-size-default\"><em>Q: Hey, I was just going through the picture in the paper, and the encoder stack has something called \u201cpositional encoding\u201d and \u201cAdd &amp; Norm\u201d also. What are these<\/em><\/h3>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-65f9a17 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"65f9a17\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-02b70aa\" data-id=\"02b70aa\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-64ded99 elementor-widget elementor-widget-image\" data-id=\"64ded99\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img loading=\"lazy\" decoding=\"async\" width=\"342\" height=\"504\" src=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/10\/0_yteLD8FQFIm-GSWC.png\" class=\"attachment-large size-large wp-image-33558\" alt=\"\" srcset=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/10\/0_yteLD8FQFIm-GSWC.png 342w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/10\/0_yteLD8FQFIm-GSWC-204x300.png 204w\" sizes=\"(max-width: 342px) 100vw, 342px\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-f1a3104 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"f1a3104\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-ca2fa66\" data-id=\"ca2fa66\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-c204829 elementor-widget elementor-widget-text-editor\" data-id=\"c204829\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>Okay, These two concepts are pretty essential to this particular architecture. And I am glad you asked this one. So, we will discuss these steps before moving further to the decoder stack.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-7092902 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"7092902\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-b9dc988\" data-id=\"b9dc988\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-0ced143 elementor-widget elementor-widget-heading\" data-id=\"0ced143\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\">C. Positional Encodings<\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-ca4fd03 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"ca4fd03\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-11cff0d\" data-id=\"11cff0d\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-d056bdf elementor-widget elementor-widget-text-editor\" data-id=\"d056bdf\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>Since, our model contains no recurrence and no convolution, in order for the model to make use of the order of the sequence, we must inject some information about the relative or absolute position of the tokens in the sequence. To this end, we add \u201cpositional encodings\u201d to the input embeddings at the bottoms of both the encoder and decoder stacks(as we will see later). The positional encodings need to have the same dimension, D as the embeddings have so that the two can be summed.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-6d8d69d elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"6d8d69d\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-a40a69c\" data-id=\"a40a69c\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-dbf2a9d elementor-widget elementor-widget-image\" data-id=\"dbf2a9d\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"188\" src=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/10\/1_wvVZmlUGeuJUMgJPt4k2Bw-1024x188.png\" class=\"attachment-large size-large wp-image-33559\" alt=\"\" srcset=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/10\/1_wvVZmlUGeuJUMgJPt4k2Bw-1024x188.png 1024w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/10\/1_wvVZmlUGeuJUMgJPt4k2Bw-300x55.png 300w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/10\/1_wvVZmlUGeuJUMgJPt4k2Bw-768x141.png 768w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/10\/1_wvVZmlUGeuJUMgJPt4k2Bw-610x112.png 610w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/10\/1_wvVZmlUGeuJUMgJPt4k2Bw-750x138.png 750w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/10\/1_wvVZmlUGeuJUMgJPt4k2Bw-1140x209.png 1140w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/10\/1_wvVZmlUGeuJUMgJPt4k2Bw.png 1471w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-65827d6 elementor-widget elementor-widget-text-editor\" data-id=\"65827d6\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>In the paper, the authors used sine and cosine functions to create positional embeddings for different positions.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-f43d2e5 elementor-widget elementor-widget-text-editor\" data-id=\"f43d2e5\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>This particular mathematical thing actually generates a 2d matrix which is added to the embedding vector that goes into the first encoder step.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-6a7a1b4 elementor-widget elementor-widget-text-editor\" data-id=\"6a7a1b4\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>Put simply, it\u2019s just a constant matrix that we add to the sentence so that the network could get the position of the word.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-438e6fb elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"438e6fb\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-ec46c65\" data-id=\"ec46c65\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-d8b8288 elementor-widget elementor-widget-image\" data-id=\"d8b8288\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img loading=\"lazy\" decoding=\"async\" width=\"952\" height=\"362\" src=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/10\/1_9dlYkvIF81idv_owWPebgg.png\" class=\"attachment-large size-large wp-image-33560\" alt=\"\" srcset=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/10\/1_9dlYkvIF81idv_owWPebgg.png 952w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/10\/1_9dlYkvIF81idv_owWPebgg-300x114.png 300w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/10\/1_9dlYkvIF81idv_owWPebgg-768x292.png 768w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/10\/1_9dlYkvIF81idv_owWPebgg-610x232.png 610w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/10\/1_9dlYkvIF81idv_owWPebgg-750x285.png 750w\" sizes=\"(max-width: 952px) 100vw, 952px\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-47a9a06 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"47a9a06\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-d5db57d\" data-id=\"d5db57d\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-923723d elementor-widget elementor-widget-image\" data-id=\"923723d\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img loading=\"lazy\" decoding=\"async\" width=\"961\" height=\"364\" src=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/10\/1_vkbzCQl841IOvgi6UQpwCw.png\" class=\"attachment-large size-large wp-image-33562\" alt=\"\" srcset=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/10\/1_vkbzCQl841IOvgi6UQpwCw.png 961w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/10\/1_vkbzCQl841IOvgi6UQpwCw-300x114.png 300w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/10\/1_vkbzCQl841IOvgi6UQpwCw-768x291.png 768w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/10\/1_vkbzCQl841IOvgi6UQpwCw-610x231.png 610w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/10\/1_vkbzCQl841IOvgi6UQpwCw-750x284.png 750w\" sizes=\"(max-width: 961px) 100vw, 961px\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-f9bd56e elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"f9bd56e\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-17ce4b3\" data-id=\"17ce4b3\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-36b6abc elementor-widget elementor-widget-text-editor\" data-id=\"36b6abc\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>Above is the heatmap of the position encoding matrix that we will add to the input that is to be given to the first encoder. I am showing the heatmap for the first 300 positions and the first 3000 positions. We can see that there is a distinct pattern that we provide to our Transformer to understand the position of each word. And since we are using a function comprised of sin and cos, we are able to embed positional embeddings for very high positions also pretty well as we can see in the second picture.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-d9607b0 elementor-widget elementor-widget-text-editor\" data-id=\"d9607b0\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p><strong>Interesting Fact:<\/strong> The authors also let the Transformer learn these encodings too and didn\u2019t see any difference in performance as such. So, they went with the above idea as it doesn\u2019t depend on sentence length and so even if the test sentence is bigger than train samples, we would be fine.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-0dcf608 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"0dcf608\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-591654d\" data-id=\"591654d\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-7e2c9e1 elementor-widget elementor-widget-heading\" data-id=\"7e2c9e1\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\">D. Add and Normalize<\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-e55148e elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"e55148e\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-6189619\" data-id=\"6189619\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-5e9cc03 elementor-widget elementor-widget-text-editor\" data-id=\"5e9cc03\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>Another thing, that I didn\u2019t mention for the sake of simplicity while explaining the encoder is that the encoder(the decoder architecture too) architecture has skip level residual connections(something akin to resnet50) also. So, the exact encoder architecture in the paper looks like below. Simply put, it helps traverse information for a much greater length in a Deep Neural Network. This can be thought of as akin(intuitively) to information passing in an organization where you have access to your manager as well as to your manager\u2019s manager.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-ab3ae35 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"ab3ae35\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-b459ee1\" data-id=\"b459ee1\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-de07faf elementor-widget elementor-widget-image\" data-id=\"de07faf\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"213\" src=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/10\/1_a4QnN4pf-tDnSyKba1ObWQ-1024x213.png\" class=\"attachment-large size-large wp-image-33563\" alt=\"\" srcset=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/10\/1_a4QnN4pf-tDnSyKba1ObWQ-1024x213.png 1024w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/10\/1_a4QnN4pf-tDnSyKba1ObWQ-300x62.png 300w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/10\/1_a4QnN4pf-tDnSyKba1ObWQ-768x159.png 768w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/10\/1_a4QnN4pf-tDnSyKba1ObWQ-1536x319.png 1536w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/10\/1_a4QnN4pf-tDnSyKba1ObWQ-2048x425.png 2048w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/10\/1_a4QnN4pf-tDnSyKba1ObWQ-610x127.png 610w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/10\/1_a4QnN4pf-tDnSyKba1ObWQ-750x156.png 750w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/10\/1_a4QnN4pf-tDnSyKba1ObWQ-1140x237.png 1140w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-b09016f elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"b09016f\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-a9478fb\" data-id=\"a9478fb\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-7b8ba09 elementor-widget elementor-widget-heading\" data-id=\"7b8ba09\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\">2. Decoder Architecture<\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-318ba9c elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"318ba9c\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-3d20f5b\" data-id=\"3d20f5b\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-d98849b elementor-widget elementor-widget-heading\" data-id=\"d98849b\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h3 class=\"elementor-heading-title elementor-size-default\">Q: Okay, so till now we have learned that an encoder takes an input sentence and encodes its information in a matrix of size SxD(4x512). That\u2019s all great but how does it help the decoder decode it to German?<\/h3>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-4215430 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"4215430\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-5014866\" data-id=\"5014866\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-8c0d34f elementor-widget elementor-widget-text-editor\" data-id=\"8c0d34f\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>Good things come to those who wait. So, before understanding how the decoder does that, let us understand the decoder stack.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-7a49bd2 elementor-widget elementor-widget-text-editor\" data-id=\"7a49bd2\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>The decoder stack contains 6 decoder layers in a stack (As given in the paper again) and each decoder in the stack is comprised of these main three layers:<\/p>\n<p>\u00a0<\/p>\n<p><\/p>\n<p>\u00a0<\/p>\n<p><\/p>\n<ul class=\"wp-block-list\">\n<li><strong>Masked multi-head self-attention Layer<\/strong><\/li>\n<li><strong>multi-head self-attention Layer, and<\/strong><\/li>\n<li><strong>a position-wise fully connected feed-forward network<\/strong><\/li>\n<\/ul>\n<p><\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-3bce84c elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"3bce84c\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-3c3a7b5\" data-id=\"3c3a7b5\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-bce5bae elementor-widget elementor-widget-text-editor\" data-id=\"bce5bae\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>It also has the same positional encoding as well as the skip level connection as well. We already know how the multi-head attention and feed-forward network layers work, so we will get straight into what is different in the decoder as compared to the encoder.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-0eff726 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"0eff726\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-c43f6d4\" data-id=\"c43f6d4\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-d44c658 elementor-widget elementor-widget-image\" data-id=\"d44c658\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img loading=\"lazy\" decoding=\"async\" width=\"1011\" height=\"1024\" src=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/10\/1_U50KSNr_u5KtXb42KZ3Mpg-1011x1024.png\" class=\"attachment-large size-large wp-image-33564\" alt=\"\" srcset=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/10\/1_U50KSNr_u5KtXb42KZ3Mpg-1011x1024.png 1011w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/10\/1_U50KSNr_u5KtXb42KZ3Mpg-296x300.png 296w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/10\/1_U50KSNr_u5KtXb42KZ3Mpg-768x778.png 768w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/10\/1_U50KSNr_u5KtXb42KZ3Mpg-1516x1536.png 1516w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/10\/1_U50KSNr_u5KtXb42KZ3Mpg-2021x2048.png 2021w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/10\/1_U50KSNr_u5KtXb42KZ3Mpg-610x618.png 610w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/10\/1_U50KSNr_u5KtXb42KZ3Mpg-75x75.png 75w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/10\/1_U50KSNr_u5KtXb42KZ3Mpg-750x760.png 750w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/10\/1_U50KSNr_u5KtXb42KZ3Mpg-1140x1155.png 1140w\" sizes=\"(max-width: 1011px) 100vw, 1011px\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-71a561b elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"71a561b\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-27e5a4a\" data-id=\"27e5a4a\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-2286361 elementor-widget elementor-widget-heading\" data-id=\"2286361\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h3 class=\"elementor-heading-title elementor-size-default\"><em>Q: Wait, but do I see the output we need flowing into the decoder as input? What? Why?&nbsp;<\/em><\/h3>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-02b3f9d elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"02b3f9d\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-3a3c48c\" data-id=\"3a3c48c\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-b9bac02 elementor-widget elementor-widget-text-editor\" data-id=\"b9bac02\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>I am noticing that you are getting pretty good at asking questions. And that is a great question, something I even though myself a lot of times, and something that I hope will get much clearer by the time you reach the end of this post.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-a77c409 elementor-widget elementor-widget-text-editor\" data-id=\"a77c409\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>But to give an intuition, we can think of a transformer as a conditional language model in this case. A model that predicts the next word given an input word and an English sentence on which to condition upon or base its prediction on.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-79257e4 elementor-widget elementor-widget-text-editor\" data-id=\"79257e4\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>Such models are inherently sequential as in how would you train such a model? You start by giving the start token(<code><s><\/s><\/code><s>) and the model predicts the first word conditioned on the English sentence. You change the weights based on if the prediction is right or wrong. Then you give the start token and the first word (&lt;<code>s&gt; der&lt;\/code&gt;) and the model predicts the second word. You change weights again. And so on.<\/code><\/s><\/p>\n<p><code><br \/>\n<\/code><\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-6ffa146 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"6ffa146\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-796d1bb\" data-id=\"796d1bb\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-5993a6c elementor-widget elementor-widget-text-editor\" data-id=\"5993a6c\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>The transformer decoder learns just like that but the beauty is that it doesn\u2019t do that in a sequential manner. It uses masking to do this calculation and thus takes the whole output sentence (although shifted right by adding a &lt;<code>s&gt;&lt;\/code&gt; token to the front) while training. Also, please note that at prediction time we won\u2019t give the output to the network<\/code><\/p>\n<p><code>\n<\/code><\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-78394a1 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"78394a1\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-1fe6f77\" data-id=\"1fe6f77\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-78112d3 elementor-widget elementor-widget-heading\" data-id=\"78112d3\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h3 class=\"elementor-heading-title elementor-size-default\"><em>Q: But, how does this masking exactly work?<\/em><\/h3>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-5bae4ef elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"5bae4ef\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-d8cad83\" data-id=\"d8cad83\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-e81edac elementor-widget elementor-widget-heading\" data-id=\"e81edac\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\">A) Masked Multi-Head Self Attention Layer<\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-7068399 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"7068399\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-291e120\" data-id=\"291e120\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-d62fc5b elementor-widget elementor-widget-text-editor\" data-id=\"d62fc5b\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>It works, as usual, you wear it I mean . Kidding aside, as you can see that this time we have a <strong>Masked<\/strong>Multi-Head attention Layer in our decoder. This means that we will mask our shifted output (that is the input to the decoder) in a way that the network is never able to see the subsequent words since otherwise, it can easily copy that word while training.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-1acfae8 elementor-widget elementor-widget-text-editor\" data-id=\"1acfae8\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>So, how does the mask exactly work in the masked attention layer? If you remember, in the attention layer we multiplied the query(Q) and keys(K) and divided them by sqrt(d) before taking the softmax.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-2808129 elementor-widget elementor-widget-text-editor\" data-id=\"2808129\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>In a masked attention layer, though, we add the resultant matrix before the softmax(which will be of shape (TxT)) to a masking matrix.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-617dc32 elementor-widget elementor-widget-text-editor\" data-id=\"617dc32\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>So, In a masked layer, the function changes from:<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-e64b1e9 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"e64b1e9\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-c6e5b9d\" data-id=\"c6e5b9d\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-410ec5c elementor-widget elementor-widget-image\" data-id=\"410ec5c\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"190\" src=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/10\/1_vdXoUklpf8XN6HEABbQgqA-1024x190.png\" class=\"attachment-large size-large wp-image-33565\" alt=\"\" srcset=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/10\/1_vdXoUklpf8XN6HEABbQgqA-1024x190.png 1024w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/10\/1_vdXoUklpf8XN6HEABbQgqA-300x56.png 300w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/10\/1_vdXoUklpf8XN6HEABbQgqA-768x143.png 768w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/10\/1_vdXoUklpf8XN6HEABbQgqA-1536x285.png 1536w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/10\/1_vdXoUklpf8XN6HEABbQgqA-610x113.png 610w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/10\/1_vdXoUklpf8XN6HEABbQgqA-750x139.png 750w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/10\/1_vdXoUklpf8XN6HEABbQgqA-1140x212.png 1140w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/10\/1_vdXoUklpf8XN6HEABbQgqA.png 1744w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-2be3dfe elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"2be3dfe\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-1479923\" data-id=\"1479923\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-5179e79 elementor-widget elementor-widget-heading\" data-id=\"5179e79\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h3 class=\"elementor-heading-title elementor-size-default\"><em>Q: I still don\u2019t get it, what happens if we do that?<\/em><\/h3>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-42c2522 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"42c2522\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-cff5788\" data-id=\"cff5788\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-4f46bd7 elementor-widget elementor-widget-text-editor\" data-id=\"4f46bd7\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>That\u2019s understandable actually. Let me break it in steps. So, our resultant matrix(QxK\/sqrt(d)) of shape (TxT) might look something like below:(The numbers can be big as softmax not applied yet)<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-9b8e29e elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"9b8e29e\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-7bbf9f9\" data-id=\"7bbf9f9\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-e29648a elementor-widget elementor-widget-image\" data-id=\"e29648a\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img loading=\"lazy\" decoding=\"async\" width=\"432\" height=\"353\" src=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/10\/1_BuSTz8bskZy2PxrJZmmm2w.png\" class=\"attachment-large size-large wp-image-33566\" alt=\"\" srcset=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/10\/1_BuSTz8bskZy2PxrJZmmm2w.png 432w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/10\/1_BuSTz8bskZy2PxrJZmmm2w-300x245.png 300w\" sizes=\"(max-width: 432px) 100vw, 432px\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-200ce6a elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"200ce6a\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-16d6ac6\" data-id=\"16d6ac6\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-f6a52f7 elementor-widget elementor-widget-text-editor\" data-id=\"f6a52f7\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>The word Schnelle will now be composed of both Braune and Fuchs if we take the above matrix\u2019s softmax and multiply it with the value matrix V. But we don\u2019t want that, so we add the mask matrix to it to give:<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-b6664b4 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"b6664b4\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-07f4eeb\" data-id=\"07f4eeb\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-0abc7bb elementor-widget elementor-widget-image\" data-id=\"0abc7bb\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"267\" src=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/10\/1_rqzdoHsE_GviBhKZM-liSw-1024x267.png\" class=\"attachment-large size-large wp-image-33567\" alt=\"\" srcset=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/10\/1_rqzdoHsE_GviBhKZM-liSw-1024x267.png 1024w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/10\/1_rqzdoHsE_GviBhKZM-liSw-300x78.png 300w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/10\/1_rqzdoHsE_GviBhKZM-liSw-768x200.png 768w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/10\/1_rqzdoHsE_GviBhKZM-liSw-1536x401.png 1536w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/10\/1_rqzdoHsE_GviBhKZM-liSw-610x159.png 610w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/10\/1_rqzdoHsE_GviBhKZM-liSw-750x196.png 750w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/10\/1_rqzdoHsE_GviBhKZM-liSw-1140x298.png 1140w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/10\/1_rqzdoHsE_GviBhKZM-liSw.png 2038w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-9f090f4 elementor-widget elementor-widget-text-editor\" data-id=\"9f090f4\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>And, now what will happen after we do the softmax step?<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-446dabe elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"446dabe\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-abd2115\" data-id=\"abd2115\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-6957ab2 elementor-widget elementor-widget-image\" data-id=\"6957ab2\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img loading=\"lazy\" decoding=\"async\" width=\"430\" height=\"354\" src=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/10\/1_pr57axue8zssFIqetfUZMw.png\" class=\"attachment-large size-large wp-image-33569\" alt=\"\" srcset=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/10\/1_pr57axue8zssFIqetfUZMw.png 430w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/10\/1_pr57axue8zssFIqetfUZMw-300x247.png 300w\" sizes=\"(max-width: 430px) 100vw, 430px\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-c9ef9e9 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"c9ef9e9\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-2d388f8\" data-id=\"2d388f8\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-b1ba894 elementor-widget elementor-widget-text-editor\" data-id=\"b1ba894\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>Since e^{-inf} = 0, all positions subsequent to Schnelle have been converted to 0. Now, if we multiply this matrix with the value matrix V, the vector corresponding to Schnelle\u2019s position in the Z vector passing through the decoder would not contain any information of the subsequent words Braune and Fuchs just like we wanted.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-0ab3f6c elementor-widget elementor-widget-text-editor\" data-id=\"0ab3f6c\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>And that is how the transformer takes the whole shifted output sentence at once and doesn\u2019t learn in a sequential manner. Pretty neat I must say.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-1163ce3 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"1163ce3\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-f847184\" data-id=\"f847184\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-16879df elementor-widget elementor-widget-heading\" data-id=\"16879df\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h3 class=\"elementor-heading-title elementor-size-default\"><em>Q: Are you kidding me? That\u2019s actually awesome.<\/em><\/h3>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-1e45280 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"1e45280\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-5def159\" data-id=\"5def159\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-186e3b9 elementor-widget elementor-widget-text-editor\" data-id=\"186e3b9\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>So glad that you are still with me and you appreciate it. Now, coming back to the decoder. The next layer in the decoder is:<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-1aad079 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"1aad079\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-aa0aa3d\" data-id=\"aa0aa3d\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-404a1fb elementor-widget elementor-widget-heading\" data-id=\"404a1fb\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\">Multi-Headed Attention Layer<\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-2c1909b elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"2c1909b\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-54bd431\" data-id=\"54bd431\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-e2734af elementor-widget elementor-widget-text-editor\" data-id=\"e2734af\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>As you can see in the decoder architecture, a Z vector(Output of encoder) flows from the encoder to the multi-head attention layer in the Decoder. This Z output from the last encoder has a special name and is often called as memory. The attention layer takes as input both the encoder output and data flowing from below(shifted outputs) and uses attention. The Query vector Q is created from the data flowing in the decoder, while the Key(K) and value(V) vectors come from the encoder output.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-70f737f elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"70f737f\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-bbcb506\" data-id=\"bbcb506\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-17ef9ef elementor-widget elementor-widget-heading\" data-id=\"17ef9ef\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h3 class=\"elementor-heading-title elementor-size-default\"><em>Q: Isn\u2019t there any mask here?<em><\/h3>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-c8065b3 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"c8065b3\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-8cab041\" data-id=\"8cab041\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-07e9bae elementor-widget elementor-widget-text-editor\" data-id=\"07e9bae\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>No, there is no mask here. The output coming from below is already masked and this allows every position in the decoder to attend over all the positions in the Value vector. So for every word position to be generated the decoder has access to the whole English sentence.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-05234e2 elementor-widget elementor-widget-text-editor\" data-id=\"05234e2\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>Here is a single attention layer(which will be part of a multi-head just like before):<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-5f522c2 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"5f522c2\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-d00df68\" data-id=\"d00df68\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-9512825 elementor-widget elementor-widget-image\" data-id=\"9512825\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"840\" src=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/10\/1_oX9qWLSpjXm9MfcNdeA9AQ-1024x840.png\" class=\"attachment-large size-large wp-image-33570\" alt=\"\" srcset=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/10\/1_oX9qWLSpjXm9MfcNdeA9AQ-1024x840.png 1024w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/10\/1_oX9qWLSpjXm9MfcNdeA9AQ-300x246.png 300w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/10\/1_oX9qWLSpjXm9MfcNdeA9AQ-768x630.png 768w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/10\/1_oX9qWLSpjXm9MfcNdeA9AQ-1536x1260.png 1536w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/10\/1_oX9qWLSpjXm9MfcNdeA9AQ-2048x1680.png 2048w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/10\/1_oX9qWLSpjXm9MfcNdeA9AQ-610x500.png 610w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/10\/1_oX9qWLSpjXm9MfcNdeA9AQ-750x615.png 750w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/10\/1_oX9qWLSpjXm9MfcNdeA9AQ-1140x935.png 1140w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-51c0f00 elementor-widget elementor-widget-text-editor\" data-id=\"51c0f00\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>&#8220;Isn\u2019t there any mask here&#8221;<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-a10924c elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"a10924c\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-007a530\" data-id=\"007a530\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-23cf935 elementor-widget elementor-widget-heading\" data-id=\"23cf935\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h3 class=\"elementor-heading-title elementor-size-default\"><em>Q: But won\u2019t the shapes of Q, K, and V be different this time?<\/em><\/h3>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-e1bbe60 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"e1bbe60\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-c9449f6\" data-id=\"c9449f6\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-8be4718 elementor-widget elementor-widget-text-editor\" data-id=\"8be4718\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>You can look at the figure where I have done all the weights calculation. I would also ask you to see the shapes of the resultant Z vector and how our weight matrices until now never used the target or source sentence length in any of their dimensions. Normally, the shape cancels away in all our matrix calculations. For example, see how the S dimension cancels away in calculation 2 above. That is why while selecting the batches during training the authors talk about tight batches. That is in a batch all source sentences have similar lengths. And different batches could have different source lengths.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-1524956 elementor-widget elementor-widget-text-editor\" data-id=\"1524956\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>I will now talk about the skip level connections and the feed-forward layer. They are actually the same as in \u2026.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-8028847 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"8028847\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-581b952\" data-id=\"581b952\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-ef6b757 elementor-widget elementor-widget-heading\" data-id=\"ef6b757\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h3 class=\"elementor-heading-title elementor-size-default\"><em>Q: Ok, I get it. We have the skip level connections and the FF layer and get a matrix of shape TxD after this whole decode operation.<\/em>&nbsp;<em>But where is the German translation?<\/em><\/h3>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-690f78d elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"690f78d\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-950b515\" data-id=\"950b515\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-28f6358 elementor-widget elementor-widget-heading\" data-id=\"28f6358\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\">3. Output Head<\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-62bf4a6 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"62bf4a6\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-3df7ba4\" data-id=\"3df7ba4\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-82bd07d elementor-widget elementor-widget-text-editor\" data-id=\"82bd07d\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>We are actually very much there now friend. Once, we are done with the transformer, the next thing is to add a task-specific output head on the top of the decoder output. This can be done by adding some linear layers and softmax on top to get the probability <em>across all the words in the german vocab<\/em>. We can do something like this:<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-6a45c58 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"6a45c58\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-70f166f\" data-id=\"70f166f\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-93d6d28 elementor-widget elementor-widget-image\" data-id=\"93d6d28\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"328\" src=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/10\/1_Ofas691Sf7KBqANNgUUXkw-1024x328.png\" class=\"attachment-large size-large wp-image-33571\" alt=\"\" srcset=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/10\/1_Ofas691Sf7KBqANNgUUXkw-1024x328.png 1024w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/10\/1_Ofas691Sf7KBqANNgUUXkw-300x96.png 300w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/10\/1_Ofas691Sf7KBqANNgUUXkw-768x246.png 768w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/10\/1_Ofas691Sf7KBqANNgUUXkw-1536x493.png 1536w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/10\/1_Ofas691Sf7KBqANNgUUXkw-610x196.png 610w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/10\/1_Ofas691Sf7KBqANNgUUXkw-750x241.png 750w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/10\/1_Ofas691Sf7KBqANNgUUXkw-1140x366.png 1140w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/10\/1_Ofas691Sf7KBqANNgUUXkw.png 2011w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-633d5b1 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"633d5b1\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-d56865b\" data-id=\"d56865b\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-2f116e3 elementor-widget elementor-widget-text-editor\" data-id=\"2f116e3\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>As you can see we are able to generate probabilities. So far we know how to do a forward pass through this Transformer architecture. Let us see how we do the training of such a Neural Net Architecture.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-45b32d8 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"45b32d8\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-c994594\" data-id=\"c994594\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-9bf5072 elementor-widget elementor-widget-heading\" data-id=\"9bf5072\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\">Training:<\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-dcbbb34 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"dcbbb34\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-5ede314\" data-id=\"5ede314\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-5dbef94 elementor-widget elementor-widget-text-editor\" data-id=\"5dbef94\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>Till now, if we take a bird-eye view of the structure we have something like:<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-54de3fc elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"54de3fc\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-ee1f8a5\" data-id=\"ee1f8a5\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-ee080ac elementor-widget elementor-widget-image\" data-id=\"ee080ac\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"171\" src=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/10\/1_15WHSTH8YTe_u4ChtIA9Jg-1024x171.png\" class=\"attachment-large size-large wp-image-33572\" alt=\"\" srcset=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/10\/1_15WHSTH8YTe_u4ChtIA9Jg-1024x171.png 1024w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/10\/1_15WHSTH8YTe_u4ChtIA9Jg-300x50.png 300w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/10\/1_15WHSTH8YTe_u4ChtIA9Jg-768x128.png 768w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/10\/1_15WHSTH8YTe_u4ChtIA9Jg-1536x256.png 1536w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/10\/1_15WHSTH8YTe_u4ChtIA9Jg-2048x341.png 2048w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/10\/1_15WHSTH8YTe_u4ChtIA9Jg-610x102.png 610w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/10\/1_15WHSTH8YTe_u4ChtIA9Jg-750x125.png 750w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/10\/1_15WHSTH8YTe_u4ChtIA9Jg-1140x190.png 1140w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-258a77b elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"258a77b\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-99e60c8\" data-id=\"99e60c8\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-5eadc8f elementor-widget elementor-widget-text-editor\" data-id=\"5eadc8f\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>We can give an English sentence and shifted output sentence and do a forward pass and get the probabilities over the German vocabulary. And thus we should be able to use a loss function like cross-entropy where the target could be the german word we want, and train the neural network using the Adam Optimizer. Just like any classification example. So, there is your German.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-666f298 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"666f298\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-b3f191f\" data-id=\"b3f191f\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-0960b16 elementor-widget elementor-widget-text-editor\" data-id=\"0960b16\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>In the paper though, the authors use slight variations of optimizers and loss. You can choose to skip the below 2 sections on KL Divergence Loss and Learning rate schedule with Adam if you want as it is done only to churn out more performance out of the model and not an inherent part of the Transformer architecture as such.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-96ddcea elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"96ddcea\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-08a2e41\" data-id=\"08a2e41\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-ae6defd elementor-widget elementor-widget-heading\" data-id=\"ae6defd\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h3 class=\"elementor-heading-title elementor-size-default\"><em>Q: I have been here for such a long time and have I complained?&nbsp;<\/em><\/h3>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-fafa8c2 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"fafa8c2\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-ee270bc\" data-id=\"ee270bc\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-4a3ca15 elementor-widget elementor-widget-text-editor\" data-id=\"4a3ca15\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>Okay. Okay. I get you. Let\u2019s do it then.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-b6bcc55 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"b6bcc55\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-254fa42\" data-id=\"254fa42\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-89f7486 elementor-widget elementor-widget-heading\" data-id=\"89f7486\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h3 class=\"elementor-heading-title elementor-size-default\">A) KL Divergence with Label Smoothing:<\/h3>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-a723962 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"a723962\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-198bb40\" data-id=\"198bb40\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-49b8697 elementor-widget elementor-widget-text-editor\" data-id=\"49b8697\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>KL Divergence is the information loss that happens when the distribution P is approximated by the distribution Q. When we use the KL Divergence loss, we try to estimate the target distribution(P) using the probabilities(Q) we generate from the model. And we try to minimize this information loss in the training.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-810f96c elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"810f96c\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-fe13808\" data-id=\"fe13808\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-01cc96c elementor-widget elementor-widget-image\" data-id=\"01cc96c\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img loading=\"lazy\" decoding=\"async\" width=\"378\" height=\"81\" src=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/10\/1_AACNCkYiRmf3xmZLQsYhLw.png\" class=\"attachment-large size-large wp-image-33573\" alt=\"\" srcset=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/10\/1_AACNCkYiRmf3xmZLQsYhLw.png 378w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/10\/1_AACNCkYiRmf3xmZLQsYhLw-300x64.png 300w\" sizes=\"(max-width: 378px) 100vw, 378px\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-ec498e0 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"ec498e0\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-ba7d739\" data-id=\"ba7d739\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-9def806 elementor-widget elementor-widget-text-editor\" data-id=\"9def806\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>If you notice, in this form(without label smoothing which we will discuss) this is exactly the same as cross-entropy. Given two distributions like below.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-c06ee1f elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"c06ee1f\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-a5c1584\" data-id=\"a5c1584\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-b8dd781 elementor-widget elementor-widget-image\" data-id=\"b8dd781\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img loading=\"lazy\" decoding=\"async\" width=\"630\" height=\"405\" src=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/10\/1_PvO3UDTV0oaVnz05Pk6sQg.png\" class=\"attachment-large size-large wp-image-33574\" alt=\"\" srcset=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/10\/1_PvO3UDTV0oaVnz05Pk6sQg.png 630w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/10\/1_PvO3UDTV0oaVnz05Pk6sQg-300x193.png 300w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/10\/1_PvO3UDTV0oaVnz05Pk6sQg-610x392.png 610w\" sizes=\"(max-width: 630px) 100vw, 630px\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-9ceb6e8 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"9ceb6e8\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-a7a58e6\" data-id=\"a7a58e6\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-7190f46 elementor-widget elementor-widget-image\" data-id=\"7190f46\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img loading=\"lazy\" decoding=\"async\" width=\"630\" height=\"405\" src=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/10\/1_bmnluXCYHHgsXSng6oO6kw.png\" class=\"attachment-large size-large wp-image-33575\" alt=\"\" srcset=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/10\/1_bmnluXCYHHgsXSng6oO6kw.png 630w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/10\/1_bmnluXCYHHgsXSng6oO6kw-300x193.png 300w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/10\/1_bmnluXCYHHgsXSng6oO6kw-610x392.png 610w\" sizes=\"(max-width: 630px) 100vw, 630px\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-9a4ed55 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"9a4ed55\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-67fe764\" data-id=\"67fe764\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-3ceb965 elementor-widget elementor-widget-text-editor\" data-id=\"3ceb965\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>The KL Divergence formula just plain gives <code>-logq(oder)<\/code> and that is the cross-entropy loss.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-b0cf853 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"b0cf853\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-ad2ded9\" data-id=\"ad2ded9\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-62d126c elementor-widget elementor-widget-text-editor\" data-id=\"62d126c\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>In the paper, though the authors used label smoothing with \u03b1 = 0.1 and so the KL Divergence loss is not cross-entropy. What that means is that in the target distribution the output value is substituted by (1-\u03b1) and the remaining 0.1 is distributed across all the words. The authors say that this is so that the model is not too confident.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-e0a5f5d elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"e0a5f5d\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-c15f099\" data-id=\"c15f099\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-b22ce62 elementor-widget elementor-widget-image\" data-id=\"b22ce62\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img loading=\"lazy\" decoding=\"async\" width=\"630\" height=\"405\" src=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/10\/1_xSy1vi7zUb9NCaHeWkkF1w.png\" class=\"attachment-large size-large wp-image-33576\" alt=\"\" srcset=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/10\/1_xSy1vi7zUb9NCaHeWkkF1w.png 630w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/10\/1_xSy1vi7zUb9NCaHeWkkF1w-300x193.png 300w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/10\/1_xSy1vi7zUb9NCaHeWkkF1w-610x392.png 610w\" sizes=\"(max-width: 630px) 100vw, 630px\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-230dcb3 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"230dcb3\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-4d843bd\" data-id=\"4d843bd\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-1874039 elementor-widget elementor-widget-image\" data-id=\"1874039\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img loading=\"lazy\" decoding=\"async\" width=\"630\" height=\"405\" src=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/10\/1_bmnluXCYHHgsXSng6oO6kw-1.png\" class=\"attachment-large size-large wp-image-33577\" alt=\"\" srcset=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/10\/1_bmnluXCYHHgsXSng6oO6kw-1.png 630w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/10\/1_bmnluXCYHHgsXSng6oO6kw-1-300x193.png 300w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/10\/1_bmnluXCYHHgsXSng6oO6kw-1-610x392.png 610w\" sizes=\"(max-width: 630px) 100vw, 630px\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-f6129ca elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"f6129ca\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-f07804a\" data-id=\"f07804a\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-83d4440 elementor-widget elementor-widget-heading\" data-id=\"83d4440\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h3 class=\"elementor-heading-title elementor-size-default\"><em>Q: But, why do we make our models not confident? It seems absurd.<\/em><\/h3>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-777e432 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"777e432\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-102aff6\" data-id=\"102aff6\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-26e79c2 elementor-widget elementor-widget-text-editor\" data-id=\"26e79c2\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>Yes, it does but intuitively, you can think of it as when we give the target as 1 to our loss function, we have no doubts that the true label is True and others are not. But vocabulary is inherently a non-standardized target. For example, who is to say that you cannot use good in place of great? So we add some confusion in our labels so our model is not too rigid.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-13d8814 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"13d8814\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-4c2cebd\" data-id=\"4c2cebd\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-adf9e62 elementor-widget elementor-widget-heading\" data-id=\"adf9e62\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\">B) A particular Learning Rate schedule with Adam<\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-c19d9b2 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"c19d9b2\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-f1c90be\" data-id=\"f1c90be\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-581f3cb elementor-widget elementor-widget-text-editor\" data-id=\"581f3cb\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>The authors use a learning rate scheduler to increase the learning rate until warmup steps and then decrease it using the below function. And they used the Adam optimizer with \u03b2\u00b9 = 0.9, \u03b2\u00b2 = 0.98. Nothing too interesting here just some learning choices.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-6dfc5f2 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"6dfc5f2\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-80da4f8\" data-id=\"80da4f8\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-13bd354 elementor-widget elementor-widget-image\" data-id=\"13bd354\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img loading=\"lazy\" decoding=\"async\" width=\"582\" height=\"65\" src=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/10\/1_nWQcmJkndGnl1gdwmJ2C1g.png\" class=\"attachment-large size-large wp-image-33578\" alt=\"\" srcset=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/10\/1_nWQcmJkndGnl1gdwmJ2C1g.png 582w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/10\/1_nWQcmJkndGnl1gdwmJ2C1g-300x34.png 300w\" sizes=\"(max-width: 582px) 100vw, 582px\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-dbc3227 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"dbc3227\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-3fe1cda\" data-id=\"3fe1cda\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-f5231b4 elementor-widget elementor-widget-heading\" data-id=\"f5231b4\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h3 class=\"elementor-heading-title elementor-size-default\"><em>Q: But wait I just remembered that we won\u2019t have the shifted output at the prediction time, would we? How do we do predictions then?<\/em><\/h3>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-a459663 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"a459663\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-16f3e20\" data-id=\"16f3e20\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-a87981c elementor-widget elementor-widget-text-editor\" data-id=\"a87981c\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>f you realize what we have at this point is a generative model and we will have to do the predictions in a generative way as we won\u2019t know the output target vector when doing prediction. So predictions are still sequential.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-5638de4 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"5638de4\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-8c6174a\" data-id=\"8c6174a\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-635cb14 elementor-widget elementor-widget-heading\" data-id=\"635cb14\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\">Prediction Time<\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-eaf199b elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"eaf199b\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-bf4d224\" data-id=\"bf4d224\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-eb45181 elementor-widget elementor-widget-image\" data-id=\"eb45181\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"397\" src=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/10\/1_sdjUHzfL9arBGjMModT9vw-1024x397.gif\" class=\"attachment-large size-large wp-image-33579\" alt=\"\" srcset=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/10\/1_sdjUHzfL9arBGjMModT9vw-1024x397.gif 1024w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/10\/1_sdjUHzfL9arBGjMModT9vw-300x116.gif 300w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/10\/1_sdjUHzfL9arBGjMModT9vw-768x298.gif 768w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/10\/1_sdjUHzfL9arBGjMModT9vw-610x237.gif 610w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/10\/1_sdjUHzfL9arBGjMModT9vw-750x291.gif 750w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/10\/1_sdjUHzfL9arBGjMModT9vw-1140x442.gif 1140w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-1b9429c elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"1b9429c\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-63d98f6\" data-id=\"63d98f6\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-16b1a67 elementor-widget elementor-widget-text-editor\" data-id=\"16b1a67\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tThis model does piece-wise predictions. In the original paper, they use the Beam Search to do prediction. But a greedy search would work fine as well for the purpose of explaining it. In the above example, I have shown how a greedy search would work exactly. The greedy search would start with:<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:list -->\n<ul>\n<li>Passing the whole English sentence as encoder input and just the start token\u00a0<code><st><\/code>as shifted output(input to the decoder) to the model and doing the forward pass.<\/li>\n<li>The model will predict the next word \u2014\u00a0<code>der<\/code><\/li>\n<li>Then, we pass the whole English sentence as encoder input and add the last predicted word to the shifted output(input to the decoder =\u00a0<code><st> der<\/code>) and do the forward pass.<\/li>\n<li>The model will predict the next word \u2014\u00a0<code>schnelle<\/code><\/li>\n<li>Passing the whole English sentence as encoder input and\u00a0<code><st> der schnelle<\/code>\u00a0as shifted output(input to the decoder) to the model and doing the forward pass.<\/li>\n<li>and so on, until the model predicts the end token\u00a0<code><\/s\/><\/code>\u00a0or we generate some maximum number of tokens(something we can define) so the translation doesn\u2019t run for an infinite duration in any case it breaks.<\/li>\n<\/ul>\n<!-- \/wp:list -->\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-2134152 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"2134152\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-b92a641\" data-id=\"b92a641\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-8b50b30 elementor-widget elementor-widget-heading\" data-id=\"8b50b30\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\">Beam Search:<\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-0a32628 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"0a32628\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-1b75db7\" data-id=\"1b75db7\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-6d2f3e6 elementor-widget elementor-widget-heading\" data-id=\"6d2f3e6\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h3 class=\"elementor-heading-title elementor-size-default\"><em>Q: Now I am greedy, Tell me about beam search as well.<\/em><\/h3>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-012f022 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"012f022\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-9fef30c\" data-id=\"9fef30c\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-13b67ee elementor-widget elementor-widget-text-editor\" data-id=\"13b67ee\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>Okay, the beam search idea is inherently very similar to the above idea. In beam search, we don\u2019t just look at the highest probability word generated but the top two words.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-d506676 elementor-widget elementor-widget-text-editor\" data-id=\"d506676\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tSo, For example, when we gave the whole English sentence as encoder input and just the start token as shifted output, we get two best words as\u00a0<code>i<\/code>(p=0.6) and\u00a0<code>der<\/code>(p=0.3). We will now generate the output model for both output sequences,<code><s> i<\/code>\u00a0and\u00a0<code><s&#038;>der<\/code>\u00a0and look at the probability of the next top word generated. For example, if\u00a0<code><> i<\/code>\u00a0gave a probability of (p=0.05) for the next word and\u00a0<code><s> der><\/code>\u00a0gave (p=0.5) for the next predicted word, we discard the sequence\u00a0<code><s> i<\/code>and go with\u00a0<code><s> der<\/code>\u00a0instead, as the sum of probability of sentence is maximized(<code><s> der next_word_to_der<\/code>\u00a0p = 0.3+0.5 compared to\u00a0<code><s> i next_word_to_i<\/code>\u00a0p = 0.6+0.05). We then repeat this process to get the sentence with the highest probability.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-f756438 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"f756438\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-35d6ccc\" data-id=\"35d6ccc\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-08e0c88 elementor-widget elementor-widget-text-editor\" data-id=\"08e0c88\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>Since we used the top 2 words, the beam size is 2 for this Beam Search. In the paper, they used beam search of size 4.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-7ed2904 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"7ed2904\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-64508ca\" data-id=\"64508ca\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-dc3ff91 elementor-widget elementor-widget-text-editor\" data-id=\"dc3ff91\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p><strong>PS<\/strong>: I showed that the English sentence is passed at every step for brevity, but in practice, the output of the encoder is saved and only the shifted output passes through the decoder at each time step.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-adf8c7e elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"adf8c7e\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-379ec9f\" data-id=\"379ec9f\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-d02a220 elementor-widget elementor-widget-heading\" data-id=\"d02a220\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h3 class=\"elementor-heading-title elementor-size-default\"><em>Q: Anything else you forgot to tell me? I will let you have your moment.<\/em><\/h3>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-861b434 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"861b434\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-7cc7f60\" data-id=\"7cc7f60\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-82ca12b elementor-widget elementor-widget-text-editor\" data-id=\"82ca12b\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>Yes. Since you asked. Here it is:<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-04090b7 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"04090b7\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-6f57f91\" data-id=\"6f57f91\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-c917e7b elementor-widget elementor-widget-heading\" data-id=\"c917e7b\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\">BPE, Weight Sharing and Checkpointing<\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-2a44697 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"2a44697\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-06ec684\" data-id=\"06ec684\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-6856e66 elementor-widget elementor-widget-text-editor\" data-id=\"6856e66\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>In the paper, the authors used Byte pair encoding to create a common English German vocabulary. They then used shared weights across both the English and german embedding and pre-softmax linear transformation as the embedding weight matrix shape would work (Vocab Length X D).<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-b664c4b elementor-widget elementor-widget-text-editor\" data-id=\"b664c4b\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>Also, the authors average the last k checkpoints to create an ensembling effect to reach the performance<em>.<\/em> This is a pretty known technique where we average the weights in the last few epochs of the model to create a new model which is sort of an ensemble.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-78c763a elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"78c763a\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-9f1a233\" data-id=\"9f1a233\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-1b445d0 elementor-widget elementor-widget-heading\" data-id=\"1b445d0\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h3 class=\"elementor-heading-title elementor-size-default\"><em>Q: Can you show me some code?<\/em><\/h3>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-77a8147 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"77a8147\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-31b443d\" data-id=\"31b443d\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-ea70820 elementor-widget elementor-widget-text-editor\" data-id=\"ea70820\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>This post has already been so long, so I will do that in the <a href=\"https:\/\/mlwhiz.com\/blog\/2020\/10\/10\/create-transformer-from-scratch\/\" target=\"_blank\" rel=\"noreferrer noopener\">next post<\/a>. Stay tuned.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-ab180cc elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"ab180cc\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-6027bc2\" data-id=\"6027bc2\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-db37480 elementor-widget elementor-widget-heading\" data-id=\"db37480\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h4 class=\"elementor-heading-title elementor-size-default\">References<\/h4>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-5898021 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"5898021\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-7d245bc\" data-id=\"7d245bc\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-a0198e6 elementor-widget elementor-widget-text-editor\" data-id=\"a0198e6\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<!-- wp:list -->\n<ul>\n<li><a href=\"https:\/\/arxiv.org\/abs\/1706.03762\" target=\"_blank\" rel=\"noreferrer noopener\">Attention Is All You Need<\/a>: The Paper which started it all.<\/li>\n<li><a href=\"https:\/\/nlp.seas.harvard.edu\/2018\/04\/03\/attention.html\" target=\"_blank\" rel=\"noreferrer noopener\">The Annotated Transformer<\/a>: This one has all the code. Although I will write a simple transformer in the next post too.<\/li>\n<li><a href=\"http:\/\/jalammar.github.io\/illustrated-transformer\/\" target=\"_blank\" rel=\"noreferrer noopener\">The Illustrated Transformer<\/a>: This is one of the best posts on transformers.<\/li>\n<\/ul>\n<!-- \/wp:list -->\n\n<!-- wp:paragraph -->\n<p id=\"f2c7\">In this post, I covered how the Transformer architecture works from a detail-oriented, intuitive perspective.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<\/div>\n\t\t","protected":false},"excerpt":{"rendered":"<p>Transformers have become the defacto standard for any NLP tasks nowadays. Not only that, but they are now also being used in Computer Vision and to generate music. They are still hard to understand as ever.<\/p>\n","protected":false},"author":653,"featured_media":10526,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"content-type":"","footnotes":""},"categories":[187],"tags":[487,94,474,819],"ppma_author":[3409],"class_list":["post-10523","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-bigdata-cloud","tag-computer-vision","tag-data-science","tag-nlp","tag-transformers"],"authors":[{"term_id":3409,"user_id":653,"is_guest":0,"slug":"rahul-agarwal","display_name":"Rahul Agarwal","avatar_url":"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/04\/medium_cc5785b8-8195-44e6-a0de-2e33be05d7cb-150x150.png","user_url":"http:\/\/bit.ly\/384SBYb","last_name":"Agarwal","first_name":"Rahul","job_title":"","description":"Rahul Agarwal is a Data Scientist at Walmart Labs."}],"_links":{"self":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/10523","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/users\/653"}],"replies":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/comments?post=10523"}],"version-history":[{"count":5,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/10523\/revisions"}],"predecessor-version":[{"id":33582,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/10523\/revisions\/33582"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/media\/10526"}],"wp:attachment":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/media?parent=10523"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/categories?post=10523"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/tags?post=10523"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/ppma_author?post=10523"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}