{"id":26499,"date":"2021-09-21T17:54:07","date_gmt":"2021-09-21T17:54:07","guid":{"rendered":"https:\/\/www.experfy.com\/blog\/?p=26499"},"modified":"2023-08-16T11:15:33","modified_gmt":"2023-08-16T11:15:33","slug":"review-of-attention-2","status":"publish","type":"post","link":"https:\/\/www.experfy.com\/blog\/ai-ml\/review-of-attention-2\/","title":{"rendered":"Review of Attention (Vision Models) &#8211; Part 2"},"content":{"rendered":"\t\t<div data-elementor-type=\"wp-post\" data-elementor-id=\"26499\" class=\"elementor elementor-26499\" data-elementor-post-type=\"post\">\n\t\t\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-3aae2a0 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"3aae2a0\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-46204a2\" data-id=\"46204a2\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-c92302a elementor-widget elementor-widget-text-editor\" data-id=\"c92302a\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p><strong>\u00a0Attention in Vision Models:<\/strong><\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-af181d5 elementor-widget elementor-widget-text-editor\" data-id=\"af181d5\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>In the previous article, we briefly discussed the concept of attention and its application in the language domain. In this article, we will discuss some of the works that have applied attention in visual models tasks.<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-3c3eb1a elementor-widget elementor-widget-text-editor\" data-id=\"3c3eb1a\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>Although vision models have traditionally used <a href=\"http:\/\/www.experfy.com\/blog\/ai-ml\/what-are-convolutional-neural-networks-cnn\/\" target=\"_blank\" rel=\"noreferrer noopener\">Convolution Networks<\/a> (ConvNet) as the standard for image encoders, there are several works that have tried using attention mechanisms at various stages in a vision model. Most of these works apply attention along with traditional convolution mechanisms to improve the performance of a model in a particular task. Some works have gone beyond that and tried to replace convolution with purely attention-based techniques. In this article, we will review models from both approaches.<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-b676a5e elementor-widget elementor-widget-text-editor\" data-id=\"b676a5e\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>One of the common techniques in vision is to apply self-attention to the image features generated by traditional ConvNet and use the output of the attention block in the downstream task.<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-030dd0c elementor-widget elementor-widget-text-editor\" data-id=\"030dd0c\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p><strong>Show, Attend, and Tell:<\/strong><\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-417308c elementor-widget elementor-widget-text-editor\" data-id=\"417308c\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>Xu et al. 2015 used attention for the task of image captioning [1]. The model used a ConvNet to extract image features, which were fed to attention-based RNN for generating text descriptions.<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-65b86fe elementor-widget elementor-widget-text-editor\" data-id=\"65b86fe\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>The paper introduced two methods of performing attention: Soft and Hard. The difference is primarily in how the model operates on the patches from the input image. In Hard attention, the model operates on a single patch at a time. It is called soft attention if the model smoothly attends to all the patches at the same time to make the operation differentiable.\u00a0 Although hard attention is non-differentiable, the paper adopted complex techniques, such as variance reduction and reinforcement learning, to learn its parameters.<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-4ee7145 elementor-widget elementor-widget-text-editor\" data-id=\"4ee7145\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>Another useful aspect of applying attention is the ability to visualize the attention weights. It can be used to visualize the regions that the model focused on while generating each word in the caption.<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-7ff63aa elementor-widget elementor-widget-text-editor\" data-id=\"7ff63aa\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p><em>Fig. 1. A giraffe standing in a field with trees.<\/em>[1]<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-b507eed elementor-widget elementor-widget-text-editor\" data-id=\"b507eed\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p><strong>Self-Attention Generative Adversarial Networks (SAGAN) <\/strong>[2]<strong>:<\/strong><\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-bd8dbc2 elementor-widget elementor-widget-text-editor\" data-id=\"bd8dbc2\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p><a href=\"https:\/\/machinelearningmastery.com\/what-are-generative-adversarial-networks-gans\/\" rel=\"noopener\">GANs <\/a>have traditionally proved to be great tools for image synthesis. Convolution GAN limits the receptive field to spatially local points in low-resolution feature maps. Zhang et al. 2019\u00a0 addressed this limitation by introducing self-attention, which lets the model attend to all the feature locations and use them for image synthesis [2].\u00a0<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-356ecca elementor-widget elementor-widget-text-editor\" data-id=\"356ecca\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>Three transformations are applied to the image features from the hidden layer. These can be considered like Query, Key, and Value matrices. Applying SoftMax on the dot product between the Query and Key matrices results in an attention map. An attention map indicates the extent to which each position in the Query matrix should attend to each position in the Key matrix. The attention map is multiplied with the Value matrix, and a 1 X 1 convolution is applied to the result to obtain self-attention feature maps.<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-9c1f6b8 elementor-widget elementor-widget-text-editor\" data-id=\"9c1f6b8\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p><em>Fig 2. Self-Attention module in SAGAN <\/em>[2]<em>\u00a0<\/em><\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-5c966f8 elementor-widget elementor-widget-text-editor\" data-id=\"5c966f8\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>The self-attention feature map <em>o<sub>i<\/sub> <\/em>is multiplied by a learnable scalar and added to the original feature map <em>x<sub>i<\/sub><\/em> from the hidden layer to generate the final output <em>y<sub>i<\/sub><\/em>.<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-e2baf9e elementor-widget elementor-widget-text-editor\" data-id=\"e2baf9e\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p class=\"has-text-align-center\"><em>y<sub>i<\/sub> = \u03b3o<sub>i<\/sub> +x<sub>i<\/sub><\/em><\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-f5916fd elementor-widget elementor-widget-text-editor\" data-id=\"f5916fd\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>SAGAN efficiently captures long-range dependencies and performs especially well in the class conditional image generation. It significantly improved the state-of-the-art inception score in the task at the time of its publication.<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-2ad8487 elementor-widget elementor-widget-text-editor\" data-id=\"2ad8487\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p><strong>Self-Attention as an alternative to convolution:<\/strong><\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-6e06f55 elementor-widget elementor-widget-text-editor\" data-id=\"6e06f55\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>Inspired by the success of transformer architecture in the text domain, several papers explored using attention as an alternative to convolution in the vision models [3] [4] [5]. Since attention has proven to be a very useful technique in learning sequential data, most of these models treated images as a sequence of pixels and operated on them.<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-7c63261 elementor-widget elementor-widget-text-editor\" data-id=\"7c63261\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>There are few advantages of using self-attention instead of convolution. ConvNets operate on pixels in the smaller neighbourhood (or kernel sizes) and efficiently learn local correlation structures. Therefore, they are limited by the spatial proximity to the pixel. But attention is also effective in learning long-range dependencies between distant positions in an image [3].<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-2437ac6 elementor-widget elementor-widget-text-editor\" data-id=\"2437ac6\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>The downside is that the computation complexity of the self-attention layer drastically increases with the dimension of the image. Parmar et al. 2018 calculated the time complexity of a self-attention layer operating on <em>l<sub>m<\/sub><\/em> positions as <em>O(h.w.l<sub>m<\/sub>.d)<\/em>. It is computationally feasible for the model to attend to 192 positions in an 8 X 8 image, but it couldn\u2019t be scaled up to 3072 positions in a 32 X32 image.<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-b2d119f elementor-widget elementor-widget-text-editor\" data-id=\"b2d119f\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>Additionally, the vision models need to have the correct inductive biases to enable them to efficiently learn image features.<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-26fd4d9 elementor-widget elementor-widget-text-editor\" data-id=\"26fd4d9\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<ol>\n<li><strong>Translation equivariance:<\/strong><\/li>\n<\/ol>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-06a462e elementor-widget elementor-widget-text-editor\" data-id=\"06a462e\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>The model should be resilient to minor perturbations in the pixel distribution. If an object is shifted by a few pixels in an image, the model shouldn\u2019t interpret it as a completely new image; instead, it should be able to recognize similar global contexts present in the two images.<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-d653476 elementor-widget elementor-widget-text-editor\" data-id=\"d653476\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<ol start=\"2\">\n<li><strong>The relative position of the pixels:<\/strong><\/li>\n<\/ol>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-690acca elementor-widget elementor-widget-text-editor\" data-id=\"690acca\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>The context of the image is dependent on the relative position of the pixels in the image.<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-e605cea elementor-widget elementor-widget-text-editor\" data-id=\"e605cea\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>The model should be able to encode the relative positions of the pixels in the image and utilize this information to generate the output signal.<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-55a67d8 elementor-widget elementor-widget-text-editor\" data-id=\"55a67d8\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>We\u2019ll explore a few works that deal with these challenges.<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-88501f0 elementor-widget elementor-widget-text-editor\" data-id=\"88501f0\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p><strong>Image Transformer:<\/strong><\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-4ca4a2b elementor-widget elementor-widget-text-editor\" data-id=\"4ca4a2b\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>Parmar et al. 2018 developed a purely attention-based transformer architecture for image generation. The images were formulated as a sequence of pixels, and the model was trained on a sequence completion objective, <em>i.e.<\/em>, to generate the next pixel in the image, conditioned on the previous set of pixels [3].<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-6974f3b elementor-widget elementor-widget-text-editor\" data-id=\"6974f3b\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>The self-attention layer computed a d-dimensional representation for each position; each channel of each pixel. The representation for a given position is calculated as a weighted sum of contributions from previous positions. The weights are determined by the attention distribution over previous positions. Instead of attending to all the previous inputs, the self-attention layer attends only to a fixed number of positions in the local neighbourhood like a ConvNet. This addresses the computation challenge with adopting attention to images.\u00a0<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-ffbd6c2 elementor-widget elementor-widget-text-editor\" data-id=\"ffbd6c2\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>One advantage of attention is parallel processing; the query needn\u2019t be computed for each pixel. Instead, the image is split into a fixed set of contiguous blocks called <em>memory blocks<\/em>. For all queries in a memory block, the model attends to the same memory matrix.<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-3ab9608 elementor-widget elementor-widget-text-editor\" data-id=\"3ab9608\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>Parmar et al. proposed two schemes for deciding the pattern for query\/memory blocks: 1-D local attention, where the model attends to non-overlapping query blocks of fixed length, while 2-D attention uses non-overlapping rectangular query blocks.<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-64b7707 elementor-widget elementor-widget-text-editor\" data-id=\"64b7707\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p><em>Fig. 3. Local 1-D vs 2-D attention <\/em>[3]<em>.<\/em><\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-34cb667 elementor-widget elementor-widget-text-editor\" data-id=\"34cb667\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>In both types of attention, each position in the query block attends to all the pixels in the memory block. In figure 3, the pixel marked as q is the pixel last generated at that time step. The positions marked in white within query\/memory blocks use masked attention, and they don&#8217;t contribute to the next representation of positions in the query block.<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-e870216 elementor-widget elementor-widget-text-editor\" data-id=\"e870216\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>In this article, We reviewed some of the approaches of applying attention to vision models. We will continue this discussion in the next article and review a few additional approaches in this domain.<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<\/div>\n\t\t","protected":false},"excerpt":{"rendered":"<p>\u00a0Attention in Vision Models: In the previous article, we briefly discussed the concept of attention and its application in the language domain. In this article, we will discuss some of the works that have applied attention in visual models tasks. Although vision models have traditionally used Convolution Networks (ConvNet) as the standard for image encoders,<\/p>\n","protected":false},"author":1193,"featured_media":26500,"comment_status":"open","ping_status":"open","sticky":false,"template":"single-post-2.php","format":"standard","meta":{"content-type":"","footnotes":""},"categories":[183,965],"tags":[97,94,92],"ppma_author":[4007],"class_list":["post-26499","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-ai-ml","category-ai-machine-learning","tag-artificial-intelligence","tag-data-science","tag-machine-learning"],"authors":[{"term_id":4007,"user_id":1193,"is_guest":0,"slug":"raghav","display_name":"Raghavendran","avatar_url":"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/09\/1577746037422-150x150.jpeg","user_url":"","last_name":"Ramakrishnan","first_name":"Raghavendran","job_title":"","description":"Raghavendran is a Machine Learning Engineer at IQVIA. He completed his Masters from Arizona State University. He is passionate about deep learning and particularly interested in understanding the role of vision and language in learning a task."}],"_links":{"self":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/26499","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/users\/1193"}],"replies":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/comments?post=26499"}],"version-history":[{"count":10,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/26499\/revisions"}],"predecessor-version":[{"id":30468,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/26499\/revisions\/30468"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/media\/26500"}],"wp:attachment":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/media?parent=26499"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/categories?post=26499"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/tags?post=26499"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/ppma_author?post=26499"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}