{"id":2320,"date":"2020-03-17T03:38:36","date_gmt":"2020-03-17T00:38:36","guid":{"rendered":"http:\/\/kusuaks7\/?p=1925"},"modified":"2023-12-27T18:27:42","modified_gmt":"2023-12-27T18:27:42","slug":"transformers-are-graph-neural-networks","status":"publish","type":"post","link":"https:\/\/www.experfy.com\/blog\/ai-ml\/transformers-are-graph-neural-networks\/","title":{"rendered":"Transformers are Graph Neural Networks"},"content":{"rendered":"\t\t<div data-elementor-type=\"wp-post\" data-elementor-id=\"2320\" class=\"elementor elementor-2320\" data-elementor-post-type=\"post\">\n\t\t\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-fe4a7c0 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"fe4a7c0\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-5f8f0034\" data-id=\"5f8f0034\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-5ae28dea elementor-widget elementor-widget-text-editor\" data-id=\"5ae28dea\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tEngineer friends often ask me: Graph Deep Learning sounds great, but are there any big commercial success stories? Is it being deployed in practical applications?\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-a062e20 elementor-widget elementor-widget-text-editor\" data-id=\"a062e20\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tBesides the obvious ones\u2013recommendation systems at\u00a0<a href=\"https:\/\/medium.com\/pinterest-engineering\/pinsage-a-new-graph-convolutional-neural-network-for-web-scale-recommender-systems-88795a107f48\" class=\"broken_link\" rel=\"noopener\">Pinterest<\/a>,\u00a0<a href=\"https:\/\/arxiv.org\/abs\/1902.08730\" rel=\"noopener\">Alibaba<\/a>\u00a0and\u00a0<a href=\"https:\/\/blog.twitter.com\/en_us\/topics\/company\/2019\/Twitter-acquires-Fabula-AI.html\" class=\"broken_link\" rel=\"noopener\">Twitter<\/a>\u2013a slightly nuanced success story is the\u00a0<a href=\"https:\/\/arxiv.org\/abs\/1706.03762\" rel=\"noopener\"><strong>Transformer architecture<\/strong><\/a>, which has\u00a0<a href=\"https:\/\/openai.com\/blog\/better-language-models\/\" class=\"broken_link\" rel=\"noopener\">taken<\/a>\u00a0<a href=\"https:\/\/www.blog.google\/products\/search\/search-language-understanding-bert\/\" rel=\"noopener\">the<\/a>\u00a0<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/project\/large-scale-pretraining-for-response-generation\/\" rel=\"noopener\">NLP<\/a>\u00a0<a href=\"https:\/\/ai.facebook.com\/blog\/roberta-an-optimized-method-for-pretraining-self-supervised-nlp-systems\/\" class=\"broken_link\" rel=\"noopener\">industry<\/a>\u00a0<a href=\"https:\/\/blog.einstein.ai\/introducing-a-conditional-transformer-language-model-for-controllable-generation\/\" rel=\"noopener\">by<\/a>\u00a0<a href=\"https:\/\/nv-adlr.github.io\/MegatronLM\" rel=\"noopener\">storm<\/a>.\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-980ac3b elementor-widget elementor-widget-text-editor\" data-id=\"980ac3b\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tThrough this post, I want to establish links between\u00a0<a href=\"https:\/\/graphdeeplearning.github.io\/project\/spatial-convnets\/\" rel=\"noopener\">Graph Neural Networks (GNNs)<\/a>\u00a0and Transformers. I\u2019ll talk about the intuitions behind model architectures in the NLP and GNN communities, make connections using equations and figures, and discuss how we could work together to drive progress.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-536aa4d elementor-widget elementor-widget-text-editor\" data-id=\"536aa4d\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tLet\u2019s start by talking about the purpose of model architectures\u2013<em>representation learning<\/em>.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-24139a5 elementor-widget elementor-widget-heading\" data-id=\"24139a5\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h3 class=\"elementor-heading-title elementor-size-default\"><h3 id=\"representation-learning-for-nlp\">Representation Learning for NLP<\/h3><\/h3>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-59b0132 elementor-widget elementor-widget-text-editor\" data-id=\"59b0132\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tAt a high level, all neural network architectures build\u00a0<em>representations<\/em>\u00a0of input data as vectors\/embeddings, which encode useful statistical and semantic information about the data. These\u00a0<em>latent<\/em>\u00a0or\u00a0<em>hidden<\/em>\u00a0representations can then be used for performing something useful, such as classifying an image or translating a sentence. The neural network\u00a0<em>learns<\/em>\u00a0to build better-and-better representations by receiving feedback, usually via error\/loss functions.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-8e0e55e elementor-widget elementor-widget-text-editor\" data-id=\"8e0e55e\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tFor Natural Language Processing (NLP), conventionally,\u00a0<strong>Recurrent Neural Networks<\/strong>\u00a0(RNNs) build representations of each word in a sentence in a sequential manner,\u00a0<em>i.e.<\/em>,\u00a0<strong>one word at a time<\/strong>. Intuitively, we can imagine an RNN layer as a conveyor belt, with the words being processed on it\u00a0<em>autoregressively<\/em>\u00a0from left to right. At the end, we get a hidden feature for each word in the sentence, which we pass to the next RNN layer or use for our NLP tasks of choice.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-0618bf5 elementor-widget elementor-widget-text-editor\" data-id=\"0618bf5\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<blockquote>I highly recommend Chris Olah\u2019s legendary blog for recaps on\u00a0<a href=\"http:\/\/colah.github.io\/posts\/2015-08-Understanding-LSTMs\/\" rel=\"noopener\">RNNs<\/a>\u00a0and\u00a0<a href=\"http:\/\/colah.github.io\/posts\/2014-07-NLP-RNNs-Representations\/\" rel=\"noopener\">representation learning<\/a>\u00a0for NLP.<\/blockquote>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-5354d3b elementor-widget elementor-widget-image\" data-id=\"5354d3b\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/graphdeeplearning.github.io\/post\/transformers-are-gnns\/rnn-transf-nlp.jpg\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-ffa0a38 elementor-widget elementor-widget-text-editor\" data-id=\"ffa0a38\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tInitially introduced for machine translation,\u00a0<strong>Transformers<\/strong>\u00a0have gradually replaced RNNs in mainstream NLP. The architecture takes a fresh approach to representation learning: Doing away with recurrence entirely, Transformers build features of each word using an\u00a0<a href=\"https:\/\/distill.pub\/2016\/augmented-rnns\/\" rel=\"noopener\">attention<\/a>\u00a0<a href=\"https:\/\/lilianweng.github.io\/lil-log\/2018\/06\/24\/attention-attention.html\" rel=\"noopener\">mechanism<\/a>\u00a0to figure out how important\u00a0<strong>all the other words<\/strong>\u00a0in the sentence are w.r.t. to the aforementioned word. Knowing this, the word\u2019s updated features are simply the sum of linear transformations of the features of all the words, weighted by their importance.\n<blockquote>Back in 2017, this idea sounded very radical, because the NLP community was so used to the sequential\u2013one-word-at-a-time\u2013style of processing text with RNNs. The title of the paper probably added fuel to the fire!\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-d05ccd9 elementor-widget elementor-widget-text-editor\" data-id=\"d05ccd9\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\nFor a recap, Yannic Kilcher made an excellent\u00a0<a href=\"https:\/\/www.youtube.com\/watch?v=iDulhoQ2pro\" rel=\"noopener\">video overview<\/a>.<\/blockquote>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-d36b49f elementor-widget elementor-widget-heading\" data-id=\"d36b49f\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h3 class=\"elementor-heading-title elementor-size-default\"><h3 id=\"breaking-down-the-transformer\">Breaking down the Transformer<\/h3><\/h3>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-e89c4dd elementor-widget elementor-widget-text-editor\" data-id=\"e89c4dd\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tLet\u2019s develop intuitions about the architecture by translating the previous paragraph into the language of mathematical symbols and vectors. We update the hidden feature\u00a0h\u00a0of the\u00a0i&#8217;th word in a sentence\u00a0S\u00a0from layer\u00a0\u2113\u00a0to layer\u00a0\u2113+1\u00a0as follows:\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-f231539 elementor-widget elementor-widget-text-editor\" data-id=\"f231539\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\thi\u2113+1=Attention(Q\u2113hi\u2113\u00a0,K\u2113hj\u2113\u00a0,V\u2113hj\u2113),\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-5b9c708 elementor-widget elementor-widget-text-editor\" data-id=\"5b9c708\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\ti.e.,\u00a0hi\u2113+1=\u2211j\u2208Swij(V\u2113hj\u2113),\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-9777a47 elementor-widget elementor-widget-text-editor\" data-id=\"9777a47\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\twhere\u00a0wij=softmaxj(Q\u2113hi\u2113\u22c5K\u2113hj\u2113),\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-26ff5dd elementor-widget elementor-widget-text-editor\" data-id=\"26ff5dd\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\twhere\u00a0j\u2208S\u00a0denotes the set of words in the sentence and\u00a0Q\u2113,K\u2113,V\u2113\u00a0are learnable linear weights (denoting the\u00a0<strong>Q<\/strong>uery,\u00a0<strong>K<\/strong>ey and\u00a0<strong>V<\/strong>alue for the attention computation, respectively). The attention mechanism is performed parallelly for each word in the sentence to obtain their updated features in\u00a0<em>one shot<\/em>\u2013another plus point for Transformers over RNNs, which update features word-by-word.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-ce7a36c elementor-widget elementor-widget-text-editor\" data-id=\"ce7a36c\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tWe can understand the attention mechanism better through the following pipeline:\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-678ccf3 elementor-widget elementor-widget-image\" data-id=\"678ccf3\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/graphdeeplearning.github.io\/post\/transformers-are-gnns\/attention-block.jpg\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-69f1b3d elementor-widget elementor-widget-text-editor\" data-id=\"69f1b3d\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<blockquote>Taking in the features of the word\u00a0hi\u2113\u00a0and the set of other words in the sentence\u00a0hj\u2113;\u00a0\u2200j\u2208S, we compute the attention weights\u00a0wij\u00a0for each pair\u00a0(i,j)\u00a0through the dot-product, followed by a softmax across all\u00a0j&#8217;s. Finally, we produce the updated word feature\u00a0hi\u2113+1\u00a0for word\u00a0i\u00a0by summing over all\u00a0hj\u2113&#8217;s weighted by their corresponding\u00a0wij. Each word in the sentence parallelly undergoes the same pipeline to update its features.<\/blockquote>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-370abc5 elementor-widget elementor-widget-heading\" data-id=\"370abc5\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h3 class=\"elementor-heading-title elementor-size-default\"><h3 id=\"multi-head-attention-mechanism\">Multi-head Attention mechanism<\/h3><\/h3>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-833e9bb elementor-widget elementor-widget-text-editor\" data-id=\"833e9bb\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tGetting this dot-product attention mechanism to work proves to be tricky\u2013bad random initializations can de-stabilize the learning process. We can overcome this by parallelly performing multiple \u2018heads\u2019 of attention and concatenating the result (with each head now having separate learnable weights):\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-1428194 elementor-widget elementor-widget-text-editor\" data-id=\"1428194\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\thi\u2113+1=Concat(head1,\u2026,headK)O\u2113,headk=Attention(Qk,\u2113hi\u2113\u00a0,Kk,\u2113hj\u2113\u00a0,Vk,\u2113hj\u2113),\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-eacf48f elementor-widget elementor-widget-text-editor\" data-id=\"eacf48f\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\twhere\u00a0Qk,\u2113,Kk,\u2113,Vk,\u2113\u00a0are the learnable weights of the\u00a0k&#8217;th attention head and\u00a0O\u2113\u00a0is a down-projection to match the dimensions of\u00a0hi\u2113+1\u00a0and\u00a0hi\u2113\u00a0across layers.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-d5e8de5 elementor-widget elementor-widget-text-editor\" data-id=\"d5e8de5\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tMultiple heads allow the attention mechanism to essentially \u2018hedge its bets\u2019, looking at different transformations or aspects of the hidden features from the previous layer. We\u2019ll talk more about this later.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-a8ea212 elementor-widget elementor-widget-heading\" data-id=\"a8ea212\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h3 class=\"elementor-heading-title elementor-size-default\"><h3 id=\"scale-issues-and-the-feed-forward-sub-layer\">Scale issues and the Feed-forward sub-layer<\/h3><\/h3>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-9d52e9c elementor-widget elementor-widget-text-editor\" data-id=\"9d52e9c\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tA key issue motivating the final Transformer architecture is that the features for words\u00a0<em>after<\/em>\u00a0the attention mechanism might be at\u00a0<strong>different scales<\/strong>\u00a0or\u00a0<strong>magnitudes<\/strong>: (1) This can be due to some words having very sharp or very distributed attention weights\u00a0wij\u00a0when summing over the features of the other words. (2) At the individual feature\/vector entries level, concatenating across multiple attention heads\u2013each of which might output values at different scales\u2013can lead to the entries of the final vector\u00a0hi\u2113+1\u00a0having a wide range of values. Following conventional ML wisdom, it seems reasonable to add a\u00a0<a href=\"https:\/\/nealjean.com\/ml\/neural-network-normalization\/\" rel=\"noopener\">normalization layer<\/a>\u00a0into the pipeline.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-381c4a1 elementor-widget elementor-widget-text-editor\" data-id=\"381c4a1\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tTransformers overcome issue (2) with\u00a0<a href=\"https:\/\/arxiv.org\/abs\/1607.06450\" rel=\"noopener\"><strong>LayerNorm<\/strong><\/a>, which normalizes and learns an affine transformation at the feature level. Additionally,\u00a0<strong>scaling the dot-product<\/strong>\u00a0attention by the square-root of the feature dimension helps counteract issue (1).\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-e1f0d48 elementor-widget elementor-widget-text-editor\" data-id=\"e1f0d48\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tFinally, the authors propose another \u2018trick\u2019 to control the scale issue:\u00a0<strong>a position-wise 2-layer MLP<\/strong>\u00a0with a special structure. After the multi-head attention, they project\u00a0hi\u2113+1\u00a0to a (absurdly) higher dimension by a learnable weight, where it undergoes the ReLU non-linearity, and is then projected back to its original dimension followed by another normalization:\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-c33d823 elementor-widget elementor-widget-text-editor\" data-id=\"c33d823\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\thi\u2113+1=LN(MLP(LN(hi\u2113+1)))\n<blockquote>To be honest, I\u2019m not sure what the exact intuition behind the over-parameterized feed-forward sub-layer was and nobody seems to be asking questions about it, too! I suppose LayerNorm and scaled dot-products didn\u2019t completely solve the issues highlighted, so the big MLP is a sort of hack to re-scale the feature vectors independently of each other.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-6bb5989 elementor-widget elementor-widget-text-editor\" data-id=\"6bb5989\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\n<a href=\"mailto:chaitanya.joshi@ntu.edu.sg\">Email me<\/a>\u00a0if you know more!<\/blockquote>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-e6abec9 elementor-widget elementor-widget-text-editor\" data-id=\"e6abec9\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tThe final picture of a Transformer layer looks like this:\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-110bd04 elementor-widget elementor-widget-image\" data-id=\"110bd04\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/graphdeeplearning.github.io\/post\/transformers-are-gnns\/transformer-block.png\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-0f48881 elementor-widget elementor-widget-text-editor\" data-id=\"0f48881\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tThe Transformer architecture is also extremely amenable to very deep networks, enabling the NLP community to\u00a0<em><a href=\"https:\/\/arxiv.org\/abs\/1910.10683\" rel=\"noopener\">scale<\/a>\u00a0<a href=\"https:\/\/arxiv.org\/abs\/2001.08361\" rel=\"noopener\">up<\/a><\/em>\u00a0in terms of both model parameters and, by extension, data.\u00a0<strong>Residual connections<\/strong>\u00a0between the inputs and outputs of each multi-head attention sub-layer and the feed-forward sub-layer are key for stacking Transformer layers (but omitted from the diagram for clarity).\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-d98c672 elementor-widget elementor-widget-heading\" data-id=\"d98c672\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h3 class=\"elementor-heading-title elementor-size-default\"><h3 id=\"gnns-build-representations-of-graphs\">GNNs build representations of graphs<\/h3><\/h3>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-359133c elementor-widget elementor-widget-text-editor\" data-id=\"359133c\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tLet\u2019s take a step away from NLP for a moment.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-5665ad7 elementor-widget elementor-widget-text-editor\" data-id=\"5665ad7\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tGraph Neural Networks (GNNs) or Graph Convolutional Networks (GCNs) build representations of nodes and edges in graph data. They do so through\u00a0<strong>neighbourhood aggregation<\/strong>\u00a0(or message passing), where each node gathers features from its neighbours to update its representation of the\u00a0<em>local<\/em>\u00a0graph structure around it. Stacking several GNN layers enables the model to propagate each node\u2019s features over the entire graph\u2013from its neighbours to the neighbours\u2019 neighbours, and so on.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-b0c3b5a elementor-widget elementor-widget-image\" data-id=\"b0c3b5a\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/graphdeeplearning.github.io\/post\/transformers-are-gnns\/gnn-social-network.jpg\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-e8a5c26 elementor-widget elementor-widget-text-editor\" data-id=\"e8a5c26\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<blockquote>Take the example of this emoji social network: The node features produced by the GNN can be used for predictive tasks such as identifying the most influential members or proposing potential connections.<\/blockquote>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-20ea352 elementor-widget elementor-widget-text-editor\" data-id=\"20ea352\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tIn their most basic form, GNNs update the hidden features\u00a0h\u00a0of node\u00a0i\u00a0(for example, \ud83d\ude06) at layer\u00a0\u2113\u00a0via a non-linear transformation of the node\u2019s own features\u00a0hi\u2113\u00a0added to the aggregation of features\u00a0hj\u2113\u00a0from each neighbouring node\u00a0j\u2208N(i):\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-5ce04f2 elementor-widget elementor-widget-text-editor\" data-id=\"5ce04f2\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\thi\u2113+1=\u03c3(U\u2113hi\u2113+\u2211j\u2208N(i)(V\u2113hj\u2113)),\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-03ebd6f elementor-widget elementor-widget-text-editor\" data-id=\"03ebd6f\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\twhere\u00a0U\u2113,V\u2113\u00a0are learnable weight matrices of the GNN layer and\u00a0\u03c3\u00a0is a non-linearity such as ReLU. In the example,\u00a0N(\ud83d\ude06)\u00a0=\u00a0{ \ud83d\ude18, \ud83d\ude0e, \ud83d\ude1c, \ud83e\udd29 }.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-3dea56c elementor-widget elementor-widget-text-editor\" data-id=\"3dea56c\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tThe summation over the neighbourhood nodes\u00a0j\u2208N(i)\u00a0can be replaced by other input size-invariant\u00a0<strong>aggregation functions<\/strong>\u00a0such as simple mean\/max or something more powerful, such as a weighted sum via an\u00a0<a href=\"https:\/\/petar-v.com\/GAT\/\" rel=\"noopener\"><strong>attention mechanism<\/strong><\/a>.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-42752af elementor-widget elementor-widget-text-editor\" data-id=\"42752af\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\nDoes that sound familiar?\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-ce49cb6 elementor-widget elementor-widget-text-editor\" data-id=\"ce49cb6\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\nMaybe a pipeline will help make the connection:\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-978d0e3 elementor-widget elementor-widget-image\" data-id=\"978d0e3\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/graphdeeplearning.github.io\/post\/transformers-are-gnns\/gnn-block.jpg\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-3d7d257 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"3d7d257\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-68e4a15\" data-id=\"68e4a15\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-e189f92 elementor-widget elementor-widget-text-editor\" data-id=\"e189f92\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tIf we were to do multiple parallel heads of neighbourhood aggregation and replace summation over the neighbours\u00a0j\u00a0with the attention mechanism,\u00a0<em>i.e.<\/em>, a weighted sum, we\u2019d get the\u00a0<b>Graph Attention Network<\/b>\u00a0(GAT). Add normalization and the feed-forward MLP, and voila, we have a\u00a0<b>Graph Transformer<\/b>!\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-fd1f421 elementor-widget elementor-widget-heading\" data-id=\"fd1f421\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h3 class=\"elementor-heading-title elementor-size-default\"><h3 id=\"sentences-are-fully-connected-word-graphs\">Sentences are fully-connected word graphs<\/h3><\/h3>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-fc95487 elementor-widget elementor-widget-text-editor\" data-id=\"fc95487\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tTo make the connection more explicit, consider a sentence as a fully-connected graph, where each word is connected to every other word. Now, we can use a GNN to build features for each node (word) in the graph (sentence), which we can then perform NLP tasks with.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-8b18dd1 elementor-widget elementor-widget-image\" data-id=\"8b18dd1\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/graphdeeplearning.github.io\/post\/transformers-are-gnns\/gnn-nlp.jpg\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-6225706 elementor-widget elementor-widget-text-editor\" data-id=\"6225706\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tBroadly, this is what Transformers are doing: they are\u00a0<strong>GNNs with multi-head attention<\/strong>\u00a0as the neighbourhood aggregation function. Whereas standard GNNs aggregate features from their local neighbourhood nodes\u00a0j\u2208N(i), Transformers for NLP treat the entire sentence\u00a0S\u00a0as the local neighbourhood, aggregating features from each word\u00a0j\u2208S\u00a0at each layer.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-2b0469f elementor-widget elementor-widget-text-editor\" data-id=\"2b0469f\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tImportantly, various problem-specific tricks\u2013such as position encodings, causal\/masked aggregation, learning rate schedules and extensive pre-training\u2013are essential for the success of Transformers but seldom seem in the GNN community. At the same time, looking at Transformers from a GNN perspective could inspire us to get rid of a lot of the\u00a0<em>bells and whistles<\/em>\u00a0in the architecture.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-d7c9ef3 elementor-widget elementor-widget-heading\" data-id=\"d7c9ef3\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h3 class=\"elementor-heading-title elementor-size-default\"><h3 id=\"what-can-we-learn-from-each-other\">What can we learn from each other?<\/h3><\/h3>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-bbcfb28 elementor-widget elementor-widget-text-editor\" data-id=\"bbcfb28\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tNow that we\u2019ve established a connection between Transformers and GNNs, let me throw some ideas around\u2026\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-b72164a elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"b72164a\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-b7b6961\" data-id=\"b7b6961\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-fb85112 elementor-widget elementor-widget-heading\" data-id=\"fb85112\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h4 class=\"elementor-heading-title elementor-size-default\"><h4 id=\"are-fully-connected-graphs-the-best-input-format-for-nlp\">Are fully-connected graphs the best input format for NLP?<\/h4><\/h4>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-ef98c2e elementor-widget elementor-widget-text-editor\" data-id=\"ef98c2e\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tBefore statistical NLP and ML, linguists like Noam Chomsky focused on developing fomal theories of\u00a0<a href=\"https:\/\/en.wikipedia.org\/wiki\/Syntactic_Structures\" rel=\"noopener\">linguistic structure<\/a>, such as\u00a0<strong>syntax trees\/graphs<\/strong>.\u00a0<a href=\"https:\/\/arxiv.org\/abs\/1503.00075\" rel=\"noopener\">Tree LSTMs<\/a>\u00a0already tried this, but maybe Transformers\/GNNs are better architectures for bringing the world of linguistic theory and statistical NLP closer?\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-de6a419 elementor-widget elementor-widget-image\" data-id=\"de6a419\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/graphdeeplearning.github.io\/post\/transformers-are-gnns\/syntax-tree.png\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-7a7fd96 elementor-widget elementor-widget-heading\" data-id=\"7a7fd96\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h4 class=\"elementor-heading-title elementor-size-default\">\n<h4 id=\"how-to-learn-long-term-dependencies\">How to learn long-term dependencies?<\/h4><\/h4>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-68dcfd5 elementor-widget elementor-widget-text-editor\" data-id=\"68dcfd5\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tAnother issue with fully-connected graphs is that they make learning very long-term dependencies between words difficult. This is simply due to how the number of edges in the graph\u00a0<strong>scales quadratically<\/strong>\u00a0with the number of nodes,\u00a0<em>i.e.<\/em>, in an\u00a0n\u00a0word sentence, a Transformer\/GNN would be doing computations over\u00a0n2\u00a0pairs of words. Things get out of hand for very large\u00a0n.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-35b90a4 elementor-widget elementor-widget-text-editor\" data-id=\"35b90a4\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tThe NLP community\u2019s perspective on the long sequences and dependencies problem is interesting: Making the attention mechanism\u00a0<a href=\"https:\/\/openai.com\/blog\/sparse-transformer\/\" class=\"broken_link\" rel=\"noopener\">sparse<\/a>\u00a0or\u00a0<a href=\"https:\/\/ai.facebook.com\/blog\/making-transformer-networks-simpler-and-more-efficient\/\" class=\"broken_link\" rel=\"noopener\">adaptive<\/a>\u00a0in terms of input size, adding\u00a0<a href=\"https:\/\/ai.googleblog.com\/2019\/01\/transformer-xl-unleashing-potential-of.html\" rel=\"noopener\">recurrence<\/a>\u00a0or\u00a0<a href=\"https:\/\/deepmind.com\/blog\/article\/A_new_model_and_dataset_for_long-range_memory\" rel=\"noopener\">compression<\/a>\u00a0into each layer, and using\u00a0<a href=\"https:\/\/www.pragmatic.ml\/reformer-deep-dive\/\" class=\"broken_link\" rel=\"noopener\">Locality Sensitive Hashing<\/a>\u00a0for efficient attention are all promising new ideas for better Transformers.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-ec839f6 elementor-widget elementor-widget-text-editor\" data-id=\"ec839f6\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\nIt would be interesting to see ideas from the GNN community thrown into the mix,\u00a0<em>e.g.<\/em>,\u00a0<a href=\"https:\/\/arxiv.org\/abs\/1911.04070\" rel=\"noopener\">Binary Partitioning<\/a>\u00a0for sentence\u00a0<strong>graph sparsification<\/strong>\u00a0seems like another exciting approach.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-a3670a5 elementor-widget elementor-widget-image\" data-id=\"a3670a5\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/graphdeeplearning.github.io\/post\/transformers-are-gnns\/long-term-depend.png\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-8cc2f58 elementor-widget elementor-widget-heading\" data-id=\"8cc2f58\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h4 class=\"elementor-heading-title elementor-size-default\"><h4 id=\"are-transformers-learning-neural-syntax\">Are Transformers learning \u2018neural syntax\u2019?<\/h4><\/h4>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-667b897 elementor-widget elementor-widget-text-editor\" data-id=\"667b897\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tThere have been\u00a0<a href=\"https:\/\/pair-code.github.io\/interpretability\/bert-tree\/\" rel=\"noopener\">several<\/a>\u00a0<a href=\"https:\/\/arxiv.org\/abs\/1905.05950\" rel=\"noopener\">interesting<\/a>\u00a0<a href=\"https:\/\/arxiv.org\/abs\/1906.04341\" rel=\"noopener\">papers<\/a>\u00a0from the NLP community on what Transformers might be learning. The basic premise is that performing attention on all word pairs in a sentence\u2013with the purpose of identifying which pairs are the most interesting\u2013enables Transformers to learn something like a\u00a0<strong>task-specific syntax<\/strong>.\nDifferent heads in the multi-head attention might also be \u2018looking\u2019 at different syntactic properties.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-09ad4b9 elementor-widget elementor-widget-text-editor\" data-id=\"09ad4b9\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tIn graph terms, by using GNNs on full graphs, can we recover the most important edges\u2013and what they might entail\u2013from how the GNN performs neighbourhood aggregation at each layer? I\u2019m\u00a0<a href=\"https:\/\/arxiv.org\/abs\/1909.07913\" rel=\"noopener\">not so convinced<\/a>\u00a0by this view yet.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-543d57c elementor-widget elementor-widget-image\" data-id=\"543d57c\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/graphdeeplearning.github.io\/post\/transformers-are-gnns\/attention-heads.png\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-8a32ead elementor-widget elementor-widget-heading\" data-id=\"8a32ead\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h4 class=\"elementor-heading-title elementor-size-default\"><h4 id=\"why-multiple-heads-of-attention-why-attention\">Why multiple heads of attention? Why attention?<\/h4><\/h4>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-743193e elementor-widget elementor-widget-text-editor\" data-id=\"743193e\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tI\u2019m more sympathetic to the optimization view of the multi-head mechanism\u2013having multiple attention heads\u00a0<strong>improves learning<\/strong>\u00a0and overcomes\u00a0<strong>bad random initializations<\/strong>. For instance,\u00a0<a href=\"https:\/\/lena-voita.github.io\/posts\/acl19_heads.html\" rel=\"noopener\">these<\/a>\u00a0<a href=\"https:\/\/arxiv.org\/abs\/1905.10650\" rel=\"noopener\">papers<\/a>\u00a0showed that Transformer heads can be \u2018pruned\u2019 or removed\u00a0<em>after<\/em>\u00a0training without significant performance impact.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-66dec4a elementor-widget elementor-widget-text-editor\" data-id=\"66dec4a\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\nConversely, GNNs with simpler aggregation functions such as sum or max do not require multiple aggregation heads for stable training. Wouldn\u2019t it be nice for Transformers if we didn\u2019t have to compute pair-wise compatibilities between each word pair in the sentence?\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-a9f20e7 elementor-widget elementor-widget-text-editor\" data-id=\"a9f20e7\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\nCould Transformers benefit from ditching attention, altogether? Yann Dauphin and collaborators\u2019\u00a0<a href=\"https:\/\/arxiv.org\/abs\/1705.03122\" rel=\"noopener\">recent<\/a>\u00a0<a href=\"https:\/\/arxiv.org\/abs\/1901.10430\" rel=\"noopener\">work<\/a>\u00a0suggests an alternative\u00a0<strong>ConvNet architecture<\/strong>. Transformers, too, might ultimately be doing\u00a0<a href=\"http:\/\/jbcordonnier.com\/posts\/attention-cnn\/\" rel=\"noopener\">something<\/a>\u00a0<a href=\"https:\/\/twitter.com\/ChrSzegedy\/status\/1232148457810538496\" rel=\"noopener\">similar<\/a>\u00a0to ConvNets!\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-923f053 elementor-widget elementor-widget-image\" data-id=\"923f053\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/graphdeeplearning.github.io\/post\/transformers-are-gnns\/attention-conv.png\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-84400c8 elementor-widget elementor-widget-heading\" data-id=\"84400c8\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h4 class=\"elementor-heading-title elementor-size-default\"><h4 id=\"why-is-training-transformers-so-hard\">Why is training Transformers so hard?<\/h4><\/h4>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-38c3e13 elementor-widget elementor-widget-text-editor\" data-id=\"38c3e13\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tReading new Transformer papers makes me feel that training these models requires something akin to\u00a0<em>black magic<\/em>\u00a0when determining the best\u00a0<strong>learning rate schedule, warmup strategy<\/strong>\u00a0and\u00a0<strong>decay settings<\/strong>. This could simply be because the models are so huge and the NLP tasks studied are so challenging.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-a501de3 elementor-widget elementor-widget-text-editor\" data-id=\"a501de3\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tBut\u00a0<a href=\"https:\/\/arxiv.org\/abs\/1906.01787\" rel=\"noopener\">recent<\/a>\u00a0<a href=\"https:\/\/arxiv.org\/abs\/1910.06764\" rel=\"noopener\">results<\/a>\u00a0<a href=\"https:\/\/arxiv.org\/abs\/2002.04745\" rel=\"noopener\">suggest<\/a>\u00a0that it could also be due to the specific permutation of normalization and residual connections within the architecture.\n<blockquote cite=\"https:\/\/twitter.com\/chaitjo\/status\/1229335421806501888\" data-scribe=\"section:subject\" data-tweet-id=\"1229335421806501888\"><a href=\"https:\/\/twitter.com\/chaitjo\" aria-label=\"Chaitanya Joshi (screen name: chaitjo)\" data-scribe=\"element:user_link\" rel=\"noopener\"><img decoding=\"async\" src=\"https:\/\/pbs.twimg.com\/profile_images\/1152067850028253184\/iCX0ZgYi_normal.png\" alt=\"\" data-scribe=\"element:avatar\" data-src-1x=\"https:\/\/pbs.twimg.com\/profile_images\/1152067850028253184\/iCX0ZgYi_normal.png\" data-src-2x=\"https:\/\/pbs.twimg.com\/profile_images\/1152067850028253184\/iCX0ZgYi_bigger.png\" \/><\/a>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-012fe71 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"012fe71\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-aa3c4c3\" data-id=\"aa3c4c3\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-5c3a1a5 elementor-widget elementor-widget-text-editor\" data-id=\"5c3a1a5\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<a href=\"https:\/\/twitter.com\/chaitjo\" aria-label=\"Chaitanya Joshi (screen name: chaitjo)\" data-scribe=\"element:user_link\" rel=\"noopener\">Chaitanya Joshi@chaitjo<\/a>\n<p dir=\"ltr\" lang=\"en\">I enjoyed reading the new <a dir=\"ltr\" href=\"https:\/\/twitter.com\/DeepMind\" data-mentioned-user-id=\"4783690002\" data-scribe=\"element:mention\" rel=\"noopener\">@DeepMind<\/a> Transformer paper, but why is training these models such dark magic? &#8220;For word-based LM we used 16, 000 warmup steps with 500, 000 decay steps and sacrifice 9,000 goats.&#8221;<a dir=\"ltr\" title=\"https:\/\/arxiv.org\/pdf\/1911.05507.pdf\" href=\"https:\/\/t.co\/dP49GTa4ze\" target=\"_blank\" rel=\"nofollow noopener noreferrer\" data-expanded-url=\"https:\/\/arxiv.org\/pdf\/1911.05507.pdf\" data-scribe=\"element:url\">https:\/\/arxiv.org\/pdf\/1911.05507.pdf\u00a0\u2026<\/a><\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-cf9524e elementor-widget elementor-widget-text-editor\" data-id=\"cf9524e\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<article dir=\"ltr\" data-scribe=\"component:card\">\n<p style=\"text-align: center;\"><a href=\"https:\/\/twitter.com\/chaitjo\/status\/1229335421806501888\/photo\/1\" rel=\"noopener\"><img decoding=\"async\" style=\"width: 680px; height: 157px;\" title=\"View image on Twitter\" src=\"https:\/\/pbs.twimg.com\/media\/EQ96C8zUYAE0b1u?format=png&amp;name=small\" alt=\"View image on Twitter\" data-image=\"https:\/\/pbs.twimg.com\/media\/EQ96C8zUYAE0b1u\" data-image-format=\"png\" \/><\/a><\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-895bfe3 elementor-widget elementor-widget-text-editor\" data-id=\"895bfe3\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\n<\/article>&nbsp;<\/blockquote>\nAt this point I\u2019m ranting, but this makes me sceptical: Do we really need multiple heads of expensive pair-wise attention, overparameterized MLP sub-layers, and complicated learning schedules?\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-ccdca91 elementor-widget elementor-widget-text-editor\" data-id=\"ccdca91\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tDo we really need massive models with\u00a0<a href=\"https:\/\/www.technologyreview.com\/s\/613630\/training-a-single-ai-model-can-emit-as-much-carbon-as-five-cars-in-their-lifetimes\/\" rel=\"noopener\">massive carbon footprints<\/a>?\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-e1310d4 elementor-widget elementor-widget-text-editor\" data-id=\"e1310d4\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tShouldn\u2019t architectures with good\u00a0<a href=\"https:\/\/arxiv.org\/abs\/1806.01261\" rel=\"noopener\">inductive biases<\/a>\u00a0for the task at hand be easier to train?\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-ca5f23c elementor-widget elementor-widget-text-editor\" data-id=\"ca5f23c\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tPublished in\u00a0<a href=\"https:\/\/graphdeeplearning.github.io\/post\/transformers-are-gnns\/\" rel=\"noopener\">NTU Graph Deep Learning Lab<\/a>.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<\/div>\n\t\t","protected":false},"excerpt":{"rendered":"<p>For Natural Language Processing (NLP), conventionally,&nbsp;Recurrent Neural Networks&nbsp;(RNNs) build representations of each word in a sentence in a sequential manner,&nbsp;i.e.,&nbsp;one word at a time. This post establishes links between&nbsp;Graph Neural Networks (GNNs)&nbsp;and Transformers. It will talk about the intuitions behind model architectures in the NLP and GNN communities, make connections using equations and figures, and discuss how we could work together to drive progress. It starts by talking about the purpose of model architectures&ndash;representation learning.<\/p>\n","protected":false},"author":745,"featured_media":8204,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"content-type":"","footnotes":""},"categories":[183],"tags":[92],"ppma_author":[3593],"class_list":["post-2320","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-ai-ml","tag-machine-learning"],"authors":[{"term_id":3593,"user_id":745,"is_guest":0,"slug":"chaitanya-joshi","display_name":"Chaitanya Joshi","avatar_url":"https:\/\/secure.gravatar.com\/avatar\/?s=96&d=mm&r=g","user_url":"","last_name":"Joshi","first_name":"Chaitanya","job_title":"","description":"Chaitanya Joshi is Research Assistant at NTU, Singapore. His current research focuses on the emerging field of Graph Deep Learning and its applications for Operations Research and Combinatorial Optimization. He has co-authored patent applications and research papers at top Machine Learning conferences such as NeurIPS and ICLR."}],"_links":{"self":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/2320","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/users\/745"}],"replies":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/comments?post=2320"}],"version-history":[{"count":5,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/2320\/revisions"}],"predecessor-version":[{"id":35233,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/2320\/revisions\/35233"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/media\/8204"}],"wp:attachment":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/media?parent=2320"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/categories?post=2320"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/tags?post=2320"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/ppma_author?post=2320"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}