{"id":9372,"date":"2020-08-20T06:42:00","date_gmt":"2020-08-20T06:42:00","guid":{"rendered":"https:\/\/www.experfy.com\/blog\/?p=9372"},"modified":"2023-11-17T14:20:16","modified_gmt":"2023-11-17T14:20:16","slug":"simple-scalable-graph-neural-networks","status":"publish","type":"post","link":"https:\/\/www.experfy.com\/blog\/ai-ml\/simple-scalable-graph-neural-networks\/","title":{"rendered":"Simple Scalable Graph Neural Networks"},"content":{"rendered":"\t\t<div data-elementor-type=\"wp-post\" data-elementor-id=\"9372\" class=\"elementor elementor-9372\" data-elementor-post-type=\"post\">\n\t\t\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-31f7139d elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"31f7139d\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-3c413efc\" data-id=\"3c413efc\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-39fe7f4a elementor-widget elementor-widget-text-editor\" data-id=\"39fe7f4a\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p><strong>TL;DR:<\/strong><em>\u00a0One of the challenges that have so far precluded the wide adoption of graph neural networks in industrial applications is the difficulty to scale them to large graphs such as the Twitter follow graph. The interdependence between nodes makes the decomposition of the loss function into individual nodes\u2019 contributions challenging. In this post, we describe a simple graph neural network architecture developed at Twitter that can work on very large graphs.<\/em><\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p><em>This post was co-authored with\u00a0<\/em><a href=\"https:\/\/twitter.com\/ffabffrasca?lang=en\" target=\"_blank\" rel=\"noreferrer noopener\"><em>Fabrizo Frasca<\/em><\/a><em>\u00a0and\u00a0<\/em><a href=\"https:\/\/www.emanuelerossi.co.uk\/\" target=\"_blank\" rel=\"noreferrer noopener\"><em>Emanuele Rossi<\/em><\/a><em>.<\/em><\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:separator --><hr class=\"wp-block-separator\" \/><!-- \/wp:separator -->\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-a34733f elementor-widget elementor-widget-text-editor\" data-id=\"a34733f\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>Graph Neural Networks (GNNs) are a class of ML models that have emerged in recent years for learning on graph-structured data. GNNs have been successfully applied to model systems of relation and interactions in a variety of different domains, including social science, computer vision and graphics, particle physics, chemistry, and medicine. Until recently, most of the research in the field has focused on developing new GNN models and testing them on small graphs (with Cora, a citation network containing only about 5K nodes, still being widely used [1]); relatively little effort has been invested in dealing with large-scale applications. On the other hand, industrial problems often deal with giant graphs, such as Twitter or Facebook social networks containing hundreds of millions of nodes and billions of edges. A big part of methods described in the literature are unsuitable for these settings.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p>In a nutshell, graph neural networks operate by aggregating the features from local neighbour nodes. Arranging the\u00a0<em>d<\/em>-dimensional node features into an\u00a0<em>n<\/em>\u00d7<em>d<\/em>\u00a0matrix\u00a0<strong>X<\/strong>\u00a0(here\u00a0<em>n<\/em>\u00a0denotes the number of nodes), the simplest convolution-like operation on graphs implemented in the popular\u00a0<a href=\"http:\/\/tkipf.github.io\/graph-convolutional-networks\/\" target=\"_blank\" rel=\"noreferrer noopener\">GCN model<\/a>\u00a0[2] combines node-wise transformations with feature diffusion across adjacent nodes<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p><strong>Y<\/strong>\u00a0= ReLU(<strong>AXW<\/strong>).<\/p>\n<!-- \/wp:paragraph -->\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-67f84e3 elementor-widget elementor-widget-text-editor\" data-id=\"67f84e3\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>Here\u00a0<strong>W<\/strong>\u00a0is a learnable matrix shared across all nodes and\u00a0<strong>A<\/strong>\u00a0is a linear diffusion operator amounting to a weighted average of features in a neighbourhood [3]. Multiple layers of this form can be applied in sequence like in traditional CNNs. Graph neural networks can be designed to make predictions at the level of nodes (e.g. for applications such as detecting malicious users in a social network), edges (e.g. for link prediction, a typical scenario in recommender systems), or the entire graphs (e.g. predicting chemical properties of molecular graphs). The node-wise classification task can be carried out, for instance, by a two-layer GCN of the form<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p><strong>Y<\/strong>\u00a0= softmax(<strong>A\u00a0<\/strong>ReLU(<strong>AXW<\/strong>)<strong>W<\/strong>\u2019).<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p>Why is scaling graph neural networks challenging? In the aforementioned node-wise prediction problem, the nodes play the role of samples on which the GNN is trained. In traditional machine learning settings, it is typically assumed that the samples are drawn from some distribution in a statistically independent manner. This, in turn, allows to decompose the loss function into the individual sample contributions and employ stochastic optimisation techniques working with small subsets (mini-batches) of the training data at a time. Virtually every deep neural network architecture is nowadays trained using mini-batches.<\/p>\n<!-- \/wp:paragraph -->\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-402f1b5 elementor-widget elementor-widget-text-editor\" data-id=\"402f1b5\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>In graphs, on the other hand, the fact that the nodes are inter-related via edges creates statistical dependence between samples in the training set. Moreover, because of the statistical dependence between nodes, sampling can introduce bias \u2014 for instance it can make some nodes or edges appear more frequently than on others in the training set \u2014 and this \u2018side-effect\u2019 would need proper handling. Last but not least, one has to guarantee that the sampled subgraph maintains a meaningful structure that the GNN can exploit.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p>In many early works on graph neural networks, these problems were swept under the carpet: architectures such as GCN and ChebNet [2], MoNet [4] and GAT [5] were trained using full-batch gradient descent. This has led to the necessity to hold the whole adjacency matrix of the graph and the node features in memory. As a result, for example, an\u00a0<em>L<\/em>-layer GCN model has time complexity \ud835\udcaa(<em>Lnd<\/em>\u00b2) and memory complexity\u00a0<em>\ud835\udcaa<\/em>(<em>Lnd +Ld<\/em>\u00b2) [7], prohibitive even for modestly-sized graphs.<\/p>\n<!-- \/wp:paragraph -->\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-d32d11f elementor-widget elementor-widget-text-editor\" data-id=\"d32d11f\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<!-- wp:paragraph -->\n<p>The first work to tackle the problem of scalability was GraphSAGE [8], a seminal paper of Will Hamilton and co-authors. GraphSAGE used neighbourhood sampling combined with mini-batch training to train GNNs on large graphs (the acronym SAGE, standing for \u201csample and aggregate\u201d, is a reference to this scheme). The main idea is that in order to compute the training loss on a single node with an\u00a0<em>L<\/em>-layer GCN, only the\u00a0<em>L<\/em>-hop neighbours of that node are necessary, as nodes further away in the graph are not involved in the computation. The problem is that, for graphs of the \u201c<a href=\"https:\/\/en.wikipedia.org\/wiki\/Small-world_network#:~:text=A%20small%2Dworld%20network%20is,number%20of%20hops%20or%20steps.\" target=\"_blank\" rel=\"noreferrer noopener\">small-world<\/a>\u201d type, such as social networks, the 2-hop neighbourhood of some nodes may already contain millions of nodes, making it too big to be stored in memory [9]. GraphSAGE tackles this problem by sampling the neighbours up to the\u00a0<em>L<\/em>-th hop: starting from the training node, it samples uniformly with replacement [10] a fixed number\u00a0<em>k<\/em>\u00a0of 1-hop neighbours, then for each of these neighbours it again samples\u00a0<em>k<\/em>\u00a0neighbours, and so on for\u00a0<em>L\u00a0<\/em>times. In this way, for every node we are guaranteed to have a bounded\u00a0<em>L<\/em>-hop sampled neighbourhood of \ud835\udcaa(<em>k\u1d38<\/em>) nodes. If we then construct a batch with\u00a0<em>b<\/em>\u00a0training nodes, each with its own independent\u00a0<em>L<\/em>-hop neighbourhood, we get to a memory complexity of \ud835\udcaa(<em>bk\u1d38<\/em>) independent of the graph size\u00a0<em>n<\/em>. The computational complexity of one batch of GraphSAGE is \ud835\udcaa(<em>bLd<\/em>\u00b2<em>k\u1d38<\/em>).<\/p>\n<!-- \/wp:paragraph -->\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-8f0b631 elementor-widget elementor-widget-image\" data-id=\"8f0b631\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/miro.medium.com\/max\/1618\/1*0Fmpikt6XWGoxCfZ3ei2JQ.png\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-9b345c7 elementor-widget elementor-widget-text-editor\" data-id=\"9b345c7\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>A notable drawback of GraphSAGE is that sampled nodes might appear multiple times, thus potentially introducing a lot of redundant computation. For instance, in the figure above the dark green node appears in both the\u00a0<em>l<\/em>-hop neighbourhood for the two training nodes, and therefore its embedding is computed twice in the batch. With the increase of the batch size\u00a0<em>b<\/em>\u00a0and the number of samples\u00a0<em>k<\/em>, the amount of redundant computation increases as well. Moreover, despite having \ud835\udcaa(<em>bk\u1d38<\/em>) nodes in memory for each batch, the loss is computed on only\u00a0<em>b<\/em>\u00a0of them, and therefore, the computation for the other nodes is also in some sense wasted.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p>Multiple follow-up works focused on improving the sampling of mini-batches in order to remove redundant computation of GraphSAGE and make each batch more efficient. The most recent works in this direction are ClusterGCN [11] and GraphSAINT [12], which take the approach of<em>\u00a0graph-sampling<\/em>\u00a0(as opposed to neighbourhood-sampling of GraphSAGE). In graph-sampling approaches, for each batch, a subgraph of the original graph is sampled, and a full GCN-like model is run on the entire subgraph. The challenge is to make sure that these subgraphs preserve most of the original edges and still present a meaningful topological structure.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p>ClusterGCN achieves this by first clustering the graph. Then, at each batch, the model is trained on one cluster. This allows the nodes in each batch to be as tightly connected as possible.<\/p>\n<!-- \/wp:paragraph -->\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-d6088c4 elementor-widget elementor-widget-text-editor\" data-id=\"d6088c4\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<!-- wp:paragraph -->\n<p>GraphSAINTproposes instead a general probabilistic graph sampler constructing training batches by sampling subgraphs of the original graph. The graph sampler can be designed according to different schemes: for example, it can perform uniform node sampling, uniform edge sampling, or \u201cimportance sampling\u201d by using random walks to compute the importance of nodes and use it as the probability distribution for sampling.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p>It is also important to note that one of the advantages of sampling is that during training it acts as a sort of edge-wise dropout, which regularises the model and can help the performance [13]. However, edge dropout would require to still see all the edges at inference time, which is not feasible here. Another effect graph sampling might have is reducing the bottleneck [14] and the resulting \u201cover-squashing\u201d phenomenon that stems from the exponential expansion of the neighbourhood.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p>Inour recent paper with Ben Chamberlain, Davide Eynard, and Federico Monti [15], we investigated the extent to which it is possible to design simple, sampling-free architectures for node-wise classification problems. You may wonder why one would prefer to abandon sampling strategies in light of the indirect benefits we have just highlighted above. There are a few reasons for that. First, instances of node classification problems may significantly vary from one another and, to the best of our knowledge, no work so far has systematically studied\u00a0<em>when<\/em>\u00a0sampling actually provides positive effects other than just alleviating computational complexity. Second, the implementation of sampling schemes introduces additional complexity and we believe a simple, strong, sampling-free, scalable baseline architecture is appealing.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p>Our approach is motivated by several recent empirical findings. First, simple fixed aggregators (such as GCN) were shown to often outperform in many cases more complex ones, such as GAT or MPNN [16]. Second, while deep learning success was built on models with a lot of layers, in <a href=\"https:\/\/www.experfy.com\/blog\/do-we-need-deep-graph-neural-networks\/\" target=\"_blank\" rel=\"noreferrer noopener\">graph deep learning<\/a> it is still an open question\u00a0<a href=\"https:\/\/towardsdatascience.com\/do-we-need-deep-graph-neural-networks-be62d3ec5c59\" target=\"_blank\" rel=\"noreferrer noopener\" class=\"broken_link\">whether depth is needed<\/a>. In particular, Wu and coauthors [17] argue that a GCN model with a single multi-hop diffusion layer can perform on par with models with multiple layers.<\/p>\n<!-- \/wp:paragraph -->\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-63263a5 elementor-widget elementor-widget-text-editor\" data-id=\"63263a5\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<!-- wp:paragraph -->\n<p>By combining different, fixed neighbourhood aggregators within a single convolutional layer, it is possible to obtain an extremely scalable model without resorting to graph sampling [18]. In other words, all the graph related (fixed) operations are in the first layer of the architecture and can therefore be precomputed; the pre-aggregated information can then be fed as inputs to the rest of the model which, due to the lack of neighbourhood aggregation, boils down to a multi-layer perceptron (MLP). Importantly, the expressivity in the graph filtering operations can still be retained even with such a shallow convolutional scheme by employing several, possibly specialised and more complex, diffusion operators. As an example, it is possible to design operators to include\u00a0<a href=\"https:\/\/towardsdatascience.com\/beyond-weisfeiler-lehman-using-substructures-for-provably-expressive-graph-neural-networks-d476ad665fa3\" target=\"_blank\" rel=\"noreferrer noopener\" class=\"broken_link\">local substructure counting<\/a>\u00a0[19] or graph motifs [20].<\/p>\n<!-- \/wp:paragraph -->\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-e59586a elementor-widget elementor-widget-image\" data-id=\"e59586a\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/miro.medium.com\/max\/1417\/0*MwmeKZbygKAMQRKX\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-afbd02f elementor-widget elementor-widget-text-editor\" data-id=\"afbd02f\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>The proposed scalable architecture, which we call Scalable Inception-like Graph Network (SIGN) has the following form for node-wise classification tasks:<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p><strong>Y<\/strong>\u00a0= softmax(ReLU(<strong>XW<\/strong>\u2080<strong>\u00a0| A<\/strong>\u2081<strong>XW<\/strong>\u2081<strong>\u00a0| A<\/strong>\u2082<strong>XW<\/strong>\u2082<strong>\u00a0| \u2026 | A<\/strong><em>\u1d63<\/em><strong>XW<\/strong><em>\u1d63<\/em>)\u00a0<strong>W<\/strong>\u2019)<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p>Here\u00a0<strong>A<\/strong><em>\u1d63<\/em>\u00a0are linear diffusion matrices (such as a normalised adjacency matrix, its powers, or a motif matrix) and\u00a0<strong>W<\/strong><em>\u1d63<\/em>\u00a0and\u00a0<strong>W<\/strong>\u2019 are learnable parameters. As depicted in the figure above, the network can be made deeper with additional node-wise layers,<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p><strong>Y<\/strong>\u00a0= softmax(ReLU(\u2026ReLU(<strong>XW<\/strong>\u2080<strong>\u00a0| A<\/strong>\u2081<strong>XW<\/strong>\u2081<strong>\u00a0| \u2026 | A<\/strong><em>\u1d63<\/em><strong>XW<\/strong><em>\u1d63<\/em>)\u00a0<strong>W<\/strong>\u2019)\u2026\u00a0<strong>W<\/strong>\u2019\u2019)<\/p>\n<!-- \/wp:paragraph -->\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-68a9e13 elementor-widget elementor-widget-text-editor\" data-id=\"68a9e13\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<!-- wp:paragraph -->\n<p>Finally, when employing different powers for the same diffusion operator (e.g.\u00a0<strong>A<\/strong>\u2081=<strong>B<\/strong>\u00b9<strong>, A<\/strong>\u2082=<strong>B<\/strong>\u00b2, etc.), the graph operations effectively aggregate from neighbours in further and further hops, akin to having convolutional filters of different receptive fields within the same network layer. This analogy to the popular inception module in classical CNNs explains the name of the proposed architecture [21].<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p>As already mentioned, the matrix products\u00a0<strong>A<\/strong>\u2081<strong>X<\/strong>,\u2026,\u00a0<strong>A<\/strong><em>\u1d63<\/em><strong>X\u00a0<\/strong>in the above equations do not depend on the learnable model parameters and can thus be pre-computed. In particular, for very large graphs this pre-computation can be scaled efficiently using distributed computing infrastructures such as Apache Spark. This effectively reduces the computational complexity of the overall model to that of an MLP. Moreover, by moving the diffusion to the pre-computation step, we can aggregate information from\u00a0<em>all<\/em>\u00a0the neighbours, avoiding sampling and the possible loss of information and bias that comes with it [22].<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p>The main advantage of SIGN is its scalability and efficiency, as it can be trained using standard mini-batch gradient descent. We found our model to be up to two orders of magnitude faster than ClusterGCN and GraphSAINT at inference time, while also being significantly faster at training time (all this while maintaining accuracy performances very close to that of the state-of-the-art GraphSAINT).<\/p>\n<!-- \/wp:paragraph -->\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-da46e03 elementor-widget elementor-widget-image\" data-id=\"da46e03\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/miro.medium.com\/max\/2545\/1*CTKiIwNUmmEn5mxMsYbwGQ.png\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-bbd3355 elementor-widget elementor-widget-image\" data-id=\"bbd3355\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/miro.medium.com\/max\/1600\/0*YkrNHieOOLhDWhPo\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-27c6706 elementor-widget elementor-widget-text-editor\" data-id=\"27c6706\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<!-- wp:paragraph -->\n<p>Moreover, our model supports any diffusion operators. For different types of graphs, different diffusion operators may be necessary, and we found some tasks to benefit from having motif-based operators such as triangle counts.<\/p>\n<!-- \/wp:paragraph -->\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-a9db49b elementor-widget elementor-widget-image\" data-id=\"a9db49b\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/miro.medium.com\/max\/1600\/0*dUVM_oOF5enY6yXC\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-ea07869 elementor-widget elementor-widget-text-editor\" data-id=\"ea07869\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<!-- wp:paragraph -->\n<p>Despite the limitation of having only a single graph convolutional layer and linear diffusion operators, SIGN performs very well in practice, achieving results on par or even better than much more complex models. Given its speed and simplicity of implementation, we envision SIGN to be a simple baseline graph learning method for large-scale applications. Perhaps more importantly, the success of such a simple model leads to a more fundamental question:\u00a0<a href=\"https:\/\/towardsdatascience.com\/do-we-need-deep-graph-neural-networks-be62d3ec5c59\" target=\"_blank\" rel=\"noreferrer noopener\" class=\"broken_link\">do we really need deep graph neural networks<\/a>? We conjecture that in many problems of learning on social networks and \u201csmall world\u201d graphs, we should use\u00a0<a href=\"https:\/\/towardsdatascience.com\/beyond-weisfeiler-lehman-using-substructures-for-provably-expressive-graph-neural-networks-d476ad665fa3\" target=\"_blank\" rel=\"noreferrer noopener\" class=\"broken_link\">richer local structures<\/a>\u00a0rather than resort to brute-force deep architectures. Interestingly, traditional CNNs architectures evolved according to an opposite trend (deeper networks with smaller filters) because of computational advantages and the ability to compose complex features of simpler ones. We are not sure if the same approach is right for graphs, where compositionality is much more complex (e.g. certain structures cannot be computed by message passing, no matter how deep the network is). For sure, more elaborate experiments are still needed to test this conjecture.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:separator --><hr class=\"wp-block-separator\" \/><!-- \/wp:separator -->\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-ce6b8f9 elementor-widget elementor-widget-text-editor\" data-id=\"ce6b8f9\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>[1] The recently introduced\u00a0<a href=\"https:\/\/ogb.stanford.edu\/\" target=\"_blank\" rel=\"noreferrer noopener\">Open Graph Benchmark<\/a>\u00a0now offers large-scale graphs with millions of nodes. It will probably take some time for the community to switch to it.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p>[2] T. Kipf and M. Welling,\u00a0<a href=\"https:\/\/arxiv.org\/pdf\/1609.02907.pdf\" target=\"_blank\" rel=\"noreferrer noopener\">Semi-supervised classification with graph convolutional networks<\/a>\u00a0(2017). Proc. ICLR introduced the popular GCN architecture, which was derived as a simplification of the ChebNet model proposed by M. Defferrard et al.\u00a0<a href=\"https:\/\/papers.nips.cc\/paper\/6081-convolutional-neural-networks-on-graphs-with-fast-localized-spectral-filtering.pdf\" target=\"_blank\" rel=\"noreferrer noopener\">Convolutional neural networks on graphs with fast localized spectral filtering<\/a>\u00a0(2016). Proc. NIPS.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p>[3] As the diffusion operator, Kipf and Welling used the graph adjacency matrix with self-loops (i.e., the node itself contributes to its feature update), but other choices are possible as well. The diffusion operation can be made feature-dependent of the form\u00a0<strong>A<\/strong>(<strong>X<\/strong>)<strong>X<\/strong>\u00a0(i.e., it is still a linear combination of the node features, but the weights depend on the features themselves) like in MoNet [4] or GAT [5] models, or completely nonlinear,\ud835\udc9c(<strong>X<\/strong>), like in message-passing neural networks (MPNN) [6]. For simplicity, we focus the discussion on the GCN model applied to node-wise classification.<\/p>\n<!-- \/wp:paragraph -->\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-36e2a62 elementor-widget elementor-widget-text-editor\" data-id=\"36e2a62\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<!-- wp:paragraph -->\n<p>[4] F. Monti et al.,\u00a0<a href=\"https:\/\/openaccess.thecvf.com\/content_cvpr_2017\/html\/Monti_Geometric_Deep_Learning_CVPR_2017_paper.html\" target=\"_blank\" rel=\"noreferrer noopener\">Geometric Deep Learning on Graphs and Manifolds Using Mixture Model CNNs<\/a>\u00a0(2017). In Proc. CVPR.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p>[5] P. Veli\u010dkovi\u0107 et al.,\u00a0<a href=\"https:\/\/arxiv.org\/abs\/1710.10903\" target=\"_blank\" rel=\"noreferrer noopener\">Graph Attention Networks<\/a>\u00a0(2018). In Proc. ICLR.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p>[6] J. Gilmer et al.,\u00a0<a href=\"https:\/\/arxiv.org\/abs\/1704.01212\" target=\"_blank\" rel=\"noreferrer noopener\">Neural message passing for quantum chemistry<\/a>\u00a0(2017). In Proc. ICML.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p>[7] Here we assume for simplicity that the graph is sparse with the number of edges |<a href=\"https:\/\/en.wikipedia.org\/wiki\/%E2%84%B0\" target=\"_blank\" rel=\"noreferrer noopener\"><em>\u2130<\/em><\/a>|=\ud835\udcaa(<em>n<\/em>).<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p>[8] W. Hamilton et al.,\u00a0<a href=\"https:\/\/arxiv.org\/abs\/1706.02216\" target=\"_blank\" rel=\"noreferrer noopener\">Inductive Representation Learning on Large Graphs<\/a>\u00a0(2017). In Proc. NeurIPS.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p>[9] The number of neighbours in such graphs tends to grow exponentially with the neighbourhood expansion.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p>[10] Sampling with replacement means that some neighbour nodes can appear more than once, in particular if the number of neighbours is smaller than\u00a0<em>k<\/em>.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p>[11] W.-L. Chiang et al.,\u00a0<a href=\"https:\/\/arxiv.org\/abs\/1905.07953\" target=\"_blank\" rel=\"noreferrer noopener\">Cluster-GCN: An Efficient Algorithm for Training Deep and Large Graph Convolutional Networks<\/a>\u00a0(2019). In Proc. KDD.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p>[12] H. Zeng et al.,\u00a0<a href=\"https:\/\/arxiv.org\/abs\/1907.04931\" target=\"_blank\" rel=\"noreferrer noopener\">GraphSAINT: Graph Sampling Based Inductive Learning Method<\/a>\u00a0(2020) In Proc. ICLR.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p>[13] Y. Rong et al.\u00a0<a href=\"https:\/\/openreview.net\/pdf?id=Hkx1qkrKPr\" target=\"_blank\" rel=\"noreferrer noopener\">DropEdge: Towards deep graph convolutional networks on node classification<\/a>\u00a0(2020). In Proc. ICLR. An idea similar to DropOut where a random subset of edges is used during training.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p>[14] U. Alon and E. Yahav,\u00a0<a href=\"https:\/\/arxiv.org\/pdf\/2006.05205.pdf\" target=\"_blank\" rel=\"noreferrer noopener\">On the bottleneck of graph neural networks and its practical implications<\/a>\u00a0(2020). arXiv:2006.05205. Identified the over-squashing phenomenon in graph neural networks, which is similar to one observed in sequential recurrent models.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p>[15] Frasca et al.,\u00a0<a href=\"https:\/\/grlplus.github.io\/papers\/77.pdf\" target=\"_blank\" rel=\"noreferrer noopener\">SIGN: Scalable Inception Graph Neural Networks<\/a>\u00a0(2020). ICML workshop on Graph Representation Learning and Beyond.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p>[16] O. Shchur et al.\u00a0<a href=\"https:\/\/arxiv.org\/pdf\/1811.05868.pdf\" target=\"_blank\" rel=\"noreferrer noopener\">Pitfalls of graph neural network evaluation<\/a>\u00a0(2018). Workshop on Relational Representation Learning. Shows that simple GNN models perform on par with more complex ones.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p>[17] F. Wu et al.,\u00a0<a href=\"https:\/\/arxiv.org\/abs\/1902.07153\" target=\"_blank\" rel=\"noreferrer noopener\">Simplifying graph neural networks<\/a>\u00a0(2019). In Proc. ICML.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p>[18] While we stress that SIGN does not need sampling for computational efficiency, there are other reasons why graph subsampling is useful. J. Klicpera et al.\u00a0<a href=\"https:\/\/arxiv.org\/pdf\/1911.05485.pdf\" target=\"_blank\" rel=\"noreferrer noopener\">Diffusion improves graph learning<\/a>\u00a0(2020). Proc. NeurIPS show that sampled diffusion matrices improve performance of graph neural networks. We observed the same phenomenon in early SIGN experiments.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p>[19] G. Bouritsas et al.\u00a0<a href=\"https:\/\/arxiv.org\/abs\/2006.09252\" target=\"_blank\" rel=\"noreferrer noopener\">Improving graph neural network expressivity via subgraph isomorphism counting<\/a>\u00a0(2020). arXiv:2006.09252. Shows how provably powerful GNNs can be obtained by structural node encoding.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p>[20] F. Monti, K. Otness, M. M. Bronstein,\u00a0<a href=\"https:\/\/arxiv.org\/pdf\/1802.01572\" target=\"_blank\" rel=\"noreferrer noopener\">MotifNet: a motif-based graph convolutional network for directed graphs<\/a>\u00a0(2018). arXiv:1802.01572. Uses motif-based diffusion operators.<\/p>\n<!-- \/wp:paragraph -->\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-0a1953e elementor-widget elementor-widget-text-editor\" data-id=\"0a1953e\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<!-- wp:paragraph -->\n<p>[21] C. Szegedi et al.,\u00a0<a href=\"https:\/\/arxiv.org\/abs\/1409.4842\" target=\"_blank\" rel=\"noreferrer noopener\">Going deeper with convolution<\/a>\u00a0(2015). Proc. CVPR proposed the inception module in the already classical Google LeNet architecture. To be fair, we were not the first to think of graph inception modules. Our collaborator Anees Kazi from TU Munich, who was a visiting student at Imperial College last year, introduced them first.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p>[22] Note that reaching higher-order neighbours is normally achieved by depth-wise stacking graph convolutional layers operating with direct neighbours; in our architecture this is directly achieved in the first layer by powers of graph operators.<\/p>\n<!-- \/wp:paragraph -->\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<\/div>\n\t\t","protected":false},"excerpt":{"rendered":"<p>One of the challenges that have so far precluded the wide adoption of graph neural networks in industrial applications is the difficulty to scale them to large graphs. This post describes a simple graph neural network architecture that can work on very large graphs. <\/p>\n","protected":false},"author":874,"featured_media":9373,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"content-type":"","footnotes":""},"categories":[183],"tags":[560,561],"ppma_author":[3686],"class_list":["post-9372","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-ai-ml","tag-gnn","tag-graph-neural-networks"],"authors":[{"term_id":3686,"user_id":874,"is_guest":0,"slug":"michael-bronstein","display_name":"Michael Bronstein","avatar_url":"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/08\/Michael-Bronstein-150x150.jpg","user_url":"https:\/\/www.imperial.ac.uk\/people\/m.bronstein","last_name":"Bronstein","first_name":"Michael","job_title":"","description":"Michael Bronstein is Professor, Chair in Machine Learning and Pattern Recognition at Imperial College, London, besides Head of Graph ML at Twitter \/ ML Lead at ProjectCETI\/ ex Founder &amp; Chief Scientist at Fabula_ai\/ ex at Intel #AI #ML #graphs. His main expertise is in theoretical and computational geometric methods for data analysis, and his research encompasses a broad spectrum of applications ranging from machine learning, computer vision, and pattern recognition to geometry processing, computer graphics, and imaging. He has authored over 150 papers, the book Numerical geometry of non-rigid shapes (Springer 2008), and holds over 30 granted patents."}],"_links":{"self":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/9372","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/users\/874"}],"replies":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/comments?post=9372"}],"version-history":[{"count":6,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/9372\/revisions"}],"predecessor-version":[{"id":34153,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/9372\/revisions\/34153"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/media\/9373"}],"wp:attachment":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/media?parent=9372"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/categories?post=9372"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/tags?post=9372"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/ppma_author?post=9372"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}