{"id":22702,"date":"2021-03-23T08:14:00","date_gmt":"2021-03-23T08:14:00","guid":{"rendered":"https:\/\/www.experfy.com\/blog\/deeper-neural-networks-lead-to-simpler-embeddings\/"},"modified":"2023-08-29T11:23:33","modified_gmt":"2023-08-29T11:23:33","slug":"deeper-neural-networks-lead-to-simpler-embeddings","status":"publish","type":"post","link":"https:\/\/www.experfy.com\/blog\/ai-ml\/deeper-neural-networks-lead-to-simpler-embeddings\/","title":{"rendered":"Deeper Neural Networks Lead To Simpler Embeddings"},"content":{"rendered":"\t\t<div data-elementor-type=\"wp-post\" data-elementor-id=\"22702\" class=\"elementor elementor-22702\" data-elementor-post-type=\"post\">\n\t\t\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-f173691 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"f173691\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-9e3bf9b\" data-id=\"9e3bf9b\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-b7d4b90 elementor-widget elementor-widget-text-editor\" data-id=\"b7d4b90\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p class=\"has-medium-font-size\">A surprising explanation for generalization in neural networks<\/p>\n<p id=\"b5a0\">Recent research is increasingly investigating how neural networks, being as hyper-parametrized as they are, generalize. That is, according to traditional statistics, the more parameters, the more the model overfits. This notion is directly contradicted by a fundamental axiom of deep learning:<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-67881cb elementor-widget elementor-widget-text-editor\" data-id=\"67881cb\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<blockquote class=\"wp-block-quote\"><p>Increased parametrization improves generalization.<\/p><\/blockquote>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-d1b0eb3 elementor-widget elementor-widget-text-editor\" data-id=\"d1b0eb3\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>Although it may not be explicitly stated anywhere, it\u2019s the intuition behind why researchers continue to push models larger to make them more powerful.<\/p><p>There have been many efforts to explain exactly why this is so. Most are quite interesting; the recently proposed\u00a0<a href=\"https:\/\/arxiv.org\/pdf\/1803.03635v1.pdf\" target=\"_blank\" rel=\"noreferrer noopener\">Lottery Ticket Hypothesis<\/a>\u00a0states that neural networks are just giant lotteries finding the best subnetwork, and\u00a0<a href=\"https:\/\/proceedings.neurips.cc\/paper\/2019\/file\/c4ef9c39b300931b69a36fb3dbb8d60e-Paper.pdf\" target=\"_blank\" rel=\"noreferrer noopener\">another paper<\/a>\u00a0suggests through theoretical proof that such phenomenon is built into the nature of deep learning.<\/p><p id=\"7c2e\">Perhaps one of the most intriguing, though, is one proposing that&nbsp;<em>deeper neural networks lead to simpler embeddings.&nbsp;<\/em>Alternatively, this is known as the \u201csimplicity bias\u201d \u2014 neural network parameters have a bias towards simpler mappings.<\/p>\n<p id=\"3b0e\">Minyoung Huh et al proposed in a recent paper,\u00a0<a href=\"https:\/\/arxiv.org\/pdf\/2103.10427.pdf\" target=\"_blank\" rel=\"noreferrer noopener\">\u201cThe Low-Rank Simplicity Bias in Deep Networks\u201d<\/a>, that depth increases the proportion of simpler solutions in the parameter space. This makes neural networks more likely \u2014 by chance \u2014 to find simple solutions rather than complex ones.<\/p>\n<p id=\"2c5d\">On terminology: the authors measure the \u201csimplicity\u201d of a matrix based on its rank \u2014 roughly speaking, a measurement of how linearly independent parts of the matrix are on other parts. A higher rank can be considered more complex since its parts are highly independent and thus contain more \u201cindependent information\u201d. On the other hand, a lower rank can be considered simpler.<\/p>\n<p id=\"6855\">Huh et al begin by analyzing the rank of&nbsp;<em>linear&nbsp;<\/em>networks \u2014 that is, networks without any nonlinearities, like activation functions.<\/p>\n<p id=\"462d\">The authors trained several linear networks of different depths on the MNIST dataset. For each network, they randomly drew 128 <a href=\"https:\/\/www.experfy.com\/blog\/ai-ml\/do-we-need-deep-graph-neural-networks\/\" target=\"_blank\" rel=\"noreferrer noopener\">neural<\/a> network weights and the associated kernel, plotting their ranks. As the depth increases, the rank of the network\u2019s parameters decreases.<\/p>\n\n\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-52f5e22 elementor-widget elementor-widget-image\" data-id=\"52f5e22\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t<figure class=\"wp-caption\">\n\t\t\t\t\t\t\t\t\t\t<img fetchpriority=\"high\" decoding=\"async\" width=\"1024\" height=\"629\" src=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1W1L2Ds6l7w6ZfWst_DcgXw-1024x629.png\" class=\"attachment-large size-large wp-image-19015\" alt=\"Deeper Neural Networks Lead To Simpler Embeddings\" srcset=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1W1L2Ds6l7w6ZfWst_DcgXw-1024x629.png 1024w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1W1L2Ds6l7w6ZfWst_DcgXw-300x184.png 300w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1W1L2Ds6l7w6ZfWst_DcgXw-768x471.png 768w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1W1L2Ds6l7w6ZfWst_DcgXw-610x374.png 610w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1W1L2Ds6l7w6ZfWst_DcgXw-750x460.png 750w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1W1L2Ds6l7w6ZfWst_DcgXw-1140x700.png 1140w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1W1L2Ds6l7w6ZfWst_DcgXw.png 1523w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/>\t\t\t\t\t\t\t\t\t\t\t<figcaption class=\"widget-image-caption wp-caption-text\">Source:\u00a0<a href=\"https:\/\/arxiv.org\/pdf\/2103.10427.pdf\" target=\"_blank\" rel=\"noopener\">Huh et al.<\/a><\/figcaption>\n\t\t\t\t\t\t\t\t\t\t<\/figure>\n\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-52ba43a elementor-widget elementor-widget-text-editor\" data-id=\"52ba43a\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"f09d\">This can be derived from the fact that the rank of the product of two matrices can only decrease or remain the same from the ranks of each of its constituents. If we get a little bit more abstract, we can think of this intuitively as: each matrix contains its own independent information, but when they are combined, the information of one matrix can only get muddled and entangled with the information in another matrix.<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-4aa7994 elementor-widget elementor-widget-text-editor\" data-id=\"4aa7994\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<blockquote class=\"wp-block-quote\"><p>rank(AB) \u2264 min (rank(A), rank(B))<\/p><\/blockquote>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-0703905 elementor-widget elementor-widget-text-editor\" data-id=\"0703905\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"45f3\">What is more interesting, though, is that the same applies to&nbsp;<em>nonlinear<\/em>&nbsp;networks. When nonlinear activation functions like tanh or ReLU are applied, the pattern is repeated: higher depth, lower rank.<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-4e31bd5 elementor-widget elementor-widget-image\" data-id=\"4e31bd5\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t<figure class=\"wp-caption\">\n\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" width=\"1024\" height=\"681\" src=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1fyJsuDmI3uilaL4bj516Kw-1024x681.png\" class=\"attachment-large size-large wp-image-19016\" alt=\"Effective Rank Of Kernels\" srcset=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1fyJsuDmI3uilaL4bj516Kw-1024x681.png 1024w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1fyJsuDmI3uilaL4bj516Kw-300x200.png 300w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1fyJsuDmI3uilaL4bj516Kw-768x511.png 768w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1fyJsuDmI3uilaL4bj516Kw-610x406.png 610w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1fyJsuDmI3uilaL4bj516Kw-750x499.png 750w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1fyJsuDmI3uilaL4bj516Kw-1140x758.png 1140w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1fyJsuDmI3uilaL4bj516Kw.png 1150w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/>\t\t\t\t\t\t\t\t\t\t\t<figcaption class=\"widget-image-caption wp-caption-text\">Source:\u00a0<a href=\"https:\/\/arxiv.org\/pdf\/2103.10427.pdf\" target=\"_blank\" rel=\"noopener\">Huh et al.<\/a><\/figcaption>\n\t\t\t\t\t\t\t\t\t\t<\/figure>\n\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-c797220 elementor-widget elementor-widget-text-editor\" data-id=\"c797220\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"0d76\">The authors also performed hierarchical cluster kernels for different depths of the network for the two nonlinearities (ReLU and tanh). As the depth increases, the presence of block structures shows decreasing rank. The \u201cindependent information\u201d each kernel carries decreases with depth.<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-954b675 elementor-widget elementor-widget-image\" data-id=\"954b675\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t<figure class=\"wp-caption\">\n\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" width=\"1024\" height=\"541\" src=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/10yPcHVyYIGfybI3ISsowoQ-1024x541.png\" class=\"attachment-large size-large wp-image-19017\" alt=\"Deeper Neural Networks Lead To Simpler Embeddings\" srcset=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/10yPcHVyYIGfybI3ISsowoQ-1024x541.png 1024w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/10yPcHVyYIGfybI3ISsowoQ-300x158.png 300w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/10yPcHVyYIGfybI3ISsowoQ-768x405.png 768w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/10yPcHVyYIGfybI3ISsowoQ-1536x811.png 1536w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/10yPcHVyYIGfybI3ISsowoQ-2048x1081.png 2048w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/10yPcHVyYIGfybI3ISsowoQ-610x322.png 610w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/10yPcHVyYIGfybI3ISsowoQ-750x396.png 750w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/10yPcHVyYIGfybI3ISsowoQ-1140x602.png 1140w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/>\t\t\t\t\t\t\t\t\t\t\t<figcaption class=\"widget-image-caption wp-caption-text\">Source:\u00a0<a href=\"https:\/\/arxiv.org\/pdf\/2103.10427.pdf\" target=\"_blank\" rel=\"noopener\">Huh et al.<\/a><\/figcaption>\n\t\t\t\t\t\t\t\t\t\t<\/figure>\n\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-a4fbdea elementor-widget elementor-widget-text-editor\" data-id=\"a4fbdea\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"a62c\">Hence, although it may seem odd,&nbsp;<mark><em>over-parametrizing a network acts as implicit regularization<\/em><\/mark>. This is especially true with linearities; thus, a model\u2019s generalization can be improved by increasing linearities.<\/p>\n<p id=\"f15b\">In fact, the authors find that on both the CIFAR-10 and CIFAR-100 datasets, linearly expanding the network increases accuracy by 2.2% and 6.5% from a baseline simple CNN. On ImageNet, linear over-parametrization of AlexNet increases accuracy by 1.8%; ResNet10 accuracy increases by 0.9%, and ResNet18 accuracy increases by 0.4%.<\/p>\n<p id=\"f900\">This linear over-parametrization \u2014 which is not adding any serious&nbsp;<em>learning capacity<\/em>&nbsp;to the network, only more linear transformations \u2014 performs even better than explicit regularizers, like penalties. Moreover, this implicit regularization&nbsp;<em>does not change the objective being minimized<\/em>.<\/p>\n<p id=\"a173\">Perhaps what is most satisfying about this contribution is that it is in agreement with Occam\u2019s razor \u2014 a statement many papers have questioned or outright refuted as being relevant to deep learning.<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-61fe309 elementor-widget elementor-widget-text-editor\" data-id=\"61fe309\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<blockquote class=\"wp-block-quote\"><p>The simplest solution is usually the right one.<br>&#8211; Occam\u2019s razor<\/p><\/blockquote>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-b3090b7 elementor-widget elementor-widget-text-editor\" data-id=\"b3090b7\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"232b\">Indeed, much of prior discourse has taken more parameters to mean more complex, leading many to assert that Occam\u2019s razor, while being relevant in the regime of classical statistics, does not apply to over parametrized spaces.<\/p>\n<p id=\"27fe\">This paper\u2019s fascinating contribution argues instead that simpler solutions are in fact better, and that more successful, highly parameterized neural networks arrive at those simpler solutions&nbsp;<em>because<\/em>, not despite, their parametrization.<\/p>\n<p id=\"44d3\">Still, as the authors term it, this is a \u201cconjecture\u201d \u2014 it still needs refinement and further investigation. But it\u2019s a well-founded, and certainly interesting, one at that \u2014 perhaps with the potential to shift the conversation on how we think about generalization in deep learning.<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<\/div>\n\t\t","protected":false},"excerpt":{"rendered":"<p>Recent research is increasingly investigating how neural networks, being as hyper-parametrized as they are, generalize. That is, according to traditional statistics, the more parameters, the more the model overfits.<\/p>\n","protected":false},"author":884,"featured_media":19018,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"content-type":"","footnotes":""},"categories":[183],"tags":[97,206,1446,1447,92],"ppma_author":[3782],"class_list":["post-22702","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-ai-ml","tag-artificial-intelligence","tag-deep-learning","tag-deeper-neural-networks","tag-embeddings","tag-machine-learning"],"authors":[{"term_id":3782,"user_id":884,"is_guest":0,"slug":"andre-ye","display_name":"Andre Ye","avatar_url":"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/08\/Andre-Ye-150x150.jpg","user_url":"https:\/\/www.critiq.tech\/","last_name":"Ye","first_name":"Andre","job_title":"","description":"Andre Ye is Cofounder at Critiq, and Editor and Writer at Medium"}],"_links":{"self":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/22702","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/users\/884"}],"replies":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/comments?post=22702"}],"version-history":[{"count":4,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/22702\/revisions"}],"predecessor-version":[{"id":31828,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/22702\/revisions\/31828"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/media\/19018"}],"wp:attachment":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/media?parent=22702"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/categories?post=22702"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/tags?post=22702"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/ppma_author?post=22702"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}