{"id":985,"date":"2018-11-16T01:42:21","date_gmt":"2018-11-15T22:42:21","guid":{"rendered":"http:\/\/kusuaks7\/?p=590"},"modified":"2021-05-11T14:01:29","modified_gmt":"2021-05-11T14:01:29","slug":"the-curse-of-dimensionality","status":"publish","type":"post","link":"https:\/\/www.experfy.com\/blog\/bigdata-cloud\/the-curse-of-dimensionality\/","title":{"rendered":"The Curse of Dimensionality"},"content":{"rendered":"<p><strong><em>Ready to learn Data Science? Browse&nbsp;<a href=\"https:\/\/www.experfy.com\/training\/tracks\/data-science-training-certification\">Data Science Training and Certification<\/a> courses developed by industry thought leaders and Experfy in Harvard Innovation Lab.<\/em><\/strong><\/p>\n<p>Many machine learning practitioners may have come to a crossroads where they have to choose the number of features they want to include in their model. This decision is actually dependent on many factors. For instance, you may need to know about the nature of your data, is the feature contributing (is it some kind of ID)? Are features multi-collinear? How many samples do you have? Machine Learning practitioners often work with high dimensional vectors to create a model. Since dimensions higher than k=3 are impossible to visualize, it is often hard for us to know if our model construct is right or wrong. In this post, I will illustrate the Curse of Dimensionality, a powerful concept where the volume of the vector space grows exponentially with the dimension, causing our data to be highly sparse and distant from one another, and discuss its implications to the work of data science.<\/p>\n<h3 id=\"95f2\" name=\"95f2\">High Dimensional Spaces<\/h3>\n<p id=\"4a7b\" name=\"4a7b\">Let&rsquo;s illustrate the vastness of a high dimensional space by considering a hypersphere inside a hypercube. In two or three dimensions, they look like this<\/p>\n<figure id=\"42f7\" name=\"42f7\">\n<p><canvas height=\"35\" width=\"75\"><\/canvas><img decoding=\"async\" data-src=\"https:\/\/cdn-images-1.medium.com\/max\/800\/1*kpqxmqs9PPoHgPLTR-aAtg.png\" src=\"https:\/\/cdn-images-1.medium.com\/max\/800\/1*kpqxmqs9PPoHgPLTR-aAtg.png\" style=\"width: 700px; height: 345px;\" \/><\/p>\n<\/figure>\n<p id=\"37ca\" name=\"37ca\">Although not visualizable in higher dimensions, we can generalize the concept of a hypersphere inside a hypercube in any N dimensions.<\/p>\n<p id=\"ca51\" name=\"ca51\">Now let&rsquo;s consider a hypercube of side s=2r where r is the radius of the hypersphere. To show how fast the space grows, I will consider the volume of hypercube to hypersphere ratio. If the hypersphere eats up more volume as the dimension grows, then the ratio should converge to 1. Conversely, if the hypersphere eats up less volume as the dimension grows, then the ratio should converge to 0. (Spoiler: As explained&nbsp;<a data-href=\"https:\/\/en.wikipedia.org\/wiki\/Curse_of_dimensionality#Distance_functions\" href=\"https:\/\/en.wikipedia.org\/wiki\/Curse_of_dimensionality#Distance_functions\" rel=\"noopener noreferrer\" target=\"_blank\">here<\/a>, the analytic solution is the ratio should converge to 0.)<\/p>\n<h3 id=\"c9a6\" name=\"c9a6\">Monte Carlo Simulation<\/h3>\n<p id=\"38e9\" name=\"38e9\">Here, we will do a Monte Carlo simulation to show that this is true. First we sample a data point from a uniform random distribution in k dimensions. Then we calculate the Euclidean Distance of this point from the origin; if the distance is lower than r, then the point lies inside the hypersphere with radius r, and vice versa. We calculate the fraction of the times this occurs. You can find the code of this experiment in my&nbsp;<a data-href=\"https:\/\/github.com\/taninaim\/curse_of_dimensionality\" href=\"https:\/\/github.com\/taninaim\/curse_of_dimensionality\" rel=\"noopener noreferrer\" target=\"_blank\">github<\/a>. The result is shown as follows:<\/p>\n<figure id=\"23c0\" name=\"23c0\">\n<p><canvas height=\"55\" width=\"75\"><\/canvas><img decoding=\"async\" data-src=\"https:\/\/cdn-images-1.medium.com\/max\/800\/1*3uJ4uzHJFWJnuKIOQRDHnQ.png\" src=\"https:\/\/cdn-images-1.medium.com\/max\/800\/1*3uJ4uzHJFWJnuKIOQRDHnQ.png\" style=\"width: 700px; height: 529px;\" \/><\/p>\n<\/figure>\n<p id=\"ccdf\" name=\"ccdf\">In the plot above, the x-axis is the number of dimensions, and the y-axis is the volume of hypersphere to volume of hypercube ratio. A simple run with 1 million samples from a k-dimensional uniform random distribution, where k ranges from 1 to 100, gives the following statistics.<\/p>\n<div id=\"fe8f\" name=\"fe8f\"><span style=\"font-family:courier new,courier,monospace;\">k = 1<br \/>\nVolume of hypercube = 2<br \/>\nFraction of points inside hypersphere = 1.0<br \/>\n&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8211;<br \/>\nk = 2<br \/>\nVolume of hypercube = 4<br \/>\nFraction of points inside hypersphere = 0.784976<br \/>\n&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8211;<br \/>\nk = 3<br \/>\nVolume of hypercube = 8<br \/>\nFraction of points inside hypersphere = 0.52376<br \/>\n&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8211;<\/span><\/div>\n<div name=\"4a80\"><span style=\"font-family:courier new,courier,monospace;\">&#8230;<\/span><\/div>\n<div name=\"d1d4\"><span style=\"font-family:courier new,courier,monospace;\">k = 17<br \/>\nVolume of hypercube = 131072<br \/>\nFraction of points inside hypersphere = 1e-06<br \/>\n&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8211;<br \/>\nk = 18<br \/>\nVolume of hypercube = 262144<br \/>\nFraction of points inside hypersphere = 0.0<br \/>\n&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8211;<br \/>\nk = 19<br \/>\nVolume of hypercube = 524288<br \/>\nFraction of points inside hypersphere = 0.0<br \/>\n&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8211;<\/span><\/div>\n<h3 id=\"ce71\" name=\"ce71\">Discussion<\/h3>\n<p id=\"539c\" name=\"539c\">In the experiment above, notice that we start with a 1. This intuitively makes sense because in one dimension all we have is a line of length 1, so the hypersphere is the same as the line. From there, the fraction decreases exponentially, and converges to 0. This has a powerful implication to data scientists that if you increased your data&rsquo;s dimensionality, you&rsquo;d need exponentially more data to compensate for the vastness of the space (i.e. sparseness of the data), and this may lead to overfitting. Therefore, next time when you consider adding more features to your data\/model, do consider the potential problems that might arise when your dimension becomes too big. Finally, I encourage you to play around with the code and change some parameters to see this effect for yourself!<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Many machine learning practitioners may have come to a crossroads where they have to choose the number of features they want to include in their model. This decision is actually dependent on many factors. Machine Learning practitioners often work with high dimensional vectors to create a model. Here is an illustration of the Curse of Dimensionality, a powerful concept where the volume of the vector space grows exponentially with the dimension, causing our data to be highly sparse and distant from one another.<\/p>\n","protected":false},"author":341,"featured_media":3491,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"content-type":"","footnotes":""},"categories":[187],"tags":[94],"ppma_author":[2064],"class_list":["post-985","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-bigdata-cloud","tag-data-science"],"authors":[{"term_id":2064,"user_id":341,"is_guest":0,"slug":"tanin-phongpandecha","display_name":"Tanin Phongpandecha","avatar_url":"https:\/\/secure.gravatar.com\/avatar\/?s=96&d=mm&r=g","user_url":"","last_name":"Phongpandecha","first_name":"Tanin","job_title":"","description":"Tanin Phongpandecha is Chief Data Officer at <a href=\"https:\/\/datawow.io\/\">Data Wow Co., Ltd<\/a>. He has a strong background in economics, statistics, and computer science, and working experience in the financial and information services industries.&nbsp;"}],"_links":{"self":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/985","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/users\/341"}],"replies":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/comments?post=985"}],"version-history":[{"count":1,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/985\/revisions"}],"predecessor-version":[{"id":6243,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/985\/revisions\/6243"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/media\/3491"}],"wp:attachment":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/media?parent=985"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/categories?post=985"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/tags?post=985"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/ppma_author?post=985"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}