{"id":2253,"date":"2020-02-12T04:24:21","date_gmt":"2020-02-12T04:24:21","guid":{"rendered":"http:\/\/kusuaks7\/?p=1858"},"modified":"2024-01-11T07:06:56","modified_gmt":"2024-01-11T07:06:56","slug":"the-bootstrap-the-swiss-army-knife-of-any-data-scientist","status":"publish","type":"post","link":"https:\/\/www.experfy.com\/blog\/bigdata-cloud\/the-bootstrap-the-swiss-army-knife-of-any-data-scientist\/","title":{"rendered":"The bootstrap. The Swiss army knife of any data scientist"},"content":{"rendered":"\t\t<div data-elementor-type=\"wp-post\" data-elementor-id=\"2253\" class=\"elementor elementor-2253\" data-elementor-post-type=\"post\">\n\t\t\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-32a01ffe elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"32a01ffe\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-7e626522\" data-id=\"7e626522\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-e7947d0 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"e7947d0\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-2c1dc67\" data-id=\"2c1dc67\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-aca60bd elementor-widget elementor-widget-text-editor\" data-id=\"aca60bd\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<section>\n<p id=\"f160\" data-selectable-paragraph=\"\">Every measure must be followed by an\u00a0<strong>error estimate<\/strong>. There\u2019s no chance to avoid this. If I tell you \u201cI\u2019m 1,93 metres tall\u201d, I\u2019m not giving you any information about the\u00a0<strong>precision\u00a0<\/strong>of this measure. You could think that my precision is on the second decimal digit, but you can\u2019t be sure.<\/p>\n<p id=\"1dc7\" data-selectable-paragraph=\"\">So, what we really need is some way to assess the precision of our measure starting from the data sample we have.<\/p>\n<p id=\"2754\" data-selectable-paragraph=\"\">If our observable is the mean value calculated over a sample, a simple precision estimate is given by the\u00a0<strong>standard error<\/strong>. But what can we do if we are measuring something that is not the mean value? That\u2019s the point at which bootstrap comes in help.<\/p>\n\n<\/section>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-48c09c4 elementor-widget elementor-widget-heading\" data-id=\"48c09c4\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h1 class=\"elementor-heading-title elementor-size-default\"><h1 id=\"d98a\" data-selectable-paragraph=\"\">Bootstrap in a nutshell<\/h1><\/h1>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-6616430 elementor-widget elementor-widget-text-editor\" data-id=\"6616430\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"bbca\" data-selectable-paragraph=\"\"><strong>Bootstrap<\/strong>\u00a0is a technique made in order to measure confidence intervals and\/or\u00a0<strong>standard error\u00a0<\/strong>of an observable that can be calculated on a sample.<\/p>\n<p id=\"e5d0\" data-selectable-paragraph=\"\">It relies on the concept of\u00a0<strong>resampling<\/strong>, which is a procedure that, starting from a data sample,\u00a0<strong>simulates a new sample\u00a0<\/strong>of the same size, considering every original value with\u00a0<strong>replacement<\/strong>. Each value is taken at the same probability of the others (which is\u00a0<em>1\/N<\/em>).<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-c6902fd elementor-widget elementor-widget-text-editor\" data-id=\"c6902fd\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"839d\" data-selectable-paragraph=\"\">So, for example, if our sample is made by the four values {1,2,3,4}, some possible resampling can be {1,4,1,3}, {1,2,4,4}, {1,4,3,2} and so on. As you can see, the values are the same, they\u00a0<strong>can be repeated<\/strong>\u00a0and the samples have the\u00a0<strong>same size<\/strong>\u00a0as the original one.<\/p>\n<p id=\"677b\" data-selectable-paragraph=\"\">Now, let\u2019s say we have a sample of\u00a0<em>N<\/em>\u00a0values and want to calculate some observable\u00a0<em>O<\/em>, that is a function that takes in input all the\u00a0<em>N<\/em>\u00a0values and gives in output one real number. An example can be the average value, the standard deviation or even more complex functions as the quantiles, the Sharpe ratio and so on.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-7d8051b elementor-widget elementor-widget-text-editor\" data-id=\"7d8051b\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"5162\" data-selectable-paragraph=\"\">Our goal is to calculate the expected value of this observable and, for example, its\u00a0<strong>standard error<\/strong>.<\/p>\n<p id=\"1864\" data-selectable-paragraph=\"\"><span style=\"color: #00ff00;\"><mark><span style=\"background-color: #e6e6fa;\">To accomplish this goal, we can run the resampling procedure\u00a0<\/span><\/mark><mark><em><span style=\"background-color: #e6e6fa;\">M<\/span><\/em><\/mark><mark><span style=\"background-color: #e6e6fa;\">\u00a0times and, for each sample, we can calculate our observable. At the end of the process, we have\u00a0<\/span><\/mark><mark><em><span style=\"background-color: #e6e6fa;\">M\u00a0<\/span><\/em><\/mark><mark><span style=\"background-color: #e6e6fa;\">different values of\u00a0<\/span><\/mark><mark><em><span style=\"background-color: #e6e6fa;\">O<\/span><\/em><\/mark><mark><span style=\"background-color: #e6e6fa;\">, that can be used to calculate the expected value, the standard error, the confidence intervals and so on.<\/span><\/mark><\/span><\/p>\n<p id=\"f1cf\" data-selectable-paragraph=\"\">This simple procedure is the\u00a0<strong>bootstrap process<\/strong>.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-adf3983 elementor-widget elementor-widget-heading\" data-id=\"adf3983\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h1 class=\"elementor-heading-title elementor-size-default\">\n<h1 id=\"83ee\" data-selectable-paragraph=\"\">An example in R<\/h1><\/h1>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-c4b0065 elementor-widget elementor-widget-text-editor\" data-id=\"c4b0065\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"f72c\" data-selectable-paragraph=\"\">R has a wonderful library called\u00a0<em>bootstrap\u00a0<\/em>that performs all the calculations for us.<\/p>\n<p id=\"649c\" data-selectable-paragraph=\"\">Let\u2019s say we want to measure the\u00a0<strong>skewness\u00a0<\/strong>of the\u00a0<strong>sepal length<\/strong>\u00a0in the famous\u00a0<strong>iris dataset<\/strong>\u00a0and let\u2019s say we want to know its expected value, its standard error and the 5th and 9th percentile.<\/p>\n<p id=\"5c8f\" data-selectable-paragraph=\"\">Here follows the R code that makes this possible:<\/p>\n<p id=\"3288\" data-selectable-paragraph=\"\">The mean value is 0.29, the standard deviation is 0.13, the 5th percentile is 0.10 and the 95th percentile is 0.49.<\/p>\n<p id=\"c34b\" data-selectable-paragraph=\"\">These numbers allow us to say that the real skewness value of the dataset is 0.29 +\/- 0.13 and it is between 0.10 and 0.49 with a 90% confidence.<\/p>\n<p id=\"ba3f\" data-selectable-paragraph=\"\">Surprisingly, isn\u2019t it?<\/p>\n<p id=\"dc87\" data-selectable-paragraph=\"\">This simple procedure is universal and can be used with\u00a0<strong>any kind of observable<\/strong>.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-1dba84e elementor-widget elementor-widget-heading\" data-id=\"1dba84e\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h1 class=\"elementor-heading-title elementor-size-default\"><h1 id=\"13b4\" data-selectable-paragraph=\"\">The larger the dataset, the higher the precision<\/h1><\/h1>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-de7a2e0 elementor-widget elementor-widget-text-editor\" data-id=\"de7a2e0\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"dddc\" data-selectable-paragraph=\"\">Somebody could think that the larger our dataset, the smaller the standard error we take due to the\u00a0<strong>law of large number<\/strong>.<\/p>\n<p id=\"db88\" data-selectable-paragraph=\"\">It\u2019s definitely true and we can simulate this case.<\/p>\n<p id=\"ac1b\" data-selectable-paragraph=\"\">Let\u2019s simulate\u00a0<em>N<\/em>\u00a0random uniformly distributed numbers and let\u2019s calculate the skewness over them. What we expect is that the\u00a0<strong>standard deviation<\/strong>\u00a0of the bootstrapped skewness\u00a0<strong>decreases<\/strong>\u00a0as long as\u00a0<em>N<\/em>\u00a0increases.<\/p>\n<p id=\"6e72\" data-selectable-paragraph=\"\">The following R code performs this calculation and, (not so) surprisingly, the bootstrap standard deviation decreases as a\u00a0<strong>power law<\/strong>\u00a0with the exponent equal to -1\/2.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-ac3e535 elementor-widget elementor-widget-image\" data-id=\"ac3e535\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/miro.medium.com\/max\/1366\/1*wvt7Fu0Tf1ZYmTh0RVVjqQ.png\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-ef14831 elementor-widget elementor-widget-text-editor\" data-id=\"ef14831\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"a937\" data-selectable-paragraph=\"\">The slope of the double-log linear regression is the\u00a0<strong>exponent<\/strong>\u00a0of the power law and it\u2019s equal to\u00a0<strong>-0.5<\/strong>.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-e3aa7d6 elementor-widget elementor-widget-heading\" data-id=\"e3aa7d6\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h1 class=\"elementor-heading-title elementor-size-default\"><h1 id=\"e07f\" data-selectable-paragraph=\"\">Conclusions<\/h1><\/h1>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-88f1ef0 elementor-widget elementor-widget-text-editor\" data-id=\"88f1ef0\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"331c\" data-selectable-paragraph=\"\">Bootstrap is a method to extract as much\u00a0<strong>information\u00a0<\/strong>as possible from a finite size dataset and it makes us calculate the expected value of any observable and its precision (i.e. its standard deviation or confidence intervals).<\/p>\n<p id=\"0d1a\" data-selectable-paragraph=\"\">It\u2019s really useful when we need to calculate an error estimate for some scientific measure and it can easily generalized for\u00a0<strong>multivariate observables<\/strong>.<\/p>\n<p id=\"7700\" data-selectable-paragraph=\"\">Any data scientist should not forget to use this powerful tool, which has shown several useful application even in\u00a0<strong>machine learning<\/strong>\u00a0(e.g in the Random Forest classification\/regression models).<\/p>\n\n<\/section>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<\/div>\n\t\t","protected":false},"excerpt":{"rendered":"<p>Bootstrap is a method to extract as much&nbsp;information&nbsp;as possible from a finite size dataset and it makes us calculate the expected value of any observable and it&rsquo;s precision. It&rsquo;s really useful when we need to calculate an error estimate for some scientific measure and it can easily be generalized for&nbsp;multivariate observables. Any data scientist should not forget to use this powerful tool, which has shown several useful applications even in&nbsp;machine learning.&nbsp;<\/p>\n","protected":false},"author":618,"featured_media":3655,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"content-type":"","footnotes":""},"categories":[187],"tags":[94],"ppma_author":[3328],"class_list":["post-2253","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-bigdata-cloud","tag-data-science"],"authors":[{"term_id":3328,"user_id":618,"is_guest":0,"slug":"gianluca-malato","display_name":"Gianluca Malato","avatar_url":"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/04\/medium_918623b2-8f36-4110-8343-6fc9228595dd-150x150.jpg","user_url":"http:\/\/www.gianlucamalato.it\/","last_name":"Malato","first_name":"Gianluca","job_title":"","description":"Gianluca Malato is Data Scientist at Poste Italiane SPA.\u00a0 He is also a fiction author and software developer, Editor of\u00a0<a href=\"https:\/\/medium.com\/data-science-journal?source=follow_footer--------------------------follow_footer-\">Data Science Journal<\/a>,\u00a0<a href=\"https:\/\/medium.com\/the-trading-scientist?source=follow_footer--------------------------follow_footer-\">The Trading Scientist<\/a>, and\u00a0<a href=\"https:\/\/medium.com\/the-writers-notebook?source=follow_footer--------------------------follow_footer-\">The Writer\u2019s Notebook<\/a>. His books are available on <a href=\"https:\/\/www.amazon.com\/Gianluca-Malato\/e\/B076CHTG3W?ref=dbs_a_mng_rwt_scns_share\">Amazon<\/a>."}],"_links":{"self":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/2253","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/users\/618"}],"replies":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/comments?post=2253"}],"version-history":[{"count":6,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/2253\/revisions"}],"predecessor-version":[{"id":35470,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/2253\/revisions\/35470"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/media\/3655"}],"wp:attachment":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/media?parent=2253"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/categories?post=2253"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/tags?post=2253"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/ppma_author?post=2253"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}