{"id":1881,"date":"2019-08-13T02:30:27","date_gmt":"2019-08-12T23:30:27","guid":{"rendered":"http:\/\/kusuaks7\/?p=1486"},"modified":"2024-05-03T10:01:09","modified_gmt":"2024-05-03T10:01:09","slug":"how-to-correctly-select-a-sample-from-a-huge-dataset-in-machine-learning","status":"publish","type":"post","link":"https:\/\/www.experfy.com\/blog\/ai-ml\/how-to-correctly-select-a-sample-from-a-huge-dataset-in-machine-learning\/","title":{"rendered":"How to correctly select a sample from a huge dataset in machine learning"},"content":{"rendered":"\t\t<div data-elementor-type=\"wp-post\" data-elementor-id=\"1881\" class=\"elementor elementor-1881\" data-elementor-post-type=\"post\">\n\t\t\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-31d0a453 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"31d0a453\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-55105300\" data-id=\"55105300\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-da007d9 elementor-widget elementor-widget-heading\" data-id=\"da007d9\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h3 class=\"elementor-heading-title elementor-size-default\"><h3 style=\"color: #aaa;font-style: italic\">Choosing a small, representative dataset from a large population can improve model training reliability<\/h3><\/h3>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-1ab0cf2 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"1ab0cf2\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-27381d9\" data-id=\"27381d9\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-76eb7aa elementor-widget elementor-widget-text-editor\" data-id=\"76eb7aa\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tIn machine learning, we often need to train a model with a\u00a0<strong>very larg<\/strong>e dataset of thousands or even millions of records. The higher the size of a dataset, the higher its\u00a0<strong>statistical significance<\/strong>\u00a0and the information it carries, but we rarely ask ourselves: is such a huge dataset\u00a0<strong>really useful<\/strong>? Or we could reach a satisfying result with a smaller, much more manageable one? Selecting a reasonably small dataset carrying the good amount of information can really make us\u00a0<strong>save time<\/strong>\u00a0and money.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-6691313 elementor-widget elementor-widget-text-editor\" data-id=\"6691313\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tLet\u2019s make a simple mental experiment. Imagine that we are in a\u00a0<strong>library\u00a0<\/strong>and want to learn Dante Alighieri\u2019s\u00a0<em>Divina Commedia\u00a0<\/em>word by word.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-effa69f elementor-widget elementor-widget-text-editor\" data-id=\"effa69f\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\nWe have two options:\n<ol>\n \t<li>Grab the first edition we find and start studying from it<\/li>\n \t<li>Grab as many editions as possible and study from all of them<\/li>\n<\/ol>\nThe right answer is very clear. Why would we want to study from different books when\u00a0<strong>just one\u00a0<\/strong>of them is enough?\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-998fae2 elementor-widget elementor-widget-text-editor\" data-id=\"998fae2\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\nMachine learning is the same thing. We have a model that\u00a0<strong>learns something\u00a0<\/strong>and, exactly like us, it needs some time. What we want is the\u00a0<strong>minimum amount\u00a0<\/strong>of information that is required to learn properly from the phenomenon, without wasting our time. Information redundancy doesn\u2019t contain any business value for us.\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-ad6361b elementor-widget elementor-widget-text-editor\" data-id=\"ad6361b\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tBut how can we be sure that our edition isn\u2019t corrupted or incomplete? We must perform some kind of\u00a0<strong>high-level comparison\u00a0<\/strong>with the population made by the other editions. For example, we could check the number of\u00a0<em>canti\u00a0<\/em>and\u00a0<em>cantiche<\/em>. If our book has three\u00a0<em>cantiche\u00a0<\/em>and each one of them has 33\u00a0<em>canti<\/em>, maybe it\u2019s complete and we can safely learn from it.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-1abc245 elementor-widget elementor-widget-text-editor\" data-id=\"1abc245\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tWhat we are doing is learn from a sample (the single\u00a0<em>Divina Commedia\u00a0<\/em>edition) and check its\u00a0<strong>statistical significance\u00a0<\/strong>(the macro comparison with the other books).\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-f75e768 elementor-widget elementor-widget-text-editor\" data-id=\"f75e768\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tThe same, exact concept can be applied in machine learning. Instead of learning from a huge population of many records, we can make a\u00a0<strong>sub-sampling<\/strong>\u00a0of it keeping all the\u00a0<strong>statistics intact<\/strong>.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-dd91adf elementor-widget elementor-widget-heading\" data-id=\"dd91adf\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\"><h3>Statistical framework<\/h3><\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-8387ba2 elementor-widget elementor-widget-text-editor\" data-id=\"8387ba2\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tIn order to take a small, easy to handle dataset, we must be sure we\u00a0<strong>don\u2019t lose\u00a0<\/strong>statistical significance with respect to the population. A too small dataset won\u2019t carry enough information to learn from, a too huge dataset can be time-consuming to analyze. So how can we choose the good\u00a0<strong>compromise\u00a0<\/strong>between size and information?\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-75baa6e elementor-widget elementor-widget-text-editor\" data-id=\"75baa6e\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tStatistically speaking, we want that our sample keeps the\u00a0<strong>probability distribution\u00a0<\/strong>of the population under a reasonable\u00a0<strong>significance level<\/strong>. In other words, if we take a look at the\u00a0<strong>histogram\u00a0<\/strong>of the sample, it must be the same as the histogram of the population.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-246d81a elementor-widget elementor-widget-text-editor\" data-id=\"246d81a\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tThere are many ways to accomplish this goal. The simplest thing to do is taking a random sub-sample with\u00a0<strong>uniform distribution\u00a0<\/strong>and check if it\u2019s significant or not. If it\u2019s reasonably significant, we\u2019ll keep it. If it\u2019s not, we\u2019ll take another sample and repeat the procedure until we get a good significance level.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-7e9ef85 elementor-widget elementor-widget-heading\" data-id=\"7e9ef85\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h3 class=\"elementor-heading-title elementor-size-default\"><h3>Multivariate vs. multiple univariate<\/h3><\/h3>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-dcd13d0 elementor-widget elementor-widget-text-editor\" data-id=\"dcd13d0\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tIf we have a dataset made by\u00a0<em>N<\/em>\u00a0variables, it can be clustered in a\u00a0<em>N<\/em>-variate histogram and so can be every sub-sample we can take from it.\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-2606fd0 elementor-widget elementor-widget-text-editor\" data-id=\"2606fd0\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tThis operation, although academically correct, can be really difficult to perform in reality, especially if our dataset mixes numerical and categorical variables.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-255bfbb elementor-widget elementor-widget-text-editor\" data-id=\"255bfbb\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tThat\u2019s why I prefer a simpler approach, that usually introduces an acceptable approximation. What we are going to do is consider each variable\u00a0<strong>independently\u00a0<\/strong>from the others. If each one of the single, univariate histograms of the sample columns is\u00a0<strong>comparable<\/strong>\u00a0with the correspondent histogram of the population columns, we can\u00a0<strong>assume\u00a0<\/strong>that the sample is\u00a0<strong>not biased<\/strong>.\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-f2f313e elementor-widget elementor-widget-text-editor\" data-id=\"f2f313e\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tThe comparison between sample and population is then made this way:\n<ol>\n \t<li>Take one variable from the sample<\/li>\n \t<li>Compare its probability distribution with the probability distribution of the same variable of the population<\/li>\n \t<li>Repeat with all the variables<\/li>\n<\/ol>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-b1bf9ec elementor-widget elementor-widget-text-editor\" data-id=\"b1bf9ec\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tSome of you could think that we are forgetting the\u00a0<strong>correlation<\/strong>\u00a0between variables. It\u2019s not completely true, in my opinion, if we select our sample\u00a0<strong>uniformly<\/strong>. It\u2019s widely known that selecting a sub-sample uniformly will produce, with large numbers, the\u00a0<strong>same probability distribution\u00a0<\/strong>of the original population. Powerful resampling methods like\u00a0<strong>bootstrap<\/strong>\u00a0are built around this concept (see my\u00a0<a href=\"https:\/\/medium.com\/data-science-journal\/the-bootstrap-the-swiss-army-knife-of-any-data-scientist-acd6e592be13\" class=\"broken_link\" rel=\"noopener\">previous article<\/a>\u00a0for more information).\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-91953fa elementor-widget elementor-widget-heading\" data-id=\"91953fa\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\"><h3>Comparing sample and population<\/h3><\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-5e098ef elementor-widget elementor-widget-text-editor\" data-id=\"5e098ef\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tAs I said before, for each variable we must compare its probability distribution on the sample with the probability distribution on the population.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-31d6ebf elementor-widget elementor-widget-text-editor\" data-id=\"31d6ebf\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tThe histograms of categorical variables can be compared using\u00a0<strong>Pearson\u2019s chi-square test<\/strong>, while the cumulative distribution functions of the numerical variables can be compared using the\u00a0<strong>Kolmogorov-Smirnov test<\/strong>.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-ac7ff58 elementor-widget elementor-widget-text-editor\" data-id=\"ac7ff58\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tBoth statistical tests work under the null hypothesis that the sample has the\u00a0<strong>same distribution\u00a0<\/strong>of the population. Since a sample is made by many columns and we want all of them to be\u00a0<strong>significative<\/strong>, we can reject the null hypothesis if the p-value of\u00a0<strong>at least one<\/strong>\u00a0of the tests is lower than the usual\u00a0<strong>5% confidence level<\/strong>. In other words, we want\u00a0<strong>every column<\/strong>\u00a0to pass the significance test in order to accept the sample as valid.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-07de953 elementor-widget elementor-widget-heading\" data-id=\"07de953\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h3 class=\"elementor-heading-title elementor-size-default\"><h3>R Example<\/h3><\/h3>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-7b955da elementor-widget elementor-widget-text-editor\" data-id=\"7b955da\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tLet\u2019s move from theory to practice. As usual, I\u2019ll use an example in R language. What I\u2019m going to show you is how the statistical tests can\u00a0<strong>give us a warning\u00a0<\/strong>when sampling is not done properly.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-bbb99b8 elementor-widget elementor-widget-heading\" data-id=\"bbb99b8\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h4 class=\"elementor-heading-title elementor-size-default\"><h4>Data simulation<\/h4><\/h4>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-640dd54 elementor-widget elementor-widget-text-editor\" data-id=\"640dd54\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tLet\u2019s simulate some (huge) data. We\u2019ll create a data frame with 1 million records and 2 columns. The first one has 500.000 records taken from a normal distribution, while the other 500.000 records are taken from a uniform distribution. This variable is\u00a0<strong>clearly biased\u00a0<\/strong>and it will help me explain the concepts of statistical significance later.\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-a05484f elementor-widget elementor-widget-text-editor\" data-id=\"a05484f\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tThe other field is a\u00a0<strong>factor variable\u00a0<\/strong>created by using the first 10 letters from the alphabet uniformly distributed.\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-61cb994 elementor-widget elementor-widget-text-editor\" data-id=\"61cb994\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tHere follows the code to create such a dataset.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-001f598 elementor-widget elementor-widget-text-editor\" data-id=\"001f598\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<span style=\"font-family: courier new,courier,monospace;\">set.seed(100)\nN = 1e6\ndataset = data.frame(\n# x1 variable has a bias. The first 500k values are taken\n# from a normal distribution, while the remaining 500k\n# are taken from a uniform distribution\nx1 = c(\nrnorm(N\/2,0,1) ,\nrunif(N\/2,0,1)\n),<\/span>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-af93dd9 elementor-widget elementor-widget-text-editor\" data-id=\"af93dd9\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<span style=\"font-family: courier new,courier,monospace;\">\u00a0 # Categorical variable made by the first 10 letters\nx2 = sample(LETTERS[1:10],N,replace=TRUE)\n)<\/span>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-60585d9 elementor-widget elementor-widget-heading\" data-id=\"60585d9\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h4 class=\"elementor-heading-title elementor-size-default\"><h4>Create a sample and check its significance<\/h4><\/h4>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-d8bab02 elementor-widget elementor-widget-text-editor\" data-id=\"d8bab02\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tNow we can try to create a sample made by 10.000 records from the original dataset and check its significance.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-586bc3c elementor-widget elementor-widget-text-editor\" data-id=\"586bc3c\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tRemember: numerical variables must be checked with the\u00a0<strong>Kolmogorov-Smirnov<\/strong>\u00a0test, while categorical variables (i.e. factors in R) need\u00a0<strong>Pearson\u2019s chi-square\u00a0<\/strong>test.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-a9904a5 elementor-widget elementor-widget-text-editor\" data-id=\"a9904a5\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tFor each test, we\u2019ll store its p-value in a named list for the final check. If\u00a0<strong>all the p-values<\/strong>\u00a0are greater than 5%, we can say that the sample is not biased.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-bd25ce3 elementor-widget elementor-widget-text-editor\" data-id=\"bd25ce3\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<span style=\"font-family: courier new,courier,monospace;\">sample_size = 10000\nset.seed(1)\nidxs = sample(1:nrow(dataset),sample_size,replace=F)\nsubsample = dataset[idxs,]\npvalues = list()\nfor (col in names(dataset)) {\nif (class(dataset[,col]) %in% c(&#8220;numeric&#8221;,&#8221;integer&#8221;)) {\n# Numeric variable. Using Kolmogorov-Smirnov test<\/span>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-9b8629e elementor-widget elementor-widget-text-editor\" data-id=\"9b8629e\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\n<span style=\"font-family: courier new,courier,monospace;\">\u00a0\u00a0\u00a0 pvalues[[col]] = ks.test(subsample[[col]],dataset[[col]])$p.value<\/span>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-1294c3e elementor-widget elementor-widget-text-editor\" data-id=\"1294c3e\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<span style=\"font-family: courier new,courier,monospace;\">\u00a0 } else {\n# Categorical variable. Using Pearson&#8217;s Chi-square test<\/span>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-8ba10c0 elementor-widget elementor-widget-text-editor\" data-id=\"8ba10c0\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<span style=\"font-family: courier new,courier,monospace;\">\u00a0\u00a0\u00a0 probs = table(dataset[[col]])\/nrow(dataset)\npvalues[[col]] = chisq.test(table(subsample[[col]]),p=probs)$p.value<\/span>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-fe4bfb6 elementor-widget elementor-widget-text-editor\" data-id=\"fe4bfb6\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<span style=\"font-family: courier new,courier,monospace;\">\u00a0 }\n}<\/span>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-5d80800 elementor-widget elementor-widget-text-editor\" data-id=\"5d80800\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\npvalues\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-b5e5782 elementor-widget elementor-widget-text-editor\" data-id=\"b5e5782\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tThe p-values are, then:\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-9fef74e elementor-widget elementor-widget-image\" data-id=\"9fef74e\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/cdn-images-1.medium.com\/max\/840\/1*FkIqXnYLdIETcIPGJqavtw.png\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-c341729 elementor-widget elementor-widget-text-editor\" data-id=\"c341729\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<width=\"40%\" \/><\/center>Each one of them is\u00a0<strong>greater than 5%<\/strong>, so we can say that the sample is statistically significant.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-1acc3d8 elementor-widget elementor-widget-text-editor\" data-id=\"1acc3d8\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tWhat happens if we take the first 10.000 records instead of taking them randomly? We know that the first half of the X1 variable of the dataset has a different distribution than the total, so we expect that such a sample can\u2019t be representative of the whole population.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-b37c8ad elementor-widget elementor-widget-text-editor\" data-id=\"b37c8ad\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\nIf we repeat the tests, these are the p-values:\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-9c58dc6 elementor-widget elementor-widget-image\" data-id=\"9c58dc6\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/cdn-images-1.medium.com\/max\/840\/1*vwWT_Da_Q4NSXh8msGuCpA.png\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-214ffa1 elementor-widget elementor-widget-text-editor\" data-id=\"214ffa1\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<width=\"40%\" \/><\/center>As expected, X1 has a too low p-value due to the bias of the population. In this case, we must\u00a0<strong>keep generating<\/strong>\u00a0random samples until all the p-values are greater than the minimum allowed confidence level.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-ac148a0 elementor-widget elementor-widget-heading\" data-id=\"ac148a0\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\"><h3>Conclusions<\/h3><\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-f731aaf elementor-widget elementor-widget-text-editor\" data-id=\"f731aaf\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tIn this article, I\u2019ve shown you that a proper sample can be statistically significant to represent the whole population. This may help us in machine learning because a small dataset can make us train models\u00a0<strong>more quickly<\/strong>\u00a0than a larger one, carrying the same amount of information.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-93f4279 elementor-widget elementor-widget-text-editor\" data-id=\"93f4279\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tHowever, everything is strongly related to the\u00a0<strong>significance level<\/strong>\u00a0we choose. For certain kinds of problems, it can be useful to raise the confidence level or discard those variables that don\u2019t show a suitable p-value. As usual, a proper data discovery before training can help us decide how to perform a sample correctly.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-b7a505a elementor-widget elementor-widget-text-editor\" data-id=\"b7a505a\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tHere is the\u00a0<a href=\"https:\/\/medium.com\/data-science-journal\/how-to-correctly-select-a-sample-from-a-huge-dataset-in-machine-learning-24327650372c\" class=\"broken_link\" rel=\"noopener\">Original<\/a>\u00a0article.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<\/div>\n\t\t","protected":false},"excerpt":{"rendered":"<p>In this article, you will learn that a proper sample can be statistically significant to represent the whole population. This may help us in machine learning because a small dataset can make us train models&nbsp;more quickly&nbsp;than a larger one, carrying the same amount of information. However, everything is strongly related to the&nbsp;significance level&nbsp;we choose. For certain kinds of problems, it can be useful to raise the confidence level or discard those variables that don&rsquo;t show a suitable p-value.<\/p>\n","protected":false},"author":618,"featured_media":3607,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"content-type":"","footnotes":""},"categories":[183],"tags":[92],"ppma_author":[3328],"class_list":["post-1881","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-ai-ml","tag-machine-learning"],"authors":[{"term_id":3328,"user_id":618,"is_guest":0,"slug":"gianluca-malato","display_name":"Gianluca Malato","avatar_url":"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/04\/medium_918623b2-8f36-4110-8343-6fc9228595dd-150x150.jpg","user_url":"http:\/\/www.gianlucamalato.it\/","last_name":"Malato","first_name":"Gianluca","job_title":"","description":"Gianluca Malato is Data Scientist at Poste Italiane SPA.\u00a0 He is also a fiction author and software developer, Editor of\u00a0<a href=\"https:\/\/medium.com\/data-science-journal?source=follow_footer--------------------------follow_footer-\">Data Science Journal<\/a>,\u00a0<a href=\"https:\/\/medium.com\/the-trading-scientist?source=follow_footer--------------------------follow_footer-\">The Trading Scientist<\/a>, and\u00a0<a href=\"https:\/\/medium.com\/the-writers-notebook?source=follow_footer--------------------------follow_footer-\">The Writer\u2019s Notebook<\/a>. His books are available on <a href=\"https:\/\/www.amazon.com\/Gianluca-Malato\/e\/B076CHTG3W?ref=dbs_a_mng_rwt_scns_share\">Amazon<\/a>."}],"_links":{"self":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/1881","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/users\/618"}],"replies":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/comments?post=1881"}],"version-history":[{"count":7,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/1881\/revisions"}],"predecessor-version":[{"id":36847,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/1881\/revisions\/36847"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/media\/3607"}],"wp:attachment":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/media?parent=1881"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/categories?post=1881"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/tags?post=1881"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/ppma_author?post=1881"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}