{"id":627,"date":"2018-03-26T04:55:18","date_gmt":"2018-03-26T04:55:18","guid":{"rendered":"http:\/\/kusuaks7\/?p=232"},"modified":"2025-05-23T10:20:10","modified_gmt":"2025-05-23T10:20:10","slug":"a-great-pitfall-neglecting-validation","status":"publish","type":"post","link":"https:\/\/www.experfy.com\/blog\/bigdata-cloud\/a-great-pitfall-neglecting-validation\/","title":{"rendered":"A Great Pitfall: Neglecting Validation"},"content":{"rendered":"\t\t<div data-elementor-type=\"wp-post\" data-elementor-id=\"627\" class=\"elementor elementor-627\" data-elementor-post-type=\"post\">\n\t\t\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-6ae20c52 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"6ae20c52\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-7ef99fde\" data-id=\"7ef99fde\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-7f103a17 elementor-widget elementor-widget-text-editor\" data-id=\"7f103a17\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<header><strong><em>Ready to learn Data Science? <a href=\"https:\/\/www.experfy.com\/training\/courses\">Browse courses<\/a>\u00a0like\u00a0<a href=\"https:\/\/www.experfy.com\/training\/tracks\/data-science-training-certification\">Data Science Training and Certification<\/a> developed by industry thought leaders and Experfy in Harvard Innovation Lab.<\/em><\/strong>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-ee009d8 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"ee009d8\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-16ea2f2\" data-id=\"16ea2f2\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-167be3d elementor-widget elementor-widget-image\" data-id=\"167be3d\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img fetchpriority=\"high\" decoding=\"async\" width=\"1024\" height=\"682\" src=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2018\/03\/1_7e5_hADcH6D3HB1DHdZmJQ-1024x682.jpg\" class=\"attachment-large size-large wp-image-37860\" alt=\"\" srcset=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2018\/03\/1_7e5_hADcH6D3HB1DHdZmJQ-1024x682.jpg 1024w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2018\/03\/1_7e5_hADcH6D3HB1DHdZmJQ-300x200.jpg 300w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2018\/03\/1_7e5_hADcH6D3HB1DHdZmJQ-768x512.jpg 768w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2018\/03\/1_7e5_hADcH6D3HB1DHdZmJQ-1536x1024.jpg 1536w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2018\/03\/1_7e5_hADcH6D3HB1DHdZmJQ-610x407.jpg 610w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2018\/03\/1_7e5_hADcH6D3HB1DHdZmJQ-750x500.jpg 750w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2018\/03\/1_7e5_hADcH6D3HB1DHdZmJQ-1140x760.jpg 1140w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2018\/03\/1_7e5_hADcH6D3HB1DHdZmJQ.jpg 2000w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-3ee39b4 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"3ee39b4\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-2e06563\" data-id=\"2e06563\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-7cbe926 elementor-widget elementor-widget-text-editor\" data-id=\"7cbe926\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"bd5b\">It was in the first chapter in your book, it was in the first lecture you attended, or it was mentioned in the first tutorial you watched. It seems simple: you do not measure your predictor\u2019s performance based on the predictions you made on the data that you used for training. However, as management keeps putting on more pressure, or your stack takes longer to run, you start to sacrifice on validation. This whole story is about this one very basic thing: validation.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-35e193f elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"35e193f\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-5463e49\" data-id=\"5463e49\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-a2e2eb7 elementor-widget elementor-widget-heading\" data-id=\"a2e2eb7\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h3 class=\"elementor-heading-title elementor-size-default\"><h3 id=\"6dfb\"><strong>What is Validation?<\/strong><\/h3><\/h3>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-45aecf1 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"45aecf1\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-d234af4\" data-id=\"d234af4\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-2009a68 elementor-widget elementor-widget-text-editor\" data-id=\"2009a68\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"8308\">You can skip this section if you already know the answer. If you are not sure, let me explain it roughly. Let us say you want to have a predictor that performs prediction at a specific task. After you get your predictor, you will want to know how well does it perform? Does it perform very poorly that it is useless, or does it perform excellently so you can declare the problem solved? In order to measure the performance, you need validation.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-540f304 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"540f304\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-a0a2f95\" data-id=\"a0a2f95\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-66f51b2 elementor-widget elementor-widget-text-editor\" data-id=\"66f51b2\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"6a55\">In order to perform validation, you need data. More specifically, you need data with the information that you want to predict. We call this information the ground truth. According to Wikipedia, ground truth is defined as:\u00a0<em>\u201cinformation provided by direct observation\u201d<\/em><a href=\"https:\/\/en.wikipedia.org\/wiki\/Ground_truth\" target=\"_blank\" rel=\"noopener noreferrer\" data-href=\"https:\/\/en.wikipedia.org\/wiki\/Ground_truth\" data-><em>*<\/em><\/a><em>.\u00a0<\/em>Ground truth is usually provided by humans. In our processes, we believe that the ground truth is the actual value that we want to predict for our data.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-1c59b15 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"1c59b15\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-46f053a\" data-id=\"46f053a\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-fd0f87d elementor-widget elementor-widget-text-editor\" data-id=\"fd0f87d\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"ca41\">The adventure of validation begins once you have both your predictor and the data with ground truth. If the data with ground truth was not present to your development process, it is easy. You just compare the predictions with the ground truth. However, this situation is rare. You rarely lock up your whole data and never peak into it till the very end. Very often you use this data while developing your predictor. You use some of your data for training, check the data for adjusting the thresholds, or you peak into the data to get some idea how to achieve a good predictor. In this case, you need to think about how to establish a solid and valid validation mechanism.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-71a9a5f elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"71a9a5f\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-91ebad7\" data-id=\"91ebad7\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-96636a8 elementor-widget elementor-widget-heading\" data-id=\"96636a8\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h3 class=\"elementor-heading-title elementor-size-default\"><h3 id=\"972f\"><strong>Why Are We Doing Validation?<\/strong><\/h3><\/h3>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-b791109 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"b791109\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-1657072\" data-id=\"1657072\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-b000e12 elementor-widget elementor-widget-text-editor\" data-id=\"b000e12\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"dfe4\">It seems simple: we are doing validation because we want to measure the performance of our predictor. But let us not fooled by this off-the-shelf definition. We do not just want to measure the performance. We want to know if our predictor will do a good job in a real-life setting.<\/p>\n<p id=\"b0b9\">Let us establish a prediction task example. In this example, we want to know if an image has a cat in it. In this task, you feed an image to the predictor, and it tells you whether there is a cat in the image or not. So by definition, it is a binary classification problem.<\/p>\n<p id=\"c684\">For this task, you can find a dataset or prepare your own. After that, you can go ahead and start developing your predictor. But one should not forget why you are developing a predictor. You do not want to detect cats in your dataset. You want a predictor that detects cats when you pick up your camera and take a picture of a cat in your local park. For validation, the dataset in your hard-disk is to help you so you do not have to chase cats in your local park all the time. But with this comfort, we often forget about the real goal and shoot for the grant price: a higher KPI (key performance indicator) value!<\/p>\n<p id=\"517c\">We are doing validation so that it will give us a good idea how our predictors will work in the real-life setting. Your KPI is just to help you, it is not your actual goal. We need to keep this in mind when developing our predictors.<\/p\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-5ca5bf5 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"5ca5bf5\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-174caf8\" data-id=\"174caf8\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-0dbe342 elementor-widget elementor-widget-heading\" data-id=\"0dbe342\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h3 class=\"elementor-heading-title elementor-size-default\"><h3 id=\"51ef\"><strong>When You Forget About Validation<\/strong><\/h3><\/h3>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-3286621 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"3286621\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-7ba3dc7\" data-id=\"7ba3dc7\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-6d4bd57 elementor-widget elementor-widget-text-editor\" data-id=\"6d4bd57\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"0284\">What happens in a scenario where you forget or fail to establish a valid validation mechanism? Let me paint you a possible outcome of this scenario.<\/p>\n<p id=\"957a\">You work on your project. You focus on getting your KPI better and better. You work hard on it. You get more data, you do your research on different methods, you run a lot of experiments. In the end, you get a performance that looks great. Mission accomplished! You are now ready to present your work and deploy it.<\/p>\n<p id=\"47a3\">Everybody is happy at this point. You present it to your team, to management or who you work with. If you are lucky, at this point somebody will catch the mistake you made. If you are lucky somebody will tell you that you failed to establish a valid validation mechanism. This is not the worst case scenario. You still did not deploy your predictor. Still, nobody is affected by this mistake. You just need to go back to sketching board and introduce a valid validation mechanism. But if you are not lucky you will march into a bad situation.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-54bb345 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"54bb345\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-fadc0ad\" data-id=\"fadc0ad\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-1c0e20a elementor-widget elementor-widget-text-editor\" data-id=\"1c0e20a\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"3452\">In the worst case situation, you deploy your predictor, it goes live. After a while, people start to realize that the predictor you developed is not performing well. Users will start to complain. Everybody becomes confused. The KPI suggest that the predictor performs very well. But why users are unhappy about the performance? After a while, you are convinced that there is something wrong. You go back to sketching board, trying to debug your model. Finally, you realize that what your KPI shows is not right. Your validation mechanism is broken. You were so focused on increasing the KPI, you sacrificed on validation. Now your reputation is damaged, and you need to come up with a new model quickly before churns start.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-202b378 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"202b378\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-2126b1d\" data-id=\"2126b1d\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-c2d3447 elementor-widget elementor-widget-heading\" data-id=\"c2d3447\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h3 class=\"elementor-heading-title elementor-size-default\"><h3 id=\"fe8d\"><strong>Why Do We Give Up or Forget Validation<\/strong><\/h3><\/h3>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-31f1c25 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"31f1c25\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-bb67146\" data-id=\"bb67146\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-94e8093 elementor-widget elementor-widget-text-editor\" data-id=\"94e8093\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"6c25\">There are a lot of reasons why we sacrifice on validation or forget to establish it completely. I believe the list goes very long. My list below is not complete, but these are the ones that I came across often.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-56072de elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"56072de\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-d4c61ad\" data-id=\"d4c61ad\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-79860e2 elementor-widget elementor-widget-heading\" data-id=\"79860e2\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h4 class=\"elementor-heading-title elementor-size-default\">\n<h4 id=\"1c74\"><strong><em>Management Puts\u00a0Pressure<\/em><\/strong><\/h4><\/h4>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-d03612c elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"d03612c\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-ef884a2\" data-id=\"ef884a2\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-5b374c7 elementor-widget elementor-widget-text-editor\" data-id=\"5b374c7\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"d8c7\">This issue is not specific to data science or validation. Management or business side always want to have results fast. They put pressure on the team to come up with a predictor fast. As pressure builds up, you start to give up on certain aspects of your method. One aspect you choose to give up may be validation. You know that validation is not fun. You need to have a different workflow just because of that. Addition to that it takes longer time. Because of these, it may seem like a good place prune.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-0266f66 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"0266f66\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-fb8ca28\" data-id=\"fb8ca28\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-a65e7aa elementor-widget elementor-widget-heading\" data-id=\"a65e7aa\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h4 class=\"elementor-heading-title elementor-size-default\"><h4 id=\"facb\"><strong><em>You Think You Do Not Need Validation<\/em><\/strong><\/h4><\/h4>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-af0f608 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"af0f608\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-4612a0f\" data-id=\"4612a0f\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-fb6b739 elementor-widget elementor-widget-text-editor\" data-id=\"fb6b739\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"8f96\">I believe this is caused by forgetting why we are doing validation. Whatever the reason is, you may think you do not need validation. Let me list a couple of reasons that I came across with:<\/p>\n<p id=\"ab30\"><em>Because you do not use machine learning.<\/em><\/p>\n<p id=\"a61f\">The need for validation is not specific to machine learning. The complexity of your model does not determine the need for validation. Even if your method is a single if statement, or a complicated machine learning model, it does not matter. You need validation.<\/p>\n<p id=\"4fad\"><em>Because you think your method is not overfitting.<\/em><\/p>\n<p id=\"f327\">Even if you are using complicated regularization methods, it does not mean it works perfectly. Overfitting preventions are not perfect. Means that you cannot rely on them and measure the performance on the training set.<\/p>\n<p id=\"3d34\"><em>Because your method has nothing to do with validation.<\/em><\/p>\n<p id=\"f786\">Maybe you think you do not need validation because you have not used any data in training. However, are you sure that you have not introduced any bias into your method by yourself? This is the part where the rabbit hole goes deeper. Even if you are not using any data in training process, you as a person may have developed a method that is biased to the data you use. Let me use the cat detection example to clarify this: if your dataset only consists of dark-colored cats and you use this information while developing your algorithm, it will perform better on your dataset than performing in real life setting where there are cats of different colors.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-1bd9d82 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"1bd9d82\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-49e7ae2\" data-id=\"49e7ae2\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-b79de3a elementor-widget elementor-widget-heading\" data-id=\"b79de3a\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h4 class=\"elementor-heading-title elementor-size-default\"><h4 id=\"087b\"><em>Your Runs\/Experiments Are Taking a Long\u00a0Time<\/em><\/h4>\n<\/h4>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-54030fc elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"54030fc\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-1da1498\" data-id=\"1da1498\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-def7ef2 elementor-widget elementor-widget-text-editor\" data-id=\"def7ef2\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"37e1\">It is not a secret that usually machine learning operations take a long time. Even if a single run does not, you want a lot of them because of various reasons like trying different hyper-parameters, methods, etc. When your patience starts to wear off, cloud server bills start to pile up, and the deadline approaches; you try to cut some operations off. At this point, you tend to cut from validation. You may start to use fewer folds, switch to subsampling, or worst: cancel out the validation.<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-03a8c97 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"03a8c97\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-39b1f9d\" data-id=\"39b1f9d\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-76d941b elementor-widget elementor-widget-heading\" data-id=\"76d941b\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h3 class=\"elementor-heading-title elementor-size-default\"><h3 id=\"cf9f\"><strong>No Excuses. You Need Validation<\/strong><\/h3><\/h3>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-1cf121c elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"1cf121c\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-40d7618\" data-id=\"40d7618\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-cb5074a elementor-widget elementor-widget-text-editor\" data-id=\"cb5074a\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"f573\">Whatever the reason is, whether listed above or not, you cannot effort not having a valid validation mechanism. Validation is basic for a prediction work. You may choose one from different methods (e.g. cross-validation, subsampling, etc.) but you must have one. It is the best to one from the very early stage. It may evolve as the project goes, but even in your first runs, you must employ validation. Failing to have a valid validation mechanism will cause bigger problems in the later stages of the project.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-41d4c5e elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"41d4c5e\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-ec0c9e6\" data-id=\"ec0c9e6\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-7dcaab9 elementor-widget elementor-widget-heading\" data-id=\"7dcaab9\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h3 class=\"elementor-heading-title elementor-size-default\"><h3 id=\"2a9d\"><strong>Rabbit Hole Goes Way Deeper: Valid Validation<\/strong><\/h3>\n<\/h3>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-42e4f46 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"42e4f46\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-116175c\" data-id=\"116175c\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-666e476 elementor-widget elementor-widget-text-editor\" data-id=\"666e476\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"cb51\">When we think again about why we want validation, the situation goes deeper than just employing off-the-shelf validation methods like K-fold cross-validation. If we want to make sure that we get a similar performance in real-life settings, we need to think deeper about validation.<\/p>\n<p id=\"df82\">This topic exceeds the boundaries of this story, but here are some foods for thought:<\/p>\n\n<ul>\n \t<li id=\"d703\">Is your data leaking the label of the instance in an inexplicit way?<\/li>\n \t<li id=\"c81a\">Does the dataset contain a range of variation?<\/li>\n \t<li id=\"a439\">Is the data collected in synthetic (in-lab) environment, or collected realistically?<\/li>\n \t<li id=\"08eb\">Did you introduce bias in your method through your yourself?<\/li>\n<\/ul>\n<\/section>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<\/div>\n\t\t","protected":false},"excerpt":{"rendered":"<p>Ready to learn Data Science? Browse courses\u00a0like\u00a0Data Science Training and Certification developed by industry thought leaders and Experfy in Harvard Innovation Lab. It was in the first chapter in your book, it was in the first lecture you attended, or it was mentioned in the first tutorial you watched. It seems simple: you do not<\/p>\n","protected":false},"author":263,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"content-type":"","footnotes":""},"categories":[187],"tags":[94],"ppma_author":[1779],"class_list":["post-627","post","type-post","status-publish","format-standard","hentry","category-bigdata-cloud","tag-data-science"],"authors":[{"term_id":1779,"user_id":263,"is_guest":0,"slug":"kemal-yesilbek","display_name":"Kemal Yesilbek","avatar_url":"https:\/\/secure.gravatar.com\/avatar\/?s=96&d=mm&r=g","user_url":"","last_name":"Yesilbek","first_name":"Kemal","job_title":"","description":"Kemal Tugrul Yesilbek, data scientist at Lone Rooftop, is focused on machine learning and data science practices. He published multiple research papers on machine learning and its applications in academic journals and conferences. He is experienced in building machine learning solutions from idea to operation."}],"_links":{"self":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/627","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/users\/263"}],"replies":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/comments?post=627"}],"version-history":[{"count":4,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/627\/revisions"}],"predecessor-version":[{"id":37863,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/627\/revisions\/37863"}],"wp:attachment":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/media?parent=627"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/categories?post=627"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/tags?post=627"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/ppma_author?post=627"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}