{"id":976,"date":"2018-11-12T04:20:29","date_gmt":"2018-11-12T01:20:29","guid":{"rendered":"http:\/\/kusuaks7\/?p=581"},"modified":"2021-05-11T14:01:18","modified_gmt":"2021-05-11T14:01:18","slug":"stop-feeding-garbage-to-your-model-the-6-biggest-mistakes-with-datasets-and-how-to-avoid-them","status":"publish","type":"post","link":"https:\/\/www.experfy.com\/blog\/bigdata-cloud\/stop-feeding-garbage-to-your-model-the-6-biggest-mistakes-with-datasets-and-how-to-avoid-them\/","title":{"rendered":"Stop Feeding Garbage To Your Model!\u200a\u2014\u200aThe 6 biggest mistakes with datasets and how to avoid them"},"content":{"rendered":"<p><strong><em>Ready to learn Big Data? Browse <a href=\"https:\/\/www.experfy.com\/training\/tracks\/big-data-training-certification\">Big Data Training and Certification Courses<\/a> developed by industry thought leaders and Experfy in Harvard Innovation Lab.<\/em><\/strong><\/p>\n<section name=\"b62d\">\n<blockquote>\n<p id=\"4872\" name=\"4872\">Learn how to build killer datasets by avoiding the most frequent mistakes in Data Science, plus tips, tricks and&nbsp;kittens.<\/p>\n<\/blockquote>\n<figure id=\"935e\" name=\"935e\">\n<p><canvas height=\"30\" width=\"75\"><\/canvas><img decoding=\"async\" data-src=\"https:\/\/cdn-images-1.medium.com\/max\/800\/1*3qE2JG-nVHzP6jgVG3xOKg.png\" src=\"https:\/\/cdn-images-1.medium.com\/max\/800\/1*3qE2JG-nVHzP6jgVG3xOKg.png\" style=\"width: 700px; height: 281px;\" \/><\/p>\n<\/figure>\n<\/section>\n<section name=\"9f32\">\n<hr \/>\n<h3 id=\"066e\" name=\"066e\">Introduction<\/h3>\n<p id=\"a785\" name=\"a785\">If you haven&rsquo;t heard it already, let me tell you a truth that you should, as a&nbsp;<em>data scientist<\/em>, always keep in a corner of your head:<\/p>\n<blockquote id=\"ba63\" name=\"ba63\"><p><strong>&ldquo;Your results are only as good as your data.&rdquo;<\/strong><\/p><\/blockquote>\n<p id=\"6a93\" name=\"6a93\">Many people make the mistake of trying to&nbsp;<strong>compensate&nbsp;<\/strong>for<strong>&nbsp;<\/strong>their ugly&nbsp;<strong>dataset<\/strong>by&nbsp;<strong>improving<\/strong>&nbsp;their&nbsp;<strong>model<\/strong>. This is the equivalent of buying a&nbsp;<strong>supercar<\/strong>because your old car doesn&rsquo;t perform well with&nbsp;<em>cheap<\/em>&nbsp;<em>gasoline<\/em>. It makes much more sense to&nbsp;<em>refine<\/em>&nbsp;the&nbsp;<strong>oil<\/strong>&nbsp;instead of&nbsp;<em>upgrading<\/em>&nbsp;the&nbsp;<strong>car<\/strong>. In this article, I will explain how you can easily&nbsp;<strong>improve<\/strong>&nbsp;your&nbsp;<strong>results<\/strong>&nbsp;by&nbsp;<strong>enhancing<\/strong>&nbsp;your&nbsp;<strong>dataset<\/strong>.<\/p>\n<p id=\"df6a\" name=\"df6a\"><strong><em>Note<\/em><\/strong><em>: I will take the task of image classification as an example, but these tips can be applied to all sorts of datasets.<\/em><\/p>\n<h3 id=\"3bda\" name=\"3bda\">The 6 Most frequent mistakes, and how to fix&nbsp;them.<\/h3>\n<figure id=\"836f\" name=\"836f\">\n<p><canvas height=\"20\" width=\"75\"><\/canvas><img decoding=\"async\" data-src=\"https:\/\/cdn-images-1.medium.com\/max\/800\/1*u-ezFqnYLYUSs0il5pF9iQ.png\" src=\"https:\/\/cdn-images-1.medium.com\/max\/800\/1*u-ezFqnYLYUSs0il5pF9iQ.png\" style=\"width: 700px; height: 200px;\" \/><\/p>\n<\/figure>\n<h4 id=\"64fe\" name=\"64fe\">1. Not enough&nbsp;data.<\/h4>\n<p id=\"3676\" name=\"3676\">If your dataset is too&nbsp;<em>small<\/em>, your model doesn&rsquo;t have enough examples to find&nbsp;<strong>discriminative features<\/strong>&nbsp;that will be used to&nbsp;<strong>generalize<\/strong>. It will then&nbsp;<a data-href=\"https:\/\/machinelearningmastery.com\/overfitting-and-underfitting-with-machine-learning-algorithms\/\" href=\"https:\/\/machinelearningmastery.com\/overfitting-and-underfitting-with-machine-learning-algorithms\/\" rel=\"noopener noreferrer\" target=\"_blank\"><strong>overfit<\/strong><\/a>your data, resulting in a&nbsp;<strong>low<\/strong>&nbsp;<strong>training<\/strong>&nbsp;<strong>error<\/strong>&nbsp;but a&nbsp;<strong>high<\/strong>&nbsp;<strong>test<\/strong>&nbsp;<strong>error<\/strong>.<\/p>\n<p id=\"8a8b\" name=\"8a8b\"><strong><em>Solution #1:&nbsp;<\/em><\/strong>gather more data. You can try to find more from the&nbsp;<em>same<\/em>&nbsp;<em>source<\/em>as your original dataset, or from&nbsp;<em>another<\/em>&nbsp;<em>source<\/em>&nbsp;if the images are quite similar or if you&nbsp;<strong>absolutely want<\/strong>&nbsp;to&nbsp;<strong>generalize<\/strong>.<\/p>\n<p id=\"5988\" name=\"5988\"><strong><em>Caveats:&nbsp;<\/em><\/strong>This is usually not an easy thing to do, at least without investing time and money. Also, you might want to do an&nbsp;<em>analysis<\/em>&nbsp;to determine&nbsp;<strong>how much&nbsp;<\/strong>additional<strong>&nbsp;<\/strong>data you need. Compare your results with&nbsp;<strong>different dataset sizes,<\/strong>&nbsp;and try to&nbsp;<strong>extrapolate<\/strong>.<\/p>\n<figure id=\"fa61\" name=\"fa61\">\n<p><canvas height=\"35\" width=\"75\"><\/canvas><img decoding=\"async\" data-src=\"https:\/\/cdn-images-1.medium.com\/max\/800\/1*hbXZX9s4lQ12xTXuF5YzTQ.png\" src=\"https:\/\/cdn-images-1.medium.com\/max\/800\/1*hbXZX9s4lQ12xTXuF5YzTQ.png\" style=\"width: 700px; height: 349px;\" \/><\/p>\n<\/figure>\n<p name=\"ebfa\" style=\"text-align: center;\">In this case, it seems that we would need&nbsp;<strong>500k samples<\/strong>&nbsp;to reach our&nbsp;<strong>target<\/strong>&nbsp;<strong>error<\/strong>. That would mean gathering&nbsp;<strong>50 times as much<\/strong>&nbsp;data as we have for the moment. It is probably more&nbsp;<em>efficient<\/em>&nbsp;to work on other&nbsp;<strong><em>aspects<\/em><\/strong>&nbsp;of the data, or on the&nbsp;<strong>model<\/strong>.<\/p>\n<p id=\"ebfa\" name=\"ebfa\"><strong><em>Solution #2:&nbsp;<\/em><\/strong>augment your data by creating multiple copies of the same image with slight variations. This technique works wonders and it produces tons of additional images at a really low cost. You can try to&nbsp;<em>crop<\/em>,&nbsp;<em>rotate<\/em>,&nbsp;<em>translate<\/em>&nbsp;or&nbsp;<em>scale<\/em>&nbsp;your image. You can&nbsp;<em>add<\/em>&nbsp;<em>noise<\/em>,&nbsp;<em>blur it<\/em>,&nbsp;<em>change its colors&nbsp;<\/em>or&nbsp;<em>obstruct parts of it.<\/em>&nbsp;In all cases, you need to make sure the data is&nbsp;<strong>still representing the same class<\/strong><\/p>\n<figure id=\"fc5a\" name=\"fc5a\">\n<p><canvas height=\"75\" width=\"72\"><\/canvas><img decoding=\"async\" data-src=\"https:\/\/cdn-images-1.medium.com\/max\/800\/1*Z8L-niHsUacRRKx2P7dhGg.png\" src=\"https:\/\/cdn-images-1.medium.com\/max\/800\/1*Z8L-niHsUacRRKx2P7dhGg.png\" style=\"width: 631px; height: 664px;\" \/><\/p><figcaption>&nbsp;<\/figcaption><\/figure>\n<p name=\"65b3\" style=\"text-align: center;\">All this images still represent the &ldquo;cat&rdquo;&nbsp;category<\/p>\n<p id=\"65b3\" name=\"65b3\">This can be extremely powerful, as stacking these effects gives exponentially numerous samples for your dataset. Note that this is still usually&nbsp;<strong>inferior<\/strong>&nbsp;to collecting&nbsp;<strong>more<\/strong>&nbsp;<strong>raw<\/strong>&nbsp;<strong>data<\/strong>.<\/p>\n<p name=\"65b3\" style=\"text-align: center;\"><img decoding=\"async\" data-src=\"https:\/\/cdn-images-1.medium.com\/max\/800\/1*abXm_BwuZTJOisce_rjeMg.png\" src=\"https:\/\/cdn-images-1.medium.com\/max\/800\/1*abXm_BwuZTJOisce_rjeMg.png\" \/><\/p>\n<p name=\"ab92\" style=\"text-align: center;\">Combined data augmentation techniques. The class is still &ldquo;cat&rdquo; and should be recognized as&nbsp;such.<\/p>\n<p id=\"ab92\" name=\"ab92\"><strong><em>Caveats:&nbsp;<\/em><\/strong>all augmentations techniques might not be usable for your problem. For example, if you want to classify Lemons and Limes, don&rsquo;t play with the hue, as it would&nbsp;<em>make sense&nbsp;<\/em>that color is important for the classification.<\/p>\n<figure id=\"a7d4\" name=\"a7d4\">\n<p><canvas height=\"22\" width=\"75\"><\/canvas><\/p>\n<p><img decoding=\"async\" data-src=\"https:\/\/cdn-images-1.medium.com\/max\/800\/1*-o62yj2HqmVI_gtqCYeC3A.png\" src=\"https:\/\/cdn-images-1.medium.com\/max\/800\/1*-o62yj2HqmVI_gtqCYeC3A.png\" style=\"width: 700px; height: 214px;\" \/><\/p><figcaption>This type of data augmentation would make it harder for the model to find discriminating features.<\/figcaption><\/figure>\n<figure id=\"f53a\" name=\"f53a\">\n<p><canvas height=\"20\" width=\"75\"><\/canvas><\/p>\n<p><img decoding=\"async\" data-src=\"https:\/\/cdn-images-1.medium.com\/max\/800\/1*AfgtZz2gO5zeO9HWE7M3IA.png\" src=\"https:\/\/cdn-images-1.medium.com\/max\/800\/1*AfgtZz2gO5zeO9HWE7M3IA.png\" \/><\/p>\n<\/figure>\n<h4 id=\"25a5\" name=\"25a5\">2. Low quality&nbsp;classes<\/h4>\n<p id=\"3a56\" name=\"3a56\">It&rsquo;s an easy one, but take time to go through your dataset if possible, and&nbsp;<strong>verify the label&nbsp;<\/strong>of each sample. This might take a while, but having&nbsp;<em>counter-examples&nbsp;<\/em>in your dataset will be<strong>&nbsp;detrimental&nbsp;<\/strong>to<strong>&nbsp;<\/strong>the learning process.<\/p>\n<p id=\"0137\" name=\"0137\">Also, choose the right level of&nbsp;<strong>granularity<\/strong>&nbsp;for your classes. Depending on the problem, you might need more or less classes.&nbsp;<em>For example<\/em>, you can classify the image of a&nbsp;<strong>kitten<\/strong>&nbsp;with a&nbsp;<strong>global classifier&nbsp;<\/strong>to determine it&rsquo;s an&nbsp;<strong>animal<\/strong>, then run it through an&nbsp;<strong>animal classifier&nbsp;<\/strong>to determine it&rsquo;s a&nbsp;<strong>kitten<\/strong>. A huge model could do both, but it would be much harder.<\/p>\n<figure id=\"aec2\" name=\"aec2\">\n<p><canvas height=\"32\" width=\"75\"><\/canvas><\/p>\n<p><img decoding=\"async\" data-src=\"https:\/\/cdn-images-1.medium.com\/max\/800\/1*x78GxSvoQC2EhHd0CgoT1A.png\" src=\"https:\/\/cdn-images-1.medium.com\/max\/800\/1*x78GxSvoQC2EhHd0CgoT1A.png\" style=\"width: 700px; height: 310px;\" \/><\/p>\n<\/figure>\n<p name=\"46bc\" style=\"text-align: center;\">Two stage prediction with specialized classifiers.<\/p>\n<p name=\"46bc\" style=\"text-align: center;\"><img decoding=\"async\" data-src=\"https:\/\/cdn-images-1.medium.com\/max\/800\/1*a6dDXkAF2QZYr2vWVcjA8g.png\" src=\"https:\/\/cdn-images-1.medium.com\/max\/800\/1*a6dDXkAF2QZYr2vWVcjA8g.png\" style=\"width: 700px; height: 200px;\" \/><\/p>\n<h4 id=\"46bc\" name=\"46bc\">3. Low quality&nbsp;data<\/h4>\n<p id=\"edf1\" name=\"edf1\">As said in the introduction,&nbsp;<em>low quality&nbsp;<\/em><strong>data<\/strong>&nbsp;will only lead to&nbsp;<em>low quality&nbsp;<\/em><strong>results<\/strong>.<\/p>\n<p id=\"4b38\" name=\"4b38\">You might have samples in your dataset f your dataset that are too far from what you want to use. These might be more&nbsp;<strong>confusing<\/strong>&nbsp;for the model than helpful.<\/p>\n<p id=\"8ee1\" name=\"8ee1\"><strong><em>Solution<\/em><\/strong>:&nbsp;<em>remove<\/em>&nbsp;the worst images. This is a lengthy process, but will improve your results.<\/p>\n<figure id=\"f186\" name=\"f186\">\n<p><canvas height=\"25\" width=\"75\"><\/canvas><\/p>\n<p><img decoding=\"async\" data-src=\"https:\/\/cdn-images-1.medium.com\/max\/800\/1*MQIXyEFlLtHIy-OKUtBXIQ.png\" src=\"https:\/\/cdn-images-1.medium.com\/max\/800\/1*MQIXyEFlLtHIy-OKUtBXIQ.png\" \/><\/p><figcaption>&nbsp;<\/figcaption><\/figure>\n<p name=\"aa0b\" style=\"text-align: center;\">Sure, these three images represent cats, but the model might not be able to work with&nbsp;it.<\/p>\n<p id=\"aa0b\" name=\"aa0b\">Another&nbsp;<strong>common<\/strong>&nbsp;issue is when your dataset is made of data that&nbsp;<strong>doesn&rsquo;t<\/strong><strong>match<\/strong>&nbsp;the&nbsp;<strong>real<\/strong>&nbsp;<strong>world<\/strong>&nbsp;<strong>application<\/strong>.&nbsp;<em>For instance&nbsp;<\/em>if the images are taken from completely different sources.<\/p>\n<p id=\"1bf8\" name=\"1bf8\"><strong><em>Solution:&nbsp;<\/em><\/strong>think about the long term application of your technology, and which means will be used to acquire data in production. If possible, try to find\/build a dataset with the same tools.<\/p>\n<figure id=\"70d5\" name=\"70d5\">\n<p><canvas height=\"55\" width=\"75\"><\/canvas><\/p>\n<p><img decoding=\"async\" data-src=\"https:\/\/cdn-images-1.medium.com\/max\/800\/1*maCkqiUvMgeGmRRdGe_oYw.png\" src=\"https:\/\/cdn-images-1.medium.com\/max\/800\/1*maCkqiUvMgeGmRRdGe_oYw.png\" style=\"width: 700px; height: 517px;\" \/><\/p><figcaption>Using data that doesn&rsquo;t represent your real world application is usually a bad idea. Your model is likely to extract features that won&rsquo;t work in the real&nbsp;world.<\/figcaption><\/figure>\n<figure id=\"0dba\" name=\"0dba\">\n<p><canvas height=\"20\" width=\"75\"><\/canvas><\/p>\n<p><img decoding=\"async\" data-src=\"https:\/\/cdn-images-1.medium.com\/max\/800\/1*CSzPpz3Pw1F7yvY-OPnwtg.png\" src=\"https:\/\/cdn-images-1.medium.com\/max\/800\/1*CSzPpz3Pw1F7yvY-OPnwtg.png\" style=\"width: 700px; height: 200px;\" \/><\/p>\n<\/figure>\n<h4 id=\"c508\" name=\"c508\">4. Unbalanced classes<\/h4>\n<p id=\"07b1\" name=\"07b1\">If the&nbsp;<strong>number<\/strong>&nbsp;of sample per class&nbsp;<strong>isn&rsquo;t<\/strong>&nbsp;<em>roughly<\/em>&nbsp;the&nbsp;<strong>same<\/strong>&nbsp;for all classes, the model might have a tendency to favor the dominant class, as it results in a&nbsp;<strong>lower<\/strong>&nbsp;<strong>error<\/strong>. We say that the model is&nbsp;<strong>biased<\/strong>&nbsp;because the&nbsp;<em>class distribution<\/em>&nbsp;is&nbsp;<strong>skewed<\/strong>. This is a serious issue, and also why you need to take a look at&nbsp;<a data-href=\"https:\/\/en.wikipedia.org\/wiki\/Precision_and_recall\" href=\"https:\/\/en.wikipedia.org\/wiki\/Precision_and_recall\" rel=\"noopener noreferrer\" target=\"_blank\">precision, recall<\/a>&nbsp;or&nbsp;<a data-href=\"https:\/\/en.wikipedia.org\/wiki\/Confusion_matrix\" href=\"https:\/\/en.wikipedia.org\/wiki\/Confusion_matrix\" rel=\"noopener noreferrer\" target=\"_blank\">confusion matrixes.<\/a><\/p>\n<p id=\"034d\" name=\"034d\"><strong><em>Solution #1:&nbsp;<\/em><\/strong>gather more samples of the&nbsp;<em>underrepresented&nbsp;<\/em>classes. However, this is&nbsp;<strong>often<\/strong>&nbsp;<strong>costly<\/strong>&nbsp;in&nbsp;<em>time<\/em>&nbsp;and&nbsp;<em>money<\/em>, or simply&nbsp;<em>not feasible.<\/em><\/p>\n<p id=\"58f6\" name=\"58f6\"><strong><em>Solution #2:<\/em><\/strong>&nbsp;over\/under-sample your data. This means that you&nbsp;<strong>remove<\/strong>some samples from the&nbsp;<strong>over-represented<\/strong>&nbsp;classes, and\/or&nbsp;<strong>duplicate<\/strong>&nbsp;samples from the&nbsp;<strong>under-represented<\/strong>&nbsp;classes. Better than&nbsp;<em>duplication<\/em>, use data augmentation as seen previously.<\/p>\n<figure id=\"81c1\" name=\"81c1\">\n<p><canvas height=\"35\" width=\"75\"><\/canvas><\/p>\n<p><img decoding=\"async\" data-src=\"https:\/\/cdn-images-1.medium.com\/max\/800\/1*bmpmzzkZF_o8DChLJI4WeQ.png\" src=\"https:\/\/cdn-images-1.medium.com\/max\/800\/1*bmpmzzkZF_o8DChLJI4WeQ.png\" style=\"width: 700px; height: 339px;\" \/><\/p><figcaption>We need to&nbsp;<strong>augment<\/strong>&nbsp;the&nbsp;<strong>under-represented&nbsp;<\/strong>class (cat) and&nbsp;<strong>leave aside&nbsp;<\/strong>some<strong>&nbsp;<\/strong>samples from the&nbsp;<strong>over-represented&nbsp;<\/strong>class (lime). This will give a much smoother class distribution.<\/figcaption><\/figure>\n<figure id=\"a6b5\" name=\"a6b5\">\n<p><canvas height=\"20\" width=\"75\"><\/canvas><\/p>\n<p><img decoding=\"async\" data-src=\"https:\/\/cdn-images-1.medium.com\/max\/800\/1*sNxSDCqVhrPaJDfsX9id1A.png\" src=\"https:\/\/cdn-images-1.medium.com\/max\/800\/1*sNxSDCqVhrPaJDfsX9id1A.png\" style=\"width: 700px; height: 200px;\" \/><\/p>\n<\/figure>\n<h4 id=\"f62a\" name=\"f62a\">5. Unbalanced data<\/h4>\n<p id=\"4e45\" name=\"4e45\">If your data doesn&rsquo;t have a&nbsp;<strong>specific<\/strong>&nbsp;<strong>format<\/strong>, or if the values don&rsquo;t lie in the&nbsp;<strong>certain<\/strong>&nbsp;<strong>range<\/strong>, your model might have trouble dealing with it. You will have better results with image that are in&nbsp;<em>aspect<\/em>&nbsp;<em>ratio<\/em>&nbsp;and&nbsp;<em>pixel<\/em>&nbsp;<em>values<\/em>.<\/p>\n<p id=\"48d1\" name=\"48d1\"><strong><em>Solution #1:&nbsp;<\/em><\/strong><em>Crop<\/em>&nbsp;or&nbsp;<em>stretch<\/em>&nbsp;the data so that it has the same aspect or&nbsp;<strong>format<\/strong>as the other samples.<\/p>\n<figure id=\"8472\" name=\"8472\">\n<p><canvas height=\"22\" width=\"75\"><\/canvas><\/p>\n<p><img decoding=\"async\" data-src=\"https:\/\/cdn-images-1.medium.com\/max\/800\/1*XQ4-W17DRJwJbcQDY1ubUQ.png\" src=\"https:\/\/cdn-images-1.medium.com\/max\/800\/1*XQ4-W17DRJwJbcQDY1ubUQ.png\" \/><\/p><figcaption>&nbsp;<\/figcaption><\/figure>\n<p name=\"112c\" style=\"text-align: center;\">Two possibilities to improve a badly formatted image.<\/p>\n<p id=\"112c\" name=\"112c\"><strong><em>Solution #2:&nbsp;<\/em><\/strong><em>normalize<\/em>&nbsp;the data so that every sample has its data in the&nbsp;<strong>same<\/strong>value&nbsp;<strong>range<\/strong>.<\/p>\n<figure id=\"8787\" name=\"8787\">\n<p><canvas height=\"25\" width=\"75\"><\/canvas> <img decoding=\"async\" data-src=\"https:\/\/cdn-images-1.medium.com\/max\/800\/1*LuAvDFaL0KwynRMZECBfrw.png\" src=\"https:\/\/cdn-images-1.medium.com\/max\/800\/1*LuAvDFaL0KwynRMZECBfrw.png\" style=\"width: 700px; height: 253px;\" \/><\/p><figcaption>The value range is normalized to be consistent across the&nbsp;dataset.<\/figcaption><\/figure>\n<figure id=\"2b02\" name=\"2b02\">\n<p><canvas height=\"20\" width=\"75\"><\/canvas><\/p>\n<p><img decoding=\"async\" data-src=\"https:\/\/cdn-images-1.medium.com\/max\/800\/1*2ZhkFwY3slItf9blqLnVWA.png\" src=\"https:\/\/cdn-images-1.medium.com\/max\/800\/1*2ZhkFwY3slItf9blqLnVWA.png\" \/><\/p>\n<\/figure>\n<h4 id=\"25f5\" name=\"25f5\">6. No validation or&nbsp;testing<\/h4>\n<p id=\"056c\" name=\"056c\">Once your dataset has been&nbsp;<em>cleaned<\/em>,&nbsp;<em>augmented<\/em>&nbsp;and properly&nbsp;<em>labelled<\/em>, you need to&nbsp;<strong>split<\/strong>&nbsp;it. Many people split it the following way:&nbsp;<em>80%<\/em>&nbsp;for&nbsp;<strong>training<\/strong>, and&nbsp;<em>20%&nbsp;<\/em>for&nbsp;<strong>testing,&nbsp;<\/strong>which<strong>&nbsp;<\/strong>allow you to easily spot&nbsp;<em>overfitting.&nbsp;<\/em><strong>However<\/strong>, if you are trying multiple models on the same testing set, something else happens. By&nbsp;<em>picking<\/em>&nbsp;the model giving the best test accuracy, you are in fact&nbsp;<strong>overfitting the testing set<\/strong>. This happens because you are manually selecting a model&nbsp;<strong>not<\/strong>for its&nbsp;<strong>intrinsic<\/strong>&nbsp;<strong>value<\/strong>, but for its&nbsp;<em>performance<\/em>&nbsp;on a&nbsp;<strong>specific&nbsp;<\/strong>set of data.<\/p>\n<p id=\"6895\" name=\"6895\"><strong><em>Solution:<\/em><\/strong>&nbsp;split the dataset in three:&nbsp;<strong>training<\/strong>,&nbsp;<strong>validation<\/strong>&nbsp;and&nbsp;<strong>testing<\/strong>. This&nbsp;<strong>shields<\/strong>&nbsp;your testing set from being&nbsp;<em>overfitted<\/em>&nbsp;by the&nbsp;<strong>choice of the model<\/strong>. The selection process becomes:<\/p>\n<ol>\n<li id=\"65c0\" name=\"65c0\"><strong>Train<\/strong>&nbsp;your models on the&nbsp;<strong>training set.<\/strong><\/li>\n<li id=\"0836\" name=\"0836\"><strong>Test<\/strong>&nbsp;them on the&nbsp;<strong>validation set&nbsp;<\/strong>to make sure you aren&rsquo;t&nbsp;<em>overfitting<\/em>.<\/li>\n<li id=\"d373\" name=\"d373\">Pick the most promising model.&nbsp;<strong>Test<\/strong>&nbsp;it on the&nbsp;<strong>testing set<\/strong>, this will give you the&nbsp;<strong>true accuracy&nbsp;<\/strong>of your model.<\/li>\n<\/ol>\n<figure id=\"531b\" name=\"531b\">\n<p><canvas height=\"25\" width=\"75\"><\/canvas><\/p>\n<p><img decoding=\"async\" data-src=\"https:\/\/cdn-images-1.medium.com\/max\/800\/1*8fphzGUa4BdpoXaNGttziw.png\" src=\"https:\/\/cdn-images-1.medium.com\/max\/800\/1*8fphzGUa4BdpoXaNGttziw.png\" \/><\/p>\n<\/figure>\n<p id=\"3367\" name=\"3367\"><strong><em>Note<\/em><\/strong>: Once you have&nbsp;<em>chosen<\/em>&nbsp;your model for&nbsp;<strong>production<\/strong>, don&rsquo;t forget to train it on the&nbsp;<strong>whole<\/strong>&nbsp;<strong>dataset<\/strong>! The more data the better!<\/p>\n<h4 id=\"841c\" name=\"841c\">Conclusion<\/h4>\n<p id=\"df44\" name=\"df44\">I hope by now you are&nbsp;<em>convinced<\/em>&nbsp;that you must pay attention to your&nbsp;<strong>dataset<\/strong>before even thinking about your model. You now know the biggest mistakes of working with data, how to avoid the&nbsp;<strong>pitfalls<\/strong>, plus&nbsp;<strong>tips<\/strong>&nbsp;and&nbsp;<strong>tricks<\/strong>&nbsp;on how to build&nbsp;<strong>killer<\/strong>&nbsp;<strong>datasets<\/strong>! In case of doubt, remember:<\/p>\n<blockquote id=\"c35a\" name=\"c35a\"><p><strong>&ldquo;The winner is not the one with the best model, it&rsquo;s the one with the best&nbsp;data.&rdquo;<\/strong>.<\/p><\/blockquote>\n<\/section>\n","protected":false},"excerpt":{"rendered":"<p>Many people make the mistake of trying to&nbsp;compensate&nbsp;for&nbsp;their ugly&nbsp;dataset by improving their model. This is the equivalent of buying a&nbsp;super car because your old car doesn&rsquo;t perform well with&nbsp;cheap&nbsp;gasoline. It makes much more sense to&nbsp;refine&nbsp;the&nbsp;oil&nbsp;instead of&nbsp;upgrading&nbsp;the&nbsp;car. This article explains how you can easily&nbsp;improve&nbsp;your&nbsp;results&nbsp;by&nbsp;enhancing&nbsp;your&nbsp;dataset, with the task of image classification as an example, but these tips can be applied to all sorts of datasets.<\/p>\n","protected":false},"author":389,"featured_media":3448,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"content-type":"","footnotes":""},"categories":[187],"tags":[94],"ppma_author":[2196],"class_list":["post-976","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-bigdata-cloud","tag-data-science"],"authors":[{"term_id":2196,"user_id":389,"is_guest":0,"slug":"julien-despois","display_name":"Julien Despois","avatar_url":"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/Screen-Shot-2019-10-07-at-5.45.59-PM-1-150x150.png","user_url":"","last_name":"Despois","first_name":"Julien","job_title":"","description":"Julien Despois is Machine Learning &amp; Deep Learning Scientist at L\u2019Or\u00e9al AI Research. With a\u00a0strong background in Machine Learning, computer science, and mathematics he has a passion for Artificial Intelligence. He is specialized in applications of Deep Learning."}],"_links":{"self":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/976","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/users\/389"}],"replies":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/comments?post=976"}],"version-history":[{"count":1,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/976\/revisions"}],"predecessor-version":[{"id":6245,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/976\/revisions\/6245"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/media\/3448"}],"wp:attachment":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/media?parent=976"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/categories?post=976"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/tags?post=976"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/ppma_author?post=976"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}