{"id":2023,"date":"2019-10-22T03:56:41","date_gmt":"2019-10-22T03:56:41","guid":{"rendered":"http:\/\/kusuaks7\/?p=1628"},"modified":"2024-03-11T10:39:24","modified_gmt":"2024-03-11T10:39:24","slug":"how-to-measure-feature-importance-in-a-binary-classification-model","status":"publish","type":"post","link":"https:\/\/www.experfy.com\/blog\/ai-ml\/how-to-measure-feature-importance-in-a-binary-classification-model\/","title":{"rendered":"How to measure feature importance in a binary classification model"},"content":{"rendered":"\t\t<div data-elementor-type=\"wp-post\" data-elementor-id=\"2023\" class=\"elementor elementor-2023\" data-elementor-post-type=\"post\">\n\t\t\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-74a965ed elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"74a965ed\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-134e32a2\" data-id=\"134e32a2\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-d28c7a3 elementor-widget elementor-widget-heading\" data-id=\"d28c7a3\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h3 class=\"elementor-heading-title elementor-size-default\"><h3 id=\"827d\" style=\"color: #aaa;font-style: italic\"><strong>An example in R language of how to check feature relevance in a binary classification problem<\/strong><\/h3><\/h3>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-67a97e6 elementor-widget elementor-widget-text-editor\" data-id=\"67a97e6\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<section>\n<p id=\"e63b\">One of the main tasks that a data scientist must face when he builds a machine learning model is the selection of the\u00a0<strong>most predictive variables<\/strong>. Selecting predictors with low predictive power can lead, in fact, to overfitting or\u00a0<strong>low model performance<\/strong>. In this article, I\u2019ll show you some techniques to better select the predictors of a dataset in a\u00a0<strong>binary classification<\/strong>\u00a0model.<\/p>\n\n<\/section><section>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-d7652af elementor-widget elementor-widget-text-editor\" data-id=\"d7652af\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"fa7b\">When a data scientist starts working at some model, he often\u00a0<strong>doesn\u2019t have\u00a0<\/strong>a real idea of which the predictors should be. Maybe the previous phase of\u00a0<strong>business understanding<\/strong>\u00a0discarded some useless variables but we often have to face a giant table of\u00a0<strong>hundreds<\/strong>\u00a0of variables.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-b558d00 elementor-widget elementor-widget-text-editor\" data-id=\"b558d00\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"5ec6\">Training a model on such a huge table\u00a0<strong>is not a good idea<\/strong>. You really run the risk of\u00a0<strong>collinearity\u00a0<\/strong>(i.e. correlations between variables). So we have to\u00a0<strong>choose\u00a0<\/strong>the best set of variables to use in order to make our model\u00a0<strong>learn properly\u00a0<\/strong>from the business\u00a0<strong>information\u00a0<\/strong>we are giving it.<\/p>\n<p id=\"16fb\">Our goal is to\u00a0<strong>increase\u00a0<\/strong>the predictive power of our model against our binary target, so we must find those variables that are\u00a0<strong>strongly correlated\u00a0<\/strong>with it. Remember: information is\u00a0<strong>hidden<\/strong>\u00a0inside the dataset and we must provide all the necessary conditions to make our model extract it. So we have to\u00a0<strong>prepare<\/strong>\u00a0data\u00a0<strong>before<\/strong>\u00a0the training phase in order to make the model work properly.<\/p>\n<p id=\"f6ef\">Numerical and categorical predictors have a different kind of approach and I\u2019ll show you both.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-068ab0c elementor-widget elementor-widget-heading\" data-id=\"068ab0c\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h3 class=\"elementor-heading-title elementor-size-default\"><h3 id=\"2bdd\">Numerical variables<\/h3><\/h3>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-9563c24 elementor-widget elementor-widget-text-editor\" data-id=\"9563c24\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"0b1a\">Since our problem is a binary classification task, we can consider our\u00a0<strong>outcome\u00a0<\/strong>as a\u00a0<strong>number\u00a0<\/strong>which can be either equal to 0 or 1. In order to check if a variable is relevant or not, we can calculate the absolute value of the\u00a0<strong>Pearson linear correlation coefficient\u00a0<\/strong>between the target and the predictors.<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-9194d60 elementor-widget elementor-widget-image\" data-id=\"9194d60\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/cdn-images-1.medium.com\/max\/1000\/0*LYwqs4AzbiYPF-Wg.png\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-f3c3360 elementor-widget elementor-widget-text-editor\" data-id=\"f3c3360\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"c8c0\">That is the covariance divided by the product of the standard deviations.<\/p>\n<p id=\"41fd\">We are not interested in the sign of correlation. We just need to know its\u00a0<strong>intensity<\/strong>. That\u2019s why we use the absolute value.<\/p>\n<p id=\"f68c\">I have often seen this kind of approach in many AI projects and tools. Honestly, I have to say that it\u2019s not completely correct to calculate the correlation coefficient in this way. For a perfect predictor, we expect a Pearson coefficient absolute value equal to 1, but we could not achieve this value if we treat binary outcome as a binary number. It\u2019s not important, however. We are using Pearson correlation coefficient to\u00a0<strong>sort our features<\/strong>\u00a0from the most relevant to the least one, so as long as the coefficient calculation is the same, we can\u00a0<strong>compare<\/strong>\u00a0the features between them.<\/p>\n<p id=\"2d96\">Pearson correlation coefficient is not flawless, however. It only measures\u00a0<strong>linear correlation<\/strong>\u00a0and our variables couldn\u2019t be linearly correlated. But in first approximation, we can easily calculate and use it for our purpose.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-26149f7 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"26149f7\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-e688bc5\" data-id=\"e688bc5\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-59f8de1 elementor-widget elementor-widget-heading\" data-id=\"59f8de1\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h3 class=\"elementor-heading-title elementor-size-default\"><h3 id=\"9431\">Categorical variables<\/h3><\/h3>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-313be56 elementor-widget elementor-widget-text-editor\" data-id=\"313be56\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"5f22\">For the categorical variables, there\u2019s no Pearson correlation coefficient, but we can use another great discovery of Pearson, which is the\u00a0<strong>chi-square test<\/strong>.<\/p>\n<p id=\"1086\">Let\u2019s say we have a histogram of\u00a0<em>N<\/em>\u00a0different categories with\u00a0<em>O<\/em>\u00a0observation that sum up to\u00a0<em>n\u00a0<\/em>and let\u2019s say want to compare it with a theoretical histogram made by probabilities\u00a0<em>p<\/em>. We can build a chi-square variable in this way:<\/p>\n<p id=\"fb7a\">This variable is asymptotically distributed as a\u00a0<strong>chi-square distribution<\/strong>\u00a0with\u00a0<em>N<\/em>-1 degrees of freedom.<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-8556ea6 elementor-widget elementor-widget-image\" data-id=\"8556ea6\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/cdn-images-1.medium.com\/max\/1000\/0*1vnlkOZ20BMawQJG.png\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-4a037e3 elementor-widget elementor-widget-text-editor\" data-id=\"4a037e3\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"fb7a\">This variable is asymptotically distributed as a\u00a0<strong>chi-square distribution<\/strong>\u00a0with\u00a0<em>N<\/em>-1 degrees of freedom.<\/p><p id=\"50b9\">If our variable is not correlated to the target, we expect that, for each one of its values, we get 50% zeroes and 50% ones on our dataset. This is a\u00a0<strong>theoretical histogram\u00a0<\/strong>we could expect to have if there\u2019s no correlation, so a\u00a0<strong>one-tailed chi-square test<\/strong>\u00a0performed to check whether the real histogram is similar to this one, should give us a p-value equal to 1 (i.e. a low chi-square value) if our variable is not correlated to the target. On the contrary, a perfect predictor will push p-value towards\u00a0<strong>lower values\u00a0<\/strong>(i.e. higher chi-square values).<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-ffc833a elementor-widget elementor-widget-heading\" data-id=\"ffc833a\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h3 class=\"elementor-heading-title elementor-size-default\"><h3 id=\"ad74\">Example in\u00a0R<\/h3><\/h3>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-235abb2 elementor-widget elementor-widget-text-editor\" data-id=\"235abb2\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"adc2\">To better explain the procedure I\u2019ll show you an example in R code. I\u2019ll work with the famous\u00a0<strong>iris\u00a0<\/strong>dataset.<\/p>\n<p id=\"c17f\">Remember that R has a powerful function\u00a0<strong>cor<\/strong>\u00a0that calculates the correlation matrix and the function\u00a0<strong>chisq.test<\/strong>\u00a0that performs the chi-square test.<\/p>\n<p id=\"c485\">First, we create a column named\u00a0<strong>target<\/strong>\u00a0that is equal to 1 when the species is\u00a0<strong>virginica<\/strong>\u00a0and 0 otherwise. Then we\u2019ll check the correlations with the other variables.<\/p>\n<p id=\"f7f0\">Let\u2019s start with the\u00a0<strong>numerical features<\/strong>. With this simple code, it\u2019s very easy to find the most correlated ones.<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-bd4820f elementor-widget elementor-widget-text-editor\" data-id=\"bd4820f\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<div id=\"c897\" style=\"background: #eee; border: 1px solid #ccc; padding: 5px 10px;\"><em># Load iris dataset data(\u201ciris\u201d)<\/em><\/div>\n&nbsp;\n<div id=\"d48c\" style=\"background: #eee; border: 1px solid #ccc; padding: 5px 10px;\"><span style=\"font-family: courier new,courier,monospace;\"># <em>Generate a binary target column\niris$target = ifelse(iris$Species == \u201cvirginica\u201d,1,0)\nnumeric_columns = setdiff(names(iris),\u201dSpecies\u201d)<\/em><\/span><\/div>\n&nbsp;\n<div id=\"85e3\" style=\"background: #eee; border: 1px solid #ccc; padding: 5px 10px;\"><em><span style=\"font-family: courier new,courier,monospace;\">target_corr = abs(cor(iris[,numeric_columns])[&#8220;target\u201d,])<\/span><\/em><\/div>\n<p style=\"text-align: center;\"><\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-1ec70d1 elementor-widget elementor-widget-image\" data-id=\"1ec70d1\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/cdn-images-1.medium.com\/freeze\/max\/1000\/1*_Kz5lVsseB9hzr2cQWjGZA.png?q=20\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-d716d63 elementor-widget elementor-widget-text-editor\" data-id=\"d716d63\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<span style=\"background-color: #ffffff;\">As you can see, the most correlated one is the petal width, then comes the petal length and so on. The correlation of the target with itself is obviously 1.<\/span>\n<p id=\"a7aa\">Let\u2019s take a look at the plot of the target variable against the\u00a0<strong>petal width<\/strong>:<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-3f19d76 elementor-widget elementor-widget-image\" data-id=\"3f19d76\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/cdn-images-1.medium.com\/freeze\/max\/1000\/1*VA7X0bea8cEDZ7_PlyHKrw.png?q=20\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-618b645 elementor-widget elementor-widget-text-editor\" data-id=\"618b645\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tAs you can see, higher values of petal width lead to 1 and lower values lead to 0. That\u2019s a clear correlation.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-7dca5e3 elementor-widget elementor-widget-text-editor\" data-id=\"7dca5e3\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"bc05\">Now, let\u2019s take a look at the plot of the target against the\u00a0<strong>sepal length<\/strong>, which has been classified as the least representative variable:<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-d4f0ccf elementor-widget elementor-widget-image\" data-id=\"d4f0ccf\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/cdn-images-1.medium.com\/freeze\/max\/1000\/1*XGJt1vvAqc41iiJfD35tVQ.png?q=20\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-6e225f6 elementor-widget elementor-widget-text-editor\" data-id=\"6e225f6\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tIt\u2019s clear that there is a wide region approximately between 5.5 and 7 inside which we get 0 and 1 almost alternatively. The lack of a\u00a0<strong style=\"background-color: rgba(0, 0, 0, 0.05);\">graphical pattern<\/strong>\u00a0is always a good reason to suspect the lack of\u00a0<strong style=\"background-color: rgba(0, 0, 0, 0.05);\">correlation<\/strong>.\n<p id=\"0406\">For the\u00a0<strong>categorical\u00a0<\/strong>case, we\u2019ll calculate the correlation between the target and the species variables. Of course, we expect a strong correlation, because we have built the target as a direct function of the species.<\/p>\n<p id=\"570a\">I\u2019ll show you the single-line code and the results:<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-9ff73ce elementor-widget elementor-widget-image\" data-id=\"9ff73ce\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/cdn-images-1.medium.com\/freeze\/max\/1000\/1*aMyoO82Vhz4_Oiww1zLMpA.png?q=20\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-c47120d elementor-widget elementor-widget-text-editor\" data-id=\"c47120d\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tThe\u00a0<strong style=\"background-color: rgba(0, 0, 0, 0.05);\">table<\/strong>\u00a0function generates the contingency table and the\u00a0<strong style=\"background-color: rgba(0, 0, 0, 0.05);\">chisq.test<\/strong>\u00a0function has been built in order to perform the chi-square test as we want for our case.\n<p id=\"6f75\">A very low p-value means a very strong difference from the uncorrelated case. As usual in the hypothesis tests, you don\u2019t actually accept the null hypothesis, but\u00a0<strong>refuse to neglect<\/strong>\u00a0it.<\/p>\n<p id=\"ddae\">We can get further confirmation by taking a look at the contingency table:<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-00f7496 elementor-widget elementor-widget-image\" data-id=\"00f7496\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/cdn-images-1.medium.com\/freeze\/max\/1000\/1*SxLBK301EG-P9Yj95TacjQ.png?q=20\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-8a94fb7 elementor-widget elementor-widget-text-editor\" data-id=\"8a94fb7\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tAs you can see, the column values are\u00a0<strong style=\"background-color: rgba(0, 0, 0, 0.05);\">very unbalanced<\/strong>, which is exactly what we are looking for.\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-39ff537 elementor-widget elementor-widget-heading\" data-id=\"39ff537\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h3 class=\"elementor-heading-title elementor-size-default\"><h3 id=\"7bf1\">Conclusions<\/h3><\/h3>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-851f1c6 elementor-widget elementor-widget-text-editor\" data-id=\"851f1c6\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"5aa2\">In this article, I\u2019ve shown you two simple techniques in R to measure the importance of numerical and categorical variables against a binary target. There are many more methods that can be used both with a multi-class categorical target and for a numerical target (i.e. regression).<\/p>\n<p id=\"9f62\">However, this simple procedure can be used to check at first the most important variables and start a deeper analysis to find the best set of predictors for our model.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<\/div>\n\t\t","protected":false},"excerpt":{"rendered":"<p>An example in R language of how to check feature relevance in a binary classification problem One of the main tasks that a data scientist must face when he builds a machine learning model is the selection of the\u00a0most predictive variables. Selecting predictors with low predictive power can lead, in fact, to overfitting or\u00a0low model<\/p>\n","protected":false},"author":618,"featured_media":2448,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"content-type":"","footnotes":""},"categories":[183],"tags":[92],"ppma_author":[3328],"class_list":["post-2023","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-ai-ml","tag-machine-learning"],"authors":[{"term_id":3328,"user_id":618,"is_guest":0,"slug":"gianluca-malato","display_name":"Gianluca Malato","avatar_url":"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/04\/medium_918623b2-8f36-4110-8343-6fc9228595dd-150x150.jpg","user_url":"http:\/\/www.gianlucamalato.it\/","last_name":"Malato","first_name":"Gianluca","job_title":"","description":"Gianluca Malato is Data Scientist at Poste Italiane SPA.\u00a0 He is also a fiction author and software developer, Editor of\u00a0<a href=\"https:\/\/medium.com\/data-science-journal?source=follow_footer--------------------------follow_footer-\">Data Science Journal<\/a>,\u00a0<a href=\"https:\/\/medium.com\/the-trading-scientist?source=follow_footer--------------------------follow_footer-\">The Trading Scientist<\/a>, and\u00a0<a href=\"https:\/\/medium.com\/the-writers-notebook?source=follow_footer--------------------------follow_footer-\">The Writer\u2019s Notebook<\/a>. His books are available on <a href=\"https:\/\/www.amazon.com\/Gianluca-Malato\/e\/B076CHTG3W?ref=dbs_a_mng_rwt_scns_share\">Amazon<\/a>."}],"_links":{"self":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/2023","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/users\/618"}],"replies":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/comments?post=2023"}],"version-history":[{"count":4,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/2023\/revisions"}],"predecessor-version":[{"id":36330,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/2023\/revisions\/36330"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/media\/2448"}],"wp:attachment":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/media?parent=2023"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/categories?post=2023"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/tags?post=2023"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/ppma_author?post=2023"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}