{"id":2028,"date":"2019-10-24T02:24:46","date_gmt":"2019-10-24T02:24:46","guid":{"rendered":"http:\/\/kusuaks7\/?p=1633"},"modified":"2024-03-08T13:03:15","modified_gmt":"2024-03-08T13:03:15","slug":"feature-selection-by-random-search-in-python","status":"publish","type":"post","link":"https:\/\/www.experfy.com\/blog\/bigdata-cloud\/feature-selection-by-random-search-in-python\/","title":{"rendered":"Feature selection by random search in Python"},"content":{"rendered":"\t\t<div data-elementor-type=\"wp-post\" data-elementor-id=\"2028\" class=\"elementor elementor-2028\" data-elementor-post-type=\"post\">\n\t\t\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-75f025f2 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"75f025f2\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-2046421f\" data-id=\"2046421f\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-7a3c98c6 elementor-widget elementor-widget-text-editor\" data-id=\"7a3c98c6\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<strong>Feature selection<\/strong>\u00a0has always been a great task in machine learning. According to my experience, I can surely say that feature selection\u00a0<strong>is much more important<\/strong>\u00a0than model selection itself.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-ac97b3a elementor-widget elementor-widget-heading\" data-id=\"ac97b3a\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\"><h2>Feature selection and collinearity<\/h2><\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-b182804 elementor-widget elementor-widget-text-editor\" data-id=\"b182804\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p data-selectable-paragraph=\"\">I have already written an article\u00a0about feature selection. It was an\u00a0<strong>unsupervised\u00a0<\/strong>way to measure feature importance in a\u00a0<strong>binary classification<\/strong>\u00a0model, using Pearson\u2019s chi-square test and correlation coefficient.<\/p>\n<p data-selectable-paragraph=\"\">Generally speaking, an unsupervised approach is often enough for a simple feature selection. However, each model has\u00a0<strong>its own way<\/strong>\u00a0of \u201cthinking\u201d the features and treat their correlation with the target variable. Moreover, there are models that do not care too much about\u00a0<strong>collinearity\u00a0<\/strong>(i.e., the correlation between the features) and other models that show\u00a0<strong>very big problems<\/strong>\u00a0when it occurs (for example, linear models).<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-31df585 elementor-widget elementor-widget-text-editor\" data-id=\"31df585\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p data-selectable-paragraph=\"\">Although it\u2019s possible to\u00a0<strong>rank\u00a0<\/strong>the features by some kind of\u00a0<strong>relevance<\/strong>\u00a0metrics introduced by the model (for example, the p-value of the t-test performed on the coefficients of linear regression), taking only the most relevant variables couldn\u2019t be enough. Think about a feature that is equal to another one, just multiplied by two. The linear correlation between these features if 1 and this simple multiplication doesn\u2019t affect the correlation with the target variable, so if we take only the most relevant variables, we\u2019ll take the original feature and the multiplied one. This leads to\u00a0<strong>collinearity<\/strong>, which can be quite dangerous for our model.<\/p>\n<p data-selectable-paragraph=\"\">That\u2019s why we must introduce some way to better select our features.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-0bb76d1 elementor-widget elementor-widget-heading\" data-id=\"0bb76d1\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\"><h2>Random search<\/h2><\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-5d433a6 elementor-widget elementor-widget-text-editor\" data-id=\"5d433a6\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p data-selectable-paragraph=\"\">Random search is a really useful tool in a data scientist toolbox. It\u2019s a very simple technique very often used, for example, in cross-validation and\u00a0<strong>hyperparameter optimization<\/strong>.<\/p>\n<p data-selectable-paragraph=\"\">It\u2019s very simple. If you have a multi-dimensional grid and want to look for the point on this grid which maximizes (or minimizes) some\u00a0<strong>objective function<\/strong>, random search works as follows:<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-156b77f elementor-widget elementor-widget-text-editor\" data-id=\"156b77f\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<ol>\n \t<li>Take a random point on the grid and measure the objective function value<\/li>\n \t<li>If the value is better than the best one achieved so far, keep the point in memory.<\/li>\n \t<li>Repeat for a certain, pre-defined number of times<\/li>\n<\/ol>\n<p data-selectable-paragraph=\"\">That\u2019s it. Just generating random points and look for the best one.<\/p>\n<p data-selectable-paragraph=\"\">Is this a\u00a0<strong>good way<\/strong>\u00a0to find the global minimum (or maximum)? Of course, it\u2019s not. The point we are looking for is only one (if we are lucky) in a very large space and we have only a limited number of iterations. The probability of getting that single point in an\u00a0<em>N-<\/em>point grid is\u00a0<em>1\/N<\/em>.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-d72f5b3 elementor-widget elementor-widget-text-editor\" data-id=\"d72f5b3\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p data-selectable-paragraph=\"\">So, why is random search so much used? Because we\u00a0<strong>never really want<\/strong>\u00a0to maximize our performance measure; we want a good,\u00a0<strong>reasonably high value<\/strong>\u00a0that it\u2019s not the highest possible, to avoid overfitting.<\/p>\n<p data-selectable-paragraph=\"\">That\u2019s why random search works and can be used in feature selection.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-003b758 elementor-widget elementor-widget-heading\" data-id=\"003b758\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\"><h2>How to use random search for feature selection<\/h2><\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-797b2d5 elementor-widget elementor-widget-text-editor\" data-id=\"797b2d5\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p data-selectable-paragraph=\"\">Random search can be used for feature selection with quite\u00a0<strong>good results<\/strong>. An example of a procedure similar to a random search is the\u00a0<strong>Random Forest<\/strong>\u00a0model which performs a random selection of the features for each tree.<\/p>\n<p data-selectable-paragraph=\"\">The idea is pretty simple: choose the features\u00a0<strong>randomly<\/strong>, measure the model performances by\u00a0<strong>k-fold cross-validation<\/strong>, and repeat many times. The feature combination that gives the best performances is the one we are looking for.<\/p>\n<p data-selectable-paragraph=\"\">More precisely, these are the steps to follow:<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-4b61c86 elementor-widget elementor-widget-text-editor\" data-id=\"4b61c86\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<ol>\n \t<li>Generate a random integer number\u00a0<em>N<\/em>between 1 and the number of features.<\/li>\n \t<li>Generate a random sequence of\u00a0<em>N<\/em>integer numbers between 0 and\u00a0<em>N-1<\/em>without repetition. This sequence represents our feature array. Remember that Python arrays start from 0.<\/li>\n \t<li>Train the model on these features and cross-validate it with k-fold cross-validation, saving the\u00a0<strong>average value\u00a0<\/strong>of some performance measure.<\/li>\n \t<li>Repeat from point 1 as many times as you want.<\/li>\n \t<li>Finally, get the feature array that gives the best performances according to the chosen performance measure.<\/li>\n<\/ol>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-77b8c88 elementor-widget elementor-widget-heading\" data-id=\"77b8c88\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\"><h2>A practical example in Python<\/h2><\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-60afb8b elementor-widget elementor-widget-text-editor\" data-id=\"60afb8b\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p data-selectable-paragraph=\"\">For this example, I\u2019ll use the\u00a0<strong>breast cancer dataset<\/strong>\u00a0included in\u00a0<strong>sklearn<\/strong>\u00a0module. Our model will be a\u00a0<strong>logistic regression<\/strong>, and we\u2019ll perform a 5-fold cross-validation using\u00a0<strong>accuracy<\/strong>\u00a0as the performance measure.<\/p>\n<p data-selectable-paragraph=\"\">First of all, we must import the necessary modules.<\/p>\n\n<div style=\"background: #eee; border: 1px solid #ccc; padding: 5px 10px;\"><span style=\"font-family: courier new,courier,monospace;\">import sklearn.datasets\nfrom sklearn.linear_model import LogisticRegression\nfrom sklearn.model_selection import cross_val_score\nimport numpy as np<\/span><\/div>\n<p data-selectable-paragraph=\"\">Then we can import breast cancer data and break it in input and target.<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-0e1f363 elementor-widget elementor-widget-text-editor\" data-id=\"0e1f363\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<div style=\"background: #eee; border: 1px solid #ccc; padding: 5px 10px;\"><span style=\"font-family: courier new,courier,monospace;\">dataset= sklearn.datasets.load_breast_cancer()\ndata = dataset.data\ntarget = dataset.target<\/span><\/div>\n<p data-selectable-paragraph=\"\">We can now create a logistic regression object.<\/p>\n\n<pre>lr = LogisticRegression()\n<\/pre>\nThen, we can measure the average accuracy in k-fold CV with all the features.&lt;\/p.\n<div style=\"background: #eee; border: 1px solid #ccc; padding: 5px 10px;\"><span style=\"font-family: courier new,courier,monospace;\"># Model accuracy using all the features\nnp.mean(cross_val_score(lr,data,target,cv=5,scoring=&#8221;accuracy&#8221;))\n# 0.9509041939207385<\/span><\/div>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-1b8ec0e elementor-widget elementor-widget-text-editor\" data-id=\"1b8ec0e\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tIt\u2019s 95%. Let\u2019s keep this in mind.\n<p data-selectable-paragraph=\"\">Now, we can implement a random search with, for example, 300 iterations.<\/p>\n\n<div style=\"background: #eee; border: 1px solid #ccc; padding: 5px 10px;\"><span style=\"font-family: courier new,courier,monospace;\">result = []<\/span><\/div>\n<div style=\"background: #eee; border: 1px solid #ccc; padding: 5px 10px;\"><span style=\"font-family: courier new,courier,monospace;\"># Number of iterations\nN_search = 300<\/span><\/div>\n<div style=\"background: #eee; border: 1px solid #ccc; padding: 5px 10px;\"><span style=\"font-family: courier new,courier,monospace;\"># Random seed initialization\nnp.random.seed(1)<\/span><\/div>\n<div style=\"background: #eee; border: 1px solid #ccc; padding: 5px 10px;\"><span style=\"font-family: courier new,courier,monospace;\">for i in range(N_search):\n# Generate a random number of features\nN_columns =\u00a0 list(np.random.choice(range(data.shape[1]),1)+1)<\/span><\/div>\n<div style=\"background: #eee; border: 1px solid #ccc; padding: 5px 10px;\"><span style=\"font-family: courier new,courier,monospace;\">\u00a0\u00a0\u00a0 # Given the number of features, generate features without replacement\ncolumns = list(np.random.choice(range(data.shape[1]), N_columns, replace=False))<\/span><\/div>\n<div style=\"background: #eee; border: 1px solid #ccc; padding: 5px 10px;\"><span style=\"font-family: courier new,courier,monospace;\">\u00a0\u00a0\u00a0 # Perform k-fold cross validation\nscores = cross_val_score(lr,data[:,columns], target, cv=5, scoring=&#8221;accuracy&#8221;)<\/span><\/div>\n<div style=\"background: #eee; border: 1px solid #ccc; padding: 5px 10px;\"><span style=\"font-family: courier new,courier,monospace;\">\u00a0\u00a0\u00a0 # Store the result\nresult.append({&#8216;columns&#8217;:columns,&#8217;performance&#8217;:np.mean(scores)})<\/span><\/div>\n<div style=\"background: #eee; border: 1px solid #ccc; padding: 5px 10px;\"><span style=\"font-family: courier new,courier,monospace;\"># Sort the result array in descending order for performance measure\nresult.sort(key=lambda x : -x[&#8216;performance&#8217;])<\/span><\/div>\n&nbsp;\n<p data-selectable-paragraph=\"\">At the end of the loop and the sorting function, the first element of\u00a0<em>result\u00a0<\/em>list is the object we are looking for.<\/p>\n<p data-selectable-paragraph=\"\">We can use this value to calculate the new performance measure with this subset of the features.<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-8a8e553 elementor-widget elementor-widget-text-editor\" data-id=\"8a8e553\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<div style=\"background: #eee; border: 1px solid #ccc; padding: 5px 10px;\"><span style=\"font-family: courier new,courier,monospace;\">np.mean(cross_val_score(lr, data[:,result[0][\u2018columns\u2019]], target, cv=5, scoring=\u201daccuracy\u201d))\n# 0.9526741054251634<\/span><\/div>\n&nbsp;\n<p data-selectable-paragraph=\"\">As you can see, accuracy has increased.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-a1ffda6 elementor-widget elementor-widget-heading\" data-id=\"a1ffda6\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\"><h2>Conclusions<\/h2><\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-1b313c6 elementor-widget elementor-widget-text-editor\" data-id=\"1b313c6\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p data-selectable-paragraph=\"\">Random search can be a powerful tool to perform feature selection. It\u2019s not meant to give the reasons why some features are more useful than other ones (as opposed to other feature selection procedures like Recursive Feature Elimination), but it can be a useful tool to reach\u00a0<strong>good results<\/strong>\u00a0in less time.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<\/div>\n\t\t","protected":false},"excerpt":{"rendered":"<p>Random search is a really useful tool in a data scientist toolbox. It&rsquo;s a very simple technique and can be a powerful tool to perform feature selection. It&rsquo;s not meant to give the reasons why some features are more useful than other ones (as opposed to other feature selection procedures like Recursive Feature Elimination), but it can be a useful tool to reach&nbsp;good results&nbsp;in less time. Learn how to use a simple random search in Python to get good results in less time.<\/p>\n","protected":false},"author":618,"featured_media":2470,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"content-type":"","footnotes":""},"categories":[187],"tags":[94],"ppma_author":[3328],"class_list":["post-2028","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-bigdata-cloud","tag-data-science"],"authors":[{"term_id":3328,"user_id":618,"is_guest":0,"slug":"gianluca-malato","display_name":"Gianluca Malato","avatar_url":"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/04\/medium_918623b2-8f36-4110-8343-6fc9228595dd-150x150.jpg","user_url":"http:\/\/www.gianlucamalato.it\/","last_name":"Malato","first_name":"Gianluca","job_title":"","description":"Gianluca Malato is Data Scientist at Poste Italiane SPA.\u00a0 He is also a fiction author and software developer, Editor of\u00a0<a href=\"https:\/\/medium.com\/data-science-journal?source=follow_footer--------------------------follow_footer-\">Data Science Journal<\/a>,\u00a0<a href=\"https:\/\/medium.com\/the-trading-scientist?source=follow_footer--------------------------follow_footer-\">The Trading Scientist<\/a>, and\u00a0<a href=\"https:\/\/medium.com\/the-writers-notebook?source=follow_footer--------------------------follow_footer-\">The Writer\u2019s Notebook<\/a>. His books are available on <a href=\"https:\/\/www.amazon.com\/Gianluca-Malato\/e\/B076CHTG3W?ref=dbs_a_mng_rwt_scns_share\">Amazon<\/a>."}],"_links":{"self":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/2028","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/users\/618"}],"replies":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/comments?post=2028"}],"version-history":[{"count":5,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/2028\/revisions"}],"predecessor-version":[{"id":36313,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/2028\/revisions\/36313"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/media\/2470"}],"wp:attachment":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/media?parent=2028"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/categories?post=2028"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/tags?post=2028"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/ppma_author?post=2028"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}