{"id":1115,"date":"2019-02-15T10:31:58","date_gmt":"2019-02-15T10:31:58","guid":{"rendered":"http:\/\/kusuaks7\/?p=720"},"modified":"2023-07-31T10:14:53","modified_gmt":"2023-07-31T10:14:53","slug":"dive-into-your-data-some-insight-about-an-efficient-variable-selection-process","status":"publish","type":"post","link":"https:\/\/www.experfy.com\/blog\/bigdata-cloud\/dive-into-your-data-some-insight-about-an-efficient-variable-selection-process\/","title":{"rendered":"Dive into Your Data: Some Insight into an Efficient Variable Selection Process"},"content":{"rendered":"<p>As data scientists, one of the key ways we help our collaborators and clients is by sifting through large complex collections of data to help identify the underlying factors associated with a particular outcome (e.g. click-through-rate, conversion rate, predicted revenue).\u00a0In practical terms, the identification of these predictive variables provides insight into how certain components of ad campaigns, customer demographics, etc., are related to outcomes. By having a better understanding of what these factors are, we can more accurately address and predict user needs.<\/p>\n<p>In statistical\/machine learning parlance this is commonly referred to as variable or model selection. For people who have taken a stats course on linear models or\u00a0searched for methods regarding variable selection, regression\u00a0or classification, you have probably come across some of the staples, such as\u00a0forward, backward and stepwise selection. While these approaches work well in instances when there are a limited number of variables (on the order of ten or less), as we move to tens, hundreds, thousands and more (as in many web-scale applications) they are simply no longer practical nor reliable. From a practical standpoint, filtering out irrelevant variables allows us to focus time and resources on the factors (e.g. ad campaings, user demographics, device our site was accessed from, etc.) that are driving sales, conversion, clicks, and other important metrics.<\/p>\n<p>I\u2019ll give a quick overview of these procedures, highlight some of their issues and introduce a more \u201cmodern\u201d framework that leverages so-called sparsity constraints that many companies, including the likes of Yahoo and Google, use to wrangle data with many variables. \u00a0For this post I\u2019ll focus on linear regression and show how to implement these various methods in R.<\/p>\n<h3>Classical Model Selection (this is by no means exhaustive!)<\/h3>\n<p>First recall that given a set of inputs measurements <strong><span style=\"font-size: 14px;\">x<span style=\"line-height: 17.3333320617676px;\"><sub>1<\/sub>,x<sub>2<\/sub>,x<sub>3<\/sub>&#8230;.x<sub>P<\/sub><\/span><\/span><\/strong>\u00a0(e.g. customer demographics, time of a day user&#8217;s logged on, etc.) where <strong>x<sub>j\u00a0<\/sub>= (x<sub>1j<\/sub>,&#8230;..,x<sub>Nj<\/sub>)<\/strong>, the standard linear model is expressed as:<\/p>\n<p><strong><span style=\"font-size: 14px;\">? =\u00a0\u00df<sub>0\u00a0<\/sub>+\u00a0\u00df<sub>1<\/sub>x<sub>1\u00a0<\/sub>+ <span style=\"line-height: 20.7999992370605px;\">\u00df<sub>2<\/sub>x<sub>2\u00a0<\/sub>+ &#8230;&#8230;. +\u00a0<\/span>\u00a0<span style=\"line-height: 20.7999992370605px;\">\u00df<sub>p<\/sub>x<sub>p<\/sub><\/span><\/span><\/strong><\/p>\n<p>where <strong>?<\/strong>\u00a0is some outcome interest (e.g. total sales for a particular user),\u00a0<strong><span style=\"font-size: 14px;\">\u00df<sub>0<\/sub><\/span><\/strong>\u00a0is the intercept term (representing, for example, average sales) and\u00a0<span style=\"font-size: 14px;\"><strong>\u00df<sub>j<\/sub><\/strong><\/span>&#8216;s are the variable weights (i.e. coefficients representing that variable&#8217;s importance\/contribution to the observed outcome). The criteria by which we estimate the\u00a0<strong style=\"font-size: 14px; line-height: 22.3999996185303px;\">\u00df<sub>j<\/sub><\/strong>&#8216;s is minimizing the following loss function with respect to the <strong style=\"line-height: 20.7999992370605px;\"><span style=\"font-size: 14px;\">\u00df<\/span><\/strong>\u00a0:<\/p>\n<p>&nbsp;<\/p>\n<p>Where <strong>y<sub>i<\/sub><\/strong>&#8216;s are the observed outcomes. The subscript\u00a0<em>i\u00a0<\/em>denotes a single observation (e.g. one users visit onto a site) and\u00a0<em>RSS\u00a0<\/em>stands for &#8220;residual sum of squares&#8221;.<\/p>\n<h3>Approaches to Variable Selection<\/h3>\n<p>Here are some of the classical approaches that assess which variables to keep using:<\/p>\n<p><em>Forward selection<\/em>: Starting with no variables in the model, add one at a time based on which is the most statistically significant. Significance is typically assessed based on one of several criteria; an individual variable\u2019s p-value, a comparison of the model with and without that variable (a comparison of the \u201cfull\u201d versus \u201creduced\u201d model that also results in a p-value) or the use of \u201cinformation criteria\u201d such as AIC and BIC. The algorithm continues adding variables and terminates when none of the remaining variables has a significant p-value (e.g. less than 0.05).<\/p>\n<p><em>Backward selection<\/em>: Similar to forward selection, except that now we start with all of the variables in the model and remove one variable at a time, which variable is removed is, once again most commonly determined by a p-value. The algorithm terminates when all remaining variables are significant and the non-significant ones have been removed.<\/p>\n<p><em>Stepwise selection<\/em>: Is a combination of forward and backward procedures. There are number of variants of this algorithm, the most common one follows a forward procedure, but after a new variable has been added one backward selection step is taken to identify if any variables should be removed.<\/p>\n<p>As previously mentioned, while the methods above work sufficiently well for a handful of variables, as we move to larger sets things start getting messy. For example, even at 20 variables, using something like the stepwise approach is not only comparatively slow but also highly unreliable. Consider the fact that we\u2019re only exploring a small subspace of the over 1 million possible models (i.e. all models with 1, 2, 3 variables etc. from 20). This issue is rapidly compounded as we move to larger numbers. What we\u2019d ideally like is an approach that searches this enormous space of models more efficiently.<\/p>\n<h3>Illustration: Restaurant Revenue Prediction<\/h3>\n<p>To make this a little more concrete, consider the following example; a recently organized competition\u00a0asked participants to predict restaurant revenue based on a collection of variables like location, demographics, etc. The data that was provided contained 137 restaurants\u00a0and 42 variables.\u00a0This data was to be used to build a model to predict revenue for 100,000 restaurants for which revenue information was \u201cmasked.\u201d\u00a0Since the actual revenue for those 100,000 is not publicly available, we\u2019ll focus our analysis on the data provided.<\/p>\n<p>The first takeaway is to note the large number of variables compared to the number of\u00a0restaurants (this imbalance is not uncommon in webscale data where while there may be billions of observations, there are also millions of variables). Generally speaking these types of scenarios, if not carefully accounted for<ins cite=\"mailto:Berkay%20Adlim\" datetime=\"2015-08-27T13:24\">,<\/ins> tend to produce models that are highly biased, i.e. do not provide accurate prediction outside the available data (commonly referred to as \u201coverfitting\u201d).<\/p>\n<p>The appeal of model selection procedures like those mentioned above is that they help us avoid overfitting by eliminating variables that are not predictive and end up introducing noise. This also allows for clearer insights into what\u2019s driving revenue. For someone looking into investing in a restaurant this information might be valuable as it would tell them what factors are most critical for\/correlated with success\/growth.<\/p>\n<p>Below we include the R (<a href=\"http:\/\/cran.r-project.org\/\" rel=\"noopener\">http:\/\/cran.r-project.org\/<\/a>) code snippet (complete code will be made available on GitHub) showing the results for a model where all the variables are left in and another where variables have been selected using the stepwise procedure (we leave it to the reader to try the \u201cforward\u201d and \u201cbackward\u201d approaches).<\/p>\n<p>&nbsp;<\/p>\n<p>From the results above we can see a small bump in performance and a decrease in the number of variables left in the model dropping from 42 to 29.<\/p>\n<h3>Sparsity-Constrained Model Selection<\/h3>\n<p>A more modern approach that\u2019s become a mainstay for analyzing data with many variables is the LASSO (Least Absolute Shrinkage and Selection Operator), also referred to as \u201csparse\u201d regression. The idea behind the LASSO is to leverage a class of mathematical constraints on the variable weights (i.e. coefficients) to more efficiently search through the space of models. From a practical standpoint this gives us the ability to take a large collection of variables and hone in on those that are most predictive of\/relevant to the outcome of interest, e.g. revenue. This constraint takes the form:<\/p>\n<p>&nbsp;<\/p>\n<p>where the left hand side of this equation is the <strong><em>l<sub>1<\/sub><\/em><\/strong>-norm of the coefficient vector\u00a0<strong>\u00df=(<span style=\"line-height: 20.7999992370605px;\">\u00df<sub>1<\/sub>,\u00df<sub>2<\/sub>,&#8230;.\u00df<sub>p<\/sub>)<\/span><\/strong>. At first glance it&#8217;s not exactly intuitive as to how or why this helps with model selection. To help gain some insight into what this constraint is doing it&#8217;ll be helpful to look at a picture of what the above constraint looks like compared to the standard <em>RSS<\/em> criteria.<\/p>\n<p>&nbsp;<\/p>\n<p>What&#8217;s being shown here is the following; the point labeled \u00a0is the <em>unconstrained\u00a0<\/em>least squares estimate, the red contours surrounding it (imagine that these are coming &#8220;at you&#8221;, like in a topological map)\u00a0are the corresponding <em>RSS <\/em>values for other values of\u00a0<strong><span style=\"line-height: 20.7999992370605px;\">\u00df<\/span><\/strong>\u00a0(not the least squares estimate). The diamond region centered at the origin is the region associated with <strong>l<\/strong>\u00a0constraint.\u00a0In order for this constraint to be satisfied the <em>RSS <\/em>contours <em>must<\/em> make contact with this diamond region. The main takeaway is that the \u201csharp edges\u201d of this diamond fall directly along the origin, i.e. some of the\u00a0<strong><span style=\"line-height: 20.7999992370605px;\">\u00df<sub>j<\/sub><\/span><\/strong>&#8216;s\u00a0will have a value of exactly 0. This effectively \u201cremoves\u201d these variables from our model.<\/p>\n<p>By increasing\/decreasing the value of <em>t <\/em>in the above constraint we set more\/fewer coefficients set to 0. Various procedures exist to select a specific <em>t<\/em>, e.g. Mallows Cp, a procedure that looks to balance model complexity (i.e. the number of coefficients) with how well the included coefficients predict outcome.<\/p>\n<p>There are a number of algorithms that allow us to solve this problem quickly, efficiently and scalably. Considerable theory has also been developed to show that under certain conditions the LASSO model (and some variants of it we\u2019ll discuss in future posts) can recover the \u201ctrue\u201d set of predictive variables.<\/p>\n<p>The big takeaway here is that for large complex problems there are some powerful modeling tools to help us gain insight into what exactly is going on. Returning to the restaurant revenue prediction problem and applying the LASSO model we have the following:<\/p>\n<h3>Restaurant Revenue Prediction Revisited<\/h3>\n<p>&nbsp;<\/p>\n<p>This is a considerable improvement (~4x) in performance over the standard linear\u00a0model and the stepwise procedure. Additionally, the total number of variables remaining in the model is down to 13, allowing us to focus our efforts and dig deeper into the underlying factors driving revenue.<\/p>\n<p>The standard LASSO just described is pretty powerful in and of itself, but a variety of extensions allow us to look for other, more complex sparsity patterns. In the next post I\u2019ll discuss\u00a0group LASSO, which allows us to capture sparsity patterns in things like multi-level categorical coefficients (e.g. device type a user accesses a web page from).<\/p>\n<p>NB: Typically such an analysis would include resampling the data falling into the \u201ctraining\u201d and \u201ctesting\u201d sets many times to more reliably estimate model prediction performance. What&#8217;s presented here is for illustrative purposes only.<\/p>\n<p>&nbsp;<\/p>\n<p>&nbsp;<\/p>\n<p>&nbsp;<\/p>\n<p>&nbsp;<\/p>\n<p>&nbsp;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>How you can efficiently select&nbsp;the appropriate model for your data using various variable\/model selection methods.&nbsp;<\/p>\n","protected":false},"author":14,"featured_media":4121,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[187],"tags":[140],"ppma_author":[2473],"class_list":["post-1115","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-bigdata-cloud","tag-predictive-analytics"],"authors":[{"term_id":2473,"user_id":14,"is_guest":0,"slug":"daniel-samarov","display_name":"Daniel Samarov","avatar_url":"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/Daniel-Samarov-150x150.png","author_category":"","user_url":"","last_name":"Samarov","first_name":"Daniel","job_title":"","description":"Dan received his Ph.D. in Statistics from the University of North Carolina at Chapel Hill. He has over 15 years of experience working in the areas of machine learning, statistics, and data science on applications ranging from the biological\/physical sciences to marketing\/advertising, healthcare, retail, and the internet of things. He has worked as an independent consultant for six years and in 2018 cofounded Solid Data, a full-stack data science consultancy. In that role he has helped clients define and execute data science and AI strategies, extract valuable insights from their data to deliver better user experiences, expand core business competencies and drive business opportunities through the development and deployment of robust, scalable, production-ready data products."}],"_links":{"self":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/1115","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/users\/14"}],"replies":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/comments?post=1115"}],"version-history":[{"count":0,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/1115\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/media\/4121"}],"wp:attachment":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/media?parent=1115"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/categories?post=1115"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/tags?post=1115"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/ppma_author?post=1115"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}