{"id":1634,"date":"2019-04-12T03:56:41","date_gmt":"2019-04-12T03:56:41","guid":{"rendered":"http:\/\/kusuaks7\/?p=1239"},"modified":"2023-08-08T08:43:25","modified_gmt":"2023-08-08T08:43:25","slug":"smarter-ways-to-encode-categorical-data-for-machine-learning","status":"publish","type":"post","link":"https:\/\/www.experfy.com\/blog\/ai-ml\/smarter-ways-to-encode-categorical-data-for-machine-learning\/","title":{"rendered":"Smarter Ways to Encode Categorical Data for Machine Learning"},"content":{"rendered":"<p id=\"8307\">Better encoding of categorical data can mean better model performance. In this series, I\u2019ll introduce you to a wide range of encoding options from the\u00a0Category Encoders package\u00a0for use with scikit-learn in Python.<\/p>\n<p style=\"text-align: center;\"><img decoding=\"async\" src=\"https:\/\/cdn-images-1.medium.com\/max\/800\/1*N2XdDMI0aqRRB6Odcpk50Q.jpeg\" data-src=\"https:\/\/cdn-images-1.medium.com\/max\/800\/1*N2XdDMI0aqRRB6Odcpk50Q.jpeg\" \/><\/p>\n<p style=\"text-align: center;\">Enigma for\u00a0encoding<\/p>\n<h3 id=\"b12a\">TL;DR;<\/h3>\n<p id=\"6fb0\">Use Category Encoders to improve model performance when you have nominal or ordinal data that may provide value.<\/p>\n<p id=\"04d0\">For nominal columns try OneHot, Hashing, LeaveOneOut, and Target encoding. Avoid OneHot for high cardinality columns and decision tree-based algorithms.<\/p>\n<p id=\"6587\">For ordinal columns try Ordinal (Integer), Binary, OneHot, LeaveOneOut, and Target. Helmert, Sum, BackwardDifference and Polynomial are less likely to be helpful, but if you have time or theoretic reason you might want to try them.<\/p>\n<p id=\"7ccc\">For regression tasks, Target and LeaveOneOut probably won\u2019t work well.<\/p>\n<h3 id=\"4c49\">Roadmap<\/h3>\n<figure id=\"9ce4\"><canvas width=\"75\" height=\"47\"><\/canvas><img decoding=\"async\" style=\"width: 700px; height: 466px;\" src=\"https:\/\/cdn-images-1.medium.com\/max\/800\/1*_ROiCDwhoeL2pvpA5cWEmA.jpeg\" data-src=\"https:\/\/cdn-images-1.medium.com\/max\/800\/1*_ROiCDwhoeL2pvpA5cWEmA.jpeg\" \/><\/figure>\n<p style=\"text-align: center;\">Map<\/p>\n<p id=\"51a9\">In this article we\u2019ll discuss terms, general usage and five classic encoding options: Ordinal, One Hot, Binary, BaseN, and Hashing. In the future I may evaluate Bayesian encoders and contrast encoders with roots in statistical hypothesis testing.<\/p>\n<p id=\"b0e0\">In an <a href=\"https:\/\/www.experfy.com\/blog\/seven-data-types-a-better-way-to-think-about-data-types-for-machine-learning\">earlier article<\/a>,\u00a0I argued we should classify data as one of seven types to make better models faster. Here are the seven data types:<\/p>\n<p id=\"2505\">Useless\u200a\u2014\u200auseless for machine learning algorithms, that is\u200a\u2014\u200adiscrete<br \/>\nNominal\u200a\u2014\u200agroups without order\u200a\u2014\u200adiscrete<br \/>\nBinary\u200a\u2014\u200aeither\/or\u200a\u2014\u200adiscrete<br \/>\nOrdinal\u200a\u2014\u200agroups with order\u200a\u2014\u200adiscrete<br \/>\nCount\u200a\u2014\u200athe number of occurrences\u200a\u2014\u200adiscrete<br \/>\nTime\u200a\u2014\u200acyclical numbers with a temporal component\u200a\u2014\u200acontinuous<br \/>\nInterval\u200a\u2014\u200apositive and\/or negative numbers without a temporal component\u200a\u2014\u200acontinuous<\/p>\n<p id=\"42de\">Here we\u2019re concerned with encoding nominal and ordinal data. A column with nominal data has values that cannot be ordered in any meaningful way. Nominal data is most often one-hot (aka dummy) encoded, but there are many options that might perform better for machine learning.<\/p>\n<figure id=\"d5a2\"><canvas width=\"75\" height=\"37\"><\/canvas><img decoding=\"async\" style=\"width: 640px; height: 320px;\" src=\"https:\/\/cdn-images-1.medium.com\/max\/800\/1*2pEYsGuTjA2pN-UjjOjXxw.jpeg\" data-src=\"https:\/\/cdn-images-1.medium.com\/max\/800\/1*2pEYsGuTjA2pN-UjjOjXxw.jpeg\" \/><\/figure>\n<p style=\"text-align: center;\">Rank<\/p>\n<p id=\"b72f\">In contrast, ordinal data can be rank ordered. Ordinal data can be encoded one of three ways, broadly speaking, but I think it\u2019s safe to say that its encoding is often not carefully considered.<\/p>\n<ol>\n<li id=\"99c8\">It can be assumed to be close enough to interval data\u200a\u2014\u200awith relatively equal magnitudes between the values\u200a\u2014\u200ato treat it as such. Social scientists make this assumption all the time with Likert scales. For example, \u201cOn a scale from 1 to 7, 1 being extremely unlikely, 4 being neither likely nor unlikely and 7 being extremely likely, how likely are you to recommend this movie to a friend?\u201d. Here the difference between 3 and 4 and the difference between 6 and 7 can be reasonably assumed to be similar.<\/li>\n<li id=\"c66b\">It can be treated as nominal data, where each category has no numeric relationship to another. One-hot encoding and other encodings appropriate for nominal data make sense here.<\/li>\n<li id=\"e4b6\">The magnitude of the difference between the numbers can be ignored. You can just train your model with different encodings and seeing which encoding works best.<\/li>\n<\/ol>\n<p id=\"e02c\">In this series we\u2019ll look at Categorical Encoders 11 encoders as of version 1.2.8. **Update: Version 1.3.0 is the latest version on PyPI as of April 11, 2019.**<\/p>\n<p id=\"6e9d\">Many of these encoding methods go by more than one name in the statistics world and sometimes one name can mean different things. We\u2019ll follow the Category Encoders usage.<\/p>\n<p id=\"0457\">Big thanks to\u00a0<a href=\"http:\/\/www.willmcginnis.com\/2015\/11\/29\/beyond-one-hot-an-exploration-of-categorical-variables\/\" target=\"_blank\" rel=\"noopener noreferrer\" data-href=\"http:\/\/www.willmcginnis.com\/2015\/11\/29\/beyond-one-hot-an-exploration-of-categorical-variables\/\" data->Will McGinnis<\/a>\u00a0for creating and maintaining this package. It is largely derived from StatsModel\u2019s\u00a0<a href=\"https:\/\/patsy.readthedocs.io\/en\/latest\/API-reference.html\" target=\"_blank\" rel=\"noopener noreferrer\" data-href=\"https:\/\/patsy.readthedocs.io\/en\/latest\/API-reference.html\" data->Patsy package<\/a>, which in turn is based on this\u00a0<a href=\"https:\/\/stats.idre.ucla.edu\/r\/library\/r-library-contrast-coding-systems-for-categorical-variables\/\" target=\"_blank\" rel=\"noopener noreferrer\" data-href=\"https:\/\/stats.idre.ucla.edu\/r\/library\/r-library-contrast-coding-systems-for-categorical-variables\/\" data->UCLA statistics reference<\/a>.<\/p>\n<p id=\"8dcf\">There are an infinite number of ways to encode categorical information. The ones in Category Encoders should be sufficient for most uses.<\/p>\n<h3 id=\"d269\">Quick Summary<\/h3>\n<p id=\"4d10\">Here\u2019s the list of Category Encoders functions with their descriptions and the type of data they would be most appropriate to encode.<\/p>\n<h4 id=\"a0b4\">Classic Encoders<\/h4>\n<p id=\"e7d2\">The first group of five classic encoders can be seen on a continuum of embedding information in one column (Ordinal) up to\u00a0<em>k<\/em>\u00a0columns (OneHot). These are very useful encodings for machine learning practitioners to understand.<\/p>\n<p id=\"90aa\"><strong><em>Ordinal\u200a<\/em><\/strong>\u2014\u200aconvert string labels to integer values 1 through\u00a0<em>k<\/em>. Ordinal.<br \/>\n<strong><em>OneHot<\/em>\u200a<\/strong>\u2014\u200aone column for each value to compare vs. all other values. Nominal, ordinal.<br \/>\n<strong><em>Binary<\/em><\/strong>\u200a\u2014\u200aconvert each integer to binary digits. Each binary digit gets one column. Some info loss but fewer dimensions. Ordinal.<br \/>\n<strong><em>BaseN\u200a<\/em><\/strong>\u2014\u200aOrdinal, Binary, or higher encoding. Nominal, ordinal. Doesn\u2019t add much functionality. Probably avoid.<br \/>\n<strong><em>Hashing<\/em>\u200a<\/strong>\u2014\u200aLike OneHot but fewer dimensions, some info loss due to collisions. Nominal, ordinal.<\/p>\n<h4 id=\"3232\">Contrast Encoders<\/h4>\n<p id=\"ee1a\">The five contrast encoders all have multiple issues that I argue make them unlikely to be useful for machine learning. They all output one column for each column value. I would avoid them in most cases. Their\u00a0<a href=\"http:\/\/www.willmcginnis.com\/2015\/11\/29\/beyond-one-hot-an-exploration-of-categorical-variables\/\" target=\"_blank\" rel=\"noopener noreferrer\" data-href=\"http:\/\/www.willmcginnis.com\/2015\/11\/29\/beyond-one-hot-an-exploration-of-categorical-variables\/\" data->stated intents<\/a>\u00a0are below.<\/p>\n<p id=\"022d\"><strong><em>Helmert<\/em><\/strong><em>\u00a0(reverse)<\/em>\u200a\u2014\u200aThe mean of the dependent variable for a level is compared to the mean of the dependent variable over all previous levels.<br \/>\n<strong><em>Sum<\/em><\/strong><em>\u200a<\/em>\u2014\u200acompares the mean of the dependent variable for a given level to the overall mean of the dependent variable over all the levels.<br \/>\n<strong><em>Backward Difference<\/em><\/strong>\u200a\u2014\u200athe mean of the dependent variable for a level is compared with the mean of the dependent variable for the prior level.<br \/>\n<strong><em>Polynomial<\/em><\/strong><em>\u200a<\/em>\u2014\u200aorthogonal polynomial contrasts. The coefficients taken on by polynomial coding for k=4 levels are the linear, quadratic, and cubic trends in the categorical variable.<\/p>\n<h4 id=\"0942\">Bayesian Encoders<\/h4>\n<p id=\"bea0\">The Bayesian encoders use information from the dependent variable in their encodings. They output one column and can work well with high cardinality data.<\/p>\n<p id=\"acc4\"><strong><em>Target<\/em><\/strong>\u200a\u2014\u200ause the mean of the DV, must take steps to avoid overfitting\/ response leakage. Nominal, ordinal. For classification tasks.<br \/>\n<strong><em>LeaveOneOut<\/em><\/strong><em>\u200a<\/em>\u2014\u200asimilar to target but avoids contamination. Nominal, ordinal. For classification tasks.<br \/>\n<strong><em>WeightOfEvidence<\/em><\/strong><em>\u200a<\/em>\u2014\u200aadded in v1.3. Not documented in the\u00a0<a href=\"http:\/\/contrib.scikit-learn.org\/categorical-encoding\/\" target=\"_blank\" rel=\"noopener noreferrer\" data-href=\"http:\/\/contrib.scikit-learn.org\/categorical-encoding\/\" data->docs<\/a>\u00a0as of April 11, 2019. The method is explained in\u00a0<a href=\"https:\/\/www.listendata.com\/2015\/03\/weight-of-evidence-woe-and-information.html\" target=\"_blank\" rel=\"noopener noreferrer\" data-href=\"https:\/\/www.listendata.com\/2015\/03\/weight-of-evidence-woe-and-information.html\" data->this post<\/a>.<br \/>\n<strong><em>James-Stein<\/em><\/strong>\u200a\u2014\u200aforthcoming in v1.4. Described in the code\u00a0<a href=\"https:\/\/github.com\/scikit-learn-contrib\/categorical-encoding\/blob\/master\/category_encoders\/james_stein.py\" target=\"_blank\" rel=\"noopener noreferrer\" data-href=\"https:\/\/github.com\/scikit-learn-contrib\/categorical-encoding\/blob\/master\/category_encoders\/james_stein.py\" data->here<\/a>.<br \/>\n<strong><em>M-estimator\u200a<\/em><\/strong>\u2014\u200aforthcoming in v1.4. Described in the code\u00a0<a href=\"https:\/\/github.com\/scikit-learn-contrib\/categorical-encoding\/blob\/master\/category_encoders\/m_estimate.py\" target=\"_blank\" rel=\"noopener noreferrer\" data-href=\"https:\/\/github.com\/scikit-learn-contrib\/categorical-encoding\/blob\/master\/category_encoders\/m_estimate.py\" data->here<\/a>. Simplified target encoder.<\/p>\n<h4 id=\"720c\">Use<\/h4>\n<p id=\"15a8\">Category Encoders follow the same API as sklearn\u2019s preprocessors. They have some added conveniences, such as the ability to easily add an encoder to a pipeline. Additionally, the encoder returns a pandas DataFrame if a DataFrame is passed to it. Here\u2019s an example of the code with the BinaryEncoder:<\/p>\n<p id=\"7a90\">We\u2019ll tackle a few gotchas with implementation in the future. But you should be able to jump right into the first five if you are familiar with scikit learn\u2019s api.<\/p>\n<p id=\"7be0\">Note that all Category Encoders impute missing values automatically by default. However, I recommend filling missing data data yourself prior to encoding so you can test the results of several methods. I plan to discuss imputing options in a forthcoming article, so follow\u00a0<a href=\"https:\/\/medium.com\/@jeffhale\" target=\"_blank\" rel=\"noopener noreferrer\" data-href=\"https:\/\/medium.com\/@jeffhale\" data->me<\/a>\u00a0on Medium if you want to make sure you don\u2019t miss it.<\/p>\n<h3 id=\"7125\">Terminology<\/h3>\n<p id=\"a4ff\">You might see commentators use the following terms interchangeably:\u00a0<em>dimension<\/em>,\u00a0<em>feature<\/em>,\u00a0<em>vector<\/em>,\u00a0<em>series<\/em>,\u00a0<em>independent variable<\/em>, and\u00a0<em>column<\/em>. I will too\u00a0\ud83d\ude42 Similarly, you might see\u00a0<em>row<\/em>\u00a0and\u00a0<em>observation<\/em>\u00a0used interchangeably.<\/p>\n<p id=\"891d\"><em>k<\/em>\u00a0is the original number of unique values in your data column.\u00a0<em>High<\/em><em>cardinality<\/em>\u00a0means a lot of unique values ( a large<em>\u00a0k)<\/em>. A column with hundreds of zip codes is an example of a high cardinality feature.<\/p>\n<figure id=\"814a\"><canvas width=\"75\" height=\"62\"><\/canvas><img decoding=\"async\" style=\"width: 640px; height: 544px;\" src=\"https:\/\/cdn-images-1.medium.com\/max\/800\/1*VXFoPAllLJYd5n1Dk4TH_A.jpeg\" data-src=\"https:\/\/cdn-images-1.medium.com\/max\/800\/1*VXFoPAllLJYd5n1Dk4TH_A.jpeg\" \/><\/figure>\n<p style=\"text-align: center;\">High cardinality theme\u00a0bird<\/p>\n<p id=\"cd3a\"><em>High dimensionality<\/em>\u00a0means a matrix with many dimensions. High dimensionality comes with the Curse of Dimensionality\u200a\u2014\u200aa thorough treatment of this topic can be found\u00a0<a href=\"http:\/\/www.visiondummy.com\/2014\/04\/curse-dimensionality-affect-classification\/\" target=\"_blank\" rel=\"noopener noreferrer\" data-href=\"http:\/\/www.visiondummy.com\/2014\/04\/curse-dimensionality-affect-classification\/\" data->here<\/a>. The take away is that high dimensionality requires many observations and often results in overfitting.<\/p>\n<figure id=\"e781\"><canvas width=\"75\" height=\"47\"><\/canvas><img decoding=\"async\" src=\"https:\/\/cdn-images-1.medium.com\/max\/800\/1*sR1o088FB_07s16qZfDlDg.jpeg\" data-src=\"https:\/\/cdn-images-1.medium.com\/max\/800\/1*sR1o088FB_07s16qZfDlDg.jpeg\" \/><\/figure>\n<p style=\"text-align: center;\">A wand to help ward off the Curse of Dimensionality<\/p>\n<p id=\"fe27\"><em>Sparse<\/em>\u00a0data is a matrix with lots of zeroes relative to other values. If your encoders transform your data so that it becomes sparse, some algorithms may not work well. Sparsity can often be managed by flagging it, but many algorithms don\u2019t work well unless the data is dense.<\/p>\n<figure id=\"34ba\"><canvas width=\"75\" height=\"47\"><\/canvas><img decoding=\"async\" src=\"https:\/\/cdn-images-1.medium.com\/max\/800\/1*Nf_5RSDQ7fG2dCbQC8PVbQ.jpeg\" data-src=\"https:\/\/cdn-images-1.medium.com\/max\/800\/1*Nf_5RSDQ7fG2dCbQC8PVbQ.jpeg\" \/><\/figure>\n<p style=\"text-align: center;\">Sparse<\/p>\n<h3 id=\"63d5\">Digging Into Category\u00a0Encoders<\/h3>\n<p id=\"51d2\">Without further ado, let\u2019s encode!<\/p>\n<h4 id=\"dbae\">Ordinal<\/h4>\n<p id=\"d0e7\">OrdinalEncoder converts each string value to a whole number. The first unique value in your column becomes 1, the second becomes 2, the third becomes 3, and so on.<\/p>\n<p id=\"e6eb\">What the actual value was prior to encoding does not affect what it becomes when you\u00a0<em>fit_transform<\/em>\u00a0with OrdinalEncoder. The first value could have been 10 and the second value could have been 3. Now they will be 1 and 3, respectively.<\/p>\n<p id=\"a3cb\">If the column contains nominal data, stopping after you use OrdinalEncoder is a bad idea. Your machine learning algorithm will treat the variable as continuous and assume the values are on a meaningful scale. Instead, if you have a column with values\u00a0<em>car, bus,\u00a0<\/em>and\u00a0<em>truck<\/em>\u00a0you should first encode this nominal data using OrdinalEncoder. Then encode it again using one of the methods appropriate to nominal data that we\u2019ll explore below.<\/p>\n<p id=\"8091\">In contrast, if your column values are truly ordinal, that means that the integer assigned to each value is meaningful. Assignment should be done with intention. Say your column had the string values \u201cFirst\u201d, \u201cThird\u201d, and \u201cSecond\u201d in it. Those values should be mapped to the corresponding integers by passing OrdinalEncoder a list of dicts like so:<\/p>\n<p id=\"ed3f\"><span style=\"font-family: courier new,courier,monospace;\">[{\u2018col\u2019: \u2018finished_race_order\u2019,<br \/>\n\u2018mapping\u2019: [(&#8220;First&#8221;, 1),<br \/>\n(\u2018Second\u2019, 2),<br \/>\n(\u2018Third\u2019, 3)]<br \/>\n}]<\/span><\/p>\n<p id=\"4511\">Here\u2019s the basic setup for all the code samples to follow. You can get the full notebook at\u00a0<a href=\"https:\/\/www.kaggle.com\/discdiver\/category-encoders-examples\" target=\"_blank\" rel=\"noopener noreferrer\" data-href=\"https:\/\/www.kaggle.com\/discdiver\/category-encoders-examples\" data->this Kaggle Kernel<\/a>.<\/p>\n<figure id=\"23a8\"><canvas width=\"75\" height=\"30\"><\/canvas><img decoding=\"async\" src=\"https:\/\/cdn-images-1.medium.com\/max\/800\/1*JqIGQLCCc9bFgUk-0RJb4w.png\" data-src=\"https:\/\/cdn-images-1.medium.com\/max\/800\/1*JqIGQLCCc9bFgUk-0RJb4w.png\" \/><\/figure>\n<p id=\"cb58\">Here\u2019s the untransformed X column.<\/p>\n<figure id=\"29cd\"><canvas width=\"75\" height=\"30\"><\/canvas><img decoding=\"async\" src=\"https:\/\/cdn-images-1.medium.com\/max\/800\/1*yFQLITYTvlt7F753_75FAw.png\" data-src=\"https:\/\/cdn-images-1.medium.com\/max\/800\/1*yFQLITYTvlt7F753_75FAw.png\" \/><\/figure>\n<p id=\"8894\">And here\u2019s the OrdinalEncoder code to transform the\u00a0<em>color<\/em>\u00a0column values from letters to integers.<\/p>\n<figure id=\"9b9c\"><canvas width=\"75\" height=\"32\"><\/canvas><img decoding=\"async\" src=\"https:\/\/cdn-images-1.medium.com\/max\/800\/1*0iLj4G7TPcmLRl4HjRg5Ug.png\" data-src=\"https:\/\/cdn-images-1.medium.com\/max\/800\/1*0iLj4G7TPcmLRl4HjRg5Ug.png\" \/><\/figure>\n<p id=\"4f98\">All the string values are now integers.<\/p>\n<p id=\"3e9e\">Sklearn\u2019s LabelEncoder does pretty much the same thing as Category Encoder\u2019s OrdinalEncoder, but is not quite as user friendly. LabelEncoder won\u2019t return a DataFrame, instead it returns a numpy array if you pass a DataFrame. It also outputs values starting with 0, compared to OrdinalEncoder\u2019s default of outputting values starting with 1.<\/p>\n<p id=\"04fc\">You could accomplish ordinal encoding by mapping string values to integers manually with\u00a0<em>apply<\/em>. But that\u2019s extra work once you know how to use Category Encoders.<\/p>\n<h4 id=\"71d5\">OneHot<\/h4>\n<p id=\"ada3\">One-hot encoding is the classic approach to dealing with nominal, and maybe ordinal, data. It\u2019s referred to as the \u201cThe Standard Approach for Categorical Data\u201d in Kaggle\u2019s\u00a0<a href=\"https:\/\/www.kaggle.com\/dansbecker\/using-categorical-data-with-one-hot-encoding\" target=\"_blank\" rel=\"noopener noreferrer\" data-href=\"https:\/\/www.kaggle.com\/dansbecker\/using-categorical-data-with-one-hot-encoding\" data->Machine Learning tutorial series<\/a>. It also goes by the\u00a0<a href=\"https:\/\/stats.stackexchange.com\/questions\/308916\/what-is-one-hot-encoding-called-in-scientific-literature\" target=\"_blank\" rel=\"noopener noreferrer\" data-href=\"https:\/\/stats.stackexchange.com\/questions\/308916\/what-is-one-hot-encoding-called-in-scientific-literature\" data->names<\/a><em>dummy\u00a0<\/em>encoding,\u00a0<em>indicator\u00a0<\/em>encoding, and occasionally\u00a0<em>binary<\/em>\u00a0encoding. Yes, this is confusing.<\/p>\n<figure id=\"b394\"><canvas width=\"75\" height=\"70\"><\/canvas><img decoding=\"async\" src=\"https:\/\/cdn-images-1.medium.com\/max\/800\/1*L5YLQUYFuZ5KXD_LQr167g.jpeg\" data-src=\"https:\/\/cdn-images-1.medium.com\/max\/800\/1*L5YLQUYFuZ5KXD_LQr167g.jpeg\" \/><\/figure>\n<p style=\"text-align: center;\">That\u2019s one hot\u00a0sun<\/p>\n<p id=\"2d31\">The one-hot encoder creates one column for each value to compare against all other values. For each new column, a row gets a 1 if the row contained that column\u2019s value and a 0 if it did not. Here\u2019s how it looks:<\/p>\n<figure id=\"7b61\"><canvas width=\"75\" height=\"35\"><\/canvas><img decoding=\"async\" src=\"https:\/\/cdn-images-1.medium.com\/max\/800\/1*h8yQCZY1A6pKXlTH6nMe3g.png\" data-src=\"https:\/\/cdn-images-1.medium.com\/max\/800\/1*h8yQCZY1A6pKXlTH6nMe3g.png\" \/><\/figure>\n<p id=\"dac7\"><em>color_-1<\/em>\u00a0is actually an extraneous column, because it\u2019s all 0s\u200a\u2014\u200awith no variation it\u2019s not helping your model learn anything. It may have been intended for missing values, but in version 1.2.8 of Category Encoders it isn\u2019t doing anything. However, it\u2019s only adding one column so it\u2019s not really a big deal for performance.<\/p>\n<p id=\"49a1\">One-hot encoding can perform very well, but the number of new features is equal to\u00a0<em>k,<\/em>\u00a0the number of unique values. This feature expansion can create serious memory problems if your data set has high cardinality features. One-hot-encoded data can also be difficult for decision-tree-based algorithms\u200a\u2014\u200asee discussion\u00a0<a href=\"https:\/\/roamanalytics.com\/2016\/10\/28\/are-categorical-variables-getting-lost-in-your-random-forests\/\" target=\"_blank\" rel=\"noopener noreferrer\" data-href=\"https:\/\/roamanalytics.com\/2016\/10\/28\/are-categorical-variables-getting-lost-in-your-random-forests\/\" data->here<\/a>.<\/p>\n<p id=\"dca3\">The pandas\u00a0<a href=\"https:\/\/pandas.pydata.org\/pandas-docs\/stable\/generated\/pandas.get_dummies.html\" target=\"_blank\" rel=\"noopener noreferrer\" data-href=\"https:\/\/pandas.pydata.org\/pandas-docs\/stable\/generated\/pandas.get_dummies.html\" data->GetDummies<\/a>\u00a0and sklearn\u00a0<a href=\"http:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.preprocessing.OneHotEncoder.html\" target=\"_blank\" rel=\"noopener noreferrer\" data-href=\"http:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.preprocessing.OneHotEncoder.html\" data->OneHotEncoder<\/a>\u00a0functions perform the same role as OneHotEncoder. I find OneHotEncoder a bit nicer to use.<\/p>\n<h4 id=\"1022\">Binary<\/h4>\n<p id=\"e06e\">Binary can be thought of as a hybrid of one-hot and hashing encoders. Binary creates fewer features than one-hot, while preserving some uniqueness of values in the the column. It can work well with higher dimensionality ordinal data.<\/p>\n<figure id=\"5fee\"><canvas width=\"75\" height=\"52\"><\/canvas><img decoding=\"async\" src=\"https:\/\/cdn-images-1.medium.com\/max\/800\/1*YhFBGHfz6oPY0b8GhaVuaA.jpeg\" data-src=\"https:\/\/cdn-images-1.medium.com\/max\/800\/1*YhFBGHfz6oPY0b8GhaVuaA.jpeg\" \/><\/figure>\n<p style=\"text-align: center;\">Binary<\/p>\n<p id=\"2dd5\">Here\u2019s how it works:<\/p>\n<ul>\n<li id=\"97aa\">The categories are encoded by OrdinalEncoder if they aren\u2019t already in numeric form.<\/li>\n<li id=\"4a98\">Then those integers are converted into binary code, so for example 5 becomes 101 and 10 becomes 1010<\/li>\n<li id=\"b5cf\">Then the digits from that binary string are split into separate columns. So if there are 4\u20137 values in an ordinal column then 3 new columns are created: one for the first bit, one for the second, and one for the third.<\/li>\n<li id=\"f911\">Each observation is encoded across the columns in its binary form.<\/li>\n<\/ul>\n<p id=\"3175\">Here\u2019s how it looks:<\/p>\n<figure id=\"e86f\"><canvas width=\"75\" height=\"32\"><\/canvas><img decoding=\"async\" src=\"https:\/\/cdn-images-1.medium.com\/max\/800\/1*fVZyy09cfDzYQTvILCqtEQ.png\" data-src=\"https:\/\/cdn-images-1.medium.com\/max\/800\/1*fVZyy09cfDzYQTvILCqtEQ.png\" \/><\/figure>\n<p id=\"b27f\">The first column has no variance, so it isn\u2019t doing anything to help the model.<\/p>\n<p id=\"84ce\">With only three levels, the information embedded becomes muddled. There are many collisions and the model can\u2019t glean much information from the features. Just one-hot encode a column if it only has a few values.<\/p>\n<p id=\"9875\">In contrast, binary really shines when the cardinality of the column is higher\u200a\u2014\u200awith the 50 US states, for example.<\/p>\n<p id=\"ad2f\">Binary encoding creates fewer columns than one-hot encoding. It is more memory efficient. It also reduces the chances of dimensionality problems with higher cardinality.<\/p>\n<p id=\"4c41\">Most similar values overlap with each other across many of the new columns. This allows many machine learning algorithms to learn the values similarity. Binary encoding is a decent compromise for ordinal data with high cardinality.<\/p>\n<p id=\"a927\">For nominal data a hashing algorithm with more fine-grained control usually makes more sense. If you\u2019ve used binary encoding successfully, please share in the comments.<\/p>\n<h4 id=\"19c2\">BaseN<\/h4>\n<p id=\"e490\">When the BaseN\u00a0<em>base = 1<\/em>\u00a0it is basically the same as one hot encoding. When\u00a0<em>base = 2<\/em>\u00a0it is basically the same as binary encoding. McGinnis\u00a0<a href=\"http:\/\/www.willmcginnis.com\/2016\/12\/18\/basen-encoding-grid-search-category_encoders\/\" target=\"_blank\" rel=\"noopener noreferrer\" data-href=\"http:\/\/www.willmcginnis.com\/2016\/12\/18\/basen-encoding-grid-search-category_encoders\/\" data->said<\/a>, \u201cPractically, this adds very little new functionality, rarely do people use base-3 or base-8 or any base other than ordinal or binary in real problems.\u201d<\/p>\n<figure id=\"20fc\"><canvas width=\"75\" height=\"52\"><\/canvas><img decoding=\"async\" src=\"https:\/\/cdn-images-1.medium.com\/max\/800\/1*WjQ00YC4gXA7vgdr6x7kgA.jpeg\" data-src=\"https:\/\/cdn-images-1.medium.com\/max\/800\/1*WjQ00YC4gXA7vgdr6x7kgA.jpeg\" \/><\/figure>\n<p style=\"text-align: center;\">Base 3<\/p>\n<p id=\"5a32\">The main reason for its existence is to possibly make grid searching easier. You could use BaseN with\u00a0<em>gridsearchCV.\u00a0<\/em>However<em>,\u00a0<\/em>if you\u2019re going to grid search with some of these encoding options, you\u2019re going to make that search part of your workflow anyway. I don\u2019t see a compelling reason to use BaseN. If you do, please share in the comments.<\/p>\n<figure id=\"0977\"><canvas width=\"75\" height=\"30\"><\/canvas><img decoding=\"async\" src=\"https:\/\/cdn-images-1.medium.com\/max\/800\/1*7ni3XWgN9WGSjgmOn84INA.png\" data-src=\"https:\/\/cdn-images-1.medium.com\/max\/800\/1*7ni3XWgN9WGSjgmOn84INA.png\" \/><\/figure>\n<p id=\"dc19\">The default base for BaseNEncoder is 2, which is the equivalent of BinaryEncoder.<\/p>\n<h4 id=\"870f\">Hashing<\/h4>\n<p id=\"6fb6\">HashingEncoder implements the\u00a0<a href=\"https:\/\/medium.com\/value-stream-design\/introducing-one-of-the-best-hacks-in-machine-learning-the-hashing-trick-bf6a9c8af18f\" target=\"_blank\" rel=\"noopener noreferrer\" data-href=\"https:\/\/medium.com\/value-stream-design\/introducing-one-of-the-best-hacks-in-machine-learning-the-hashing-trick-bf6a9c8af18f\" data->hashing trick<\/a>. It is similar to one-hot encoding but with fewer new dimensions and some info loss due to collisions. The collisions do not significantly affect performance unless there is a great deal of overlap. An excellent discussion of the hashing trick and guidelines for selecting the number of output features can be found\u00a0<a href=\"https:\/\/booking.ai\/dont-be-tricked-by-the-hashing-trick-192a6aae3087\" target=\"_blank\" rel=\"noopener noreferrer\" data-href=\"https:\/\/booking.ai\/dont-be-tricked-by-the-hashing-trick-192a6aae3087\" data->here<\/a>.<\/p>\n<p id=\"bfce\">Here\u2019s the ordinal column again for a refresher.<\/p>\n<figure id=\"0ef3\"><canvas width=\"75\" height=\"30\"><\/canvas><img decoding=\"async\" src=\"https:\/\/cdn-images-1.medium.com\/max\/800\/1*cqYJAqZKBIxU8OKGdklwWQ.png\" data-src=\"https:\/\/cdn-images-1.medium.com\/max\/800\/1*cqYJAqZKBIxU8OKGdklwWQ.png\" \/><\/figure>\n<p id=\"ffab\">And here\u2019s the HashingEncoder.<\/p>\n<figure id=\"ccf0\"><canvas width=\"75\" height=\"32\"><\/canvas><img decoding=\"async\" src=\"https:\/\/cdn-images-1.medium.com\/max\/800\/1*VkuC2TEJ5mhj1qphkmlKlg.png\" data-src=\"https:\/\/cdn-images-1.medium.com\/max\/800\/1*VkuC2TEJ5mhj1qphkmlKlg.png\" \/><\/figure>\n<p id=\"3177\">The\u00a0<em>n_components\u00a0<\/em>parameter controls the number of expanded columns. The default is eight columns. In our example column with three values the default results in five columns full of 0s.<\/p>\n<p id=\"eb3c\">If you set\u00a0<em>n_components<\/em>\u00a0less than\u00a0<em>k<\/em>\u00a0you\u2019ll have a small reduction in the value provided by the encoded data. You\u2019ll also have fewer dimensions.<\/p>\n<p id=\"872b\">You can pass a hashing algorithm of your choice to HashingEncoder; the default is\u00a0<em>md5<\/em>. Hashing algorithms have been very successful in some Kaggle\u00a0<a href=\"https:\/\/blog.myyellowroad.com\/using-categorical-data-in-machine-learning-with-python-from-dummy-variables-to-deep-category-66041f734512\" target=\"_blank\" rel=\"noopener noreferrer\" data-href=\"https:\/\/blog.myyellowroad.com\/using-categorical-data-in-machine-learning-with-python-from-dummy-variables-to-deep-category-66041f734512\" data->competitions<\/a>. It\u2019s worth trying HashingEncoder for nominal and ordinal data if you have high cardinality features.<\/p>\n<h3 id=\"097b\">Wrap<\/h3>\n<figure id=\"09f5\"><canvas width=\"75\" height=\"47\"><\/canvas><img decoding=\"async\" src=\"https:\/\/cdn-images-1.medium.com\/max\/800\/1*bJvhmYNxVxkkW_2sL03xaQ.jpeg\" data-src=\"https:\/\/cdn-images-1.medium.com\/max\/800\/1*bJvhmYNxVxkkW_2sL03xaQ.jpeg\" \/><\/figure>\n<p style=\"text-align: center;\">Exercise break<\/p>\n<p id=\"8fa6\">That\u2019s all for now. Here\u2019s a recap and suggestions for remaining encoders.<\/p>\n<p id=\"3af2\">For nominal columns try OneHot, Hashing, LeaveOneOut, and Target encoding. Avoid OneHot for high cardinality columns and decision tree-based algorithms.<\/p>\n<p id=\"223b\">For ordinal columns try Ordinal (Integer), Binary, OneHot, LeaveOneOut, and Target. Helmert, Sum, BackwardDifference and Polynomial are less likely to be helpful, but if you have time or theoretic reason you might want to try them.<\/p>\n<p id=\"c8e7\">The Bayesian encoders can work well for some machine learning tasks. For example, Owen Zhang used the leave one out encoding method to perform well in a\u00a0<a href=\"https:\/\/www.slideshare.net\/OwenZhang2\/tips-for-data-science-competitions\" target=\"_blank\" rel=\"noopener noreferrer\" data-href=\"https:\/\/www.slideshare.net\/OwenZhang2\/tips-for-data-science-competitions\" data->Kaggle classification challenge<\/a>.<\/p>\n<p>*Update April 2019: I updated this article to include information about forthcoming encoders and reworked the conclusion.**<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Better encoding of categorical data can mean better model performance. In this series, I&rsquo;ll introduce you to a wide range of encoding options from the&nbsp;Category Encoders package&nbsp;for use with scikit-learn in Python. Use Category Encoders to improve model performance when you have nominal or ordinal data that may provide value. In this article we&rsquo;ll discuss terms, general usage and five classic encoding options: Ordinal, One Hot, Binary, BaseN, and Hashing.<\/p>\n","protected":false},"author":369,"featured_media":2476,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[183],"tags":[92],"ppma_author":[2134],"class_list":["post-1634","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-ai-ml","tag-machine-learning"],"authors":[{"term_id":2134,"user_id":369,"is_guest":0,"slug":"jeff-hale","display_name":"Jeff Hale","avatar_url":"https:\/\/secure.gravatar.com\/avatar\/?s=96&d=mm&r=g","author_category":"","user_url":"","last_name":"Hale","first_name":"Jeff","job_title":"","description":"Jeff Hale is a co-founder of Rebel Desk, where he oversees technology, finance, and operations for this company. He&nbsp;is an experienced entrepreneur who has managed technology, operations, and finances for several companies.&nbsp;"}],"_links":{"self":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/1634","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/users\/369"}],"replies":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/comments?post=1634"}],"version-history":[{"count":0,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/1634\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/media\/2476"}],"wp:attachment":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/media?parent=1634"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/categories?post=1634"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/tags?post=1634"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/ppma_author?post=1634"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}