{"id":1995,"date":"2019-10-07T03:59:33","date_gmt":"2019-10-07T03:59:33","guid":{"rendered":"http:\/\/kusuaks7\/?p=1600"},"modified":"2024-03-14T08:49:56","modified_gmt":"2024-03-14T08:49:56","slug":"data-cleaning-and-preprocessing-for-beginners","status":"publish","type":"post","link":"https:\/\/www.experfy.com\/blog\/bigdata-cloud\/data-cleaning-and-preprocessing-for-beginners\/","title":{"rendered":"Data Cleaning and Preprocessing for Beginners"},"content":{"rendered":"\t\t<div data-elementor-type=\"wp-post\" data-elementor-id=\"1995\" class=\"elementor elementor-1995\" data-elementor-post-type=\"post\">\n\t\t\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-3ecb91c1 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"3ecb91c1\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-544635b3\" data-id=\"544635b3\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-75ef0db9 elementor-widget elementor-widget-text-editor\" data-id=\"75ef0db9\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tWhen our team\u2019s\u00a0<a href=\"https:\/\/arxiv.org\/abs\/1908.02505\" rel=\"noopener\">project\u00a0<\/a>scored first in the text subtask of this year\u2019s CALL Shared Task challenge, one of the key components of our success was careful preparation and cleaning of data. Data cleaning and preparation is the most critical first step in any\u00a0<a href=\"https:\/\/datafloq.com\/read\/?q=Artificial%20intelligence#utm=internal\" rel=\"noopener\">AI<\/a>\u00a0project. As evidence shows,\u00a0<a href=\"https:\/\/www.forbes.com\/sites\/gilpress\/2016\/03\/23\/data-preparation-most-time-consuming-least-enjoyable-data-science-task-survey-says\/?utm_source=datafloq&amp;utm_medium=ref&amp;utm_campaign=datafloq#77d304176f63\" target=\"_blank\" rel=\"noopener noreferrer\" class=\"broken_link\">most data scientists spend most of their time\u200a\u2014\u200aup to\u00a0<strong>70%<\/strong>\u200a\u2014\u200aon cleaning\u00a0data<\/a>.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-e0d5d08 elementor-widget elementor-widget-text-editor\" data-id=\"e0d5d08\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tIn this blog post, we\u2019ll guide you through these initial steps of data cleaning and preprocessing in\u00a0<a href=\"https:\/\/datafloq.com\/read\/?q=Python#utm=internal\" rel=\"noopener\">Python<\/a>, starting from importing the most popular libraries to actual encoding of features.\n<blockquote><strong>Data cleansing<\/strong>\u00a0or\u00a0<strong>data cleaning<\/strong>\u00a0is the process of detecting and correcting (or removing) corrupt or inaccurate records from a record set, table, or database and refers to identifying incomplete, incorrect, inaccurate or irrelevant parts of the data and then replacing, modifying, or deleting the dirty or coarse data. \/\/<a href=\"https:\/\/en.wikipedia.org\/wiki\/Data_cleansing?utm_source=datafloq&amp;utm_medium=ref&amp;utm_campaign=datafloq\" target=\"_blank\" rel=\"noopener noreferrer\"><strong>Wikipedia<\/strong><\/a><\/blockquote>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-772e678 elementor-widget elementor-widget-heading\" data-id=\"772e678\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\"><h2><strong>Step 1. Loading the data\u00a0set<\/strong><\/h2>\n<\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-539881d elementor-widget elementor-widget-heading\" data-id=\"539881d\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h3 class=\"elementor-heading-title elementor-size-default\"><h3><strong>Importing librariez<\/strong><\/h3><\/h3>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-8551d41 elementor-widget elementor-widget-text-editor\" data-id=\"8551d41\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tThe absolutely first thing you need to do is to import libraries for data preprocessing. There are lots of libraries available, but the most popular and important Python libraries for working on data are Numpy, Matplotlib, and Pandas.\u00a0<strong>Numpy<\/strong>\u00a0is the library used for all mathematical things.\u00a0<strong>Pandas<\/strong>\u00a0is the best tool available for importing and managing datasets.\u00a0<strong>Matplotlib<\/strong>\u00a0(Matplotlib.pyplot) is the library to make\u00a0charts.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-365737c elementor-widget elementor-widget-text-editor\" data-id=\"365737c\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tTo make it easier for future use, you can import these libraries with a shortcut\u00a0alias:\n<div style=\"background: #eee; border: 1px solid #ccc; padding: 5px 10px;\"><em>import numpy as np\nimport matplotlib.pyplot as plt\nimport pandas as pd<\/em><\/div>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-cd05b4f elementor-widget elementor-widget-heading\" data-id=\"cd05b4f\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h3 class=\"elementor-heading-title elementor-size-default\"><h3><strong>Loading data into\u00a0pandas<\/strong><\/h3>\n<\/h3>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-f4e3e61 elementor-widget elementor-widget-text-editor\" data-id=\"f4e3e61\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tOnce you downloaded your data set and named it as a\u00a0.csv file, you need to load it into a pandas DataFrame to explore it and perform some basic cleaning tasks removing information you don\u2019t need that will make data processing slower.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-ef57efa elementor-widget elementor-widget-text-editor\" data-id=\"ef57efa\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tUsually, such tasks\u00a0include:\n<ul>\n \t<li>Removing the first line: it contains extraneous text instead of the column titles. This text prevents the data set from being parsed properly by the pandas\u00a0library:<\/li>\n<\/ul>\n<div style=\"background: #eee; border: 1px solid #ccc; padding: 5px 10px;\"><em>my_dataset = pd.read_csv(\u2018data\/my_dataset.csv\u2019, skiprows=1, low_memory=False)<\/em><\/div>\n<ul>\n \t<li>Removing columns with text explanations that we won\u2019t need, url columns and other unnecessary columns:<\/li>\n<\/ul>\n<div style=\"background: #eee; border: 1px solid #ccc; padding: 5px 10px;\"><em>my_dataset = my_dataset.drop([\u2018url\u2019],axis=1)<\/em><\/div>\n<ul>\n \t<li>Removing all columns with only one value, or have more than 50% missing values to work faster (if your data set is large enough that it will still be meaningful):<\/li>\n<\/ul>\n<div style=\"background: #eee; border: 1px solid #ccc; padding: 5px 10px;\"><em>my_dataset = my_dataset.dropna(thresh=half_count,axis=1)<\/em><\/div>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-2e62146 elementor-widget elementor-widget-text-editor\" data-id=\"2e62146\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tIt\u2019s also a good practice to name the filtered data set differently to keep it separate from the raw data. This makes sure you still have the original data in case you need to go back to\u00a0it.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-033afd8 elementor-widget elementor-widget-heading\" data-id=\"033afd8\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\"><h2><strong>Step 2. Exploring the data\u00a0set<\/strong><\/h2>\n<\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-411986d elementor-widget elementor-widget-heading\" data-id=\"411986d\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h3 class=\"elementor-heading-title elementor-size-default\"><h3><strong>Understanding the\u00a0data<\/strong><\/h3><\/h3>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-0bfc229 elementor-widget elementor-widget-text-editor\" data-id=\"0bfc229\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tNow you have got your data set up, but you still should spend some time exploring it and understanding what feature each column represents. Such manual review of the data set is important, to avoid mistakes in the data analysis and the modelling process.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-34890af elementor-widget elementor-widget-text-editor\" data-id=\"34890af\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tTo make the process easier, you can create a DataFrame with the names of the columns, data types, the first row\u2019s values, and description from the data dictionary.\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-e39713b elementor-widget elementor-widget-text-editor\" data-id=\"e39713b\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tAs you explore the features, you can pay attention to any column\u00a0that:\n<ul>\n \t<li>is formatted poorly,<\/li>\n \t<li>requires more data or a lot of pre-processing to turn into useful a feature,\u00a0or<\/li>\n \t<li>contains redundant information,<\/li>\n<\/ul>\nsince these things can hurt your analysis if handled incorrectly.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-1ebec39 elementor-widget elementor-widget-text-editor\" data-id=\"1ebec39\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<strong><em>You should also pay attention to data leakage<\/em><\/strong>, which can cause the model to overfit. This is because the model will be also learning from features that won\u2019t be available when we\u2019re using it to make predictions. We need to be sure our model is trained using only the data it would have at the point of a loan application.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-2716cdb elementor-widget elementor-widget-heading\" data-id=\"2716cdb\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h3 class=\"elementor-heading-title elementor-size-default\"><h3><strong>Deciding on a target\u00a0column<\/strong><\/h3>\n<\/h3>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-f7fae79 elementor-widget elementor-widget-text-editor\" data-id=\"f7fae79\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tWith a filtered data set explored, you need to create a matrix of dependent variables and a vector of independent variables. At first you should decide on the appropriate column to use as a target column for modelling based on the question you want to answer. For example, if you want to predict the development of cancer, or the chance the credit will be approved, you need to find a column with the status of the disease or loan granting ad use it as the target\u00a0column.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-cdda103 elementor-widget elementor-widget-text-editor\" data-id=\"cdda103\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tFor example, if the target column is the last one, you can create the matrix of dependent variables by\u00a0typing:\n<div style=\"background: #eee; border: 1px solid #ccc; padding: 5px 10px;\"><em>X = dataset.iloc[:, -1]..values<\/em><\/div>\nThat first colon (<strong>:<\/strong>) means that we want to take all the lines in our dataset.\u00a0<strong>: -1<\/strong>\u00a0means that we want to take all of the columns of data except the last one. The\u00a0.<strong>values<\/strong>\u00a0on the end means that we want all of the\u00a0values.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-5dad2ec elementor-widget elementor-widget-text-editor\" data-id=\"5dad2ec\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tTo have a vector of independent variables with only the data from the last column, you can\u00a0type\n<div style=\"background: #eee; border: 1px solid #ccc; padding: 5px 10px;\"><em>y = dataset.iloc[:, -1].values<\/em><\/div>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-1ac4ab3 elementor-widget elementor-widget-heading\" data-id=\"1ac4ab3\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\"><h2><strong>Step 3. Preparing the Features for Machine\u00a0Learning<\/strong><\/h2>\n<\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-966841a elementor-widget elementor-widget-text-editor\" data-id=\"966841a\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tFinally, it\u2019s time to do the preparatory work to feed the features for ML algorithms. To clean the data set, you need to\u00a0<strong>handle missing values and categorical features<\/strong>, because the mathematics underlying most\u00a0<a href=\"https:\/\/datafloq.com\/read\/?q=Machine%20learning#utm=internal\" rel=\"noopener\">machine learning<\/a>\u00a0models assumes that the data is numerical and contains no missing values. Moreover, the\u00a0<strong>scikit-learn<\/strong>\u00a0library returns an error if you try to train a model like linear regression and logistic regression using data that contain missing or non-numeric values.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-be25f5f elementor-widget elementor-widget-heading\" data-id=\"be25f5f\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h3 class=\"elementor-heading-title elementor-size-default\"><h3><strong>Dealing with Missing\u00a0Values<\/strong><\/h3><\/h3>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-d4a3aea elementor-widget elementor-widget-text-editor\" data-id=\"d4a3aea\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tMissing data is perhaps the most common trait of unclean data. These values usually take the form of NaN or\u00a0None.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-a569737 elementor-widget elementor-widget-text-editor\" data-id=\"a569737\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tThere are several causes of missing values: sometimes values are missing because they do not exist, or because of improper collection of data or poor data entry. For example, if someone is under age, and the question applies to people over 18, then the question will contain a missing value. In such cases, it would be wrong to fill in a value for that question.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-867c1bc elementor-widget elementor-widget-text-editor\" data-id=\"867c1bc\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tThere are several ways to fill up missing\u00a0values:\n<ul>\n \t<li>you can remove the lines with the data if you have your data set is big enough and the percentage of missing values is high (over 50%, for example);<\/li>\n \t<li>you can fill all null variables with 0 is dealing with numerical values;<\/li>\n \t<li>you can use the\u00a0<a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.impute.SimpleImputer.html\" rel=\"noopener\">Imputer<\/a>\u00a0class from the\u00a0<a href=\"https:\/\/scikit-learn.org\/\" rel=\"noopener\">scikit-learn<\/a>\u00a0library to fill in missing values with the data\u2019s (mean, median, most_frequent)<\/li>\n \t<li>you can also decide to fill up missing values with whatever value comes directly after it in the same\u00a0column.<\/li>\n<\/ul>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-a8fbc31 elementor-widget elementor-widget-text-editor\" data-id=\"a8fbc31\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tThese decisions depend on the type of data, what you want to do with the data, and the cause of values missing. In reality, just because something is popular doesn\u2019t necessarily make it the right choice. The most common\u00a0<a href=\"https:\/\/datafloq.com\/read\/?q=strategy#utm=internal\" rel=\"noopener\">strategy<\/a>\u00a0is to use the mean value, but depending on your data you may come up with a totally different approach.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-4ad6eac elementor-widget elementor-widget-heading\" data-id=\"4ad6eac\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h3 class=\"elementor-heading-title elementor-size-default\"><h3><strong>Handling categorical data<\/strong><\/h3><\/h3>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-f7e52c2 elementor-widget elementor-widget-text-editor\" data-id=\"f7e52c2\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<a href=\"https:\/\/datafloq.com\/read\/?q=Machine%20learning#utm=internal\" rel=\"noopener\">Machine learning<\/a>\u00a0uses only numeric values (float or int data type). However, data sets often contain the object data type than needs to be transformed into numeric. In most cases, categorical values are discrete and can be encoded as dummy variables, assigning a number for each category. The simplest way is to use One Hot Encoder, specifying the index of the column you want to work\u00a0on:\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-7bbcfd8 elementor-widget elementor-widget-text-editor\" data-id=\"7bbcfd8\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<div style=\"background: #eee; border: 1px solid #ccc; padding: 5px 10px;\"><em>from sklearn.preprocessing import OneHotEncoder\nonehotencoder = OneHotEncoder(categorical_features = [0])<\/em><\/div>\n<div style=\"background: #eee; border: 1px solid #ccc; padding: 5px 10px;\"><em>\u200bX = onehotencoder.fit_transform(X).toarray()<\/em><\/div>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-263932e elementor-widget elementor-widget-heading\" data-id=\"263932e\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h3 class=\"elementor-heading-title elementor-size-default\"><h3><strong>Dealing with inconsistent data\u00a0entry<\/strong><\/h3><\/h3>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-47efd1d elementor-widget elementor-widget-text-editor\" data-id=\"47efd1d\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tInconsistency occurs, for example, when there are different unique values in a column which are meant to be the same. You can think of different approaches to capitalization, simple misprints and inconsistent formats to form an idea. One of the ways to remove data inconsistencies is by to remove whitespaces before or after entry names and by converting all cases to lower\u00a0cases.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-ea6553d elementor-widget elementor-widget-text-editor\" data-id=\"ea6553d\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tIf there is a large number of inconsistent unique entries, however, it is impossible to manually check for the closest matches. You can use the\u00a0<a href=\"https:\/\/github.com\/seatgeek\/fuzzywuzzy\" rel=\"noopener\">Fuzzy Wuzzy\u00a0<\/a>package to identify which strings are most likely to be the same. It takes in two strings and returns a ratio. The closer the ratio is to 100, the more likely you will unify the\u00a0strings.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-785567a elementor-widget elementor-widget-heading\" data-id=\"785567a\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\"><h3><strong>Handling Dates and\u00a0Times<\/strong><\/h3><\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-abdc5ce elementor-widget elementor-widget-text-editor\" data-id=\"abdc5ce\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tA specific type of data inconsistency is an inconsistent format of dates, such as dd\/mm\/yy and mm\/dd\/yy in the same columns. Your date values might not be in the right data type, and this will not allow you to effectively perform manipulations and get insight from it. This time you can use the\u00a0<a href=\"https:\/\/docs.python.org\/2\/library\/datetime.html\" rel=\"noopener\">datetime<\/a>\u00a0package to fix the type of the\u00a0date.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-02da4f2 elementor-widget elementor-widget-heading\" data-id=\"02da4f2\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h3 class=\"elementor-heading-title elementor-size-default\"><h3><strong>Scaling and Normalization<\/strong><\/h3><\/h3>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-07cf155 elementor-widget elementor-widget-text-editor\" data-id=\"07cf155\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tScaling is important if you need to specify that a change in one quantity is not equal to another change in another. With the help of scaling you ensure that just because some features are big they won\u2019t be used as a main predictor. For example, if you use the age and the salary of a person in prediction, some algorithms will pay attention to the salary more because it is bigger, which does not make any\u00a0sense.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-0f57d98 elementor-widget elementor-widget-text-editor\" data-id=\"0f57d98\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tNormalization involves transforming or converting your dataset into a normal distribution. Some algorithms like SVM converge far faster on normalized data, so it makes sense to normalize your data to get better\u00a0results.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-dd6deb5 elementor-widget elementor-widget-text-editor\" data-id=\"dd6deb5\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tThere are many ways to perform feature scaling. In a nutshell, we put all of our features into the same scale so that none are dominated by another. For example, you can use the\u00a0<a href=\"https:\/\/scikit-learn.org\/stable\/modules\/preprocessing.html\" rel=\"noopener\">StandardScaler<\/a>\u00a0class from the sklearn.preprocessing package to fit and transform your data\u00a0set:\n<div style=\"background: #eee; border: 1px solid #ccc; padding: 5px 10px;\"><em>from sklearn.preprocessing import StandardScaler<\/em><\/div>\n<div style=\"background: #eee; border: 1px solid #ccc; padding: 5px 10px;\"><em>sc_X = StandardScaler()<\/em><\/div>\n<div style=\"background: #eee; border: 1px solid #ccc; padding: 5px 10px;\"><em>X_train = sc_X.fit_transform(X_train)\nX_test = sc_X.transform(X_test)<\/em><\/div>\n<div style=\"background: #eee; border: 1px solid #ccc; padding: 5px 10px;\"><em>As you don\u2019t need to fit it to your test set, you can just apply transformation.<\/em><\/div>\n<div style=\"background: #eee; border: 1px solid #ccc; padding: 5px 10px;\"><em>sc_y = StandardScaler()\ny_train = sc_y.fit_transform(y_train)<\/em><\/div>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-ed1e1d0 elementor-widget elementor-widget-heading\" data-id=\"ed1e1d0\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h3 class=\"elementor-heading-title elementor-size-default\"><h3><strong>Save to\u00a0CSV<\/strong><\/h3><\/h3>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-058ad83 elementor-widget elementor-widget-text-editor\" data-id=\"058ad83\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tTo be sure that you still have the raw data, it is a good practice to store the final output of each section or stage of your workflow in a separate csv file. In this way, you\u2019ll be able to make changes in your data processing flow without having to recalculate everything.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-0b9e95c elementor-widget elementor-widget-text-editor\" data-id=\"0b9e95c\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tAs we did previously, you can store your DataFrame as a\u00a0.csv using the pandas to_csv() function.\n<div style=\"background: #eee; border: 1px solid #ccc; padding: 5px 10px;\"><em>my_dataset.to_csv(\u201cprocessed_data\/cleaned_dataset.csv\u201d,index=False)<\/em><\/div>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-e7893dd elementor-widget elementor-widget-heading\" data-id=\"e7893dd\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\"><h2><strong>Conclusion<\/strong><\/h2><\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-0ec363c elementor-widget elementor-widget-text-editor\" data-id=\"0ec363c\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tThese are the very basic steps required to work through a large data set, cleaning and preparing the data for any Data Science project. There are other forms of data cleaning that you might find useful. But for now, we want you to understand that you need to properly arrange and tidy up your data before the formulation of any model. Better and cleaner data outperforms the best algorithms. If you use a very simple algorithm on the cleanest data, you will get very impressive results. And, what is more, it is not that difficult to perform basic preprocessing!\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<\/div>\n\t\t","protected":false},"excerpt":{"rendered":"<p>When our team\u2019s\u00a0project\u00a0scored first in the text subtask of this year\u2019s CALL Shared Task challenge, one of the key components of our success was careful preparation and cleaning of data. Data cleaning and preparation is the most critical first step in any\u00a0AI\u00a0project. As evidence shows,\u00a0most data scientists spend most of their time\u200a\u2014\u200aup to\u00a070%\u200a\u2014\u200aon cleaning\u00a0data.In this<\/p>\n","protected":false},"author":570,"featured_media":4163,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"content-type":"","footnotes":""},"categories":[187],"tags":[94],"ppma_author":[3261],"class_list":["post-1995","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-bigdata-cloud","tag-data-science"],"authors":[{"term_id":3261,"user_id":570,"is_guest":0,"slug":"max-ved","display_name":"Max Ved","avatar_url":"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/04\/medium_cbaf23d5-a78a-4ceb-8f6e-343134811364-150x150.jpg","user_url":"https:\/\/sciforce.solutions\/","last_name":"Ved","first_name":"Max","job_title":"","description":"Max Ved, a Scientist Entrepreneur, is Co-Founder &amp; CTO at SciForce, an IT company specialized in the development of software solutions.\u00a0"}],"_links":{"self":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/1995","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/users\/570"}],"replies":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/comments?post=1995"}],"version-history":[{"count":5,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/1995\/revisions"}],"predecessor-version":[{"id":36431,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/1995\/revisions\/36431"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/media\/4163"}],"wp:attachment":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/media?parent=1995"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/categories?post=1995"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/tags?post=1995"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/ppma_author?post=1995"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}