{"id":9887,"date":"2020-09-29T06:55:20","date_gmt":"2020-09-29T06:55:20","guid":{"rendered":"https:\/\/www.experfy.com\/blog\/?p=9887"},"modified":"2023-09-13T09:36:34","modified_gmt":"2023-09-13T09:36:34","slug":"essential-commands-for-data-preparation-with-pandas","status":"publish","type":"post","link":"https:\/\/www.experfy.com\/blog\/bigdata-cloud\/essential-commands-for-data-preparation-with-pandas\/","title":{"rendered":"Essential commands for data preparation with Pandas"},"content":{"rendered":"\n<p class=\"has-medium-font-size\"><em>Pandas \u201ccheat sheet\u201d for data wrangling<\/em><\/p>\n\n\n\n<p id=\"355b\">If you like to cook you know this very well. Turning on the stove and cooking food is a tiny part of the whole cooking process. Much of your sweat and tears actually go into preparing the right ingredients.<\/p>\n\n\n\n<p id=\"dd60\">Clich\u00e9, but worth saying it again \u2014 data preparation is 80% of work in any data science project. Whether it is about making a dashboard, a simple statistical analysis, or fitting advanced machine learning model \u2014 it all starts with finding the data and transforming it into the right format so the algorithm can take care of the rest.<\/p>\n\n\n\n<p id=\"a064\">If you are a Python fan, then&nbsp;<code>pandas<\/code>&nbsp;is your best friend in your data science journey. Equipped with all the tools, it helps you get through the most difficult parts of a project.<\/p>\n\n\n\n<p id=\"38a0\">That said, like any new tool you first need to learn it\u2019s functionalities and how to put them into use. Many beginners in data science still struggle to make the best use of Pandas and instead spend much of their time on Stack Overflow. The principal reason for this is, I\u2019d say, not being able to match Pandas functionalities with their analytics needs.<\/p>\n\n\n\n<p id=\"e722\">Much of this struggle can be overcome simply by making an inventory of typical data preparation problems and matching them with appropriate Pandas tools. Below I am presenting a typical data preparation and exploratory analysis workflow and matching with necessary Pandas functions. I am not trying to document everything under the sun on Pandas rather demonstrating the process of creating your own data wrangling cheatsheet.<\/p>\n\n\n\n<h1 class=\"wp-block-heading\" id=\"54d0\">Set up<\/h1>\n\n\n\n<p id=\"a447\">Soon after you fire up your favorite Python IDE you might want to get started right away and import the necessary libraries. That\u2019s fine, but you still need to set up your environment for setting the working directory, locate data and other files etc.<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"># find out your current directory<br>import os<br>os.getcwd()# if you want to set a different working directory<br>os.chdir(\"folder-path\")# to get a list of all files in the directory<br>os.listdir()<\/pre>\n\n\n\n<h1 class=\"wp-block-heading\" id=\"e1ef\">Data import<\/h1>\n\n\n\n<p id=\"8bff\">Next up data import, and this is where you\u2019ll be using Pandas for the first time.<\/p>\n\n\n\n<p id=\"7250\">Your data may be sitting anywhere in the world \u2014 your local machine, SQL database, in the cloud or even in an online database. And data can be saved in a variety of formats \u2014 csv, txt, excel, sav etc.<\/p>\n\n\n\n<p id=\"c690\">Depending on where the data is coming from and it\u2019s file extension, you\u2019d need different Pandas commands. Below are a couple of examples.<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"># import pandas and numpy libraries<br>import pandas as pd<br>import numpy as np# import a csv file from local machine<br>df = pd.read_csv(\"file_path\")# import a csv file from an online database<br>df = pd.read_csv(\"<a href=\"https:\/\/raw.githubusercontent.com\/uiuc-cse\/data-fa14\/gh-pages\/data\/iris.csv\" rel=\"noopener\">https:\/\/raw.githubusercontent.com\/uiuc-cse\/data-fa14\/gh-pages\/data\/iris.csv<\/a>\")<\/pre>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"585b\">Data inspection<\/h2>\n\n\n\n<p id=\"f1cd\">After importing data you\u2019d like to inspect it for a number of things such as the number of columns and rows, columns names etc.<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"># description of index, entries, columns, data types, memory info<br>df.info() # check out first few rows<br>df.head(5) # head# number of columns &amp; rows<br>df.shape # column names<br>df.columns # number of unique values of a column<br>df[\"sepal_length\"].nunique()# show unique values of a column<br>df[\"sepal_length\"].unique()# number of unique values alltogether<br>df.columns.nunique()# value counts<br>df['species'].value_counts()<\/pre>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"3be6\">Dealing with NA values<\/h2>\n\n\n\n<p id=\"98c3\">Next, check for NA, NaN or missing values. Some algorithms can handle missing values but others require that missing values are taken care of before putting data into use. Regardless, checking for missing values and understanding how to handle them is an essential part of your \u201cgetting to know\u201d the data.<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"># show null\/NA values per column<br>df.isnull().sum()# show NA values as % of total observations per column<br>df.isnull().sum()*100\/len(df)# drop all rows containing null<br>df.dropna()# drop all columns containing null<br>df.dropna(axis=1)# drop columns with less than 5 NA values<br>df.dropna(axis=1, thresh=5)# replace all na values with -9999<br>df.fillna(-9999)# fill na values with NaN<br>df.fillna(np.NaN)# fill na values with strings<br>df.fillna(\"data missing\")# fill missing values with mean column values<br>df.fillna(df.mean())# replace na values of specific columns with mean value<br>df[\"columnName\"] = df[\"columnName\"].fillna(df[\"columnName\"].mean())# interpolation of missing values (useful in time-series)<br>df[\"columnName\"].interpolate()<\/pre>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"da66\">Column operation<\/h2>\n\n\n\n<p id=\"ba5a\">As often the case, you may need to perform a wide range of column operations such as renaming or dropping a column, sorting column values, creating new calculated columns etc.<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"># select a column<br>df[\"sepal_length\"]# select multiple columns and create a new dataframe X<br>X = df[[\"sepal_length\", \"sepal_width\", \"species\"]]# select a column by column number<br>df.iloc[:, [1,3,4]]# drop a column from dataframe X<br>X = X.drop(\"sepalL\", axis=1)# save all columns to a list<br>df.columns.tolist()# Rename columns<br>df.rename(columns={\"old colum1\": \"new column1\", \"old column2\": \"new column2\"})# sorting values by column \"sepalW\" in ascending order<br>df.sort_values(by = \"sepal_width\", ascending = True)# add new calculated column<br>df['newcol'] = df[\"sepal_length\"]*2# create a conditional calculated column<br>df['newcol'] = [\"short\" if i&lt;3 else \"long\" for i in df[\"sepal_width\"]] <\/pre>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"a199\">Row operation (sort, filter, slice)<\/h2>\n\n\n\n<p id=\"e015\">Up until the previous section you have mostly cleaned up your data, but another important part of data preparation is slicing and filtering data to go into the next round of the analytics pipeline.<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"># select rows 3 to 10<br>df.iloc[3:10,]# select rows 3 to 49 and columns 1 to 3<br>df.iloc[3:50, 1:4]# randomly select 10 rows<br>df.sample(10)# find rows with specific strings<br>df[df[\"species\"].isin([\"Iris-setosa\"])]# conditional filtering<br>df[df.sepal_length &gt;= 5]# filtering rows with multiple values e.g. 0.2, 0.3<br>df[df[\"petal_width\"].isin([0.2, 0.3])]# multi-conditional filtering<br>df[(df.petal_length &gt; 1) &amp; (df.species==\"Iris-setosa\") | (df.sepal_width &lt; 3)]# drop rows<br>df.drop(df.index[1]) # 1 is row index to be deleted<\/pre>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"2d3b\">Grouping<\/h2>\n\n\n\n<p id=\"a5ea\">Last but not least, often you will need to group data by different categories \u2014 and it is especially useful in exploratory data analysis and in getting insights on categorical variables.<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"># data grouped by column \"species\"<br>X = df.groupby(\"species\")# return mean values of a column (\"sepal_length\" ) grouped by \"species\" column<br>df.groupby(\"spp\")[\"sepal_length\"].mean()# return mean values of ALL columns grouped by \"species\" category<br>df.groupby(\"species\").mean()# get counts in different categories<br>df.groupby(\"spp\").nunique() <\/pre>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"e6ad\">Summary<\/h2>\n\n\n\n<p id=\"ef9f\">The purpose of this article was to show some essential Pandas functions needed for making data analysis-ready. In this demonstration, I followed a typical analytics process rather than showing codes in a random fashion, which will allow data scientists to find the right tool in the right order in the project. Of course, I did not intend to show every single code required to deal with every single problem in data preparation, rather the intention was to show how to create an essential Pandas cheatsheet.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Data preparation is 80% of work in any data science project. If you are a Python fan, then  Pandas is your best friend in your data science journey. Here are some essential Pandas functions needed for making data analysis-ready.<\/p>\n","protected":false},"author":925,"featured_media":9888,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"content-type":"","footnotes":""},"categories":[187],"tags":[672,673,216,114],"ppma_author":[3920],"class_list":["post-9887","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-bigdata-cloud","tag-data-preparation","tag-data-wrangling","tag-pandas","tag-python"],"authors":[{"term_id":3920,"user_id":925,"is_guest":0,"slug":"mahbubul-alam","display_name":"Mahbubul Alam","avatar_url":"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/09\/Mahbubul-Alam.png","user_url":"http:\/\/www.mccicorp.com","last_name":"Alam","first_name":"Mahbubul","job_title":"","description":"Mahbubul Alam, PhD is Data scientist, economist and quantitative researcher at MCCI, an award-winning integrated marketing agency that specializes in Public &amp; Media Relations, Social Media, Video, Web &amp; Digital, Marketing &amp; Advertising."}],"_links":{"self":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/9887","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/users\/925"}],"replies":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/comments?post=9887"}],"version-history":[{"count":1,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/9887\/revisions"}],"predecessor-version":[{"id":9889,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/9887\/revisions\/9889"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/media\/9888"}],"wp:attachment":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/media?parent=9887"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/categories?post=9887"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/tags?post=9887"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/ppma_author?post=9887"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}