{"id":1562,"date":"2019-03-08T02:09:56","date_gmt":"2019-03-08T02:09:56","guid":{"rendered":"http:\/\/kusuaks7\/?p=1167"},"modified":"2023-07-17T11:20:51","modified_gmt":"2023-07-17T11:20:51","slug":"why-and-how-to-use-pandas-with-large-data-but-not-big-data","status":"publish","type":"post","link":"https:\/\/www.experfy.com\/blog\/bigdata-cloud\/why-and-how-to-use-pandas-with-large-data-but-not-big-data\/","title":{"rendered":"Why and How to Use Pandas with Large Data &#8211; But Not Big Data"},"content":{"rendered":"<p id=\"a3f8\"><em><span style=\"font-size: 14px;\"><a href=\"https:\/\/pandas.pydata.org\/\" target=\"_blank\" rel=\"noopener noreferrer\" data-href=\"https:\/\/pandas.pydata.org\/\" data->Pandas\u00a0<\/a><\/span><\/em><em><span style=\"font-size: 14px;\">has<\/span><\/em><em><span style=\"font-size: 14px;\"> been one of the most popular and <\/span><\/em><em><span style=\"font-size: 14px;\">favourite<\/span><\/em><em><span style=\"font-size: 14px;\"> data science tools used in\u00a0<a href=\"https:\/\/www.python.org\/\" target=\"_blank\" rel=\"noopener noreferrer\" data-href=\"https:\/\/www.python.org\/\" data->Python<\/a>\u00a0programming language for data wrangling and analysis.<\/span><\/em><\/p>\n<p id=\"f732\">Data is unavoidably messy in real world. And Pandas is\u00a0<em>seriously\u00a0<\/em>a game changer when it comes to cleaning, transforming, manipulating and analyzing data. In simple terms, Pandas helps to\u00a0<strong>clean the mess.<\/strong><\/p>\n<h3 id=\"1e3b\">My Story of NumPy &amp;\u00a0Pandas<\/h3>\n<p id=\"e717\">When I first started out learning Python, I was naturally introduced to\u00a0<a href=\"http:\/\/www.numpy.org\/\" target=\"_blank\" rel=\"noopener noreferrer\" data-href=\"http:\/\/www.numpy.org\/\" data->NumPy\u00a0<\/a>(Numerical Python). It is the fundamental package for scientific computing with Python that provides an\u00a0<a href=\"https:\/\/activewizards.com\/blog\/top-15-libraries-for-data-science-in-python\/\" target=\"_blank\" rel=\"noopener noreferrer\" data-href=\"https:\/\/activewizards.com\/blog\/top-15-libraries-for-data-science-in-python\/\" data->abundance of useful features for operations on n-arrays and matrices in Python<\/a>.<\/p>\n<p id=\"1ae5\">In addition, the library provides vectorization of mathematical operations on the NumPy array type, which significantly optimizes computation with high performance and enhanced speed of execution.<\/p>\n<p id=\"6f69\"><strong>NumPy is cool.<\/strong><\/p>\n<p id=\"5db8\">But therein still lies some underlying needs for more higher level of data analysis tools. And this is where Pandas comes to my rescue.<\/p>\n<p id=\"8a43\">Fundamentally, the functionality of Pandas is built on top of NumPy and both libraries belong to the\u00a0<a href=\"https:\/\/www.scipy.org\/\" target=\"_blank\" rel=\"noopener noreferrer\" data-href=\"https:\/\/www.scipy.org\/\" data->SciPy<\/a>\u00a0stack. This means that Pandas relies heavily on NumPy array to implement its objects for manipulation and computation \u2014 but used in a more convenient fashion.<\/p>\n<p id=\"a3cc\">In practice, NumPy &amp; Pandas are still being used interchangeably. The high level features and its convenient usage are what determine my preference in Pandas.<\/p>\n<h4 id=\"59d5\">Why use Pandas with Large Data \u2014 Not BIG\u00a0Data?<\/h4>\n<p id=\"cb6d\">There is a stark difference between large data and big data. With the\u00a0<a href=\"https:\/\/www.wired.com\/insights\/2014\/04\/big-data-big-hype\/\" target=\"_blank\" rel=\"noopener noreferrer\" data-href=\"https:\/\/www.wired.com\/insights\/2014\/04\/big-data-big-hype\/\" data->hype around big data<\/a>, it is easy for us to consider everything as \u201c<a href=\"http:\/\/blog.syncsort.com\/2018\/03\/big-data\/big-data-vs-traditional-data\/\" target=\"_blank\" rel=\"noopener noreferrer\" data-href=\"http:\/\/blog.syncsort.com\/2018\/03\/big-data\/big-data-vs-traditional-data\/\" data->big data<\/a>\u201d and just go with the flow.<\/p>\n<p>A famous joke by Prof. Dan Ariely:<\/p>\n<figure id=\"7050\"><canvas width=\"75\" height=\"21\"><\/canvas><img decoding=\"async\" src=\"https:\/\/cdn-images-1.medium.com\/max\/640\/1*h2AuMUgEWjaXMannlJUQKA.png\" data-src=\"https:\/\/cdn-images-1.medium.com\/max\/640\/1*h2AuMUgEWjaXMannlJUQKA.png\" \/><\/figure>\n<p style=\"text-align: center;\"><span style=\"font-size: 11px;\"><a href=\"https:\/\/www.facebook.com\/dan.ariely\/posts\/904383595868\" target=\"_blank\" rel=\"noopener noreferrer\" data-href=\"https:\/\/www.facebook.com\/dan.ariely\/posts\/904383595868\" data->(Source)<\/a><\/span><\/p>\n<p id=\"488d\">The word large and big are in themselves \u2018relative\u2019 and in my humble opinion, large data is data sets that are less than 100GB.<\/p>\n<p id=\"59f7\">Pandas is very efficient with small data (usually from 100MB up to 1GB) and performance is rarely a concern.<\/p>\n<p id=\"0736\">However, if you\u2019re in data science or big data field, chances are you\u2019ll encounter a common problem sooner or later when using Pandas \u2014 low performance and long runtime that ultimately result in insufficient memory usage \u2014 when you\u2019re dealing with large data sets.<\/p>\n<p id=\"7e0e\">Indeed, Pandas has its own limitation when it comes to big data due to its algorithm and local memory constraints. Therefore, big data is typically stored in computing clusters for higher scalability and fault tolerance. And it can often be accessed through big data ecosystem (<a href=\"https:\/\/aws.amazon.com\/ec2\/\" target=\"_blank\" rel=\"noopener noreferrer\" data-href=\"https:\/\/aws.amazon.com\/ec2\/\" data->AWS EC2<\/a>,\u00a0<a href=\"https:\/\/hadoop.apache.org\/\" target=\"_blank\" rel=\"noopener noreferrer\" data-href=\"https:\/\/hadoop.apache.org\/\" data->Hadoop<\/a>\u00a0etc.) using\u00a0<a href=\"https:\/\/spark.apache.org\/\" target=\"_blank\" rel=\"noopener noreferrer\" data-href=\"https:\/\/spark.apache.org\/\" data->Spark\u00a0<\/a>and\u00a0<a href=\"https:\/\/www.edureka.co\/blog\/hadoop-ecosystem\" target=\"_blank\" rel=\"noopener noreferrer\" data-href=\"https:\/\/www.edureka.co\/blog\/hadoop-ecosystem\" data->many other tools<\/a>.<\/p>\n<p id=\"0476\">Eventually, one of the ways to use Pandas with large data on local machines (with certain memory constraints) is to reduce memory usage of the data.<\/p>\n<h3 id=\"fbee\">How to use Pandas with Large\u00a0Data?<\/h3>\n<figure id=\"333d\"><canvas width=\"58\" height=\"75\"><\/canvas><img decoding=\"async\" src=\"https:\/\/cdn-images-1.medium.com\/max\/640\/0*FytVOpeZPmqKsfkr\" data-src=\"https:\/\/cdn-images-1.medium.com\/max\/640\/0*FytVOpeZPmqKsfkr\" \/><\/figure>\n<p style=\"text-align: center;\"><span style=\"font-size: 11px;\"><a href=\"https:\/\/unsplash.com\/photos\/lO2qvjuM7ec\" target=\"_blank\" rel=\"noopener noreferrer\" data-href=\"https:\/\/unsplash.com\/photos\/lO2qvjuM7ec\" data->(Source)<\/a><\/span><\/p>\n<p id=\"212a\">So the question is:\u00a0<strong>How to reduce memory usage of data using Pandas?<\/strong><\/p>\n<p id=\"4081\">The following explanation will be based my experience on an anonymous large data set (40\u201350 GB) which required me to reduce the memory usage to fit into local memory for analysis (even before reading the data set to a dataframe).<\/p>\n<h4 id=\"013d\">1.\u00a0Read CSV file data in chunk\u00a0size<\/h4>\n<p id=\"edec\">To be honest, I was baffled when I encountered an error and I couldn\u2019t read the data from CSV file, only to realize that the memory of my local machine was too small for the data with 16GB of RAM.<\/p>\n<p id=\"30c9\">Here comes the good news and the beauty of Pandas: I realized that\u00a0<strong>pandas.read_csv\u00a0<\/strong>has a parameter called\u00a0<strong>chunksize<\/strong>!<\/p>\n<p id=\"214a\">The parameter essentially means the number of rows to be read into a dataframe at any single time in order to fit into the local memory. Since the data consists of more than 70 millions of rows, I specified the chunksize as 1 million rows each time that broke the large data set into many smaller pieces.<\/p>\n<p id=\"6acb\"><span style=\"background-color: #ffffff;\">The operation above resulted in a TextFileReader object for iteration. Strictly speaking,\u00a0<\/span><strong style=\"background-color: #ffffff;\">df_chunk<\/strong><span style=\"background-color: #ffffff;\">\u00a0is not a <\/span>dataframe but<span style=\"background-color: #ffffff;\"> an object for further operation in the next step.<\/span><\/p>\n<p>Once I had the object ready, the basic workflow was to perform operation on each chunk and concatenated each of them to form a dataframe in the end (as shown below). By iterating each chunk, I performed data filtering\/preprocessing using a function \u2014 <strong>chunk_preprocessing\u00a0<\/strong>before appending each chunk to a list. And finally I concatenated the list into a final dataframe to fit into the local memory.<\/p>\n<h4 id=\"589f\">2. Filter out unimportant columns to save\u00a0memory<\/h4>\n<p id=\"ae83\">Great. At this stage, I already had a dataframe to do all sorts of analysis required.<\/p>\n<p id=\"f620\">To save more time for data manipulation and computation, I further filtered out some unimportant columns to save more memory.<\/p>\n<h4 id=\"7588\">3. Change dtypes for\u00a0columns<\/h4>\n<p id=\"2dc9\">The simplest way to\u00a0<a href=\"http:\/\/pbpython.com\/pandas_dtypes.html\" target=\"_blank\" rel=\"noopener noreferrer\" data-href=\"http:\/\/pbpython.com\/pandas_dtypes.html\" data->convert a pandas column of data to a different type<\/a>\u00a0is to use\u00a0<code>astype().<\/code><\/p>\n<p id=\"91c7\">I can say that changing data types in Pandas is extremely helpful to save memory, especially if you have large data for intense analysis or computation (For example, feed data into your machine learning model for training).<\/p>\n<p id=\"9db1\">By reducing the bits required to store the data, I reduced the overall memory usage by the data up to 50%\u00a0!<\/p>\n<p id=\"dae8\">Give it a try. And I believe you\u2019ll find that useful as well! Let me know how it goes.<\/p>\n<h3 id=\"3941\">Final Thoughts<\/h3>\n<figure id=\"0f98\"><canvas width=\"75\" height=\"50\"><\/canvas><img decoding=\"async\" src=\"https:\/\/cdn-images-1.medium.com\/max\/640\/0*JelzfMU0UVqV1WxD\" data-src=\"https:\/\/cdn-images-1.medium.com\/max\/640\/0*JelzfMU0UVqV1WxD\" \/><\/figure>\n<p id=\"24a4\" style=\"text-align: center;\"><span style=\"font-size: 11px;\"><a href=\"https:\/\/unsplash.com\/photos\/FXFz-sW0uwo\" target=\"_blank\" rel=\"noopener noreferrer\" data-href=\"https:\/\/unsplash.com\/photos\/FXFz-sW0uwo\" data->(Source)<\/a><\/span><\/p>\n<p id=\"2614\">I hope that sharing my experience in using Pandas with large data could help you explore another useful feature in Pandas to deal with large data by reducing memory usage and ultimately improving computational efficiency.<\/p>\n<p id=\"9dd4\">Typically, Pandas has\u00a0<a href=\"https:\/\/medium.com\/p\/9594dda2ea4c\/edit\" target=\"_blank\" rel=\"noopener noreferrer\" data-href=\"https:\/\/medium.com\/p\/9594dda2ea4c\/edit\" data->most of the features<\/a>\u00a0that we need for data wrangling and analysis. I strongly encourage you to check them out as they\u2019d come in handy to you next time.<\/p>\n<p id=\"203c\">Also, if you\u2019re serious about learning how to do data analysis in Python, then this book is for you \u2014 <a href=\"https:\/\/www.amazon.com\/Python-Data-Analysis-Wrangling-IPython\/dp\/1449319793\/ref=as_li_ss_tl?ie=UTF8&amp;linkCode=ll1&amp;tag=admond-20&amp;linkId=3b36a4cb00a369cf09eb1d7af9690b8a\" target=\"_blank\" rel=\"noopener noreferrer\" data-href=\"https:\/\/www.amazon.com\/Python-Data-Analysis-Wrangling-IPython\/dp\/1449319793\/ref=as_li_ss_tl?ie=UTF8&amp;linkCode=ll1&amp;tag=admond-20&amp;linkId=3b36a4cb00a369cf09eb1d7af9690b8a\" data-><strong>Python for Data Analysis<\/strong><\/a>. With complete instructions for manipulating, processing, cleaning, and crunching datasets in Python using Pandas, the book gives a comprehensive and step-by-step guide to effectively use Pandas in your analysis.<\/p>\n<p id=\"e55f\">Hope this helps!<\/p>\n","protected":false},"excerpt":{"rendered":"<p>There is a stark difference between large data and big data.&nbsp;Using Pandas with large data could help you explore another useful feature in Pandas to deal with large data by reducing memory usage and ultimately improving computational efficiency. Typically, Pandas has&nbsp;most of the features&nbsp;that we need for data wrangling and analysis. Pandas is&nbsp;seriously&nbsp;a game changer when it comes to cleaning, transforming, manipulating and analyzing data. In simple terms, Pandas helps to&nbsp;clean the mess.<\/p>\n","protected":false},"author":493,"featured_media":4080,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"content-type":"","footnotes":""},"categories":[187],"tags":[95],"ppma_author":[1922],"class_list":["post-1562","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-bigdata-cloud","tag-big-data-amp-technology"],"authors":[{"term_id":1922,"user_id":493,"is_guest":0,"slug":"admond-lee-kin-lim","display_name":"Admond Lee","avatar_url":"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/04\/medium_0878f2cd-4527-4f07-8f34-5711c7f91a5b-150x150.jpg","user_url":"http:\/\/www.micron.com","last_name":"Lee","first_name":"Admond","job_title":"","description":"Admond Lee Kin Lim\u00a0is Big Data Engineer at <a href=\"http:\/\/www.micron.com\/\">Micron Technology<\/a>, a world leader in innovative memory solutions that transform how the world uses information."}],"_links":{"self":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/1562","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/users\/493"}],"replies":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/comments?post=1562"}],"version-history":[{"count":4,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/1562\/revisions"}],"predecessor-version":[{"id":29243,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/1562\/revisions\/29243"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/media\/4080"}],"wp:attachment":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/media?parent=1562"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/categories?post=1562"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/tags?post=1562"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/ppma_author?post=1562"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}