{"id":9328,"date":"2020-08-14T07:51:31","date_gmt":"2020-08-14T07:51:31","guid":{"rendered":"https:\/\/www.experfy.com\/blog\/?p=9328"},"modified":"2023-11-20T17:21:36","modified_gmt":"2023-11-20T17:21:36","slug":"why-and-how-to-use-dask-with-big-data","status":"publish","type":"post","link":"https:\/\/www.experfy.com\/blog\/bigdata-cloud\/why-and-how-to-use-dask-with-big-data\/","title":{"rendered":"Why and How to Use Dask with Big Data"},"content":{"rendered":"\t\t<div data-elementor-type=\"wp-post\" data-elementor-id=\"9328\" class=\"elementor elementor-9328\" data-elementor-post-type=\"post\">\n\t\t\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-4054ff00 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-eae-slider=\"75944\" data-id=\"4054ff00\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-328414a\" data-eae-slider=\"40391\" data-id=\"328414a\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-11901371 elementor-widget elementor-widget-text-editor\" data-id=\"11901371\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\n<p class=\"wp-block-paragraph\">If you\u2019ve been following my articles, chances are you\u2019ve already read one of my previous articles on\u00a0<a href=\"https:\/\/www.experfy.com\/blog\/why-and-how-to-use-pandas-with-large-data-but-not-big-data\/\" target=\"_blank\" rel=\"noreferrer noopener\"><strong>Why and How to Use Pandas with Large Data<\/strong><\/a>.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Being a data scientist,\u00a0<a href=\"https:\/\/pandas.pydata.org\/\" target=\"_blank\" rel=\"noreferrer noopener\">Pandas<\/a>\u00a0is one of the best tools for data cleaning and analysis used in Python.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">It\u2019s\u00a0<em>seriously\u00a0<\/em>a game changer when it comes to cleaning, transforming, manipulating and analyzing data.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">No doubt about it.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">In fact, I\u2019ve even created my own\u00a0<a href=\"https:\/\/towardsdatascience.com\/the-simple-yet-practical-data-cleaning-codes-ad27c4ce0a38\" target=\"_blank\" rel=\"noreferrer noopener\"><strong>toolbox for data cleaning<\/strong><\/a>\u00a0using Pandas. The toolbox is nothing but a compilation of common tricks to deal with messy data with Pandas.<\/p>\n\n\n<hr class=\"wp-block-separator\" \/>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-2b1d6ae elementor-widget elementor-widget-heading\" data-id=\"2b1d6ae\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h1 class=\"elementor-heading-title elementor-size-default\"><h1 id=\"4299\">My Love-Hate Relationship with Pandas<\/h1>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-e23a088 elementor-widget elementor-widget-text-editor\" data-id=\"e23a088\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\n<p class=\"wp-block-paragraph\">Don\u2019t get me wrong.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Pandas is great. It\u2019s powerful.<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-69d5394 elementor-widget elementor-widget-image\" data-id=\"69d5394\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/miro.medium.com\/max\/1024\/0*VkJfLzCCvsUZOABk.png\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-81585d8 elementor-widget elementor-widget-text-editor\" data-id=\"81585d8\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\n<p class=\"wp-block-paragraph\">It\u2019s still one of the most popular data science tools for data cleaning and analytics.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">However, after being in data science field for some time, the data volume that I\u2019m dealing with increases from 10MB, 10GB, 100GB, to 500GB or sometimes even more than that.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">My PC either suffered<strong>\u00a0low performance or long runtime<\/strong>\u00a0due to the inefficient local memory usage for data that was larger than 100GB.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">That was the time when I realized Pandas wasn\u2019t initially designed for data at large scales.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">That was the time when I realized the\u00a0<strong>stark difference between large data and big data<\/strong>.<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-105d259 elementor-widget elementor-widget-text-editor\" data-id=\"105d259\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>The word large and big are in themselves \u201crelative\u201d and in my humble opinion, large data is data sets that are less than 100GB.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p>Now, Pandas is very efficient with small data (usually from 100MB up to 1GB) and performance is rarely a concern.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p>But when you have more data that\u2019s way larger than your local\u00a0<a href=\"https:\/\/en.wikipedia.org\/wiki\/Random-access_memory\" target=\"_blank\" rel=\"noreferrer noopener\">RAM<\/a>\u00a0(say 100GB), you can either still use\u00a0<a href=\"https:\/\/www.experfy.com\/blog\/why-and-how-to-use-pandas-with-large-data-but-not-big-data\/\" target=\"_blank\" rel=\"noreferrer noopener\">Pandas to handle data with some tricks<\/a>\u00a0to certain extent or choose a better tool \u2014 in this case,\u00a0<a href=\"https:\/\/dask.org\/\" target=\"_blank\" rel=\"noreferrer noopener\"><strong>Dask<\/strong><\/a>.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p>This time, I chose the latter.<\/p>\n<!-- \/wp:paragraph -->\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-366e814 elementor-widget elementor-widget-heading\" data-id=\"366e814\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h1 class=\"elementor-heading-title elementor-size-default\"><h1 id=\"26c4\">Why Dask works like a MAGIC?<\/h1><\/h1>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-c5564bc elementor-widget elementor-widget-image\" data-id=\"c5564bc\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/miro.medium.com\/max\/1308\/0*h4ZbcLyB3B3idib-.png\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-8e7c027 elementor-widget elementor-widget-text-editor\" data-id=\"8e7c027\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<!-- wp:paragraph -->\n<p>To some of us,\u00a0<a href=\"https:\/\/dask.org\/\" target=\"_blank\" rel=\"noreferrer noopener\"><strong>Dask<\/strong><\/a>\u00a0might be something that you\u2019re already familiar with.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p>But to most aspiring data scientists or people who just got started in data science, Dask might sound a little bit foreign.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p>And this is perfectly fine.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p>In fact, I didn\u2019t get to know Dask until I faced the real limitation of Pandas.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:quote -->\n<blockquote class=\"wp-block-quote\">\n<p>Keep in mind that Dask is\u00a0<strong><em>not a necessity<\/em><\/strong>\u00a0if your data volume is sufficiently low and can fit into your PC\u2019s memory space.<\/p>\n<\/blockquote>\n<!-- \/wp:quote -->\n\n<!-- wp:quote -->\n<blockquote class=\"wp-block-quote\">\n<p>So the question now is\u2026<\/p>\n<p><strong>What\u2019s Dask and why Dask is better than Pandas to handle big data?<\/strong><\/p>\n<\/blockquote>\n<!-- \/wp:quote -->\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-401febb elementor-widget elementor-widget-heading\" data-id=\"401febb\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\"><!-- wp:heading -->\n<h2 id=\"5e34\">\ufe0fDask is popularly known as a Python parallel computing library<\/h2>\n<!-- \/wp:heading --><\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-11cc0fb elementor-widget elementor-widget-text-editor\" data-id=\"11cc0fb\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<!-- wp:paragraph -->\n<p>Through its parallel computing features,\u00a0<a href=\"https:\/\/docs.dask.org\/en\/latest\/why.html\" target=\"_blank\" rel=\"noreferrer noopener\">Dask<\/a>\u00a0allows for rapid and efficient scaling of computation.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p>It provides an easy way to\u00a0<strong>handle large and big data<\/strong>\u00a0in Python with minimal extra effort beyond the regular Pandas workflow.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p>In other words, Dask allows us to easily\u00a0<strong>scale out to clusters<\/strong>\u00a0to handle big data or\u00a0<strong>scale down to single computers<\/strong>\u00a0to handle large data through harnessing the full power of CPU\/GPU, all beautifully integrated with Python code.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p>Cool isn\u2019t it?<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p>Think of Dask as an extension of Pandas in terms of\u00a0<strong>performance and scalability<\/strong>.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p>What\u2019s even cooler is that you can switch between Dask dataframe and Pandas Dataframe to do any data transformation and operation on demand.<\/p>\n<!-- \/wp:paragraph -->\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-8bcbba0 elementor-widget elementor-widget-heading\" data-id=\"8bcbba0\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h1 class=\"elementor-heading-title elementor-size-default\"><h1 id=\"9ed5\">How to use Dask with Big Data?<\/h1><\/h1>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-ea02fcf elementor-widget elementor-widget-text-editor\" data-id=\"ea02fcf\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<!-- wp:paragraph -->\n<p>Okay, enough of theory.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p>It\u2019s time to get our hands dirty.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p>You can\u00a0<a href=\"https:\/\/docs.dask.org\/en\/latest\/install.html\" target=\"_blank\" rel=\"noreferrer noopener\">install Dask<\/a>\u00a0and try that in your local PC to use your CPU\/GPU.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:quote -->\n<blockquote class=\"wp-block-quote\">\n<p>But we\u2019re talking about\u00a0<strong>big data<\/strong>\u00a0here, so let\u2019s do something\u00a0<strong>different<\/strong>.<\/p>\n<p>Let\u2019s go\u00a0<strong>BIG<\/strong>.<\/p>\n<\/blockquote>\n<!-- \/wp:quote -->\n\n<!-- wp:paragraph -->\n<p>Instead of taming the \u201cbeast\u201d by scaling down to single computers, let\u2019s discover the full power of the \u201cbeast\u201d by\u00a0<strong>scaling out to clusters<\/strong>, for\u00a0<strong>FREE<\/strong>.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p>YES, I mean it.<\/p>\n<!-- \/wp:paragraph -->\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-5c328af elementor-widget elementor-widget-text-editor\" data-id=\"5c328af\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<!-- wp:paragraph -->\n<p>Understanding that setting up a cluster (AWS for example) and connecting Jupyter notebook to the cloud can be a pain to some data scientists, especially for beginners in cloud computing, let\u2019s use\u00a0<a href=\"https:\/\/www.saturncloud.io\/s\/\" target=\"_blank\" rel=\"noreferrer noopener\"><strong>Saturn Cloud<\/strong><\/a>.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p>This is a new platform that I\u2019ve been trying out recently.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p>Saturn Cloud is a managed data science and machine learning platform that automates DevOps and ML infrastructure engineering.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p>To my surprise, it uses\u00a0<strong>Jupyter and Dask<\/strong>\u00a0to\u00a0<strong>scale Python for big data<\/strong>\u00a0using the libraries we know and love (Numpy, Pandas, Scikit-Learn etc.). It also leverages\u00a0<a href=\"https:\/\/www.docker.com\/\" target=\"_blank\" rel=\"noreferrer noopener\"><strong>Docker<\/strong><\/a><strong>\u00a0and\u00a0<\/strong><a href=\"https:\/\/kubernetes.io\/\" target=\"_blank\" rel=\"noreferrer noopener\"><strong>Kubernetes<\/strong><\/a>\u00a0so that your data science work is reproducible, shareable and ready for production.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p>There are three main types of Dask\u2019s user interfaces, namely Array, Bag, and Dataframe. We\u2019ll focus mainly on\u00a0<strong>Dask Dataframe<\/strong>\u00a0in the code snippets below as this is what we mostly would be using for data cleaning and analytics as a data scientist.<\/p>\n<!-- \/wp:paragraph -->\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-59894dd elementor-widget elementor-widget-heading\" data-id=\"59894dd\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\"><!-- wp:heading -->\n<h2 id=\"cd3b\">1. Read CSV files to Dask dataframe<\/h2><!-- \/wp:heading --><\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-6ee94ad elementor-widget elementor-widget-text-editor\" data-id=\"6ee94ad\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<!-- wp:preformatted -->\n<pre class=\"wp-block-preformatted\">import dask.dataframe as dddf = dd.read_csv('<a href=\"https:\/\/e-commerce-data.s3.amazonaws.com\/E-commerce+Data+(1).csv&#039;,\" target=\"_blank\" rel=\"noreferrer noopener\">https:\/\/e-commerce-data.s3.amazonaws.com\/E-commerce+Data+(1).csv',<\/a> encoding = 'ISO-8859-1', blocksize=32e6)<\/pre>\n<!-- \/wp:preformatted -->\n\n<!-- wp:paragraph -->\n<p>Dask dataframe is no different from Pandas dataframe in terms of normal files reading and data transformation which makes it so attractive to data scientists, as you\u2019ll see later.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p>Here we just read a single CSV file stored in\u00a0<a href=\"https:\/\/aws.amazon.com\/s3\/\" target=\"_blank\" rel=\"noreferrer noopener\">S3<\/a>. Since we just want to test out Dask dataframe, the file size is quite small with 541909 rows.<\/p>\n<!-- \/wp:paragraph -->\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-61df158 elementor-widget elementor-widget-image\" data-id=\"61df158\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/miro.medium.com\/max\/837\/1*qSBydmkhFAJ_cj9GD3Z28w.png\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-0f4f4fe elementor-widget elementor-widget-text-editor\" data-id=\"0f4f4fe\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p><strong>NOTE:<\/strong>\u00a0We can also\u00a0<a href=\"http:\/\/docs.saturncloud.io\/en\/articles\/3760116-read-public-data-from-s3-in-saturn\" target=\"_blank\" rel=\"noreferrer noopener\">read multiple files<\/a>\u00a0to the Dask dataframe in one line of code, regardless of the files size.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p>When we load up our data from the CSV, Dask will create a DataFrame that is\u00a0<a href=\"https:\/\/docs.dask.org\/en\/latest\/dataframe.html#design\" target=\"_blank\" rel=\"noreferrer noopener\">row-wise partitioned<\/a>\u00a0i.e rows are grouped by index value. That\u2019s how Dask is able to load the data into memory on-demand and process it super fast \u2014<strong>\u00a0it goes by partition<\/strong>.<\/p>\n<!-- \/wp:paragraph -->\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-8c52152 elementor-widget elementor-widget-image\" data-id=\"8c52152\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/miro.medium.com\/max\/372\/0*IZmDXucl3oksi6oF.png\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-f822d5d elementor-widget elementor-widget-text-editor\" data-id=\"f822d5d\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<!-- wp:paragraph -->\n<p>In our case, we see that the Dask dataframe has 2 partitions (this is because of the\u00a0<code>blocksize<\/code>\u00a0specified when reading CSV) with 8 tasks.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p><strong>\u201cPartitions\u201d<\/strong>\u00a0here simply mean the number of Pandas dataframes split within the Dask dataframe.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p>The more partitions we have, the more tasks we will need for each computation.<\/p>\n<!-- \/wp:paragraph -->\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-27cc8af elementor-widget elementor-widget-image\" data-id=\"27cc8af\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/miro.medium.com\/max\/675\/1*OCPrJHdAbn3cAYNl6BcLSA.png\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-9e8921c elementor-widget elementor-widget-heading\" data-id=\"9e8921c\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\"><h2 id=\"e5c0\">2. Use\u00a0<code>compute()<\/code>\u00a0to execute the operation<\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-bdf07b0 elementor-widget elementor-widget-text-editor\" data-id=\"bdf07b0\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<!-- wp:paragraph -->\n<p>Now that we\u2019ve read the CSV file to Dask dataframe.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p>It is important to remember that, while Dask dataframe is very similar to Pandas dataframe, some differences do exist.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p>The main difference that I notice is this\u00a0<code>compute<\/code>\u00a0method in Dask dataframe.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:preformatted -->\n<pre class=\"wp-block-preformatted\">df.UnitPrice.mean().compute()<\/pre>\n<!-- \/wp:preformatted -->\n\n<!-- wp:paragraph -->\n<p>Most Dask user interfaces are\u00a0<strong><em>lazy<\/em><\/strong>, meaning that\u00a0<strong>they don\u2019t evaluate until you explicitly ask for a result\u00a0<\/strong>using the\u00a0<code>compute<\/code>\u00a0method.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p>This is how we calculate the mean of the\u00a0<code>UnitPrice<\/code>\u00a0by adding\u00a0<code>compute<\/code>\u00a0method right after the\u00a0<code>mean<\/code>\u00a0method.<\/p>\n<!-- \/wp:paragraph -->\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-d3c7244 elementor-widget elementor-widget-heading\" data-id=\"d3c7244\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\"><h2 id=\"734f\">3. Check number of missing values for each column<\/h2><\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-4711c25 elementor-widget elementor-widget-text-editor\" data-id=\"4711c25\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<!-- wp:preformatted -->\n<pre class=\"wp-block-preformatted\">df.isnull().sum().compute()<\/pre>\n<!-- \/wp:preformatted -->\n\n<!-- wp:paragraph -->\n<p>Similarly, if we want to check the number of missing values for each column, we need to add\u00a0<code>compute<\/code>\u00a0method.<\/p>\n<!-- \/wp:paragraph -->\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-1521750 elementor-widget elementor-widget-heading\" data-id=\"1521750\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\">4. Filter rows based on conditions<\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-1974509 elementor-widget elementor-widget-text-editor\" data-id=\"1974509\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<!-- wp:preformatted -->\n<pre class=\"wp-block-preformatted\">df[df.quantity &lt; 10].compute()<\/pre>\n<!-- \/wp:preformatted -->\n\n<!-- wp:paragraph -->\n<p>During the data cleaning or Exploratory Data Analysis (EDA) process, we often need to filter rows based on certain conditions to understand the \u201cstory\u201d behind the data.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p>We can do the exact operation as what we do in Pandas by just adding\u00a0<code>compute<\/code>\u00a0method.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p>And BOOM! We get the results!<\/p>\n<!-- \/wp:paragraph -->\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-34e4507 elementor-widget elementor-widget-heading\" data-id=\"34e4507\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\"><!-- wp:heading -->\n<h2 id=\"08b0\">DEMO to create Dask cluster &amp; run Jupyter at scale with Python<\/h2>\n<!-- \/wp:heading --><\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-e9b7698 elementor-widget elementor-widget-text-editor\" data-id=\"e9b7698\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<!-- wp:paragraph -->\n<p>Now that we\u2019ve understood how to use Dask in general.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p>It\u2019s time to see how to\u00a0create a Dask cluster on Saturn Cloud\u00a0and run Python code in Jupyter at scale.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p>I recorded a short video to show you exactly how to do the setup and run Python code in a Dask cluster in minutes. Enjoy!<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:html -->\n<figure><iframe src=\"https:\/\/cdn.embedly.com\/widgets\/media.html?src=https%3A%2F%2Fwww.youtube.com%2Fembed%2Fn4SvV1spra0%3Ffeature%3Doembed&amp;display_name=YouTube&amp;url=https%3A%2F%2Fwww.youtube.com%2Fwatch%3Fv%3Dn4SvV1spra0&amp;image=https%3A%2F%2Fi.ytimg.com%2Fvi%2Fn4SvV1spra0%2Fhqdefault.jpg&amp;key=a19fcc184b9711e1b4764040d3dc5c07&amp;type=text%2Fhtml&amp;schema=youtube\" width=\"654\" height=\"430\" allowfullscreen=\"allowfullscreen\"><\/iframe>\n<figcaption><a href=\"http:\/\/bit.ly\/saturn-cloud-demo\" target=\"_blank\" rel=\"noreferrer noopener\">How to create a Dask cluster and run Jupyter Notebook on Saturn Cloud<\/a><\/figcaption>\n<\/figure>\n<!-- \/wp:html -->\n\n<!-- wp:separator --><hr class=\"wp-block-separator\" \/><!-- \/wp:separator -->\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-9cc3ac8 elementor-widget elementor-widget-heading\" data-id=\"9cc3ac8\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h1 class=\"elementor-heading-title elementor-size-default\"><h1 id=\"e0b1\">Final Thoughts<\/h1><\/h1>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-93fdebc elementor-widget elementor-widget-text-editor\" data-id=\"93fdebc\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p><br><\/p>\n<p>Thank you for reading.<\/p>\n<p><!-- \/wp:paragraph --><\/p>\n<p><!-- wp:paragraph --><\/p>\n<p>In terms of functionalities, Pandas still wins.<\/p>\n<p><!-- \/wp:paragraph --><\/p>\n<p><!-- wp:paragraph --><\/p>\n<p>In terms of performance and scalability, Dask is ahead of Pandas.<\/p>\n<p><!-- \/wp:paragraph --><\/p>\n<p><!-- wp:paragraph --><\/p>\n<p>In my opinion, if you have data that\u2019s larger than few GB (comparable to your RAM), go with Dask for the purpose of performance and scalability.<\/p>\n<p><!-- \/wp:paragraph --><\/p>\n<p><!-- wp:paragraph --><\/p>\n<p>If you want to create a Dask cluster in minutes and run your Python code at scale, I highly recommend you to get the&nbsp;<a href=\"http:\/\/bit.ly\/saturn-cloud-community-edition\" target=\"_blank\" rel=\"noreferrer noopener\"><strong>community edition of Saturn Cloud here for FREE<\/strong><\/a><\/p>\n<p><!-- \/wp:paragraph --><\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<\/div>\n\t\t","protected":false},"excerpt":{"rendered":"<p>When you have more data that\u2019s way larger than your local RAM ,say 100GB, you can either still use Pandas to handle data with some tricks to certain extent or choose a better tool \u2014 in this case, Dask.<\/p>\n","protected":false},"author":493,"featured_media":9329,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[187],"tags":[122,281,216],"ppma_author":[1922],"class_list":["post-9328","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-bigdata-cloud","tag-big-data","tag-dask","tag-pandas"],"authors":[{"term_id":1922,"user_id":493,"is_guest":0,"slug":"admond-lee-kin-lim","display_name":"Admond Lee","avatar_url":"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/04\/medium_0878f2cd-4527-4f07-8f34-5711c7f91a5b-150x150.jpg","author_category":"","user_url":"http:\/\/www.micron.com","last_name":"Lee","first_name":"Admond","job_title":"","description":"Admond Lee Kin Lim\u00a0is Big Data Engineer at <a href=\"http:\/\/www.micron.com\/\">Micron Technology<\/a>, a world leader in innovative memory solutions that transform how the world uses information."}],"_links":{"self":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/9328","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/users\/493"}],"replies":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/comments?post=9328"}],"version-history":[{"count":0,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/9328\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/media\/9329"}],"wp:attachment":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/media?parent=9328"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/categories?post=9328"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/tags?post=9328"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/ppma_author?post=9328"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}