{"id":22666,"date":"2021-03-05T10:52:31","date_gmt":"2021-03-05T10:52:31","guid":{"rendered":"https:\/\/www.experfy.com\/blog\/still-using-pandas-process-big-data-2021\/"},"modified":"2023-09-04T11:58:11","modified_gmt":"2023-09-04T11:58:11","slug":"still-using-pandas-process-big-data-2021","status":"publish","type":"post","link":"https:\/\/www.experfy.com\/blog\/bigdata-cloud\/still-using-pandas-process-big-data-2021\/","title":{"rendered":"Are You Still Using Pandas to Process Big Data in 2021?"},"content":{"rendered":"\t\t<div data-elementor-type=\"wp-post\" data-elementor-id=\"22666\" class=\"elementor elementor-22666\" data-elementor-post-type=\"post\">\n\t\t\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-61ccca0 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"61ccca0\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-f762f67\" data-id=\"f762f67\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-8b5bc64 elementor-widget elementor-widget-text-editor\" data-id=\"8b5bc64\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p class=\"has-medium-font-size\"><em>Pandas doesn\u2019t handle well Big Data. These two libraries do! Which one is better? Faster?<\/em><\/p>\n<p id=\"4a36\">Irecently wrote two introductory articles about processing Big Data with\u00a0<a href=\"https:\/\/towardsdatascience.com\/are-you-still-using-pandas-for-big-data-12788018ba1a\" target=\"_blank\" rel=\"noreferrer noopener\">Dask<\/a>\u00a0and\u00a0<a href=\"https:\/\/towardsdatascience.com\/how-to-process-a-dataframe-with-billions-of-rows-in-seconds-c8212580f447\" target=\"_blank\" rel=\"noreferrer noopener\" class=\"broken_link\">Vaex<\/a>\u00a0\u2014 libraries for processing bigger than memory datasets. While writing, a question popped up in my mind:<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-d6560b9 elementor-widget elementor-widget-text-editor\" data-id=\"d6560b9\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<blockquote class=\"wp-block-quote\"><p><strong><em>Can these libraries really process bigger than memory datasets or is it all just a sales slogan?<\/em><\/strong><\/p><\/blockquote>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-76510d0 elementor-widget elementor-widget-text-editor\" data-id=\"76510d0\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"42b8\">This intrigued meto make a practical experiment with <a href=\"https:\/\/www.experfy.com\/blog\/bigdata-cloud\/why-and-how-to-use-dask-with-big-data\/\" target=\"_blank\" rel=\"noreferrer noopener\">Dask <\/a>and Vaex and try to process a bigger than memory dataset. The dataset was so big that you cannot even open it with pandas.<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-9367a28 elementor-widget elementor-widget-heading\" data-id=\"9367a28\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\">What do I mean by Big Data?<\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-1d85edd elementor-widget elementor-widget-image\" data-id=\"1d85edd\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t<figure class=\"wp-caption\">\n\t\t\t\t\t\t\t\t\t\t<img fetchpriority=\"high\" decoding=\"async\" width=\"1024\" height=\"683\" src=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/0Ao9I4CsTxA9Zd-uZ-1024x683.jpeg\" class=\"attachment-large size-large wp-image-18862\" alt=\"Are You Still Using Pandas to Process Big Data in 2021?\" srcset=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/0Ao9I4CsTxA9Zd-uZ-1024x683.jpeg 1024w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/0Ao9I4CsTxA9Zd-uZ-300x200.jpeg 300w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/0Ao9I4CsTxA9Zd-uZ-768x512.jpeg 768w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/0Ao9I4CsTxA9Zd-uZ-1536x1024.jpeg 1536w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/0Ao9I4CsTxA9Zd-uZ-2048x1365.jpeg 2048w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/0Ao9I4CsTxA9Zd-uZ-610x407.jpeg 610w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/0Ao9I4CsTxA9Zd-uZ-750x500.jpeg 750w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/0Ao9I4CsTxA9Zd-uZ-1140x760.jpeg 1140w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/>\t\t\t\t\t\t\t\t\t\t\t<figcaption class=\"widget-image-caption wp-caption-text\">Photo by\u00a0<a href=\"https:\/\/unsplash.com\/@ev?utm_source=medium&amp;utm_medium=referral\" target=\"_blank\" rel=\"noopener\">ev<\/a>\u00a0on\u00a0<a href=\"https:\/\/unsplash.com\/?utm_source=medium&amp;utm_medium=referral\" target=\"_blank\" rel=\"noopener\">Unsplash<\/a><\/figcaption>\n\t\t\t\t\t\t\t\t\t\t<\/figure>\n\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-c852360 elementor-widget elementor-widget-text-editor\" data-id=\"c852360\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"7d0c\">Big Data is a loosely defined term,&nbsp;which has as many definitions as there are hits on Google. In this article, I use the term to describe a dataset that is so big that we need specialized software to process it. With Big, I am referring to \u201cbigger than the main memory on a single machine\u201d.<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-9c29741 elementor-widget elementor-widget-text-editor\" data-id=\"9c29741\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<blockquote class=\"wp-block-quote\"><p>Definition from Wikipedia:<\/p><p>Big data is a field that treats ways to analyze, systematically extract information from, or otherwise, deal with data sets that are too large or complex to be dealt with by traditional data-processing application software.<\/p><\/blockquote>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-7a3e312 elementor-widget elementor-widget-heading\" data-id=\"7a3e312\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\">What are Dask and Vaex?<\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-89d0090 elementor-widget elementor-widget-image\" data-id=\"89d0090\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t<figure class=\"wp-caption\">\n\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" width=\"1024\" height=\"682\" src=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/0rlzJyQLjigA-3HTJ-1024x682.jpeg\" class=\"attachment-large size-large wp-image-18863\" alt=\"What are Dask and Vaex?\" srcset=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/0rlzJyQLjigA-3HTJ-1024x682.jpeg 1024w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/0rlzJyQLjigA-3HTJ-300x200.jpeg 300w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/0rlzJyQLjigA-3HTJ-768x512.jpeg 768w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/0rlzJyQLjigA-3HTJ-1536x1024.jpeg 1536w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/0rlzJyQLjigA-3HTJ-2048x1365.jpeg 2048w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/0rlzJyQLjigA-3HTJ-610x407.jpeg 610w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/0rlzJyQLjigA-3HTJ-750x500.jpeg 750w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/0rlzJyQLjigA-3HTJ-1140x760.jpeg 1140w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/>\t\t\t\t\t\t\t\t\t\t\t<figcaption class=\"widget-image-caption wp-caption-text\">Photo by\u00a0<a href=\"https:\/\/unsplash.com\/@jeshoots?utm_source=medium&amp;utm_medium=referral\" target=\"_blank\" rel=\"noopener\">JESHOOTS.COM<\/a>\u00a0on\u00a0<a href=\"https:\/\/unsplash.com\/?utm_source=medium&amp;utm_medium=referral\" target=\"_blank\" rel=\"noopener\">Unsplash<\/a><\/figcaption>\n\t\t\t\t\t\t\t\t\t\t<\/figure>\n\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-39afda4 elementor-widget elementor-widget-text-editor\" data-id=\"39afda4\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"a798\"><strong>Dask<\/strong>&nbsp;provides advanced parallelism for analytics, enabling performance at scale for the tools you love. This includes numpy, pandas and sklearn. It is open-source and freely available. It uses existing Python APIs and data structures to make it easy to switch between Dask-powered equivalents.<\/p>\n<p id=\"cd0e\"><strong>Vaex<\/strong>&nbsp;is a high-performance Python library for lazy Out-of-Core DataFrames (similar to Pandas), to visualize and explore big tabular datasets. It can calculate basic statistics for more than a billion rows per second. It supports multiple visualizations allowing interactive exploration of big data.<\/p>\n<p id=\"f20e\">Dask and Vaex Dataframes are not fully compatible with Pandas Dataframes but some most common \u201cdata wrangling\u201d operations are supported by both tools.&nbsp;<mark>Dask is more focused on scaling the code to compute clusters, while Vaex makes it easier to work with large datasets on a single machine.<\/mark><\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-874dcb2 elementor-widget elementor-widget-heading\" data-id=\"874dcb2\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\">The Experiment<\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-2020c9a elementor-widget elementor-widget-image\" data-id=\"2020c9a\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t<figure class=\"wp-caption\">\n\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" width=\"1024\" height=\"682\" src=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/0XsiFMRdMK5OTtmpx-1024x682.jpeg\" class=\"attachment-large size-large wp-image-18864\" alt=\"Are You Still Using Pandas to Process Big Data in 2021?\" srcset=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/0XsiFMRdMK5OTtmpx-1024x682.jpeg 1024w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/0XsiFMRdMK5OTtmpx-300x200.jpeg 300w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/0XsiFMRdMK5OTtmpx-768x512.jpeg 768w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/0XsiFMRdMK5OTtmpx-1536x1024.jpeg 1536w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/0XsiFMRdMK5OTtmpx-2048x1365.jpeg 2048w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/0XsiFMRdMK5OTtmpx-610x407.jpeg 610w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/0XsiFMRdMK5OTtmpx-750x500.jpeg 750w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/0XsiFMRdMK5OTtmpx-1140x760.jpeg 1140w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/>\t\t\t\t\t\t\t\t\t\t\t<figcaption class=\"widget-image-caption wp-caption-text\">Photo by\u00a0<a href=\"https:\/\/unsplash.com\/@_louisreed?utm_source=medium&amp;utm_medium=referral\" target=\"_blank\" rel=\"noopener\">Louis Reed<\/a>\u00a0on\u00a0<a href=\"https:\/\/unsplash.com\/?utm_source=medium&amp;utm_medium=referral\" target=\"_blank\" rel=\"noopener\">Unsplash<\/a><\/figcaption>\n\t\t\t\t\t\t\t\t\t\t<\/figure>\n\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-accafe5 elementor-widget elementor-widget-text-editor\" data-id=\"accafe5\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"eda8\">I\u2019ve generated two CSV files with 1 million rows and 1000 columns. The size of a file was 18.18 GB, which is 36.36 GB combined. Files have random numbers from a Uniform distribution between 0 and 100.<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-fef99b2 elementor-widget elementor-widget-image\" data-id=\"fef99b2\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t<figure class=\"wp-caption\">\n\t\t\t\t\t\t\t\t\t\t<img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"660\" src=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1AlkOauFhiLK1PyaxUFRK7Q-1024x660.png\" class=\"attachment-large size-large wp-image-18865\" alt=\"Are You Still Using Pandas to Process Big Data in 2021?\" srcset=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1AlkOauFhiLK1PyaxUFRK7Q-1024x660.png 1024w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1AlkOauFhiLK1PyaxUFRK7Q-300x193.png 300w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1AlkOauFhiLK1PyaxUFRK7Q-768x495.png 768w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1AlkOauFhiLK1PyaxUFRK7Q-1536x990.png 1536w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1AlkOauFhiLK1PyaxUFRK7Q-610x393.png 610w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1AlkOauFhiLK1PyaxUFRK7Q-750x483.png 750w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1AlkOauFhiLK1PyaxUFRK7Q-1140x735.png 1140w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1AlkOauFhiLK1PyaxUFRK7Q.png 1974w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/>\t\t\t\t\t\t\t\t\t\t\t<figcaption class=\"widget-image-caption wp-caption-text\">Two CSV files with random data. Photo made by the author<\/figcaption>\n\t\t\t\t\t\t\t\t\t\t<\/figure>\n\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-a8f4579 elementor-widget elementor-widget-text-editor\" data-id=\"a8f4579\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<pre class=\"wp-block-preformatted\">import pandas as pd<br>import numpy as npfrom os import pathn_rows = 1_000_000<br>n_cols = 1000for i in range(1, 3):<br>    filename = 'analysis_%d.csv' % i<br>    file_path = path.join('csv_files', filename)<br>    df = pd.DataFrame(np.random.uniform(0, 100, size=(n_rows, n_cols)), columns=['col%d' % i for i in range(n_cols)])<br>    print('Saving', file_path)<br>    df.to_csv(file_path, index=False)df.head()<\/pre>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-4cda382 elementor-widget elementor-widget-image\" data-id=\"4cda382\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t<figure class=\"wp-caption\">\n\t\t\t\t\t\t\t\t\t\t<img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"257\" src=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1in_UallF5DDsxiBHuCIMWw-1024x257.png\" class=\"attachment-large size-large wp-image-18866\" alt=\"Head of a file\" srcset=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1in_UallF5DDsxiBHuCIMWw-1024x257.png 1024w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1in_UallF5DDsxiBHuCIMWw-300x75.png 300w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1in_UallF5DDsxiBHuCIMWw-768x192.png 768w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1in_UallF5DDsxiBHuCIMWw-1536x385.png 1536w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1in_UallF5DDsxiBHuCIMWw-610x153.png 610w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1in_UallF5DDsxiBHuCIMWw-750x188.png 750w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1in_UallF5DDsxiBHuCIMWw-1140x286.png 1140w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1in_UallF5DDsxiBHuCIMWw.png 1572w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/>\t\t\t\t\t\t\t\t\t\t\t<figcaption class=\"widget-image-caption wp-caption-text\">Head of a file. Photo made by the author<\/figcaption>\n\t\t\t\t\t\t\t\t\t\t<\/figure>\n\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-1efcd16 elementor-widget elementor-widget-text-editor\" data-id=\"1efcd16\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"c594\">The experiment was run on a MacBook Pro with 32 GB of main memory \u2014 quite a beast. When testing the limits of a pandas Dataframe, I surprisingly found out that reaching a Memory Error on such a machine is quite a challenge!<\/p>\n<p id=\"7ba1\">macOS starts dumping data from the main memory to SSD when the memory is running near its capacity. The upper limit for pandas Dataframe was 100 GB of free disk space on the machine.<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-254634c elementor-widget elementor-widget-text-editor\" data-id=\"254634c\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<blockquote class=\"wp-block-quote\"><p>When your Mac needs memory it will push something that isn\u2019t currently being used into a swapfile for temporary storage. When it needs access again, it will read the data from the swap file and back into memory.<\/p><\/blockquote>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-7599b12 elementor-widget elementor-widget-text-editor\" data-id=\"7599b12\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"0757\">I\u2019ve spent some time thinking about how should I address this issue so that the experiment would be fair. The first idea that came to my mind was to disable swapping so that each library would have only the main memory available \u2014 good luck with that on macOS. After spending a few hours I wasn\u2019t able to disable swapping.<\/p>\n<p id=\"dc04\">The second idea was to use a brute force approach. I\u2019ve filled the SSD to its full capacity so that the operating system couldn\u2019t use swap as there was no free space left on the device.<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-860d4dc elementor-widget elementor-widget-image\" data-id=\"860d4dc\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t<figure class=\"wp-caption\">\n\t\t\t\t\t\t\t\t\t\t<img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"563\" src=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1GzQv3Hi5NqJPGRlLmDkLDQ-1024x563.png\" class=\"attachment-large size-large wp-image-18867\" alt=\"Your disk is almost full notification during the experiment.\" srcset=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1GzQv3Hi5NqJPGRlLmDkLDQ-1024x563.png 1024w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1GzQv3Hi5NqJPGRlLmDkLDQ-300x165.png 300w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1GzQv3Hi5NqJPGRlLmDkLDQ-768x422.png 768w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1GzQv3Hi5NqJPGRlLmDkLDQ-1536x844.png 1536w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1GzQv3Hi5NqJPGRlLmDkLDQ-2048x1125.png 2048w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1GzQv3Hi5NqJPGRlLmDkLDQ-610x335.png 610w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1GzQv3Hi5NqJPGRlLmDkLDQ-750x412.png 750w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1GzQv3Hi5NqJPGRlLmDkLDQ-1140x626.png 1140w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/>\t\t\t\t\t\t\t\t\t\t\t<figcaption class=\"widget-image-caption wp-caption-text\">Your disk is almost full notification during the experiment. Photo made by the author<\/figcaption>\n\t\t\t\t\t\t\t\t\t\t<\/figure>\n\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-038aa24 elementor-widget elementor-widget-text-editor\" data-id=\"038aa24\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"eeff\">This worked! pandas couldn\u2019t read two 18 GB files and Jupyter Kernel crashed.<\/p>\n<p id=\"c086\">If I would perform this experiment again I would create a virtual machine with less memory. That way it would be easier to show the limits of these tools.<\/p>\n<p id=\"fbde\">Can Dask or Vaex help us and process these large files? Which one is faster? Let\u2019s find out.<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-18f1af7 elementor-widget elementor-widget-heading\" data-id=\"18f1af7\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\">Vaex vs Dask<\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-9f17924 elementor-widget elementor-widget-image\" data-id=\"9f17924\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t<figure class=\"wp-caption\">\n\t\t\t\t\t\t\t\t\t\t<img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"709\" src=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/0asf1BDHdxjbq41-3-1024x709.jpeg\" class=\"attachment-large size-large wp-image-18868\" alt=\"Are You Still Using Pandas to Process Big Data in 2021?\" srcset=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/0asf1BDHdxjbq41-3-1024x709.jpeg 1024w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/0asf1BDHdxjbq41-3-300x208.jpeg 300w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/0asf1BDHdxjbq41-3-768x532.jpeg 768w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/0asf1BDHdxjbq41-3-1536x1064.jpeg 1536w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/0asf1BDHdxjbq41-3-2048x1419.jpeg 2048w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/0asf1BDHdxjbq41-3-610x423.jpeg 610w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/0asf1BDHdxjbq41-3-750x520.jpeg 750w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/0asf1BDHdxjbq41-3-1140x790.jpeg 1140w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/>\t\t\t\t\t\t\t\t\t\t\t<figcaption class=\"widget-image-caption wp-caption-text\">Photo by\u00a0<a href=\"https:\/\/unsplash.com\/@fridooh?utm_source=medium&amp;utm_medium=referral\" target=\"_blank\" class=\"broken_link\" rel=\"noopener\">Frida Bredesen<\/a>\u00a0on\u00a0<a href=\"https:\/\/unsplash.com\/?utm_source=medium&amp;utm_medium=referral\" target=\"_blank\" rel=\"noopener\">Unsplash<\/a><\/figcaption>\n\t\t\t\t\t\t\t\t\t\t<\/figure>\n\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-23034b0 elementor-widget elementor-widget-text-editor\" data-id=\"23034b0\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"abed\">When designing the experiment, I thought about basic operations when performing Data Analysis, like grouping, filtering and visualizing data. I came up with the following operations:<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-849890c elementor-widget elementor-widget-text-editor\" data-id=\"849890c\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<ul><li>calculating 10th quantile of a column,<\/li><li>adding a new column,<\/li><li>filtering by column,<\/li><li>grouping by column and aggregating,<\/li><li>visualizing a column.<\/li><\/ul>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-d6cf26e elementor-widget elementor-widget-text-editor\" data-id=\"d6cf26e\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"553f\">All of the above operations perform a calculation using a single column, eg:<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-3bd3893 elementor-widget elementor-widget-text-editor\" data-id=\"3bd3893\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<pre class=\"wp-block-preformatted\"># filtering with a single column<br>df[df.col2 &gt; 10]<\/pre>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-bbf82fe elementor-widget elementor-widget-text-editor\" data-id=\"bbf82fe\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"5e01\">So I was intrigued to try an operation, which requires all data to be processed:<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-02b97ac elementor-widget elementor-widget-text-editor\" data-id=\"02b97ac\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<ul><li>calculate the sum of all of the columns.<\/li><\/ul>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-ca672d4 elementor-widget elementor-widget-text-editor\" data-id=\"ca672d4\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"ddbf\">This can be achieved by breaking down the calculation to smaller chunks. Eg. reading each column separately and calculating the sum and in the last step calculating the overall sum. These types of computational problems are known as\u00a0<a href=\"https:\/\/en.wikipedia.org\/wiki\/Embarrassingly_parallel\" target=\"_blank\" rel=\"noreferrer noopener\">Embarrassingly parallel<\/a>\u00a0\u2014 no effort is required to separate the problem into separate tasks.<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-4e918e2 elementor-widget elementor-widget-heading\" data-id=\"4e918e2\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\">Vaex<\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-f89750e elementor-widget elementor-widget-image\" data-id=\"f89750e\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t<figure class=\"wp-caption\">\n\t\t\t\t\t\t\t\t\t\t<img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"682\" src=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/0JiVrNI8tHm1yEyNA-1024x682.jpeg\" class=\"attachment-large size-large wp-image-18869\" alt=\"Vaex\" srcset=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/0JiVrNI8tHm1yEyNA-1024x682.jpeg 1024w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/0JiVrNI8tHm1yEyNA-300x200.jpeg 300w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/0JiVrNI8tHm1yEyNA-768x512.jpeg 768w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/0JiVrNI8tHm1yEyNA-1536x1024.jpeg 1536w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/0JiVrNI8tHm1yEyNA-2048x1365.jpeg 2048w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/0JiVrNI8tHm1yEyNA-610x407.jpeg 610w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/0JiVrNI8tHm1yEyNA-750x500.jpeg 750w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/0JiVrNI8tHm1yEyNA-1140x760.jpeg 1140w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/>\t\t\t\t\t\t\t\t\t\t\t<figcaption class=\"widget-image-caption wp-caption-text\">Photo by\u00a0<a href=\"https:\/\/unsplash.com\/@photos_by_lanty?utm_source=medium&amp;utm_medium=referral\" target=\"_blank\" rel=\"noopener\">Photos by Lanty<\/a>\u00a0on\u00a0<a href=\"https:\/\/unsplash.com\/?utm_source=medium&amp;utm_medium=referral\" target=\"_blank\" rel=\"noopener\">Unsplash<\/a><\/figcaption>\n\t\t\t\t\t\t\t\t\t\t<\/figure>\n\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-3cea5f3 elementor-widget elementor-widget-text-editor\" data-id=\"3cea5f3\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"3669\">Let\u2019s start with Vaex. The experiment was designed in a way that follows best practices for each tool \u2014 this is using binary format HDF5 for Vaex. So we need to convert CSV files to HDF5 format (The Hierarchical Data Format version 5).<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-5e02e49 elementor-widget elementor-widget-text-editor\" data-id=\"5e02e49\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<pre class=\"wp-block-preformatted\">import glob<br>import vaexcsv_files = glob.glob('csv_files\/*.csv')for i, csv_file in enumerate(csv_files, 1):<br>    for j, dv in enumerate(vaex.from_csv(csv_file, chunk_size=5_000_000), 1):<br>        print('Exporting %d %s to hdf5 part %d' % (i, csv_file, j))<br>        dv.export_hdf5(f'hdf5_files\/analysis_{i:02}_{j:02}.hdf5')<\/pre>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-9348bfd elementor-widget elementor-widget-text-editor\" data-id=\"9348bfd\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"d931\">Vaex needed 405 seconds to covert two CSV files (36.36 GB) to two HDF5 files, which have 16 GB combined. Conversion from text to binary format reduced the file size.<\/p>\n<p id=\"c4d0\"><strong>Open HDF5 dataset with Vaex:<\/strong><\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-31bfcd5 elementor-widget elementor-widget-text-editor\" data-id=\"31bfcd5\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<pre class=\"wp-block-preformatted\">dv = vaex.open('hdf5_files\/*.hdf5')<\/pre>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-51f028d elementor-widget elementor-widget-text-editor\" data-id=\"51f028d\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"220a\">Vaex needed 1218 seconds to read the HDF5 files. I expected it to be faster as Vaex claims near-instant opening of files in binary format.<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-d2e0d71 elementor-widget elementor-widget-text-editor\" data-id=\"d2e0d71\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<blockquote class=\"wp-block-quote\"><p><a href=\"https:\/\/vaex.readthedocs.io\/en\/latest\/example_io.html#Binary-file-formats\" target=\"_blank\" rel=\"noreferrer noopener\">From Vaex documentation<\/a>:<\/p><p>Opening such data is instantenous regardless of the file size on disk: Vaex will just memory-map the data instead of reading it in memory. This is the optimal way of working with large datasets that are larger than available RAM.<\/p><\/blockquote>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-3aec053 elementor-widget elementor-widget-text-editor\" data-id=\"3aec053\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"a218\"><strong>Display head with Vaex:<\/strong><\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-c159a85 elementor-widget elementor-widget-text-editor\" data-id=\"c159a85\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<pre class=\"wp-block-preformatted\">dv.head()<\/pre>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-a97e3df elementor-widget elementor-widget-text-editor\" data-id=\"a97e3df\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"ba32\">Vaex needed 1189 seconds to display head. I am not sure why displaying the first 5 rows of each column took so long.<\/p>\n<p id=\"b9e9\"><strong>Calculate 10th quantile with Vaex:<\/strong><\/p>\n<p id=\"9990\">Note, Vaex has percentile_approx function which calculates an approximation of quantile.<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-c4de39d elementor-widget elementor-widget-text-editor\" data-id=\"c4de39d\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<pre class=\"wp-block-preformatted\">quantile = dv.percentile_approx('col1', 10)<\/pre>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-54a7ff0 elementor-widget elementor-widget-text-editor\" data-id=\"54a7ff0\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"f402\">Vaex needed 0 seconds to calculate the approximation of the 10th quantile for the col1 column.<\/p>\n<p id=\"9517\"><strong>Add a new column with Vaex:<\/strong><\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-34642b8 elementor-widget elementor-widget-text-editor\" data-id=\"34642b8\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<pre class=\"wp-block-preformatted\">dv[\u2018col1_binary\u2019] = dv.col1 &gt; dv.percentile_approx(\u2018col1\u2019, 10)<\/pre>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-5e83944 elementor-widget elementor-widget-text-editor\" data-id=\"5e83944\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"e309\">Vaex has a concept of virtual columns, which stores an expression as a column. It does not take up any memory and is computed on the fly when needed. A virtual column is treated just like a normal column. As expected Vaex needed 0 seconds to execute the command above.<\/p>\n<p id=\"1091\"><strong>Filter data with Vaex:<\/strong><\/p>\n<p id=\"3623\">Vaex has a concept of\u00a0<a href=\"https:\/\/vaex.readthedocs.io\/en\/latest\/tutorial.html#Selections-and-filtering\" target=\"_blank\" rel=\"noreferrer noopener\">selections<\/a>, which I didn\u2019t use as Dask doesn\u2019t support selections, which would make the experiment unfair. The filter below is similar to filtering with pandas, except that Vaex does not copy the data.<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-ad7f7ea elementor-widget elementor-widget-text-editor\" data-id=\"ad7f7ea\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<pre class=\"wp-block-preformatted\">dv = dv[dv.col2 &gt; 10]<\/pre>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-1eee4bb elementor-widget elementor-widget-text-editor\" data-id=\"1eee4bb\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"1e2f\">Vaex needed 0 seconds to execute the filter above.<\/p>\n<p id=\"1e96\"><strong>Grouping and aggregating data with Vaex:<\/strong><\/p>\n<p id=\"d8e7\">The command below is slightly different from pandas as it combines grouping and aggregation. The command groups the data by col1_binary and calculate the mean for col3:<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-747bab9 elementor-widget elementor-widget-text-editor\" data-id=\"747bab9\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<pre class=\"wp-block-preformatted\">group_res = dv.groupby(by=dv.col1_binary, agg={'col3_mean': vaex.agg.mean('col3')})<\/pre>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-3b5f799 elementor-widget elementor-widget-image\" data-id=\"3b5f799\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t<figure class=\"wp-caption\">\n\t\t\t\t\t\t\t\t\t\t<img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"137\" src=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1ZQVYuxSE3ZvLf7Ex-1uknA-1024x137.png\" class=\"attachment-large size-large wp-image-18870\" alt=\"Image for post\" srcset=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1ZQVYuxSE3ZvLf7Ex-1uknA-1024x137.png 1024w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1ZQVYuxSE3ZvLf7Ex-1uknA-300x40.png 300w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1ZQVYuxSE3ZvLf7Ex-1uknA-768x103.png 768w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1ZQVYuxSE3ZvLf7Ex-1uknA-610x82.png 610w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1ZQVYuxSE3ZvLf7Ex-1uknA-750x100.png 750w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1ZQVYuxSE3ZvLf7Ex-1uknA-1140x152.png 1140w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1ZQVYuxSE3ZvLf7Ex-1uknA.png 1406w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/>\t\t\t\t\t\t\t\t\t\t\t<figcaption class=\"widget-image-caption wp-caption-text\">Calculating mean with Vaex. Photo made by the author<\/figcaption>\n\t\t\t\t\t\t\t\t\t\t<\/figure>\n\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-afa543c elementor-widget elementor-widget-text-editor\" data-id=\"afa543c\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"3868\">Vaex needed 0 seconds to execute the command above.<\/p>\n<p id=\"09fa\"><strong>Visualize the histogram:<\/strong><\/p>\n<p id=\"e173\">Visualization with bigger datasets is problematic as traditional tools for data analysis are not optimized to handle them. Let\u2019s try if we can make a histogram of col3 with Vaex.<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-a95dea3 elementor-widget elementor-widget-text-editor\" data-id=\"a95dea3\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<pre class=\"wp-block-preformatted\">plot = dv.plot1d(dv.col3, what='count(*)', limits=[0, 100])<\/pre>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-ee16b22 elementor-widget elementor-widget-image\" data-id=\"ee16b22\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t<figure class=\"wp-caption\">\n\t\t\t\t\t\t\t\t\t\t<img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"507\" src=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1FPcoGub4_AH_1Oe_86iaLg-1024x507.png\" class=\"attachment-large size-large wp-image-18871\" alt=\"Visualizing data with Vaex\" srcset=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1FPcoGub4_AH_1Oe_86iaLg-1024x507.png 1024w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1FPcoGub4_AH_1Oe_86iaLg-300x149.png 300w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1FPcoGub4_AH_1Oe_86iaLg-768x381.png 768w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1FPcoGub4_AH_1Oe_86iaLg-1536x761.png 1536w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1FPcoGub4_AH_1Oe_86iaLg-2048x1015.png 2048w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1FPcoGub4_AH_1Oe_86iaLg-610x302.png 610w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1FPcoGub4_AH_1Oe_86iaLg-750x372.png 750w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1FPcoGub4_AH_1Oe_86iaLg-1140x565.png 1140w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/>\t\t\t\t\t\t\t\t\t\t\t<figcaption class=\"widget-image-caption wp-caption-text\">Visualizing data with Vaex. Photo made by the author<\/figcaption>\n\t\t\t\t\t\t\t\t\t\t<\/figure>\n\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-40ffdf6 elementor-widget elementor-widget-text-editor\" data-id=\"40ffdf6\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"a492\">Vaex needed 0 seconds to display the plot, which was surprisingly fast.<\/p>\n<p id=\"85f0\"><strong>Calculate the sum of all columns<\/strong><\/p>\n<p id=\"9c8f\">Memory is not an issue when processing a single column at a time. Let\u2019s try to calculate the sum of all the numbers in the dataset with Vaex.<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-d8dd901 elementor-widget elementor-widget-text-editor\" data-id=\"d8dd901\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<pre class=\"wp-block-preformatted\">suma = np.sum(dv.sum(dv.column_names))<\/pre>\n<p id=\"100d\">Vaex needed 40 seconds to calculate the sum of all columns.<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-0492ee9 elementor-widget elementor-widget-heading\" data-id=\"0492ee9\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h3 class=\"elementor-heading-title elementor-size-default\">Dask<\/h3>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-ab9598a elementor-widget elementor-widget-image\" data-id=\"ab9598a\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t<figure class=\"wp-caption\">\n\t\t\t\t\t\t\t\t\t\t<img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"678\" src=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/0CjC-ygYYj2tzyrry-1024x678.jpeg\" class=\"attachment-large size-large wp-image-18872\" alt=\"Are You Still Using Pandas to Process Big Data in 2021?\" srcset=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/0CjC-ygYYj2tzyrry-1024x678.jpeg 1024w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/0CjC-ygYYj2tzyrry-300x199.jpeg 300w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/0CjC-ygYYj2tzyrry-768x509.jpeg 768w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/0CjC-ygYYj2tzyrry-1536x1017.jpeg 1536w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/0CjC-ygYYj2tzyrry-2048x1356.jpeg 2048w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/0CjC-ygYYj2tzyrry-610x404.jpeg 610w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/0CjC-ygYYj2tzyrry-750x497.jpeg 750w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/0CjC-ygYYj2tzyrry-1140x755.jpeg 1140w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/>\t\t\t\t\t\t\t\t\t\t\t<figcaption class=\"widget-image-caption wp-caption-text\">Photo by\u00a0<a href=\"https:\/\/unsplash.com\/@kellysikkema?utm_source=medium&amp;utm_medium=referral\" target=\"_blank\" rel=\"noopener\">Kelly Sikkema<\/a>\u00a0on\u00a0<a href=\"https:\/\/unsplash.com\/?utm_source=medium&amp;utm_medium=referral\" target=\"_blank\" rel=\"noopener\">Unsplash<\/a><\/figcaption>\n\t\t\t\t\t\t\t\t\t\t<\/figure>\n\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-22f30d6 elementor-widget elementor-widget-text-editor\" data-id=\"22f30d6\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"dda9\">Now, let\u2019s repeat the operations above but with Dask. The Jupyter Kernel was restarted before running Dask commands.<\/p>\n<p id=\"4dbd\">Instead of reading CSV files directly with Dask\u2019s read_csv function, we convert the CSV files to HDF5 to make the experiment fair.<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-1db446c elementor-widget elementor-widget-text-editor\" data-id=\"1db446c\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<pre class=\"wp-block-preformatted\">import dask.dataframe as ddds = dd.read_csv('csv_files\/*.csv')<br>ds.to_hdf('hdf5_files_dask\/analysis_01_01.hdf5', key='table')<\/pre>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-25d53b8 elementor-widget elementor-widget-text-editor\" data-id=\"25d53b8\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"631f\">Dask needed 763 seconds for conversion. Let me know in the comments if there is a faster way to convert the data with Dask. I tried to read the HDF5 files that were converted with Vaex with no luck.<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-7f4a20f elementor-widget elementor-widget-text-editor\" data-id=\"7f4a20f\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<blockquote class=\"wp-block-quote\"><p><a href=\"https:\/\/docs.dask.org\/en\/latest\/dataframe-best-practices.html#store-data-in-apache-parquet-format\" target=\"_blank\" rel=\"noreferrer noopener\">Best practices with Dask<\/a>:<\/p><p>HDF5 is a popular choice for Pandas users with high performance needs. We encourage Dask DataFrame users to store and load data using Parquet instead.<\/p><\/blockquote>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-deb3864 elementor-widget elementor-widget-text-editor\" data-id=\"deb3864\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"5b22\"><strong>Open HDF5 dataset with Dask:<\/strong><\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-33a46e7 elementor-widget elementor-widget-text-editor\" data-id=\"33a46e7\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<pre class=\"wp-block-preformatted\">import dask.dataframe as ddds = dd.read_csv('csv_files\/*.csv')<\/pre>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-d85235a elementor-widget elementor-widget-text-editor\" data-id=\"d85235a\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"9a63\">Dask needed 0 seconds to open the HDF5 file. This is because I didn\u2019t explicitly run the compute command, which would actually read the file.<\/p><p id=\"1f66\"><strong>Display head with Dask:<\/strong><\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-5c9eac1 elementor-widget elementor-widget-text-editor\" data-id=\"5c9eac1\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<pre class=\"wp-block-preformatted\">ds.head()<\/pre>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-683a3ed elementor-widget elementor-widget-text-editor\" data-id=\"683a3ed\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"76c0\">Dask needed 9 seconds to output the first 5 rows of the file.<\/p>\n<p><strong>Calculate the 10th quantile with Dask:<\/strong><\/p>\n<p>Dask has a quantile function, which calculates actual quantile, not an approximation.<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-18e5889 elementor-widget elementor-widget-text-editor\" data-id=\"18e5889\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<pre class=\"wp-block-preformatted\">quantile = ds.col1.quantile(0.1).compute()<\/pre>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-2c63ecc elementor-widget elementor-widget-text-editor\" data-id=\"2c63ecc\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"fee4\">Dask wasn\u2019t able to calculate quantile as Juptyter Kernel crashed.<\/p>\n<p id=\"8a94\"><strong>Define a new column with Dask:<\/strong><\/p>\n<p id=\"0113\">The function below uses the quantile function to define a new binary column. Dask wasn\u2019t able to calculate it because it uses quantile.<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-3955e21 elementor-widget elementor-widget-text-editor\" data-id=\"3955e21\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<pre class=\"wp-block-preformatted\">ds['col1_binary'] = ds.col1 &gt; ds.col1.quantile(0.1)<\/pre>\n<p id=\"dad9\"><strong>Filter data with Dask:<\/strong><\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-6a47c8b elementor-widget elementor-widget-text-editor\" data-id=\"6a47c8b\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<pre class=\"wp-block-preformatted\">ds = ds[(ds.col2 &gt; 10)]<\/pre>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-e4990a6 elementor-widget elementor-widget-text-editor\" data-id=\"e4990a6\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"8d66\">The command above needed 0 seconds to execute as Dask uses the delayed execution paradigm.<\/p>\n<p id=\"5fa7\"><strong>Grouping and aggregating data with Dask:<\/strong><\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-8913af1 elementor-widget elementor-widget-text-editor\" data-id=\"8913af1\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<pre class=\"wp-block-preformatted\">group_res = ds.groupby('col1_binary').col3.mean().compute()<\/pre>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-770e6f9 elementor-widget elementor-widget-text-editor\" data-id=\"770e6f9\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"f409\">Dask wasn\u2019t able to group and aggregate the data.<\/p>\n<p id=\"ef4a\"><strong>Visualize the histogram of col3:<\/strong><\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-92c9aec elementor-widget elementor-widget-text-editor\" data-id=\"92c9aec\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<pre class=\"wp-block-preformatted\">plot = ds.col3.compute().plot.hist(bins=64, ylim=(13900, 14400))<\/pre>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-605035d elementor-widget elementor-widget-text-editor\" data-id=\"605035d\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"33f2\">Dask wasn\u2019t able to visualize the data.<\/p>\n<p id=\"4d2a\"><strong>Calculate the sum of all columns:<\/strong><\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-b9e6f5d elementor-widget elementor-widget-text-editor\" data-id=\"b9e6f5d\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<pre class=\"wp-block-preformatted\">suma = ds.sum().sum().compute()<\/pre>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-f697501 elementor-widget elementor-widget-text-editor\" data-id=\"f697501\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"c196\">Dask wasn\u2019t able to sum all the data.<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-08cb6d7 elementor-widget elementor-widget-heading\" data-id=\"08cb6d7\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\">Results<\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-a2485a4 elementor-widget elementor-widget-text-editor\" data-id=\"a2485a4\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"d7e6\">The table below shows the execution times of the Vaex vs Dask experiment. NA means that the tool couldn\u2019t process the data and Jupyter Kernel crashed.<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-8944e28 elementor-widget elementor-widget-image\" data-id=\"8944e28\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t<figure class=\"wp-caption\">\n\t\t\t\t\t\t\t\t\t\t<img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"600\" src=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/11lha93QSqHt9TunVECjRKA-1024x600.png\" class=\"attachment-large size-large wp-image-18873\" alt=\"Summary of execution times in the experiment.\" srcset=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/11lha93QSqHt9TunVECjRKA-1024x600.png 1024w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/11lha93QSqHt9TunVECjRKA-300x176.png 300w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/11lha93QSqHt9TunVECjRKA-768x450.png 768w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/11lha93QSqHt9TunVECjRKA-1536x901.png 1536w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/11lha93QSqHt9TunVECjRKA-610x358.png 610w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/11lha93QSqHt9TunVECjRKA-750x440.png 750w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/11lha93QSqHt9TunVECjRKA-1140x668.png 1140w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/11lha93QSqHt9TunVECjRKA.png 1586w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/>\t\t\t\t\t\t\t\t\t\t\t<figcaption class=\"widget-image-caption wp-caption-text\">Summary of execution times in the experiment. Photo made by the author<\/figcaption>\n\t\t\t\t\t\t\t\t\t\t<\/figure>\n\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-88e9c59 elementor-widget elementor-widget-heading\" data-id=\"88e9c59\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\">Conclusion<\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-5897f0f elementor-widget elementor-widget-image\" data-id=\"5897f0f\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t<figure class=\"wp-caption\">\n\t\t\t\t\t\t\t\t\t\t<img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"576\" src=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/0jB0lzD1N3LCLmLgY-1024x576.jpeg\" class=\"attachment-large size-large wp-image-18874\" alt=\"Are You Still Using Pandas to Process Big Data in 2021?\" srcset=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/0jB0lzD1N3LCLmLgY-1024x576.jpeg 1024w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/0jB0lzD1N3LCLmLgY-300x169.jpeg 300w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/0jB0lzD1N3LCLmLgY-768x432.jpeg 768w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/0jB0lzD1N3LCLmLgY-1536x864.jpeg 1536w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/0jB0lzD1N3LCLmLgY-2048x1152.jpeg 2048w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/0jB0lzD1N3LCLmLgY-610x343.jpeg 610w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/0jB0lzD1N3LCLmLgY-750x422.jpeg 750w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/0jB0lzD1N3LCLmLgY-1140x641.jpeg 1140w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/>\t\t\t\t\t\t\t\t\t\t\t<figcaption class=\"widget-image-caption wp-caption-text\">Photo by\u00a0<a href=\"https:\/\/unsplash.com\/@joshgmit?utm_source=medium&amp;utm_medium=referral\" target=\"_blank\" rel=\"noopener\">Joshua Golde<\/a>\u00a0on\u00a0<a href=\"https:\/\/unsplash.com\/?utm_source=medium&amp;utm_medium=referral\" target=\"_blank\" rel=\"noopener\">Unsplash<\/a><\/figcaption>\n\t\t\t\t\t\t\t\t\t\t<\/figure>\n\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-c1a49b1 elementor-widget elementor-widget-text-editor\" data-id=\"c1a49b1\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"7387\">Vaex requires conversion of CSV to HDF5 format, which doesn\u2019t bother me as you can go to lunch, come back and the data will be converted. I also understand that in harsh conditions (like in the experiment) with little or no main memory reading data will take longer.<\/p>\n<p id=\"06f5\">What I don\u2019t understand is the time that Vaex needed to display the head of the file (1189 seconds for the first 5 rows!). Other operations in Vaex are heavily optimized, which enables us to do interactive data analysis on bigger than main memory datasets.<\/p>\n<p id=\"8745\">I kinda expected the problems with Dask as it is more optimized for compute clusters instead of a single machine. Dask is built on top of pandas, which means that operations that are slow in pandas, stay slow in Dask.<\/p>\n<p id=\"25e4\">The winner of the experiment is clear. Vaex was able to process bigger than the main memory file on a laptop while Dask couldn\u2019t. This experiment is specific as I am testing performance on a single machine, not a compute cluster.<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<\/div>\n\t\t","protected":false},"excerpt":{"rendered":"<p>Pandas doesn\u2019t handle well Big Data.Can processing Big Data with Dask and Vaex really process bigger than memory datasets or is it all just a sales slogan?<\/p>\n","protected":false},"author":784,"featured_media":18875,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"content-type":"","footnotes":""},"categories":[187],"tags":[122,281,216,283],"ppma_author":[3778],"class_list":["post-22666","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-bigdata-cloud","tag-big-data","tag-dask","tag-pandas","tag-vaex"],"authors":[{"term_id":3778,"user_id":784,"is_guest":0,"slug":"roman-orac","display_name":"Roman Orac","avatar_url":"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/04\/medium_b7d17fbf-b990-4540-aa64-0ff5333f3943-150x150.jpg","user_url":"https:\/\/www.sportradar.com\/","last_name":"Orac","first_name":"Roman","job_title":"","description":"Roman Orac is Senior Data Scientist at <a href=\"http:\/\/www.sportradar.com\/\">Sportradar<\/a>, a global leader in understanding and leveraging the power of sports data and digital content for its clients around the world."}],"_links":{"self":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/22666","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/users\/784"}],"replies":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/comments?post=22666"}],"version-history":[{"count":4,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/22666\/revisions"}],"predecessor-version":[{"id":32035,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/22666\/revisions\/32035"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/media\/18875"}],"wp:attachment":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/media?parent=22666"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/categories?post=22666"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/tags?post=22666"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/ppma_author?post=22666"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}