{"id":2260,"date":"2020-02-14T04:04:34","date_gmt":"2020-02-14T01:04:34","guid":{"rendered":"http:\/\/kusuaks7\/?p=1865"},"modified":"2024-01-10T13:14:11","modified_gmt":"2024-01-10T13:14:11","slug":"tips-and-tricks-for-fast-data-analysis-in-python","status":"publish","type":"post","link":"https:\/\/www.experfy.com\/blog\/bigdata-cloud\/tips-and-tricks-for-fast-data-analysis-in-python\/","title":{"rendered":"Tips and Tricks for Fast Data Analysis in Python"},"content":{"rendered":"\t\t<div data-elementor-type=\"wp-post\" data-elementor-id=\"2260\" class=\"elementor elementor-2260\" data-elementor-post-type=\"post\">\n\t\t\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-508a961b elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-eae-slider=\"6681\" data-id=\"508a961b\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-40250326\" data-eae-slider=\"75988\" data-id=\"40250326\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-5509760 elementor-widget elementor-widget-heading\" data-id=\"5509760\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\"><h2>Quickly summarise and describe datasets with python<\/h2><\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-e6c024d elementor-widget elementor-widget-text-editor\" data-id=\"e6c024d\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<section>\n<p id=\"3247\" data-selectable-paragraph=\"\">The python programming language has a large number of both built-in functions and libraries for data analysis. Combining some of these libraries can produce very powerful methods of summarising, describing and filtering large amounts of data.<\/p>\n<p id=\"5c80\" data-selectable-paragraph=\"\">In this article, I want to share some tips on how to combine\u00a0<a href=\"https:\/\/pandas.pydata.org\/\" target=\"_blank\" rel=\"noopener nofollow noreferrer\">pandas<\/a>,\u00a0<a href=\"https:\/\/matplotlib.org\/\" target=\"_blank\" rel=\"noopener nofollow noreferrer\">matplotlib\u00a0<\/a>and some built-in python functionality to very quickly analyse a dataset.<\/p>\n<p id=\"542f\" data-selectable-paragraph=\"\">All libraries in this post can be installed via the package manager pip.<\/p>\n\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-6639b5a elementor-widget elementor-widget-heading\" data-id=\"6639b5a\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\"><h2 id=\"2e30\" data-selectable-paragraph=\"\">Data<\/h2><\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-1e3aad3 elementor-widget elementor-widget-text-editor\" data-id=\"1e3aad3\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"7fec\" data-selectable-paragraph=\"\">In this article, I am going to be using a data set known as the adult income dataset which can be downloaded from the\u00a0<a href=\"https:\/\/archive.ics.uci.edu\/ml\/datasets\/adult\" target=\"_blank\" rel=\"noopener nofollow noreferrer\">UCI machine learning repository<\/a>. This dataset contains a number of features about each adult and a target variable which tells us whether or not they earn over \u00a350,000 pa.<\/p>\n<p id=\"616c\" data-selectable-paragraph=\"\">Here are all the imports for the libraries that I am using.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-fd44fea elementor-widget elementor-widget-text-editor\" data-id=\"fd44fea\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<div style=\"background: #eee; border: 1px solid #ccc; padding: 5px 10px;\">import pandas as pd\nimport numpy as np\nfrom sklearn import preprocessing<\/div>\n<div style=\"background: #eee; border: 1px solid #ccc; padding: 5px 10px;\">import matplotlib.pyplot as plt\n%matplotlib inline<\/div>\n<p id=\"acd2\" data-selectable-paragraph=\"\">I am using pandas to read in the dataset and return the first few rows.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-0089cff elementor-widget elementor-widget-text-editor\" data-id=\"0089cff\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<div style=\"background: #eee; border: 1px solid #ccc; padding: 5px 10px;\">data = pd.read_csv(&#8216;adults_data.csv&#8217;)\ndata.head()<\/div>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-4ae9a96 elementor-widget elementor-widget-image\" data-id=\"4ae9a96\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/miro.medium.com\/max\/2144\/1*kwnlZDg9RYgaC4Q-d6I4wg.png\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-4c3af22 elementor-widget elementor-widget-text-editor\" data-id=\"4c3af22\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"6086\" data-selectable-paragraph=\"\">This dataset is usually used to build a machine learning model which predicts the income class from the features. However, before getting to the model building stage it is useful to perform some data analysis first.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-75813c0 elementor-widget elementor-widget-heading\" data-id=\"75813c0\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\"><h2 id=\"8ae8\" data-selectable-paragraph=\"\">Describe<\/h2><\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-8d15f76 elementor-widget elementor-widget-text-editor\" data-id=\"8d15f76\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"ed07\" data-selectable-paragraph=\"\">The describe function allows us to very quickly look at some basic descriptive statistics for the numerical features in the dataset. Running\u00a0<code>data.describe()<\/code>\u00a0we can see that our dataset has 32,561 rows, we can see the mean value in each numerical feature and get a view of the distribution of values in each of these features.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-7ce1a30 elementor-widget elementor-widget-image\" data-id=\"7ce1a30\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/miro.medium.com\/max\/1226\/1*aiiuqi07_KLGIEkL7Qp4Dg.png\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-b5adb84 elementor-widget elementor-widget-heading\" data-id=\"b5adb84\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\"><h2 id=\"c5e7\" data-selectable-paragraph=\"\">Value counts<\/h2><\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-953eacc elementor-widget elementor-widget-text-editor\" data-id=\"953eacc\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"3e2d\" data-selectable-paragraph=\"\">In this dataset, we also have categorical variables and it will also be useful to get a basic understanding of the distribution for these. The\u00a0<code>value_counts()<\/code>\u00a0function provides a very simple way to do this. Let\u2019s use this to inspect the\u00a0<code>marital-status<\/code>\u00a0feature.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-500635d elementor-widget elementor-widget-image\" data-id=\"500635d\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/miro.medium.com\/max\/524\/1*9kJk9a1ffVXxFe-6svZblg.png\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-b959524 elementor-widget elementor-widget-text-editor\" data-id=\"b959524\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"d5d1\" data-selectable-paragraph=\"\">To make this easier to visualise we can quickly create a bar plot for this value by adding just a small amount of extra code. The title is optional, and you can customise axis labels, colours and other aspects of the chart with the usual matplotlib functionality.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-e2829c0 elementor-widget elementor-widget-text-editor\" data-id=\"e2829c0\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<div style=\"background: #eee; border: 1px solid #ccc; padding: 5px 10px;\">plt.title(&#8216;Marital Status&#8217;)\ndata[&#8216;marital-status&#8217;].value_counts().plot.bar()<\/div>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-5c663ae elementor-widget elementor-widget-image\" data-id=\"5c663ae\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/miro.medium.com\/max\/784\/1*QIIDt7NmqXmrppwR2t0iHg.png\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-0e89b2a elementor-widget elementor-widget-text-editor\" data-id=\"0e89b2a\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"02b1\" data-selectable-paragraph=\"\">Plotting with value counts doesn\u2019t work so well when we have a feature with high cardinality (a large number of unique values).<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-590adcf elementor-widget elementor-widget-text-editor\" data-id=\"590adcf\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<div style=\"background: #eee; border: 1px solid #ccc; padding: 5px 10px;\">plt.title(&#8216;Native Country&#8217;)\ndata[&#8216;native-country&#8217;].value_counts().plot.bar()<\/div>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-6ca114b elementor-widget elementor-widget-image\" data-id=\"6ca114b\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/miro.medium.com\/max\/776\/1*k_UQ58wUrIhkX4H1m0Apag.png\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-889c62c elementor-widget elementor-widget-text-editor\" data-id=\"889c62c\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"2f5e\" data-selectable-paragraph=\"\">For a feature like native-country, it would be more useful to plot only the top n values as this gives us a useful insight. We can do this by adding just a little more code.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-07167d4 elementor-widget elementor-widget-text-editor\" data-id=\"07167d4\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<div style=\"background: #eee; border: 1px solid #ccc; padding: 5px 10px;\">plt.title(&#8216;Native Country&#8217;)\ndata[&#8216;native-country&#8217;].value_counts().nlargest(10).plot.bar()<\/div>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-9efa745 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-eae-slider=\"78641\" data-id=\"9efa745\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-001b75d\" data-eae-slider=\"25548\" data-id=\"001b75d\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-3afaae6 elementor-widget elementor-widget-image\" data-id=\"3afaae6\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/miro.medium.com\/max\/754\/1*Y2BRiQHThnIVw_t7FMl3sQ.png\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-505367d elementor-widget elementor-widget-heading\" data-id=\"505367d\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\"><h2 id=\"3627\" data-selectable-paragraph=\"\">Pandas groupby<\/h2><\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-3d039a6 elementor-widget elementor-widget-text-editor\" data-id=\"3d039a6\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"9d91\" data-selectable-paragraph=\"\">The pandas groupby function is very useful when we have data where we want to compare segments. In this dataset, we want to perform analysis to understand the differences, and magnitude of differences in the features between the two income classes. The pandas groupby function provides a very quick way to do this.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-f40e9cb elementor-widget elementor-widget-text-editor\" data-id=\"f40e9cb\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"ed2e\" data-selectable-paragraph=\"\">If we run the code below we can analyse the differences in mean, for all numerical values, between the two income groups.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-d6d90ea elementor-widget elementor-widget-text-editor\" data-id=\"d6d90ea\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<div style=\"background: #eee; border: 1px solid #ccc; padding: 5px 10px;\">round(data.groupby([&#8216;income&#8217;]).mean(),2)<\/div>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-2ff8ec0 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-eae-slider=\"13378\" data-id=\"2ff8ec0\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-f8a6951\" data-eae-slider=\"54038\" data-id=\"f8a6951\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-b8da36a elementor-widget elementor-widget-image\" data-id=\"b8da36a\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/miro.medium.com\/max\/1082\/1*IBWp2azrHDlY5M8X9L3ubg.png\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-67fb3a9 elementor-widget elementor-widget-text-editor\" data-id=\"67fb3a9\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"699a\" data-selectable-paragraph=\"\">A better way to compare the differences would be to view the variance in distributions for the two groups. A boxplot is a useful way to do that. This can be accomplished by using the plotting functionality alongside groupby. The visualisation is shown below.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-4e2bc24 elementor-widget elementor-widget-text-editor\" data-id=\"4e2bc24\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<div style=\"background: #eee; border: 1px solid #ccc; padding: 5px 10px;\">data.groupby(&#8216;income&#8217;).boxplot(fontsize=20,rot=90,figsize=(20,10),patch_artist=True)<\/div>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-c7e90fd elementor-widget elementor-widget-image\" data-id=\"c7e90fd\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/miro.medium.com\/max\/2350\/1*K656_PQxWenxzO4-sVyQPg.png\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-7bbd923 elementor-widget elementor-widget-text-editor\" data-id=\"7bbd923\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"b7aa\" data-selectable-paragraph=\"\">You will notice that as the values are on different scales it is difficult to compare the two distributions. To overcome this we can\u00a0<a href=\"https:\/\/medium.com\/greyatom\/why-how-and-when-to-scale-your-features-4b30ab09db5e\" target=\"_blank\" rel=\"noopener noreferrer\">scale<\/a>\u00a0the values. To do this I am using the\u00a0<a href=\"https:\/\/scikit-learn.org\/\" target=\"_blank\" rel=\"noopener nofollow noreferrer\">scikit-learn<\/a>\u00a0MinMaxScaler function. This scales the values so that they all lie between 0 and 1. We can now clearly see substantial differences between some of the features such as age and hours-per-week.<\/p>\n&nbsp;\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-ded8643 elementor-widget elementor-widget-image\" data-id=\"ded8643\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/miro.medium.com\/max\/2266\/1*AKFTUSozksr-2i56iegbAw.png\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-577d8e7 elementor-widget elementor-widget-text-editor\" data-id=\"577d8e7\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"0ebc\" data-selectable-paragraph=\"\">We can also use the groupby function to compare categorical features. In the below graph we can quickly identify that there are a higher number of males compared to females in the higher income bracket.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-015c94a elementor-widget elementor-widget-text-editor\" data-id=\"015c94a\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<div style=\"background: #eee; border: 1px solid #ccc; padding: 5px 10px;\">data.groupby(&#8216;income&#8217;).gender.value_counts().unstack(0).plot.barh()<\/div>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-a2518fe elementor-widget elementor-widget-image\" data-id=\"a2518fe\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/miro.medium.com\/max\/804\/1*GiZJ15VOqxenc2KW-2b94g.png\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-1267d48 elementor-widget elementor-widget-heading\" data-id=\"1267d48\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\"><h2 id=\"8d14\" data-selectable-paragraph=\"\">Pivot tables<\/h2><\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-d9d03a9 elementor-widget elementor-widget-text-editor\" data-id=\"d9d03a9\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"8eba\" data-selectable-paragraph=\"\">Pandas has functionality that enables you to create spreadsheet-style pivot tables in python. Pivot tables allow you to quickly summarise, group and filter data to perform more complex analyses.<\/p>\n<p id=\"ed40\" data-selectable-paragraph=\"\">We can use the pivot table to explore more complex relationships. Let\u2019s look a little deeper into the relationship between gender and income class. Do females earn less because they work fewer hours per week?<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-b071f4f elementor-widget elementor-widget-text-editor\" data-id=\"b071f4f\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<div style=\"background: #eee; border: 1px solid #ccc; padding: 5px 10px;\">pivot_workclass = pd.pivot_table(data, values=[&#8216;hours-per-week&#8217;],\nindex = &#8216;gender&#8217;,\ncolumns = &#8216;income&#8217;, aggfunc=np.mean, fill_value=0)<\/div>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-c938494 elementor-widget elementor-widget-image\" data-id=\"c938494\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/miro.medium.com\/max\/430\/1*guLHR4vjggyVr3X5WCE5nw.png\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-4578a78 elementor-widget elementor-widget-text-editor\" data-id=\"4578a78\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"0a58\" data-selectable-paragraph=\"\">We can add plotting functionality to make this easier to visualise.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-9297c3f elementor-widget elementor-widget-text-editor\" data-id=\"9297c3f\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<div style=\"background: #eee; border: 1px solid #ccc; padding: 5px 10px;\">pivot_workclass = pd.pivot_table(data, values=[&#8216;hours-per-week&#8217;],\nindex = &#8216;gender&#8217;,\ncolumns = &#8216;income&#8217;, aggfunc=np.mean, fill_value=0).plot.bar()<\/div>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-5beafb9 elementor-widget elementor-widget-image\" data-id=\"5beafb9\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/miro.medium.com\/max\/716\/1*Rik4wqBLcYiJMLe_dP3gnw.png\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-db8d730 elementor-widget elementor-widget-text-editor\" data-id=\"db8d730\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<section>\n<p id=\"f2a3\" data-selectable-paragraph=\"\">All the methods described above can be extended to create much richer and more complex analyses.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<\/div>\n\t\t","protected":false},"excerpt":{"rendered":"<p>The python programming language has a large number of both built-in functions and libraries for data analysis. Combining some of these libraries can produce very powerful methods of summarising, describing and filtering large amounts of data. This article shares some tips on how to combine&nbsp;pandas, matplotlib <a href=\"https:\/\/matplotlib.org\/\" target=\"_blank\" rel=\"noopener noreferrer\">&nbsp;<\/a>and some built-in python functionality to very quickly analyse a dataset. All the methods described can be extended to create much richer and more complex analyses.<\/p>\n","protected":false},"author":795,"featured_media":3688,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[187],"tags":[94],"ppma_author":[2924],"class_list":["post-2260","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-bigdata-cloud","tag-data-science"],"authors":[{"term_id":2924,"user_id":795,"is_guest":0,"slug":"rebecca-vickery","display_name":"Rebecca Vickery","avatar_url":"https:\/\/secure.gravatar.com\/avatar\/?s=96&d=mm&r=g","author_category":"","user_url":"","last_name":"Vickery","first_name":"Rebecca","job_title":"","description":"Rebecca Vickery is a Data Scientist at Holiday Extras. She has been working in data &amp; analytics in the Travel industry for the past 10 years. Areas of interest include machine learning, customer lifecycle analytics, python and sql development."}],"_links":{"self":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/2260","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/users\/795"}],"replies":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/comments?post=2260"}],"version-history":[{"count":0,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/2260\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/media\/3688"}],"wp:attachment":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/media?parent=2260"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/categories?post=2260"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/tags?post=2260"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/ppma_author?post=2260"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}