{"id":9303,"date":"2020-08-13T06:48:49","date_gmt":"2020-08-13T06:48:49","guid":{"rendered":"https:\/\/www.experfy.com\/blog\/?p=9303"},"modified":"2023-11-21T10:46:06","modified_gmt":"2023-11-21T10:46:06","slug":"you-can-do-everything-in-pandas-with-sql","status":"publish","type":"post","link":"https:\/\/www.experfy.com\/blog\/bigdata-cloud\/you-can-do-everything-in-pandas-with-sql\/","title":{"rendered":"You Can Do Everything in Pandas With SQL"},"content":{"rendered":"\t\t<div data-elementor-type=\"wp-post\" data-elementor-id=\"9303\" class=\"elementor elementor-9303\" data-elementor-post-type=\"post\">\n\t\t\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-5c3259f9 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"5c3259f9\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-67dfc876\" data-id=\"67dfc876\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-4950937a elementor-widget elementor-widget-text-editor\" data-id=\"4950937a\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>Pandas is one of the most useful tools a data scientist can use. It provides several handy functionalities to extract information. Unfortunately, using Pandas requires data to be loaded into DataFrames, which aren\u2019t great for handling massive quantities of data \u2014 quantities of data common at a company where you would need such data manipulation skills.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p>If you were to download a complete, massive dataset \u2014 perhaps store it as a .csv file \u2014 you would need to spend at least several minutes waiting for the file to complete download and be converted into a DataFrame by Pandas. On top of this, any operations you perform will be slow because Pandas has trouble dealing with these massive amounts of data and needs to run through every row.<\/p>\n<!-- \/wp:paragraph -->\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-f01559c elementor-widget elementor-widget-text-editor\" data-id=\"f01559c\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>A better solution? First query the database with SQL \u2014 the native, efficient processing language it likely runs on \u2014 then download a reduced dataset, if necessary, and use Pandas to operate on the smaller-scale tables it is designed to work on. While SQL is very efficient with large databases, it cannot replace the value of Panda\u2019s plotting integration with other libraries and with the Python language in general.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p>In the case where we only need to find numerical answers, however, like the average cost of an item or how many employees receive commission but not an hourly wage above $40, there is (usually) no need to touch Pandas and a Python environment at all.<\/p>\n<!-- \/wp:paragraph -->\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-8189e58 elementor-widget elementor-widget-text-editor\" data-id=\"8189e58\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>While most people\u2019s knowledge of SQL stops at\u00a0<code>SELECT * FROM table WHERE id = \u2018Bob\u2019<\/code>, one would be surprised by the functionalities that SQL offers.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p>As an example, we will work on the SQL Tryit Editor database provided by\u00a0<a href=\"https:\/\/www.w3schools.com\/sql\/trysql.asp?filename=trysql_asc\" target=\"_blank\" rel=\"noreferrer noopener\">w3schools<\/a>. This site allows you to run SQL queries on a fictional database. The table Customers (you can see the full list of tables on the right panel) has 7 columns with numerical and text data:<\/p>\n<!-- \/wp:paragraph -->\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-59540c5 elementor-widget elementor-widget-image\" data-id=\"59540c5\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/miro.medium.com\/max\/1939\/1*_TeBnwCBtB1IrNE9H-SeLw.png\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-6468b2d elementor-widget elementor-widget-text-editor\" data-id=\"6468b2d\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<!-- wp:paragraph -->\n<p>Say we want to send a prepared statement in the form of \u201cName LIVES IN Address, City\u201d. This is simple: the double pipes operator\u00a0<code>||<\/code>\u00a0acts as a concatenation. Then, we can add column values and strings, and save the result under an alias, or the column name of the result, using\u00a0<code>AS<\/code>.<\/p>\n<!-- \/wp:paragraph -->\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-5ec3fe7 elementor-widget elementor-widget-image\" data-id=\"5ec3fe7\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/08\/Untitled-3-2.png\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-63096c3 elementor-widget elementor-widget-image\" data-id=\"63096c3\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/miro.medium.com\/max\/1932\/1*OkHVee9WcyHsrPVJpraTVA.png\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-c992105 elementor-widget elementor-widget-text-editor\" data-id=\"c992105\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<!-- wp:paragraph -->\n<p>Note that since we additionally specified\u00a0<code>Country=\u2019Mexico\u2019<\/code>, our results are all of addresses and cities within Mexico. This type of operation would be more difficult and less easy to do in Python.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p>Note that not all databases support the same syntax. This article uses syntax for PostgreSQL, although every operation discussed has the operations in DB2, Oracle, MySQL, and SQL Server databases, sometimes with the same syntax, sometimes with different syntaxes. StackOverflow or Google can help you find these database-specific keywords.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p>Say that, in the\u00a0<code>Products<\/code>\u00a0table, we want to group products into three price buckets:\u00a0<code>Cheap<\/code>,\u00a0<code>Regular<\/code>, and\u00a0<code>Expensive<\/code>, for prices less than $12, between $12 and $21, and above $21, respectively.<\/p>\n<!-- \/wp:paragraph -->\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-3649678 elementor-widget elementor-widget-image\" data-id=\"3649678\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/miro.medium.com\/max\/1940\/1*Mo3PZFu4oSPMQ6io1UkBLA.png\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-1084043 elementor-widget elementor-widget-text-editor\" data-id=\"1084043\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>No problem! The\u00a0<code>CASE<\/code>\u00a0keyword can help. This keyword acts like an if\/else if\/else statement in other languages like Python.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p>The\u00a0<code>CASE<\/code>\u00a0keyword uses the syntax\u00a0<code>WHEN condition THEN value<\/code>. When multiple\u00a0<code>WHEN<\/code>s are stacked, they assume an \u2018else-if\u2019 relationship. Lastly, an\u00a0<code>ELSE value<\/code>\u00a0can be added if the condition is not met. Lastly,\u00a0<code>END<\/code>\u00a0is written to indicate the end of the\u00a0<code>CASE<\/code>\u00a0statement, and the results are saved (aliased) to a column named Bucket through\u00a0<code>AS Bucket<\/code>.<\/p>\n<!-- \/wp:paragraph -->\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-3352bda elementor-widget elementor-widget-image\" data-id=\"3352bda\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/08\/Untitled-2-1.png\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-1fd7f3e elementor-widget elementor-widget-image\" data-id=\"1fd7f3e\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/miro.medium.com\/max\/1941\/1*fKI-Ta_EPlvJ1e31Us1htQ.png\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-b5b6f5d elementor-widget elementor-widget-text-editor\" data-id=\"b5b6f5d\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>This could also be done in Pandas with\u00a0<code>.apply()<\/code>, with much slower speed.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p>Say we want to randomly sample 5 rows from\u00a0<code>Products<\/code>. Although there is no direct method to do this, we can get creative by using both the\u00a0<code>ORDER BY<\/code>\u00a0and\u00a0<code>LIMIT<\/code>\u00a0keywords.\u00a0<code>ORDER BY<\/code>\u00a0orders the data in a certain format; for instance, using\u00a0<code>ORDER BY Price ASC<\/code>\u00a0would order the data such that the price was in ascending format. Using\u00a0<code>DESC<\/code>\u00a0uses descending, and\u00a0<code>ORDER BY<\/code>\u00a0works with strings by sorting them alphabetically.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p><code>ORDER BY random()<\/code>\u00a0orders the data randomly, and\u00a0<code>LIMIT x<\/code>\u00a0returns the first\u00a0<em>x<\/em>\u00a0rows in the selected subset of data. This way, five random rows are selected from the data (<code>*<\/code>\u00a0means all columns).<\/p>\n<!-- \/wp:paragraph -->\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-3073add elementor-widget elementor-widget-image\" data-id=\"3073add\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/08\/Untitled-1-3.png\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-6c466ef elementor-widget elementor-widget-text-editor\" data-id=\"6c466ef\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<!-- wp:paragraph -->\n<p>Note: Unfortunately, SQL Tryit Editor does not support random(), but real databases do (or use a variant, like\u00a0<code>rand()<\/code>).<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p>Like this task of randomly sampling, most of SQL is about chaining together several simpler commands like\u00a0<code>SELECT<\/code>\u00a0and integrating them with built-in functions to yield astonishingly complex results.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p>Moreover, SQL provides all the statistical functions you may need. With everything from\u00a0<code>MIN()<\/code>\u00a0to\u00a0<code>MAX()<\/code>\u00a0to\u00a0<code>COUNT()<\/code>\u00a0to\u00a0<code>SUM()<\/code>\u00a0to\u00a0<code>AVG()<\/code>\u00a0to\u00a0<code>ASIN()<\/code>\u00a0(arcsine), you\u2019re set. You can either use package extensions for metrics like standard deviation or create them yourself using existing default functions, which is not difficult to do at all.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p>These are standard tasks \u2014 but what is more amazing is that SQL is a Turing-complete language. Put simply, you could represent a program in, say, Python or C++, in SQL, by building your own complex memory systems and using elements of SQL like functions, if\/elses, recursion, etc. You can view some fascinating demonstrations of Turing-complete SQL\u00a0<code>here<\/code>. The main point of this is not to encourage you to use SQL as an operational language, but to demonstrate that SQL can be used to do so much more than you thought.<\/p>\n<!-- \/wp:paragraph -->\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-b51f606 elementor-widget elementor-widget-text-editor\" data-id=\"b51f606\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>There\u2019s so much more that you can do in SQL that we haven\u2019t discussed:<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:list -->\n<ul>\n<li>Specify your own custom functions, like you would declare a function in Python or C++. These can be used to, for example, parse IP addresses.<\/li>\n<li>Use recursion to create complex looping and data generation with the\u00a0<code>WITH<\/code>\u00a0keyword.<\/li>\n<li>Sort string columns by a\u00a0<em>substring<\/em>.<\/li>\n<li>Perform complex joining between multiple tables.<\/li>\n<li>Use SQL to generate SQL (automating tasks).<\/li>\n<li>Generate forecasts using statistical models.<\/li>\n<li>Create histograms.<\/li>\n<li>Build tree structures (with leaf, branch, root nodes).<\/li>\n<\/ul>\n<!-- \/wp:list -->\n\n<!-- wp:paragraph -->\n<p>It\u2019s true that you can do a lot more with SQL than you can with Pandas. That being said, usually that additional functionality is not necessary. The main reason why you should be using SQL is because it is built to handle large quantities of data in a custom environment that DataFrames are not.<\/p>\n<!-- \/wp:paragraph -->\n<!-- wp:paragraph -->\n<p>Generally, SQL is a simple but sometimes very messy language, and it should usually be used just to reduce the size of the data until it becomes more manageable to handle in Pandas\u2019 smaller environment.<\/p>\n<!-- \/wp:paragraph -->\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-c7baf63 elementor-widget elementor-widget-heading\" data-id=\"c7baf63\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\"><h2 id=\"af28\">Key Points<\/h2><\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-166a6ac elementor-widget elementor-widget-text-editor\" data-id=\"166a6ac\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<!-- wp:list -->\n<ul>\n<li>Pandas isn\u2019t good at handling big data, and its features can all be done with SQL. However, Pandas\u2019 value comes from its integration with other plotting libraries, machine learning libraries, and the Python language.<\/li>\n<li>The goal should usually be to use SQL to narrow down a large dataset into one that is more relevant for the task, then to handle it in a Python environment, using Pandas\u2019 DataFrame as the basis for storage.<\/li>\n<li>Don\u2019t be scared to touch SQL to handle big data problems. As demonstrated above, SQL\u2019s syntax is simple and is almost all about chaining together simple commands to yield more complex results. If you have a clear vision of the result, you can make it happen with SQL.<\/li>\n<li>SQL can do a lot more than most people realize.<\/li>\n<\/ul>\n<!-- \/wp:list -->\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<\/div>\n\t\t","protected":false},"excerpt":{"rendered":"<p>Pandas is one of the most useful tools a data scientist can use. It provides several handy functionalities to extract information. Unfortunately, using Pandas requires data to be loaded into DataFrames, which aren\u2019t great for handling massive quantities of data \u2014 quantities of data common at a company where you would need such data manipulation<\/p>\n","protected":false},"author":884,"featured_media":9304,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"content-type":"","footnotes":""},"categories":[187],"tags":[122,94,216,114,545],"ppma_author":[3782],"class_list":["post-9303","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-bigdata-cloud","tag-big-data","tag-data-science","tag-pandas","tag-python","tag-sql"],"authors":[{"term_id":3782,"user_id":884,"is_guest":0,"slug":"andre-ye","display_name":"Andre Ye","avatar_url":"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/08\/Andre-Ye-150x150.jpg","user_url":"https:\/\/www.critiq.tech\/","last_name":"Ye","first_name":"Andre","job_title":"","description":"Andre Ye is Cofounder at Critiq, and Editor and Writer at Medium"}],"_links":{"self":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/9303","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/users\/884"}],"replies":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/comments?post=9303"}],"version-history":[{"count":4,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/9303\/revisions"}],"predecessor-version":[{"id":34211,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/9303\/revisions\/34211"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/media\/9304"}],"wp:attachment":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/media?parent=9303"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/categories?post=9303"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/tags?post=9303"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/ppma_author?post=9303"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}