{"id":1861,"date":"2019-08-02T10:02:31","date_gmt":"2019-08-02T10:02:31","guid":{"rendered":"http:\/\/kusuaks7\/?p=1466"},"modified":"2024-07-22T14:53:56","modified_gmt":"2024-07-22T14:53:56","slug":"surprising-sorting-tips-for-data-scientists","status":"publish","type":"post","link":"https:\/\/www.experfy.com\/blog\/bigdata-cloud\/surprising-sorting-tips-for-data-scientists\/","title":{"rendered":"Surprising Sorting Tips for Data Scientists"},"content":{"rendered":"\t\t<div data-elementor-type=\"wp-post\" data-elementor-id=\"1861\" class=\"elementor elementor-1861\" data-elementor-post-type=\"post\">\n\t\t\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-3c57d079 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-eae-slider=\"87076\" data-id=\"3c57d079\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-4ec7e401\" data-eae-slider=\"37826\" data-id=\"4ec7e401\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-9883dcd elementor-widget elementor-widget-heading\" data-id=\"9883dcd\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h3 class=\"elementor-heading-title elementor-size-default\"><h3 style=\"color: #aaa;font-style: italic\">Python, Numpy, Pandas, PyTorch, TensorFlow &amp; SQL<\/h3><\/h3>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-544fb51 elementor-widget elementor-widget-text-editor\" data-id=\"544fb51\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"e9a7\" data-selectable-paragraph=\"\">Sorting data is a basic task for data scientists and data engineers. Python users have a number of libraries to choose from with built-in, optimized sorting options. Some even work in parallel on GPUs. Surprisingly some sort methods don\u2019t use the stated algorithm types and others don\u2019t perform as expected.<\/p>\n<p id=\"07fd\" data-selectable-paragraph=\"\">Choosing which library and type of sorting algorithm to use can be tricky. Implementations change quickly. As of this writing, the Pandas documentation isn\u2019t even up to date with the code<\/p>\n<p id=\"05dc\" data-selectable-paragraph=\"\">In this article I\u2019ll give you the lay of the land, provide tips to help you remember the methods, and share the results of a speed test.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-04a9086 elementor-widget elementor-widget-image\" data-id=\"04a9086\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/miro.medium.com\/max\/1920\/1*YdXDTn55ZZQmpRbnMbSFuQ.jpeg\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-d3a0b53 elementor-widget elementor-widget-text-editor\" data-id=\"d3a0b53\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p style=\"text-align: center;\" data-selectable-paragraph=\"\">Sorted Tea<\/p>\n<p id=\"6583\" data-selectable-paragraph=\"\">Let\u2019s get sorting!<\/p>\n<p id=\"e4eb\" data-selectable-paragraph=\"\">UPDATE July 17, 2019: Speed test evaluation results now include GPU implementations of PyTorch and TensorFlow. TensorFlow also includes CPU results under both\u00a0<code>tensorflow==2.0.0-beta1<\/code>\u00a0and\u00a0<code>tensorflow-gpu==2.0.0-beta1<\/code>. Surprising findings: PyTorch GPU is lightening fast and TensorFlow GPU is slower than TensorFlow CPU.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-b22982c elementor-widget elementor-widget-heading\" data-id=\"b22982c\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h1 class=\"elementor-heading-title elementor-size-default\"><h1 id=\"5801\" data-selectable-paragraph=\"\">Context<\/h1><\/h1>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-96de67e elementor-widget elementor-widget-text-editor\" data-id=\"96de67e\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"7533\" data-selectable-paragraph=\"\">There are many different basic sorting algorithms. Some perform faster and use less memory than others. Some are better suited to big data and some work better if the data are arranged in certain ways. See the chart below for time and space complexity of many common algorithms.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-8664887 elementor-widget elementor-widget-image\" data-id=\"8664887\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/miro.medium.com\/max\/700\/1*niPobJI4MOr5hHp1yOqqiQ.png\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-414a482 elementor-widget elementor-widget-text-editor\" data-id=\"414a482\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p style=\"text-align: center;\" data-selectable-paragraph=\"\">From\u00a0<a href=\"http:\/\/bigocheatsheet.com\/\" rel=\"noopener\">http:\/\/bigocheatsheet.com\/<\/a><\/p>\n<p id=\"bfac\" data-selectable-paragraph=\"\">Being an expert at basic implementations isn\u2019t necessary for most data science problems. In fact, premature optimization is occasionally sited as the root of all evil. However, knowing which library and which keyword arguments to use can be quite helpful when you need to repeatedly sort a lot of data. Here\u2019s my cheat sheet.<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-78a7d93 elementor-widget elementor-widget-image\" data-id=\"78a7d93\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/miro.medium.com\/max\/2400\/1*9JRNRN86A4Qp_-iJ9iwSFQ.png\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-b9cb088 elementor-widget elementor-widget-text-editor\" data-id=\"b9cb088\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p style=\"text-align: center;\" data-selectable-paragraph=\"\"><span style=\"text-align: center; background-color: rgba(0, 0, 0, 0.05);\">My Google Sheet available here:\u00a0<\/span><a style=\"text-align: center; background-color: rgba(0, 0, 0, 0.05);\" href=\"https:\/\/docs.google.com\/spreadsheets\/d\/1zQbDvpmrvTYVnRz_2OTlfB6knLlotdbAoFH6Oy48uSc\/edit?usp=sharing\" rel=\"noopener\">https:\/\/docs.google.com\/spreadsheets\/d\/1zQbDvpmrvTYVnRz_2OTlfB6knLlotdbAoFH6Oy48uSc\/edit?usp=sharing<\/a><\/p>\n<p id=\"3569\" data-selectable-paragraph=\"\">The sorting algorithms have changed over the years in many libraries. These software versions were used in the analysis performed for this article.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-6c28a18 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-eae-slider=\"73067\" data-id=\"6c28a18\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-2a238b1\" data-eae-slider=\"36168\" data-id=\"2a238b1\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-6c3626c elementor-widget elementor-widget-text-editor\" data-id=\"6c3626c\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<span style=\"font-family: courier new,courier,monospace;\">python 3.6.8\nnumpy 1.16.4\npandas 0.24.2\ntensorflow==2.0.0-beta1\u00a0 #tensorflow-gpu==2.0.0-beta1 slows sorting\npytorch 1.1<\/span>\n<p id=\"18a2\" data-selectable-paragraph=\"\">Let\u2019s start with the basics.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-901c9ca elementor-widget elementor-widget-heading\" data-id=\"901c9ca\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h1 class=\"elementor-heading-title elementor-size-default\"><h1 id=\"9f14\" data-selectable-paragraph=\"\">Python (vanilla)<\/h1><\/h1>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-d93d509 elementor-widget elementor-widget-image\" data-id=\"d93d509\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/miro.medium.com\/max\/296\/1*H1olKNHMeAiPbDoDf95MYw.png\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-07c34df elementor-widget elementor-widget-text-editor\" data-id=\"07c34df\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\n<p id=\"6e2b\" data-selectable-paragraph=\"\">Python contains two built-in sorting methods.<\/p>\n\n<ul>\n \t<li id=\"7fc3\" data-selectable-paragraph=\"\"><code>my_list.<a href=\"https:\/\/docs.python.org\/3\/library\/stdtypes.html#list.sort\" rel=\"noopener\">sort()<\/a><\/code>\u00a0sorts a list in-place. It mutates the list.\u00a0<code>sort()<\/code>\u00a0returns\u00a0<code>None<\/code>.<\/li>\n \t<li id=\"98dd\" data-selectable-paragraph=\"\"><code><a href=\"https:\/\/docs.python.org\/3\/library\/functions.html#sorted\" rel=\"noopener\">sorted(my_list)<\/a><\/code>\u00a0makes a sorted copy of any iterable.\u00a0<code>sorted()<\/code>\u00a0returns the sorted iterable.\u00a0<code>sort()<\/code>\u00a0does not mutate the original iterable.<\/li>\n<\/ul>\n<p id=\"c161\" data-selectable-paragraph=\"\"><code>sort()<\/code>\u00a0should be faster because it is in place. Surprisingly, that\u2019s not what I found in the test below. In-place sorting is more dangerous because it mutates the original data.<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-f181382 elementor-widget elementor-widget-image\" data-id=\"f181382\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/miro.medium.com\/max\/700\/1*Vc7JacKNVeW1jCeZxEXH-w.jpeg\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-8f87823 elementor-widget elementor-widget-text-editor\" data-id=\"8f87823\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p style=\"text-align: center;\" data-selectable-paragraph=\"\">Vanilla<\/p>\n<p id=\"fcb6\" data-selectable-paragraph=\"\">For vanilla Python all of the implementations we\u2019ll look at in this article, the default sorting order is ascending \u2014 from smallest to largest. Most sorting methods accept a keyword parameter to switch the sort order to descending. Unfortunately for your brain, this parameter name is different for each library.<\/p>\n<p id=\"09cb\" data-selectable-paragraph=\"\">To change either sort order to descending in vanilla Python, pass\u00a0<code>reverse=True<\/code>.<\/p>\n<p id=\"2d45\" data-selectable-paragraph=\"\"><code>key<\/code>\u00a0can be passed as a keyword argument to create your own sort criteria. For example,\u00a0<code>sort(key=len)<\/code>\u00a0will sort by the length of each list item.<\/p>\n<p id=\"4dee\" data-selectable-paragraph=\"\">The only sorting algorithm used in vanilla Python is Timsort. Timsort chooses a sorting method depending upon the characteristics of the data to be sorted. For example, if a short list is to be sorted, then an insertion sort is used. See\u00a0<a href=\"https:\/\/towardsdatascience.com\/u\/2480a7e35749?source=post_page---------------------------\" rel=\"noopener\">Brandon Skerritt<\/a>\u2019s great article for more details on Timsort\u00a0here.<\/p>\n<p id=\"dd44\" data-selectable-paragraph=\"\">Timsort, and thus Vanilla Python sorts,\u00a0<a href=\"https:\/\/docs.python.org\/3\/howto\/sorting.html?highlight=sort\" rel=\"noopener\">are stable<\/a>. This means that if multiple values are the same, then those items remain in the original order after sorting.<\/p>\n<p id=\"8201\" data-selectable-paragraph=\"\">To remember\u00a0<code>sort()<\/code>\u00a0vs.\u00a0<code>sorted()<\/code>, I just remember that\u00a0<em>sorted<\/em>\u00a0is a longer word than\u00a0<em>sort<\/em>\u00a0and that sorted should take longer to run because it has to make a copy. Although the results below didn\u2019t support the conventional wisdom, the mnemonic still works.<\/p>\n<p id=\"fe3f\" data-selectable-paragraph=\"\">Now let\u2019s look at using Numpy.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-2f3676d elementor-widget elementor-widget-heading\" data-id=\"2f3676d\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h1 class=\"elementor-heading-title elementor-size-default\"><h1 id=\"4db3\" data-selectable-paragraph=\"\">Numpy<\/h1><\/h1>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-5f6c175 elementor-widget elementor-widget-image\" data-id=\"5f6c175\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/miro.medium.com\/max\/354\/1*rkzjM21lmcx3sPOTqbJ0oA.png\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-d9fe3cf elementor-widget elementor-widget-text-editor\" data-id=\"d9fe3cf\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\n<p id=\"ea24\" data-selectable-paragraph=\"\">Numpy is the bedrock Python library for scientific computing. Like vanilla Python, it has two sort implementations, one that mutates the array and one that copies it.<\/p>\n\n<ul>\n \t<li id=\"54dc\" data-selectable-paragraph=\"\"><code><a href=\"https:\/\/docs.scipy.org\/doc\/numpy-1.16.0\/reference\/generated\/numpy.ndarray.sort.html#numpy.ndarray.sort\" rel=\"noopener\">my_array.sort()<\/a><\/code>\u00a0mutates the array in place and returns the sorted array.<\/li>\n \t<li id=\"36ff\" data-selectable-paragraph=\"\"><code><a href=\"https:\/\/docs.scipy.org\/doc\/numpy-1.16.0\/reference\/generated\/numpy.sort.html#numpy.sort\" rel=\"noopener\">np.sort(my_array)<\/a><\/code>\u00a0returns a copy of the sorted array, so it doesn\u2019t mutate the original array.<\/li>\n<\/ul>\n<p id=\"2a9f\" data-selectable-paragraph=\"\">Here are the optional arguments.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-7a8e611 elementor-widget elementor-widget-text-editor\" data-id=\"7a8e611\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<ul>\n \t<li id=\"ce89\" data-selectable-paragraph=\"\"><code><strong>axis<\/strong><em>\u00a0<\/em><\/code><em>: int, optional \u2014\u00a0<\/em>Axis along which to sort. Default is -1, which means sort along the last axis.<\/li>\n \t<li id=\"42a7\" data-selectable-paragraph=\"\"><code><strong>kind<\/strong><\/code><em>\u00a0: {\u2018quicksort\u2019, \u2018mergesort\u2019, \u2018heapsort\u2019, \u2018stable\u2019}, optional \u2014\u00a0<\/em>Sorting algorithm. Default is \u2018quicksort\u2019. More on this below.<\/li>\n \t<li id=\"8546\" data-selectable-paragraph=\"\"><code><strong>order<\/strong><\/code><em>\u00a0: str or list of str, optional \u2014\u00a0<\/em>When\u00a0<em>a<\/em>\u00a0is an array with fields defined, this argument specifies which fields to compare first, second, etc. A single field can be specified as a string, and not all fields need be specified, but unspecified fields will still be used, in the order in which they come up in the dtype, to break ties.<\/li>\n<\/ul>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-b313391 elementor-widget elementor-widget-text-editor\" data-id=\"b313391\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"eb78\" data-selectable-paragraph=\"\">The sorting algorithms used are now a bit different than you might expect based on their names. Passing\u00a0<code>kind=quicksort<\/code>\u00a0means sorting actually starts with an\u00a0<a href=\"https:\/\/en.wikipedia.org\/wiki\/Introsort\" rel=\"noopener\">introsort<\/a>\u00a0algorithm. The\u00a0<a href=\"https:\/\/docs.scipy.org\/doc\/numpy\/reference\/generated\/numpy.sort.html\" rel=\"noopener\">docs<\/a>\u00a0explain.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-0b34b5e elementor-widget elementor-widget-text-editor\" data-id=\"0b34b5e\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<blockquote>\n<p id=\"df82\" data-selectable-paragraph=\"\">When [it] does not make enough progress it switches to a heapsort algorithm This implementation makes quicksort O(n*log(n)) in the worst case.<\/p>\n<p id=\"673f\" data-selectable-paragraph=\"\"><em>stable\u00a0<\/em>automatically chooses the best stable sorting algorithm for the data type being sorted. It, along with mergesort is currently mapped to timsort or radix sort depending on the data type. API forward compatibility currently limits the ability to select the implementation and it is hardwired for the different data types.<\/p>\n<p id=\"9311\" data-selectable-paragraph=\"\">Timsort is added for better performance on already or nearly sorted data. On random data timsort is almost identical to mergesort. It is now used for stable sort while quicksort is still the default sort if none is chosen\u2026\u2018mergesort\u2019 and \u2018stable\u2019 are mapped to radix sort for integer data types.<\/p>\n<\/blockquote>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-7d6ac9d elementor-widget elementor-widget-text-editor\" data-id=\"7d6ac9d\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"fc9e\" data-selectable-paragraph=\"\">One take-away is that Numpy provides a wider range of control for sorting algorithm options than vanilla Python. A second take-away is that the\u00a0<em>kind<\/em>keyword value doesn\u2019t necessarily correspond to the actual sort type used. A final take-away is that the\u00a0<code>mergesort<\/code>\u00a0and\u00a0<code>stable<\/code>\u00a0values are stable sorts, but\u00a0<code>quicksort<\/code>\u00a0and\u00a0<code>heapsort<\/code>\u00a0are not.<\/p>\n<p id=\"dd65\" data-selectable-paragraph=\"\">Numpy sorts are the only implementations on our list without a keyword argument to reverse the sort order. Luckily, it\u2019s quick to reverse an array with a slice like this:\u00a0<code>my_arr[::-1]<\/code>.<\/p>\n<p id=\"2c13\" data-selectable-paragraph=\"\">The Numpy algorithm options are also available in the more user-friendly Pandas \u2014 and I find the functions easier to keep straight.<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-8dacfff elementor-widget elementor-widget-heading\" data-id=\"8dacfff\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h1 class=\"elementor-heading-title elementor-size-default\"><h1 id=\"8187\" data-selectable-paragraph=\"\">Pandas<\/h1><\/h1>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-153245c elementor-widget elementor-widget-image\" data-id=\"153245c\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/miro.medium.com\/max\/640\/1*UIJ2FGS-px4aFywabG7qvg.jpeg\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-84e0a84 elementor-widget elementor-widget-text-editor\" data-id=\"84e0a84\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p style=\"text-align: center;\" data-selectable-paragraph=\"\">Panda<\/p>\n<p id=\"6536\" data-selectable-paragraph=\"\">Sort a Pandas DataFrame with\u00a0<code><a href=\"https:\/\/pandas.pydata.org\/pandas-docs\/stable\/reference\/api\/pandas.DataFrame.sort_values.html\" rel=\"noopener\">df.sort_values(by=my_column)<\/a><\/code>. There are a number of keyword arguments available.<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-b2455b1 elementor-widget elementor-widget-text-editor\" data-id=\"b2455b1\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<ul>\n \t<li id=\"8b29\" data-selectable-paragraph=\"\"><code><strong>by<\/strong><\/code>:\u00a0<em>str<\/em>\u00a0or\u00a0<em>list of str<\/em>, required \u2014 Name or list of names to sort by. If axis is\u00a0<em>0<\/em>\u00a0or\u00a0<em>index<\/em>\u00a0then\u00a0<em>by<\/em>\u00a0may contain index levels and\/or column labels. If axis is\u00a0<em>1<\/em>\u00a0or\u00a0<em>columns<\/em>\u00a0then\u00a0<em>by<\/em>\u00a0may contain column levels and\/or index labels<\/li>\n \t<li id=\"90b5\" data-selectable-paragraph=\"\"><code><strong>axis<\/strong><\/code>: {<em>0<\/em>\u00a0or\u00a0<em>index<\/em>,\u00a0<em>1\u00a0<\/em>or\u00a0<em>columns<\/em>}, default\u00a0<em>0<\/em>\u00a0\u2014 Axis to be sorted.<\/li>\n \t<li id=\"c2ab\" data-selectable-paragraph=\"\"><code><strong>ascending<\/strong><\/code>:\u00a0<em>bool<\/em>\u00a0or<em>\u00a0list of bool<\/em>, default\u00a0<em>True<\/em>\u00a0\u2014 Sort ascending vs. descending. Specify\u00a0<em>list<\/em>\u00a0for multiple sort orders. If this is a\u00a0<em>list<\/em>\u00a0of\u00a0<em>bools<\/em>, must match the length of the\u00a0<em>by\u00a0<\/em>argument.<\/li>\n \t<li id=\"90ed\" data-selectable-paragraph=\"\"><code><strong>inplace<\/strong><\/code>:\u00a0<em>bool<\/em>, default\u00a0<em>False<\/em>\u00a0\u2014 if\u00a0<em>True<\/em>, perform operation in-place.<\/li>\n \t<li id=\"b2f3\" data-selectable-paragraph=\"\"><code><strong>kind<\/strong><\/code>: {<em>quicksort, mergesort, heapsort,\u00a0<\/em>or\u00a0<em>stable<\/em>}, default\u00a0<em>quicksort \u2014<\/em>Choice of sorting algorithm. See also\u00a0<code>ndarray.np.sort<\/code>\u00a0for more information. For DataFrames, this option is only applied when sorting on a single column or label.<\/li>\n \t<li id=\"9463\" data-selectable-paragraph=\"\"><code><strong>na_position<\/strong><\/code>: {\u2018first\u2019, \u2018last\u2019}, default \u2018last\u2019 \u2014\u00a0<em>first<\/em>\u00a0puts\u00a0<em>NaNs<\/em>\u00a0at the beginning,\u00a0<em>last<\/em>\u00a0puts\u00a0<em>NaN<\/em>s at the end.<\/li>\n<\/ul>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-80f9a16 elementor-widget elementor-widget-text-editor\" data-id=\"80f9a16\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"c724\" data-selectable-paragraph=\"\">Sort a Pandas Series by following the same syntax. With a Series you don\u2019t provide a\u00a0<code>by<\/code>\u00a0keyword, because you don\u2019t have multiple columns.<\/p>\n<p id=\"280c\" data-selectable-paragraph=\"\">Because Pandas uses Numpy under the hood, you have the same nicely optimized sorting options at your fingertips. However, Pandas requires some extra time for its conveniences.<\/p>\n<p id=\"49c3\" data-selectable-paragraph=\"\">The default when sorting by a single column is to use Numpy\u2019s\u00a0<code>quicksort<\/code><em>.<\/em>You\u2019ll recall<em>\u00a0<\/em><code>quicksort<\/code>\u00a0is now actually an introsort that becomes a heapsort if the sorting progress is slow. Pandas ensures that sorting by multiple columns uses Numpy\u2019s\u00a0<code>mergesort<\/code>. Mergesort in Numpy actually uses Timsort or Radix sort algorithms. These are stable sorting algorithms and stable sorting is necessary when sorting by multiple columns.<\/p>\n<p id=\"0366\" data-selectable-paragraph=\"\">The key things to try to remember for Pandas:<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-3c63019 elementor-widget elementor-widget-text-editor\" data-id=\"3c63019\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<ul>\n \t<li id=\"39b9\" data-selectable-paragraph=\"\">The function name:\u00a0<code>sort_values()<\/code>.<\/li>\n \t<li id=\"c701\" data-selectable-paragraph=\"\">You need\u00a0<code>by=column_name<\/code>\u00a0or a list of column names.<\/li>\n \t<li id=\"b547\" data-selectable-paragraph=\"\"><code>ascending<\/code>\u00a0is the keyword for reversing.<\/li>\n \t<li id=\"6664\" data-selectable-paragraph=\"\">Use\u00a0<code>mergesort<\/code>\u00a0if you want a stable sort.<\/li>\n<\/ul>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-b8ffe56 elementor-widget elementor-widget-text-editor\" data-id=\"b8ffe56\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"bdac\" data-selectable-paragraph=\"\">When doing exploratory data analysis, I often find myself summing and sorting values in a Pandas DataFrame with\u00a0<code>Series.value_counts()<\/code>. Here\u2019s a code snippet to sum and sort the most frequent values for each column.<\/p>\n<span style=\"font-family: courier new,courier,monospace;\">for c in df.columns:\nprint(f&#8221;&#8212;- {c} &#8212;&#8220;)\nprint(df[c].value_counts().head())<\/span>\n<p id=\"0d02\" data-selectable-paragraph=\"\"><a href=\"https:\/\/github.com\/dask\/dask\/issues\/4368\" rel=\"noopener\">Dask<\/a>, which is basically Pandas for big data, doesn\u2019t yet have a parallel sorting implementation as of mid 2019, although\u00a0<a href=\"https:\/\/github.com\/dask\/dask\/issues\/4368\" rel=\"noopener\">it\u2019s being discussed<\/a><\/p>\n<p id=\"1f1b\" data-selectable-paragraph=\"\">Sorting in Pandas is a nice option for exploratory data analysis on smaller datasets. When you have a lot of data and want parallelized search on a GPU, you may want to use TensorFlow or PyTorch.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-1d86e90 elementor-widget elementor-widget-heading\" data-id=\"1d86e90\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h1 class=\"elementor-heading-title elementor-size-default\"><h1 id=\"b124\" data-selectable-paragraph=\"\">TensorFlow<\/h1><\/h1>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-0c3e43a elementor-widget elementor-widget-image\" data-id=\"0c3e43a\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/miro.medium.com\/max\/700\/1*zmMOdVZ_j9vwMcpdD8Uceg.png\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-b49c47b elementor-widget elementor-widget-text-editor\" data-id=\"b49c47b\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"328d\" data-selectable-paragraph=\"\"><a href=\"https:\/\/www.tensorflow.org\/versions\/r2.0\/api_docs\/python\/tf\/sort\" rel=\"noopener\">TensorFlow<\/a>\u00a0is the most popular deep learning framework. See my article on deep learning framework popularity and usage\u00a0<a href=\"https:\/\/towardsdatascience.com\/which-deep-learning-framework-is-growing-fastest-3f77f14aa318?source=friends_link&amp;sk=0a10207f22f4dbc143e7a90a3f843515\" rel=\"noopener\">here<\/a>. The following is for the GPU version of TensorFlow 2.0.<\/p>\n<p id=\"2417\" data-selectable-paragraph=\"\"><code>tf.sort(my_tensor)<\/code>\u00a0returns a sorted copy of a tensor. Optional arguments:<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-5b82ac7 elementor-widget elementor-widget-text-editor\" data-id=\"5b82ac7\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<ul>\n \t<li id=\"c027\" data-selectable-paragraph=\"\"><code><strong>axis<\/strong><\/code>: {int, optional} The axis along which to sort. The default is -1, which sorts the last axis.<\/li>\n \t<li id=\"ef81\" data-selectable-paragraph=\"\"><code><strong>direction<\/strong><\/code>: {<em>ascending<\/em>\u00a0or\u00a0<em>descending<\/em>} \u2014 direction in which to sort the values.<\/li>\n \t<li id=\"b3a2\" data-selectable-paragraph=\"\"><code><strong>name<\/strong><\/code>: {str, optional} \u2014 name for the operation.<\/li>\n<\/ul>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-998960e elementor-widget elementor-widget-text-editor\" data-id=\"998960e\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"f1eb\" data-selectable-paragraph=\"\"><code>tf.sort<\/code>\u00a0uses the<code>\u00a0top_k()<\/code>\u00a0method behind the scenes.\u00a0<code>top_k<\/code>\u00a0uses\u00a0<a href=\"https:\/\/nvlabs.github.io\/cub\/\" rel=\"noopener\">CUB library<\/a>\u00a0for CUDA GPUs to make parallelism easier to implement. As the docs explain \u201cCUB provides state-of-the-art, reusable software components for every layer of the CUDA programming model.\u201d TensorFlow uses radix sort on GPU via CUB, as discussed\u00a0<a href=\"https:\/\/github.com\/tensorflow\/tensorflow\/issues\/288\" rel=\"noopener\">here<\/a>.<\/p>\n<p id=\"7e2d\" data-selectable-paragraph=\"\">TensorFlow GPU info can be found\u00a0<a href=\"https:\/\/www.tensorflow.org\/install\/gpu\" rel=\"noopener\">here<\/a>. To enable GPU capabilities with TensorFlow 2.0 you need to\u00a0<code>!pip3 install tensorflow-gpu==2.0.0-beta1<\/code>. As we\u2019ll see from the evaluation below, you might want to stick with\u00a0<code>tensorflow==2.0.0-beta1<\/code>\u00a0if all you are doing in sorting (which isn\u2019t very likely).<\/p>\n<p id=\"7046\" data-selectable-paragraph=\"\">Use the following code snippet to see whether each line of code is running on the CPU or GPU:<\/p>\n<p id=\"2266\" data-selectable-paragraph=\"\"><code>tf.debugging.set_log_device_placement(True)<\/code><\/p>\n<p id=\"b0be\" data-selectable-paragraph=\"\">To specify you want to use a GPU use the following\u00a0<em>with<\/em>\u00a0block:<\/p>\n<span style=\"font-family: courier new,courier,monospace;\">with tf.device(&#8216;\/GPU:0&#8217;):\n%time tf.sort(my_tf_tensor)<\/span>\n<p id=\"09d2\" data-selectable-paragraph=\"\">use\u00a0<code>with tf.device('\/CPU:0'):<\/code>\u00a0to use the CPU.<\/p>\n<p id=\"0b51\" data-selectable-paragraph=\"\"><code>tf.sort()<\/code>\u00a0is a pretty intuitive method to remember and use if you work in TensorFlow. Just remember\u00a0<code>direction=descending<\/code>\u00a0to switch the sort order.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-76bd882 elementor-widget elementor-widget-heading\" data-id=\"76bd882\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h1 class=\"elementor-heading-title elementor-size-default\"><h1 id=\"d649\" data-selectable-paragraph=\"\">PyTorch<\/h1><\/h1>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-5f2ca05 elementor-widget elementor-widget-image\" data-id=\"5f2ca05\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/miro.medium.com\/max\/503\/1*8u5YFObocx7AviTbZYo8-g.png\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-7c10ca1 elementor-widget elementor-widget-text-editor\" data-id=\"7c10ca1\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<code>torch.sort(my_tensor)<\/code>\u00a0returns a sorted copy of a tensor. Optional arguments:<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-9ccd072 elementor-widget elementor-widget-text-editor\" data-id=\"9ccd072\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<ul>\n \t<li id=\"3abf\" data-selectable-paragraph=\"\"><code><strong>dim<\/strong><\/code>: {int, optional<em>}<\/em>\u00a0\u2014 the dimension to sort along<\/li>\n \t<li id=\"8393\" data-selectable-paragraph=\"\"><code><strong>descending<\/strong><\/code>: {bool, optional<em>}<\/em>\u00a0\u2014 controls the sorting order (ascending or descending).<\/li>\n \t<li id=\"e48c\" data-selectable-paragraph=\"\"><code><strong>out<\/strong><\/code>: {tuple, optional<em>}<\/em>\u00a0\u2014 the output tuple of (Tensor, LongTensor) that can be optionally given to be used as output buffers.<\/li>\n<\/ul>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-c11980e elementor-widget elementor-widget-text-editor\" data-id=\"c11980e\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"cc0e\" data-selectable-paragraph=\"\">Specify you want to use the GPU to sort by affixing\u00a0<code>.cuda()<\/code>\u00a0to the end of your tensor.<\/p>\n<span style=\"font-family: courier new,courier,monospace;\">gpu_tensor=my_pytorch_tensor.cuda()\n%time torch.sort(gpu_tensor)<\/span>\n<p id=\"a6c3\" data-selectable-paragraph=\"\">Some digging showed that PyTorch uses a segmented parallel sort via\u00a0<a href=\"https:\/\/thrust.github.io\/\" rel=\"noopener\">Thrust<\/a>\u00a0if a dataset any larger than 1 million rows by 100,000 columns is being sorted.<\/p>\n<p id=\"fdcd\" data-selectable-paragraph=\"\">Unfortunately, I ran out of memory when trying to\u00a0create 1.1M x 100K random data points via Numpy in Google Colab. I then tried GCP with 416 MB of RAM and still ran out of memory<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-23d66af elementor-widget elementor-widget-text-editor\" data-id=\"23d66af\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<blockquote>\n<p id=\"01b6\" data-selectable-paragraph=\"\">Segmented sort and locality sort are high-performance variants of mergesort that operate on non-uniform random data. Segmented sort allows us to sort many variable-length arrays in parallel. \u2014\u00a0<a href=\"https:\/\/moderngpu.github.io\/segsort.html\" rel=\"noopener\">https:\/\/moderngpu.github.io\/segsort.html<\/a><\/p>\n<\/blockquote>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-b1eda0a elementor-widget elementor-widget-text-editor\" data-id=\"b1eda0a\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"865a\" data-selectable-paragraph=\"\"><a href=\"https:\/\/github.com\/thrust\/thrust\/wiki\/Quick-Start-Guide\" rel=\"noopener\">Thrust<\/a>\u00a0is a parallel algorithms library that enables performance portability between GPUs and multicore CPUs. It provides a sort primitive that selects the most efficient implementation automatically. The CUB library used by TensorFlow wraps thrust. PyTorch and TensorFlow are using similar implementations for GPU sorting under the hood \u2014 whatever thrust chooses for the situation.<\/p>\n<p id=\"a4ab\" data-selectable-paragraph=\"\">Like TensorFlow, the sorting method in PyTorch is fairly straightforward to remember:\u00a0<code>torch.sort()<\/code>. The only tricky thing to remember is the direction of the sorted values: TensorFlow uses\u00a0<code>direction<\/code>\u00a0while PyTorch uses\u00a0<code>descending<\/code>. And don\u2019t forget to use\u00a0<code>.cuda()<\/code>\u00a0to get a giant speed boost with large data sets.<\/p>\n<p id=\"b6a7\" data-selectable-paragraph=\"\">While sorting with GPUs could be a good option for really large datasets, it might also make sense to sort data directly in SQL.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-4f7c55b elementor-widget elementor-widget-heading\" data-id=\"4f7c55b\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h1 class=\"elementor-heading-title elementor-size-default\"><h1 id=\"2817\" data-selectable-paragraph=\"\">SQL<\/h1><\/h1>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-f896bc0 elementor-widget elementor-widget-text-editor\" data-id=\"f896bc0\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"a9d2\" data-selectable-paragraph=\"\">Sorting in SQL is often very fast, particularly when the sort is in-memory.<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-bb2b5a1 elementor-widget elementor-widget-image\" data-id=\"bb2b5a1\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/miro.medium.com\/max\/176\/1*tCcPfdWHFyacdflarVXy5w.png\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-4b1a0f3 elementor-widget elementor-widget-text-editor\" data-id=\"4b1a0f3\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"70c7\" data-selectable-paragraph=\"\">SQL is a specification, but doesn\u2019t dictate things like which sort algorithm an implementation must use.\u00a0<a href=\"https:\/\/madusudanan.com\/blog\/all-you-need-to-know-about-sorting-in-postgres\/\" rel=\"noopener\">Postgres uses<\/a>\u00a0a disk merge sort, heap sort, or quick sort, depending upon the circumstances. If you have enough memory, sorts can be made much faster by making them in-memory. Increase the available memory for sorts via the\u00a0<code><a href=\"https:\/\/wiki.postgresql.org\/wiki\/Tuning_Your_PostgreSQL_Server\" rel=\"noopener\">work_mem<\/a><\/code><a href=\"https:\/\/wiki.postgresql.org\/wiki\/Tuning_Your_PostgreSQL_Server\" rel=\"noopener\">\u00a0setting<\/a>.<\/p>\n<p id=\"cbc0\" data-selectable-paragraph=\"\">Other SQL implementations use different sorting algorithms. For example, Google BigQuery uses introsort with some tricks, according to\u00a0<a href=\"https:\/\/stackoverflow.com\/a\/53026600\/4590385\" rel=\"noopener\">this Stack Overflow answer<\/a>.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-860dfe3 elementor-widget elementor-widget-text-editor\" data-id=\"860dfe3\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"ae43\" data-selectable-paragraph=\"\">Sorts in SQL are performed with the\u00a0<code>ORDER BY<\/code>\u00a0command. This syntax is distinct from the Python implementations that all use some form of the word\u00a0<em>sort.\u00a0<\/em>I find it easier to remember that ORDER BY goes with SQL syntax because it\u2019s so unique.<\/p>\n<p id=\"7f53\" data-selectable-paragraph=\"\">To make the sort descending, use the keyword DESC. So a query to return customers in alphabetical order from last to first would look like this:<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-586e4c8 elementor-widget elementor-widget-text-editor\" data-id=\"586e4c8\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<span style=\"font-family: courier new,courier,monospace;\">SELECT Names FROM Customers\nORDER BY Names DESC;<\/span>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-f4e8f45 elementor-widget elementor-widget-heading\" data-id=\"f4e8f45\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h1 class=\"elementor-heading-title elementor-size-default\"><h1 id=\"8050\" data-selectable-paragraph=\"\">Comparisons<\/h1><\/h1>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-a50ace1 elementor-widget elementor-widget-text-editor\" data-id=\"a50ace1\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"ee7c\" data-selectable-paragraph=\"\">For each of the Python libraries above, I conducted an analysis of the wall time to sort the same 1,000,000 data points in a single column, array, or list. I used a\u00a0<a href=\"https:\/\/colab.research.google.com\/\" rel=\"noopener\">Google Colab<\/a>\u00a0Jupyter Notebook with a K80 GPU and Intel(R) Xeon(R) CPU @ 2.30GHz.<\/p>\n\n<figure><img style=\"width: 700px; hei\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-5276982 elementor-widget elementor-widget-image\" data-id=\"5276982\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/miro.medium.com\/max\/1000\/1*oAzzPmtk4-lxWzNqDlE11w.gif\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-6423e8d elementor-widget elementor-widget-text-editor\" data-id=\"6423e8d\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p style=\"text-align: center;\" data-selectable-paragraph=\"\">Source code:\u00a0<a href=\"https:\/\/colab.research.google.com\/drive\/1NNarscUZHUnQ5v-FjbfJmB5D3kyyq9Av\" rel=\"noopener\">https:\/\/colab.research.google.com\/drive\/1NNarscUZHUnQ5v-FjbfJmB5D3kyyq9Av<\/a><\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-69b6b2a elementor-widget elementor-widget-heading\" data-id=\"69b6b2a\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\"><h2 id=\"46d1\" data-selectable-paragraph=\"\">Observations<\/h2>\n<\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-062b179 elementor-widget elementor-widget-text-editor\" data-id=\"062b179\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<ul>\n \t<li id=\"6ba3\" data-selectable-paragraph=\"\">PyTorch with GPU is super fast.<\/li>\n \t<li id=\"b234\" data-selectable-paragraph=\"\">For both Numpy and Pandas, inplace is generally faster than copying the data.<\/li>\n \t<li id=\"7dac\" data-selectable-paragraph=\"\">The default Pandas quicksort is rather fast.<\/li>\n \t<li id=\"1bbf\" data-selectable-paragraph=\"\">Most Pandas functions are comparatively slower than their Numpy counterparts.<\/li>\n \t<li id=\"540b\" data-selectable-paragraph=\"\">TensorFlow CPU is quite fast. The GPU install slows down TensorFlow even when the CPU is used. The GPU sort is quite slow. This looks like a possible bug.<\/li>\n \t<li id=\"0e26\" data-selectable-paragraph=\"\">Vanilla Python inplace sorting is surprisingly slow. It was nearly 100x slower than the PyTorch GPU-enabled sort. I tested it multiple times (with different data) to double check that this was not an anomaly.<\/li>\n<\/ul>\n<p id=\"5f26\" data-selectable-paragraph=\"\">Again, this is just one small test. It\u2019s definitely not definitive.<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-dff2e47 elementor-widget elementor-widget-heading\" data-id=\"dff2e47\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h1 class=\"elementor-heading-title elementor-size-default\"><h1 id=\"533b\" data-selectable-paragraph=\"\">Wrap<\/h1><\/h1>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-215b69f elementor-widget elementor-widget-text-editor\" data-id=\"215b69f\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"3ef3\" data-selectable-paragraph=\"\">You generally shouldn\u2019t need custom sorting implementations. The off-the shelf options are strong. They are generally not using just a single sorting method. Instead they evaluate at the data first and then use a sorting algorithm that performs well. Some implementations even change algorithms if the sort is not progressing quickly.<\/p>\n<p id=\"16fd\" data-selectable-paragraph=\"\">In this article, you\u2019ve seen how to sort in each part of the Python data science stack and in SQL. I hope you\u2019ve found it helpful. If you have, please share it on your favorite social media so others can find it, too.<\/p>\n<p id=\"9781\" data-selectable-paragraph=\"\">You just need to remember which option to choose and how to call them. Use my cheat sheet above to save time. My general recommendations are the following:<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-a0f0eb3 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-eae-slider=\"10595\" data-id=\"a0f0eb3\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-32c663e\" data-eae-slider=\"75383\" data-id=\"32c663e\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-b89868a elementor-widget elementor-widget-text-editor\" data-id=\"b89868a\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<ul>\n \t<li id=\"e088\" data-selectable-paragraph=\"\">Use the default Pandas\u00a0<code>sort_values()<\/code>\u00a0for exploration on relatively small datasets.<\/li>\n \t<li id=\"f649\" data-selectable-paragraph=\"\">For large datasets or when speed is at a premium, try Numpy\u2019s in-place mergesort, a PyTorch or TensorFlow parallel GPU implementation, or SQL.<\/li>\n<\/ul>\n<p id=\"b1c7\" data-selectable-paragraph=\"\">Sorting on GPUs isn\u2019t something I\u2019ve seen much written about. It\u2019s an area that appears ripe for more research and tutorials. Here\u2019s a 2017 article to give you a taste of recent\u00a0<a href=\"https:\/\/dl.acm.org\/citation.cfm?id=3079105\" rel=\"noopener\">research<\/a>. More info on GPU sorting algorithms can be found\u00a0<a href=\"https:\/\/devtalk.nvidia.com\/default\/topic\/951795\/fastest-sorting-algorithm-on-gpu-currently\/\" rel=\"noopener\">here<\/a>.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<\/div>\n\t\t","protected":false},"excerpt":{"rendered":"<p>There are many different basic sorting algorithms. Some perform faster and use less memory than others. Some are better suited to big data and some work better if the data are arranged in certain ways.&nbsp;Choosing which library and type of sorting algorithm to use can be tricky. Implementations change quickly.&nbsp;In this article, I&rsquo;ll give you the lay of the land, provide tips to help you remember the methods and share the results of a speed test.<\/p>\n","protected":false},"author":369,"featured_media":3504,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[187],"tags":[94],"ppma_author":[2134],"class_list":["post-1861","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-bigdata-cloud","tag-data-science"],"authors":[{"term_id":2134,"user_id":369,"is_guest":0,"slug":"jeff-hale","display_name":"Jeff Hale","avatar_url":"https:\/\/secure.gravatar.com\/avatar\/?s=96&d=mm&r=g","author_category":"","user_url":"","last_name":"Hale","first_name":"Jeff","job_title":"","description":"Jeff Hale is a co-founder of Rebel Desk, where he oversees technology, finance, and operations for this company. He&nbsp;is an experienced entrepreneur who has managed technology, operations, and finances for several companies.&nbsp;"}],"_links":{"self":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/1861","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/users\/369"}],"replies":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/comments?post=1861"}],"version-history":[{"count":0,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/1861\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/media\/3504"}],"wp:attachment":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/media?parent=1861"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/categories?post=1861"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/tags?post=1861"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/ppma_author?post=1861"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}