{"id":1217,"date":"2019-02-15T10:32:01","date_gmt":"2019-02-15T10:32:01","guid":{"rendered":"http:\/\/kusuaks7\/?p=822"},"modified":"2023-07-17T13:07:55","modified_gmt":"2023-07-17T13:07:55","slug":"why-you-should-forget-loops-and-embrace-vectorization-for-data-science","status":"publish","type":"post","link":"https:\/\/www.experfy.com\/blog\/bigdata-cloud\/why-you-should-forget-loops-and-embrace-vectorization-for-data-science\/","title":{"rendered":"Why you should forget loops and embrace vectorization for Data Science"},"content":{"rendered":"<p><strong><em>Ready to learn Data Science? <a href=\"https:\/\/www.experfy.com\/training\/courses\">Browse courses<\/a>\u00a0like\u00a0<a href=\"https:\/\/www.experfy.com\/training\/tracks\/data-science-training-certification\">Data Science Training and Certification<\/a> developed by industry thought leaders and Experfy in Harvard Innovation Lab.<\/em><\/strong><\/p>\n<h3><strong><em><u>Python and Numpy in modern data science<\/u><\/em><\/strong><\/h3>\n<p>Python is fast emerging as the <a href=\"https:\/\/www.quora.com\/Why-is-Python-a-language-of-choice-for-data-scientists\" target=\"_blank\" rel=\"noopener noreferrer\">de-facto programming language of choice<\/a> for data scientists. But unlike R or Julia, it is a general-purpose language and does not have a functional syntax to start analyzing and transforming numerical data right out of the box. So, it needs specialized library.<\/p>\n<p><strong>Numpy<\/strong>, short for <a href=\"http:\/\/numpy.org\" target=\"_blank\" rel=\"noopener noreferrer\">Numerical Python<\/a>, is the fundamental package required for high performance scientific computing and data analysis in Python ecosystem. It is the foundation on which nearly all of the higher-level tools such as <a href=\"https:\/\/pandas.pydata.org\" target=\"_blank\" rel=\"noopener noreferrer\">Pandas<\/a> and <a href=\"http:\/\/scikit-learn.org\/\" target=\"_blank\" rel=\"noopener noreferrer\">scikit-learn<\/a> are built. <a href=\"https:\/\/www.tensorflow.org\" target=\"_blank\" rel=\"noopener noreferrer\">TensorFlow<\/a> uses NumPy arrays as the fundamental building block on top of which they built their Tensor objects and graph flow for deep learning tasks (which makes heavy use of linear algebra operations on a long list\/vector\/matrix of numbers).<\/p>\n<p>Two of the most important advantages Numpy provides, are:<\/p>\n<ul>\n<li>ndarray, a fast and space-efficient multidimensional array providing vectorized arithmetic operations and sophisticated <a href=\"https:\/\/towardsdatascience.com\/two-cool-features-of-python-numpy-mutating-by-slicing-and-broadcasting-3b0b86e8b4c7\" target=\"_blank\" rel=\"noopener noreferrer\"><em>broadcasting<\/em> capabilities<\/a><\/li>\n<li>Standard mathematical functions for fast operations on entire arrays of data <em>without having to write iteration loops<\/em><\/li>\n<\/ul>\n<p>You will often come across this assertion in the data science, machine learning, and Python community that <strong>Numpy is much faster due to its vectorized implementation<\/strong> and due to the fact that many of its core routines are written in C (based on <a href=\"https:\/\/en.wikipedia.org\/wiki\/CPython\" target=\"_blank\" rel=\"noopener noreferrer\">CPython framework<\/a>).<\/p>\n<p>And it is indeed true (<a href=\"http:\/\/notes-on-cython.readthedocs.io\/en\/latest\/std_dev.html\" target=\"_blank\" rel=\"noopener noreferrer\">this article is a beautiful demonstration<\/a> of various options that one can work with Numpy, even writing bare-bone C routines with Numpy APIs). <strong>Numpy arrays are densely packed arrays of homogeneous type. Python lists, by contrast, are arrays of pointers to objects<\/strong>, even when all of them are of the same type. So, you get the benefits of <a href=\"https:\/\/en.wikipedia.org\/wiki\/Locality_of_reference\" target=\"_blank\" rel=\"noopener noreferrer\">locality of reference<\/a>. Many Numpy operations are implemented in C, avoiding the general cost of loops in Python, pointer indirection and per-element <a href=\"https:\/\/www.sitepoint.com\/typing-versus-dynamic-typing\/\" target=\"_blank\" rel=\"noopener noreferrer\">dynamic type checking<\/a>. The speed boost depends on which operations you\u2019re performing. <strong>For data science and modern machine learning tasks, this is an invaluable advantage<\/strong>, as often the data set size runs into millions if not billions of records and you do not want to iterate over it using a <em>for-loop<\/em> along with its associated baggage.<\/p>\n<h3><strong><em><u>How much superior Numpy is compared to \u2018for-loop\u2019?<\/u><\/em><\/strong><\/h3>\n<p>Now, we all have used <em>for-loops<\/em> for majority of the tasks which needs an iteration over a long list of elements. I am sure almost everybody, who is reading this article, wrote their first code for matrix or vector multiplication using a for-loop back in high-school or college. <em>For-loop<\/em> has served programming community long and steady. However, it comes with some baggage and is often slow in execution when it comes to processing large data sets (many millions of records as in this age of <em>Big Data<\/em>). This is particularly true for interpreted language like Python, where, if the body of your loop is simple, the<strong> interpreter overhead of the loop itself can be a substantial amount of the overhead<\/strong>. Therefore, an equivalent Numpy vectorized operation can offer a significant speed boost for such repetitive mathematical operation that a data scientist needs to perform routinely.<\/p>\n<p>In this short article, I intended to prove it definitively with an example of a moderately sized data set.<\/p>\n<p>Here is the <a href=\"https:\/\/github.com\/tirthajyoti\/PythonMachineLearning\/blob\/master\/How%20fast%20are%20NumPy%20ops.ipynb\" target=\"_blank\" rel=\"noopener noreferrer\"><strong><em>link to my Github code<\/em><\/strong><\/a> (Jupyter notebook) that shows, in a few easy lines of code, the difference in speed of Numpy operation from that of regular Python programming constructs like <em>for-loop<\/em>, <a href=\"https:\/\/stackoverflow.com\/questions\/10973766\/understanding-the-map-function\" target=\"_blank\" rel=\"noopener noreferrer\"><em>map-function<\/em><\/a>, or <a href=\"http:\/\/www.pythonforbeginners.com\/basics\/list-comprehensions-in-python\" target=\"_blank\" rel=\"noopener noreferrer\"><em>list-comprehension<\/em><\/a>.<\/p>\n<p>I just outline the basic flow:<\/p>\n<ul>\n<li>Create a list of a moderately large number of floating point numbers, preferably drawn from a continuous statistical distribution like a Gaussian or Uniform random. I chose 1 million for the demo.<\/li>\n<li>Create a ndarray object out of that list i.e. vectorize.<\/li>\n<li>Write short code blocks to iterate over the list and use a mathematical operation on the list say taking logarithm of base 10. Use for-loop, map-function, and list-comprehension. Each time use time.time() function to determine how much time it takes in total to process the 1 million records.<\/li>\n<\/ul>\n<p><span style=\"font-family: courier new,courier,monospace;\">t1=time.time()<br \/>\nfor item in l1:<br \/>\nl2.append(lg10(item))<br \/>\nt2 = time.time()<br \/>\nprint(&#8220;With for loop and appending it took {} seconds&#8221;.format(t2-t1))<br \/>\nspeed.append(t2-t1)<\/span><\/p>\n<ul>\n<li>Do the same operation using Numpy\u2019s built-in mathematical method (np.log10) over the ndarray object. Time it.<\/li>\n<\/ul>\n<p><span style=\"font-family: courier new,courier,monospace;\">t1=time.time()<br \/>\na2=np.log10(a1)<br \/>\nt2 = time.time()<br \/>\nprint(&#8220;With direct Numpy log10 method it took {} seconds&#8221;.format(t2-t1))<br \/>\nspeed.append(t2-t1)<\/span><\/p>\n<ul>\n<li>Store the execution times in a list and plot a bar chart showing the comparative difference.<\/li>\n<\/ul>\n<p>Here is the result. You can repeat the whole process by running all the cells of the Jupyter notebook. Every time it will generate a new set of random numbers, so the exact execution time may vary a little bit but overall the trend will always be the same. You can try with various other mathematical functions\/string operations or combination thereof, to check if this holds true in general.<\/p>\n<h3><strong><em><u>You can do this trick even with if-then-else conditional loop <\/u><\/em><\/strong><\/h3>\n<p>Vectorization trick is fairly well-known to data scientists and is used routinely in coding, to speed up the overall data transformation, where simple mathematical transformations are performed over an iterable object e.g. a list. What is less appreciated is that it even pays to vectorize non-trivial code blocks such as conditional loops.<\/p>\n<p>Now, mathematical transformation based on some predefined condition are fairly common in data science tasks. It turns out one can easily vectorize simple blocks of conditional loops by first turning them into functions and then using numpy.vectorize method. As we see above, there is a possibility of an order of magnitude speed improvement with vectorization for simple mathematical transformation. For the case with conditional loops, the speedup is less dramatic, as the internal conditional looping is still somewhat inefficient. However, there is at least 20\u201350% improvement in the execution time over other plain vanilla Python codes.<\/p>\n<p>Here is the simple code to demonstrate it:<\/p>\n<p><span style=\"font-family: courier new,courier,monospace;\">import numpy as np<br \/>\nfrom math import sin as sn<br \/>\nimport matplotlib.pyplot as plt<br \/>\nimport time<\/span><\/p>\n<p><span style=\"font-family: courier new,courier,monospace;\"><strong># Number of test points<\/strong><br \/>\nN_point\u00a0 = 1000<\/span><\/p>\n<p><span style=\"font-family: courier new,courier,monospace;\"><strong># Define a custom function with some if-else loops<\/strong><br \/>\ndef myfunc(x,y):<br \/>\nif (x&gt;0.5*y and y&lt;0.3):<br \/>\nreturn (sn(x-y))<br \/>\nelif (x&lt;0.5*y):<br \/>\nreturn 0<br \/>\nelif (x&gt;0.2*y):<br \/>\nreturn (2*sn(x+2*y))<br \/>\nelse:<br \/>\nreturn (sn(y+x))<\/span><\/p>\n<p><span style=\"font-family: courier new,courier,monospace;\"><strong># List of stored elements, generated from a Normal distribution<\/strong><br \/>\nlst_x = np.random.randn(N_point)<br \/>\nlst_y = np.random.randn(N_point)<br \/>\nlst_result = []<\/span><\/p>\n<p><span style=\"font-family: courier new,courier,monospace;\"><strong># Optional plots of the data<\/strong><br \/>\nplt.hist(lst_x,bins=20)<br \/>\nplt.show()<br \/>\nplt.hist(lst_y,bins=20)<br \/>\nplt.show()<\/span><\/p>\n<p><span style=\"font-family: courier new,courier,monospace;\"><strong># First, plain vanilla for-loop<\/strong><br \/>\nt1=time.time()<br \/>\nfor i in range(len(lst_x)):<br \/>\nx = lst_x[i]<br \/>\ny= lst_y[i]<br \/>\nif (x&gt;0.5*y and y&lt;0.3):<br \/>\nlst_result.append(sn(x-y))<br \/>\nelif (x&lt;0.5*y):<br \/>\nlst_result.append(0)<br \/>\nelif (x&gt;0.2*y):<br \/>\nlst_result.append(2*sn(x+2*y))<br \/>\nelse:<br \/>\nlst_result.append(sn(y+x))<br \/>\nt2=time.time()<\/span><\/p>\n<p><span style=\"font-family: courier new,courier,monospace;\">print(&#8221;<br \/>\nTime taken by the plain vanilla for-loop<br \/>\n&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;-<br \/>\n{} us&#8221;.format(1000000*(t2-t1)))<\/span><\/p>\n<p><span style=\"font-family: courier new,courier,monospace;\"><strong># List comprehension<\/strong><br \/>\nprint(&#8221;<br \/>\nTime taken by list comprehension and zip<br \/>\n&#8220;+&#8217;-&#8216;*40)<br \/>\n%timeit lst_result = [myfunc(x,y) for x,y in zip(lst_x,lst_y)]<\/span><\/p>\n<p><span style=\"font-family: courier new,courier,monospace;\"><strong># Map() function<\/strong><br \/>\nprint(&#8221;<br \/>\nTime taken by map function<br \/>\n&#8220;+&#8217;-&#8216;*40)<br \/>\n%timeit list(map(myfunc,lst_x,lst_y))<\/span><\/p>\n<p><span style=\"font-family: courier new,courier,monospace;\"><strong># Numpy.vectorize method<\/strong><br \/>\nprint(&#8221;<br \/>\nTime taken by numpy.vectorize method<br \/>\n&#8220;+&#8217;-&#8216;*40)<br \/>\nvectfunc = np.vectorize(myfunc,otypes=[np.float],cache=False)<br \/>\n%timeit list(vectfunc(lst_x,lst_y))<\/span><\/p>\n<p><span style=\"font-family: courier new,courier,monospace;\"><strong># Results<\/strong><br \/>\nTime taken by the plain vanilla for-loop<br \/>\n&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;-<br \/>\n<strong>2000.0934600830078<\/strong> us<\/span><\/p>\n<p><span style=\"font-family: courier new,courier,monospace;\">Time taken by list comprehension and zip<br \/>\n&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;-<br \/>\n1000 loops, best of 3: <strong>810 \u00b5s<\/strong> per loop<\/span><\/p>\n<p><span style=\"font-family: courier new,courier,monospace;\">Time taken by map function<br \/>\n&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;-<br \/>\n1000 loops, best of 3: <strong>726 \u00b5s<\/strong> per loop<\/span><\/p>\n<p>Time<span style=\"font-family: courier new,courier,monospace;\">taken<\/span><span style=\"font-family: courier new,courier,monospace;\"> by <\/span><span style=\"font-family: courier new,courier,monospace;\">numpy<\/span><span style=\"font-family: courier new,courier,monospace;\">.vectorize method<br \/>\n&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;-<br \/>\n1000 loops, best of 3: <strong>516 \u00b5s<\/strong> per loop<\/span><\/p>\n<p>&nbsp;<\/p>\n<p>Notice that I have used<strong> %timeit<\/strong> <a href=\"http:\/\/ipython.readthedocs.io\/en\/stable\/interactive\/magics.html\" target=\"_blank\" rel=\"noopener noreferrer\">Jupyter magic command<\/a> everywhere I could write the evaluated expression in one line. That way I am effectively running at least 1000 loops of the same expression and averaging the execution time to avoid any random effect. Consequently, if you run this whole script in a Jupyter notebook, you may slightly different result for the first case i.e. plain vanilla for-loop execution, but the next three should give very consistent trend (based on your computer hardware).<\/p>\n<h3><strong><em><u>Summary and conclusion<\/u><\/em><\/strong><\/h3>\n<p>We see the evidence that, for this data transformation task based on a series of conditional checks, the vectorization approach using Numpy routinely gives some 20\u201350% speedup compared to general Python methods.<\/p>\n<p><strong>It may not seem a dramatic improvement, but every bit of time saving adds up in a data science pipeline and pays back in the long run! <\/strong>If a data science job requires this transformation to happen a million times, that may result in a difference between 2 days and 8 hours.<\/p>\n<p>In short, wherever you have a long list of data and need to perform some mathematical transformation over them, strongly consider turning those python data structures (list or tuples or dictionaries) into numpy.ndarray objects and using inherent vectorization capabilities.<\/p>\n<p>There is an entire open-source, online book on this topic by a French neuroscience researcher. <a href=\"https:\/\/www.labri.fr\/perso\/nrougier\/from-python-to-numpy\/#id7\" target=\"_blank\" rel=\"noopener noreferrer\">Check it out here<\/a>.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Ready to learn Data Science? Browse courses\u00a0like\u00a0Data Science Training and Certification developed by industry thought leaders and Experfy in Harvard Innovation Lab. Python and Numpy in modern data science Python is fast emerging as the de-facto programming language of choice for data scientists. But unlike R or Julia, it is a general-purpose language and does<\/p>\n","protected":false},"author":137,"featured_media":14381,"comment_status":"open","ping_status":"open","sticky":false,"template":"single-post-2.php","format":"standard","meta":{"footnotes":""},"categories":[187],"tags":[94],"ppma_author":[1967],"class_list":["post-1217","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-bigdata-cloud","tag-data-science"],"authors":[{"term_id":1967,"user_id":137,"is_guest":0,"slug":"tirthajyoti-sarkar","display_name":"Tirthajyoti Sarkar","avatar_url":"https:\/\/secure.gravatar.com\/avatar\/?s=96&d=mm&r=g","author_category":"","user_url":"","last_name":"Sarkar","first_name":"Tirthajyoti","job_title":"","description":"Dr. Tirthajyoti Sarkar, Principal Engineer at ON Semiconductor, conducts research on and designs advanced semiconductor technology and products, which power various things from smartphones to electric cars, with data centers and washing machines in between. He also moonlights by learning and practicing data science, machine learning, and Python\/R programming. He writes for multiple Data Science\/Artificial intelligence focused publications and loves to experiment with advanced machine learning techniques for application to semiconductor designs."}],"_links":{"self":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/1217","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/users\/137"}],"replies":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/comments?post=1217"}],"version-history":[{"count":0,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/1217\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/media\/14381"}],"wp:attachment":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/media?parent=1217"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/categories?post=1217"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/tags?post=1217"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/ppma_author?post=1217"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}