This is a guest post by Dr. Andrew Peterson, Senior Data Scientist. A related postIs Python Becoming the King of the Data Science Forest?has generated spirited debate on LinkedIn and has resulted in the highest number of shares on social media from Experfy Insights. The evaluation of Python below continues this discussion and we are grateful to Dr. Peterson for his timely contribution.
The following is a list of issues that were identified with Python and associated packages during an evaluation of Python as a replacement for R. The evaluation was done at the request of a member of the senior executive team who was responding to the common but inaccurate perception that Python is now being used by more people for analytics than R is. The issues listed here were deemed significant because they could easily lead to important coding errors that may be difficult to track down and would thus require strict coding standards and extensive unit testing to minimise the risk of them occurring. The context of the evaluation was strictly in terms of developing predictive models based on machine learning, time series analysis, GLMs, Markov chains, etc.
- Indexing Arrays: Python indexes arrays by identifying the gaps between cells in a vector, not the cells themselves. This is not consistent with other languages such as C++, Java, Matlab, R, etc., all of which index the actual cells in vectors and arrays. Python indexing is not intuitive from a mathematical or statistical perspective either and creates a significant opportunity for coding errors that may go undetected.Complicating matters further, NumPy introduces another two ways of indexing arrays that can be chosen on the fly, and Pandas adds a fourth way of indexing (which happens to be the same approach used in R). So there are at least four different ways that arrays can be indexed, all of which can be used together in the same code. However, different indexing methods have different implications for speed and memory usage and they can also return different types of objects either a view of the array or a copy of the array (see below).
- Views vs. Copies: In NumPy, assigning a slice from an array to a different variable creates a view of the original array. This means if an element in the new variable (the view) is changed, that element in the original array will also change. To avoid this behaviour the .copy() method must be explicitly used to make an independent copy of the slice. However, the so-called fancy indexing in NumPy returns a copy by default, not a view. Additionally, some matrix operations such as transposing will return a view, not a new object, and reshaping an array will return either a view or a copy depending on the circumstances. Thus, the scenarios in which a view or a copy is returned are complex and create additional opportunities for subtle coding errors that could return very significant numerical errors while being difficult to identify and correct.
- Broadcasting: NumPy allows arithmetic operations to be performed on arrays that are non-conformable. It does this by automatically repeating rows or columns to force the arrays to be conformable. For example a 3*3 matrix and a 1*3 matrix can be added together because Python will automatically repeat the 1*3 matrix 3 times to create a new 3*3 matrix and then return the result of the addition as a 3*3 matrix. Other languages wont permit this operation at all because the arrays, as defined, are not conformable. Broadcasting, which is promoted as a feature in NumPy, creates an opportunity for significant errors in matrix operations that cannot occur in other languages.
- Numeric and Integer Data Types: There are specific and nonstandard rules for the transformation (or not) of one data type to another following different mathematical operations. These rules have precedence depending on the size of the data type, where the biggest data type always wins. For example, multiplying a 64 bit integer by a 32 bit float will return an integer, but multiplying a 32 bit integer by a 64 bit float returns a float and multiplying a 64 bit integer by a 64 bit float returns a float. In other languages, multiplying an integer by a float returns a float. This means that analysts need to be conscious of data types when performing even simple mathematical operations, which conflicts with the duck typing philosophy promoted within the Python community.In addition, keeping track of more complex object types (and their associated methods) is tedious in a dynamically-typed environment. This is one of the main reasons for introducing Traits which force static typing within Python to generate significant benefits. An apt quote from the Enthought Traits web page:Python does not require the data type of variables to be declared. As any experienced Python programmer knows, this flexibility has both good and bad points. The Traits package was developed to address some of the problems caused by not having declared variable types, in those cases where problems might arise.
- Ignored Errors: There are certain errors that the Python interpreter will ignore (e.g. not putting the name for assigned data types in quotes), but it is not clear exactly what errors the interpreter will or wont ignore, or the assumptions it is making when it does ignore those errors. In addition, there are differences between the IPython interpreter and the standard interpreter in the way these errors are, or are not, ignored. This means code with errors may run in one interpreter (and potentially return incorrect results without the analyst knowing) but not run in the other interpreter. In addition IPython supports magic functions which are not available in standard Python.
- Inconsistencies Across Packages: SciPy loads NumPy into the SciPy namespace so that functions in NumPy can be called using SciPy conventions. But certain basic mathematical functions such as log functions call different underlying libraries depending on whether they are called from the SciPy or NumPy namespaces. This is also true for the linear algebra functions for solving systems of linear equations. Calling those functions from SciPy will load one linear algebra library, but calling the same function from NumPy will load a different linear algebra library.
- Pivot Tables in Pandas: There does not appear to be a simple way to get true counts of values in a pivot table (as you do using pivot tables in Excel). Aggregate functions like sum and mean appear to work, but count based aggregate functions count not only the cells containing the values of interest, but also the cells that identify those rows or columns. Thus the final count will be twice what is expected. If the marginal values are included then the row and column counts will also be twice what they should be, but the overall count (the grand sum) will be correct.
- Sorting a Data Frame Re-orders the Row Index: When a data frame is sorted with df.sort_index(), the row index is included in the sorting and is therefore reordered along with the rows. This means that if the first row of the sorted data frame is extracted with df.ix[1,:], the result will be the row that was the first row in the unsorted data frame because this row will still have the index value 1. Also, when a data frame is sorted and the row index of the sorted data frame is, for example, [1, 4, 3, 5, 2], and then the first two rows are extracted using df.ix[1:2,:], the result will be the first 4 rows of the data frame because these sorted row indices now fall between those rows with indices 1 and 2. But, df.head(5) will return the first 5 rows of the sorted data frame as expected. The only way to deal with this appears to be to create a new index after the data frame has been sorted:df.index = range(1,len(df) + 1)
- Poor Documentation: As an example, the Pandas df.groupby method returns an object of type DataFrameGroupBy. This type, and the numerous attributes associated with it, are undocumented. Poor documentation is a consistent criticism of Python packages in general.
- Incompatibility Across Different Python Distributions: A number of scientific Python distributions have been created in order to manage the very complex package dependencies involved with setting up Python for scientific computing. These distributions typically include packages and base Python that are at least 1-2 versions old due to the complexity of the dependencies. However, we have observed that despite different distributions containing (almost!) the same packages, code developed using one scientific distribution does not necessarily run on a different scientific distribution. Tracking down this type of bug can be exceedingly time consuming without an expert understanding of Python as a system, not just as a language.
- IDE Instability: The Python code that was written for this evaluation was developed in the Spyder IDE because IPython was too cumbersome for code development. We observed that the Python interpreter was unpredictably unstable, crashing regularly but without any clear pattern or apparent reason true for both IPython and Spyder. For example, code that would run once causing a crash, would run successfully if run immediately a second time. Simple syntax errors were found that would cause a Python crash sometimes with an error message, and sometimes without. It is unusual for a production ready language to be unable to exit gracefully under almost any circumstances.
Need help with your R or Python project or simply need data scientists and visualizers to augment your existing team? Post your project in the Experfy Marketplace to solicit bids from vetted experts. Experfy has the worlds top data experts, who specialize in specific industry data and can ask the right questions of your data. You can also email [email protected] for more information.