{"id":1511,"date":"2019-02-18T04:09:27","date_gmt":"2019-02-18T04:09:27","guid":{"rendered":"http:\/\/kusuaks7\/?p=1116"},"modified":"2023-07-24T17:55:18","modified_gmt":"2023-07-24T17:55:18","slug":"optimal-tooling-for-machine-learning-and-ai","status":"publish","type":"post","link":"https:\/\/www.experfy.com\/blog\/bigdata-cloud\/optimal-tooling-for-machine-learning-and-ai\/","title":{"rendered":"Optimal Tooling for Machine Learning and AI"},"content":{"rendered":"<p><strong><em>Ready to learn Data Science? <a href=\"https:\/\/www.experfy.com\/training\/courses\">Browse courses<\/a>\u00a0like\u00a0<a href=\"https:\/\/www.experfy.com\/training\/tracks\/data-science-training-certification\">Data Science Training and Certification<\/a> developed by industry thought leaders and Experfy in Harvard Innovation Lab.<\/em><\/strong><\/p>\n<p><em>Note: this post is based on talks I recently gave at Facebook Developer Circles and Data Natives Berlin. You can get the slides\u00a0<\/em><a href=\"https:\/\/www.slideshare.net\/boyan-angelov\/slide-atlas\" target=\"_blank\" rel=\"noopener noreferrer\"><em>here<\/em><\/a><em>.<\/em><\/p>\n<p>I\u2019ll be the first to admit that tooling is probably the least exciting topic in data science at the moment. People seem to be more interested in speaking about the latest chatbot technology or deep learning framework.<\/p>\n<p>This just does not make sense. Why would you not dedicate enough time to pick your tools carefully? And there\u2019s the added problem which is typical for my profession \u2014 when all you have a hammer, everything becomes a nail (this is why you can actually\u00a0<a href=\"http:\/\/rmarkdown.rstudio.com\/rmarkdown_websites.html\" target=\"_blank\" rel=\"noopener noreferrer\">build websites with R<\/a>\u00a0;-)). Let\u2019s talk about this.<\/p>\n<p>Let\u2019s start with the essentials.<\/p>\n<h2 style=\"margin-left: -1.6pt;\">Which language should I\u00a0use?<\/h2>\n<p>&nbsp;<\/p>\n<p>Ok, this is a controversial one. There are some very wide ranging opinions on this, ranging from one extreme to the other. I have probably the least common one \u2014 more is better.\u00a0<strong>You should use both R and Python.<\/strong><\/p>\n<p style=\"text-align: center;\"><img decoding=\"async\" src=\"https:\/\/cdn-images-1.medium.com\/max\/1600\/0*g9PV8j4tt6RkPoLk.jpg\" alt=\"experfy-blog\" \/><\/p>\n<p style=\"text-align: center;\">More is\u00a0better<\/p>\n<p>So, why? R is arguably much better at data visualization, and has a ton of stats packages. Python on the other hand would help you put your model in production and be more appreciated by the other developers on the team (imagine giving them an R model to deploy).<\/p>\n<p>Here I want to give a shout out to\u00a0<a href=\"http:\/\/julialang.org\/\" target=\"_blank\" rel=\"noopener noreferrer\">Julia<\/a>. It is a newcomer to the field, but has huge potential. Keep an eye out for this one.<\/p>\n<h2 style=\"margin-left: -1.6pt;\">Essential software\u00a0packages<\/h2>\n<p>&nbsp;<\/p>\n<p>We don\u2019t want to be reinventing the wheel constantly when working, and we should take advantage of the awesome open-source communities around those languages. First a quick refresher on what are the main tasks in a typical data science workflow.<\/p>\n<p style=\"text-align: center;\"><img decoding=\"async\" src=\"https:\/\/cdn-images-1.medium.com\/max\/1600\/0*0YBFvXKv-6tcTAkB\" alt=\"experfy-blog\" \/><img decoding=\"async\" src=\"https:\/\/cdn-images-1.medium.com\/max\/1600\/0*0YBFvXKv-6tcTAkB\" alt=\"experfy-blog\" \/><img decoding=\"async\" style=\"width: 600px; height: 236px;\" src=\"https:\/\/cdn-images-1.medium.com\/max\/800\/0*0YBFvXKv-6tcTAkB.\" alt=\"experfy-blog\" \/><\/p>\n<p style=\"text-align: center;\">A typical machine learning\u00a0workflow<\/p>\n<p>The most important steps are:\u00a0<strong>ingestion<\/strong>,\u00a0<strong>cleaning<\/strong>,\u00a0<strong>visualizing<\/strong>,\u00a0<strong>modeling<\/strong>and\u00a0<strong>communicating<\/strong> \u2014 we need libraries for all of those.<\/p>\n<p>For\u00a0<strong>data cleaning<\/strong>\u00a0in R there is a wonderful package called\u00a0<a href=\"https:\/\/github.com\/tidyverse\/dplyr\" target=\"_blank\" rel=\"noopener noreferrer\">dplyr<\/a>. Admittedly, it has a weird synthax, but there lies the power. Pay attention to the\u00a0<strong>%&gt;%<\/strong> \u2014 it works absolutely the same way as the pipe (<strong>|<\/strong>) operator in *nix, the output of the previous operation becomes the input for the next. In this way, in just a few lines of code you can construct quite complex, while readable, data cleaning or subletting operations.<\/p>\n<p>The alternative for python is\u00a0<a href=\"https:\/\/pandas.pydata.org\/\" target=\"_blank\" rel=\"noopener noreferrer\">Pandas<\/a>. This library borrows heavily from R, especially the concept of a dataframe (where rows are observations and columns are features). It has some learning curve, but once you get used to it you can do pretty much anything in data manipulation (you can even write directly to databases).<\/p>\n<p>For\u00a0<strong>data visualization<\/strong>\u00a0we have\u00a0<a href=\"http:\/\/ggplot2.org\/\" target=\"_blank\" rel=\"noopener noreferrer\">ggplot2\u00a0<\/a>and\u00a0<a href=\"https:\/\/plot.ly\/\" target=\"_blank\" rel=\"noopener noreferrer\">plotly\u00a0<\/a>for R. Ggplot 2 is extremely powerful, but quite low-level. Again, it has a bit of a weird syntax, and you should read about the\u00a0<a href=\"http:\/\/vita.had.co.nz\/papers\/layered-grammar.html\" target=\"_blank\" rel=\"noopener noreferrer\">Grammar of Graphics<\/a>\u00a0to understand why. Plotly is a newer library which would give your ggplots superpowers, by making them interactive with just one line of code. The base package for dataviz in Python is matplotlib. It has some pretty arcane features, such as weird syntax and horrible default colors, and this is why I strongly suggest that you use the newer\u00a0<a href=\"http:\/\/seaborn.pydata.org\/\" target=\"_blank\" rel=\"noopener noreferrer\">seaborn\u00a0<\/a>package. One area where python lacks is visualization of model performance. This gap is filled by the excellent\u00a0<a href=\"https:\/\/github.com\/DistrictDataLabs\/yellowbrick\" target=\"_blank\" rel=\"noopener noreferrer\">yellowbrick\u00a0<\/a>project. You can use that to create nice plots to evaluate your classifiers, look at feature importance, or even plot some text models.<\/p>\n<p style=\"text-align: center;\"><img decoding=\"async\" style=\"width: 621px; height: 357px;\" src=\"https:\/\/cdn-images-1.medium.com\/max\/1600\/0*7RVJglaXdVK_1Hrg.\" alt=\"experfy-blog\" \/><\/p>\n<p style=\"text-align: center;\">Using Seaborn for scatter pair plotting of the iris\u00a0dataset<\/p>\n<p><strong>Machine learning<\/strong>\u00a0in R suffers from a consistency problem. Pretty much any model has a different API and you have to either memorize everything by heart, or keep quite a few documentation tabs open if you just want to test different algorithms on your data (which you should). This deficiency is solved by two main packages \u2014 <a href=\"https:\/\/github.com\/topepo\/caret\" target=\"_blank\" rel=\"noopener noreferrer\">caret\u00a0<\/a>and\u00a0mlr, the latter being newer. I would go for mlr, since it seems to be even more structured and actively maintained. You have everything you need there, starting for functions for splitting your data, training, prediction and performance evaluators. The corresponding library in Python is perhaps my favorite, and it is no wonder that some major tech companies support it \u2014 <a href=\"http:\/\/scikit-learn.org\/\" target=\"_blank\" rel=\"noopener noreferrer\">scikit-learn<\/a>. It has an extremely consistent API, over 150+ algorithms available (including neural networks), wonderful documentation, active maintenance and tutorials.<\/p>\n<p style=\"text-align: center;\"><img decoding=\"async\" style=\"width: 600px; height: 413px;\" src=\"https:\/\/cdn-images-1.medium.com\/max\/1600\/0*hQcwl4jIYFQ_8Yzk.\" alt=\"experfy-blog\" \/><\/p>\n<p style=\"text-align: center;\">ROC\/AUC plot in Python, using yellowbrick<\/p>\n<h2 style=\"margin-left: -1.6pt;\">Integrated Development Environment<\/h2>\n<p>&nbsp;<\/p>\n<p>Choosing an IDE for R is a no-brainer.\u00a0<a href=\"http:\/\/rstudio.com\/\" target=\"_blank\" rel=\"noopener noreferrer\">RStudio\u00a0<\/a>is an absolutely fantastic tool, and it just does not have competition. Ideally we want something like this for python. I have looked at a dozen of those (Spyder, PyCharm, Rodeo, spacemacs, Visual Studio, Canopy etc. etc.), and there are just two contenders:\u00a0<a href=\"https:\/\/github.com\/jupyterlab\/jupyterlab\" target=\"_blank\" rel=\"noopener noreferrer\">Jupyter Lab<\/a>\u00a0and\u00a0<a href=\"https:\/\/atom.io\/\" target=\"_blank\" rel=\"noopener noreferrer\">Atom\u00a0<\/a>+\u00a0<a href=\"https:\/\/atom.io\/packages\/hydrogen\" target=\"_blank\" rel=\"noopener noreferrer\">Hydrogen<\/a>.<\/p>\n<p>Jupyter Lab is still under (active) construction, and looks pretty awesome. That said, it still inherits some of the drawbacks present in Jupyter notebooks, such as cell state, security and worse of all \u2014 VCS integration. For this reason, my recommendation will be for Atom + Hydrogen. You can do all kinds of data science things with this setup, such as inspecting your dataframes and variables, plotting stuff and everything inline, in\u00a0<strong>.py<\/strong>\u00a0scripts.<\/p>\n<p style=\"text-align: center;\"><img decoding=\"async\" style=\"width: 657px; height: 371px;\" src=\"https:\/\/cdn-images-1.medium.com\/max\/1600\/1*sdOJL0OJ_1Q1RO3WSsOWUQ.png\" alt=\"experfy-blog\" \/><\/p>\n<p style=\"text-align: center;\">Atom +\u00a0Hydrogen<\/p>\n<h2 style=\"margin-left: -1.6pt;\"><strong>EDA Tools<\/strong><\/h2>\n<p>&nbsp;<\/p>\n<p>Why do we need them? Often (especially at the beginning) in the data science process we have to explore data quickly. Before we commit to a visualization we need to explore, and do that with minimal technological investment. This is why writing a ton of seaborn or ggplot code is sub-optimal and you should use a GUI interface. Plus, it can be also used by business people, since no code is involved. There are two very cool cross-platform free tools available:\u00a0<a href=\"https:\/\/folk.uio.no\/ohammer\/past\/\" target=\"_blank\" rel=\"noopener noreferrer\">Past<\/a>and\u00a0<a href=\"http:\/\/orange.biolab.si\/\" target=\"_blank\" rel=\"noopener noreferrer\">Orange<\/a>. The former is more focused on statistical analysis, while the latter on modeling. Both can do awesome data visualization, so they perfectly serve our purpose.<\/p>\n<p style=\"text-align: center;\"><img decoding=\"async\" style=\"width: 616px; height: 430px;\" src=\"https:\/\/cdn-images-1.medium.com\/max\/1600\/0*ykvej9QI8Tm8Blrd.\" alt=\"experfy-blog\" \/><\/p>\n<p style=\"text-align: center;\">Stuff you can do with\u00a0Orange<\/p>\n<h2 style=\"margin-left: -1.6pt;\">Conclusion<\/h2>\n<p>&nbsp;<\/p>\n<p>As a parting note I wish you to remain productive, and optimize your tools as much as you can (without using this as an excuse not to work\u00a0;)).<\/p>\n<p>Originally posted at <a href=\"https:\/\/towardsdatascience.com\/optimal-tooling-for-machine-learning-and-ai-e43495db59da\" rel=\"noopener\">Towards Data Science<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Ready to learn Data Science? Browse courses\u00a0like\u00a0Data Science Training and Certification developed by industry thought leaders and Experfy in Harvard Innovation Lab. Note: this post is based on talks I recently gave at Facebook Developer Circles and Data Natives Berlin. You can get the slides\u00a0here. I\u2019ll be the first to admit that tooling is probably<\/p>\n","protected":false},"author":100,"featured_media":3838,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"content-type":"","footnotes":""},"categories":[187],"tags":[94],"ppma_author":[2637],"class_list":["post-1511","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-bigdata-cloud","tag-data-science"],"authors":[{"term_id":2637,"user_id":100,"is_guest":0,"slug":"boyan-angelov","display_name":"Boyan Angelov","avatar_url":"https:\/\/secure.gravatar.com\/avatar\/?s=96&d=mm&r=g","user_url":"","last_name":"Angelov","first_name":"Boyan","job_title":"","description":"Boyan Angelov leads the machine learning efforts at a Berlin startup, building an AI to help tech companies get better candidates. He has started using machine learning during his work on microbial metagenomes at the Max Planck Institute for Marine Microbiology. The discoveries he made there were in the applications of dimensionality reduction methods. Later he worked in the clinical trials space, focusing on information retrieval and natural language processing. &nbsp;"}],"_links":{"self":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/1511","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/users\/100"}],"replies":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/comments?post=1511"}],"version-history":[{"count":2,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/1511\/revisions"}],"predecessor-version":[{"id":28267,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/1511\/revisions\/28267"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/media\/3838"}],"wp:attachment":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/media?parent=1511"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/categories?post=1511"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/tags?post=1511"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/ppma_author?post=1511"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}