{"id":1044,"date":"2018-12-27T02:37:18","date_gmt":"2018-12-26T23:37:18","guid":{"rendered":"http:\/\/kusuaks7\/?p=649"},"modified":"2021-05-11T14:02:44","modified_gmt":"2021-05-11T14:02:44","slug":"revisiting-the-data-science-suitcase","status":"publish","type":"post","link":"https:\/\/www.experfy.com\/blog\/bigdata-cloud\/revisiting-the-data-science-suitcase\/","title":{"rendered":"Revisiting the Data Science Suitcase"},"content":{"rendered":"<p><strong><em>Ready to learn Data Analytics? Browse <a href=\"https:\/\/www.experfy.com\/training\/tracks\/data-analyst-training-certification\">Data Analyst Training and Certification courses<\/a> developed by industry thought leaders and Experfy in Harvard Innovation Lab.<\/em><\/strong><\/p>\n<p>Two years ago, I wrote a blog entitled&nbsp;<a href=\"https:\/\/www.information-management.com\/opinion\/containing-big-data-what-size-is-your-suitcase\" rel=\"noopener noreferrer\" target=\"_blank\">&ldquo;What Size is Your Suitcase?&rdquo;<\/a>&nbsp;in which I recounted a holiday shopping &ldquo;dilemma&rdquo; my wife and I experienced&nbsp; as we purchased mutual gift suitcases. The muddle revolved on which size bags to buy &ndash; large ones that could handle all our travel needs, or more agile small to mid-size pieces that might be convenient for 95+% of our planned trips, if inadequate for the extreme. I chose the latter, settling on a 21 inch spinner that fits in the overhead bins of commercial aircraft. My wife initially went large, opting for a bulky 25 incher. Once she had it in hand, though, she deemed it clunky and had me exchange for the easier-to-maneuver 23 inch.<\/p>\n<p>I used the suitcase dilemma as a metaphor for the types of decisions I saw being made in the analytics technology world by customers of Inquidia, the consultancy I worked for at the time. Companies we contracted with were invariably confronted with the decision on the type\/size\/complexity of solutions to implement, and they many times initially demanded the 100% answer to their forecast needs into the mid to long range future. One customer pondered an Hadoop &ldquo;Big Data&rdquo; ecosystem surrounding its purported 10 TB of analytic data. In reality, the &ldquo;real&rdquo; data size was more like 1 TB and growing slowly&nbsp; &ndash; easily managed by an open source analytic database. The customer seemed disappointed when we told them they didn&rsquo;t need a Big Data solution. Another customer fretted about the data size limitation of the R statistical platform, confiding they might go instead with an expensive proprietary competitor. Turns out that their largest statistical data set the year we engaged was a modest 2 GB. I showed them R working comfortably with a 20 GB data table on a 64 GB RAM Wintel notebook and they were sold. My summary take: be a suitcase-skeptic &ndash; don&rsquo;t be too quick to purchase the largest, handle-all-cases bags. Consider in addition the frugality and simplicity of a 95+% solution, simultaneously planning for, but not implementing, the 100% case.<\/p>\n<p>I found a birds of a feather fellow suitcase-skeptic when reviewing the splendid presentation&nbsp;<a href=\"https:\/\/twitter.com\/i\/moments\/1063121156612546567\" rel=\"noopener noreferrer\" target=\"_blank\">&ldquo;Best Practices for Using Machine Learning in Business in 2018&rdquo;<\/a>&nbsp;by data scientist Szi&aacute;rd Pafka. Pafka teased his readers with a prez subtitle &ldquo;Deeper than Deep Learning&rdquo;, complaining that today&rsquo;s AI is pretty much yesterday&rsquo;s ML and that deep learning is overkill for many business prediction applications.&nbsp; &ldquo;No doubt, deep learning has had great success in computer vision, some success in sequence modeling (time series\/text), and (combined with reinforcement learning) fantastic results in virtual environments such as playing games&hellip;&hellip;However, in most problems with tabular\/structured data (mix of numeric and categorical variables) as most often encountered in business problems, deep learning usually cannot match the predictive accuracy of tree-based ensembles such as random forests or boosting\/GBMs.&rdquo; And of course deep learning models are generally a good deal more cumbersome to work with than gradient boosting ensembles. So Pafka is a suitcase-skeptic with deep learning for&nbsp; traditional business ML uses, preferring a 95% ensemble solution to most challenges.&nbsp; He also promotes open source and multi-language (R and Python) packages such as H2O and xgboost as ML cornerstones. &ldquo;The best open source tools are on par or better in features and performance compared to the commercial tools, so unlike 10+ years ago when a majority of people used various expensive tool, nowadays open source rules.&rdquo;<\/p>\n<p>Count Pafka as a skeptic for distributed analytics as well. It&rsquo;s not that analytics clusters have no value; it&rsquo;s just that they&rsquo;re oftentimes needlessly deployed. I couldn&rsquo;t make the argument better than Szilard: &lsquo;And the good news is that you most likely don&rsquo;t need distributed &ldquo;Big Data&rdquo; ML tools. Even if you have Terabytes of raw data (e.g. user clicks) after you prepare\/refine your data for ML (e.g. user behavior features) your model matrix is much smaller and will fit in RAM.&rsquo;&nbsp; He cites Netflix&rsquo;s neural net library Vectorflow, &ldquo;an efficient solution in a single machine setting, lowering iteration time of modeling without sacrificing the scalability for small to medium size problems (100 M rows).&rdquo; To be sure, there are many instances for which distributed\/cluster computing for analytics is the best and perhaps only choice. The suitcase-skeptic, though, opts for the 95% case, planning for, but saving, distributed solutions for when they&rsquo;re necessary.<\/p>\n<p>I&rsquo;ve had many discussions with companies deploying analytics about batch or real-time data loading and model scoring. Out of the gate, most will claim that real-time updates are&nbsp;<em>sine qua non &ndash;&nbsp;<\/em>that is,&nbsp;<em>&nbsp;<\/em>until they understand the complexity and cost of that approach. They then often comprimise to small windows of hours or even minutes for batch updates. Pafka is a suitcase-skeptic here too. &ldquo;Batch scoring is usually simpler to do. I think batch scoring is perfectly fine if you don&rsquo;t need real-time scoring\/ you don&rsquo;t do real-time decisions. Batch can be daily, hourly, every 5 minutes if you want. You can use the same ML lib as for training from R or python.&rdquo;&nbsp; Again, adopt the simpler 95% strategy and save the biggest solutions for when they&rsquo;s really needed.<\/p>\n<p>I&rsquo;d recommend readers consume Pafka&rsquo;s best practices enthusiastically. I also think the suitcase-skeptic approach is the right one for companies getting started with analytics and learning as they go. Plan for the 100% solution down the road while implementing the 95% case that can deliver results immediately. I suspect Data Science luminary Szi&aacute;rd Pafka would agree.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>I use the suitcase dilemma as a metaphor for the types of decisions being made in the analytics technology world by customers. Companies invariably confront with the decision on the type\/size\/complexity of solutions to implement, and they many times initially demanded the 100% answer to their forecast needs into the mid to long range future.&nbsp; Be a suitcase-skeptic &ndash; don&rsquo;t be too quick to purchase the largest, handle-all-cases bags. Consider in addition the frugality and simplicity of a 95+% solution, simultaneously planning for, but not implementing, the 100% case.<\/p>\n","protected":false},"author":430,"featured_media":3770,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"content-type":"","footnotes":""},"categories":[187],"tags":[94],"ppma_author":[2310],"class_list":["post-1044","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-bigdata-cloud","tag-data-science"],"authors":[{"term_id":2310,"user_id":430,"is_guest":0,"slug":"steve-miller","display_name":"Steve Miller","avatar_url":"https:\/\/secure.gravatar.com\/avatar\/?s=96&d=mm&r=g","user_url":"","last_name":"Miller","first_name":"Steve","job_title":"","description":"Steve Miller is Co-founder and President of&nbsp;<a href=\"http:\/\/www.inquidia.com\/\" target=\"_blank\" rel=\"noopener\">Inqudia Consulting<\/a>. He has over 35 years experience in business intelligence and statistics, the last 25 revolving on the delivery of analytics technology services.&nbsp;"}],"_links":{"self":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/1044","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/users\/430"}],"replies":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/comments?post=1044"}],"version-history":[{"count":1,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/1044\/revisions"}],"predecessor-version":[{"id":6236,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/1044\/revisions\/6236"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/media\/3770"}],"wp:attachment":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/media?parent=1044"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/categories?post=1044"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/tags?post=1044"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/ppma_author?post=1044"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}