Revisiting the Data Science Suitcase

Steve Miller Steve Miller
December 27, 2018 Big Data, Cloud & DevOps

Ready to learn Data Analytics? Browse Data Analyst Training and Certification courses developed by industry thought leaders and Experfy in Harvard Innovation Lab.

Two years ago, I wrote a blog entitled “What Size is Your Suitcase?” in which I recounted a holiday shopping “dilemma” my wife and I experienced  as we purchased mutual gift suitcases. The muddle revolved on which size bags to buy – large ones that could handle all our travel needs, or more agile small to mid-size pieces that might be convenient for 95+% of our planned trips, if inadequate for the extreme. I chose the latter, settling on a 21 inch spinner that fits in the overhead bins of commercial aircraft. My wife initially went large, opting for a bulky 25 incher. Once she had it in hand, though, she deemed it clunky and had me exchange for the easier-to-maneuver 23 inch.

I used the suitcase dilemma as a metaphor for the types of decisions I saw being made in the analytics technology world by customers of Inquidia, the consultancy I worked for at the time. Companies we contracted with were invariably confronted with the decision on the type/size/complexity of solutions to implement, and they many times initially demanded the 100% answer to their forecast needs into the mid to long range future. One customer pondered an Hadoop “Big Data” ecosystem surrounding its purported 10 TB of analytic data. In reality, the “real” data size was more like 1 TB and growing slowly  – easily managed by an open source analytic database. The customer seemed disappointed when we told them they didn’t need a Big Data solution. Another customer fretted about the data size limitation of the R statistical platform, confiding they might go instead with an expensive proprietary competitor. Turns out that their largest statistical data set the year we engaged was a modest 2 GB. I showed them R working comfortably with a 20 GB data table on a 64 GB RAM Wintel notebook and they were sold. My summary take: be a suitcase-skeptic – don’t be too quick to purchase the largest, handle-all-cases bags. Consider in addition the frugality and simplicity of a 95+% solution, simultaneously planning for, but not implementing, the 100% case.

I found a birds of a feather fellow suitcase-skeptic when reviewing the splendid presentation “Best Practices for Using Machine Learning in Business in 2018” by data scientist Sziárd Pafka. Pafka teased his readers with a prez subtitle “Deeper than Deep Learning”, complaining that today’s AI is pretty much yesterday’s ML and that deep learning is overkill for many business prediction applications.  “No doubt, deep learning has had great success in computer vision, some success in sequence modeling (time series/text), and (combined with reinforcement learning) fantastic results in virtual environments such as playing games……However, in most problems with tabular/structured data (mix of numeric and categorical variables) as most often encountered in business problems, deep learning usually cannot match the predictive accuracy of tree-based ensembles such as random forests or boosting/GBMs.” And of course deep learning models are generally a good deal more cumbersome to work with than gradient boosting ensembles. So Pafka is a suitcase-skeptic with deep learning for  traditional business ML uses, preferring a 95% ensemble solution to most challenges.  He also promotes open source and multi-language (R and Python) packages such as H2O and xgboost as ML cornerstones. “The best open source tools are on par or better in features and performance compared to the commercial tools, so unlike 10+ years ago when a majority of people used various expensive tool, nowadays open source rules.”

Count Pafka as a skeptic for distributed analytics as well. It’s not that analytics clusters have no value; it’s just that they’re oftentimes needlessly deployed. I couldn’t make the argument better than Szilard: ‘And the good news is that you most likely don’t need distributed “Big Data” ML tools. Even if you have Terabytes of raw data (e.g. user clicks) after you prepare/refine your data for ML (e.g. user behavior features) your model matrix is much smaller and will fit in RAM.’  He cites Netflix’s neural net library Vectorflow, “an efficient solution in a single machine setting, lowering iteration time of modeling without sacrificing the scalability for small to medium size problems (100 M rows).” To be sure, there are many instances for which distributed/cluster computing for analytics is the best and perhaps only choice. The suitcase-skeptic, though, opts for the 95% case, planning for, but saving, distributed solutions for when they’re necessary.

I’ve had many discussions with companies deploying analytics about batch or real-time data loading and model scoring. Out of the gate, most will claim that real-time updates are sine qua non – that is,  until they understand the complexity and cost of that approach. They then often comprimise to small windows of hours or even minutes for batch updates. Pafka is a suitcase-skeptic here too. “Batch scoring is usually simpler to do. I think batch scoring is perfectly fine if you don’t need real-time scoring/ you don’t do real-time decisions. Batch can be daily, hourly, every 5 minutes if you want. You can use the same ML lib as for training from R or python.”  Again, adopt the simpler 95% strategy and save the biggest solutions for when they’s really needed.

I’d recommend readers consume Pafka’s best practices enthusiastically. I also think the suitcase-skeptic approach is the right one for companies getting started with analytics and learning as they go. Plan for the 100% solution down the road while implementing the 95% case that can deliver results immediately. I suspect Data Science luminary Sziárd Pafka would agree.

  • Experfy Insights

    Top articles, research, podcasts, webinars and more delivered to you monthly.

  • Steve Miller

    Tags
    Data Science
    © 2021, Experfy Inc. All rights reserved.
    Leave a Comment
    Next Post
    Eight Top artificial intelligence and analytics trends for 2019

    Eight Top artificial intelligence and analytics trends for 2019

    Leave a Reply Cancel reply

    Your email address will not be published. Required fields are marked *

    More in Big Data, Cloud & DevOps
    Big Data, Cloud & DevOps
    Cognitive Load Of Being On Call: 6 Tips To Address It

    If you’ve ever been on call, you’ve probably experienced the pain of being woken up at 4 a.m., unactionable alerts, alerts going to the wrong team, and other unfortunate events. But, there’s an aspect of being on call that is less talked about, but even more ubiquitous – the cognitive load. “Cognitive load” has perhaps

    5 MINUTES READ Continue Reading »
    Big Data, Cloud & DevOps
    How To Refine 360 Customer View With Next Generation Data Matching

    Knowing your customer in the digital age Want to know more about your customers? About their demographics, personal choices, and preferable buying journey? Who do you think is the best source for such insights? You’re right. The customer. But, in a fast-paced world, it is almost impossible to extract all relevant information about a customer

    4 MINUTES READ Continue Reading »
    Big Data, Cloud & DevOps
    3 Ways Businesses Can Use Cloud Computing To The Fullest

    Cloud computing is the anytime, anywhere delivery of IT services like compute, storage, networking, and application software over the internet to end-users. The underlying physical resources, as well as processes, are masked to the end-user, who accesses only the files and apps they want. Companies (usually) pay for only the cloud computing services they use,

    7 MINUTES READ Continue Reading »

    About Us

    Incubated in Harvard Innovation Lab, Experfy specializes in pipelining and deploying the world's best AI and engineering talent at breakneck speed, with exceptional focus on quality and compliance. Enterprises and governments also leverage our award-winning SaaS platform to build their own customized future of work solutions such as talent clouds.

    Join Us At

    Contact Us

    1700 West Park Drive, Suite 190
    Westborough, MA 01581

    Email: [email protected]

    Toll Free: (844) EXPERFY or
    (844) 397-3739

    © 2025, Experfy Inc. All rights reserved.