At breakfast last week, my wife and I noticed water dripping out from under our coffee maker.
Saeco Barista, circa 2004 source: http://www.imghoot.com/barista-coffee-machine.html
We’ve had our coffee maker for over 10 years, and as coffee zealots, it’s faithfully poured our morning dose nearly every day we’ve been home. This coffee maker is old but great: it automatically grinds, tamps, pours and disposes of coffee, requiring only a single button push to get a cup. I’m also partially convinced that this coffee maker, which also makes lattes and mochas, is a primary driver for Grandma and Grandpa’s visits.
The convenience of this machine is great, but the machine still requires human intervention – requiring near-daily refills of water and beans and emptying of the grind pucks, adjustments, and maintenance. However, you can also fine-tune the amount and grind of the coffee and the size and strength of the pour. As a result, it pours a great cup.
During our research for a suitable replacement, we tried one of the little plastic cup systems that have replaced drip and espresso machines since we were last shopping for coffee makers. This new coffee maker promised to be even more efficient since you never actually have to *see* the coffee – it remains clean and safe inside its capsules. The downside? We have yet to pour a decent cup from it. We miss the customizations afforded by the near- but not fully-automatic maker it replaced.
At the Data Guild our love of coffee is bested only by our love of data products. The same challenges that confront coffee drinkers in optimizing the continuum between fully-automatic and fully-crafted are similar to what we see every day in the data marketplace: finding the balance between bespoke solutions and one-size fits all products.
At the Data Guild we're shifting more energy to data product development, which will address key market pains we’ve observed while building custom data products in various industries.
A Typical Dialog
In the last two years of strategic data consulting engagements with several Fortune 100 companies, we’ve had several similar initial conversations. While there are variations, typically they go something like this:
Client: “We’re looking to turn our data into actionable insights and competitive advantages.”
Data Guild: “Great, we’ve got some experience in that and would love to help.”
Client: “OK I need to hire a team of 12 data scientists from you to work on-site for the next 6 months and help me build a hadoop [sic]”
Data Guild: “OK, but first we’d love to explore your needs and determine what it is you actually need before you start spending money.”
Client: “Can we do this in parallel? I need to get some people and products deployed in the next three months.”
The Data Science Team at work. Source: http://www.vintag.es/2014/09/antique-office-photographs-ca-1920s.html
So What is a “Data Product”?
Data products can take two forms: algorithms and data source.
On the algorithmic side, there are opportunities to integrate theory from existing literature to develop systems which can enable new, smarter and faster capabilities that streamline existing business processes. Nearly every industry has gone through a major shift relative to the integration of algorithmic systems into workflow. Product/media suggestions, healthcare predictive capabilities, energy optimization are just a few that we’ve been fortunate enough to help develop.
On the data source side, we see the need for secure, reliable data sources that may be used by many to develop a disparate set of solutions. Real estate, government, electronic health records and national health registries are a few examples where data – open or otherwise – can have a significant improvement in impact when centralized and shared.
However, in either of these cases, we believe that data products must deliver value on implementation: it’s not enough to have the “promise” of value creation; rather, demonstrable ROI in its first deployment is the only path to sustainability.
A Data Product For Everyone?
It’s no surprise given the current hype over big data, IIoT (Industrial Internet of Things), data science, predictive analytics, machine learning, deep learning, etc., that traditional industries are feeling the pressure to make headway or risk lagging their competition in putting their data to work.
Source: New Yorker Magazine
However, in the new age of data-driven software, there is a new rule: one-size-fits-all fits no one. This is true generally of software products, where even the most widespread consumer apps and services are experienced completely differently by each individual:
Your Facebook experience is defined by who you "friend" and what you "like."
Your sphere of musical influence is defined organically, growing outwards from the seed artists you set in your music services.
Your news is based on your outlet selection on article preferences.
All of these factors lead us into a world that’s more fragmented, personalized and customized than ever before.
Data products are similar: what is deployed in each instance must fit its context and thereby be modified, or learn, to fit. There are certainly exceptions to this rule – data platforms such as Spark, Hadoop, Hive, noSQL, etc., certainly address common needs across many industries and applications. Similarly, basic analytics and visualization are needed everywhere and are also suited for economies of scope.
However, there is a difference between building blocks of a solution and the solution itself. When integrating data products, our deliverables must always achieve ROI, so it's not enough to build a system – we must be able to quantify the value generated from data systems. Though basic tools can generate some answers, good data questions invariably create more and more interesting/actionable questions. It is at this point that most off-the-shelf data products break down.
To achieve real business value, data products must be built top-down: from the solution back to data assets. Only by this process can the requisite components be defined and properly integrated.
To Pay or Not To Pay?
Another common misconception in the market is the need to invest significant capital in software licenses for access to the best algorithms. Or the opposite: you can just use Open Source libraries as stand-alone product to achieve a solution.
As in all areas of software, open source has taken a central role in data software products. Companies that claim to have an invented proprietary approach to machine learning, prediction, or recommendation should be treated with healthy skepticism in a world where public literature and open source have such a central role.
The latest and best algorithms in our space have historically and continue to come from academic pursuits in algorithmic performance, increased accuracy, and decreased variance. These are based on publicly available published academic works. The proliferation of R packages and Python libraries based on the very best and latest algorithms make the claim of proprietary machine learning tenuous.
Aside from platform and general purpose analytics/visualization, solutions which span the last mile of the business problem through successful integration of these pieces are those that will win the market.
There will always be services (i.e. training and deployment) around open source which can and should garner large project fees (see Cloudera/Hortonworks). Together these comprise the total cost of ownership (TCO) of software. However, companies and products should no longer be judged on core algorithms as they were in the days of Google. Rather, they should be examined based on the depth of understanding of the existing market problems they address.
The Coming Backlash in 2016
While the data-driven world is still in its infancy, we see 2016 as a pivot point where many prior investments in data systems (software, services, and teams) will go bad – businesses will react first with a backlash.
Managers will retreat to instinct-driven decision-making and distrust of data products and teams that promised so much in 2012-2015, but then failed to deliver. Gartner calls this the “trough of disillusionment.”
However (continuing Gartner-speak), this period is also followed by the “slope of enlightenment” and “plateau of productivity.” What then drives the industry to these next two desirable states?
2016 Data Products: The Rise of the Computer-Human Learning Systems
Among our clients, those that have made this transition have done so through a sober understanding that being data-driven does not simplify your life, but rather fortuitously complicates it. With good questions come more questions, and therefore more resource requirements: people, data storage, compute, telemetry, reporting, etc.
Running the data-driven enterprise is a dirty business, integrating data sources from sparse and far-flung sources, often ingesting dirty data exhaust for the purpose of signal extraction. Oftentimes data is sourced from uncontrolled historical sources which were designed for fault detection and diagnosis, not strategic decision support.
To turn the corner, data products must cater to the humans they serve. They must adapt to the knowledge of expert-operators and be flexible enough to snap into existing workflows and business systems. In short, data products must quickly be seen as solutions, not tools, if they are to survive integration into complex organizational decision processes.
The key difference is machine learning vs. human-machine learning. At the Data Guild we’ve built and deployed machine learning systems into Manufacturing, Energy, Finance, Retail, Tech and other industries, and had a lot of success in achieving measurable ROI. The key to all of these projects has been the humans (our clients) involved in this process.
In the design process client subject matter experts (SMEs) define and characterize the data. They highlight industry best practices and “things to look for” in the data, even if it has not been formally done before quantitatively.
In pilots, business experts define key performance indicators (KPIs) and report outcomes.
In deployment, systems experts define integration points, APIs, and testing.
In scaling, channel partners drive “last mile” engineering, making the difference between “works once” and “works at scale.”
Without humans in the mix, the “data product” would be little more than “hypeware” – machine learning systems that work well in the lab, but not in the real world.
Like the coffee maker on my counter, to make something truly great that fits my taste requires both a bit of technology and a bit of human oversight and training. As we make the transition in 2016 from the “robots will take my job…then kill me” hype to a more productive and realistic norm, we hope to continue building, shipping, and deploying products that integrate the best of humans and machines and generate many cupfuls of value along the way.