Another tip for success with your Data Science team is to be more methodical.
By this, I mean to establish and use a consistent methodology, process or workflow. This will enable repeatable results, simpler collaboration & knowledge transfer. If it is a well–designed methodology, it should also ensure appropriate QA stages and reduce the cost of rework.
A few different influences have had me thinking about this topic recently.
Lack of methodical approach – a common problem
Over recent years, within academic research, there has been a plea for better reproducibility of results. Time and time again exciting studies have failed to have their findings reproduced, with the inevitable call for more rigour.
My own work training Data Science teams and coaching their leaders has also revealed this problem. Too many teams, sometimes under the smokescreen of being “agile” or “innovative“, are basically making it up as they go along. Different team members using different work processes and so hindering both consistent quality & collaboration.
Thirdly, I have recently started as a guest lecturer at my local university, helping out on their MSc Data Science programme. The module I am teaching includes a focus on Data Science methodologies. Researching this has reminded me how little progress has been made.
That comment relates to my memory of Data Mining prior to what is now called the “AI Winter“. Back in the 1990s, I was working in R&D building Neural Network, Genetic Algorithm & Fuzzy Logic models. I then went on to create the Analytics and neophyte Data Science teams. Even then it was clear that we needed to improve processes. There was a lack of common standards that hindered our progress.
A few of the more popular Data Science methodologies
So, rather than gripe (or worse still “point fingers“), let me try and help by highlighting what does exist in the way of Data Science methodologies. Back in the 1990s others helped me by pointing out emerging standards in Data Mining, so let me try and pass on the favour.
First a caveat. I am not a Data Scientist, although I have worked around them and managed them for years. So, please feel free to share any pitfalls in what I am sharing or gaps in my knowledge. This post will not be comprehensive, I just hope it helps prompt Data Science leaders to think more about this need for their teams.
Let me start with a methodology that is as old as my experience. The CRoss-Industry Standard Process for Data Mining (CRISP-DM) was the standard that emerged as the most popular back in the 1990s. It helped bring more consistency and methodical approach to what was an iterative exploration.
Although I knew it was still used be some Data Science teams, I was surprised to find that polls still showed it to be the most popular method. It certainly is still relevant for Data Science, but the dominance showed in this poll from KD Nuggets surprised me:
Fortunately, this methodology has been around long enough to have more supporting material available than just the most familiar diagram. Nicole Leaper’s excellent EXDE design blog shares this useful visual summary:
Despite my greater awareness of CRISP-DM at the time, apparently, this KDD process preceded it. It was published in a copy of the American Association for AI in 1996. Once again this was grounded in the experience of Data Mining practice.
It had the advantage of more clearly calling out the need for not just preprocessing of data nut normally transformation before “data mining” to discover patterns. Visually it was also simpler to understand, whilst retaining the feedback arrows that CRISP-DM used to indicate the cycles prompted by discovery.
As with CRISP-DM, if you replace the term Data Mining with Modelling or Algorithm Selection & Usage – this method still works for Data Science. I can see why it also retains its place as the 5th most popular Data Science methodology in use today.
The original paper is also still worth reading to understand the nuances of this method:
Lastly, it would be remiss of me not to mention the methodology pioneered by what used to be the most dominant statistical modelling software. Around the time that these Data Mining methods were being pioneered, large corporations were committing to SAS Software.
It is easy to criticise large (even privately owned) software companies, especially when they become overly dominant in their markets. Microsoft & IBM have been there in the past. Think Apple, Amazon & Google now. However, just as we tend to now acknowledge how IBM & Microsoft advanced IT usage over decades, perhaps we can also look more kindly on SAS. It has invested huge amounts in R&D over the years.
One of the fruits of their labours in this regard in a methodology and associated processes called SEMMA. It was supported by SAS training and exhausting levels of documentation. But, it did bring to a generation of SAS coders who were also statistical models (surely nascent Data Scientists) a method for the software they used day to day.
If only for that reason, it is still worth today’s R & Python coders checking this out. A chance to think through how the same rigour could be applied on their data platforms and via a wider range of software tools & languages. SAS still publishes useful detail on this method:
Do you still use one of these 3 older methods to be methodical?
There is no shame in it if you do. Another way to phrase that would be “tried and tested“. If that KD Nuggets poll from 2017 is still valid, it sounds like most Data Science teams do still use one of these methods. I’d certainly advocate that over the Wild West of lacking a methodical common process.
However, as Dylan’s often quoted song goes, “times they are a’ changing.” Even that KDD poll identified that the 2nd most popular methodology used by Data Science teams was one developed ‘in house‘. In the second of this series I will share some of those more recent options.
That post will include an example from a technology company, a management consultant and others. Before then, I’d love to hear your view of this post. Do you also see a need to have more methodical processes in Data Science teams? Does your team have the common methodology, process or workflow it needs?