On being a Senior Data Scientist
I present some ideas, these aren’t necessarily in order.
1. Senior Data Scientists understand that Software/ Machine Learning has a lifecycle and so spend a lot of time thinking about that.
Technical debt, maintainability, systems design, design docs, etc. These are all
“What could I be missing?”
“How will this not work?”
“Will you please shoot as many holes as possible into my thinking on this?”
“Even if it’s technically sound, is it understandable enough for the rest of the organization to operate, troubleshoot, and extend it?”
2. Senior Data Scientists understand that ‘data’ always has flaws. These flaws can be data generating processes, biases in data.
I once did a technical interview with a Senior Data Scientist as a candidate – and I was a bit flummoxed at the question at the end which was ‘what if the data is wrong’. It’s a valid question and one we should think about.
Often a lot of the populations we end up observing aren’t randomly sampled and we need to think a bit about how to manage this. I find anecdotally that Junior Data Scientists often think that this is not the case.
3. Senior Data Scientists understand the ‘soft’ side of technical decision making.
Increasingly I see tool choices being made and wonder about the ‘feeling’ aspect of those. It can be for example that ‘static languages are best’ or ‘we should use pytest not unittest’, increasingly this is because of ‘taste’ or ‘feelings’ or ‘philosophy’. And those are perfectly reasonable things. For example, I love the pytest functional syntax, however, I know other engineers like other tools – and that’s ok. The other thing is that sometimes people have bad experiences with tools from particular vendors, or in particular ecosystems. If you, for example, worked at a company that wrote software in Zorg and you found it incredibly difficult to deploy, and the project was a complete failure, then you’d have an emotional response to Zorg if it’s brought up in a company meeting. Engineers and Data Scientists often are obsessed with the rational, but our feelings about architectures, software matter. Otherwise, we’ll never get the buy-in we need. I’ve not finished reading it – but a book that’s been recommended by a few senior Technologists who I respect is – Words That Work: It's Not What You Say, It's What People Hear.
A corollary to this is that we can produce Machine Learning models that don’t get used.
4. Senior Data Scientists focus on impact and value
If a deep learning model doesn’t get into production because of lack of trust – you’ve failed. It’s not about satisfying your intellectual curiosity, or your need for ‘Resume Driven Development’. It’s important to think about buy-in and your time to value. As Erik Bernhardsson tweeted –
Think most of my value of knowing machine learning these days is gained from telling people why ML won’t solve their problem
This is terribly important, sometimes a simple rules engine will do. Sometimes just a SQL query. Using the right tool for the job is very important. This is complicated though, and there’s not often one ‘best’ solution, all solutions have trade offs.
Often you can make things simple with data, a question I like asking Data Scientists is ‘when did you decide not to use ML’, for example, a few years ago I saved thousands of dollars at $OLD_EMPLOYER by building a data pipeline, for some analytics. Some of the analysis pipeline involved matching text, for inventory management, for example, inventory names would be similar – so it seemed natural to use fuzzy-matching or something similar. It turned out this algorithm was too slow and impractical. And it turned out by monetary value there were 100 inventory items that needed matching, so I simply encoded the most common misspellings/ abbreviations in a dictionary. This was tremendously valuable, and a much more robust solution than using Machine Learning. Sometimes automation is what you need to do Sometimes counting and sometimes Machine Learning.
5. Senior Data Scientists care about ethics
Recently in the Data Science and Tech communities, we’ve seen the need for discussion of ethics. There’s been some interesting and worth reading literature on this from the Academic communities, and I’ll not wade too much into these debates.
However as a Senior Data Scientist working in a regulated world of Financial Services – I’ve grown to appreciate that it’s my job to have a working knowledge of GDPR, it’s something we regularly bring up when we discuss the viability of projects, and it’s a ‘risk factor’. It would be immature to just ignore this, and frankly unethical and unprofessional.
At the very least Senior Data Scientists should read some of the code of ethics in Data Science and have views on these. Ideally, you should have your own code of ethics, and maybe enforce those on yourself, certainly, you should bring that into account in your risk planning, and in terms of what data you get access to, and how you integrate security. This can, unfortunately, add to time frames, but doing things ‘right’ both in terms of customer trust and in terms of good compliance often takes longer time. As we’ve seen with the Theranos affair – ‘move fast and break things’ isn’t always the best motto.
Acknowledgments: Thanks to Eoin Hurrell and Bertil Hatt who helped with fleshing out these ideas. I’m grateful also to conversations with friends such as Eddie Bell, Mick Cooney, Mick Crawford and Ian Ozsvald and some of my Zopa colleagues including Dat Nguyen and Vlasios Vasileiou. I learn from most people I speak to, so sorry if I’ve forgotten. Finally also thanks to Audrey Somnard, who has constantly reminded me that ‘algorithms do what they want’ isn’t a sufficient ethical explanation and I should think more about these issues.