What a Data Scientist is Not
Let’s begin by talking about what a Data Scientist isn’t. As is well-known there are a lot of people claiming to be Data Scientists, many of whom are clearly not. To start with, Data Scientists are not people who have completed a few Coursera Machine Learning courses and know only Hadoop. Statisticians are not Data Scientists, even if they have a Masters or a PhD. Lastly, Software Engineers are not Data Scientists – even if they are fantastic programmers.
So What is a Data Scientist?
A Data Scientist is a true Scientist – that is, someone who uses the Scientific Method. According to the OED the scientific method is defined as "a method or procedure that has characterized natural science since the 17th century, consisting in systematic observation, measurement, and experiment, and the formulation, testing, and modification of hypotheses." The key point here is hypothesis testing, which should be used in the normal statistical sense.
A Data Scientist is someone with a very deep understanding of the relation: Data -> Information -> Knowledge as well as an intuitive grasp of the Man/Machine Boundary. Also, an advanced degree in Data Science is not necessarily enough - you need to find a master and serves as an apprentice. This is because Data Science is intensely practical - lots of knowledge, understanding, and (especially) experience are required. It is also a good idea to think in a cross-disciplinary manner, as some of the best and most powerful methods have been borrowed from the most unlikely places.
How do you become a true Data Scientist? Unicorns are real.
So how would someone become a Data Scientist – it’s somewhat analogous to a medieval Master Mason – these people were a combination of Quantity Surveyor, Civil and/or Mechanical Engineer, Stone Worker (cutting, shaping, and carving), Architect and Project Manager. Today these are all specializations, but in the Middle Ages they were done by one person. An apprenticeship took at least 7 years and a person was only promoted to Master Mason when their Master decided that they had enough experience to be examined. Such examinations were skill-based and once candidates passed they would be given their own unique mason’s mark and be admitted to the Lodge.
The disciplines relevant to Data Science are Computer Science and Programming, Mathematics, Statistics and Machine Learning, Systems Architecture, and Data Management – and of course, a deep subject matter or domain understanding. Large systems engineering methods are also extremely relevant (basically Supercomputing) and various types of Grid and Parallel computing. It is important to be able to understand numerous programming languages and technologies as well as how to transform data as required. This skill is known as Data Wrangling in the USA and Data Munging in the UK. The ability to understand the proper use of various statistical methods is paramount as is the ability to interpret the results (and to be able to explain them to a lay audience).
A good knowledge of the various technologies, their history, and the proper application and sufficient understanding to build a working solution from scratch if need be are all extremely important. Long before Hadoop existed there were Beowulf clusters (1994), and 15 years before that commercial MPP (Massively Parallel Processing) Database systems were established (e.g. Teradata). The technology available today is diverse and complicated, with many of the new products available in the cloud. Many products can also handle enormous data matrices and execute in batch as well as real-time (e.g. HPCC).
Last but not least, a competent Data Scientist is methodical and thorough and also uses an appropriate methodology such as the open source CRISP-DM (the Cross Industry Standardized Process for Data Mining). Founded in 1996, it is the leading Project Methodology for all significant analytics work.
CRISP-DM consists of six major phases:
Business Understanding (essentially the requirements phase);
Data Understanding (the Business Requirements are mapped to data attributes);
Data Preparation (Data Wrangling or Munging);
Modeling (actual predictive models are built using a variety of algorithms/methods; e.g., GLM’s (General Linear Models), SVM’s (Support Vector Machines), and so on)
Evaluation (the models are back-tested on holdout samples) and;
Deployment (the models are put into production).
The last phase can lead into what is known as Operational Analytics, which is an enormous subject area – the modern way to describe it is to provide a “turn-key” solution. Model Management is also important – models are only as good as when they were created, and constant checking is needed to ensure they still work properly. The phases are run in a staggered manner and often get re-run over the course of a project. CRISP-DM is extensible and can easily be adapted to the specific needs of a particular user.