A lot of material is available on ‘how to learn machine learning (ML)/data science (DS)?’ but when we work on actual ML/DS project, we realize that the core aspects (modeling & evaluation) that we learnt is actually just a small part of the overall solution. When working as a data scientist, nobody tells us whats the ML/DS problem that we need to solve or the prediction that we need to make, we need to understand the business process first and identify the problem and qualify the problem suitable for a ML/DS solution.
Then we need to collect underlying data being used by the business and assess whether its enough & useful to convert this business problem to ML/DS problem. Further, we explore the data & prepare it to be consumed by prediction algorithms/models & evaluate the model performance before deploying the model in production. In between, we also need to identify a suitable evaluation methodology & agree monitoring & support activities with business.
In this article, I will cover these aspects to give you a holistic view of Data Science Framework built on CRISP/DM methodology:
- Business understanding
- Data understanding
- Data preparation
These three activities are performed in iterative manner to reach most optimized & generalized model avoiding under-fitting or over-fitting.
Data preparation <-> Modeling <-> Evaluation
This initial phase focuses on understanding the project objectives and requirements from a business perspective, then converting this knowledge into a data science problem definition and a preliminary plan designed to achieve the objectives.
The data understanding phase starts with initial data collection and proceeds with activities that enable you to become familiar with the data, identify data quality problems, discover first insights into the data, and/or detect interesting subsets to form hypotheses regarding hidden information.
The data preparation phase covers all activities needed to construct the final dataset from the initial raw data. Data preparation tasks are likely to be performed multiple times and not in any prescribed order. Tasks include table, record, and attribute selection, as well as transformation and cleaning of data for modeling tools.
In this phase, various modeling techniques are selected and applied, and their parameters are calibrated to optimal values. Typically, there are several techniques for the same data mining problem type. Some techniques have specific requirements on the form of data. Therefore, going back to the data preparation phase is often necessary.
Before proceeding to final deployment of the model, it is important to thoroughly evaluate it and review the steps executed to create it, to be certain the model properly achieves the business objectives. A key objective is to determine if there is some important business issue that has not been sufficiently considered. At the end of this phase, a decision on the use of the data science results should be reached.
Depending on the requirements, the deployment phase can be as simple as generating a report or as complex as implementing a repeatable DS process across the enterprise. In many cases, it is the customer, not the data analyst, who carries out the deployment steps. However, even if the analyst will carry out the deployment effort, it is important for the customer to understand up front what actions need to be carried out in order to actually make use of the created models.