The Three Cores of Data Science

Tanin Phongpandecha Tanin Phongpandecha
September 5, 2018 Big Data, Cloud & DevOps

Ready to learn Data Science? Browse Data Science Training and Certification courses developed by industry thought leaders and Experfy in Harvard Innovation Lab.

Many people ask me, what is data science? If you ask this question to different data scientists, you’ll probably get a range of different answers. Not only is the field so interdisciplinary, but the job market is also demanding different sets of skills from the data scientist in different job roles. Depending on the nature of their work, you could find a data scientist whose most time is spent on researching and coming up with new theories for the existing tasks, or even those who would come up with a completely new theory (Convolutional/Recurrent Neural Nets, I’m sure someone out there is probably working towards coming up with X Neural Net that might someday blow away these existing models). On the other hand, you could find data scientists who work with CSV files daily, do data cleaning and visualization, and make insightful reports that could help drive high-stake decisions. In other sciences such as biology, physics, and chemistry, the definitions are rather clear as to what people are doing in the field. So what exactly is data science?

Data is to Data Science, as Elements are for Chemistry. When you’re dealing with chemistry, you identify the most basic building blocks, its properties, and build more complex models out of it, to understand and predict what would happen in a different scenario. If the model is correct and generalizable, then the model is valid. Otherwise, they construct new models. The same is for Data Science. The most basic thing you can have in Data Science is a data point. From data, data scientists can build a model to explain what is going on in the scenario we’re facing, validate it, and test it. But to do all this, we need a little bit of Computer Science, Math and Statistics, and Domains/Business Expertise.

Computer Science

Most algorithms concerning pattern recognition don’t simply have closed-form solutions. If you think about Linear Regression, you can simply use the nice geometric property of linearity to derive the Normal Equations, which allow you to get a formula to get the solution straight from all your inputs. Now that’s neat. But when you do advanced machine learning algorithms like the Kernelized SVM, Decision Trees, and Neural Networks, your best bet is to use numerical optimization (i.e. gradient descent algorithm) to find an optimum.

Large-scale machine learning requires a strong programming experience such as data parallelism, distributed computing, and memory management. For example, when you look at a toy example of training an image classifier on MNIST data, you can fit the whole dataset into memory. Let’s say you have 1 TB of image data. Simply assigning all the images as X in your python code would crash your program. Another example would be data vectorization. A naive method to training a neural network would be to write a nested bunch of for loops to update single elements in the weight matrix. In the Platonic world of mathematics, that would in theory be sufficient to obtaining a strong machine learning classifier, but in practice, that could take months or years to complete…(not to mention the bugs you’ll be discovering 2 months from now)!!! At my company (Datawow — a micro-task platform where human moderators label high volumes of data to train ML models for tasks automation), we train image classifiers with terabytes of data, and data parallelism on multiple GPUs is highly essential. These are just a few examples to start with, and you never know what problems you will face as a Data Scientist. So, a strong programming skill/experience would help a lot in Data Science.

Mathematics and Statistics

Machine Learning Cheat-Sheet

Machine Learning itself is a concept where agents learn from its environment/data to better perform at a given task. How does the algorithm learn? It’s pretty much statistical. Some machine learning algorithms (e.g. Linear/Quadratic Discriminant Analysis) are basically Bayesian Models, where we assume some parametric distributional structure of the data, and update the parameters algorithmically. Other classifiers such as Neural Networks maps real-valued vectors into a probability space (a number between 0 and 1) by a series of additions, multiplications, and output activations. The update to the weights are computed via gradient descent, where the calculations require the chain rule to model the flow of information from the output to the inner parts of the network. After all, we’re modelling numbers. So the better your math and stats foundation, the better your life will be.

Domains/Business Knowledge

People may imagine AI the way we see it in the movies: independent robots being able to set their own objectives, self-repair, and eventually take over the world. That may one day be true, but just not now (as the time of the writing). We are living in the age of vertical Artificial Intelligence. Robots that we build are trained for one specific task, and it’s only good at that task. For example, if you trained an image classifier to tell whether one image is a dog or a cat, and use that trained classifier to predict a picture of a car, it will output one of either dog or cat with very high probability (assuming you used softmax for your final layer activation). The model would find some features in the image of the car, whether it be edges, colors, blurriness, etc. and relate those to the images it’s seen in the trained images of dogs vs. cats.

What does this imply? Robots are vertically smart, but horizontally stupid! For a different task in the same domain (e.g. image classification), one can take advantage of transfer learning, which I discuss in detail here. That means that we still need domain expertise for whatever AI we’re building. For instance, if you’re in the medical field, then the machine learning model that you’re building would have to answer to the medical images (they’re much different from ImageNet images). If you’re training a text classifier for wikipedia, then word2vec may be a good starting point. But if you’re working with tweets, where people make spelling mistakes all the time, then you probably should consider character-level modelling.

Computer Science + Math&Stats + Domain Expertise

When you combine the three elements described above, you have an individual who’s very comfortable identifying what’s the problem, what’s at stake, what data should be used, what models are suitable, how to train it, and finally how to put it on production. With these skills, it is unquestionable whether you are ready to turn data into value.

  • Experfy Insights

    Top articles, research, podcasts, webinars and more delivered to you monthly.

  • Tanin Phongpandecha

    Tags
    Data Science
    © 2021, Experfy Inc. All rights reserved.
    Leave a Comment
    Next Post
    How Data Science is Driving Customer Satisfaction

    How Data Science is Driving Customer Satisfaction

    Leave a Reply Cancel reply

    Your email address will not be published. Required fields are marked *

    More in Big Data, Cloud & DevOps
    Big Data, Cloud & DevOps
    Cognitive Load Of Being On Call: 6 Tips To Address It

    If you’ve ever been on call, you’ve probably experienced the pain of being woken up at 4 a.m., unactionable alerts, alerts going to the wrong team, and other unfortunate events. But, there’s an aspect of being on call that is less talked about, but even more ubiquitous – the cognitive load. “Cognitive load” has perhaps

    5 MINUTES READ Continue Reading »
    Big Data, Cloud & DevOps
    How To Refine 360 Customer View With Next Generation Data Matching

    Knowing your customer in the digital age Want to know more about your customers? About their demographics, personal choices, and preferable buying journey? Who do you think is the best source for such insights? You’re right. The customer. But, in a fast-paced world, it is almost impossible to extract all relevant information about a customer

    4 MINUTES READ Continue Reading »
    Big Data, Cloud & DevOps
    3 Ways Businesses Can Use Cloud Computing To The Fullest

    Cloud computing is the anytime, anywhere delivery of IT services like compute, storage, networking, and application software over the internet to end-users. The underlying physical resources, as well as processes, are masked to the end-user, who accesses only the files and apps they want. Companies (usually) pay for only the cloud computing services they use,

    7 MINUTES READ Continue Reading »

    About Us

    Incubated in Harvard Innovation Lab, Experfy specializes in pipelining and deploying the world's best AI and engineering talent at breakneck speed, with exceptional focus on quality and compliance. Enterprises and governments also leverage our award-winning SaaS platform to build their own customized future of work solutions such as talent clouds.

    Join Us At

    Contact Us

    1700 West Park Drive, Suite 190
    Westborough, MA 01581

    Email: [email protected]

    Toll Free: (844) EXPERFY or
    (844) 397-3739

    © 2025, Experfy Inc. All rights reserved.