AI Series: Data Scientists, the modern alchemists.

Ready to learn Machine Learning? Browse Machine Learning Training and Certification courses developed by industry thought leaders and Experfy in Harvard Innovation Lab.

"…The narrow spiral staircase led into a larger room barely illuminated by few torches hanging on the brick wall. Two tables in the center of the room were completely covered by the strangest shapes of alchemical stills. A glass alembic was inhaling smelly and lazy vapors produced by a bubbling liquid in a heated cucurbit near a mortar and its pestle. Copper retorts of different sizes, small flasks containing white lead, sulfur and mercury and other distilling vessels were aligned on old wooden shelfs. Strange light effects were created by a bottle of Spiritus Vini reflecting the light coming from a heated pot were vaporized sulfur was transforming liquid mercury into a yellow solid, very similar to gold…"

Even if many centuries have passed since they were trying to transmute base metal into gold, our current scientific knowledge is so hugely deeper and broader on all fields and alembics and cucurbits are replaced by powerful computers, I cannot avoid recalling medieval alchemists when I think about modern data scientist fascinating mission of transforming data…into gold.

In first place, data scientists need to understand the nature of the problem they have to solve. In machine learning there are mainly 3 types of problems: Classification, Regression and Clustering. Classification tasks involve the ability to assign the input data to categorial labels, like “yes” or “no”, “True” or “False” or more complex ones like face recognition by assigning a face to the name of the person it belongs to. Regression tasks are similar to the classification tasks, but the prediction is related to a continuous value rather than a category of objects. Teaching an algorithm to predict how the prices related to a specific product or service will change under a specific set of circumstances is a regression problem. Clustering problems are closer to the traditional data mining tasks where the need is to analyze unlabeled data to discover specific and hidden patterns allowing the extraction of powerful insights like in the case of product recommendation.

Once the problem is clear, the data scientist will have to define which learning strategy will best serve the cause. Choices depends upon many different elements, including: How many data are available? Are they labelled or not? Are there algorithms or neural networks that have been previously trained on similar datasets? In my previous post I’ve already introduced the most popular learning strategies: Supervised and Unsupervised.

A supervised learning approach is probably the best choice if I have big datasets of labelled data, lots of computing power and I’m dealing with classification or regression problems while unsupervised learning is the best choice in case of clustering tasks and no availability of labeled data. But many other learning strategies have emerged over time like in the case of transfer learning which leverage an existing network previously trained on a similar domain and fine-tune the model by re-training only the last few fully-connected layers thus re-using the features that were detected and learned from supervised training cycles applied to a different task.

Another approach is offered by Deep Belief Networks, or DBNs. They use standard neural networks but implement a radically different training method. Instead of starting from random values, the network is initialized by an unsupervised pre-training phase using unlabeled data sets from which it will learn multiple layers of features. When the pre-training phase is over, all the weights and biases of the net will be very close to their optimal values and the final phase will just consist of a short, supervised fine-tuning session with back propagation and relatively few labeled examples.

Both transfer learning and BMNs allow to reduce the training time and the need of huge labeled datasets.

Last, but definitely not least, the data scientist will have to decide which algorithm, among a big variety of algorithms, will provide the best performances.

In my previous article I’ve introduced the very popular neural networks which can come in many different flavors: from its simplest form of Multi-layer Perceptron to the powerful architecture of the Convolutional Nets or the sophisticated complexity of the Recurring Neural Networks specialized in treating sequential data where the next data point depends on the previous ones, like in the case of stock prediction, text generation and voice recognition.

But Neural Networks and Deep Learning are just elements of a much broader and richer set of Machine Learning algorithms, that can cover all possible problems. The regression algorithms family, clearly well suited to solve regression type of problems, offers algorithms fast to model, particularly useful when the relationship to be modeled is not extremely complex and if you don’t have a lot of data. Linear and Logistic Regression algorithms are the simplest algorithms of this family. Clustering algorithms, as the name suggests, are particularly efficient with unsupervised learning tasks when grouping sets of objects in such a way that objects in the same group (called a cluster) are more similar to each other than to those in other groups. It is a main task of exploratory data mining and a common technique for statistical data analysis borrowed by machine learning. K-Means and Hierarchical Clustering are popular algorithms that belong to this family. For Supervised Learning on Regression and Classification tasks, decision trees and Bayesian algorithms are often a good, simple and powerful approach.

And these are just few examples among the many available machine algorithms that data scientists can use to solve their challenges and that we’ll explore in the next articles.

But while data scientists can leverage existing best practices and guidelines around which combination of problems, datasets, learning strategies and algorithms should be used to achieve the best results, it’s also true that Machine Learning is not an exact science, it’s evolving rapidly and it’s relatively new. And this is where the art of experimenting new approaches, by wisely combining, often empirically, the different ingredients, make the mission of our modern data scientist so complex and fascinating to appear magical.

A data scientist is not ‘only’ a physician or a mathematician who knows how to implement code in Python. He or she, develops his/her abilities use case by use case, leveraging best practices but often exploring new ways to approach old problems, combining different learning techniques or chaining different classes of algorithms to optimize data to improve prediction quality and performances or to overcame previously unseen obstacles and challenges.

And similarly to their alchemist ancestors who, in their stride of transforming rocks into gold, paved the way to modern chemistry, our modern data scientists, in their effort of extracting gold out of data, are laying the foundations for future generations of AI.