Blog series

Scaling ML from Zero to Millions of Users

Blog series

Introduction:

A small number of machine learning models fail quickly, some look promising and demonstrate a level of predictive power. Then testing and deploying them in a production environment is another challenge where they either fail or prove their worth. This series shall demonstrate how to train and scale up ML models from humble beginnings to world dominations.

Scaling Machine Learning from 0 to Millions of Users — Part 2 Training: EC2, EMR, ECS, EKS or SageMaker?

Part 1 of this article discussed a few simple techniques that helped with initial scalability of machine learning… and hopefully with reducing manual ops. Since then, despite a few production hiccups due the lack of high availability, life has been pretty good. However, traffic soon starts to increase, data piles up, more models need to be trained, etc. Technical and business stakes are getting higher, and the current architecture will go underwater soon. This post focuses on scaling training to a large number of machines.

11 MINUTES READ Continue Reading »

AI & Machine Learning

Scaling Machine Learning from 0 to Millions of Users — Part 1

So you want to build a ML model. No Machine Learning is easier to manage than no Machine Learning. Figuring a way to use high-level services could save you weeks of work, maybe months. In this series of posts, we’ll discuss how to train ML models and deploy them to production, from humble beginnings to world domination. Along the way, we’ll try to take justified and reasonable steps, fighting the evil forces of over-engineering.

10 MINUTES READ Continue Reading »