So you want to build a ML model. No Machine Learning is easier to manage than no Machine Learning. Figuring a way to use high-level services could save you weeks of work, maybe months. In this series of posts, we’ll discuss how to train ML models and deploy them to production, from humble beginnings to world domination. Along the way, we’ll try to take justified and reasonable steps, fighting the evil forces of over-engineering.
Part 1 of this article discussed a few simple techniques that helped with initial scalability of machine learning… and hopefully with reducing manual ops. Since then, despite a few production hiccups due the lack of high availability, life has been pretty good. However, traffic soon starts to increase, data piles up, more models need to be trained, etc. Technical and business stakes are getting higher, and the current architecture will go underwater soon. This post focuses on scaling training to a large number of machines.