A small number of machine learning models fail quickly, some look promising and demonstrate a level of predictive power. Then testing and deploying them in a production environment is another challenge where they either fail or prove their worth. This series shall demonstrate how to train and scale up ML models from humble beginnings to world dominations.
Part 1 of this article discussed a few simple techniques that helped with initial scalability of machine learning… and hopefully with reducing manual ops. Since then, despite a few production hiccups due the lack of high availability, life has been pretty good. However, traffic soon starts to increase, data piles up, more models need to be trained, etc. Technical and business stakes are getting higher, and the current architecture will go underwater soon. This post focuses on scaling training to a large number of machines.
So you want to build a ML model. No Machine Learning is easier to manage than no Machine Learning. Figuring a way to use high-level services could save you weeks of work, maybe months. In this series of posts, we’ll discuss how to train ML models and deploy them to production, from humble beginnings to world domination. Along the way, we’ll try to take justified and reasonable steps, fighting the evil forces of over-engineering.