{"id":1568,"date":"2019-03-12T04:13:10","date_gmt":"2019-03-12T04:13:10","guid":{"rendered":"http:\/\/kusuaks7\/?p=1173"},"modified":"2023-09-08T12:20:34","modified_gmt":"2023-09-08T12:20:34","slug":"scaling-machine-learning-from-0-to-millions-of-users-part-2-training-ec2-emr-ecs-eks-or-sagemaker","status":"publish","type":"post","link":"https:\/\/www.experfy.com\/blog\/ai-ml\/scaling-machine-learning-from-0-to-millions-of-users-part-2-training-ec2-emr-ecs-eks-or-sagemaker\/","title":{"rendered":"Scaling Machine Learning from 0 to Millions of Users\u200a\u2014\u200aPart 2 Training: EC2, EMR, ECS, EKS or SageMaker?"},"content":{"rendered":"<p id=\"82dd\">In <a href=\"https:\/\/www.experfy.com\/blog\/scaling-machine-learning-from-0-to-millions-of-users-part-1\">Part 1<\/a>, we broke out of the laptop, and decided to deploy our prediction service on a virtual machine. By doing so, we discussed a few simple techniques that helped with initial scalability\u2026 and hopefully with reducing manual ops. Since then, despite a few production hiccups due the lack of high availability, life has been pretty good.<\/p>\n<p id=\"bb3b\">However, traffic soon starts to increase, data piles up, more models need to be trained, etc. Technical and business stakes are getting higher, and let\u2019s face it, the current architecture will go underwater soon. Time\u2019s up: in this post, we\u2019ll focus on\u00a0<strong>scaling training to a large number of machines<\/strong>.<\/p>\n<p style=\"text-align: center;\"><img decoding=\"async\" style=\"width: 560px; height: 425px;\" src=\"https:\/\/cdn-images-1.medium.com\/max\/800\/0*0OcxdRHpzI3An9BQ\" data-src=\"https:\/\/cdn-images-1.medium.com\/max\/800\/0*0OcxdRHpzI3An9BQ\" \/><\/p>\n<p style=\"text-align: center;\"><span style=\"font-size: 11px;\">Two hundred\u2026 and fifty\u2026 six\u2026 GPUs!!!\u00a0Ja!<\/span><\/p>\n<blockquote id=\"decd\"><p>This is an opinionated series, remember? It\u2019s also based on what I hear when talking to real-life customers, not ideal ones. Reality is often ugly, and only tech hipsters (and top management) think that it looks exactly like that fancy article your read on &lt;insert_name_here&gt;\u00a0\ud83d\ude09<\/p><\/blockquote>\n<blockquote id=\"4f15\"><p>Special thanks to rockin\u2019 Evangelists Abby Fuller, Jerry Hargrove, Adrian Hornsby, Brent Langston and Ian Massingham for their tips and ideas.<\/p><\/blockquote>\n<h3 id=\"3dcc\"><strong>Scaling up will fix it\u2026\u00a0right?<\/strong><\/h3>\n<p id=\"4f1f\">Yes and no. Yes, it can be a\u00a0<strong>short-term solution<\/strong>\u00a0to use a large server for training and prediction. Amazon EC2 has a ton of\u00a0<a href=\"https:\/\/aws.amazon.com\/ec2\/instance-types\/\" target=\"_blank\" rel=\"nofollow noopener noreferrer\" data-href=\"https:\/\/aws.amazon.com\/ec2\/instance-types\/\" data->instance types<\/a>\u00a0to pick from, and all it takes is\u00a0<a href=\"https:\/\/docs.aws.amazon.com\/cli\/latest\/reference\/ec2\/stop-instances.html\" target=\"_blank\" rel=\"nofollow noopener noreferrer\" data-href=\"https:\/\/docs.aws.amazon.com\/cli\/latest\/reference\/ec2\/stop-instances.html\" data->stopping<\/a>\u00a0your instance,\u00a0<a href=\"https:\/\/docs.aws.amazon.com\/cli\/latest\/reference\/ec2\/modify-instance-attribute.html\" target=\"_blank\" rel=\"nofollow noopener noreferrer\" data-href=\"https:\/\/docs.aws.amazon.com\/cli\/latest\/reference\/ec2\/modify-instance-attribute.html\" data->changing the instance type<\/a>, and\u00a0<a href=\"https:\/\/docs.aws.amazon.com\/cli\/latest\/reference\/ec2\/start-instances.html\" target=\"_blank\" rel=\"nofollow noopener noreferrer\" data-href=\"https:\/\/docs.aws.amazon.com\/cli\/latest\/reference\/ec2\/start-instances.html\" data->starting<\/a>\u00a0it again.<\/p>\n<blockquote id=\"b704\"><p>You should only be using APIs from now on. When it comes to scaling tech &amp; ops, manual work in any GUI is the Antichrist.<\/p><\/blockquote>\n<p id=\"ba67\">No, because it\u2019s only a\u00a0<strong>temporary solution<\/strong>:<\/p>\n<ul>\n<li id=\"3b94\">It\u2019s fine to use bigger instances, until\u00a0<strong>there is no bigger instance<\/strong>. Then what?<\/li>\n<li id=\"bd7f\">It\u2019s also quite possible that\u00a0<strong>your workload won\u2019t scale nicely<\/strong>, and won\u2019t make full use of the additional hardware (RAM, CPU cores, I\/O, etc.). A marginal performance gain isn\u2019t worth the extra spend.<\/li>\n<li id=\"fcff\">Most of all,\u00a0<strong>scaling up will simply delay the inevitable<\/strong>. Keep doing it, and the only thing you\u2019ll end up with is a bigger problem to solve.<\/li>\n<\/ul>\n<p id=\"9023\">In the spirit of avoiding over-engineering, it\u2019s OK to scale up a couple of times, but if monitoring keeps pointing at the Impassable Wall of Scalability Doom, I\u2019d advise you to\u00a0<strong>act a little too early rather than a little too late<\/strong>: things scale linearly until they don\u2019t, and you don\u2019t want to find out what happens when the exponential starts rising!<\/p>\n<h3 id=\"bfac\">Scaling out<\/h3>\n<p id=\"bf7d\">When it comes to training Machine Learning (ML) models, the top requirements are actually pretty simple:<\/p>\n<ul>\n<li id=\"517c\"><strong>Reliable, scalable storage\u00a0<\/strong>for your data sets.<\/li>\n<li id=\"c33a\"><strong>Elastic compute clusters<\/strong>, that can be started on-demand in lots of different configurations (hardware, frameworks, etc.).<\/li>\n<li id=\"9332\"><strong>As little ops as possible:\u00a0<\/strong>ML is what you should focus on, because ML is what turns your raw data into revenue \/ profit \/ improver customer experience.<\/li>\n<\/ul>\n<blockquote id=\"20c6\"><p>\u201cThat\u2019s not the full list! We want total control, no lock-in, low bills, top performance\u2026 and everything else too, whatever it is\u201d. Yes, yes, we\u2019ll get there\u00a0\ud83d\ude42<\/p><\/blockquote>\n<h3 id=\"0068\">Storage<\/h3>\n<p id=\"2ed6\">Let\u2019s get that one out of the way:\u00a0<strong>your data goes to Amazon S3<\/strong>. Any other choice would need a bullet-proof justification (try me!). Throughput? Well, you now get up to\u00a0<strong>25 Gigabit per second between EC2 and S3<\/strong>. That should enough for now. Scalability? High availability? Security? No ops? Cost? Check.<\/p>\n<blockquote><p>At this point, anyone in your team coming up with a \u201chigh performance storage cluster based on this super cool open source project\u201d should be hit on the head with a heavy object until they stop moving. No mercy for wasting time, putting projects at risk, and over-engineering.<\/p><\/blockquote>\n<figure id=\"c88c\"><canvas width=\"75\" height=\"60\"><\/canvas><img decoding=\"async\" style=\"width: 600px; height: 483px;\" src=\"https:\/\/cdn-images-1.medium.com\/max\/800\/0*RsLrN3NR8QTSY7g0.jpg\" data-src=\"https:\/\/cdn-images-1.medium.com\/max\/800\/0*RsLrN3NR8QTSY7g0.jpg\" \/><\/figure>\n<p style=\"text-align: center;\"><span style=\"font-size: 11px;\">Because \u201cthe doc wasn\u2019t clear\u201d and \u201cbugs happen to everyone\u201d.<\/span><\/p>\n<p id=\"c5c7\">OK, now what about compute options?\u00a0<strong>Amazon EC2? Amazon EMR? Container services? Amazon SageMaker?<\/strong><\/p>\n<blockquote id=\"c09a\"><p>Begun the Scaling War\u00a0has.<\/p><\/blockquote>\n<h3 id=\"4732\">Amazon EC2<\/h3>\n<p id=\"7192\">Our journey started on EC2, so it can be quite tempting to continue there. Is that a good enough reason?<\/p>\n<p id=\"dc08\">As discussed in<a href=\"https:\/\/www.experfy.com\/blog\/scaling-machine-learning-from-0-to-millions-of-users-part-1\"> Part 1<\/a>, the\u00a0<a href=\"https:\/\/aws.amazon.com\/machine-learning\/amis\/\" target=\"_blank\" rel=\"noopener nofollow noreferrer\" data-href=\"https:\/\/aws.amazon.com\/machine-learning\/amis\/\" data-><strong>Deep Learning AMI<\/strong><\/a>\u00a0makes your life much simpler. It\u2019s packed with open source tools and libraries optimized by AWS (<a href=\"https:\/\/aws.amazon.com\/about-aws\/whats-new\/2018\/10\/chainer4-4_theano_1-0-2_launch_deep_learning_ami\/?nc1=h_ls\" target=\"_blank\" rel=\"nofollow noopener noreferrer\" data-href=\"https:\/\/aws.amazon.com\/about-aws\/whats-new\/2018\/10\/chainer4-4_theano_1-0-2_launch_deep_learning_ami\/?nc1=h_ls\" data->11x speedup<\/a>\u00a0on TensorFlow 1.11, anyone? Or maybe\u00a0<a href=\"https:\/\/aws.amazon.com\/about-aws\/whats-new\/2018\/11\/tensorflow-scalability-to-256-gpus\/\" target=\"_blank\" rel=\"nofollow noopener noreferrer\" data-href=\"https:\/\/aws.amazon.com\/about-aws\/whats-new\/2018\/11\/tensorflow-scalability-to-256-gpus\/\" data->linear scaling up to 256 GPUs<\/a>?).\u00a0<a href=\"https:\/\/github.com\/uber\/horovod\" target=\"_blank\" rel=\"nofollow noopener noreferrer\" data-href=\"https:\/\/github.com\/uber\/horovod\" data-><strong>Horovod<\/strong><\/a>, a popular library for distributed training, is also\u00a0<a href=\"https:\/\/docs.aws.amazon.com\/dlami\/latest\/devguide\/tutorial-horovod-tensorflow.html\" target=\"_blank\" rel=\"nofollow noopener noreferrer\" data-href=\"https:\/\/docs.aws.amazon.com\/dlami\/latest\/devguide\/tutorial-horovod-tensorflow.html\" data->included<\/a>. All of this will make it much easier to setup efficient, distributed training clusters.<\/p>\n<p id=\"fb38\">Some constraints may prevent you from using that\u00a0<strong>AWS-maintained AMI<\/strong>: need for a specific OS vendor, licensing restrictions, having to run identical builds across different providers, etc. Unfortunately, you\u2019ll have to set everything up yourself. Don\u2019t underestimate that task: you\u2019ll have to do it again and again as new versions are released. Even with automation, that\u2019ll never be a lights-out operation.<\/p>\n<p id=\"d7a1\">Whichever AMI you use, setting up a training cluster means:<\/p>\n<ol>\n<li id=\"832c\">Firing up a bunch of instances,<\/li>\n<li id=\"3b9f\">Picking one as the leader, and setting up distributed training. That usually involves listing hostnames \/ IP addresses of other machines in the cluster.<\/li>\n<li id=\"b07e\">Start training,<\/li>\n<li id=\"0189\">Once training is complete, grab the trained model and save it in Amazon S3.<\/li>\n<li id=\"cfba\">Shut the training cluster down.<\/li>\n<\/ol>\n<p id=\"26a6\">Quite a bit of work, then, which is probably why most of you go through steps 1 and 2 once,\u00a0<strong>run the cluster 24\/7<\/strong>, and never make it to step 5\u2026 And this, ladies and gentlemen, is my main concern for using EC2 here.\u00a0<strong>Unless you automate all of this<\/strong>\u00a0(with\u00a0<a href=\"http:\/\/aws.amazon.com\/cloudformation\" target=\"_blank\" rel=\"nofollow noopener noreferrer\" data-href=\"http:\/\/aws.amazon.com\/cloudformation\" data->AWS CloudFormation<\/a>,\u00a0<a href=\"https:\/\/www.terraform.io\/\" target=\"_blank\" rel=\"nofollow noopener noreferrer\" data-href=\"https:\/\/www.terraform.io\">Terraform<\/a>, CLI scripts, etc.),\u00a0<strong>you will waste a ton of money<\/strong>. Someone up above will quickly put a cap on your budget, meaning that you\u2019ll probably end up with a fixed-size cluster that needs to be time-shared by multiple developers \/ teams\u2026 and of course, someone will develop a nice intranet page to book time slots on the cluster. Congratulations, you\u2019re reinvented the mainframe!<\/p>\n<p style=\"text-align: center;\"><img decoding=\"async\" style=\"width: 590px; height: 479px;\" src=\"https:\/\/cdn-images-1.medium.com\/max\/800\/0*mkO2AvthJStvlArN\" data-src=\"https:\/\/cdn-images-1.medium.com\/max\/800\/0*mkO2AvthJStvlArN\" \/><\/p>\n<p style=\"text-align: center;\"><span style=\"font-size: 11px;\">Resnet-50 training in 3\u2026 2\u2026\u00a01\u2026<\/span><\/p>\n<blockquote id=\"4ebf\"><p>Don\u2019t laugh. I\u2019ve met a very large\u200a\u2014\u200aand otherwise brilliant\u200a\u2014\u200aAI company managing hundreds of physical GPU servers just like this. Unfortunately, I\u2019m sure some customers do the same on EC2\u2026 Get in touch if you\u2019re stuck there, we can help!<\/p><\/blockquote>\n<p id=\"a20c\">Of course, maybe your DevOps team was kind enough to provide\u00a0<strong>an all-singing, all-dancing cluster provisioning script<\/strong>\u00a0that each developer can run to get their own training cluster (buy them lots of beer: that rarely ever happens). Would that be a good enough reason to stick to EC2 for training?<\/p>\n<p id=\"a9fb\">Maybe. Techniques for cost optimization on EC2 are well-known:\u00a0<a href=\"https:\/\/aws.amazon.com\/ec2\/pricing\/reserved-instances\/\" target=\"_blank\" rel=\"nofollow noopener noreferrer\" data-href=\"https:\/\/aws.amazon.com\/ec2\/pricing\/reserved-instances\/\" data->reserved instances<\/a>,\u00a0<a href=\"https:\/\/aws.amazon.com\/ec2\/spot\/\" target=\"_blank\" rel=\"nofollow noopener noreferrer\" data-href=\"https:\/\/aws.amazon.com\/ec2\/spot\/\" data->spot instances<\/a>, etc. Doing that right may offset the extra DevOps costs. Only you can find out, and you definitely should.<\/p>\n<blockquote id=\"d77e\"><p><strong>If you don\u2019t have time or skills to get automation and cost optimization right, I\u2019d think twice about running training jobs at scale on\u00a0EC2.<\/strong><\/p><\/blockquote>\n<h3 id=\"ba95\"><strong>Amazon EMR<\/strong><\/h3>\n<p id=\"224b\"><a href=\"http:\/\/aws.amazon.com\/emr\" target=\"_blank\" rel=\"nofollow noopener noreferrer\" data-href=\"http:\/\/aws.amazon.com\/emr\" data->Amazon EMR<\/a>\u00a0in a ML discussion? Well, yes:\u00a0<a href=\"https:\/\/docs.aws.amazon.com\/emr\/latest\/ReleaseGuide\/emr-tensorflow.html\" target=\"_blank\" rel=\"nofollow noopener noreferrer\" data-href=\"https:\/\/docs.aws.amazon.com\/emr\/latest\/ReleaseGuide\/emr-tensorflow.html\" data->TensorFlow<\/a>\u00a0and\u00a0<a href=\"https:\/\/docs.aws.amazon.com\/emr\/latest\/ReleaseGuide\/emr-mxnet.html\" target=\"_blank\" rel=\"nofollow noopener noreferrer\" data-href=\"https:\/\/docs.aws.amazon.com\/emr\/latest\/ReleaseGuide\/emr-mxnet.html\" data->Apache MXNet<\/a>are part of the EMR distribution, and of course,\u00a0<a href=\"https:\/\/spark.apache.org\/mllib\/\" target=\"_blank\" rel=\"nofollow noopener noreferrer\" data-href=\"https:\/\/spark.apache.org\/mllib\/\" data->Spark MLlib<\/a>\u00a0is also included. EMR supports\u00a0<a href=\"https:\/\/docs.aws.amazon.com\/emr\/latest\/ManagementGuide\/emr-supported-instance-types.html\" target=\"_blank\" rel=\"nofollow noopener noreferrer\" data-href=\"https:\/\/docs.aws.amazon.com\/emr\/latest\/ManagementGuide\/emr-supported-instance-types.html\" data->compute-optimized instances<\/a>\u00a0(c5 and p3), so it looks like we have everything we need.<\/p>\n<p id=\"ca99\">Here are\u00a0<strong>a few good reasons to run training jobs on EMR<\/strong>:<\/p>\n<ul>\n<li id=\"acba\">You already use EMR for other tasks, with solid automation (on-demand clusters,\u00a0<a href=\"https:\/\/docs.aws.amazon.com\/emr\/latest\/ManagementGuide\/emr-work-with-steps.html\" target=\"_blank\" rel=\"nofollow noopener noreferrer\" data-href=\"https:\/\/docs.aws.amazon.com\/emr\/latest\/ManagementGuide\/emr-work-with-steps.html\" data->steps<\/a>, etc.) and cost optimization (<a href=\"https:\/\/docs.aws.amazon.com\/emr\/latest\/ManagementGuide\/emr-instance-purchasing-options.html\" target=\"_blank\" rel=\"nofollow noopener noreferrer\" data-href=\"https:\/\/docs.aws.amazon.com\/emr\/latest\/ManagementGuide\/emr-instance-purchasing-options.html\" data->spot instances<\/a>!).<\/li>\n<li id=\"7adf\">Your data requires a lot of ETL, and Hive \/ Spark would work great for that. Why not run everything in one place?<\/li>\n<li id=\"d0e0\">Spark MLlib has the algos you need.<\/li>\n<li id=\"acb4\">You read somewhere that there a\u00a0<a href=\"https:\/\/github.com\/aws\/sagemaker-spark\" target=\"_blank\" rel=\"nofollow noopener noreferrer\" data-href=\"https:\/\/github.com\/aws\/sagemaker-spark\" data->SageMaker SDK for Spark<\/a>, so that could be another option for the future.<\/li>\n<\/ul>\n<p id=\"841d\">And\u00a0<strong>equally good reasons NOT to do it<\/strong>:<\/p>\n<ul>\n<li id=\"3b6a\">You don\u2019t have time or skills to automate and optimize costs. GPU-based EMR clusters running 24\/7 at on-demand price with zero load\u00a0<em>&lt;rolls a D100 for sanity check\u2026 &gt;<\/em><\/li>\n<li id=\"c5c3\">You\u2019re using neither TensorFlow, Apache MXNet nor Spark MLlib. Yes, you could install additional packages to your clusters, but that\u2019s extra work.<\/li>\n<li id=\"4a5f\">Your ETL and ML jobs have\u00a0<strong>conflicting instance requirements<\/strong>. Let\u2019s say that they would respectively run best on 8\u00a0<em>r5.4xlarge<\/em>\u00a0and 2\u00a0<em>p3.8xlarge<\/em>\u00a0for training. How do you compromise? That\u2019s a hard call, and you may end up picking an instance type that\u2019s suboptimal for both ETL and training\u2026 or creating a dedicated GPU cluster (another one to manage and worry about).<\/li>\n<\/ul>\n<p id=\"253f\">I have\u00a0<strong>mixed feelings<\/strong>\u00a0about this: I\u2019d be fine with piling some amount of ML on top on an existing cluster, but\u00a0<strong>unless it was 100% based on Spark MLlib, scaling it simply wouldn\u2019t feel right<\/strong>.<\/p>\n<h3 id=\"62f4\">Container services<\/h3>\n<p id=\"f0ec\">In\u00a0<a href=\"https:\/\/medium.com\/@julsimon\/scaling-machine-learning-from-0-to-millions-of-users-part-1-a2d36a5e849\" target=\"_blank\" rel=\"noopener noreferrer\" class=\"broken_link\">part 1<\/a>, I suggested early one that you containerize your code in order to solve deployment issues. Obviously, this would also pay dividends when deploying to Docker clusters, whether on\u00a0<a href=\"http:\/\/aws.amazon.com\/ecs\" target=\"_blank\" rel=\"nofollow noopener noreferrer\" data-href=\"http:\/\/aws.amazon.com\/ecs\" data-><strong>Amazon ECS<\/strong><\/a>\u00a0or\u00a0<a href=\"http:\/\/aws.amazon.com\/eks\" target=\"_blank\" rel=\"nofollow noopener noreferrer\" data-href=\"http:\/\/aws.amazon.com\/eks\" data-><strong>Amazon EKS<\/strong><\/a>.<\/p>\n<p id=\"33a0\">You can\u00a0<strong>run any workload<\/strong>\u00a0in a Docker container, and you can\u00a0<strong>move it around\u00a0<\/strong>without any restriction, from your laptop to your production environment (or so I\u2019m told). When running on AWS, costs can be squeezed with auto scaling, reserved instances, spot instances, etc. Woohoo.<\/p>\n<p id=\"eaaa\">From a training perspective, containers give you\u00a0<strong>full flexibility<\/strong>\u00a0to use any open source library, or even your own custom code. All popular ML\/DL libraries provide\u00a0<strong>base images<\/strong>, which you can either run directly or customize, and these will save you a lot of time.<\/p>\n<p id=\"5ecd\">Thanks to auto scaling now supporting\u00a0<a href=\"https:\/\/aws.amazon.com\/blogs\/aws\/new-ec2-auto-scaling-groups-with-multiple-instance-types-purchase-options\/\" target=\"_blank\" rel=\"nofollow noopener noreferrer\" data-href=\"https:\/\/aws.amazon.com\/blogs\/aws\/new-ec2-auto-scaling-groups-with-multiple-instance-types-purchase-options\/\" data->mixed instances types<\/a>,\u00a0<strong>different instance types can coexist within the same cluster.\u00a0<\/strong>Thus, you can easily add compute-optimized instances to any cluster and schedule your ML\/DL trainings there. Of course, you can also create a dedicated cluster for training if you think that makes more sense.<\/p>\n<p id=\"3e2b\">Last but not least,\u00a0<strong>GPU instances<\/strong>\u00a0are supported on both services, with\u00a0<strong>GPU-optimized AMIs<\/strong>\u00a0to boot (nvidia-docker, NVIDIA drivers, etc).<\/p>\n<h4 id=\"e1b4\">Training on Amazon\u00a0ECS<\/h4>\n<p id=\"f0a4\">To minimize training time and cost, you need to make sure that training jobs run on the\u00a0<strong>most appropriate instance type<\/strong>\u00a0(say\u00a0<em>c5<\/em>\u00a0or\u00a0<em>p3<\/em>). Amazon ECS lets you add\u00a0<a href=\"https:\/\/docs.aws.amazon.com\/AmazonECS\/latest\/developerguide\/task-placement-constraints.html\" target=\"_blank\" rel=\"nofollow noopener noreferrer\" data-href=\"https:\/\/docs.aws.amazon.com\/AmazonECS\/latest\/developerguide\/task-placement-constraints.html\" data->placement constraints<\/a>\u00a0in\u00a0<a href=\"https:\/\/docs.aws.amazon.com\/AmazonECS\/latest\/developerguide\/task_definitions.html\" target=\"_blank\" rel=\"nofollow noopener noreferrer\" data-href=\"https:\/\/docs.aws.amazon.com\/AmazonECS\/latest\/developerguide\/task_definitions.html\" data->task definitions<\/a>. Here\u2019s how we would ask ECS to schedule this task only on\u00a0<em>p3<\/em>\u00a0instances.<\/p>\n<pre id=\"7126\"><code>\"placementConstraints\": [\r\n{\r\n\"expression\": \"attribute:ecs.instance-type =~ p3.*\",\r\n\"type\": \"memberOf\"\r\n}\r\n]<\/code><\/pre>\n<p id=\"bbbb\">We can also one step further, thanks to a new feature that lets\u00a0<a href=\"https:\/\/docs.aws.amazon.com\/AmazonECS\/latest\/developerguide\/ecs-gpu.html\" target=\"_blank\" rel=\"nofollow noopener noreferrer\" data-href=\"https:\/\/docs.aws.amazon.com\/AmazonECS\/latest\/developerguide\/ecs-gpu.html\" data-><strong>pin a specific number of GPUs<\/strong><\/a> to a given task.<\/p>\n<h4 id=\"604c\">Training on Amazon\u00a0EKS<\/h4>\n<p id=\"6688\">You can pretty much do the same thing on EKS. This nice blog post will walk you through the whole process of adding\u00a0<em>p3<\/em>\u00a0worker nodes to an existing cluster, defining a GPU-powered pod, and launching it on the cluster.<\/p>\n<p><a title=\"https:\/\/aws.amazon.com\/blogs\/compute\/running-gpu-accelerated-kubernetes-workloads-on-p3-and-p2-ec2-instances-with-amazon-eks\/\" href=\"https:\/\/aws.amazon.com\/blogs\/compute\/running-gpu-accelerated-kubernetes-workloads-on-p3-and-p2-ec2-instances-with-amazon-eks\/\" rel=\"nofollow noopener\" data-href=\"https:\/\/aws.amazon.com\/blogs\/compute\/running-gpu-accelerated-kubernetes-workloads-on-p3-and-p2-ec2-instances-with-amazon-eks\/\" data-><strong>Running GPU-Accelerated Kubernetes Workloads on P3 and P2 EC2 Instances with Amazon EKS | Amazon\u2026<\/strong><br \/>\n<em>This post contributed by Scott Malkie, AWS Solutions Architect Amazon EC2 P3 and P2 instances, featuring NVIDIA GPUs\u2026<\/em>aws.amazon.com<\/a><\/p>\n<h4 id=\"0513\">Container services for ML\/DL training, yes or\u00a0no?<\/h4>\n<blockquote id=\"bef0\"><p>If you belong to the Wild Hyperborean Horde who\u2019d rather eat frozen Yack poo than NOT use home-made containers for absolutely everything, you\u2019ve already answered the question, haven\u2019t you?\u00a0\ud83d\ude09<\/p><\/blockquote>\n<figure id=\"234e\"><canvas width=\"75\" height=\"45\"><\/canvas><img decoding=\"async\" style=\"width: 474px; height: 296px;\" src=\"https:\/\/cdn-images-1.medium.com\/max\/800\/0*tDaTZrUsPZkBpkpE\" data-src=\"https:\/\/cdn-images-1.medium.com\/max\/800\/0*tDaTZrUsPZkBpkpE\" \/><\/figure>\n<p style=\"text-align: center;\"><span style=\"font-size: 11px;\">Yes, for a very short while. And then you got your guts ripped out.\u00a0Hmm.<\/span><\/p>\n<p id=\"9b91\">If the need ever arose,\u00a0<strong>I wouldn\u2019t worry about scaling container services to a large number of nodes<\/strong>\u00a0(ECS was designed to\u00a0<a href=\"https:\/\/www.allthingsdistributed.com\/2015\/07\/under-the-hood-of-the-amazon-ec2-container-service.html\" target=\"_blank\" rel=\"noopener nofollow noreferrer\" data-href=\"https:\/\/www.allthingsdistributed.com\/2015\/07\/under-the-hood-of-the-amazon-ec2-container-service.html\" data->scale linearly to 1,000+ nodes<\/a>). However, once again, you should very much worry about your ability to\u00a0<strong>scale ops<\/strong>\u00a0(containers, clusters, etc.) and\u00a0<strong>manage cost<\/strong>.<\/p>\n<p id=\"5410\">If you work in a Docker shop where another team is managing clusters, providing you with\u00a0<strong>agile, automated and cost-effective ways to provision<\/strong>them, then sure. It could be as easy as committing a TensorFlow script, and then letting a CI\/CD pipeline deploy it automatically to a cluster. Not a lot of extra work for ML developers and data scientists.<\/p>\n<p id=\"c077\">Now, if you live a world where you have to\u00a0<strong>build and operate clusters on top of your actual ML job<\/strong>, that\u2019s not such a exciting proposition any more. Plumbing, large bills, fire, brimstone\u2026 you know the story.<\/p>\n<p id=\"7083\">Let\u2019s look at setting up and managing large, distributed training jobs with Horovod: here\u2019s the\u00a0<a href=\"https:\/\/github.com\/uber\/horovod\/blob\/master\/docs\/running.md\" target=\"_blank\" rel=\"nofollow noopener noreferrer\" class=\"broken_link\">documentation<\/a>. Once you\u2019ve got everything figured out (<a href=\"https:\/\/github.com\/uber\/horovod\/blob\/master\/docs\/docker.md\" target=\"_blank\" rel=\"nofollow noopener noreferrer\" class=\"broken_link\">Docker<\/a>,\u00a0<a href=\"https:\/\/github.com\/kubeflow\/kubeflow\/blob\/master\/kubeflow\/openmpi\/\" target=\"_blank\" rel=\"nofollow noopener noreferrer\" class=\"broken_link\">Kubeflow<\/a>,\u00a0<a href=\"https:\/\/github.com\/kubeflow\/mpi-operator\/\" target=\"_blank\" rel=\"nofollow noopener noreferrer\" data-href=\"https:\/\/github.com\/kubeflow\/mpi-operator\/\" data->MPI Operator<\/a>,\u00a0<a href=\"https:\/\/github.com\/kubernetes\/charts\/tree\/master\/stable\/horovod\/\" target=\"_blank\" rel=\"nofollow noopener noreferrer\" data-href=\"https:\/\/github.com\/kubernetes\/charts\/tree\/master\/stable\/horovod\/\" data->Helm Chart<\/a>, and\u00a0<a href=\"https:\/\/github.com\/IBM\/FfDL\/tree\/master\/etc\/examples\/horovod\/\" target=\"_blank\" rel=\"nofollow noopener noreferrer\" data-href=\"https:\/\/github.com\/IBM\/FfDL\/tree\/master\/etc\/examples\/horovod\/\" data->FfDL<\/a>), here\u2019s how to run a training job on 4 machines with 4 GPUs each:<\/p>\n<p id=\"760b\">Don\u2019t get me wrong: Docker, Kubernetes, Horovod and so on are impressive project, but if you insist on building and maintaining everything yourself (or if Hyperborean Harald sneers that it\u2019s \u201c<em>the only proper way to do it<\/em>\u201d), you should know what you\u2019re stepping into, as you will be using this all day long.<\/p>\n<blockquote id=\"f535\"><p><strong>Is this what you really need?\u00a0<\/strong>Maybe, maybe not.\u00a0<strong>Please make up your own\u00a0mind.<\/strong><\/p><\/blockquote>\n<h3 id=\"e0b6\">Amazon SageMaker<\/h3>\n<p id=\"bace\">One more option to go:\u00a0<a href=\"http:\/\/aws.amazon.com\/sagemaker\" target=\"_blank\" rel=\"nofollow noopener noreferrer\" data-href=\"http:\/\/aws.amazon.com\/sagemaker\" data->Amazon SageMaker<\/a>. I\u2019ve discussed it at lengths in previous posts and talks (start\u00a0<a href=\"https:\/\/medium.com\/@julsimon\/talk-from-notebook-to-production-with-amazon-sagemaker-ee2a2036c0fe\" target=\"_blank\" rel=\"noopener noreferrer\" class=\"broken_link\">here<\/a>\u00a0for an recent overview), and as the most recent service of the bunch, I\u2019ve saved it for last to see where it improves on previous options with respect to training large jobs.<\/p>\n<p id=\"e701\">A quick reminder:<\/p>\n<ul>\n<li id=\"9402\">All activity in SageMaker is driven by a\u00a0<strong>high-level\u00a0<\/strong><a href=\"https:\/\/github.com\/aws\/sagemaker-python-sdk\" target=\"_blank\" rel=\"noopener nofollow noreferrer\" data-href=\"https:\/\/github.com\/aws\/sagemaker-python-sdk\" data-><strong>Python SDK<\/strong><\/a>.<\/li>\n<li id=\"c505\">Training is based on on-demand, fully-managed instances.\u00a0Zero infrastructure work. Spot instances are not available.<\/li>\n<li id=\"b3c7\">Models may be trained using\u00a0<a href=\"https:\/\/docs.aws.amazon.com\/sagemaker\/latest\/dg\/algos.html\" target=\"_blank\" rel=\"nofollow noopener noreferrer\" data-href=\"https:\/\/docs.aws.amazon.com\/sagemaker\/latest\/dg\/algos.html\" data-><strong>AWS-maintained built-in algorithms<\/strong><\/a>\u00a0and\u00a0<a href=\"https:\/\/docs.aws.amazon.com\/sagemaker\/latest\/dg\/frameworks.html\" target=\"_blank\" rel=\"nofollow noopener noreferrer\" data-href=\"https:\/\/docs.aws.amazon.com\/sagemaker\/latest\/dg\/frameworks.html\" data-><strong>optimized frameworks<\/strong><\/a>\u00a0(same ones as in the Deep Learning AMI), as well as\u00a0<a href=\"https:\/\/docs.aws.amazon.com\/sagemaker\/latest\/dg\/your-algorithms.html\" target=\"_blank\" rel=\"nofollow noopener noreferrer\" data-href=\"https:\/\/docs.aws.amazon.com\/sagemaker\/latest\/dg\/your-algorithms.html\" data-><strong>custom algorithms<\/strong><\/a>.<\/li>\n<li id=\"70ce\">Distributed training is built-in.\u00a0<strong>Zero<\/strong>\u00a0setup.<\/li>\n<li id=\"6493\">Plenty of\u00a0<strong>examples<\/strong>\u00a0are available\u00a0<a href=\"https:\/\/github.com\/awslabs\/amazon-sagemaker-examples\" target=\"_blank\" rel=\"nofollow noopener noreferrer\" data-href=\"https:\/\/github.com\/awslabs\/amazon-sagemaker-examples\" data->here<\/a>.<\/li>\n<\/ul>\n<p id=\"73e5\">For instance, here\u2019s how you would\u00a0<strong>train a TensorFlow model<\/strong>. First, put your data in Amazon S3 (it\u2019s hopefully already there). Then, configure your training job: pass your code, the instance type, the number of instances. Finally, call\u00a0<em>fit()<\/em>.<\/p>\n<p>That\u2019s it.\u00a0<strong>This is at least 10x (100x\u00a0?) less code than any automation you\u2019d be using with EC2 or containers<\/strong>.<\/p>\n<p id=\"abe0\">And lock-in? Well, none: you\u2019re free to\u00a0<strong>take your TensorFlow code and run it anywhere<\/strong>\u00a0else.<\/p>\n<h4 id=\"9a6f\">EC2 or SageMaker?<\/h4>\n<p id=\"52e2\">Compared to EC2,\u00a0<strong>SageMaker saves you from managing any infrastructure<\/strong>, and probably a lot of framework containers too. That might not be a big deal when you\u2019re working with a couple of instances, but it sure it when you start scaling to tens or hundreds, running all kinds of different jobs.<\/p>\n<p id=\"062c\">SageMaker also terminates training clusters automatically once training is complete.<\/p>\n<blockquote id=\"6df4\"><p><strong>You will never overpay for training<\/strong>.<\/p><\/blockquote>\n<p id=\"6106\">Yes, SageMaker instances are more expensive than EC2 instances. However, if you factor in less ops and automatic termination, I\u2019d be really surprised if the gap wasn\u2019t significantly reduced.<\/p>\n<blockquote id=\"7c45\"><p>Total cost of ownership is what\u00a0matters.<\/p><\/blockquote>\n<h4 id=\"b167\">EMR or SageMaker?<\/h4>\n<p id=\"14c5\">As mentioned earlier, I don\u2019t see any compelling reason to use EMR at scale for training unless you stick to Spark MLlib. Still, if you\u2019re asking yourself the question, you\u2019re probably already using EMR&#8230; so how about both? As it happens, SageMaker also includes a\u00a0<a href=\"https:\/\/github.com\/aws\/sagemaker-spark\" target=\"_blank\" rel=\"nofollow noopener noreferrer\" data-href=\"https:\/\/github.com\/aws\/sagemaker-spark\" data->Spark SDK<\/a>. I\u2019ve covered this topic before.<\/p>\n<p id=\"f9f5\">The short version is: separate concerns.<\/p>\n<blockquote id=\"6f61\"><p><strong>Spark for large-scale ETL<\/strong>,\u00a0<strong>SageMaker for large-scale training<\/strong>.<\/p><\/blockquote>\n<h4 id=\"3441\">Container services or SageMaker?<\/h4>\n<p id=\"1125\">There is a myriad of technical details that separate these two approaches, but at the end of the day, I think the choice does come down to\u00a0<strong>engineering culture\u00a0<\/strong>and\u00a0<strong>focus<\/strong>.<\/p>\n<p id=\"28d7\">Some teams are convinced\u00a0<strong>doing everything themselves creates value for the company<\/strong>\u00a0( in some cases, it does), and some other teams would rather\u00a0<strong>rely on managed services in order to iterate as fast as possible<\/strong>. Some teams feel better about\u00a0<strong>putting all their eggs in a single basket<\/strong>\u00a0(\u201cwe run everything on Docker clusters\u201d), some other teams are happier with using\u00a0<strong>different services for different things<\/strong>. No one but them can judge what\u2019s best for their particular use case.<\/p>\n<blockquote id=\"8b3e\"><p>My personal choice would still go to SageMaker, because unlike ECS and EKS,\u00a0<strong>SageMaker is built for Machine Learning only: the team is obsessed with simplifying and optimizing the service for ML users, and ML users only<\/strong>. No offence to the ECS and EKS teams, but their focus is different, as they have to accommodate literally every possible workload.<\/p><\/blockquote>\n<p id=\"562a\">These services are all based on containers anyway, and if you\u2019re able to run distributed TensorFlow with Horovod on Kubernetes, the SageMaker SDK will feel like a breeze! Give it a try and let me know what you think.<\/p>\n<p id=\"55bc\">That\u2019s the end of the second part. In the next post, we\u2019ll talk about optimizing training from a framework perspective. Plenty more to come, we haven\u2019t even talked about prediction yet!<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Part 1 of this article discussed a few simple techniques that helped with initial scalability of machine learning&hellip; and hopefully with reducing manual ops. Since then, despite a few production hiccups due the lack of high availability, life has been pretty good. However, traffic soon starts to increase, data piles up, more models need to be trained, etc. Technical and business stakes are getting higher, and the current architecture will go underwater soon. This post focuses on&nbsp;scaling training to a large number of machines.<\/p>\n","protected":false},"author":491,"featured_media":14352,"comment_status":"open","ping_status":"open","sticky":false,"template":"single-post-2.php","format":"standard","meta":{"content-type":"","footnotes":""},"categories":[183],"tags":[92],"ppma_author":[3117],"class_list":["post-1568","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-ai-ml","tag-machine-learning"],"authors":[{"term_id":3117,"user_id":491,"is_guest":0,"slug":"julien-simon","display_name":"Julien SIMON","avatar_url":"https:\/\/secure.gravatar.com\/avatar\/?s=96&d=mm&r=g","user_url":"","last_name":"SIMON","first_name":"Julien","job_title":"","description":"Julien SIMON&nbsp;is Global Technical Evangelist, Artificial Intelligence and Machine Learning at Amazon Web Services. He frequently speaks at conferences and holds all eight AWS certifications."}],"_links":{"self":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/1568","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/users\/491"}],"replies":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/comments?post=1568"}],"version-history":[{"count":7,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/1568\/revisions"}],"predecessor-version":[{"id":29704,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/1568\/revisions\/29704"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/media\/14352"}],"wp:attachment":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/media?parent=1568"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/categories?post=1568"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/tags?post=1568"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/ppma_author?post=1568"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}