{"id":1710,"date":"2019-05-21T03:19:00","date_gmt":"2019-05-21T03:19:00","guid":{"rendered":"http:\/\/kusuaks7\/?p=1315"},"modified":"2023-09-12T13:17:52","modified_gmt":"2023-09-12T13:17:52","slug":"architecting-a-machine-learning-pipeline","status":"publish","type":"post","link":"https:\/\/www.experfy.com\/blog\/ai-ml\/architecting-a-machine-learning-pipeline\/","title":{"rendered":"Architecting a Machine Learning Pipeline"},"content":{"rendered":"<h3 id=\"178e\">How to build scalable Machine Learning systems\u200a\u2014\u200aPart\u00a02\/2<\/h3>\n<section>\n<h3 id=\"ee01\">Preface<\/h3>\n<p id=\"89d3\">When developing a model, data scientists work in some development environment tailored for Statistics and Machine Learning (Python, R etc) and are able to train and test models all in one \u2018sandboxed\u2019 environment while writing relatively little code. This is great for building interactive prototypes with fast time to market\u200a\u2014\u200athey are not productionised, low latency systems though!<\/p>\n<p id=\"01e8\">This is the 2nd in a series of articles, namely \u2018<strong><em>Being a Data Scientist does not make you a Software Engineer!<\/em><\/strong>\u2019, which covers how you can architect an end-to-end scalable Machine Learning (ML) pipeline.<\/p>\n<\/section>\n<section>\n<hr \/>\n<h4 id=\"64a4\">Revision<\/h4>\n<p>Hopefully, you have gone through the <a href=\"https:\/\/www.experfy.com\/blog\/being-a-data-scientist-does-not-make-you-a-software-engineer\">1st part\u00a0<\/a>of the series, where we introduced the basic architectural styles, design patterns, and the SOLID principles.<\/p>\n<figure id=\"fd63\"><img decoding=\"async\" src=\"https:\/\/cdn-images-1.medium.com\/max\/800\/1*jqnpgX0GUKwbhRCCcN8pnQ.png\" data-height=\"40\" data-image-id=\"1*jqnpgX0GUKwbhRCCcN8pnQ.png\" data-width=\"500\" \/><\/figure>\n<p id=\"ba67\"><strong>\u00a0TL;DR:<\/strong>\u00a0In case you haven\u2019t read it, let\u2019s repeat the \u2018holy grail\u2019\u200a\u2014\u200ai.e. the problem statement that a production-ready ML system should try to address:<\/p>\n<blockquote id=\"ac5f\"><p>The main objectives are to build a system that:<br \/>\n\u25b8 Reduces\u00a0<strong>latency<\/strong>;<br \/>\n\u25b8 Is integrated but\u00a0<strong>loosely coupled<\/strong>\u00a0with the other parts of the system, e.g. data stores, reporting, graphical user interface;<br \/>\n\u25b8 Can\u00a0<strong>scale<\/strong>\u00a0both horizontally and vertically;<br \/>\n\u25b8 Is\u00a0<strong>message driven<\/strong>\u00a0i.e. the system communicates via asynchronous, non-blocking message passing;<br \/>\n\u25b8 Provides efficient computation with regards to\u00a0<strong>workload management<\/strong>;<br \/>\n\u25b8 Is\u00a0<strong>fault-tolerant<\/strong>\u00a0and self healing i.e. breakdown management;<br \/>\n\u25b8 Supports\u00a0<strong>batch<\/strong>\u00a0and\u00a0<strong>real-time<\/strong>\u00a0processing.<\/p><\/blockquote>\n<figure id=\"78b6\"><img decoding=\"async\" src=\"https:\/\/cdn-images-1.medium.com\/max\/800\/1*dYmpIO39mAUq0EcgYLttGQ.png\" data-height=\"40\" data-image-id=\"1*dYmpIO39mAUq0EcgYLttGQ.png\" data-width=\"500\" \/><\/figure>\n<p id=\"3e9c\"><strong>\ud83c\udfa6 Scene Setting:<\/strong>\u00a0So by now you have seen the fundamental concepts of Software Engineering and you already are a seasoned Data Scientist.<\/p>\n<p id=\"31ba\"><strong>Without further ado, let\u2019s put two and two together\u2026<\/strong><\/p>\n<p id=\"774c\">For each of the ML Pipeline steps, I will be demonstrating how to design a production-grade architecture. I will intentionally not be referring to any specific technologies (apart from a couple of times that I give some examples for demonstration purposes).<\/p>\n<p id=\"38bc\"><strong>N.B.:\u00a0<\/strong>If you need to refresh on the ML pipeline steps, take a look at\u00a0<a href=\"https:\/\/towardsdatascience.com\/not-yet-another-article-on-machine-learning-e67f8812ba86\" target=\"_blank\" rel=\"noopener noreferrer\" class=\"broken_link\">this resource<\/a>.<\/p>\n<figure id=\"bc88\"><a href=\"https:\/\/towardsdatascience.com\/not-yet-another-article-on-machine-learning-e67f8812ba86\" action=\"image-link\" only=\"true\" class=\"broken_link\" rel=\"noopener\"><canvas width=\"75\" height=\"32\"><\/canvas><img decoding=\"async\" style=\"width: 700px; height: 326px;\" src=\"https:\/\/cdn-images-1.medium.com\/max\/800\/1*ElodORq2cRIXfKMy8F7a-w.png\" data-src=\"https:\/\/cdn-images-1.medium.com\/max\/800\/1*ElodORq2cRIXfKMy8F7a-w.png\" \/><\/a><\/figure>\n<\/section>\n<section>\n<p style=\"text-align: center;\">ML Pipeline<\/p>\n<hr \/>\n<h3 id=\"ee0e\">Architecting a ML\u00a0Pipeline<\/h3>\n<p id=\"6c9f\">Traditionally, pipelines involve overnight batch processing, i.e. collecting data, sending it through an enterprise message bus and processing it to provide pre-calculated results and guidance for next day\u2019s operations. Whilst this works in some industries, it is really insufficient in others, and especially when it comes to ML applications.<\/p>\n<p id=\"b66e\">The following diagram shows a ML pipeline applied to a real-time business problem where\u00a0<strong>features and predictions are time sensitive<\/strong>\u00a0(e.g. Netflix\u2019s recommendation engines, Uber\u2019s arrival time estimation, LinkedIn\u2019s connections suggestions, Airbnb\u2019s search engines etc).<\/p>\n<figure id=\"18c0\"><canvas width=\"75\" height=\"15\"><\/canvas><img decoding=\"async\" style=\"width: 700px; height: 153px;\" src=\"https:\/\/cdn-images-1.medium.com\/max\/800\/1*hdwy2LGZMVC6gcZK0w7KlA.png\" data-src=\"https:\/\/cdn-images-1.medium.com\/max\/800\/1*hdwy2LGZMVC6gcZK0w7KlA.png\" \/><\/figure>\n<p style=\"text-align: center;\">Real-Time ML<\/p>\n<p id=\"b51f\">It comprises of two clearly defined components:<\/p>\n<ul>\n<li id=\"861c\"><strong>Online Model Analytics<\/strong>: The top row represents the operational component of the application i.e. where the model is applied for\u00a0<strong>real-time decision<\/strong>\u00a0making.<\/li>\n<li id=\"4b13\"><strong>Offline Data Discovery<\/strong>: The bottom row represents the learning component i.e. analysis on historical data to create the ML model in a\u00a0<strong>batch-processing<\/strong>\u00a0mode.<\/li>\n<\/ul>\n<p id=\"bd39\">We will now take this simplified diagram and unfold its inner workings.<\/p>\n<\/section>\n<section>\n<hr \/>\n<h3 id=\"9e14\">\u2014 \u2460: Data Ingestion<\/h3>\n<blockquote id=\"b902\"><p>Data collection.<\/p><\/blockquote>\n<figure id=\"cbf3\"><canvas width=\"75\" height=\"27\"><\/canvas><img decoding=\"async\" style=\"width: 700px; height: 276px;\" src=\"https:\/\/cdn-images-1.medium.com\/max\/800\/1*WNh36Lf6eVov3uUPEOdjDg.jpeg\" data-src=\"https:\/\/cdn-images-1.medium.com\/max\/800\/1*WNh36Lf6eVov3uUPEOdjDg.jpeg\" \/><\/figure>\n<p id=\"694f\">Funneling incoming data into a data store is the first step of any ML workflow. The key point is that data is persisted without undertaking any transformation at all, to allow us to have an\u00a0<strong>immutable<\/strong>\u00a0record of the original dataset. Data can be fed from various data sources; either obtained by request (pub\/sub) or streamed from other services.<\/p>\n<p id=\"33a8\"><strong>NoSQL document databases<\/strong>\u00a0are ideal for storing large volumes of rapidly changing structured and\/or unstructured data since they are schema-less. They also offer a distributed, scalable, replicated data storage.<\/p>\n<h4 id=\"eb05\">Offline<\/h4>\n<p id=\"1040\">In the offline layer, data flows into the Raw Data Store via an\u00a0<strong>Ingestion Service\u200a<\/strong>\u2014\u200aa composite orchestration service, which encapsulates the data sourcing and persistence. Internally, a repository pattern is employed to interact with a data service, which in return interacts with the data store. When the data is saved in the database, a unique batch-id is assigned to the dataset, to allow for efficient querying and end-to-end data lineage and traceability.<\/p>\n<p id=\"8e6e\">To be performant, the ingestion distribution is twofold:<\/p>\n<p id=\"9419\">\u2022 there is a dedicated pipeline for\u00a0<strong>each dataset<\/strong>\u00a0so all of them are processed independently and concurrently, and<br \/>\n\u2022 within each pipeline, the data is\u00a0<strong>partitioned<\/strong>\u00a0to take advantage of the multiple server cores, processors or even servers.<\/p>\n<p id=\"b39b\">Spreading the data preparation across multiple pipelines, horizontally and vertically, reduces the overall time to complete the job.<\/p>\n<figure id=\"70b2\"><canvas width=\"75\" height=\"27\"><\/canvas><img decoding=\"async\" style=\"width: 700px; height: 268px;\" src=\"https:\/\/cdn-images-1.medium.com\/max\/800\/1*Abaj_g0TjFTnr0cwfcdTTQ.png\" data-src=\"https:\/\/cdn-images-1.medium.com\/max\/800\/1*Abaj_g0TjFTnr0cwfcdTTQ.png\" \/><\/figure>\n<p id=\"9837\">The ingestion service runs regularly on a\u00a0<strong>schedule\u00a0<\/strong>(once or multiple times per day) or on a\u00a0<strong>trigger<\/strong>: a topic decouples producers (i.e. the sources of data) from consumers (in our case the ingestion pipeline), so when source data is available, the producer system publishes a message to the broker, and the embedded notification service responds to the subscription by triggering the ingestion. The notification service also broadcasts to the broker that the source data has been successfully processed and is saved in the database.<\/p>\n<h4 id=\"03fb\">Online<\/h4>\n<p id=\"c28c\">In the online layer, the\u00a0<strong>Online Ingestion Service<\/strong>\u00a0is the entry point to the streaming architecture as it decouples and manages the flow of information from data sources to the processing and storage components, by providing reliable, high throughput, low latency capabilities. It functions as an enterprise-scale \u2018<strong>Data Bus<\/strong>\u2019. Data is saved in the long term Raw Data Store, but is also a\u00a0<strong>pass-through<\/strong>\u00a0layer to the next online streaming service, for further real-time processing.<\/p>\n<p id=\"3318\">Example technologies used here can be Apache Kafka (pub\/sub messaging system) and Apache Flume (data collection to long term db), but there are more you will come across, depending on your enterprise\u2019s tech stack.<\/p>\n<\/section>\n<section>\n<hr \/>\n<h3 id=\"1a5a\">\u2014 \u2461: Data Preparation<\/h3>\n<blockquote id=\"ec4c\"><p>Data exploration, data transformation and feature engineering.<\/p><\/blockquote>\n<figure id=\"644d\"><canvas width=\"75\" height=\"27\"><\/canvas><img decoding=\"async\" style=\"width: 700px; height: 276px;\" src=\"https:\/\/cdn-images-1.medium.com\/max\/800\/1*rarKVMm-KG093RWJiGvVdg.jpeg\" data-src=\"https:\/\/cdn-images-1.medium.com\/max\/800\/1*rarKVMm-KG093RWJiGvVdg.jpeg\" \/><\/figure>\n<p id=\"2d8b\">Once the data is ingested, a distributed pipeline is generated which assesses the condition of the data, i.e. looks for format differences, outliers, trends, incorrect, missing, or skewed data and rectify any anomalies along the way. This step also includes the feature engineering process. There are three main phases in a feature pipeline: extraction, transformation and selection.<\/p>\n<figure id=\"da6e\"><canvas width=\"75\" height=\"15\"><\/canvas><img decoding=\"async\" src=\"https:\/\/cdn-images-1.medium.com\/max\/800\/1*g7DbrsT2Xr-Qwdq72S2Qyg.png\" data-src=\"https:\/\/cdn-images-1.medium.com\/max\/800\/1*g7DbrsT2Xr-Qwdq72S2Qyg.png\" \/><\/figure>\n<p style=\"text-align: center;\">Feature Engineering Operations<\/p>\n<p id=\"157b\">As this is the most complex part of a ML project, introducing the right design patterns is crucial, so in terms of code organisation having a\u00a0<strong>factory method to<\/strong>\u00a0generate the features based on some common abstract feature behaviour as well as a\u00a0<strong>strategy<\/strong>\u00a0pattern to allow the selection of the right features at run time is a sensible approach.\u00a0Both feature extractors and transformers should structure with composition and re-usability in mind.<\/p>\n<p id=\"47e5\">Selecting the features can be left to the caller, or can be automated e.g. apply a\u00a0<strong>chi-squared<\/strong>\u00a0statistical test to rank the impact of each feature on the concept label and discard the less impactful features prior to model training. A series of selector APIs can be defined to enable this. Either way, to ensure consistency on the features used as model inputs and at scoring, a\u00a0<strong>unique id is<\/strong>\u00a0assigned to each feature set.<\/p>\n<p id=\"e41f\">Broadly speaking, a data preparation pipeline should be assembled into a\u00a0<strong>series of immutable transformations<\/strong>, that can easily be combined. This is where the significance of testing and high code coverage becomes an important factor for the project\u2019s success.<\/p>\n<h4 id=\"d3b2\">Offline<\/h4>\n<p id=\"45c6\">In the offline layer, the\u00a0<strong>Data Preparation Service<\/strong>\u00a0is triggered by the completion of the ingestion service. It sources the Raw Data, undertakes all the feature engineering logic, and saves the generated features in the\u00a0<em>Feature Data Store<\/em>.<\/p>\n<p id=\"d96e\">The same partitioning applies here too (i.e. dedicated pipelines\/parallelism).<\/p>\n<p id=\"e590\">Optionally, the features from multiple data sources can be combined, so a \u2018join\/sync\u2019 task is designed to aggregate all the intermediate completion events and create these new, combined features. In the end, the notification service broadcasts to the broker this process is complete and the features are available.<\/p>\n<p id=\"9cfb\">When each data preparation pipeline \ufb01nishes, the features are also <b>replicated to<\/b>\u00a0the Online Feature Data Store, so that the features can be queried with low latency for real-time prediction.<\/p>\n<h4 id=\"52be\">Online<\/h4>\n<p id=\"a98c\">The raw data is streamed from the ingestion pipeline into the\u00a0<strong>Online Data Preparation Service<\/strong>. The generated features are stored in an\u00a0<strong>in-memory online<\/strong><em>\u00a0Feature Data Store<\/em>\u00a0where they can be read at low latency at prediction time but are also persisted in the long term Feature Data Store for future training. Additionally, the in-memory database can be pre-warmed by loading features from the long term Feature Data Store.<\/p>\n<p id=\"1052\">Continuing the earlier tech stack example, a frequently used streaming engine is Apache Spark.<\/p>\n<figure id=\"396e\"><img decoding=\"async\" src=\"https:\/\/cdn-images-1.medium.com\/max\/800\/1*5Ukva4hwld-6yGZ35ym-Tg.png\" data-height=\"25\" data-image-id=\"1*5Ukva4hwld-6yGZ35ym-Tg.png\" data-width=\"632\" \/><\/figure>\n<p id=\"945c\"><strong>Offline Drill Through:\u00a0<\/strong>If we were to drill through the offline Ingestion and the Data Preparation services interaction, we would have something like below:<\/p>\n<p id=\"20ef\">\u2022\u00a0<strong>(1)\u00a0<\/strong>One or more data producers publish events to a designated \u2018Source Data Available\u2019 topic of the message broker, that the data is ready for consumption.<br \/>\n\u2022\u00a0<strong>(2)\u00a0<\/strong>The Ingestion Service is listening to the topic.<br \/>\n\u2022 Once the respective event is received, it handles it by\u00a0<strong>(3)\u00a0<\/strong>sourcing the data and\u00a0<strong>(4)\u00a0<\/strong>persisting it in its raw format in the data store.<br \/>\n\u2022\u00a0<strong>(5)\u00a0<\/strong>When the process finishes, it raises a new event to the \u2018Raw Data Extracted\u2019 topic to notify that the raw data is ready.<br \/>\n\u2022\u00a0<strong>(6)<\/strong>\u00a0The Data Preparation Service is listening to the topic.<br \/>\n\u2022 Once the respective event is received, it handles it by\u00a0<strong>(7)\u00a0<\/strong>sourcing the raw data, preparing it and engineering new features and\u00a0<strong>(8)\u00a0<\/strong>persisting the features in the data store.<br \/>\n\u2022\u00a0<strong>(9)\u00a0<\/strong>When the process finishes, it raises a new event to the \u2018Features Generated\u2019 topic to notify that the features have been generated.<\/p>\n<figure id=\"d5f5\"><canvas width=\"75\" height=\"62\"><\/canvas><img decoding=\"async\" src=\"https:\/\/cdn-images-1.medium.com\/max\/800\/1*cQ1maXIlczpkI4NzKKYDfQ.png\" data-src=\"https:\/\/cdn-images-1.medium.com\/max\/800\/1*cQ1maXIlczpkI4NzKKYDfQ.png\" \/><\/figure>\n<\/section>\n<section>\n<p style=\"text-align: center;\">Offline Data Ingestion \/ Preparation Interactions<\/p>\n<hr \/>\n<h3 id=\"69a3\">\u2014 \u2462: Data Segregation<\/h3>\n<blockquote id=\"a071\"><p>Split subsets of data to train the model and further validate how it performs against new data.<\/p><\/blockquote>\n<figure id=\"73f7\"><canvas width=\"75\" height=\"27\"><\/canvas><img decoding=\"async\" style=\"width: 700px; height: 276px;\" src=\"https:\/\/cdn-images-1.medium.com\/max\/800\/1*6jS1vIWrXf-mBG0rIXX60Q.jpeg\" data-src=\"https:\/\/cdn-images-1.medium.com\/max\/800\/1*6jS1vIWrXf-mBG0rIXX60Q.jpeg\" \/><\/figure>\n<p id=\"e765\">The fundamental goal of the ML system is to use an accurate model based on the quality of its pattern prediction for data that it has not been trained on. As such, existing labelled data is used as a\u00a0<strong>proxy\u00a0<\/strong>for future\/unseen data, by splitting it into training and evaluation subsets.<\/p>\n<p id=\"4359\">There are many strategies to do that, four of the most common ones are:<\/p>\n<p id=\"5059\">\u2022 Use a default or custom ratio to split it into the two subsets,\u00a0<strong>sequentially\u00a0<\/strong>i.e. in the order it appears in the source, making sure there is no overlapping. For instance, use the first 70% of data for training and the subsequent 30% of data for testing.<br \/>\n\u2022 Use a default or custom ratio to split it into the two subsets via a\u00a0<strong>random\u00a0<\/strong>seed. For instance, select a random 70% of the source data for training and the complement of this random subset for testing.<br \/>\n\u2022 Use either of the methods above (sequential vs. random) but also <b>shuffle the<\/b>\u00a0records within each dataset.<br \/>\n\u2022 Use a custom injected strategy to split the data, when an\u00a0<strong>explicit control over<\/strong>\u00a0the separation is needed.<\/p>\n<p id=\"1765\">The data segregation is not a separate ML pipeline as such, but an API or service must be available to facilitate this task. The next two pipelines (model training and evaluation) must be able to call this API to get back the requested datasets. In terms of code organisation, a strategy pattern is necessary so the caller service can select the right algorithm at run time, and obviously, the ability to\u00a0<strong>inject<\/strong>\u00a0the ratio or random seed is needed. Additionally, the API must be able to return the data with or without labels\/traits\u200a\u2014\u200afor training and evaluation respectively.<\/p>\n<p id=\"4922\">To protect the caller from specifying parameters that cause an\u00a0<strong>uneven data distribution<\/strong>, a warning should be raised and returned along with the dataset.<\/p>\n<\/section>\n<section>\n<hr \/>\n<h3 id=\"cc87\">\u2014 \u2463: Model\u00a0Training<\/h3>\n<blockquote id=\"a72a\"><p>Use the training subset of data to let the ML algorithm recognise the patterns in it.<\/p><\/blockquote>\n<figure id=\"3f56\"><canvas width=\"75\" height=\"27\"><\/canvas><img decoding=\"async\" style=\"width: 700px; height: 277px;\" src=\"https:\/\/cdn-images-1.medium.com\/max\/800\/1*6wwlGuFRfd0ZZ4x3BLAEqA.jpeg\" data-src=\"https:\/\/cdn-images-1.medium.com\/max\/800\/1*6wwlGuFRfd0ZZ4x3BLAEqA.jpeg\" \/><\/figure>\n<p id=\"1b73\">The model training pipeline is\u00a0<strong>offline\u00a0<\/strong>only and its schedule varies depending on the criticality of the application, from every couple of hours to once a day. Apart from schedulers, the service is also time and event triggered.<\/p>\n<p id=\"8017\">It consists of a library of training model algorithms (linear regression, ARIMA, k-means, decision trees, etc), which is built in a\u00a0<strong>SOLID\u00a0<\/strong>way to make provision for continuous development of new model types as well as making them interchangeable. Also, containment, using the\u00a0<strong>facade\u00a0<\/strong>pattern, is a crucial technique for integrating third-party APIs (this is where your Python Jupyter notebook can be wrapped and called too).<\/p>\n<p id=\"35f5\">There are a few options for parallelisation:<\/p>\n<p id=\"43ab\">\u2022 The simplest form is to have a\u00a0<strong>dedicated pipeline for each model<\/strong>, i.e. all models run concurrently.<br \/>\n\u2022 Another idea is to parallelise the\u00a0<strong>training data<\/strong>\u00a0i.e. the data is partitioned and each partition has a replica of the model. This is preferred for those models that they need all fields of an instance to perform the computation (e.g. LDA, MF).<br \/>\n\u2022 A third option is to parallelise the\u00a0<strong>model<\/strong>\u00a0itself i.e. the model is partitioned and each partition is responsible for the updates of a portion of parameters. It is ideal for Linear models, such as LR, SVM.<br \/>\n\u2022 Finally, a\u00a0<strong>hybrid<\/strong>\u00a0approach can be used, combining one or more options. (For more info I recommend you read\u00a0<a href=\"https:\/\/academic.oup.com\/nsr\/article\/5\/2\/216\/3052720\" target=\"_blank\" rel=\"noopener noreferrer\" class=\"broken_link\">this publication<\/a>).<\/p>\n<p id=\"21bd\">Model training must be implemented with\u00a0<strong>error tolerance<\/strong>\u00a0in mind and also data checkpoints and failover on training partitions should be enabled\u200a\u2014\u200ae.g. each partition can be retrained if the previous attempt fails due to some transient issue (e.g. timeout).<\/p>\n<figure id=\"3129\"><img decoding=\"async\" src=\"https:\/\/cdn-images-1.medium.com\/max\/800\/1*5Ukva4hwld-6yGZ35ym-Tg.png\" data-height=\"25\" data-image-id=\"1*5Ukva4hwld-6yGZ35ym-Tg.png\" data-width=\"632\" \/><\/figure>\n<p id=\"d374\">Since we covered the capabilities of this pipeline, let\u2019s unpack the workflow: The\u00a0<strong>Model Training Service<\/strong>\u00a0gets the training configuration parameters (e.g. model type, hyper-parameters, features to be used etc) from the Configuration Service and will then request the training dataset from the Data Segregation API. This dataset is sent to all the models in parallel and once complete, the models, the original configuration, the learned parameters, plus metadata on the training set and timings, will be saved in the\u00a0<em>Model Candidate Data Store<\/em>.<\/p>\n<\/section>\n<section>\n<hr \/>\n<h3 id=\"6aa2\">\u2014 \u2464: Candidate Model Evaluation<\/h3>\n<blockquote id=\"6aa0\"><p>Assess the performance of the model using the test subset of data to understand how accurate the prediction is.<\/p><\/blockquote>\n<figure id=\"c308\"><canvas width=\"75\" height=\"27\"><\/canvas><img decoding=\"async\" style=\"width: 700px; height: 276px;\" src=\"https:\/\/cdn-images-1.medium.com\/max\/800\/1*J0EOgRUfMGU6l37ilTFlkw.jpeg\" data-src=\"https:\/\/cdn-images-1.medium.com\/max\/800\/1*J0EOgRUfMGU6l37ilTFlkw.jpeg\" \/><\/figure>\n<p id=\"4e6d\">This pipeline is also\u00a0<strong>offline<\/strong>. The predictive performance of a model is evaluated by comparing predictions on the evaluation dataset with true values using a variety of metrics. The\u00a0<strong>\u201cbest\u201d model<\/strong>\u00a0on the evaluation subset is selected to make predictions on future\/new instances. A library of several evaluators is designed to provide a model\u2019s accuracy metrics (e.g. ROC curve, PR curve), which are also saved against the model in the data store. Again, same patterns are applicable here to allow flexibility on combining and switching between evaluators.<\/p>\n<figure id=\"626b\"><img decoding=\"async\" src=\"https:\/\/cdn-images-1.medium.com\/max\/800\/1*5Ukva4hwld-6yGZ35ym-Tg.png\" data-height=\"25\" data-image-id=\"1*5Ukva4hwld-6yGZ35ym-Tg.png\" data-width=\"632\" \/><\/figure>\n<p id=\"b816\">In terms of orchestration, the\u00a0<strong>Model Evaluation Service<\/strong>\u00a0requests the evaluation dataset from the Data Segregation API, and for each model sourced from the Model Candidate repository, it applies the relevant evaluators. The evaluation results are saved back to the repository. This is an iterative process and hyperparameter optimisation, as well as regularisation techniques, are also applied to come up with the final model. The best model is marked for deployment. Finally, the notification service broadcasts that a model is ready for deployment.<\/p>\n<p id=\"d4d4\">This pipeline also needs to live up to all the reactive traits.<\/p>\n<\/section>\n<section>\n<hr \/>\n<h3 id=\"2540\">\u2014 \u2465: Model Deployment<\/h3>\n<blockquote id=\"7048\"><p>Once the chosen model is produced, it is typically deployed and embedded in decision-making frameworks.<\/p><\/blockquote>\n<figure id=\"e70b\"><canvas width=\"75\" height=\"27\"><\/canvas><img decoding=\"async\" style=\"width: 700px; height: 276px;\" src=\"https:\/\/cdn-images-1.medium.com\/max\/800\/1*AIV7NzI0iu7XTjs1-M45fw.jpeg\" data-src=\"https:\/\/cdn-images-1.medium.com\/max\/800\/1*AIV7NzI0iu7XTjs1-M45fw.jpeg\" \/><\/figure>\n<p id=\"c207\">Model deployment is not the end; it is just the beginning!<\/p>\n<p id=\"222d\">The best model selected is deployed for offline (asynchronous) and online (synchronous) predictions.\u00a0<strong>More than one<\/strong>\u00a0models can be deployed at any time to enable the safe transition between old and new models\u200a\u2014\u200ai.e. when<br \/>\ndeploying a new model, the services need to keep serving prediction requests.<\/p>\n<p id=\"3b1a\">Traditionally, a challenge in deployment has been that the\u00a0<strong>programming languages<\/strong>\u00a0needed to operationalise models have been different from those that have been used to develop them.\u00a0<strong>Porting<\/strong>\u00a0a Python or R model into a production language like C++, C# or Java is challenging, and often results in reduced performance (speed &amp; accuracy) of the original model. There are a few ways to address this issue. In no particular order:<\/p>\n<p id=\"93bf\">\u2022 Rewrite the code in the new language [i.e. translate from Python to CSharp]<br \/>\n\u2022 Create custom DSL (Domain Specific Language) to describe the model<br \/>\n\u2022 Microservice (accessed through a RESTful API)<br \/>\n\u2022 API-first approach<br \/>\n\u2022 Containerisation<br \/>\n\u2022 Serialise the model and load into a in-memory key-value store<\/p>\n<p id=\"1436\">More specifically:<\/p>\n<h4 id=\"5f14\">Offline<\/h4>\n<p id=\"ed72\">\u2022 In an offline mode, the prediction model can be deployed to a <b>container and<\/b>\u00a0run as a microservice to create predictions on demand or on a periodic schedule.<br \/>\n\u2022 A different choice is to create a wrapper around it so you gain control over the functions available. Once a batch prediction request is made, you can\u00a0<strong>load it dynamically into memory<\/strong>\u00a0as a separate process, call the prediction functions, unload it from memory and free up the resources (native handles).<br \/>\n\u2022 Finally, another approach is to wrap the library into an\u00a0<strong>API\u00a0<\/strong>and let the caller invoke it directly or wrap it around their service to fully take over the reins of the prediction instrumentation.<\/p>\n<p id=\"9bfe\">In terms of scalability, multiple parallel pipelines can be created to accommodate the load. This involves inconsiderable effort, as the ML models are stateless.<\/p>\n<h4 id=\"7b24\">Online<\/h4>\n<p id=\"e3cf\">\u2022 Here the prediction model can be deployed in a container to a\u00a0<strong>service cluster<\/strong>, generally distributed in many servers behind a queue for load balancing to assure scalability, low latency and high throughput. The clients can send prediction requests as network\u00a0<a href=\"https:\/\/en.wikipedia.org\/wiki\/Remote_procedure_call\" target=\"_blank\" rel=\"noopener noreferrer\" data-href=\"https:\/\/en.wikipedia.org\/wiki\/Remote_procedure_call\" data->remote procedure calls<\/a>\u00a0(RPC).<br \/>\n\u2022 Alternatively,\u00a0<strong>key-value stores<\/strong>\u00a0(e.g Redis) support the storage of a model and its parameters, which gives a big boost on performance.<\/p>\n<figure id=\"30a8\"><img decoding=\"async\" src=\"https:\/\/cdn-images-1.medium.com\/max\/800\/1*5Ukva4hwld-6yGZ35ym-Tg.png\" data-height=\"25\" data-image-id=\"1*5Ukva4hwld-6yGZ35ym-Tg.png\" data-width=\"632\" \/><\/figure>\n<p id=\"31f3\">With regards to the actual model deployment activity, it can be automated via a\u00a0<a href=\"https:\/\/en.wikipedia.org\/wiki\/Continuous_delivery\" target=\"_blank\" rel=\"noopener nofollow noreferrer\" data-href=\"https:\/\/en.wikipedia.org\/wiki\/Continuous_delivery\" data->continuous delivery<\/a>\u00a0implementation: The required files are packaged, the model is validated by a reliable testing suite and is finally deployed into a running container.<br \/>\nThe tests are executed by an automated build pipeline: Short, self-contained, stateless unit tests are evaluated first. If they pass, the prediction model quality is measured in bigger integration or regression tests. When both levels of testing have passed, the application is deployed to the serving environment.<\/p>\n<p id=\"356e\">Enabling\u00a0<strong>one-click deployment<\/strong>\u00a0is ideal.<\/p>\n<\/section>\n<section>\n<hr \/>\n<h3 id=\"579d\">\u2014 \u2466: Model\u00a0Scoring<\/h3>\n<blockquote id=\"4ba0\"><p>Process of applying a ML model to a new dataset in order to uncover practical insights that will help solve a business problem. A.k.a. Model Serving.<\/p><\/blockquote>\n<figure id=\"d1e6\"><canvas width=\"75\" height=\"27\"><\/canvas><img decoding=\"async\" style=\"width: 700px; height: 277px;\" src=\"https:\/\/cdn-images-1.medium.com\/max\/800\/1*9QnXW7Qk-CaG7a2C0elYlg.jpeg\" data-src=\"https:\/\/cdn-images-1.medium.com\/max\/800\/1*9QnXW7Qk-CaG7a2C0elYlg.jpeg\" \/><\/figure>\n<p id=\"37ea\"><em>Model Scoring\u00a0<\/em>and<em>\u00a0Model Serving<\/em>\u00a0are two terms that are used interchangeably in the industry. What scoring really means, occurred to me after reading\u00a0<a href=\"https:\/\/docs.microsoft.com\/en-us\/azure\/machine-learning\/studio-module-reference\/machine-learning-score\" target=\"_blank\" rel=\"noopener noreferrer\" data-href=\"https:\/\/docs.microsoft.com\/en-us\/azure\/machine-learning\/studio-module-reference\/machine-learning-score\" data->this resource<\/a>, so before moving on, let\u2019s quickly cover the basics, in case this is not clear to you either:<\/p>\n<p id=\"11d2\">Model Scoring is the process of generating new values, given a model and some new input. The generic term\u00a0<em>score<\/em>\u00a0is used, rather than\u00a0<em>prediction<\/em>, as it may result in different types of values:<\/p>\n<p id=\"0b07\">\u2022 A list of recommended items<br \/>\n\u2022 Numeric values, for time series models and regression models<br \/>\n\u2022 A probability value, indicating the likelihood that a new input belongs to some existing category<br \/>\n\u2022 The name of a category or cluster to which a new item is most similar<br \/>\n\u2022 A predicted class or outcome, for classification models.<\/p>\n<figure id=\"336b\"><img decoding=\"async\" src=\"https:\/\/cdn-images-1.medium.com\/max\/800\/1*5Ukva4hwld-6yGZ35ym-Tg.png\" data-height=\"25\" data-image-id=\"1*5Ukva4hwld-6yGZ35ym-Tg.png\" data-width=\"632\" \/><\/figure>\n<p id=\"52ce\">Moving on\u2026 Once models are deployed they are used for scoring based on the feature data loaded by previous pipelines or directly from a client service. Models should\u00a0<strong>behave the same<\/strong>\u00a0in both offline and online modes when serving predictions.<\/p>\n<h4 id=\"c852\">Offline<\/h4>\n<p id=\"53e6\">In the offline layer, the\u00a0<strong>Scoring Service<\/strong>\u00a0is optimised for high throughput, fire-and-forget predictions for a large collection of data. An application can send an asynchronous request to kick off the scoring process but needs to wait until the batch scoring process completes before it can access the prediction results. The service prepares the data, generates the features, but also fetches extra features from the Feature Data Store. Once scoring takes place, the results are saved in the\u00a0<em>Score Data Store<\/em>. A message is sent to the broker to notify that the scoring has completed. The application is listening to this event and when notified it fetches the scores.<\/p>\n<h4 id=\"570e\">Online<\/h4>\n<p id=\"6ac2\">A client sends a request to the\u00a0<strong>Online Scoring Service<\/strong>. It can optionally specify the version of the model to be invoked, so the\u00a0<strong>Model Router<\/strong>\u00a0inspects the request and sends it to the respective model. Based on the request, similarly to the offline layer, the service prepares the data, generates the features, and optionally fetches extra features from the Feature Data Store. Once scoring takes place, the results are saved in the\u00a0<em>Score Data Store<\/em>\u00a0and then sent back to the client over the network.<\/p>\n<p id=\"bcf1\">Depending on the use-case, scores can also be delivered to the client asynchronously i.e. independently of the request:<br \/>\n\u2022\u00a0<strong>Push<\/strong>: Once the scores are generated, they are pushed to the caller as a notification.<br \/>\n\u2022\u00a0<strong>Poll<\/strong>: Once the scores are generated, they are stored in a low read-latency database; the caller periodically polls the database for available predictions.<\/p>\n<p id=\"b1f6\">In order to minimise the time the system takes to serve the scoring when it receives the request, two methods are employed:<br \/>\n\u2022 the input features are stored in a low-read latency in-memory data store,<br \/>\n\u2022 predictions precomputed in an offline batch-scoring job are cached for easy access [this is depending on the use-case, as offline predictions might not be relevant].<\/p>\n<\/section>\n<section>\n<hr \/>\n<h3 id=\"5d13\">\u2014 \u2467: Performance Monitoring<\/h3>\n<blockquote id=\"bd7f\"><p>The model is continuously monitored to observe how it behaved in the real world and calibrated accordingly.<\/p><\/blockquote>\n<figure id=\"7df6\"><canvas width=\"75\" height=\"27\"><\/canvas><img decoding=\"async\" style=\"width: 700px; height: 276px;\" src=\"https:\/\/cdn-images-1.medium.com\/max\/800\/1*F6TRLtVSBP-Gbey_E3dOMQ.jpeg\" data-src=\"https:\/\/cdn-images-1.medium.com\/max\/800\/1*F6TRLtVSBP-Gbey_E3dOMQ.jpeg\" \/><\/figure>\n<p id=\"9eff\">Any ML solution requires a well-defined performance monitoring solution. An example of information that we might want to see for model serving applications includes:<br \/>\n\u2022 model identifier,<br \/>\n\u2022 deployment date\/time,<br \/>\n\u2022 number of times the model has been served,<br \/>\n\u2022 average\/min\/max model serving times,<br \/>\n\u2022 distribution of features used.<br \/>\n\u2022 predicted vs. actual\/observed results.<\/p>\n<p id=\"a63d\">This\u00a0<strong>metadata\u00a0<\/strong>is calculated during the model scoring and then used for monitoring.<\/p>\n<p id=\"afe4\">This is another\u00a0<strong>offline<\/strong>\u00a0pipeline. The\u00a0<strong>Performance Monitoring Service<\/strong>\u00a0is notified when a new prediction is served, executes the performance evaluation, persists the results and the relevant notifications are raised. The evaluation itself takes place by comparing the scores to the observed outcomes generated by the data pipeline (training set). The basic implementation for monitoring can follow different approaches with the most popular being\u00a0<strong>logging analytics<\/strong>\u00a0(Kibana, Grafana, Splunk, etc).<\/p>\n<p id=\"8892\">To ensure\u00a0<strong>built-in resilience<\/strong>\u00a0of the ML system, a poor\u00a0<strong>speed\u00a0<\/strong>performance of the new model triggers the scores to be generated by the previous model. A\u00a0<em>\u201cbetter wrong than late\u201d<\/em>\u00a0philosophy is adopted: if a term in the model takes too long to be computed, the model is substituted by a previously deployed model, rather than blocking.<\/p>\n<p id=\"d3a2\">Additionally, the scores are joined to the observed outcomes when they become available\u200a\u2014\u200athat means that continuous\u00a0<strong>accuracy\u00a0<\/strong>measurements of the model are generated, and along the same lines with speed performance, any sign of degradation can be dealt with by reverting to the previous model.<\/p>\n<p id=\"ab1f\">A chain of responsibility pattern can be used to chain the different versions together.<\/p>\n<p id=\"6fa6\">Model monitoring is a continuous process: a shift<strong>\u00a0<\/strong>in prediction might result in restructuring the model design. Providing accurate predictions\/recommendations\u00a0<strong>continuously\u00a0<\/strong>to drive the business forward is what defines the benefits of ML!<\/p>\n<\/section>\n<section>\n<hr \/>\n<h3 id=\"12ed\">Cross Cutting\u00a0Concerns<\/h3>\n<p id=\"cd4b\">We cannot end this article by not referring to the cross cutting concerns. A ML application, like any other application, has some common functionality that spans across layers\/pipelines. Even in an individual layer, such functionalities could be used across all classes\/services,\u00a0<strong>cutting<\/strong>\u00a0through and\u00a0<strong>crossing<\/strong>\u00a0all the normal boundaries.<\/p>\n<p id=\"38f9\">The cross cutting concerns are normally\u00a0<strong>centralised<\/strong>\u00a0in one place, which increases the application\u2019s modularity. They are often managed by other teams in the organisation or are\u00a0<strong>off-the-shelf<\/strong>\u00a0\/ thrid-party products.\u00a0<strong>Dependency injection<\/strong>\u00a0is the best way to inject these in the relevant places in the code.<\/p>\n<p id=\"4b43\">The most important concerns to be addressed, in our use-case are:<\/p>\n<p id=\"845d\">\u2022 Notifications<br \/>\n\u2022 Scheduling<br \/>\n\u2022 Logging Framework (and Alert mechanism)<br \/>\n\u2022 Exception Management<br \/>\n\u2022 Configuration Service<br \/>\n\u2022 Data Service (to expose querying in a data store)<br \/>\n\u2022 Auditing<br \/>\n\u2022 Data Lineage<br \/>\n\u2022 Caching<br \/>\n\u2022 Instrumentation<\/p>\n<\/section>\n<section>\n<hr \/>\n<h3 id=\"a2e4\">And putting it all\u00a0together<\/h3>\n<p id=\"11a2\">There you have it\u2026 A production ready ML system:<\/p>\n<figure id=\"640d\" data-scroll=\"native\"><canvas width=\"75\" height=\"37\"><\/canvas><img decoding=\"async\" style=\"width: 700px; height: 358px;\" src=\"https:\/\/cdn-images-1.medium.com\/max\/2560\/1*Bt1PVhREAdFLUZeGYB8eSQ.jpeg\" data-src=\"https:\/\/cdn-images-1.medium.com\/max\/2560\/1*Bt1PVhREAdFLUZeGYB8eSQ.jpeg\" \/><figcaption>\u00a0<\/figcaption><\/figure>\n<\/section>\n<section>\n<p style=\"text-align: center;\">End to End ML Architecture<\/p>\n<hr \/>\n<h3 id=\"ba84\">Footnote<\/h3>\n<p id=\"f826\">Congratulations! You made it till the end! I do hope you enjoyed the ride into Software Engineering for Data Science!<\/p>\n<p id=\"16e0\">Thanks for reading!<\/p>\n<p>Originally published in <a href=\"https:\/\/towardsdatascience.com\/architecting-a-machine-learning-pipeline-a847f094d1c7\" rel=\"noopener\">Medium<\/a>.<\/p>\n<\/section>\n","protected":false},"excerpt":{"rendered":"<p>Traditionally, pipelines involve overnight batch processing, i.e. collecting data, sending it through an enterprise message bus and processing it to provide pre-calculated results and guidance for the next day&rsquo;s operations. Whilst this works in some industries, it is really insufficient in others, and especially when it comes to ML applications. When developing a model, data scientists work in some development environment tailored for Statistics and Machine Learning (Python, R etc) and are able to train and test models all in one &lsquo;sandboxed&rsquo; environment while writing relatively little code.<\/p>\n","protected":false},"author":556,"featured_media":2803,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"content-type":"","footnotes":""},"categories":[183],"tags":[92],"ppma_author":[3236],"class_list":["post-1710","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-ai-ml","tag-machine-learning"],"authors":[{"term_id":3236,"user_id":556,"is_guest":0,"slug":"semi-koen","display_name":"Semi\u00a0Koen Semi\u00a0Koen","avatar_url":"https:\/\/secure.gravatar.com\/avatar\/?s=96&d=mm&r=g","user_url":"","last_name":"Semi\u00a0Koen","first_name":"Semi\u00a0Koen","job_title":"","description":"Semi Koen&nbsp;is Director | Technical Architect, Investment Banking at Mizuho International."}],"_links":{"self":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/1710","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/users\/556"}],"replies":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/comments?post=1710"}],"version-history":[{"count":3,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/1710\/revisions"}],"predecessor-version":[{"id":30140,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/1710\/revisions\/30140"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/media\/2803"}],"wp:attachment":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/media?parent=1710"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/categories?post=1710"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/tags?post=1710"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/ppma_author?post=1710"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}