{"id":2102,"date":"2019-11-29T01:04:38","date_gmt":"2019-11-29T01:04:38","guid":{"rendered":"http:\/\/kusuaks7\/?p=1707"},"modified":"2024-02-19T15:47:42","modified_gmt":"2024-02-19T15:47:42","slug":"how-to-extend-scikit-learn-and-bring-sanity-to-your-machine-learning-workflow","status":"publish","type":"post","link":"https:\/\/www.experfy.com\/blog\/ai-ml\/how-to-extend-scikit-learn-and-bring-sanity-to-your-machine-learning-workflow\/","title":{"rendered":"How to Extend Scikit-learn and Bring Sanity to Your Machine Learning Workflow"},"content":{"rendered":"\t\t<div data-elementor-type=\"wp-post\" data-elementor-id=\"2102\" class=\"elementor elementor-2102\" data-elementor-post-type=\"post\">\n\t\t\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-9918c6c elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"9918c6c\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-406098ae\" data-id=\"406098ae\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-358107c0 elementor-widget elementor-widget-text-editor\" data-id=\"358107c0\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tWe usually hear (and say) that machine learning is just a commercial name for Statistics. That might be true, but if we&#8217;re building models using computers what machine learning really comprehends is Statistics\u00a0<em>and<\/em>\u00a0<strong>Software Engineering<\/strong>.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-24419e5 elementor-widget elementor-widget-heading\" data-id=\"24419e5\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h4 class=\"elementor-heading-title elementor-size-default\"><blockquote>\n<h4><strong><em>To make great products: do machine learning like the great engineer you are, not like the great machine learning expert you aren't - Rules of Machine Learning: Best Practices for ML Engineering [1]<\/em><\/strong><\/h4>\n<\/blockquote><\/h4>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-d83b02a elementor-widget elementor-widget-text-editor\" data-id=\"d83b02a\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tThis combination of Statistics and Software Engineering brings new challenges to the Software Engineering world. Developing applications in the ML domain is fundamentally different from prior software application domains, as Microsoft researchers point out [2].\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-6ae9a32 elementor-widget elementor-widget-text-editor\" data-id=\"6ae9a32\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tWhen you don&#8217;t work in a large company or when you&#8217;re just starting out in the field it&#8217;s difficult to learn and apply Software Engineering best practices because finding this information is not easy. Fortunately, open-source projects can be a great source of knowledge and can help us address this need of learning from people that have more experience than us. One of my favorite ML libraries (and source of knowledge) is\u00a0<strong>Scikit-learn<\/strong>.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-f203a40 elementor-widget elementor-widget-text-editor\" data-id=\"f203a40\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\nThe project does a great job of providing an easy-to-use interface while also providing solid implementations, being both a great way to start in the field of ML and also a tool used in the industry. Using scikit-learn tools and even reading maintainer&#8217;s answers on the issue discussions on Github is a great way to learn from them. Scikit has a lot of contributors from industry and from academia, so as these people make contributions their knowledge gets \u201cembedded\u201d in the library. One rule of thumb of scikit-learn&#8217;s project is that user code should not be tied to scikit-learn \u2014 which is a library, and not a framework [3]. This makes it easy to extend scikit functionalities to suit our needs.\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-5db40bd elementor-widget elementor-widget-text-editor\" data-id=\"5db40bd\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tToday we&#8217;re going to learn how to do this, building a custom transformer and learning how to use it to build pipelines. By doing so our code becomes easy to maintain and reuse, two aspects of Software Engineering best practices.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-48a5fdc elementor-widget elementor-widget-heading\" data-id=\"48a5fdc\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\"><h2>The scikit-learn API<\/h2><\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-d922219 elementor-widget elementor-widget-text-editor\" data-id=\"d922219\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tIf you&#8217;re familiar with scikit-learn you probably know how to use objects such as estimators and transformers, but it&#8217;s good to formalize their definitions so we can build on top of them. The basic API consists of three interfaces (and once class can implement multiple interfaces):\n<ul>\n \t<li>estimator &#8211; the base object, implements the fit() method<\/li>\n \t<li>predictor &#8211; an interface for making predictions, implements the predict() method<\/li>\n \t<li>transformer &#8211; interface for converting data, implements the transform() method<\/li>\n<\/ul>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-c10a20a elementor-widget elementor-widget-text-editor\" data-id=\"c10a20a\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tScikit-learn has many out-of-the-box transformers and predictors, but we often need to transform data in different ways. Building custom transformers using the transformer interface makes our code maintainable and we can also use the new transformer with other scikit objects like Pipeline and RandomSearchCV or GridSearchCV. Let&#8217;s see how to do that. All the code can be found\u00a0<a href=\"https:\/\/github.com\/dmesquita\/extending-scikit\" target=\"_blank\" rel=\"noopener noreferrer\">here<\/a>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-b4e4e75 elementor-widget elementor-widget-heading\" data-id=\"b4e4e75\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\"><h2>Building a custom transformer<\/h2><\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-ea8af73 elementor-widget elementor-widget-text-editor\" data-id=\"ea8af73\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tThere are two kinds of transformers: stateless transformers and stateful transformers. Stateless transformers\u00a0<em>treat samples independently<\/em>\u00a0while stateful transformations\u00a0<em>depend on the previous data<\/em>. If we need a stateful transformer the save the state on fit() method. Both stateless and stateful transformers should return self.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-e93bdb6 elementor-widget elementor-widget-text-editor\" data-id=\"e93bdb6\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tMost examples of custom transformers use numpy arrays, so let&#8217;s try something different and build a transformer that uses spaCy models. Our goal is to create a model to classify documents. We want to know if lemmatization and stopword removal can increase the performance of the model. RandomSearchCV and GridSearchCV are great to experiment if different parameters can improve the performance of a model.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-375d146 elementor-widget elementor-widget-text-editor\" data-id=\"375d146\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tWhen we create a transformer class inheriting from the BaseEstimator class we get get<em>parameters() and set<\/em>parameters() methods for free, allowing us to use the new transformer in the search to find best parameter values. But to do that we need to follow some rules [4]:\n<ul>\n \t<li>The name of the keyword arguments accepted by\u00a0<strong>init<\/strong>() should\u00a0<strong>correspond<\/strong>\u00a0to the attribute on the instance<\/li>\n \t<li>All parameter should have\u00a0<strong>sensitive defaults<\/strong>, so a user can instantiate an estimator simply calling EstimatorName()<\/li>\n \t<li>The validations should be done\u00a0<em>where the parameters are used;<\/em>\u00a0this means that should be no logic (not even input validation) on\u00a0<strong>init<\/strong>()<\/li>\n<\/ul>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-550f511 elementor-widget elementor-widget-text-editor\" data-id=\"550f511\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tThe parameters we need are the spaCy language model, lemmatization and remove_stopwords.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-46e8864 elementor-widget elementor-widget-heading\" data-id=\"46e8864\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\"><h2>Using scikit-learn pipelines<\/h2><\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-0576622 elementor-widget elementor-widget-text-editor\" data-id=\"0576622\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tIn machine learning many tasks are expressible as\u00a0<strong>sequences or combinations of transformations to data<\/strong>\u00a0[3]. Pipelines offer a clear overview of our preprocessing steps, turning a chain of estimators into one single estimator. Using pipelines is also a way to make sure that we are always performing the exactly same steps while training, doing cross-validation or making a prediction.\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-9b3c7c1 elementor-widget elementor-widget-text-editor\" data-id=\"9b3c7c1\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tEach step of the pipeline should implement the transform() method. To create the model we&#8217;ll use the new transformer, a TfidfVectorizer and a RandomForestClassifier. Each of these steps will turn into a pipeline step. The steps are defined as tuples, where the first element is the name of the step and the second element is the estimator object per se.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-44425b1 elementor-widget elementor-widget-text-editor\" data-id=\"44425b1\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tWith that we can use the pipeline object to call fit() and predict() methods, like text<em>pipeline.fit(train, labels and text<\/em>clf.predict(data). We can use all methods the last step of the pipeline implements, so we can also call text<em>clf.predict<\/em>proba(data) to get the probability scores from the RandomForestClassifier for example.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-149d05a elementor-widget elementor-widget-heading\" data-id=\"149d05a\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\"><h2>Finding the best parameters with GridSearchCV<\/h2><\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-b3651dc elementor-widget elementor-widget-text-editor\" data-id=\"b3651dc\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tWith GridSearchCV we can run an exhaustive search of the best parameters on a grid of possible values (RandomizedSearchCV is the non-exhaustive alternative). To do that we define a dict for the parameters, where the keys should be\u00a0<em>*name_of_pipeline_step*__*parameter_name*<\/em>\u00a0and the values should be lists with parameter values we want to try.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-64acb0f elementor-widget elementor-widget-heading\" data-id=\"64acb0f\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\"><h2>Takeaways<\/h2><\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-f688ef4 elementor-widget elementor-widget-text-editor\" data-id=\"f688ef4\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tMachine Learning comes with challenges that the Software Engineering world is not familiar with. Building experiments represents a large part of our workflow, and doing that with messy code doesn&#8217;t usually end up well. When we extend scikit-learn and use the components to write our experiments we make the task of maintaining our codebase easier, bringing sanity to our day-to-day tasks.\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-ada2547 elementor-widget elementor-widget-text-editor\" data-id=\"ada2547\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<b>References<\/b>\n<div>[1]\u00a0<a href=\"https:\/\/developers.google.com\/machine-learning\/guides\/rules-of-ml\" target=\"_blank\" rel=\"noopener noreferrer\">https:\/\/developers.google.com\/machine-learning\/guides\/rules-of-ml<\/a><\/div>\n<div>[2]\u00a0<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2019\/03\/amershi-icse-2019_Software_Engineering_for_Machine_Learning.pdf\" target=\"_blank\" rel=\"noopener noreferrer\">https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2019\/03\/amershi-icse-2019<em>Software<\/em>Engineering<em>for<\/em>Machine_Learning.pdf<\/a><\/div>\n<div>[3]\u00a0<a href=\"https:\/\/arxiv.org\/pdf\/1309.0238.pdf\" target=\"_blank\" rel=\"noopener noreferrer\">https:\/\/arxiv.org\/pdf\/1309.0238.pdf<\/a><\/div>\n<div>[4]\u00a0<a href=\"https:\/\/scikit-learn.org\/stable\/developers\/contributing.html#apis-of-scikit-learn-objects\" target=\"_blank\" rel=\"noopener noreferrer\">https:\/\/scikit-learn.org\/stable\/developers\/contributing.html#apis-of-scikit-learn-objects<\/a><\/div>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<\/div>\n\t\t","protected":false},"excerpt":{"rendered":"<p>Machine Learning comes with challenges that the Software Engineering world is not familiar with. Building experiments represents a large part of our workflow, and doing that with messy code doesn&#8217;t usually end up well. When we extend scikit-learn and use the components to write our experiments we make the task of maintaining our codebase easier, bringing sanity to our day-to-day tasks. Learn how to extend Scikit-learn code to make your experiments easier to maintain and reproduce.<\/p>\n","protected":false},"author":680,"featured_media":2906,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"content-type":"","footnotes":""},"categories":[183],"tags":[92],"ppma_author":[3465],"class_list":["post-2102","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-ai-ml","tag-machine-learning"],"authors":[{"term_id":3465,"user_id":680,"is_guest":0,"slug":"deborah-mesquita","display_name":"D\u00e9borah Mesquita","avatar_url":"https:\/\/secure.gravatar.com\/avatar\/?s=96&d=mm&r=g","user_url":"","last_name":"Mesquita","first_name":"D\u00e9borah","job_title":"","description":"D&eacute;borah Mesquita is Data Scientist at Eye4Fraud that screens, verifies and guarantees your online orders. She<a href=\"https:\/\/www.deborahmesquita.com\/azuremlaward\/\"> won the Azure Machine Learning Award<\/a>&nbsp;by using a model to help kids with math."}],"_links":{"self":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/2102","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/users\/680"}],"replies":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/comments?post=2102"}],"version-history":[{"count":6,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/2102\/revisions"}],"predecessor-version":[{"id":36041,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/2102\/revisions\/36041"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/media\/2906"}],"wp:attachment":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/media?parent=2102"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/categories?post=2102"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/tags?post=2102"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/ppma_author?post=2102"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}