{"id":521,"date":"2017-07-28T10:48:36","date_gmt":"2017-07-28T07:48:36","guid":{"rendered":"http:\/\/kusuaks7\/?p=126"},"modified":"2025-03-31T08:30:01","modified_gmt":"2025-03-31T08:30:01","slug":"claims-severity-prediction-with-apache-spark-2-0-and-scala","status":"publish","type":"post","link":"https:\/\/www.experfy.com\/blog\/bigdata-cloud\/claims-severity-prediction-with-apache-spark-2-0-and-scala\/","title":{"rendered":"Claims Severity Prediction with Apache Spark 2.0 and Scala"},"content":{"rendered":"\t\t<div data-elementor-type=\"wp-post\" data-elementor-id=\"521\" class=\"elementor elementor-521\" data-elementor-post-type=\"post\">\n\t\t\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-2959dfb5 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"2959dfb5\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-534fe656\" data-id=\"534fe656\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-7d8f98fc elementor-widget elementor-widget-text-editor\" data-id=\"7d8f98fc\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<em><strong>Need training for Insurance Analytics?\u00a0<a href=\"https:\/\/www.experfy.com\/blog\/claims-severity-prediction-with-apache-spark-2-0-and-scala\">Browse courses<\/a>\u00a0developed by industry thought leaders and Experfy in Harvard Innovation Lab.<\/strong><\/em>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-4ccec49 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"4ccec49\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-e2e23e2\" data-id=\"e2e23e2\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-39e91c8 elementor-widget elementor-widget-text-editor\" data-id=\"39e91c8\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<a href=\"https:\/\/www.allstate.com\/\" rel=\"noopener\">Allstate Corporation<\/a>, the second largest insurance company in United States, founded in 1931, recently launched a Machine Learning recruitment challenge in partnership with\u00a0Kaggle. \u00a0Allstate&#8217;s objective was\u00a0to predict the cost, and hence the severity, of claims.\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-250fcac elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"250fcac\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-05d5ab3\" data-id=\"05d5ab3\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-ba77b97 elementor-widget elementor-widget-text-editor\" data-id=\"ba77b97\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\nThe competition organizers provide the competitors with more than 300,000 examples with masked and anonymous data consisting of more than 100 categorical and numerical attributes, thus being compliant with confidentiality constraints.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-a4e6edd elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"a4e6edd\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-dacdd5c\" data-id=\"dacdd5c\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-29c3446 elementor-widget elementor-widget-text-editor\" data-id=\"29c3446\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tThe Spark\/Scala script\u00a0explained in this post obtains the training and test input datasets from local or\u00a0<a href=\"https:\/\/aws.amazon.com\/s3\/details\/\" rel=\"noopener\">Amazon&#8217;s AWS S3<\/a>\u00a0environment\u00a0and trains a\u00a0<a href=\"https:\/\/en.wikipedia.org\/wiki\/Random_forest\" rel=\"noopener\">Random Forest<\/a>\u00a0model\u00a0over it. The objective is to demonstrate the use of\u00a0<a href=\"https:\/\/spark.apache.org\/releases\/spark-release-2-0-0.html\" rel=\"noopener\">Spark 2.0<\/a>\u00a0Machine Learning pipelines with\u00a0<a href=\"http:\/\/www.scala-lang.org\/\" rel=\"noopener\">Scala language<\/a>,\u00a0<a href=\"https:\/\/aws.amazon.com\/s3\/details\/\" rel=\"noopener\">AWS S3<\/a>\u00a0integration and some general good practices for building Machine Learning models. In order to keep this main objective, more sophisticated techniques (such as a thorough exploratory data analysis and feature engineering) are intentionally omitted.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-17b2d63 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"17b2d63\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-32e33bf\" data-id=\"32e33bf\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-d52a3ec elementor-widget elementor-widget-heading\" data-id=\"d52a3ec\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\"><h2>Why Spark?<\/h2><\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-0d46046 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"0d46046\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-08045a8\" data-id=\"08045a8\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-6c15008 elementor-widget elementor-widget-text-editor\" data-id=\"6c15008\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tSince almost all personal computers nowadays have many Gigabytes of RAM (and it is in an accelerated growing) and powerful CPUs and GPUs, many real-world machine learning problems can be solved with a single computer and frameworks such as\u00a0<a href=\"http:\/\/scikit-learn.org\/\" rel=\"noopener\">ScikitLearn<\/a>, with no need of a distributed system.\u00a0\u00a0Sometimes, though, data grows and keeps growing. Who hasn&#8217;t heard the term &#8220;Big Data&#8221;? When big data is involved, a non-distributed solution may solve the problem for a short time, but afterwards such solution needs to be reviewed and may reqire a significantly different approach. This is where Spark comes in.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-17f4b6e elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"17f4b6e\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-5090ec1\" data-id=\"5090ec1\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-8a9134a elementor-widget elementor-widget-text-editor\" data-id=\"8a9134a\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tSpark started as a research project at\u00a0<a href=\"http:\/\/www.berkeley.edu\/\" rel=\"noopener\">UC Berkeley<\/a>\u00a0in the\u00a0<a href=\"https:\/\/amplab.cs.berkeley.edu\/\" rel=\"noopener\">AMPLab<\/a>, a research group that focuses on big data analytics. Since then, it became an\u00a0<a href=\"https:\/\/www.apache.org\/\" rel=\"noopener\">Apache<\/a>\u00a0project and has delivered many new releases, reaching a consistent maturity with a wide range of functionalities. Most of all, Spark can perform data processing over some Gigabytes or hundreds of Petabytes with basically the same programming code, only requiring a proper cluster of machines in the background (check\u00a0<a href=\"https:\/\/databricks.com\/blog\/2014\/10\/10\/spark-petabyte-sort.html\" class=\"broken_link\" rel=\"noopener\">this link<\/a>). In some very specific cases the developer may need to tune the process by changing granularity of data distribution and other related aspects, but in general there are plenty of providers that automate all this cluster configuration for the developer. For instance, the script describe here used\u00a0<a href=\"https:\/\/aws.amazon.com\/emr\/\" rel=\"noopener\">AWS Elastic MapReduce (EMR)<\/a>, which plays exactly this role.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-90a51d8 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"90a51d8\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-507e872\" data-id=\"507e872\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-c75a102 elementor-widget elementor-widget-heading\" data-id=\"c75a102\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\"><h2>Why Scala?<\/h2><\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-31977a0 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"31977a0\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-f6447b3\" data-id=\"f6447b3\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-e58e142 elementor-widget elementor-widget-text-editor\" data-id=\"e58e142\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<a href=\"https:\/\/www.scala-lang.org\/\" rel=\"noopener\">Scala<\/a>\u00a0is a beautiful and very well-devised programming language, with a strong scientific background from professor\u00a0<a href=\"https:\/\/scala.epfl.ch\/\" rel=\"noopener\">Martin Odersky&#8217;s research team<\/a>\u00a0at\u00a0<a href=\"https:\/\/www.epfl.ch\/\" rel=\"noopener\">Ecole Polytechnique F\u00e9d\u00e9rale de Lausanne<\/a>.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-af5cae1 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"af5cae1\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-796974b\" data-id=\"796974b\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-51c2bed elementor-widget elementor-widget-text-editor\" data-id=\"51c2bed\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tIn more technical terms, Scala was created with a strong functional paradigm, but also fully compatible with the imperative object-oriented paradigm from JVM platform, taking advantage of all JVM&#8217;s decades of evolution and maturity. In summary, everything one does in Java can be done in Scala and much more with a much shorter and cleaner code.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-b80b03b elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"b80b03b\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-50bf98a\" data-id=\"50bf98a\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-62367af elementor-widget elementor-widget-text-editor\" data-id=\"62367af\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tIt isn&#8217;t a surprise that Spark is built precisely over Scala, although it also provides programming interfaces for\u00a0<a href=\"https:\/\/www.python.org\/\" rel=\"noopener\">Python<\/a>,\u00a0<a href=\"https:\/\/www.r-project.org\/\" rel=\"noopener\">R<\/a>\u00a0and, naturally, Java.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-bfabf92 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"bfabf92\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-2b1b266\" data-id=\"2b1b266\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-a5f2108 elementor-widget elementor-widget-heading\" data-id=\"a5f2108\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\"><h2>The challenge<\/h2><\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-eb68280 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"eb68280\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-d75705d\" data-id=\"d75705d\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-4952af0 elementor-widget elementor-widget-text-editor\" data-id=\"4952af0\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tAllstate and Kaggle challenged competitors to build a Machine Learning solution that predicts cost\/severity of insurance claims. As in any learning process, the effectiveness of the solution\u00a0depends on the availability of credible\u00a0data, which in this case was\u00a0data from 188,318 insurance claims with a set of attributes and the expected outcome, the cost. \u00a0This represented\u00a0the <em>training dataset<\/em>. An additional dataset with\u00a0125,546 observations, with the same attributes but the outcome\u00a0represented the <em>test dataset<\/em>, using which the trained model was to be\u00a0run in order to produce the solution to the given problem.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-8e56282 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"8e56282\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-3021e4f\" data-id=\"3021e4f\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-afc9820 elementor-widget elementor-widget-text-editor\" data-id=\"afc9820\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tBoth datasets contain\u00a0116 categorical columns, with values as &#8220;A&#8221;, &#8220;B&#8221;, &#8220;C&#8221;, with no explicit meaning, and 14 numerical columns, with values in a range from 0.0 to 1.0 (<a href=\"http:\/\/en.wikipedia.org\/wiki\/Feature_scaling\" rel=\"noopener\">normalized<\/a>). It&#8217;s clearly a masked data-set, which ensures full\u00a0confidentiality and still allows any Machine Learning algorithm to learn from\u00a0it. In many real-world projects, particularly those that hire data scientists around the globe, often to work remotely, the approach is exactly the same of providing the consultant with an anonymous and masked data, whereas in other projects the data scientist may be required to obtain the data, clean it, organize it, normalize it and so on.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-6120a04 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"6120a04\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-4c516bc\" data-id=\"4c516bc\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-f57e8c3 elementor-widget elementor-widget-text-editor\" data-id=\"f57e8c3\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tThe following section describes in detail a solution for this competition implemented with Apache Spark 2.0 and Scala.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-ceeb17b elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"ceeb17b\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-8111d9c\" data-id=\"8111d9c\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-5af3126 elementor-widget elementor-widget-heading\" data-id=\"5af3126\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\"><h2>The solution<\/h2><\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-8e3de56 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"8e3de56\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-ae98a9a\" data-id=\"ae98a9a\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-87f1aec elementor-widget elementor-widget-text-editor\" data-id=\"87f1aec\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tAlthough not so labored in terms of Machine Learning techniques, the script that follows\u00a0provides many important learnings for building ML applications with Apache Spark 2.0, Scala,\u00a0<a href=\"http:\/\/www.scala-sbt.org\/\" rel=\"noopener\">SBT<\/a>\u00a0and finally running it. Some learnings are detailed as follows:\n<ul>\n \t<li>A sophisticated command line interface is provided by\u00a0<a href=\"https:\/\/github.com\/scopt\/scopt\" rel=\"noopener\">scopt<\/a>, through which the runtime can be configured with specific named parameters. It is detailed in the section\u00a0<em>Running the Script\u00a0Locally<\/em>. You must add this to your\u00a0<code>build.sbt<\/code>\u00a0file:<\/li>\n<\/ul>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-9e97307 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"9e97307\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-5509d50\" data-id=\"5509d50\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-517bc04 elementor-widget elementor-widget-text-editor\" data-id=\"517bc04\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<\/ul>\n<pre><code class=\"language-java\">libraryDependencies += \"com.github.scopt\" %% \"scopt\" % \"3.5.0\"<\/code><\/pre>\n<ul>\n \t<li>And your script code will include something like this:<\/li>\n<\/ul>\n<pre><code class=\"language-java\">val parser = new OptionParser[Params](\"AllstateClaimsSeverityRandomForestRegressor\") {\nhead(\"AllstateClaimsSeverityRandomForestRegressor\", \"1.0\")\n\nopt[String](\"s3AccessKey\").required().action((x, c) =&gt;\nc.copy(s3AccessKey = x)).text(\"The access key is for S3\")\n\nopt[String](\"s3SecretKey\").required().action((x, c) =&gt;\nc.copy(s3SecretKey = x)).text(\"The secret key is for S3\")\n...<\/code><\/pre>\n<pre><code class=\"language-java\">parser.parse(args, Params()) match {\ncase Some(params) =&gt;\nprocess(params)\ncase None =&gt;\nthrow new IllegalArgumentException(\"One or more parameters are invalid or missing\")\n}<\/code><\/pre>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-2bf57f3 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"2bf57f3\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-0651179\" data-id=\"0651179\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-37b389c elementor-widget elementor-widget-text-editor\" data-id=\"37b389c\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<ul>\n \t<li>In order for SBT to package a jar file containing this and other third-part libraries, you need to use the command\u00a0<code>sbt assembly<\/code>\u00a0instead of\u00a0<code>sbt package<\/code>. For such, it is needed to use\u00a0<a href=\"https:\/\/github.com\/sbt\/sbt-assembly\" rel=\"noopener\">sbt-assembly<\/a>\u00a0and configure your project accordingly by creating a file\u00a0<code>project\/assembly.sbt<\/code>\u00a0with the following content:<\/li>\n<\/ul>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-4520ace elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"4520ace\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-d357bdf\" data-id=\"d357bdf\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-5d136a3 elementor-widget elementor-widget-text-editor\" data-id=\"5d136a3\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<pre><code class=\"language-java\">resolvers += Resolver.url(\"artifactory\", url(\"http:\/\/scalasbt.artifactoryonline.com\/scalasbt\/sbt-plugin-releases\"))(Resolver.ivyStylePatterns)\n\naddSbtPlugin(\"com.eed3si9n\" % \"sbt-assembly\" % \"0.14.3\")<\/code><\/pre>\n<ul>\n \t<li>The method\u00a0<code>process<\/code>\u00a0is called with a\u00a0<em>case class<\/em>\u00a0instance which encapsulates the parameters provided at the command line.<\/li>\n<\/ul>\n<pre><code class=\"language-java\">case class Params(s3AccessKey: String = \"\", s3SecretKey: String = \"\",\ntrainInput: String = \"\", testInput: String = \"\",\noutputFile: String = \"\",\nalgoNumTrees: Seq[Int] = Seq(3),\nalgoMaxDepth: Seq[Int] = Seq(4),\nalgoMaxBins: Seq[Int] = Seq(32),\nnumFolds: Int = 10,\ntrainSample: Double = 1.0,\ntestSample: Double = 1.0)<\/code><\/pre>\n<pre><code class=\"language-java\">def process(params: Params) {\n...<\/code><\/pre>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-d68f691 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"d68f691\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-32a23b6\" data-id=\"32a23b6\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-662eb54 elementor-widget elementor-widget-text-editor\" data-id=\"662eb54\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<ul>\n \t<li><em>SparkSession.builder<\/em>\u00a0is used for building a\u00a0<em>Spark session<\/em>. It was introduced in Spark 2.0 and is recommended to be used in place of the old\u00a0<em>SparkConf<\/em>\u00a0and\u00a0<em>SparkContext<\/em>.\u00a0<a href=\"https:\/\/databricks.com\/blog\/2016\/08\/15\/how-to-use-sparksession-in-apache-spark-2-0.html\" class=\"broken_link\" rel=\"noopener\">This link<\/a>\u00a0provides a good description of this new strategy and the equivalence with the old one.<\/li>\n<\/ul>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-80b6277 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"80b6277\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-aa43f29\" data-id=\"aa43f29\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-21548a9 elementor-widget elementor-widget-text-editor\" data-id=\"21548a9\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<pre><code class=\"language-java\">val sparkSession = SparkSession.builder.\nappName(\"AllstateClaimsSeverityRandomForestRegressor\")\n.getOrCreate()<\/code><\/pre>\n<ul>\n \t<li>The access to S3 is configured with\u00a0<strong>s3a<\/strong>\u00a0support, which compared to the predecessor\u00a0<strong>s3n<\/strong> improves the support to large files (no more 5GB limit) and provides higher performance. For more information on this, check\u00a0this,\u00a0this and\u00a0<a href=\"http:\/\/stackoverflow.com\/questions\/30385981\/how-to-access-s3a-files-from-apache-spark\" rel=\"noopener\">this<\/a>\u00a0links.<\/li>\n<\/ul>\n<pre><code class=\"language-java\">sparkSession.conf.set(\"spark.hadoop.fs.s3a.impl\", \"org.apache.hadoop.fs.s3a.S3AFileSystem\")\nsparkSession.conf.set(\"spark.hadoop.fs.s3a.access.key\", params.s3AccessKey)\nsparkSession.conf.set(\"spark.hadoop.fs.s3a.secret.key\", params.s3SecretKey)<\/code><\/pre>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-7055b5f elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"7055b5f\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-665f8e3\" data-id=\"665f8e3\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-6820b7c elementor-widget elementor-widget-text-editor\" data-id=\"6820b7c\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<ul>\n \t<li>Besides using the new\u00a0<strong>sparkSession.read.csv<\/strong>\u00a0method, the reading process also includes important settings: It is set to read the header of the CSV file, which is directly applied to the columns&#8217; names of the dataframe created; and\u00a0<strong>inferSchema<\/strong>\u00a0property is set to\u00a0<em>true<\/em>. Without the\u00a0<strong>inferSchema<\/strong>\u00a0configuration, the float values would be considered as\u00a0<em>strings<\/em>\u00a0which would later cause the\u00a0<strong>VectorAssembler<\/strong>\u00a0to raise an ugly error:\u00a0<code>java.lang.IllegalArgumentException: Data type StringType is not supported<\/code>. Finally, both raw dataframes are\u00a0<em>cached<\/em>\u00a0since they are again used later in the code for\u00a0<em>fitting<\/em>\u00a0the\u00a0<strong>StringIndexer<\/strong>\u00a0transformations and it wouldn&#8217;t be good to read the CSV files from the filesystem or S3 once again.<\/li>\n<\/ul>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-236d03a elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"236d03a\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-07d1f9c\" data-id=\"07d1f9c\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-6320962 elementor-widget elementor-widget-text-editor\" data-id=\"6320962\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<pre><code class=\"language-java\">val trainInput = sparkSession.read\n.option(\"header\", \"true\")\n.option(\"inferSchema\", \"true\")\n.csv(params.trainInput)\n.cache\n\nval testInput = sparkSession.read\n.option(\"header\", \"true\")\n.option(\"inferSchema\", \"true\")\n.csv(params.testInput)\n.cache<\/code><\/pre>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-fdb5660 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"fdb5660\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-e7587b1\" data-id=\"e7587b1\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-b5e183c elementor-widget elementor-widget-text-editor\" data-id=\"b5e183c\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<ul>\n \t<li>The column &#8220;loss&#8221; is renamed to &#8220;label&#8221;. For some reason, even after using the\u00a0<em>setLabelCol<\/em>\u00a0on the regression model, it still looks for a column called &#8220;label&#8221;, raising an ugly error:\u00a0<code>org.apache.spark.sql.AnalysisException: cannot resolve 'label' given input <\/code> <code>columns<\/code>. It may be hardcoded somewhere in Spark&#8217;s source code.<\/li>\n \t<li>The content of\u00a0<em>train.csv<\/em>\u00a0is split into\u00a0<em>training<\/em>\u00a0and\u00a0<em>validation<\/em>\u00a0data, 70% and 30%, respectively. The content of &#8220;test.csv&#8221; is reserved for building the final CSV file for submission on Kaggle. Both original dataframes are sampled according to command line parameters, which is particularly useful for running fast executions in your local machine.<\/li>\n<\/ul>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-bccb20e elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"bccb20e\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-194f5e0\" data-id=\"194f5e0\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-1e67d1b elementor-widget elementor-widget-text-editor\" data-id=\"1e67d1b\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<pre><code class=\"language-java\">val data = trainInput.withColumnRenamed(\"loss\", \"label\")\n.sample(false, params.trainSample)\n\nval splits = data.randomSplit(Array(0.7, 0.3))\nval (trainingData, validationData) = (splits(0), splits(1))\n\ntrainingData.cache\nvalidationData.cache\n\nval testData = testInput.sample(false, params.testSample).cache<\/code><\/pre>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-d0f6cfa elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"d0f6cfa\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-ff60fea\" data-id=\"ff60fea\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-edf23fc elementor-widget elementor-widget-text-editor\" data-id=\"edf23fc\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<ul>\n \t<li>By using a custom function\u00a0<em>isCateg<\/em>\u00a0the column names are filtered and a\u00a0<a href=\"http:\/\/spark.apache.org\/docs\/latest\/ml-features.html#stringindexer\" rel=\"noopener\">StringIndexer<\/a>\u00a0is created for each categorical column, aimed at creating a new numerical column according to the custom function\u00a0<em>categNewCol<\/em>. Note: It is a weak feature engineering, since it is wrong for a learning model to assume that the categories have an order among them (one is greater or less than the other). Whenever categories are confirmed to be unordered, it is better to use some other technique such as\u00a0<a href=\"http:\/\/spark.apache.org\/docs\/latest\/ml-features.html#onehotencoder\" rel=\"noopener\">OneHotEncoder<\/a>, which yields a different new column for each category holding a boolean (0\/1) value.<\/li>\n<\/ul>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-1058442 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"1058442\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-c3a03e5\" data-id=\"c3a03e5\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-3600bc3 elementor-widget elementor-widget-text-editor\" data-id=\"3600bc3\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<pre><code class=\"language-java\">def isCateg(c: String): Boolean = c.startsWith(\"cat\")\ndef categNewCol(c: String): String = if (isCateg(c)) s\"idx_${c}\" else c\n\nval stringIndexerStages = trainingData.columns.filter(isCateg)\n.map(c =&gt; new StringIndexer()\n.setInputCol(c)\n.setOutputCol(categNewCol(c))\n.fit(trainInput.select(c).union(testInput.select(c))))<\/code><\/pre>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-429581a elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"429581a\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-efe4036\" data-id=\"efe4036\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-c5f25c9 elementor-widget elementor-widget-text-editor\" data-id=\"c5f25c9\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<ul>\n \t<li>There are some very important aspects to be considered when building a feature transformation such as StringIndexer or OneHotEncoder. Such transformations need to be\u00a0<em>fitted<\/em>\u00a0before being included in the pipeline and the\u00a0<em>fit<\/em>\u00a0process needs to be done over a dataset that contains all possible categories. For instance, if you fit a StringIndexer over the training dataset and afterwards, when the pipeline is used to predict an outcome over another dataset (validation, test, etc.), it faces some unseen category, then it will fail and raise the error:\u00a0<code>org.apache.spark.SparkException: Failed to execute user defined <\/code> <code>function($anonfun$4: (string) =&gt; double) ... Caused by: <\/code> <code>org.apache.spark.SparkException: Unseen label: XYZ ... at <\/code> <code>org.apache.spark.ml.feature.StringIndexerModel<\/code>. This is the reason why this script fits the StringIndexer transformations over a union of original data from\u00a0<code>train.csv<\/code>\u00a0and\u00a0<code>test.csv<\/code>, bypassing the sampling and split parts.<\/li>\n \t<li>After the sequence of StringIndexer transformations, the next transformation in the pipeline is the\u00a0<a href=\"http:\/\/spark.apache.org\/docs\/latest\/ml-features.html#vectorassembler\" rel=\"noopener\">VectorAssembler<\/a>, which groups a set of columns into a new &#8220;features&#8221; column to be considered by the regression model. The filter for only feature columns is performed with the custom function\u00a0<em>onlyFeatureCols<\/em>. Additionally, the custom function\u00a0<em>removeTooManyCategs<\/em>\u00a0is used to filter out some few columns which contain a number of distinct categories much higher than the supported by the default parameter\u00a0<em>maxBins<\/em>\u00a0(for RandomForest). In a seriously competitive scenario, it would be better to perform some exploratory analysis to understand these features, their impact on the outcome variable and which feature engineering techniques could be applied.<\/li>\n<\/ul>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-fb07323 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"fb07323\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-fdcf7ff\" data-id=\"fdcf7ff\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-2e2f5a0 elementor-widget elementor-widget-text-editor\" data-id=\"2e2f5a0\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<pre><code class=\"language-java\">def removeTooManyCategs(c: String): Boolean = !(c matches \"cat(109$|110$|112$|113$|116$)\")\n\ndef onlyFeatureCols(c: String): Boolean = !(c matches \"id|label\")\n\nval featureCols = trainingData.columns\n.filter(removeTooManyCategs)\n.filter(onlyFeatureCols)\n.map(categNewCol)\n\nval assembler = new VectorAssembler()\n.setInputCols(featureCols)\n.setOutputCol(\"features\")<\/code><\/pre>\n<ul>\n \t<li>The very last stage in the pipeline is the regression model, which in this script\u00a0is a\u00a0<a href=\"http:\/\/spark.apache.org\/docs\/2.0.1\/api\/java\/org\/apache\/spark\/ml\/regression\/RandomForestRegressor.html\" class=\"broken_link\" rel=\"noopener\">RandomForestRegressor<\/a>.<\/li>\n<\/ul>\n<pre><code class=\"language-java\">val algo = new RandomForestRegressor().setFeaturesCol(\"features\").setLabelCol(\"label\")\n\nval pipeline = new Pipeline().setStages((stringIndexerStages :+ assembler) :+ algo)<\/code><\/pre>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-8821f6a elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"8821f6a\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-e764435\" data-id=\"e764435\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-f8a7134 elementor-widget elementor-widget-text-editor\" data-id=\"f8a7134\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<ul>\n \t<li>It is interesting to run the pipeline a set of times with different\u00a0<em>hyperparameters<\/em>\u00a0for the transformations and the learning algorithm in order to find the combination that best fits the data (see\u00a0<a href=\"https:\/\/en.wikipedia.org\/wiki\/Hyperparameter_optimization\" rel=\"noopener\">Hyperparameter optimization<\/a>). It is also important to evaluate each combination against a separated slice of the data (see\u00a0<a href=\"https:\/\/en.wikipedia.org\/wiki\/Cross-validation_(statistics)\" rel=\"noopener\">K-fold Cross Validation<\/a>). For accomplishing such objectives, a\u00a0<a href=\"http:\/\/spark.apache.org\/docs\/latest\/api\/java\/org\/apache\/spark\/ml\/tuning\/CrossValidator.html\" rel=\"noopener\">CrossValidator<\/a>\u00a0is used in conjunction with a\u00a0<a href=\"http:\/\/spark.apache.org\/docs\/latest\/api\/java\/org\/apache\/spark\/ml\/tuning\/ParamGridBuilder.html\" rel=\"noopener\">ParamGridBuilder<\/a>\u00a0(more documentation on (this link)[<a href=\"http:\/\/spark.apache.org\/docs\/latest\/ml-tuning.html\" rel=\"noopener\">http:\/\/spark.apache.org\/docs\/latest\/ml-tuning.html<\/a>]) queueing executions with distinct combinations of\u00a0<em>hyperparameters<\/em>\u00a0according to which was parametrized in the command line.<\/li>\n<\/ul>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-a780557 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"a780557\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-c275450\" data-id=\"c275450\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-d9aa87f elementor-widget elementor-widget-text-editor\" data-id=\"d9aa87f\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<pre><code class=\"language-java\">val paramGrid = new ParamGridBuilder()\n.addGrid(algo.numTrees, params.algoNumTrees)\n.addGrid(algo.maxDepth, params.algoMaxDepth)\n.addGrid(algo.maxBins, params.algoMaxBins)\n.build()\n\nval cv = new CrossValidator()\n.setEstimator(pipeline)\n.setEvaluator(new RegressionEvaluator)\n.setEstimatorParamMaps(paramGrid)\n.setNumFolds(params.numFolds)\n\nval cvModel = cv.fit(trainingData)<\/code><\/pre>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-7bc2b22 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"7bc2b22\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-4e4cf2d\" data-id=\"4e4cf2d\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-4cdde43 elementor-widget elementor-widget-text-editor\" data-id=\"4cdde43\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<ul>\n \t<li>Note: As observed by\u00a0<a href=\"https:\/\/databricks.com\/blog\/2015\/01\/21\/random-forests-and-boosting-in-mllib.html\" class=\"broken_link\" rel=\"noopener\">this post<\/a>\u00a0the Random Forest model is much faster than GBT on Spark. I experienced an execution about 20 times slower with GBT compared to Random Forest with equivalent\u00a0<em>hyperparameters<\/em>.<\/li>\n \t<li>With an instance of\u00a0<a href=\"http:\/\/spark.apache.org\/docs\/latest\/api\/java\/org\/apache\/spark\/ml\/tuning\/CrossValidatorModel.html\" rel=\"noopener\">CrossValidatorModel<\/a>\u00a0already trained, it is time for evaluating the model over the whole training and the validation datasets. From the result of predictions it is possible to easily obtain evaluation metrics with\u00a0<a href=\"http:\/\/spark.apache.org\/docs\/latest\/api\/java\/org\/apache\/spark\/mllib\/evaluation\/RegressionMetrics.html\" rel=\"noopener\">RegressionMetrics<\/a>. Additionally, the instance of the best model can be obtained, providing thus access to some other interesting attributes, such as\u00a0<em>featureImportances<\/em>.<\/li>\n<\/ul>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-3da1b25 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"3da1b25\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-1fb0f11\" data-id=\"1fb0f11\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-9ec8f57 elementor-widget elementor-widget-text-editor\" data-id=\"9ec8f57\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<pre><code class=\"language-java\">val trainPredictionsAndLabels = cvModel.transform(trainingData).select(\"label\", \"prediction\")\n.map { case Row(label: Double, prediction: Double) =&gt; (label, prediction) }.rdd\n\nval validPredictionsAndLabels = cvModel.transform(validationData).select(\"label\", \"prediction\")\n.map { case Row(label: Double, prediction: Double) =&gt; (label, prediction) }.rdd\n\nval trainRegressionMetrics = new RegressionMetrics(trainPredictionsAndLabels)\nval validRegressionMetrics = new RegressionMetrics(validPredictionsAndLabels)\n\nval bestModel = cvModel.bestModel.asInstanceOf[PipelineModel]\nval featureImportances = bestModel.stages.last.asInstanceOf[RandomForestRegressionModel].featureImportances.toArray<\/code><\/pre>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-6257a5c elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"6257a5c\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-af36499\" data-id=\"af36499\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-4be7565 elementor-widget elementor-widget-text-editor\" data-id=\"4be7565\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<ul>\n \t<li>Finally, the model can be used to predict the answer for the\u00a0<em>test<\/em>\u00a0dataset and save a csv file ready to be submitted on Kaggle! Again, Spark 2.0 simplifies the process. The function\u00a0<code>coalesce<\/code>\u00a0gathers all partitions into 1 only, thus saving a single output file (not many).<\/li>\n<\/ul>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-06dd06f elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"06dd06f\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-9399937\" data-id=\"9399937\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-c69e10b elementor-widget elementor-widget-text-editor\" data-id=\"c69e10b\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<pre><code class=\"language-java\">cvModel.transform(testData)\n.select(\"id\", \"prediction\")\n.withColumnRenamed(\"prediction\", \"loss\")\n.coalesce(1)\n.write.format(\"csv\")\n.option(\"header\", \"true\")\n.save(params.outputFile)<\/code><\/pre>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-8d91245 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"8d91245\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-120049c\" data-id=\"120049c\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-f55301a elementor-widget elementor-widget-heading\" data-id=\"f55301a\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h3 class=\"elementor-heading-title elementor-size-default\"><h3>Running the script\u00a0locally<\/h3><\/h3>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-c26bbc7 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"c26bbc7\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-8a044ee\" data-id=\"8a044ee\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-21b5560 elementor-widget elementor-widget-text-editor\" data-id=\"21b5560\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tAssuming you have your local environment all set up with Java 8 or higher, Scala 2.11.x and Spark 2.0, you can run the script with the following command structure:\n<pre><code class=\"language-bash\">spark-submit --class com.adornes.spark.kaggle.AllstateClaimsSeverityRandomForestRegressor the_jar_file.jar --s3AccessKey YOUR_AWS_ACCESS_KEY_HERE --s3SecretKey YOUR_AWS_SECRET_KEY_HERE --trainInput \"file:\/\/\/path\/to\/the\/train.csv\" --testInput \"file:\/\/\/path\/to\/the\/test.csv\" --outputFile  \"file:\/\/\/path\/to\/any\/name\/for\/submission.csv\" --algoNumTrees 3 --algoMaxDepth 3 --algoMaxBins 32 --numFolds 5 --trainSample 0.01 --testSample 0.01<\/code><\/pre>\nAs previously mentioned,\u00a0<a href=\"https:\/\/github.com\/scopt\/scopt\" rel=\"noopener\">scopt<\/a>\u00a0is the tool that enables the nice names for parameters at command line. If you type something wrong, it will output the sample usage as follows:\n<pre><code>AllstateClaimsSeverityRandomForestRegressor 1.0\nUsage: AllstateClaimsSeverityRandomForestRegressor [options]\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-25de92d elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"25de92d\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-1596cfd\" data-id=\"1596cfd\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-1cc7f5d elementor-widget elementor-widget-heading\" data-id=\"1cc7f5d\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h3 class=\"elementor-heading-title elementor-size-default\"><h3>Running the script\u00a0on AWS Elastic MapReduce (EMR)<\/h3><\/h3>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-7500e3e elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"7500e3e\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-e159736\" data-id=\"e159736\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-a1872b1 elementor-widget elementor-widget-text-editor\" data-id=\"a1872b1\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<strong>EMR<\/strong>\u00a0plays the role of abstracting most of the background setup for a cluster with Spark\/Hadoop ecosystems. You can actually build as many clusters as you want (and can afford). By the way, the cost for EC2 instances used with EMR is considerably reduced (it is detailed\u00a0<a href=\"https:\/\/aws.amazon.com\/emr\/pricing\" rel=\"noopener\">here<\/a>).\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-3f2bd14 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"3f2bd14\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-03d3c77\" data-id=\"03d3c77\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-8e3ce0b elementor-widget elementor-widget-text-editor\" data-id=\"8e3ce0b\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tAlthough considerably abstracting the cluster configuration, EMR allows the user to customize almost any of the background details through the\u00a0<em>advanced<\/em>\u00a0options of the steps of creating a cluster. For instance, for this Spark script, you&#8217;ll need to customize the Java version, according to\u00a0<a href=\"http:\/\/docs.aws.amazon.com\/ElasticMapReduce\/latest\/ReleaseGuide\/emr-configure-apps.html#configuring-java8\" rel=\"noopener\">this link<\/a>. Besides that, everything is created using the options provided. So, going step by step, log in to your AWS console, in the\u00a0<em>Services<\/em>\u00a0tab look for\u00a0<em>EMR<\/em>, select to create a cluster, choose\u00a0<em>Go to advanced options<\/em>\u00a0on the top of the screen and fill the options as follows:\n<ul>\n \t<li><strong>Vendor<\/strong>\u00a0&#8211; Leave it as\u00a0<em>Amazon<\/em><\/li>\n \t<li><strong>Release<\/strong>\u00a0&#8211; Choose\u00a0<em>emr-5.1.0<\/em>. Select\u00a0<em>Hadoop<\/em>\u00a0and\u00a0<em>Spark<\/em>. I&#8217;d also recommend you to select\u00a0<em>Zeppelin<\/em>\u00a0(for working with notebooks) and\u00a0<em>Ganglia<\/em>\u00a0(for detailed monitoring of your cluster)<\/li>\n \t<li><strong>Edit software settings (optional)<\/strong>\u00a0&#8211; Ensure the option\u00a0<em>Enter configuration<\/em>\u00a0is selected and copy here the configurations of\u00a0<a href=\"http:\/\/docs.aws.amazon.com\/ElasticMapReduce\/latest\/ReleaseGuide\/emr-configure-apps.html#configuring-java8\" rel=\"noopener\">the aforementioned link<\/a><\/li>\n \t<li><strong>Add steps<\/strong>\u00a0&#8211; You don&#8217;t need to do it at this moment. I prefer to do it later, after your cluster is started and ready for processing stuff. Click Next for\u00a0<em>Hardware<\/em>\u00a0settings<\/li>\n \t<li><strong>Hardware<\/strong>\u00a0&#8211; You can leave it as default (and can also resize it later) but maybe 2 core instances can be increased to 4 or more. Don&#8217;t forget that your choice will have costs. Click Next for\u00a0<em>General Cluster Settings<\/em>.<\/li>\n \t<li><strong>Cluster name<\/strong>\u00a0&#8211; Give some name to your cluster. Feel free to leave all other options with the default values. Click Next for\u00a0<em>Security<\/em>.<\/li>\n \t<li><strong>EC2 Key Pair<\/strong>\u00a0&#8211; It is useful if want to log into your EC2 instances via ssh. You can either create a Key Pair or choose some existent if you already have one. Leave the remaining options with the default values and click on\u00a0<em>Create Cluster<\/em>.<\/li>\n<\/ul>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-d383763 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"d383763\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-39c6367\" data-id=\"39c6367\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-b5dd5c4 elementor-widget elementor-widget-text-editor\" data-id=\"b5dd5c4\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tNow you&#8217;ll have an overview of your cluster&#8217;s basic data, including the state of your instances. When they indicate to be ready for processing steps, go to the\u00a0<strong>Steps<\/strong>\u00a0tab, click on\u00a0<strong>Add step<\/strong>\u00a0and fill the options as follows:\n<ul>\n \t<li><strong>Step type<\/strong>\u00a0&#8211; Select\u00a0<em>Spark application<\/em><\/li>\n \t<li><strong>Application location<\/strong>\u00a0&#8211; Navigate through your S3 buckets and select the jar file there. You&#8217;ll need to have already uploaded it to S3.<\/li>\n \t<li><strong>Spark-submit options<\/strong>\u00a0&#8211; Type here\u00a0<code>--class com.adornes.spark.kaggle.AllstateClaimsSeverityRandomForestRegressor<\/code> indicating the class that holds the code that you want to run.<\/li>\n \t<li><strong>Arguments<\/strong>\u00a0&#8211; Here you type the rest of the command arguments as demonstrated before, but this time indicating S3 paths as follows:<\/li>\n<\/ul>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-8399f2f elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"8399f2f\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-73a3cc6\" data-id=\"73a3cc6\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-b7f4cc4 elementor-widget elementor-widget-text-editor\" data-id=\"b7f4cc4\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<pre><code>--s3AccessKey YOUR_AWS_ACCESS_KEY_HERE --s3SecretKey YOUR_AWS_SECRET_KEY_HERE\n--trainInput \"s3:\/path\/to\/the\/train.csv\" --testInput \"s3:\/path\/to\/the\/test.csv\"\n--outputFile  \"s3:\/path\/to\/any\/name\/for\/submission.csv\"\n--algoNumTrees 20,40,60 --algoMaxDepth 5,7,9 --algoMaxBins 32 --numFolds 10\n--trainSample 1.0 --testSample 1.0<\/code><\/pre>\nThat&#8217;s it! In the list of steps you will see your step running and will also have access to system logs. Detailed logs will be saved to the path defined in your cluster configuration. Additionally, EMR allows the user to clone both steps and clusters, being thus not required to type everything again.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-8a46f06 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"8a46f06\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-3cc4630\" data-id=\"3cc4630\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-b2cc4e7 elementor-widget elementor-widget-heading\" data-id=\"b2cc4e7\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h3 class=\"elementor-heading-title elementor-size-default\"><h3>Submission on Kaggle<\/h3><\/h3>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-4184652 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"4184652\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-4efa9ff\" data-id=\"4efa9ff\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-9e19935 elementor-widget elementor-widget-text-editor\" data-id=\"9e19935\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tAs mentioned along the explanations, many improvements could\/should be done in terms of exploratory data analysis, feature engineering, evaluating other models (starting by the simplest ones, as Linear Regression) and then decreasing the predictions error.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-f16a969 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"f16a969\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-8034458\" data-id=\"8034458\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-fb72c89 elementor-widget elementor-widget-text-editor\" data-id=\"fb72c89\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tFor being over-simplistic, this model achieved a Mean Absolute Error (MAE) of 1286 in the\u00a0<a href=\"https:\/\/www.kaggle.com\/c\/allstate-claims-severity\/leaderboard\" rel=\"noopener\">public leaderboard<\/a>, far from the top positions.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-17021b8 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"17021b8\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-e5e9607\" data-id=\"e5e9607\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-400a153 elementor-widget elementor-widget-text-editor\" data-id=\"400a153\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tThe submission file and the detailed metrics of the model evaluation can be found under the\u00a0<code>output<\/code>\u00a0directory in the <a href=\"https:\/\/github.com\/adornes\/spark_scala_ml_examples\" rel=\"noopener\">Github repository<\/a>.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-218a0c0 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"218a0c0\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-982d5d6\" data-id=\"982d5d6\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-bcb291d elementor-widget elementor-widget-heading\" data-id=\"bcb291d\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\"><h2>Conclusion<\/h2><\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-d1565e6 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"d1565e6\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-5f63833\" data-id=\"5f63833\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-fe1ff18 elementor-widget elementor-widget-text-editor\" data-id=\"fe1ff18\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tApache Spark 2.0 is indeed a powerful framework for building Machine Learning models, transformations pipeline, evaluations and everything else with a highly-scalable end product. Scala, in\u00a0turn, is also a powerful programming language and the natural choice for developing Spark applications.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-9b6eeda elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"9b6eeda\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-4b056a4\" data-id=\"4b056a4\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-8719d8a elementor-widget elementor-widget-text-editor\" data-id=\"8719d8a\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tAllstate&#8217;s challenge is a project with a very interesting objective and very aligned with real-world Machine Learning problems, thus turning the concepts discussed on this article applicable to many other problems.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-5f261fc elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"5f261fc\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-931e51b\" data-id=\"931e51b\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-09337bd elementor-widget elementor-widget-text-editor\" data-id=\"09337bd\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tThe full source code can be found at <a href=\"https:\/\/github.com\/adornes\/spark_scala_ml_examples\" rel=\"noopener\">this Github repository<\/a>.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-eb6ea79 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"eb6ea79\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-eb838eb\" data-id=\"eb838eb\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-b1261dd elementor-widget elementor-widget-text-editor\" data-id=\"b1261dd\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tMany technical and conceptual ideas discussed here are open. Suggestions and corrections are highly appreciated.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<\/div>\n\t\t","protected":false},"excerpt":{"rendered":"<p>Allstate, the second largest insurance company in United States, recently launched a Machine Learning recruitment challenge to predict the cost and the severity of claims.<\/p>\n","protected":false},"author":519,"featured_media":2937,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"content-type":"","footnotes":""},"categories":[187],"tags":[95],"ppma_author":[1616],"class_list":["post-521","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-bigdata-cloud","tag-big-data-amp-technology"],"authors":[{"term_id":1616,"user_id":519,"is_guest":0,"slug":"admin-experfy-com","display_name":"Daniel Adornes","avatar_url":"https:\/\/secure.gravatar.com\/avatar\/?s=96&d=mm&r=g","user_url":"","last_name":"Adornes","first_name":"Daniel","job_title":"","description":""}],"_links":{"self":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/521","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/users\/519"}],"replies":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/comments?post=521"}],"version-history":[{"count":5,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/521\/revisions"}],"predecessor-version":[{"id":37507,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/521\/revisions\/37507"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/media\/2937"}],"wp:attachment":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/media?parent=521"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/categories?post=521"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/tags?post=521"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/ppma_author?post=521"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}