{"id":9318,"date":"2020-08-14T06:57:57","date_gmt":"2020-08-14T06:57:57","guid":{"rendered":"https:\/\/www.experfy.com\/blog\/?p=9318"},"modified":"2023-11-20T18:06:18","modified_gmt":"2023-11-20T18:06:18","slug":"how-to-build-a-machine-learning-model","status":"publish","type":"post","link":"https:\/\/www.experfy.com\/blog\/ai-ml\/how-to-build-a-machine-learning-model\/","title":{"rendered":"How to Build a Machine Learning Model"},"content":{"rendered":"\t\t<div data-elementor-type=\"wp-post\" data-elementor-id=\"9318\" class=\"elementor elementor-9318\" data-elementor-post-type=\"post\">\n\t\t\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-4bf1d6c elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"4bf1d6c\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-341a8941\" data-id=\"341a8941\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-466e4f4 elementor-widget elementor-widget-heading\" data-id=\"466e4f4\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\">\n<h2 class=\"wp-block-heading\">A Visual Guide to Learning Data Science<\/h2>\n<\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-ff3c7e4 elementor-widget elementor-widget-text-editor\" data-id=\"ff3c7e4\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\n<p>Learning data science may seem intimidating but it doesn\u2019t have to be that way. Let\u2019s make learning data science fun and easy. So the challenge is how do we exactly make learning data science both fun and easy?<\/p>\n\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-227c2f3 elementor-widget elementor-widget-image\" data-id=\"227c2f3\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/miro.medium.com\/max\/2244\/0*IT9aLhgbOVDkMNKM\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-2607dff elementor-widget elementor-widget-text-editor\" data-id=\"2607dff\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\n<p>Cartoons are fun and since \u201c<em>a picture is worth a thousand words\u201d<\/em>, so why not make a cartoon about data science? With that goal in mind, I\u2019ve set out to doodle on my iPad the elements that are required for building a machine learning model. After a few days, the infographic shown above is what I came up with, which was also published on LinkedIn and on the\u00a0<a href=\"https:\/\/github.com\/dataprofessor\/infographic\" target=\"_blank\" rel=\"noreferrer noopener\">Data Professor GitHub<\/a>.<\/p>\n\n\n<hr class=\"wp-block-separator\" \/>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-032f857 elementor-widget elementor-widget-heading\" data-id=\"032f857\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\"><h2 id=\"1b7c\">Dataset<\/h2><\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-d13f31e elementor-widget elementor-widget-text-editor\" data-id=\"d13f31e\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\n<p>A dataset is the starting point in your journey of building the machine learning model. Simply put, the dataset is essentially an\u00a0<strong>M<\/strong>\u00d7<strong>N<\/strong>\u00a0matrix where\u00a0<strong>M<\/strong>\u00a0represents the columns (features) and\u00a0<strong>N<\/strong>\u00a0the rows (samples).<\/p>\n\n\n\n<p>Columns can be broken down to\u00a0<strong>X<\/strong>\u00a0and\u00a0<strong>Y<\/strong>. Firstly,\u00a0<strong>X<\/strong>\u00a0is synonymous with several similar terms such as features, independent variables and input variables. Secondly,\u00a0<strong>Y<\/strong>\u00a0is also synonymous with several terms namely class label, dependent variable and output variable.<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-03263ba elementor-widget elementor-widget-image\" data-id=\"03263ba\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/miro.medium.com\/max\/1090\/1*-foC2sZDjaT6eUG3ZzdSSQ@2x.jpeg\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-6d087b2 elementor-widget elementor-widget-text-editor\" data-id=\"6d087b2\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\n<p>It should be noted that a dataset that can be used for\u00a0<strong><em>supervised learning\u00a0<\/em><\/strong>(can perform either regression or classification)would contain both\u00a0<strong>X<\/strong>\u00a0and\u00a0<strong>Y<\/strong>\u00a0whereas a dataset that can be used for\u00a0<strong><em>unsupervised learning<\/em><\/strong>\u00a0will only have\u00a0<strong>X<\/strong>.<\/p>\n\n\n\n<p>Moreover, if\u00a0<strong>Y<\/strong>\u00a0contains quantitative values then the dataset (comprising of\u00a0<strong>X<\/strong>\u00a0and\u00a0<strong>Y<\/strong>) can be used for\u00a0<strong><em>regression<\/em><\/strong>\u00a0tasks whereas if\u00a0<strong>Y<\/strong>\u00a0contains qualitative values then the dataset (comprising of\u00a0<strong>X<\/strong>\u00a0and\u00a0<strong>Y<\/strong>) can be used for\u00a0<strong><em>classification tasks<\/em><\/strong>.<\/p>\n\n<hr class=\"wp-block-separator\" \/>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-f8ab293 elementor-widget elementor-widget-heading\" data-id=\"f8ab293\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\">Exploratory Data Analysis<\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-7b7edf1 elementor-widget elementor-widget-text-editor\" data-id=\"7b7edf1\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\n<p>Exploratory data analysis (EDA) is performed in order to gain a preliminary understanding and allow us to get acquainted with the dataset. In a typical data science project, one of the first things that I would do is\u00a0<em>\u201ceyeballing the data\u201d\u00a0<\/em>byperforming EDA so as to gain a better understanding of the data.<\/p>\n\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-f1068c0 elementor-widget elementor-widget-heading\" data-id=\"f1068c0\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\">Three major EDA approaches that I normally use includes:<\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-b4c0807 elementor-widget elementor-widget-text-editor\" data-id=\"b4c0807\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\n<ul class=\"wp-block-list\">\n<li><strong>Descriptive statistics<\/strong>\u00a0\u2014 Mean, median, mode, standard deviation<\/li>\n<li><strong>Data visualisations<\/strong>\u00a0\u2014 Heat maps (discerning feature intra-correlation), box plot (visualize group differences), scatter plots (visualize correlations between features), principal component analysis (visualize distribution of clusters presented in the dataset), etc.<\/li>\n<li><strong>Data shaping<\/strong>\u00a0\u2014 Pivoting data, grouping data, filtering data, etc.<\/li>\n<\/ul>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-1d6036b elementor-widget elementor-widget-image\" data-id=\"1d6036b\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/miro.medium.com\/max\/382\/1*ayh3ezdNbV7iFWzWh03XdA.png\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-b5d7ca0 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"b5d7ca0\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-6077733\" data-id=\"6077733\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-b743dea elementor-widget elementor-widget-image\" data-id=\"b743dea\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/miro.medium.com\/max\/378\/1*pC_hLMMyxeV6qGNAbxi1MA.png\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-ae146d6 elementor-widget elementor-widget-image\" data-id=\"ae146d6\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/miro.medium.com\/max\/955\/1*bOa8F7qPRnBU-K167Vw8gA.png\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-b066916 elementor-widget elementor-widget-image\" data-id=\"b066916\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/miro.medium.com\/max\/4692\/1*GwPx0V016iE-avAxVl3Rcw.png\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-14c4178 elementor-widget elementor-widget-text-editor\" data-id=\"14c4178\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\n<p>For more step-by-step tutorial on performing these\u00a0<a href=\"https:\/\/www.youtube.com\/watch?v=9m4n2xVzk9o\" target=\"_blank\" rel=\"noreferrer noopener\">exploratory data analysis in Python<\/a>, please check out the video I made on the\u00a0<a href=\"https:\/\/www.youtube.com\/dataprofessor\/\" target=\"_blank\" rel=\"noreferrer noopener\">Data Professor YouTube channel<\/a>.<\/p>\n\n\n\n<figure><iframe src=\"https:\/\/cdn.embedly.com\/widgets\/media.html?src=https%3A%2F%2Fwww.youtube.com%2Fembed%2F9m4n2xVzk9o%3Ffeature%3Doembed&amp;display_name=YouTube&amp;url=https%3A%2F%2Fwww.youtube.com%2Fwatch%3Fv%3D9m4n2xVzk9o&amp;image=https%3A%2F%2Fi.ytimg.com%2Fvi%2F9m4n2xVzk9o%2Fhqdefault.jpg&amp;key=a19fcc184b9711e1b4764040d3dc5c07&amp;type=text%2Fhtml&amp;schema=youtube\" width=\"654\" height=\"400\" allowfullscreen=\"allowfullscreen\"><\/iframe><\/figure>\n\n\n<hr class=\"wp-block-separator\" \/>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-2c9ecbd elementor-widget elementor-widget-heading\" data-id=\"2c9ecbd\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\"><h2 id=\"1c87\">Data Pre-Processing<\/h2><\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-75f21a9 elementor-widget elementor-widget-text-editor\" data-id=\"75f21a9\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>Data pre-processing (also known as data cleaning, data wrangling or data munging) is the process by which the data is subjected to various checks and scrutiny in order to remedy issues of missing values, spelling errors, normalizing\/standardizing values such that they are comparable, transforming data (e.g. logarithmic transformation), etc.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:quote -->\n<blockquote class=\"wp-block-quote\">\n<p>\u201cGarbage in, Garbage out.\u201d<br \/>\u2014 George Fuechsel<\/p>\n<\/blockquote>\n<!-- \/wp:quote -->\n\n<!-- wp:paragraph -->\n<p>As the above quote suggests, the quality of data is going to exert a big impact on the quality of the generated model. Therefore, to achieve the highest model quality, significant effort should be spent in the data pre-processing phase. It is said that data pre-processing could easily account for 80% of the time spent on data science projects while the actual model building phase and subsequent post-model analysis account for the remaining 20%.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:separator --><hr class=\"wp-block-separator\" \/><!-- \/wp:separator -->\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-fbe5d94 elementor-widget elementor-widget-heading\" data-id=\"fbe5d94\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h3 class=\"elementor-heading-title elementor-size-default\"><h3 id=\"ee5a\">Train-Test Split<\/h3>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-067e7f9 elementor-widget elementor-widget-text-editor\" data-id=\"067e7f9\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<!-- wp:paragraph -->\n<p>In the development of machine learning models, it is desirable that the trained model perform well on new, unseen data. In order to simulate the new, unseen data, the available data is subjected to\u00a0<strong><em>data splitting<\/em><\/strong>\u00a0whereby it is split to 2 portions (sometimes referred to as the\u00a0<strong><em>train-test split<\/em><\/strong>). Particularly, the first portion is the larger data subset that is used as the\u00a0<strong><em>training set<\/em><\/strong>\u00a0(such as accounting for 80% of the original data) and the second is normally a smaller subset and used as the\u00a0<strong><em>testing set\u00a0<\/em><\/strong>(the remaining 20% of the data). It should be noted that such data split is performed once.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p>Next, the training set is used to build a predictive model and such\u00a0<em>trained model<\/em>\u00a0is then applied on the testing set (<em>i.e.<\/em>\u00a0serving as the new, unseen data) to make predictions. Selection of the best model is made on the basis of the model\u2019s performance on the testing set and in efforts to obtain the best possible model, hyperparameter optimization may also be performed.<\/p>\n<!-- \/wp:paragraph -->\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-07bf4cc elementor-widget elementor-widget-image\" data-id=\"07bf4cc\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/miro.medium.com\/max\/1454\/1*Hs7RCpyvj4NrjANdwiFHaQ@2x.jpeg\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-e599af4 elementor-widget elementor-widget-heading\" data-id=\"e599af4\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h3 class=\"elementor-heading-title elementor-size-default\"><h3 id=\"d450\">Train-Validation-Test Split<\/h3><\/h3>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-82728a8 elementor-widget elementor-widget-text-editor\" data-id=\"82728a8\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<!-- wp:paragraph -->\n<p>Another common approach for\u00a0<strong><em>data splitting<\/em><\/strong>\u00a0is to split the data to 3 portions: (1) training set, (2) validation set and (3) testing set. Similar to what was explained above, the training set is used to build a predictive model and is also evaluated on the\u00a0<strong><em>validation set<\/em><\/strong>\u00a0whereby predictions are made, model tuning can be made (e.g. hyperparameter optimization) and selection of the best performing model based on results of the validation set. As we can see, similar to what was performed above to the test set, here we do the same procedure on the validation set instead. Notice that the\u00a0<strong><em>testing set<\/em><\/strong>\u00a0is not involved in any of the model building and preparation. Thus, the testing set can truly act as the new, unseen data. A more in-depth treatment of this topic is provided by\u00a0<a href=\"https:\/\/developers.google.com\/machine-learning\/crash-course\/validation\/another-partition\" target=\"_blank\" rel=\"noreferrer noopener\">Google\u2019s Machine Learning Crash Course<\/a>.<\/p>\n<!-- \/wp:paragraph -->\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-7841a23 elementor-widget elementor-widget-image\" data-id=\"7841a23\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/miro.medium.com\/max\/1531\/1*bWUWy30TYvcBcaT60jEQ9g@2x.jpeg\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-71e60c0 elementor-widget elementor-widget-heading\" data-id=\"71e60c0\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\">Cross-Validation<\/h3><\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-4f70217 elementor-widget elementor-widget-text-editor\" data-id=\"4f70217\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<!-- wp:paragraph -->\n<p>In order to make the most economical use of the available data, an\u00a0<strong><em>N-fold cross-validation (CV)\u00a0<\/em><\/strong>is normally used whereby the dataset is partitioned to\u00a0<em>N<\/em>\u00a0folds (<em>i.e.<\/em>\u00a0commonly 5-fold or 10-fold CV are used). In such\u00a0<em>N<\/em>-fold CV, one of the fold is left out as the testing data while the remaining folds are used as the training data for model building.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p>For example, in a 5-fold CV, 1 fold is left out and used as the testing data while the remaining 4 folds are pooled together and used as the training data for model building. The trained model is then applied on the aforementioned left-out fold (<em>i.e.<\/em>\u00a0the test data). This process is carried out iteratively until all folds had a chance to be left out as the testing data. As a result, we will have built 5 models (i.e. where each of the 5 folds have been left out as the testing set) where each of the 5 models contain associated performance metrics (which we will discuss soon in the forthcoming section). Finally, the metric values are based on the average performance computed from the 5 models.<\/p>\n<!-- \/wp:paragraph -->\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-b1006a6 elementor-widget elementor-widget-image\" data-id=\"b1006a6\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/miro.medium.com\/max\/2182\/1*pb_PK_pzQZz4EdKcqfGJ1Q@2x.jpeg\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-1b61698 elementor-widget elementor-widget-text-editor\" data-id=\"1b61698\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<!-- wp:paragraph -->\n<p>In situations when\u00a0<em>N<\/em>\u00a0is equal to the number of data samples, we call this\u00a0<strong><em>leave-one-out cross-validation<\/em><\/strong>. In this type of CV, each data sample represents a fold. For example, if\u00a0<em>N<\/em>\u00a0is equal to 30 then there are 30 folds (1 sample per fold). As in any other\u00a0<em>N<\/em>-fold CV, 1 fold is left out as the testing set while the remaining 29 folds are used to build the model. Next, the built model is applied to make prediction on the left-out fold. As before, this process is performed iteratively for a total of 30 times; and the average performance from the 30 models are computed and used as the CV performance metric.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:separator --><hr class=\"wp-block-separator\" \/><!-- \/wp:separator -->\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-46e284e elementor-widget elementor-widget-heading\" data-id=\"46e284e\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\"><h2 id=\"2d6d\">Model Building<\/h2><\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-ab914de elementor-widget elementor-widget-text-editor\" data-id=\"ab914de\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<!-- wp:paragraph -->\n<p>Now, comes the fun part where we finally get to use the meticulously prepared data for model building. Depending on the data type (qualitative or quantitative) of the target variable (commonly referred to as the\u00a0<strong>Y<\/strong>\u00a0variable) we are either going to be building a classification (if\u00a0<strong>Y<\/strong>\u00a0is qualitative) or regression (if\u00a0<strong>Y<\/strong>\u00a0is quantitative) model.<\/p>\n<!-- \/wp:paragraph -->   \t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-93c352f elementor-widget elementor-widget-heading\" data-id=\"93c352f\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h3 class=\"elementor-heading-title elementor-size-default\"><h3 id=\"a69b\">Learning Algorithms<\/h3><\/h3>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-90d50d5 elementor-widget elementor-widget-text-editor\" data-id=\"90d50d5\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<!-- wp:paragraph -->\n<p>Machine learning algorithms could be broadly categorised to one of three types:<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:list {\"ordered\":true} -->\n<ol>\n<li><em>Supervised learning<\/em>\u00a0\u2014 is a machine learning task that establishes the mathematical relationship between input\u00a0<strong>X<\/strong>\u00a0and output\u00a0<strong>Y<\/strong>\u00a0variables. Such\u00a0<strong>X<\/strong>,\u00a0<strong>Y<\/strong>\u00a0pair constitutes the labeled data that are used for model building in an effort to learn how to predict the output from the input.<\/li>\n<li><em>Unsupervised learning<\/em>\u00a0\u2014 is a machine learning task that makes use of only the input\u00a0<strong>X<\/strong>\u00a0variables. Such\u00a0<strong>X<\/strong>\u00a0variables are unlabeled data that the learning algorithm uses in modeling the inherent structure of the data.<\/li>\n<li><em>Reinforcement learning<\/em>\u00a0\u2014 is a machine learning task that decides on the next course of action and it does this by learning through trial and error in an effort to maximize the reward.<\/li>\n<\/ol>\n<!-- \/wp:list -->\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-293ddc5 elementor-widget elementor-widget-heading\" data-id=\"293ddc5\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h3 class=\"elementor-heading-title elementor-size-default\"><h3 id=\"15d8\">Hyperparameter Optimization<\/h3><\/h3>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-e65b44b elementor-widget elementor-widget-text-editor\" data-id=\"e65b44b\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\n<!-- wp:paragraph -->\n<p>Hyperparameters are essentially parameters of the machine learning algorithm that directly impacts the learning process and prediction performance. As there are no \u201cone-size fits all\u201d hyperparameter settings that will universally work for all datasets therefore one will need to perform\u00a0<strong><em>hyperparameter optimization<\/em><\/strong>\u00a0(also known as\u00a0<em>hyperparameter tuning<\/em>\u00a0or\u00a0<em>model tuning<\/em>).<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p>Let\u2019s take random forest as an example. Two common hyperparameters that are typically subjected to optimization when using the\u00a0<strong>randomForest<\/strong>\u00a0R package includes the\u00a0<code>mtry<\/code>\u00a0and\u00a0<code>ntree<\/code>\u00a0parameters (this corresponds to\u00a0<code>n_estimators<\/code>\u00a0and\u00a0<code>max_features<\/code>\u00a0in\u00a0<code>RandomForestClassifier()<\/code>\u00a0and\u00a0<code>RandomForestRegressor()<\/code>\u00a0functions from the\u00a0<strong>scikit-learn<\/strong>\u00a0Python library).\u00a0<code>mtry<\/code>\u00a0(<code>max_features<\/code>) represents the number of variables that are randomly sampled as candidates at each split while\u00a0<code>ntree<\/code>\u00a0(<code>n_estimators<\/code>) represents the number of trees to grow.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p>Another popular machine learning algorithm is support vector machine. Hyperparameters to be optimized is the\u00a0<code>C<\/code>\u00a0and\u00a0<code>gamma<\/code>\u00a0parameters for the radial basis function (RBF) kernel (i.e. only the\u00a0<code>C<\/code>\u00a0parameter for the linear kernel; the\u00a0<code>C<\/code>\u00a0and\u00a0<code>exponential number<\/code>\u00a0for the polynomial kernel). The\u00a0<code>C<\/code>\u00a0parameter is a penalty term that limits overfitting while the\u00a0<code>gamma<\/code>\u00a0parameter controls the width of the RBF kernel. As mentioned above, tuning is typically performed so as to arrive at the optimal set of values to use for the hyperparameters and in spite of this there are research directed towards finding good starting values for the\u00a0<code>C<\/code>\u00a0and\u00a0<code>gamma<\/code>\u00a0parameters (<a href=\"https:\/\/pubs.acs.org\/doi\/10.1021\/ci500344v\" target=\"_blank\" rel=\"noreferrer noopener\" class=\"broken_link\">Alvarsson et al. 2014<\/a>).<\/p>\n<!-- \/wp:paragraph -->\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-86c8449 elementor-widget elementor-widget-heading\" data-id=\"86c8449\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h3 class=\"elementor-heading-title elementor-size-default\"><h3 id=\"47af\">Feature Selection<\/h3><\/h3>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-d2ed9b6 elementor-widget elementor-widget-text-editor\" data-id=\"d2ed9b6\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>As the name implies, feature selection is literally the process of selecting a subset of features from an initially large volume of features. Aside from achieving highly accurate models, one of the most important aspect of machine learning model building is to obtain actionable insights and in order to achieve that it is important to be able to select a subset of important features from the vast number.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p>The task of feature selection in itself can constitute an entirely new area of research where intense efforts are geared toward devising novel algorithms and approaches. From amongst the plethora of available feature selection algorithms, some of the classical methods are based on\u00a0<em>simulated annealing<\/em>\u00a0and\u00a0<em>genetic algorithm<\/em>. In addition to these, there are a large collection of approaches based on\u00a0<em>evolutionary algorithms\u00a0<\/em>(e.g. Particle Swarms Optimization, Ant Colony Optimization, etc.)and\u00a0<em>stochastic approaches<\/em>\u00a0(e.g. Monte Carlo).<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p>Our own research group have also explored the use of Monte Carlo simulation for feature selection in a study of modeling the quantitative structure-activity relationship of aldose reductase inhibitors (<a href=\"https:\/\/doi.org\/10.1016\/j.ejmech.2014.02.043\" target=\"_blank\" rel=\"noreferrer noopener\">Nantasenamat et al. 2014<\/a>). We have also devised a novel feature selection approach based on combining two popular evolutionary algorithms namely genetic algorithm and particle swarm algorithm in our work entitled\u00a0<a href=\"https:\/\/doi.org\/10.1016\/j.chemolab.2013.08.009\" target=\"_blank\" rel=\"noreferrer noopener\"><em>Genetic algorithm search space splicing particle swarm optimization as general-purpose optimizer<\/em><\/a>\u00a0(<a href=\"https:\/\/doi.org\/10.1016\/j.chemolab.2013.08.009\" target=\"_blank\" rel=\"noreferrer noopener\">Li\u00a0<em>et al.<\/em>\u00a02013<\/a>).<\/p>\n<!-- \/wp:paragraph -->\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-5f9ba21 elementor-widget elementor-widget-image\" data-id=\"5f9ba21\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/miro.medium.com\/max\/573\/0*sm4IFr5OcsOZNjKM.jpg\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-9d3c901 elementor-widget elementor-widget-text-editor\" data-id=\"9d3c901\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t&#8220;Schematic diagram of the principles of the genetic algorithm search space splicing particle swarms optimization (GA-SSS-PSO)&#8221; \/>\n<figcaption><strong>Schematic diagram of the principles of the genetic algorithm search space splicing particle swarms optimization (GA-SSS-PSO) approach as illustrated using the Schwefel function in 2 dimensions.\u00a0<\/strong>\u201cThe original search space (a)\u00a0<em>x<\/em>\u2208[\u2013500,0] is spliced into sub-spaces at fixed interval of 2 at each dimension (a dimension equals an horizontal axis in the picture). This resulted in four subspaces (b\u2013e) where the range of\u00a0<em>x<\/em>\u00a0at each dimension is half that of the original. Each string of GA encodes the indexes for one subspace. Then, GA heuristically selects a subspace (e) and PSO is initiated there (particles are shown as red dots). PSO searches for the global minimum of the subspaces and the best particle fitness is used as fitness of the GA string encoding the indexes for that subspace. Finally, GA undergoes evolution and selects a new subspace to explore. The whole process is repeated until satisfactory error level is reached.\u201d(Reprinted from Chemometrics and Intelligent Laboratory Systems, Volume 128, Genetic algorithm search space splicing particle swarm optimization as general-purpose optimizer, Pages 153\u2013159, Copyright (2013), with permission from Elsevier)<\/figcaption>\n<!-- wp:separator --><hr class=\"wp-block-separator\" \/><!-- \/wp:separator -->\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-70d4890 elementor-widget elementor-widget-heading\" data-id=\"70d4890\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\"><h2 id=\"1b0c\">Machine Learning Tasks<\/h2><\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-0b237db elementor-widget elementor-widget-text-editor\" data-id=\"0b237db\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<!-- wp:paragraph -->\n<p>Two common machine learning tasks in supervised learning includes classification and regression.<\/p>\n<!-- \/wp:paragraph -->\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-6991022 elementor-widget elementor-widget-heading\" data-id=\"6991022\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h3 class=\"elementor-heading-title elementor-size-default\"><!-- wp:heading {\"level\":3} -->\n<h3 id=\"473b\">Classification<\/h3>\n<!-- \/wp:heading --><\/h3>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-f434372 elementor-widget elementor-widget-text-editor\" data-id=\"f434372\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<!-- wp:paragraph -->\n<p>A trained classification model takes as\u00a0<strong>input<\/strong>\u00a0a set of variables (either quantitative or qualitative) and predicts the\u00a0<strong>output<\/strong>\u00a0class label (qualitative). The following figure hows three classes as indicated by the different colors and labels. Each small colored spheres represent a data sample whereby each sample<\/p>\n<!-- \/wp:paragraph -->\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-ec735e0 elementor-widget elementor-widget-image\" data-id=\"ec735e0\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/miro.medium.com\/max\/1570\/1*ACzXRSiJrJZzTOvRcXlTuA@2x.jpeg\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-f0443d3 elementor-widget elementor-widget-text-editor\" data-id=\"f0443d3\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<strong>Schematic illustration of a multi-class classification problem.\u00a0<\/strong>Three classes of data samples are shown in 2-dimensions. This drawing shows a hypothetical distribution of data samples. Such visualisation plot can be created by performing PCA analysis and displaying the first two principal components (PCs); alternatively a simple scatter plot of two variables can also be selected and visualized. (Drawn by Chanin Nantasenamat)\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-14d39bf elementor-widget elementor-widget-heading\" data-id=\"14d39bf\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h3 class=\"elementor-heading-title elementor-size-default\"><h3 id=\"78d7\">Example dataset<\/h3><\/h3>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-491f8d5 elementor-widget elementor-widget-text-editor\" data-id=\"491f8d5\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<!-- wp:paragraph -->\n<p>Take for example, the\u00a0<strong>Penguins<\/strong>\u00a0<strong>dataset<\/strong>\u00a0(recently proposed as a replacement dataset for the heavily used\u00a0<strong>Iris dataset<\/strong>) where we take as input\u00a0<strong><em>quantitative<\/em><\/strong>\u00a0(bill length, bill depth, flipper length and body mass)and\u00a0<strong><em>qualitative<\/em><\/strong>\u00a0(sex and island) features that uniquely describes the characteristics of penguins and classifying it as belonging to one of three\u00a0<strong><em>species<\/em><\/strong>\u00a0class label (Adelie, Chinstrap or Gentoo). The dataset is comprised of 344 rows and 8 columns. A prior analysis revealed that the dataset contains 333 complete cases where 19 missing values were presented in the 11 incomplete cases.<\/p>\n<!-- \/wp:paragraph -->\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-2a3e66e elementor-widget elementor-widget-image\" data-id=\"2a3e66e\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/miro.medium.com\/max\/1800\/0*PwAMj9iYPyLJzusj.png\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-55ec449 elementor-widget elementor-widget-heading\" data-id=\"55ec449\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h3 class=\"elementor-heading-title elementor-size-default\"><h3 id=\"3c2d\">Performance metrics<\/h3><\/h3>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-f2b42ff elementor-widget elementor-widget-text-editor\" data-id=\"f2b42ff\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>How do we know when our model performs good or bad? The answer is to use performance metrics and some of the common ones for assessing the classification performance includes accuracy (Ac), sensitivity (Sn), specificity (Sp) and the Matthew\u2019s correlation coefficient (MCC).<\/p>\n<!-- \/wp:paragraph -->\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-a50d300 elementor-widget elementor-widget-image\" data-id=\"a50d300\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/miro.medium.com\/max\/783\/0*36gdfD59MadDm0up.gif\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-9ed8566 elementor-widget elementor-widget-image\" data-id=\"9ed8566\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/miro.medium.com\/max\/450\/0*-_v6XV-iLCfAifbQ.gif\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-ce5b581 elementor-widget elementor-widget-image\" data-id=\"ce5b581\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/miro.medium.com\/max\/450\/0*xo6a3rJxy0-mxKGl.gif\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-d6cff57 elementor-widget elementor-widget-image\" data-id=\"d6cff57\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/miro.medium.com\/max\/1616\/0*vAtkVq-SUjKt5pUt.gif\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-a4dab31 elementor-widget elementor-widget-text-editor\" data-id=\"a4dab31\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<!-- wp:paragraph -->\n<p>where TP, TN, FP and FN denote the instances of true positives, true negatives, false positives and false negatives, respectively. It should be noted that MCC ranges from \u22121 to 1 whereby an MCC of \u22121 indicates the worst possible prediction while a value of 1 indicates the best possible prediction scenario. Also, an MCC of 0 is indicative of random prediction.<\/p>\n<!-- \/wp:paragraph -->\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-693d6e4 elementor-widget elementor-widget-heading\" data-id=\"693d6e4\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h3 class=\"elementor-heading-title elementor-size-default\"><h3 id=\"f9bc\">Regression<\/h3><\/h3>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-b4f362f elementor-widget elementor-widget-text-editor\" data-id=\"b4f362f\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>In a nutshell, a trained regression model can be best summarised by the following simple equation:<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p><code>Y=f(X)<\/code><\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p>where\u00a0<strong>Y<\/strong>\u00a0corresponds to the quantitative\u00a0<strong>output\u00a0<\/strong>variable,\u00a0<strong>X<\/strong>\u00a0refers to the\u00a0<strong>input<\/strong>\u00a0variables and\u00a0<strong>f<\/strong>\u00a0refers to the mapping function (obtained from the trained model) for computing the output values as a function of input features. The essence of the above equation for the regression example is that\u00a0<strong>Y<\/strong>\u00a0can be deduced if\u00a0<strong>X<\/strong>\u00a0is known. Once\u00a0<strong>Y<\/strong>\u00a0is calculated (we can also say \u2018predicted\u2019), a popular way to visualise the results is to make a simple scatter plot of the actual values versus the predicted values as shown below.<\/p>\n<!-- \/wp:paragraph -->\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-b1e582d elementor-widget elementor-widget-image\" data-id=\"b1e582d\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/miro.medium.com\/max\/850\/1*YRafdHgQ3bCVnwwK2UlrEQ@2x.jpeg\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-bc801aa elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"bc801aa\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-5abd86f\" data-id=\"5abd86f\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-fe38d2d elementor-widget elementor-widget-heading\" data-id=\"fe38d2d\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h3 class=\"elementor-heading-title elementor-size-default\"><h3 id=\"83bc\">Example dataset<\/h3><\/h3>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-bac4c7f elementor-widget elementor-widget-text-editor\" data-id=\"bac4c7f\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<!-- wp:paragraph -->\n<p>The\u00a0<strong>Boston Housing dataset<\/strong>\u00a0is a popular example dataset typically used in data science tutorials. The dataset is comprised of 506 rows and 14 columns. For conciseness, shown below is the header (showing the names of variables) plus the first 4 rows of the dataset.<\/p>\n<!-- \/wp:paragraph -->\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-6404604 elementor-widget elementor-widget-image\" data-id=\"6404604\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/08\/Untitled-1-4.png\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-dbd437b elementor-widget elementor-widget-text-editor\" data-id=\"dbd437b\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<!-- wp:paragraph -->\n<p>Of the 14 columns, the first 13 variables are used as\u00a0<strong>input<\/strong>\u00a0variables while the median house price (<code>medv<\/code>) is used as the\u00a0<strong>output<\/strong>\u00a0variable. As can be seen all 14 variables contain quantitative values and thus suitable for regression analysis. I also made a step-by-step YouTube video showing <a href=\"https:\/\/www.experfy.com\/blog\/how-to-build-a-regression-model-in-python\/\" target=\"_blank\" rel=\"noreferrer noopener\">how to build<\/a> a\u00a0<a href=\"https:\/\/www.youtube.com\/watch?v=R15LjD8aCzc\" target=\"_blank\" rel=\"noreferrer noopener\">linear regression model in Python<\/a>.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:html -->\n<figure><iframe src=\"https:\/\/cdn.embedly.com\/widgets\/media.html?src=https%3A%2F%2Fwww.youtube.com%2Fembed%2FR15LjD8aCzc%3Ffeature%3Doembed&amp;display_name=YouTube&amp;url=https%3A%2F%2Fwww.youtube.com%2Fwatch%3Fv%3DR15LjD8aCzc&amp;image=https%3A%2F%2Fi.ytimg.com%2Fvi%2FR15LjD8aCzc%2Fhqdefault.jpg&amp;key=a19fcc184b9711e1b4764040d3dc5c07&amp;type=text%2Fhtml&amp;schema=youtube\" width=\"654\" height=\"430\" allowfullscreen=\"allowfullscreen\"><\/iframe><\/figure>\n<!-- \/wp:html -->\n\n<!-- wp:paragraph -->\n<p>In the video, I started by showing you how to read in the Boston Housing dataset, separating the data to X and Y matrices, perform 80\/20 data split, build a linear regression model using the 80% subset and applying the trained model to make prediction on the 20% subset. Finally the performance metrics and scatter plot of the actual versus predicted\u00a0<code>medv<\/code>\u00a0values are shown.<\/p>\n<!-- \/wp:paragraph -->\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-93007fb elementor-widget elementor-widget-image\" data-id=\"93007fb\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/miro.medium.com\/max\/368\/1*x-gU_Cmny0ohjaynyEC0KA.png\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-fd15b21 elementor-widget elementor-widget-heading\" data-id=\"fd15b21\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h3 class=\"elementor-heading-title elementor-size-default\"><h3 id=\"c09d\">Performance metrics<\/h3><\/h3>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-ff59161 elementor-widget elementor-widget-text-editor\" data-id=\"ff59161\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<!-- wp:paragraph -->\n<p>Evaluation of the performance of regression models are performed to assess the degree at which a fitted model can accurately predict the values of input data.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p>Common metric for evaluating the performance of regression models is the\u00a0<strong><em>coefficient of determination<\/em><\/strong>\u00a0(R\u00b2).<\/p>\n<!-- \/wp:paragraph -->\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-839d950 elementor-widget elementor-widget-image\" data-id=\"839d950\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/miro.medium.com\/max\/98\/1*uSVnBrs01jxhy-q3JZAjGg.png\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-43afc98 elementor-widget elementor-widget-text-editor\" data-id=\"43afc98\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>As can be seen from the equation, R\u00b2 is essentially 1 minus the ratio of the residual sum of squares to that of the total sum of squares. In simple terms, it can be said to represent the relative measure of explained variance. For example if R\u00b2 = 0.6 then it means that the model could explain 60% of the variance (<em>i.e.<\/em>\u00a0that is 60% of the data fits the regression model) whereas the unexplained variance accounted for the remaining 40%.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p>Additionally, the\u00a0<strong><em>mean squared error (MSE)\u00a0<\/em><\/strong>as well as the\u00a0<strong><em>root mean squared error (RMSE)<\/em><\/strong>\u00a0are also common measures of the residuals or error of prediction.<\/p>\n<!-- \/wp:paragraph -->\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-2139fd9 elementor-widget elementor-widget-image\" data-id=\"2139fd9\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/miro.medium.com\/max\/148\/1*r9w-CQI-lS6nue5mf75lzQ.png\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-693742e elementor-widget elementor-widget-text-editor\" data-id=\"693742e\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<!-- wp:paragraph -->\n<p>As can be seen from the above equation, the MSE is as the name implies easily computed by taking the mean of the squared error. Furthermore, a simple square root of the MSE yields the RMSE.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:separator --><hr class=\"wp-block-separator\" \/><!-- \/wp:separator -->\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-4df5634 elementor-widget elementor-widget-heading\" data-id=\"4df5634\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\"><h2 id=\"69d0\">A Visual Explanation of the Classification Process<\/h2><\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-ea0d159 elementor-widget elementor-widget-text-editor\" data-id=\"ea0d159\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>Now let\u2019s take another look at the entire process of a classification model. Using the Penguins dataset as an example, we can see that penguins can be characterised by 4 quantitative features and 2 qualitative features, which are then used as input for training a classification model. In training the model, some of the issues that one would need to consider includes the following:<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:list -->\n<ul>\n<li>What machine learning algorithm to use?<\/li>\n<li>What search space should be explored for hyperparameter optimization?<\/li>\n<li>Which data splitting scheme to use? 80\/20 split or 60\/20\/20 split? Or 10-fold CV?<\/li>\n<\/ul>\n<!-- \/wp:list -->\n\n<!-- wp:paragraph -->\n<p>Once the model has been trained, the resulting model can be used to make predictions on the class label (<em>i.e.<\/em>\u00a0in our case the penguins species), which can be one of three penguins species: Adelie, Chinstrap or Gentoo.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p>Aside from performing only classification modeling, one could also perform principal component analysis (PCA), which will make use of only the X (independent) variables to discern the underlying structure of the data and in doing so would allow the visualisation of the inherent data clusters (shown below as a hypothetical plot where the clusters are color-coded according to the 3 penguins species).<\/p>\n<!-- \/wp:paragraph -->\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-45e367c elementor-widget elementor-widget-image\" data-id=\"45e367c\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/miro.medium.com\/max\/2182\/1*DQn42T9e2ne72XrBXReQ5A@2x.jpeg\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-51e9659 elementor-widget elementor-widget-text-editor\" data-id=\"51e9659\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<strong>Schematic illustration of the process of building a classification model.\u00a0<\/strong>(Drawn by Chanin Nantasenamat)\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<\/div>\n\t\t","protected":false},"excerpt":{"rendered":"<p>How to Build a Machine Learning Model? What are the elements that are required for building a machine learning model? A dataset is the starting point in your journey of building the machine learning model. <\/p>\n","protected":false},"author":886,"featured_media":9321,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"content-type":"","footnotes":""},"categories":[183],"tags":[92,457],"ppma_author":[3736],"class_list":["post-9318","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-ai-ml","tag-machine-learning","tag-model-building"],"authors":[{"term_id":3736,"user_id":886,"is_guest":0,"slug":"chanin-nantasenamat","display_name":"Chanin Nantasenamat","avatar_url":"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/08\/Chanin-Nantasenamat-150x150.jpg","user_url":"http:\/\/www.mahidol.ac.th\/mueng\/","last_name":"Nantasenamat","first_name":"Chanin","job_title":"","description":"Chanin Nantasenamat is Associate Professor and Head, Center of Data Mining and Biomedical Informatics at Mahidol University, Thailand. He is also Founder of Data Professor YouTube Channel and Associate Editor at Frontiers in Pharmacology. Thought Leader on AI and ML Education, he was a Visiting Professor at Uppsala University, Lund University, University of California at Los Angeles as well as the California State University at Fullerton."}],"_links":{"self":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/9318","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/users\/886"}],"replies":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/comments?post=9318"}],"version-history":[{"count":9,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/9318\/revisions"}],"predecessor-version":[{"id":34199,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/9318\/revisions\/34199"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/media\/9321"}],"wp:attachment":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/media?parent=9318"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/categories?post=9318"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/tags?post=9318"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/ppma_author?post=9318"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}