{"id":9588,"date":"2020-09-07T09:04:41","date_gmt":"2020-09-07T09:04:41","guid":{"rendered":"https:\/\/www.experfy.com\/blog\/?p=9588"},"modified":"2023-11-09T06:12:33","modified_gmt":"2023-11-09T06:12:33","slug":"how-to-build-a-regression-model-in-python","status":"publish","type":"post","link":"https:\/\/www.experfy.com\/blog\/bigdata-cloud\/how-to-build-a-regression-model-in-python\/","title":{"rendered":"How to Build a Regression Model in Python"},"content":{"rendered":"\t\t<div data-elementor-type=\"wp-post\" data-elementor-id=\"9588\" class=\"elementor elementor-9588\" data-elementor-post-type=\"post\">\n\t\t\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-2582c111 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-eae-slider=\"77991\" data-id=\"2582c111\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-2c17146a\" data-eae-slider=\"14598\" data-id=\"2c17146a\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-a15f15f elementor-widget elementor-widget-heading\" data-id=\"a15f15f\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\"><em>A Detailed and Visual Step-by-Step Walkthrough<\/em><\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-d3801f4 elementor-widget elementor-widget-text-editor\" data-id=\"d3801f4\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>If you are an aspiring data scientist or a veteran data scientist, this article is for you! In this article, we will be building a simple regression model in Python. To spice things up a bit, we will not be using the widely popular and ubiquitous\u00a0<em>Boston Housing<\/em>\u00a0dataset but instead, we will be using a simple Bioinformatics dataset. Particularly, we will be using the\u00a0<strong><em>Delaney Solubility<\/em><\/strong>\u00a0dataset that represents an important physicochemical property in computational drug discovery.<\/p>\n<!-- \/wp:paragraph -->\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-5317984 elementor-widget elementor-widget-text-editor\" data-id=\"5317984\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<!-- wp:paragraph -->\n<p>The aspiring data scientist will find the step-by-step tutorial particularly accessible while the veteran data scientist may want to find a new challenging dataset for which to try out their state-of-the-art machine learning algorithm or workflow.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:separator --><hr class=\"wp-block-separator\" \/><!-- \/wp:separator -->\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-523053a elementor-widget elementor-widget-heading\" data-id=\"523053a\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\">1. What we are Building Today?<\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-14cee62 elementor-widget elementor-widget-text-editor\" data-id=\"14cee62\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>A regression model! And we are going to use Python to do that. While we\u2019re at it, we are going to use a bioinformatics dataset (technically, it\u2019s cheminformatics dataset) for the model building.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p>Particularly, we are going to predict the LogS value which is the aqueous solubility of small molecules. The aqueous solubility value is a relative measure of the ability of a molecule to be soluble in water. It is an important physicochemical property of effective drugs.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p>What better way to get acquainted with the concept of what we are building today than a cartoon illustration!<\/p>\n<!-- \/wp:paragraph -->\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-1b3e0eb elementor-widget elementor-widget-image\" data-id=\"1b3e0eb\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/miro.medium.com\/max\/2182\/1*ydtFKo4yssZJ6QXDiY1oaw@2x.jpeg\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-4c4bd29 elementor-widget elementor-widget-text-editor\" data-id=\"4c4bd29\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<!-- wp:separator --><hr class=\"wp-block-separator\" \/><!-- \/wp:separator -->\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-9a71635 elementor-widget elementor-widget-heading\" data-id=\"9a71635\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\">2. Delaney Solubility Dataset<\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-f3b7978 elementor-widget elementor-widget-heading\" data-id=\"f3b7978\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h3 class=\"elementor-heading-title elementor-size-default\">2.1. Data Understanding<\/h3>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-e0427bc elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-eae-slider=\"84869\" data-id=\"e0427bc\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-622ee90\" data-eae-slider=\"93970\" data-id=\"622ee90\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-754e7e7 elementor-widget elementor-widget-text-editor\" data-id=\"754e7e7\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>As the name implies, the\u00a0<strong><em>Delaney solubility<\/em><\/strong>\u00a0dataset is comprised of the\u00a0<strong><em>aqueous solubility<\/em><\/strong>\u00a0values along with their corresponding chemical structure for a set of 1,144 molecules. For those, outside the field of biology there are some terms that we will spend some time on clarifying.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p><strong><em>Molecules<\/em><\/strong>\u00a0or sometimes referred to as small molecules or compounds are chemical entities that are made up of atoms. Let\u2019s use some analogy here and let\u2019s think of atoms as being equivalent to Lego blocks where 1 atom being 1 Lego block. When we use several Lego blocks to build something whether it be a house, a car or some abstract entity; such constructed entities are comparable to molecules. Thus, we can refer to the specific arrangement and connectivity of atoms to form a molecule as the\u00a0<strong><em>chemical structure<\/em><\/strong>.<\/p>\n<!-- \/wp:paragraph -->\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-b6bebc9 elementor-widget elementor-widget-image\" data-id=\"b6bebc9\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/miro.medium.com\/max\/1761\/1*j2yXft-XVviXNxXqTfSmow@2x.jpeg\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-1094019 elementor-widget elementor-widget-text-editor\" data-id=\"1094019\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>So how does each of the entities that you are building differ? Well, they differ by the spatial connectivity of the blocks (i.e. how the individual blocks are connected). In chemical terms, each molecules differ by their chemical structures. Thus, if you alter the connectivity of the blocks, consequently you would have effectively altered the entity that you are building. For molecules, if atom types (e.g. carbon, oxygen, nitrogen, sulfur, phosphorus, fluorine, chlorine, etc.) or groups of atoms (e.g. hydroxy, methoxy, carboxy, ether, etc.) are altered then the molecules would also be altered consequently becoming a new chemical entity (i.e. that is a new molecule is produced).<\/p>\n<!-- \/wp:paragraph -->\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-a9bd8ee elementor-widget elementor-widget-image\" data-id=\"a9bd8ee\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/miro.medium.com\/max\/2182\/1*QslCNe6yCH4vYJM8TYgElQ@2x.jpeg\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-0ade919 elementor-widget elementor-widget-text-editor\" data-id=\"0ade919\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<!-- wp:paragraph -->\n<p>To become an effective drug, molecules will need to be uptake and distributed in the human body and such property is directly governed by the\u00a0<strong><em>aqueous solubility<\/em><\/strong>. Solubility is an important property that researchers take into consideration in the design and development of therapeutic drugs. Thus, a potent drug that is unable to reach the desired destination target owing to its poor solubility would be a poor drug candidate.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:separator --><hr class=\"wp-block-separator\" \/><!-- \/wp:separator -->\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-3de0bf2 elementor-widget elementor-widget-heading\" data-id=\"3de0bf2\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h3 class=\"elementor-heading-title elementor-size-default\">2.2. Retrieving the Dataset<\/h3>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-ee6bbfb elementor-widget elementor-widget-text-editor\" data-id=\"ee6bbfb\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>The aqueous solubility dataset as performed by Delaney in the research paper entitled\u00a0<a href=\"https:\/\/pubs.acs.org\/doi\/10.1021\/ci034243x\" target=\"_blank\" rel=\"noreferrer noopener\">ESOL: Estimating Aqueous Solubility Directly from Molecular Structure<\/a>\u00a0is available as a\u00a0<a href=\"https:\/\/pubs.acs.org\/doi\/10.1021\/ci034243x\" target=\"_blank\" rel=\"noreferrer noopener\">Supplementary file<\/a>. For your convenience, we have also downloaded the entire\u00a0<a href=\"https:\/\/github.com\/dataprofessor\/data\/blob\/master\/delaney.csv\" target=\"_blank\" rel=\"noreferrer noopener\"><strong>Delaney solubility dataset<\/strong><\/a>\u00a0and made it available on the\u00a0<a href=\"https:\/\/github.com\/dataprofessor\" target=\"_blank\" rel=\"noreferrer noopener\">Data Professor GitHub<\/a>.<\/p>\n<!-- \/wp:paragraph -->\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-ee8a50d elementor-widget elementor-widget-text-editor\" data-id=\"ee8a50d\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<!-- wp:image {\"align\":\"center\",\"id\":9591,\"sizeSlug\":\"large\"} -->\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-large\"><br \/>\n<figcaption><strong>Preview of the raw version of the Delaney solubility dataset.<\/strong>\u00a0The\u00a0<a href=\"https:\/\/github.com\/dataprofessor\/data\/blob\/master\/delaney.csv\" target=\"_blank\" rel=\"noreferrer noopener\">full version<\/a>\u00a0is available on the\u00a0<a href=\"https:\/\/github.com\/dataprofessor\" target=\"_blank\" rel=\"noreferrer noopener\">Data Professor GitHub<\/a>.<\/figcaption>\n<\/figure>\n<\/div>\n<!-- \/wp:image -->\n\n<!-- wp:paragraph -->\n<p><strong>CODE PRACTICE<\/strong><\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p>Let\u2019s get started, shall we?<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p>Fire up Google Colab or your Jupyter Notebook and run the following code cells.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:image {\"align\":\"center\",\"id\":9593,\"sizeSlug\":\"large\"} -->\n<div class=\"wp-block-image\">\u00a0<\/div>\n<!-- \/wp:image -->\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-5812982 elementor-widget elementor-widget-text-editor\" data-id=\"5812982\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p><strong>CODE EXPLANATION<\/strong><\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p>Let\u2019s now go over what each code cells mean.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p>The\u00a0<strong>first code cell<\/strong>,<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:list -->\n<ul>\n<li>As the code literally says, we are going to import the\u00a0<code>pandas<\/code>\u00a0library as\u00a0<code>pd<\/code>.<\/li>\n<\/ul>\n<!-- \/wp:list -->\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-e16692a elementor-widget elementor-widget-text-editor\" data-id=\"e16692a\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<!-- wp:paragraph -->\n<p>The\u00a0<strong>second code cell<\/strong>:<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:list -->\n<ul>\n<li>Assigns the URL where the Delaney solubility dataset resides to the\u00a0<code>delaney_url<\/code>\u00a0variable.<\/li>\n<li>Reads in the Delaney solubility dataset via the\u00a0<code>pd.read_csv()<\/code>\u00a0function and assigns the resulting dataframe to the\u00a0<code>delaney_df<\/code>\u00a0variable.<\/li>\n<li>Calls the\u00a0<code>delaney_df<\/code>\u00a0variable to return the output value that essentially prints out a dataframe containing the following 4 columns:<\/li>\n<\/ul>\n<!-- \/wp:list -->\n\n<!-- wp:list {\"ordered\":true} -->\n<ol>\n<li><strong>Compound ID\u00a0<\/strong>\u2014 Names of the compounds.<\/li>\n<li><strong>measured log(solubility:mol\/L)\u00a0<\/strong>\u2014 The experimental aqueous solubility values as reported in the original research article by Delaney.<\/li>\n<li><strong>ESOL predicted log(solubility:mol\/L)\u00a0<\/strong>\u2014 Predicted aqueous solubility values as reported in the original research article by Delaney.<\/li>\n<li><strong>SMILES<\/strong>\u00a0\u2014 A 1-dimensional encoding of the chemical structure information<\/li>\n<\/ol>\n<!-- \/wp:list -->\n\n<!-- wp:separator --><hr class=\"wp-block-separator\" \/><!-- \/wp:separator -->\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-b73e421 elementor-widget elementor-widget-heading\" data-id=\"b73e421\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h3 class=\"elementor-heading-title elementor-size-default\">2.3. Calculating the Molecular Descriptors<\/h3>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-7d1b0a5 elementor-widget elementor-widget-text-editor\" data-id=\"7d1b0a5\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>A point it note is that the above dataset as originally provided by the authors is not yet useable out of the box. Particularly, we will have to use the\u00a0<strong><em>SMILES notation<\/em><\/strong>\u00a0to calculate the\u00a0<strong><em>molecular descriptors\u00a0<\/em><\/strong>via the\u00a0<em>rdkit<\/em>\u00a0Python library as demonstrated in a step-by-step manner in a previous Medium article (<a href=\"https:\/\/towardsdatascience.com\/how-to-use-machine-learning-for-drug-discovery-1ccb5fdf81ad\" target=\"_blank\" rel=\"noreferrer noopener\"><em>How to Use Machine Learning for Drug Discovery<\/em><\/a>).<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p>It should be noted that the\u00a0<strong><em>SMILES notation<\/em><\/strong>\u00a0is a one-dimensional depiction of the chemical structure information of the molecules.\u00a0<strong><em>Molecular descriptors<\/em><\/strong>\u00a0are quantitative or qualitative description of the unique physicochemical properties of molecules.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p>Let\u2019s think of molecular descriptors as a way to uniquely represent the molecules in numerical form that can be understood by machine learning algorithms to learn from, make predictions and provide useful knowledge on the\u00a0<strong><em>structure-activity relationship<\/em><\/strong>. As previously noted, the specific arrangement and connectivity of atoms produce different chemical structures that consequently dictates the resulting activity that they will produce. Such notion is known as structure-activity relationship.<\/p>\n<!-- \/wp:paragraph -->\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-d38cc9c elementor-widget elementor-widget-text-editor\" data-id=\"d38cc9c\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<!-- wp:paragraph -->\n<p>The processed version of the dataset containing the calculated molecular descriptors along with their corresponding response variable (logS) is shown below. This processed dataset is now ready to be used for machine learning model building whereby the first 4 variables can be used as the\u00a0<strong>X<\/strong>\u00a0variables and the logS variables can be used as the\u00a0<strong>Y<\/strong>\u00a0variable.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:image {\"align\":\"center\",\"id\":9594,\"sizeSlug\":\"large\"} -->\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-large\"><br \/>\n<figcaption><strong>Preview of the processed version of the Delaney solubility dataset.<\/strong>\u00a0Essentially, the SMILES notation from the raw version was used as input to compute the 4 molecular descriptors as described in detail in a previous\u00a0<a href=\"https:\/\/towardsdatascience.com\/how-to-use-machine-learning-for-drug-discovery-1ccb5fdf81ad\" target=\"_blank\" rel=\"noreferrer noopener\">Medium article<\/a>\u00a0and\u00a0<a href=\"https:\/\/www.youtube.com\/watch?v=VXFFHHoE1wk\" target=\"_blank\" rel=\"noreferrer noopener\">YouTube video<\/a>. The\u00a0<a href=\"https:\/\/github.com\/dataprofessor\/data\/blob\/master\/delaney.csv\" target=\"_blank\" rel=\"noreferrer noopener\">full version<\/a>\u00a0is available on the\u00a0<a href=\"https:\/\/github.com\/dataprofessor\" target=\"_blank\" rel=\"noreferrer noopener\">Data Professor GitHub<\/a>.<\/figcaption>\n<\/figure>\n<\/div>\n<!-- \/wp:image -->\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-328c7cf elementor-widget elementor-widget-text-editor\" data-id=\"328c7cf\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<!-- wp:paragraph -->\n<p>A quick description of the 4 molecular descriptors and response variable is provided below:<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:list {\"ordered\":true} -->\n<ol>\n<li><strong>cLogP\u00a0<\/strong>\u2014 Octanol-water partition coefficient<\/li>\n<li><strong>MW\u00a0<\/strong>\u2014 Molecular weight<\/li>\n<li><strong>RB\u00a0<\/strong>\u2014Number of rotatable bonds<\/li>\n<li><strong>AP<\/strong><em>\u2014<\/em>Aromatic proportion = number of aromatic atoms \/ total number of heavy atoms<\/li>\n<li><strong>LogS<\/strong>\u00a0\u2014 Log of the aqueous solubility<\/li>\n<\/ol>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-fe21303 elementor-widget elementor-widget-text-editor\" data-id=\"fe21303\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<!-- wp:paragraph -->\n<p><strong>CODE PRACTICE<\/strong><\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p>Let\u2019s continue by reading in the CSV file that contains the calculated molecular descriptors.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:image {\"align\":\"center\",\"id\":9595,\"sizeSlug\":\"large\"} -->\n<div class=\"wp-block-image\">\u00a0<\/div>\n<!-- \/wp:image -->\n\n<!-- wp:paragraph -->\n<p><strong>CODE EXPLANATION<\/strong><\/p>\n<!-- \/wp:paragraph -->\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-62fc9d5 elementor-widget elementor-widget-text-editor\" data-id=\"62fc9d5\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<!-- wp:paragraph -->\n<p>Let\u2019s now go over what the code cells mean.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:list -->\n<ul>\n<li>Assigns the URL where the Delaney solubility dataset (with calculated descriptors) resides to the\u00a0<code>delaney_url<\/code>\u00a0variable.<\/li>\n<li>Reads in the Delaney solubility dataset (with calculated descriptors) via the\u00a0<code>pd.read_csv()<\/code>\u00a0function and assigns the resulting dataframe to the\u00a0<code>delaney_descriptors_df<\/code>\u00a0variable.<\/li>\n<li>Calls the\u00a0<code>delaney_descriptors_df<\/code>\u00a0variable to return the output value that essentially prints out a dataframe containing the following 5 columns:<\/li>\n<\/ul>\n<!-- \/wp:list -->\n\n<!-- wp:list {\"ordered\":true} -->\n<ol>\n<li>MolLogP<\/li>\n<li>MolWt<\/li>\n<li>NumRotatableBonds<\/li>\n<li>AromaticProportion<\/li>\n<li>logS<\/li>\n<\/ol>\n<!-- \/wp:list -->\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-6c4177e elementor-widget elementor-widget-text-editor\" data-id=\"6c4177e\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>The first 4 columns are molecular descriptors computed using the\u00a0<code>rdkit<\/code>\u00a0Python library. The fifth column is the response variable\u00a0<em>logS<\/em>.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:separator --><hr class=\"wp-block-separator\" \/><!-- \/wp:separator -->\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-020b9e1 elementor-widget elementor-widget-heading\" data-id=\"020b9e1\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\">3. Data Preparation<\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-dc4b7d6 elementor-widget elementor-widget-heading\" data-id=\"dc4b7d6\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h3 class=\"elementor-heading-title elementor-size-default\">3.1. Separating the data as X and Y variables<\/h3>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-5621e91 elementor-widget elementor-widget-text-editor\" data-id=\"5621e91\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>In building a machine learning model using the\u00a0<code>scikit-learn<\/code>\u00a0library, we would need to separate the dataset into the input features (the\u00a0<strong>X<\/strong>\u00a0variables) and the target response variable (the\u00a0<strong>Y<\/strong>\u00a0variable).<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p><strong>CODE PRACTICE<\/strong><\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p>Follow along and implement the following 2 code cells to separate the dataset contained with the\u00a0<code>delaney_descriptors_df<\/code>\u00a0dataframe to\u00a0<strong>X<\/strong>\u00a0and\u00a0<strong>Y<\/strong>\u00a0subsets.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:image {\"align\":\"center\",\"id\":9596,\"sizeSlug\":\"large\"} -->\n<div class=\"wp-block-image\">\u00a0<\/div>\n<!-- \/wp:image -->\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-54fcbfb elementor-widget elementor-widget-text-editor\" data-id=\"54fcbfb\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<!-- wp:paragraph -->\n<p><strong>CODE EXPLANATION<\/strong><\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p>Let\u2019s take a look at the 2 code cells.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p><strong><em>First code cell:<\/em><\/strong><\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:list -->\n<ul>\n<li>Here we are using the drop() function to specifically \u2018drop\u2019 the logS variable (which is the\u00a0<strong>Y<\/strong>\u00a0variable and we will be dealing with it in the next code cell). As a result, we will have 4 remaining variables which are assigned to the\u00a0<strong>X<\/strong>\u00a0dataframe. Particularly, we apply the\u00a0<code>drop()<\/code>\u00a0function to the\u00a0<code>delaney_descriptors_df<\/code>\u00a0dataframe as in\u00a0<code>delaney_descriptors_df.drop(\u2018logS\u2019, axis=1)<\/code>\u00a0where the first input argument is the specific column that we want to drop and the second input argument of\u00a0<code>axis=1<\/code>\u00a0specifies that the first input argument is a column.<\/li>\n<\/ul>\n<!-- \/wp:list -->\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-1492be2 elementor-widget elementor-widget-text-editor\" data-id=\"1492be2\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<!-- wp:paragraph -->\n<p><strong>Second code cell:<\/strong><\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:list -->\n<ul>\n<li>Here we select a single column (the \u2018logS\u2019 column) from the\u00a0<code>delaney_descriptors_df<\/code>\u00a0dataframe via\u00a0<code>delaney_descriptors_df.logS<\/code>\u00a0and assigning this to the\u00a0<strong>Y<\/strong>\u00a0variable.<\/li>\n<\/ul>\n<!-- \/wp:list -->\n\n<!-- wp:separator --><hr class=\"wp-block-separator\" \/><!-- \/wp:separator -->\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-8a7e199 elementor-widget elementor-widget-heading\" data-id=\"8a7e199\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\">3.2. Data splitting<\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-1daf810 elementor-widget elementor-widget-text-editor\" data-id=\"1daf810\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<!-- wp:paragraph -->\n<p>In evaluating the model performance, the standard practice is to split the dataset into 2 (or more partitions) partitions and here we will be <a href=\"https:\/\/www.experfy.com\/blog\/using-regression-with-correlated-data\/\" target=\"_blank\" rel=\"noreferrer noopener\">using <\/a>the 80\/20 split ratio whereby the 80% subset will be used as the train set and the 20% subset the test set. As scikit-learn requires that the data be further separated to their\u00a0<strong>X<\/strong>\u00a0and\u00a0<strong>Y<\/strong>\u00a0components, the\u00a0<code>train_test_split()<\/code>\u00a0function can readily perform the above-mentioned task.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p><strong>CODE PRACTICE<\/strong><\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p>Let\u2019s implement the following 2 code cells.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:image {\"align\":\"center\",\"id\":9597,\"sizeSlug\":\"large\"} -->\n<div class=\"wp-block-image\">\u00a0<\/div>\n<!-- \/wp:image -->\n\n<!-- wp:paragraph -->\n<p><strong>CODE EXPLANATION<\/strong><\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p>Let\u2019s take a look at what the code is doing.<\/p>\n<!-- \/wp:paragraph -->\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-730eadc elementor-widget elementor-widget-text-editor\" data-id=\"730eadc\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p><strong><em>First code cell:<\/em><\/strong><\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:list -->\n<ul>\n<li>Here we will be importing the\u00a0<code>train_test_split<\/code>\u00a0from thescikit-learn library.<\/li>\n<\/ul>\n<!-- \/wp:list -->\n\n<!-- wp:paragraph -->\n<p><strong><em>Second code cell:<\/em><\/strong><\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:list -->\n<ul>\n<li>We start by defining the names of the 4 variables that the\u00a0<code>train_test_split()<\/code>\u00a0function will generate and this includes\u00a0<code>X_train<\/code>,\u00a0<code>X_test<\/code>,\u00a0<code>Y_train<\/code>\u00a0and\u00a0<code>Y_test<\/code>. The first 2 corresponds to the X dataframes for the train and test sets while the last 2 corresponds to the Y variables for the train and test sets.<\/li>\n<\/ul>\n<!-- \/wp:list -->\n\n<!-- wp:separator --><hr class=\"wp-block-separator\" \/><!-- \/wp:separator -->\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-b26d39b elementor-widget elementor-widget-heading\" data-id=\"b26d39b\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\">4. Linear Regression Model<\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-c393b40 elementor-widget elementor-widget-text-editor\" data-id=\"c393b40\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\n<p>Now, comes the fun part and let\u2019s build a regression model.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:heading {\"level\":3} -->\n<h3 id=\"3c5b\">4.1. Training a linear regression model<\/h3>\n<!-- \/wp:heading -->\n\n<!-- wp:paragraph -->\n<p><strong>CODE PRACTICE<\/strong><\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p>Here, we will be using the\u00a0<code>LinearRegression()<\/code>\u00a0function from scikit-learn to build a model using the ordinary least squares linear regression.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:image {\"align\":\"center\",\"id\":9598,\"sizeSlug\":\"large\"} -->\n<div class=\"wp-block-image\">\u00a0<\/div>\n<!-- \/wp:image -->\n\n<!-- wp:paragraph -->\n<p><strong>CODE EXPLANATION<\/strong><\/p>\n<!-- \/wp:paragraph -->\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-1600f63 elementor-widget elementor-widget-text-editor\" data-id=\"1600f63\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>Let\u2019s see what the codes are doing<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p><strong>First code cell:<\/strong><\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:list -->\n<ul>\n<li>Here we import the linear_model from the scikit-learn library<\/li>\n<\/ul>\n<!-- \/wp:list -->\n\n<!-- wp:paragraph -->\n<p><strong>Second code cell:<\/strong><\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:list -->\n<ul>\n<li>We assign the\u00a0<code>linear_model.LinearRegression()<\/code>\u00a0function to the\u00a0<code>model<\/code>\u00a0variable.<\/li>\n<li>A model is built using the command\u00a0<code>model.fit(X_train, Y_train)<\/code>\u00a0whereby the model.fit() function will take\u00a0<code>X_train<\/code>\u00a0and\u00a0<code>Y_train<\/code>\u00a0as input arguments to build or train a model. Particularly, the\u00a0<code>X_train<\/code>\u00a0contains the input features while the\u00a0<code>Y_train<\/code>\u00a0contains the response variable (logS).<\/li>\n<\/ul>\n<!-- \/wp:list -->\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-df46c3e elementor-widget elementor-widget-heading\" data-id=\"df46c3e\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h3 class=\"elementor-heading-title elementor-size-default\">4.2. Apply trained model to predict logS from the training and test set<\/h3>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-8b869a8 elementor-widget elementor-widget-text-editor\" data-id=\"8b869a8\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>As mentioned above,\u00a0<code>model.fit()<\/code>\u00a0trains the model and the resulting trained model is saved into the\u00a0<code>model<\/code>\u00a0variable.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p><strong>CODE PRACTICE<\/strong><\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p>We will now apply the trained model to make predictions on the training set (<code>X_train<\/code>).<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:image {\"align\":\"center\",\"id\":9599,\"sizeSlug\":\"large\"} -->\n<div class=\"wp-block-image\">\u00a0<\/div>\n<!-- \/wp:image -->\n\n<!-- wp:paragraph -->\n<p>We will now apply the trained model to make predictions on the test set (<code>X_test<\/code>).<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:image {\"align\":\"center\",\"id\":9600,\"sizeSlug\":\"large\"} -->\n<div class=\"wp-block-image\">\u00a0<\/div>\n<!-- \/wp:image -->\n\n<!-- wp:paragraph -->\n<p><strong>CODE EXPLANATION<\/strong><\/p>\n<!-- \/wp:paragraph -->\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-5868c41 elementor-widget elementor-widget-text-editor\" data-id=\"5868c41\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>Let\u2019s proceed to the explanation.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p>The following explanation will cover only the training set (<code>X_train<\/code>) as the exact same concept can be identically applied to the test set (<code>X_test<\/code>) by performing the following simple tweaks:<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:list -->\n<ul>\n<li>Replace\u00a0<code>X_train<\/code>\u00a0by\u00a0<code>X_test<\/code><\/li>\n<li>Replace\u00a0<code>Y_train<\/code>\u00a0by\u00a0<code>Y_test<\/code><\/li>\n<li>Replace\u00a0<code>Y_pred_train<\/code>\u00a0by\u00a0<code>Y_pred_test<\/code><\/li>\n<\/ul>\n<!-- \/wp:list -->\n\n<!-- wp:paragraph -->\n<p>Everything else are exactly the same.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p><strong><em>First code cell:<\/em><\/strong><\/p>\n<!-- \/wp:paragraph -->\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-edfbcc7 elementor-widget elementor-widget-text-editor\" data-id=\"edfbcc7\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<li>Predictions of the logS values will be performed by calling the\u00a0<code>model.predict()<\/code>\u00a0and using\u00a0<code>X_train<\/code>\u00a0as the input argument such that we run the command\u00a0<code>model.predict(X_train)<\/code>. The resulting predicted values will be assigned to the\u00a0<code>Y_pred_train<\/code>\u00a0variable.<\/li>\n<\/ul>\n<!-- \/wp:list -->\n\n<!-- wp:paragraph -->\n<p><strong><em>Second code cell:<\/em><\/strong><\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p>Model performance metrics are now printed.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:list -->\n<ul>\n<li>Regression coefficient values are obtained from\u00a0<code>model.coef_<\/code>,<\/li>\n<li>The y-intercept value is obtained from\u00a0<code>model.intercept_<\/code>,<\/li>\n<li>The mean squared error (MSE) is computed using the\u00a0<code>mean_squared_error()<\/code>\u00a0function using\u00a0<code>Y_train<\/code>\u00a0and\u00a0<code>Y_pred_train<\/code>\u00a0as input arguments such that we run\u00a0<code>mean_squared_error(Y_train, Y_pred_train)<\/code><\/li>\n<li>The coefficient of determination (also known as R\u00b2) is computed using the\u00a0<code>r2_score()<\/code>\u00a0function using\u00a0<code>Y_train<\/code>\u00a0and\u00a0<code>Y_pred_train<\/code>\u00a0as input arguments such that we run\u00a0<code>r2_score(Y_train, Y_pred_train)<\/code><\/li>\n<\/ul>\n<!-- \/wp:list -->\n\n<!-- wp:separator --><hr class=\"wp-block-separator\" \/><!-- \/wp:separator -->\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-63dee55 elementor-widget elementor-widget-heading\" data-id=\"63dee55\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h3 class=\"elementor-heading-title elementor-size-default\">4.3. Printing out the Regression Equation<\/h3>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-898737a elementor-widget elementor-widget-text-editor\" data-id=\"898737a\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\n<!-- wp:paragraph -->\n<p>The equation of a linear regression model is actually the regression model itself whereby you can plug in the input feature values and the equation will return the target response values (LogS).<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p><strong>CODE PRACTICE<\/strong><\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p>Let\u2019s now print out the regression model equation.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:image {\"align\":\"center\",\"id\":9601,\"sizeSlug\":\"large\"} -->\n<div class=\"wp-block-image\">\u00a0<\/div>\n<!-- \/wp:image -->\n\n<!-- wp:paragraph -->\n<p><strong>CODE EXPLANATION<\/strong><\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p><strong><em>First code cell:<\/em><\/strong><\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:list -->\n<ul>\n<li>All the components of the regression model equation is derived from the\u00a0<code>model<\/code>\u00a0variable. The y-intercept and the regression coefficients for LogP, MW, RB and AP are provided in\u00a0<code>model.intercept_<\/code>,\u00a0<code>model.coef_[0]<\/code>,\u00a0<code>model.coef_[1]<\/code>,\u00a0<code>model.coef_[2]<\/code>\u00a0and\u00a0<code>model.coef_[3]<\/code>.<\/li>\n<\/ul>\n<!-- \/wp:list -->\n\n<!-- wp:paragraph -->\n<p><strong><em>Second code cell:<\/em><\/strong><\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:list -->\n<ul>\n<li>Here we put together the components and print out the equation via the\u00a0<code>print()<\/code>\u00a0function.<\/li>\n<\/ul>\n<!-- \/wp:list -->\n\n<!-- wp:separator --><hr class=\"wp-block-separator\" \/><!-- \/wp:separator -->\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-390c4ba elementor-widget elementor-widget-heading\" data-id=\"390c4ba\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\">5. Scatter Plot of experimental vs. predicted LogS<\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-a1173a2 elementor-widget elementor-widget-text-editor\" data-id=\"a1173a2\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>We will now visualize the relative distribution of the experimental versus predicted LogS by means of a scatter plot. Such plot will allow us to quickly see the model performance.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p><strong>CODE PRACTICE<\/strong><\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p>In the forthcoming examples, I will show you how to layout the 2 sub-plots differently namely: (1) vertical plot and (2) horizontal plot.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:image {\"align\":\"center\",\"id\":9602,\"sizeSlug\":\"large\"} -->\n<div class=\"wp-block-image\">\u00a0<\/div>\n<!-- \/wp:image -->\n\n<!-- wp:paragraph -->\n<p><strong>CODE EXPLANATION<\/strong><\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p>Let\u2019s now take a look at the underlying code for implementing the vertical and horizontal plots. Here, I provide 2 options for you to choose from whether to have the layout of this multi-plot figure in the vertical or horizontal layout.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p><strong><em>Import libraries<\/em><\/strong><\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p>Both start by importing the necessary libraries namely\u00a0<code>matplotlib<\/code>\u00a0and\u00a0<code>numpy<\/code>. Particularly, most of the code will be using\u00a0<code>matplotlib<\/code>\u00a0for creating the plot while the\u00a0<code>numpy<\/code>\u00a0library is used here to add a trend line.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p><strong><em>Define figure size<\/em><\/strong><\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p>Next, we specify the figure dimensions (what will be the width and height of the figure) via\u00a0<code>plt.figure(figsize=(5,11))<\/code>\u00a0for the vertical plot and\u00a0<code>plt.figure(figsize=(11,5))<\/code>\u00a0for the horizontal plot. Particularly, (5,11) tells matplotlib that the figure for the vertical plot should be 5 inches wide and 11 inches tall while the inverse is used for the horizontal plot.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p><strong><em>Define placeholders for the sub-plots<\/em><\/strong><\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p>We will tell matplotlib that we want to have 2 rows and 1 column and thus its layout will be that of a vertical plot. This is specified by\u00a0<code>plt.subplot(2, 1, 1)<\/code>\u00a0where input arguments of\u00a0<code>2, 1, 1<\/code>\u00a0refers to 2 rows, 1 column and the particular sub-plot that we are creating underneath it. In other words, let\u2019s think of the use of\u00a0<code>plt.subplot()<\/code>\u00a0function as a way of structuring the plot by creating placeholders for the various sub-plots that the figure contains. The second sub-plot of the vertical plot is specified by the value of 2 in the third input argument of the\u00a0<code>plt.subplot()<\/code>\u00a0function as in\u00a0<code>plt.subplot(2, 1, 2)<\/code>.<\/p>\n<!-- \/wp:paragraph -->\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-f93e40c elementor-widget elementor-widget-text-editor\" data-id=\"f93e40c\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>By applying the same concept, the structure of the horizontal plot is created to have 1 row and 2 columns via\u00a0<code>plt.subplot(1, 2, 1)<\/code>\u00a0and\u00a0<code>plt.subplot(1, 2, 2)<\/code>\u00a0that houses the 2 sub-plots.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p><strong><em>Creating the scatter plot<\/em><\/strong><\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p>Now that the general structure of the figure is in place, let\u2019s now add the data visualizations. The data scatters are added using the\u00a0<code>plt.scatter()<\/code>\u00a0function as in\u00a0<code>plt.scatter(x=Y_train, y=Y_pred_train, c=\u201d#7CAE00\", alpha=0.3)<\/code>\u00a0where\u00a0<code>x<\/code>\u00a0refers to the data column to use for the\u00a0<em>x<\/em>\u00a0axis,\u00a0<code>y<\/code>\u00a0refers to the data column to use for the\u00a0<em>y<\/em>\u00a0axis,\u00a0<code>c<\/code>\u00a0refers to the color to use for the scattered data points and\u00a0<code>alpha<\/code>\u00a0refers to the alpha transparency level (how translucent the scattered data points should be, the lower the number the more transparent it becomes), respectively.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p><strong><em>Adding the trend line<\/em><\/strong><\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p>Next, we use the\u00a0<code>np.polyfit()<\/code>\u00a0and\u00a0<code>np.poly1d()<\/code>\u00a0functions from\u00a0<code>numpy<\/code>\u00a0together with the\u00a0<code>plt.plot ()<\/code>\u00a0function from\u00a0<code>matplotlib<\/code>\u00a0to create the trend line.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:preformatted -->\n<pre class=\"wp-block-preformatted\"><em># Add trendline<\/em><br \/><em># https:\/\/stackoverflow.com\/questions\/26447191\/how-to-add-trendline-in-python-matplotlib-dot-scatter-graphs<\/em><br \/>z = np.polyfit(Y_train, Y_pred_train, 1)<br \/>p = np.poly1d(z)<br \/>plt.plot(Y_test,p(Y_test),\"#F8766D\")<\/pre>\n<!-- \/wp:preformatted -->\n\n<!-- wp:paragraph -->\n<p><strong><em>Adding the x and y axes labels<\/em><\/strong><\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p>To add labels for the\u00a0<em>x<\/em>\u00a0and\u00a0<em>y<\/em>\u00a0axes, we use the\u00a0<code>plt.xlabel()<\/code>\u00a0and\u00a0<code>plt.ylabel()<\/code>\u00a0functions. It should be noticed that for the vertical plot, we omit the x axis label for the top sub-plot (<em>Why? Because it is redundant with the x-axis label for the bottom sub-plot<\/em>).<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p><strong><em>Saving the figure<\/em><\/strong><\/p>\n<!-- \/wp:paragraph -->\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-c568cf3 elementor-widget elementor-widget-text-editor\" data-id=\"c568cf3\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>Finally, we are going to save the constructed figure to file and we can do that using the\u00a0<code>plt.savefig()<\/code>\u00a0function from\u00a0<code>matplotlib<\/code>\u00a0and specifying the file name as the input argument. Lastly, finish off with\u00a0<code>plt.show()<\/code>.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:preformatted -->\n<pre class=\"wp-block-preformatted\">plt.savefig('plot_vertical_logS.png')<br \/>plt.savefig('plot_vertical_logS.pdf')<br \/>plt.show()<\/pre>\n<!-- \/wp:preformatted -->\n\n<!-- wp:paragraph -->\n<p><strong>VISUAL EXPLANATION<\/strong><\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p>The above section provides a text-based explanation and in this section we are going to do the same with this visual explanation that makes use of color highlights to distinguish the different components of the plot.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:image -->\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/miro.medium.com\/max\/2182\/1*TVV9kgSL4WHJV_NaXyQN0g@2x.jpeg\" alt=\"Visual explanation on creating a scatter plot\" \/>\n<figcaption><strong>Visual explanation on creating a scatter plot.<\/strong>\u00a0Here we color highlight the specific lines of code and their corresponding plot component. (Drawn by Chanin Nantasenamat)<\/figcaption>\n<\/figure>\n<!-- \/wp:image -->\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<\/div>\n\t\t","protected":false},"excerpt":{"rendered":"<p>This article shows how to build a simple regression model in Python. It is a detailed and visual step-by-step walkthrough. <\/p>\n","protected":false},"author":886,"featured_media":9589,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[187],"tags":[700,114,399,699],"ppma_author":[3736],"class_list":["post-9588","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-bigdata-cloud","tag-build-a-regression-model","tag-python","tag-regression-model","tag-regression-model-in-python"],"authors":[{"term_id":3736,"user_id":886,"is_guest":0,"slug":"chanin-nantasenamat","display_name":"Chanin Nantasenamat","avatar_url":"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/08\/Chanin-Nantasenamat-150x150.jpg","author_category":"","user_url":"http:\/\/www.mahidol.ac.th\/mueng\/","last_name":"Nantasenamat","first_name":"Chanin","job_title":"","description":"Chanin Nantasenamat is Associate Professor and Head, Center of Data Mining and Biomedical Informatics at Mahidol University, Thailand. He is also Founder of Data Professor YouTube Channel and Associate Editor at Frontiers in Pharmacology. Thought Leader on AI and ML Education, he was a Visiting Professor at Uppsala University, Lund University, University of California at Los Angeles as well as the California State University at Fullerton."}],"_links":{"self":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/9588","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/users\/886"}],"replies":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/comments?post=9588"}],"version-history":[{"count":0,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/9588\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/media\/9589"}],"wp:attachment":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/media?parent=9588"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/categories?post=9588"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/tags?post=9588"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/ppma_author?post=9588"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}