{"id":9183,"date":"2020-07-31T08:31:19","date_gmt":"2020-07-31T08:31:19","guid":{"rendered":"https:\/\/www.experfy.com\/blog\/?p=9183"},"modified":"2023-11-24T10:32:33","modified_gmt":"2023-11-24T10:32:33","slug":"the-most-used-probability-distributions-in-data-science","status":"publish","type":"post","link":"https:\/\/www.experfy.com\/blog\/bigdata-cloud\/the-most-used-probability-distributions-in-data-science\/","title":{"rendered":"The most used probability distributions in Data Science"},"content":{"rendered":"\t\t<div data-elementor-type=\"wp-post\" data-elementor-id=\"9183\" class=\"elementor elementor-9183\" data-elementor-post-type=\"post\">\n\t\t\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-4e219955 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-eae-slider=\"66444\" data-id=\"4e219955\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-729e0ef9\" data-eae-slider=\"57562\" data-id=\"729e0ef9\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-1c4a36c elementor-widget elementor-widget-heading\" data-id=\"1c4a36c\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h4 class=\"elementor-heading-title elementor-size-default\">\n<h4 class=\"wp-block-heading\" id=\"e268\"><em>Statistics and probability are the best arrows in a Data Scientist\u2019s quiver. Let\u2019s see some probability distribution often used<\/em><\/h4>\n<\/h4>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-ed59034 elementor-widget elementor-widget-text-editor\" data-id=\"ed59034\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\n<p class=\"wp-block-paragraph\">Statistics is a must-have skill in a Data Scientist\u2019s CV, so there are concepts and topics that must be known in advance if somebody wants to work with data and machine learning models.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Probability distributions are a must-have tool. Let\u2019s see the most important ones to know for a Data Scientist.<\/p>\n\n\n<hr class=\"wp-block-separator\" \/>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-5d9ca35 elementor-widget elementor-widget-heading\" data-id=\"5d9ca35\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h1 class=\"elementor-heading-title elementor-size-default\"><h1 id=\"1ab3\">Uniform distribution<\/h1><\/h1>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-f1017bf elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-eae-slider=\"99301\" data-id=\"f1017bf\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-bac0cad\" data-eae-slider=\"27457\" data-id=\"bac0cad\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-437c066 elementor-widget elementor-widget-text-editor\" data-id=\"437c066\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\n<p class=\"wp-block-paragraph\">The simplest probability distribution is the uniform distribution, which gives the same probability to any points of a set.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">In its continuous form, a uniform distribution between\u00a0<em>a<\/em>\u00a0and\u00a0<em>b\u00a0<\/em>has this density function:<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-180f7f5 elementor-widget elementor-widget-image\" data-id=\"180f7f5\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/miro.medium.com\/max\/132\/0*0Vl3Ms0hq7tSsm8t\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-daf0ab2 elementor-widget elementor-widget-text-editor\" data-id=\"daf0ab2\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\n\n<p class=\"wp-block-paragraph\">And here\u2019s how it appears.<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-34bf52f elementor-widget elementor-widget-image\" data-id=\"34bf52f\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/miro.medium.com\/max\/601\/1*zC57BBMvx3lkj3A46ktnYA.png\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-d2a8263 elementor-widget elementor-widget-text-editor\" data-id=\"d2a8263\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\n<p class=\"wp-block-paragraph\">As you can see, the wider the range, the lower the distribution. This is necessary to keep the area equal to one.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The uniform distribution is everywhere. Any Monte Carlo simulation starts by generating uniformly distributed pseudo-random numbers. The uniform distribution is used, for example, in bootstrapping techniques for calculating confidence intervals. It\u2019s a very simple distribution and you\u2019ll find it very often.<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-32ab97a elementor-widget elementor-widget-heading\" data-id=\"32ab97a\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h1 class=\"elementor-heading-title elementor-size-default\"><h1 id=\"c7f4\">Gaussian distribution<\/h1><\/h1>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-5f9e157 elementor-widget elementor-widget-text-editor\" data-id=\"5f9e157\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Ut elit tellus, luctus nec ullamcorper mattis, pulvinar dapibus leo.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-9db0f16 elementor-widget elementor-widget-text-editor\" data-id=\"9db0f16\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\n<p class=\"wp-block-paragraph\">For simplicity, let\u2019s assume there are three customers (c1, c2, c3) in this batch, and one vehicle (v1) information is provided as a sale.<\/p>\n\n\n\n<ul class=\"wp-block-list\"><li>P(C=c1) represents the likelihood of c1 to buy any car. Assuming no prior knowledge about each customer, their likelihood of buying any car should be the same: P(C=c1) = P(C=c2) = P(C=c3), which equals a constant (e.g. 1\/3 in this situation)<\/li><li>P(V=v1) is the likelihood for v1 to be sold, given it is shown in this batch, this should be 1 (100% likelihood to be sold)<\/li><\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">Since there is only one customer making the purchase, this probability can be extended into:<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">P(V=v1) = P(C=c1, V=v1) + P(C=c2, V=v1) + P(C=c3, V=v1) = 1.0<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">For each of the item, given the following formula<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">P(C=c1, V=v1) = P(C=c1|V=v1) * P(V=v1) = P(V=v1|C=c1) * P(C=c1)<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">We can see P(C=c1|V=v1) is proportional to P(V=v1|C=c1). So now, we can get the formula for the probability calculation:<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">P(C=c1|V=v1) = P(V=v1|C=c1) \/ (P(V=v1|C=c1) + P(V=v1|C=c2) + P(V=v1|C=c3))<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">and the key is to get the probability for each P(V|C). Such a formula can be verbally explained as: the likelihood for a vehicle to be purchased by a specific customer is proportional to the likelihood for the customer to buy this specific vehicle.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The above formula may look too \u201cmathematical\u201d, so let me put it into an intuitive context: assuming three people were in a room, one is a musician, one is an athlete, and one is a data scientist. You were told there is a violin in this room belong to one of them. Now guess, whom do you think is the owner of the violin? This is pretty straightforward, right? given the likelihood of musician to own a violin is high, and the likelihood of athlete and data scientists to own a violin is lower, it is much more likely for the violin to belong to the musician. The \u201cmathematical\u201d thinking process is illustrated below.<\/p>\n\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-d677ec7 elementor-widget elementor-widget-image\" data-id=\"d677ec7\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/miro.medium.com\/max\/217\/0*iH_gSLFzqooGArvX\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-5c34256 elementor-widget elementor-widget-text-editor\" data-id=\"5c34256\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\n\n<p class=\"wp-block-paragraph\">\u03bc and \u03c3 are the mean and the standard deviation of the distribution. The mean is the peak position, while the standard deviation is related to the width. The higher the standard deviation, the larger the distribution (and the lower the peak, to keep area equal to 1).<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-cfbb226 elementor-widget elementor-widget-image\" data-id=\"cfbb226\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/miro.medium.com\/max\/601\/1*NO0MvW1WK6f46-HRRH2pNg.png\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-15af89c elementor-widget elementor-widget-text-editor\" data-id=\"15af89c\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\n<p class=\"wp-block-paragraph\">A particular case of gaussian distribution is the normal distribution, whose mean is equal to 0 and the standard deviation is equal to 1.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Every regression model using the least-squares approach assumes that the residuals are normally distributed, so it\u2019s important to know if our data is normally distributed. The\u00a0<em>F<\/em>\u00a0test for comparing variances of the residuals of a model in two datasets assumes that the residuals are normally distributed. If you work with data science, you and the Gaussian distribution will become best friends soon.<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-cafe52d elementor-widget elementor-widget-heading\" data-id=\"cafe52d\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h1 class=\"elementor-heading-title elementor-size-default\"><h1 id=\"6ed8\">Student\u2019s t distribution<\/h1><\/h1>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-7b8b1a0 elementor-widget elementor-widget-text-editor\" data-id=\"7b8b1a0\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\n\n<p class=\"wp-block-paragraph\">A very particular distribution is Student\u2019s t distribution, whose probability density function is:<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-6180d92 elementor-widget elementor-widget-image\" data-id=\"6180d92\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/miro.medium.com\/max\/368\/0*11JKwSzMRq1c6C7O\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-0556027 elementor-widget elementor-widget-text-editor\" data-id=\"0556027\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\n<p class=\"wp-block-paragraph\">The free parameter\u00a0<em>n<\/em>\u00a0is called \u201cdegrees of freedom\u201d (often abbreviated in d.o.f.). It\u2019s easy to show that t distribution gets closer to a normal distribution for high values of\u00a0<em>n<\/em>.<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-b4c4577 elementor-widget elementor-widget-image\" data-id=\"b4c4577\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/miro.medium.com\/max\/610\/1*fGY_7WXf3CrhUFJO3i1d3g.png\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-9cb4cd2 elementor-widget elementor-widget-text-editor\" data-id=\"9cb4cd2\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\n<p class=\"wp-block-paragraph\">If you have to compare mean values of different samples or the mean value of a sample with the mean value of a population and if the samples are taken from a Gaussian distribution, you\u2019ll often use Student\u2019s t-test, which is a hypothesis test performed using Student\u2019s t distribution. It\u2019s also possible to use Student&#8217;s t-test to check the linear correlation between two variables. So it\u2019s very likely that you\u2019ll need this distribution someday in the future in your Data Scientist career.<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-396f29c elementor-widget elementor-widget-heading\" data-id=\"396f29c\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h1 class=\"elementor-heading-title elementor-size-default\"><h1 id=\"bf55\">Chi-squared distribution<\/h1><\/h1>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-a9741c1 elementor-widget elementor-widget-text-editor\" data-id=\"a9741c1\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>If you take some normally distributed and independent random variables, take their square value and sum them, you get a chi-square variable, whose probability density function is:<\/p>\n<!-- \/wp:paragraph -->\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-6c72fbf elementor-widget elementor-widget-image\" data-id=\"6c72fbf\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/miro.medium.com\/max\/342\/0*e7UivxywwOp8DfZD\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-1b3409f elementor-widget elementor-widget-text-editor\" data-id=\"1b3409f\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p><em>k<\/em>\u00a0is the number of degrees of freedom. It\u2019s often equal to the number of the squared normal variables you sum, but sometimes it\u2019s reduced if there\u2019s some relationship between these variables (for example, the number of model parameters in a curve-fitting problem).<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p>Here are some examples of this distribution.<\/p>\n<!-- \/wp:paragraph -->\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-0196254 elementor-widget elementor-widget-image\" data-id=\"0196254\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/miro.medium.com\/max\/601\/1*RHrCiqveRT0FIn9Z1jjwWA.png\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-ab0061e elementor-widget elementor-widget-text-editor\" data-id=\"ab0061e\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>You\u2019ll use chi-squared distribution in Pearson\u2019s chi-square test, which compares an experimental histogram with a theoretical distribution under certain assumptions. The chi-squared distribution is related to the\u00a0<em>F<\/em>\u00a0test for the equality of variances because the\u00a0<em>F<\/em>\u00a0distribution is calculated as the probability distribution of the ratio between two chi-square variables.<\/p>\n<!-- \/wp:paragraph -->\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-e3e1ada elementor-widget elementor-widget-heading\" data-id=\"e3e1ada\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h1 class=\"elementor-heading-title elementor-size-default\"><h1 id=\"2e18\">Lognormal distribution<\/h1><\/h1>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-201bd6d elementor-widget elementor-widget-text-editor\" data-id=\"201bd6d\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<!-- wp:paragraph -->\n<p>If you take a gaussian variable and exponentiate it, you get the lognormal distribution, whose probability density function is:<\/p>\n<!-- \/wp:paragraph -->\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-6897bc1 elementor-widget elementor-widget-image\" data-id=\"6897bc1\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/miro.medium.com\/max\/234\/0*lE7lM9yx77rCq7aB\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-08857d8 elementor-widget elementor-widget-text-editor\" data-id=\"08857d8\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<!-- wp:paragraph -->\n<p>\u03bc and \u03c3 are the same as the original gaussian distribution.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p>Here are some examples:<\/p>\n<!-- \/wp:paragraph -->\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-b171a14 elementor-widget elementor-widget-image\" data-id=\"b171a14\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/miro.medium.com\/max\/601\/1*cvzRSWt5JeV94YN1UwwoRQ.png\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-7adad90 elementor-widget elementor-widget-text-editor\" data-id=\"7adad90\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>The lognormal distribution is widely measured in nature. Blood pressure follows a lognormal distribution, city sizes, and so on. A very interesting use is in the geometric Brownian motion, which is a model of random walk often used to describe financial markets, particularly in the Black-Scholes equation for options pricing.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p>Just like Gaussian distribution is widely present in nature for observables in the real domain, lognormal distribution is almost as frequent for observables with positive values. You can notice a strong asymmetry and a long tail, both very frequent phenomena in many situations when you deal with positive variables.<\/p>\n<!-- \/wp:paragraph -->\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-9a2e78f elementor-widget elementor-widget-heading\" data-id=\"9a2e78f\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h1 class=\"elementor-heading-title elementor-size-default\"><h1 id=\"b83a\">Binomial distribution<\/h1><\/h1>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-da266a7 elementor-widget elementor-widget-text-editor\" data-id=\"da266a7\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>The binomial distribution is often described as the coin toss probability distribution. It measures the probability of two events, one of them occurring with probability\u00a0<em>p<\/em>\u00a0and the other one occurring with probability\u00a0<em>1-p.<\/em><\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p>If our events are 0 and 1 numbers and the 1 event occurs with\u00a0<em>p<\/em>\u00a0probability, we can easily write the density using Dirac\u2019s delta distribution:<\/p>\n<!-- \/wp:paragraph -->\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-5251ca2 elementor-widget elementor-widget-image\" data-id=\"5251ca2\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/miro.medium.com\/max\/371\/0*7UBFf9c3UDqNHBB8\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-8c1ab3c elementor-widget elementor-widget-text-editor\" data-id=\"8c1ab3c\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<!-- wp:paragraph -->\n<p>The binomial distribution is the basic tool for any probability course and, in a Data Scientist\u2019s career, can occur many times. Just think about binary classification models, that work with a 0\u20131 variable and its statistics.<\/p>\n<!-- \/wp:paragraph -->\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-8c94d22 elementor-widget elementor-widget-heading\" data-id=\"8c94d22\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\"><h1 id=\"c7c0\">Poisson distribution<\/h1>\n<!-- \/wp:heading --><\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-f64b839 elementor-widget elementor-widget-text-editor\" data-id=\"f64b839\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<!-- wp:paragraph -->\n<p>Poisson distribution is often described as the distribution of rare events. Generally speaking, if you have an event that occurs with a fixed rate in time (i.e. 3 events per minute, 5 events per hour), the probability of observing a number\u00a0<em>n\u00a0<\/em>of events in the time unit can be described with Poisson distribution, which has this formula:<\/p>\n<!-- \/wp:paragraph -->\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-adc4f16 elementor-widget elementor-widget-image\" data-id=\"adc4f16\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/miro.medium.com\/max\/172\/0*MZWwj9FAsWFqY3LJ\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-7f8c404 elementor-widget elementor-widget-text-editor\" data-id=\"7f8c404\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<!-- wp:paragraph -->\n<p>\u03bc is the event rate in a time unit.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p>Here are some examples:<\/p>\n<!-- \/wp:paragraph -->\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-d6a923d elementor-widget elementor-widget-image\" data-id=\"d6a923d\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/miro.medium.com\/max\/619\/1*7WhBzPbymkQH7beaZOqCIQ.png\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-1fce047 elementor-widget elementor-widget-text-editor\" data-id=\"1fce047\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>As you can see, its shape is similar to a gaussian distribution and its peak is equal to \u03bc.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p>Poisson distribution is very used in particle physics and, in data science, it can be useful to describe events that have a fixed rate (e.g. a customer that enters into a supermarket in the morning).<\/p>\n<!-- \/wp:paragraph -->\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-ee45f1c elementor-widget elementor-widget-heading\" data-id=\"ee45f1c\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h1 class=\"elementor-heading-title elementor-size-default\"><h1 id=\"d5ee\">Exponential distribution<\/h1><\/h1>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-dc08359 elementor-widget elementor-widget-text-editor\" data-id=\"dc08359\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<!-- wp:paragraph -->\n<p>If you have a Poisson event that occurs at a fixed time rate, the time interval between two consecutive occurrences of that event is exponentially distributed.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p>Exponential distribution has this density function:<\/p>\n<!-- \/wp:paragraph -->\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-b07e00d elementor-widget elementor-widget-image\" data-id=\"b07e00d\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/miro.medium.com\/max\/168\/0*wy30H-r4VTUdV695\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-8cb73d9 elementor-widget elementor-widget-text-editor\" data-id=\"8cb73d9\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<!-- wp:paragraph -->\n<p>\u03c4 is the average time interval between two consecutive events.<\/p>\n<!-- \/wp:paragraph -->\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-76af2a8 elementor-widget elementor-widget-image\" data-id=\"76af2a8\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/miro.medium.com\/max\/601\/1*IDWeij5Wy75HKngSEh99sQ.png\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-30636c8 elementor-widget elementor-widget-text-editor\" data-id=\"30636c8\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<!-- wp:paragraph -->\n<p>The exponential distribution is used in particle physics and, generally speaking, if you want to move from a Poisson process (in which you study the number of events) to something more related to time (e.g. how much time occurs between two consecutive customers entering into a shop).<\/p>\n<!-- \/wp:paragraph -->\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-f894b5a elementor-widget elementor-widget-heading\" data-id=\"f894b5a\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h1 class=\"elementor-heading-title elementor-size-default\"><h1 id=\"196f\">Conclusions<\/h1><\/h1>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-046dbd9 elementor-widget elementor-widget-text-editor\" data-id=\"046dbd9\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>Data Science and statistics are two very related topics and it\u2019s not possible to work well with the former if you don\u2019t know the latter. In this article, I\u2019ve shown you some of the most important probability distributions you\u2019ll find during your Data Scientist career. Other distributions (e.g. Gamma distribution) can be found, but it\u2019s likely that in a Data Scientist\u2019s daily job he will work with one of the distributions described here.<\/p>\n<!-- \/wp:paragraph -->\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<\/div>\n\t\t","protected":false},"excerpt":{"rendered":"<p>Data Science and statistics are two very related topics and it\u2019s not possible to work well with the former if you don\u2019t know the latter. In this article, I\u2019ve shown you some of the most important probability distributions you\u2019ll find during your Data Scientist career.<\/p>\n","protected":false},"author":618,"featured_media":9184,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[187],"tags":[94,511,510],"ppma_author":[3328],"class_list":["post-9183","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-bigdata-cloud","tag-data-science","tag-probability","tag-statistics"],"authors":[{"term_id":3328,"user_id":618,"is_guest":0,"slug":"gianluca-malato","display_name":"Gianluca Malato","avatar_url":"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/04\/medium_918623b2-8f36-4110-8343-6fc9228595dd-150x150.jpg","author_category":"","user_url":"http:\/\/www.gianlucamalato.it\/","last_name":"Malato","first_name":"Gianluca","job_title":"","description":"Gianluca Malato is Data Scientist at Poste Italiane SPA.\u00a0 He is also a fiction author and software developer, Editor of\u00a0<a href=\"https:\/\/medium.com\/data-science-journal?source=follow_footer--------------------------follow_footer-\">Data Science Journal<\/a>,\u00a0<a href=\"https:\/\/medium.com\/the-trading-scientist?source=follow_footer--------------------------follow_footer-\">The Trading Scientist<\/a>, and\u00a0<a href=\"https:\/\/medium.com\/the-writers-notebook?source=follow_footer--------------------------follow_footer-\">The Writer\u2019s Notebook<\/a>. His books are available on <a href=\"https:\/\/www.amazon.com\/Gianluca-Malato\/e\/B076CHTG3W?ref=dbs_a_mng_rwt_scns_share\">Amazon<\/a>."}],"_links":{"self":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/9183","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/users\/618"}],"replies":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/comments?post=9183"}],"version-history":[{"count":0,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/9183\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/media\/9184"}],"wp:attachment":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/media?parent=9183"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/categories?post=9183"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/tags?post=9183"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/ppma_author?post=9183"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}