{"id":22698,"date":"2021-03-22T04:10:00","date_gmt":"2021-03-22T04:10:00","guid":{"rendered":"https:\/\/www.experfy.com\/blog\/k-means-clustering-and-variants\/"},"modified":"2023-08-29T11:56:03","modified_gmt":"2023-08-29T11:56:03","slug":"k-means-clustering-and-variants","status":"publish","type":"post","link":"https:\/\/www.experfy.com\/blog\/ai-ml\/k-means-clustering-and-variants\/","title":{"rendered":"K-means Clustering And Variants"},"content":{"rendered":"\t\t<div data-elementor-type=\"wp-post\" data-elementor-id=\"22698\" class=\"elementor elementor-22698\" data-elementor-post-type=\"post\">\n\t\t\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-a23331b elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"a23331b\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-fd9f338\" data-id=\"fd9f338\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-e33f66d elementor-widget elementor-widget-text-editor\" data-id=\"e33f66d\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"4580\">The clustering problem is to group a set of data points into clusters. Clusters should be internally tight. Clusters should also be well-separated.<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-05d496b elementor-widget elementor-widget-image\" data-id=\"05d496b\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img fetchpriority=\"high\" decoding=\"async\" width=\"433\" height=\"237\" src=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1RxeGEQ53poYZuZPpIVP1CA.png\" class=\"attachment-large size-large wp-image-18994\" alt=\"K-means Clustering And Variants\" srcset=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1RxeGEQ53poYZuZPpIVP1CA.png 433w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1RxeGEQ53poYZuZPpIVP1CA-300x164.png 300w\" sizes=\"(max-width: 433px) 100vw, 433px\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-98ba666 elementor-widget elementor-widget-text-editor\" data-id=\"98ba666\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"e6da\">Here is an example. Consider a geospatial map in which each pixel is green or yellow. A green pixel denotes a tree at that location. A yellow pixel denotes its absence. We\u2019d like to cluster the trees into forests.<\/p>\n<p id=\"37a9\">This is a representative example of a&nbsp;large class of clustering problems on geospatial data, at varying scales. For example, if we replace \u201cgreen denoting a tree\u201d with \u201cred denoting a lit location\u201d, we might hope to discover clusters of well-lit areas such as towns or neighborhoods.<\/p>\n<p id=\"05fc\">In this post, we will cover the K-means clustering algorithm and its variants.<\/p>\n<p id=\"4552\">The K-means clustering algorithm requires fixing K, the number of clusters, in advance. This may seem like a serious limitation. In our example, how would we know the number of forests in advance?<\/p>\n<p id=\"35f6\">That said, we can imagine some user interface that lets the user try out different values of K and see what forests get found? This interactive process might yet be useful.<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-c96e90d elementor-widget elementor-widget-heading\" data-id=\"c96e90d\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\">The Algorithm<\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-fd4c3e5 elementor-widget elementor-widget-text-editor\" data-id=\"fd4c3e5\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"f7ee\">The K-means clustering algorithm works as follows. We are given the data set to be clustered, and a value of K. In our example, the data set would be the locations of the green pixels.<\/p>\n<p id=\"7dfe\">The algorithm initializes the K clusters (somehow). It then iteratively adjusts the clusters so long as they keep improving. We can stop the algorithm at any time. Or when the clusters stop changing (sufficiently).<\/p>\n<p id=\"aafe\">How do we measure \u201cclusters improving\u201d? Let\u2019s begin by noting that in K-means clustering clusters are represented by their means. (This explains the word \u201cmeans\u201d in the algorithm\u2019s name.) In our trees example, a cluster\u2019s mean is the associate forest\u2019s center.<\/p>\n<p id=\"019e\">Next, we need the notion of how tight a cluster is. We will define a cluster\u2019s width as the average distance of the cluster\u2019s members from its mean. In our trees example, this is the average distance of a tree in a forest from the forest\u2019s center. A rough proxy for the forest\u2019s area.<\/p>\n<p id=\"c407\">A global measure of the tightness of the full set of clusters may then be defined as the sum of the widths of the clusters.<\/p>\n<p id=\"9a0f\">Next, let\u2019s look at what happens inside an iteration. The algorithm tries to improve the current clustering. In two steps. First, it reassigns each data point to the cluster whose mean is closest to it. Next, it recomputes the cluster means from the updated assignments.<\/p>\n<p id=\"b04e\">The intuition that an iteration (in which an assignment changes) improves the clustering is as follows. (This is not proof \u2014 just intuition!) In the first step, at least one data point got reassigned to a cluster whose center it is nearer to. This tends to reduce the width of one or both affected clusters. The second step, which recomputes the cluster means, often reduces the width further.<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-cf89998 elementor-widget elementor-widget-heading\" data-id=\"cf89998\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h3 class=\"elementor-heading-title elementor-size-default\">Example<\/h3>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-b2ca081 elementor-widget elementor-widget-text-editor\" data-id=\"b2ca081\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"ce72\">Let\u2019s illustrate this in the example below in which both effects come into play. We have 3 1D points to be clustered: 1, 5, and 6. Say the current clusters are {1, 5} and {6}. Their means are 3 and 6 respectively. In the next iteration 5 moves to the second cluster. This results in the clustering {1} and {5, 6} having means 1 and 5.5 respectively.<\/p>\n<p id=\"5fb3\">This shift is depicted below.<\/p>\n<p id=\"b470\">{1, 5}, {6} \u2192 {1}, {5, 6}<\/p>\n<p id=\"07a9\">Let\u2019s see how the cluster widths changed. The original clusters {1, 5} and {6} were 2 and 0 units wide respectively. The new clusters {1} and {5, 6} are 0 and 0.5 units wide respectively. So the summed widths reduced. That is, the overall clustering got tighter.<\/p>\n<p id=\"10c8\">Finally, for completeness, let\u2019s note that after the clusters have been initialized (before the iterations begin) their means need to be computed. This extra step is needed since step 1 of an iteration uses the means.<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-f6e5ce4 elementor-widget elementor-widget-heading\" data-id=\"f6e5ce4\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\">K-medians Clustering<\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-f09b0ea elementor-widget elementor-widget-text-editor\" data-id=\"f09b0ea\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"6394\">Consider the 1D data 1, 100, 102, 200 and say K = 2. What should be the two clusters? Either {{1, 100, 102}, {200}} or {{1}, {100,102, 200}}. In the first case 1 is an outlier; in the second case 200. The mean of a cluster with an outlier is skewed towards the outlier.<\/p>\n<p id=\"258f\">In the presence of outliers, a sensible variant of K-means clustering involves just replacing \u2018mean\u2019 with \u2018median\u2019. Nothing else changes.<\/p>\n<p id=\"1aff\">Consider {1, 100, 102}. The mean is 67.66 whereas the median is 100. This is graphically depicted below.<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-93e073c elementor-widget elementor-widget-image\" data-id=\"93e073c\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t<figure class=\"wp-caption\">\n\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" width=\"1024\" height=\"483\" src=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1hlCy5v7KA6lkfhXsv40uuw-1024x483.png\" class=\"attachment-large size-large wp-image-18995\" alt=\"K-means Clustering And Variants\" srcset=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1hlCy5v7KA6lkfhXsv40uuw-1024x483.png 1024w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1hlCy5v7KA6lkfhXsv40uuw-300x142.png 300w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1hlCy5v7KA6lkfhXsv40uuw-768x362.png 768w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1hlCy5v7KA6lkfhXsv40uuw-610x288.png 610w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1hlCy5v7KA6lkfhXsv40uuw-750x354.png 750w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1hlCy5v7KA6lkfhXsv40uuw.png 1051w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/>\t\t\t\t\t\t\t\t\t\t\t<figcaption class=\"widget-image-caption wp-caption-text\">Using Median is more robust when there are outliers<\/figcaption>\n\t\t\t\t\t\t\t\t\t\t<\/figure>\n\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-6929b74 elementor-widget elementor-widget-heading\" data-id=\"6929b74\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\">Soft K-means Clustering: The EM algorithm<\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-87b855d elementor-widget elementor-widget-text-editor\" data-id=\"87b855d\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"1305\">K-means clustering is a special case of a powerful statistical algorithm called EM. We will describe EM in the context of K-means clustering, calling it EMC. For contrast, we will denote k-means clustering as KMC.<\/p>\n<p id=\"b182\"><a href=\"https:\/\/www.experfy.com\/blog\/bigdata-cloud\/data-scientist-study-infographic-emc\/\" target=\"_blank\" rel=\"noreferrer noopener\">EMC models a cluster<\/a> as a probability distribution over the data space. Think of this as specifying a cluster by a soft membership function.<\/p>\n<p id=\"49b6\">Take our trees example. Imagine that we are okay with elliptical forests (of which circular ones are special cases). We might model such a cluster as a two-dimensional Gaussian whose mean models the cluster\u2019s center. Its covariance matrix captures the parameters that describe the shape of the ellipse (the major and minor axes and the tilt).<\/p>\n<p id=\"7e0a\">Before going further, let\u2019s note that when we say elliptical clusters, this is just an approximation for visual insights. A cluster doesn\u2019t have a hard boundary. Much like a mountain doesn\u2019t have a hard boundary separating it from the plains. Nonetheless, as with a mountain, the imagined boundary is useful.<\/p>\n<p id=\"11fd\">As in K-means clustering, the number of clusters is fixed in advance.<\/p>\n<p id=\"bbe7\">The next key concept is the probability model that describes the process of generating clusters. We call such a model a generative one.<\/p>\n<p id=\"a4ba\">The EMC generative model is a mixture whose components are the K individual cluster models. This mixture is parametrized by a distribution of mixing probabilities over the K clusters.<\/p>\n<p id=\"b136\">The generation process works as follows. We repeat the following independently as long as we like.<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-c6cc572 elementor-widget elementor-widget-text-editor\" data-id=\"c6cc572\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<ol><li>Randomly sample a cluster, call it&nbsp;<em>c<\/em>, from&nbsp;<em>C<\/em>.<\/li><li>Randomly sample a data point, call it&nbsp;<em>x<\/em>, from&nbsp;<em>c<\/em>\u2019s distribution.<\/li><\/ol>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-9accdd6 elementor-widget elementor-widget-text-editor\" data-id=\"9accdd6\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"ac62\">This results in a set of data points labeled with their cluster ids.<\/p>\n<p id=\"05bd\">During the clustering process, the cluster ids of the data points are of course not known. The job of EMC is to find them.<\/p>\n<p id=\"d306\">In more detail, and a bit more generally, EMC\u2019s aim is to learn the parameters of the various cluster models, together with the mixing probabilities, from the data. (We say \u201cmore generally\u201d because this way we can also cluster new data that the algorithm hasn\u2019t seen yet.)<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-3b32bb1 elementor-widget elementor-widget-heading\" data-id=\"3b32bb1\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\">The EMC algorithm\u2019s steps<\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-402f4c3 elementor-widget elementor-widget-text-editor\" data-id=\"402f4c3\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"1938\">Just like K-means clustering, EMC works iteratively to improve the clustering. In fact, each iteration has the same two steps.<\/p>\n<p id=\"2d10\">The two steps in an EMC iteration are more advanced versions of the corresponding steps in KMC. We\u2019ll start by reviewing the KMC versions, then morph them into EMC ones. During the process, new ideas will emerge.<\/p>\n<p id=\"0428\">The first KMP step was to reassign the data points to the nearest clusters. The two words to focus on here are \u2018reassign<em>\u2019<\/em>&nbsp;and \u2018nearest<em>\u2019<\/em>. Let\u2019s look at \u2018reassign\u2019 first. As it turns out, first we have to abstract out a bit, to the word \u2018assign\u2019.<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-dc77394 elementor-widget elementor-widget-heading\" data-id=\"dc77394\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\">Cluster Assignments<\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-bac30a8 elementor-widget elementor-widget-text-editor\" data-id=\"bac30a8\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"e465\">In KMC, the cluster assigned to a data point at any given time is unique. In EMC, a data point\u2019s membership is modeled as a probability distribution over the various clusters. Thus the datum potentially belongs to multiple clusters, to a varying degree. This is graphically depicted below.<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-e270b59 elementor-widget elementor-widget-image\" data-id=\"e270b59\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" width=\"784\" height=\"780\" src=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1NmhGPii3XS7XBXkX_UY74w.png\" class=\"attachment-large size-large wp-image-18996\" alt=\"Cluster Assignments\" srcset=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1NmhGPii3XS7XBXkX_UY74w.png 784w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1NmhGPii3XS7XBXkX_UY74w-300x298.png 300w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1NmhGPii3XS7XBXkX_UY74w-150x150.png 150w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1NmhGPii3XS7XBXkX_UY74w-768x764.png 768w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1NmhGPii3XS7XBXkX_UY74w-610x607.png 610w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1NmhGPii3XS7XBXkX_UY74w-75x75.png 75w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1NmhGPii3XS7XBXkX_UY74w-750x746.png 750w\" sizes=\"(max-width: 784px) 100vw, 784px\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-d3934db elementor-widget elementor-widget-text-editor\" data-id=\"d3934db\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"9fee\">The KMC version is a special case in which the distribution\u2019s mass is concentrated on one cluster.<\/p>\n<p id=\"4611\">How do we compute this probability distribution at any point in time? Actually, we could do it even for the KMC. Whereas the EMC\u2019s version will be more accurate, let\u2019s start by describing one from the KMC as it is simpler and we are on familiar ground.<\/p>\n<p id=\"c7e9\">Consider a datum&nbsp;<em>x<\/em>. We can compute the distance of&nbsp;<em>x<\/em>&nbsp;to every cluster. Next, we turn distances into membership scores which we call cluster affinities. One way to do this is to define data point&nbsp;<em>x<\/em>\u2019s affinity for a cluster&nbsp;<em>c<\/em>&nbsp;at distance&nbsp;<em>d<\/em>&nbsp;from&nbsp;<em>x<\/em>&nbsp;to be 1\/(<em>d<\/em>+<em>p<\/em>). Here&nbsp;<em>p<\/em>&nbsp;is a tiny positive number to avoid divide-by-zero. Next, we sum up these affinities. Finally, we divide each affinity by this sum. This gives us the soft cluster memberships of&nbsp;<em>x<\/em>&nbsp;as a probability distribution.<\/p>\n<p id=\"d9a4\">What we described above is fine for illustrative purposes but probably not effective enough for actual use. Simply put, cluster affinities can get overly diluted. That is, a single data point may belong to way too many clusters. Such diffusion can get in the way of trying to find a good clustering.<\/p>\n<p id=\"5c1a\">This issue can be mitigated by using softmax instead. This involves surgery on only the step in going from distance to affinity. Specifically, the affinity of distance&nbsp;<em>d<\/em>&nbsp;is defined as&nbsp;<em>e<\/em>^-<em>d<\/em>. Thus affinity drops exponentially with distance. Now we normalize as before.<\/p>\n<p id=\"ef3d\">Using the softmax this way produces a&nbsp;<em>k<\/em>-means like behavior. Simply put softmax is a better approximation to max than the function that came before it.<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-7e440fc elementor-widget elementor-widget-heading\" data-id=\"7e440fc\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\">Bayesian Cluster Assignments<\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-6882eeb elementor-widget elementor-widget-text-editor\" data-id=\"6882eeb\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"3b24\">Whereas the softmax approach described above can be effective in practice, a more elaborate approach is below. Let&nbsp;<em>c<\/em>&nbsp;denote a cluster and&nbsp;<em>x<\/em>&nbsp;a data point. By Bayes rule,<\/p>\n<p id=\"0b38\"><em>P<\/em>(<em>c<\/em>|<em>x<\/em>) =&nbsp;<em>P<\/em>(<em>x<\/em>|<em>c<\/em>)*<em>P<\/em>(<em>c<\/em>)\/sum_<em>c\u2019<\/em>&nbsp;<em>P<\/em>(<em>x<\/em>|<em>c\u2019<\/em>)*<em>P<\/em>(<em>c\u2019<\/em>)<\/p>\n<p id=\"4073\">At any given point in time, we have an estimate of&nbsp;<em>P<\/em>(<em>c<\/em>). At any given point in time, we can also estimate&nbsp;<em>P<\/em>(<em>x<\/em>|<em>c<\/em>) as cluster&nbsp;<em>c<\/em>\u2019s current probability model is fully specified. If, as in our trees example, a cluster\u2019s model was a two-dimensional Gaussian, fully specified just means that the cluster\u2019s mean and covariance matrices have specific values.<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-1506580 elementor-widget elementor-widget-heading\" data-id=\"1506580\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\">Completing The First Step\u2019s Description<\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-9ffc27c elementor-widget elementor-widget-text-editor\" data-id=\"9ffc27c\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"2553\">So we have all the information to compute&nbsp;<em>P<\/em>(<em>c<\/em>|<em>x<\/em>). Note that this approach takes into account the cluster sizes as well. (Think of&nbsp;<em>P<\/em>(<em>c<\/em>) as proportional to&nbsp;<em>c<\/em>\u2019s size.) So if two clusters were equally near a data point, the data point would favor the larger cluster.<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-dccbea6 elementor-widget elementor-widget-heading\" data-id=\"dccbea6\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\">The Second Step<\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-dac4fa1 elementor-widget elementor-widget-text-editor\" data-id=\"dac4fa1\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"9705\">The second step in KMC was to recompute the cluster means. In EMC, this generalizes to \u201creestimate the model parameters\u201d. These include the parameters of the various cluster models as well as the mixing probabilities of the mixture model.<\/p>\n<p id=\"f892\">We\u2019ll illustrate this step for the case in which the cluster models are two-dimensional Gaussians. This will immediately apply to our trees example.<\/p>\n<p id=\"b0cd\">Let\u2019s start with reestimating a cluster\u2019s mean. In KMC, this was just the mean of the data points in the cluster. In EMC, a data point can be in multiple clusters to a varying degree so this needs some generalization.<\/p>\n<p id=\"b834\">Fortunately, the generalization is intuitively easy to describe. In EMC a data point&nbsp;<em>x<\/em>&nbsp;contributes to the mean of cluster&nbsp;<em>c<\/em>&nbsp;with weight proportional to&nbsp;<em>P<\/em>(<em>c<\/em>|<em>x<\/em>). In more detail,<\/p>\n<p id=\"e103\">mean(<em>c<\/em>) = sum_<em>x<\/em>&nbsp;<em>P<\/em>(<em>c<\/em>|<em>x<\/em>)*<em>x<\/em>\/ sum_<em>x<\/em>&nbsp;<em>P<\/em>(<em>c<\/em>|<em>x<\/em>)<\/p>\n<p id=\"6133\">Let\u2019s illustrate this in the KMC special case.&nbsp;<em>P<\/em>(<em>c<\/em>|<em>x<\/em>) is 1 for&nbsp;<em>c<\/em>\u2019s members and 0 for the rest. So mean(<em>c<\/em>) = sum_{<em>x<\/em>&nbsp;in&nbsp;<em>c<\/em>}&nbsp;<em>x<\/em>\/sum_{<em>x<\/em>&nbsp;in&nbsp;<em>c<\/em>} 1 = sum_{<em>x<\/em>&nbsp;in&nbsp;<em>c<\/em>}&nbsp;<em>x<\/em>\/cluster size(<em>c<\/em>) which is exactly what we had.<\/p>\n<p id=\"54d0\">In summary, in KMC a data point contributes to the mean of its unique cluster. In EMC a data point contributes to the means of potentially multiple clusters, with strengths based on the involved cluster affinities.<\/p>\n<p id=\"aad7\">The covariance matrix of a cluster may be estimated similarly, albeit the details are more involved.<\/p>\n<p id=\"75e5\">The mixing probabilities are reestimated as follows.<\/p>\n<p id=\"4894\"><em>P<\/em>(<em>c<\/em>) = sum_<em>x<\/em>&nbsp;<em>P<\/em>(<em>c<\/em>|<em>x<\/em>)\/sum_{<em>x<\/em>,&nbsp;<em>c\u2019<\/em>}&nbsp;<em>P<\/em>(<em>c\u2019<\/em>|<em>x<\/em>)<\/p>\n<p id=\"f7c0\">Think of&nbsp;<em>P<\/em>(<em>c<\/em>) as the sum of the weighted votes it receives divided by the sum of the weighted votes all clusters receive.<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-3fde22d elementor-widget elementor-widget-heading\" data-id=\"3fde22d\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h3 class=\"elementor-heading-title elementor-size-default\">Example: A Mixture Of Two Gaussians<\/h3>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-a8a9d34 elementor-widget elementor-widget-text-editor\" data-id=\"a8a9d34\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"122d\">EMC is a complex algorithm. An example will help.<\/p>\n<p id=\"9cdf\">We will operate on a single dimension and set K = 2.We will model each cluster as a Gaussian.<\/p>\n<p id=\"b0e7\">The situation is graphically depicted below.<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-0e4a2bf elementor-widget elementor-widget-image\" data-id=\"0e4a2bf\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t<figure class=\"wp-caption\">\n\t\t\t\t\t\t\t\t\t\t<img loading=\"lazy\" decoding=\"async\" width=\"696\" height=\"416\" src=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1sHsjJF0XcvJ9o9BMnmrRfw.png\" class=\"attachment-large size-large wp-image-18997\" alt=\"K-means Clustering And Variants\" srcset=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1sHsjJF0XcvJ9o9BMnmrRfw.png 696w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1sHsjJF0XcvJ9o9BMnmrRfw-300x179.png 300w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1sHsjJF0XcvJ9o9BMnmrRfw-610x365.png 610w\" sizes=\"(max-width: 696px) 100vw, 696px\" \/>\t\t\t\t\t\t\t\t\t\t\t<figcaption class=\"widget-image-caption wp-caption-text\">Cluster 1\u2019s points are distributed according to the Gaussian on the left; Cluster 2\u2019s by the Gaussian on the right<\/figcaption>\n\t\t\t\t\t\t\t\t\t\t<\/figure>\n\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-f3418cc elementor-widget elementor-widget-text-editor\" data-id=\"f3418cc\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"0c17\">For numerical calculations let\u2019s restrict the situation further. Let\u2019s consider the same 3 data points 1, 5, and 6 and the same initial clusters {1, 5} and {6}. In EMC, we also need to initialize the model parameters. Let\u2019s just estimate them from the initial clusters.<\/p>\n<p id=\"0a4e\">Let\u2019s set the cluster posterior probabilities from the initial clustering. That is, as, in KMC, a datum determines its cluster uniquely. Next, we compute the cluster means to be 3 and 6 respectively. The cluster standard deviations are 2 and 0 respectively.<\/p>\n<p id=\"a898\">The second cluster\u2019s standard deviation of 0 poses a slight problem. The resulting Gaussian is degenerate. We can\u2019t use it.<\/p>\n<p id=\"ad97\">We\u2019ll patch this situation as follows. We will add&nbsp;<em>n<\/em>&nbsp;copies of 1, 5, and 6, where&nbsp;<em>n<\/em>&nbsp;&gt;&gt; 1. To these, we will add a few points sampled randomly from a Gaussian whose mean and standard deviation are the corresponding statistics of the sample 1, 5, and 6.<\/p>\n<p id=\"d57a\">Now we will initialize the clusters from this expanded data. The reason for&nbsp;<em>n<\/em>&nbsp;&gt;&gt; 1 is because we want 1, 5, and 6 to remain the dominant numbers.<\/p>\n<p id=\"b44b\">This may be interpreted as a Bayesian approach in which we have incorporated a prior that the standard deviations of all clusters must be positive. While we are at it, let\u2019s tack on another sensible prior: that neither cluster is empty, i.e. its mixing probability is 0.<\/p>\n<p id=\"f08c\">For illustration purposes let\u2019s wish away the synthetic data we added. We will bring it in whenever it is relevant. That is, let\u2019s start as before, with three points 1, 5, and 6, and the initial clustering {1, 5} and {6}.<\/p>\n<p id=\"3f1f\">We get the cluster means to be approximately 3 and 6 respectively. (This is because the synthetic data has some effect.) The cluster standard deviations are approximately 2 and positive respectively.<\/p>\n<p id=\"c57f\">Next, we update the posterior probabilities of the two clusters. For visual convenience, we\u2019ll use the distance versions of the update formulae.<\/p>\n<p id=\"f873\">The before and after depictions below will help illustrate the update process. First, let\u2019s depict the data points.<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-431ebf7 elementor-widget elementor-widget-text-editor\" data-id=\"431ebf7\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<pre class=\"wp-block-preformatted\">x      1      5 6<\/pre>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-d900591 elementor-widget elementor-widget-text-editor\" data-id=\"d900591\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"45e7\">Next, let\u2019s see the probabilities of cluster 1 before the update. (P(2|<em>x<\/em>) is 1-P(1|<em>x<\/em>).)<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-e0bf104 elementor-widget elementor-widget-text-editor\" data-id=\"e0bf104\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<pre class=\"wp-block-preformatted\"><strong><em>x<\/em><\/strong>      1            5 6<br><strong>P(1|<em>x<\/em>)<\/strong> 1            1 0<\/pre>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-373beb2 elementor-widget elementor-widget-text-editor\" data-id=\"373beb2\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"5b19\">Next, let\u2019s see the updated probabilities of cluster 1 based on their distances.<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-81c70ea elementor-widget elementor-widget-text-editor\" data-id=\"81c70ea\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<pre class=\"wp-block-preformatted\"><strong><em>x<\/em><\/strong>      1            5 6<br><strong>P(1|<em>x<\/em>)<\/strong> ~1          <strong>~0<\/strong> 0<\/pre>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-7a74697 elementor-widget elementor-widget-text-editor\" data-id=\"7a74697\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"2eb2\">The bold-faced one changed the most. Effectively we softly moved data point 5 from cluster 1 to cluster 2.<\/p>\n<p id=\"b825\">Now each data point\u2019s cluster probabilities are concentrated on one cluster. Plus nearby points (5 and 6) have high posteriors for the same cluster. So the clustering is in a happy state.<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-8a81de4 elementor-widget elementor-widget-heading\" data-id=\"8a81de4\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\">Summary<\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-3543a74 elementor-widget elementor-widget-text-editor\" data-id=\"3543a74\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"e6ea\">In this post, we described a well-known and classical algorithm for clustering known as K-means clustering. We also covered an advanced variant called soft k-means clustering and a robust variant that uses median instead of mean.<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-7655182 elementor-widget elementor-widget-heading\" data-id=\"7655182\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\">Further Reading<\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-97c956e elementor-widget elementor-widget-text-editor\" data-id=\"97c956e\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<ol><li><a href=\"https:\/\/en.wikipedia.org\/wiki\/K-means_clustering\" target=\"_blank\" rel=\"noreferrer noopener\">k-means clustering<\/a><\/li><\/ol>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<\/div>\n\t\t","protected":false},"excerpt":{"rendered":"<p>This post describes a well-known and classical algorithm for clustering known as K-means clustering. It also coveres an advanced variant called soft k-means clustering and a robust variant that uses median instead of mean.<\/p>\n","protected":false},"author":1044,"featured_media":18998,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"content-type":"","footnotes":""},"categories":[183],"tags":[342,1369,1438,1439],"ppma_author":[3691],"class_list":["post-22698","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-ai-ml","tag-algorithm","tag-clustering","tag-expectation-maximization-algorithm","tag-variants"],"authors":[{"term_id":3691,"user_id":1044,"is_guest":0,"slug":"arun-jagota","display_name":"Arun Jagota","avatar_url":"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/Arun-Jagota-150x150.jpeg","user_url":"https:\/\/www.salesforce.com\/in\/?ir=1","last_name":"Jagota","first_name":"Arun","job_title":"","description":"Arun Jagota is Director of Data Science at Salesforce.com. A PhD in computer science, he has taught undergraduate, graduate, and continuing education courses in Computer Science at many US Universities from 1992 through 2006. He has written a number of books, most available at Amazon.com, 50 academic publications and has 17+ patents issued."}],"_links":{"self":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/22698","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/users\/1044"}],"replies":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/comments?post=22698"}],"version-history":[{"count":4,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/22698\/revisions"}],"predecessor-version":[{"id":31848,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/22698\/revisions\/31848"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/media\/18998"}],"wp:attachment":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/media?parent=22698"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/categories?post=22698"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/tags?post=22698"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/ppma_author?post=22698"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}