{"id":498,"date":"2016-09-27T16:37:12","date_gmt":"2016-09-27T13:37:12","guid":{"rendered":"http:\/\/kusuaks7\/?p=103"},"modified":"2025-02-26T07:56:58","modified_gmt":"2025-02-26T07:56:58","slug":"k-means-clustering-in-text-data","status":"publish","type":"post","link":"https:\/\/www.experfy.com\/blog\/bigdata-cloud\/k-means-clustering-in-text-data\/","title":{"rendered":"K Means Clustering in Text Data"},"content":{"rendered":"\t\t<div data-elementor-type=\"wp-post\" data-elementor-id=\"498\" class=\"elementor elementor-498\" data-elementor-post-type=\"post\">\n\t\t\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-2322b63a elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"2322b63a\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-1f2f6dab\" data-id=\"1f2f6dab\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-271d428f elementor-widget elementor-widget-text-editor\" data-id=\"271d428f\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tClustering\/segmentation is one of the most important techniques used in Acquisition Analytics. K means clustering groups similar observations in clusters in order to be able to extract insights from vast amounts of unstructured data.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-c358311 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"c358311\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-04a97a7\" data-id=\"04a97a7\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-3f0929d elementor-widget elementor-widget-text-editor\" data-id=\"3f0929d\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tWhen you want \u00a0to analyze the Facebook\/Twitter\/Youtube comments of a particular event,\u00a0it would be impossible to manually look at each and every mention and see where the sentiment regarding a particular brand\/event\/person lies.\n<ul style=\"list-style-type: circle;\">\n \t<li>The basic idea of K Means clustering is to\u00a0form K seeds first, and then group observations in K clusters on the basis of distance with each of K seeds. The observation will be included in the\u00a0n<sup>th<\/sup> seed\/cluster if the distance betweeen the observation and the\u00a0n<sup>th<\/sup>\u00a0seed is minimum when compared to other seeds.<\/li>\n<\/ul>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-a443b58 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"a443b58\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-50ed826\" data-id=\"50ed826\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-c2c88d8 elementor-widget elementor-widget-text-editor\" data-id=\"c2c88d8\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tBelow is a brief overview of the methodology involved in performing a K Means Clustering Analysis.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-fb76552 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"fb76552\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-85b4d9e\" data-id=\"85b4d9e\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-1ebc295 elementor-widget elementor-widget-heading\" data-id=\"1ebc295\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h3 class=\"elementor-heading-title elementor-size-default\"><h3>The Process of building K clusters on Social Media text data:<\/h3><\/h3>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-207b108 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"207b108\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-6a6d7e0\" data-id=\"6a6d7e0\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-ba7b651 elementor-widget elementor-widget-text-editor\" data-id=\"ba7b651\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<ul>\n \t<li>The first step is to pull the social media mentions for a particular timeframe using social media listening tools (Radian 6, Sysmos, Synthesio etc.).\u00a0 You would need to build query\/add keywords to pull the data from social Media Listening tools.<\/li>\n \t<li>The next step is data cleansing.\u00a0This is the most important part as social media comments do not have any specific format. People use locals\/slangs etc. on social media to express their emotions, so it&#8217;s important to be able to see through them and understand the underlying sentiment.<\/li>\n \t<li>Remove punctuations, numbers, stopwords (R has specific stopword library but you\u00a0can also create your own list of stopwords). Also, remove duplicate rows or URLs from the social media mentions.<\/li>\n \t<li>The next step is to create corpus vector of all the words.<\/li>\n \t<li>Once you have created the corpus vector of words, the next step is to create a document term matrix.<\/li>\n<\/ul>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-8075472 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"8075472\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-2cc43e5\" data-id=\"2cc43e5\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-380c915 elementor-widget elementor-widget-text-editor\" data-id=\"380c915\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tLet\u2019s visualize the problem with one example. Let\u2019s assume that\u00a0there are 10 documents\/mentions and 5 unique words post data cleansing. Below is the document term matrix for this dataset. It shows for how many times one word has appeared in the document. For example, in document 1 (D1), the words\u00a0<em>online, book\u00a0<\/em>and <i>Delhi<\/i>\u00a0have each been mentioned once.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-2d94e92 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"2d94e92\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-73c6965\" data-id=\"73c6965\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-979bfde elementor-widget elementor-widget-text-editor\" data-id=\"979bfde\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<table style=\"width: 404px;\" border=\"0\" width=\"404\" cellspacing=\"0\" cellpadding=\"0\">\n<tbody>\n<tr>\n<td style=\"width: 148px; height: 20px;\" colspan=\"2\" nowrap=\"nowrap\"><strong>Document Term Matrix<\/strong><\/td>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-3ed8fac elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"3ed8fac\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-ec473c4\" data-id=\"ec473c4\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-1a04285 elementor-widget elementor-widget-text-editor\" data-id=\"1a04285\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<td style=\"width: 64px; height: 20px;\" nowrap=\"nowrap\"><\/td>\n<td style=\"width: 64px; height: 20px;\" nowrap=\"nowrap\"><\/td>\n<td style=\"width: 64px; height: 20px;\" nowrap=\"nowrap\"><\/td>\n<td style=\"width: 64px; height: 20px;\" nowrap=\"nowrap\"><\/td>\n<\/tr>\n<tr>\n<td style=\"width: 93px; height: 20px;\" nowrap=\"nowrap\"><\/td>\n<td style=\"width: 55px; height: 20px;\" nowrap=\"nowrap\"><\/td>\n<td style=\"width: 64px; height: 20px;\" nowrap=\"nowrap\"><\/td>\n<td style=\"width: 64px; height: 20px;\" nowrap=\"nowrap\"><\/td>\n<td style=\"width: 64px; height: 20px;\" nowrap=\"nowrap\"><\/td>\n<td style=\"width: 64px; height: 20px;\" nowrap=\"nowrap\"><\/td>\n<\/tr>\n<tr>\n<td style=\"width: 93px; height: 20px;\" nowrap=\"nowrap\">\n<p align=\"center\"><strong>Documents<\/strong><\/p>\n<\/td>\n<td style=\"width: 55px; height: 20px;\" nowrap=\"nowrap\">\n<p align=\"center\"><strong>Online<\/strong><\/p>\n<\/td>\n<td style=\"width: 64px; height: 20px;\" nowrap=\"nowrap\">\n<p align=\"center\"><strong>Festival<\/strong><\/p>\n<\/td>\n<td style=\"width: 64px; height: 20px;\" nowrap=\"nowrap\">\n<p align=\"center\"><strong>Book<\/strong><\/p>\n<\/td>\n<td style=\"width: 64px; height: 20px;\" nowrap=\"nowrap\">\n<p align=\"center\"><strong>Flight<\/strong><\/p>\n<\/td>\n<td style=\"width: 64px; height: 20px;\" nowrap=\"nowrap\">\n<p align=\"center\"><strong>Delhi<\/strong><\/p>\n<\/td>\n<\/tr>\n<tr>\n<td style=\"width: 93px; height: 20px;\" nowrap=\"nowrap\">\n<p align=\"center\">D1<\/p>\n<\/td>\n<td style=\"width: 55px; height: 20px;\" nowrap=\"nowrap\">\n<p align=\"center\">1<\/p>\n<\/td>\n<td style=\"width: 64px; height: 20px;\" nowrap=\"nowrap\">\n<p align=\"center\">0<\/p>\n<\/td>\n<td style=\"width: 64px; height: 20px;\" nowrap=\"nowrap\">\n<p align=\"center\">1<\/p>\n<\/td>\n<td style=\"width: 64px; height: 20px;\" nowrap=\"nowrap\">\n<p align=\"center\">0<\/p>\n<\/td>\n<td style=\"width: 64px; height: 20px;\" nowrap=\"nowrap\">\n<p align=\"center\">1<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td style=\"width: 93px; height: 20px;\" nowrap=\"nowrap\">\n<p align=\"center\">D2<\/p>\n<\/td>\n<td style=\"width: 55px; height: 20px;\" nowrap=\"nowrap\">\n<p align=\"center\">2<\/p>\n<\/td>\n<td style=\"width: 64px; height: 20px;\" nowrap=\"nowrap\">\n<p align=\"center\">1<\/p>\n<\/td>\n<td style=\"width: 64px; height: 20px;\" nowrap=\"nowrap\">\n<p align=\"center\">2<\/p>\n<\/td>\n<td style=\"width: 64px; height: 20px;\" nowrap=\"nowrap\">\n<p align=\"center\">1<\/p>\n<\/td>\n<td style=\"width: 64px; height: 20px;\" nowrap=\"nowrap\">\n<p align=\"center\">1<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td style=\"width: 93px; height: 20px;\" nowrap=\"nowrap\">\n<p align=\"center\">D3<\/p>\n<\/td>\n<td style=\"width: 55px; height: 20px;\" nowrap=\"nowrap\">\n<p align=\"center\">0<\/p>\n<\/td>\n<td style=\"width: 64px; height: 20px;\" nowrap=\"nowrap\">\n<p align=\"center\">0<\/p>\n<\/td>\n<td style=\"width: 64px; height: 20px;\" nowrap=\"nowrap\">\n<p align=\"center\">1<\/p>\n<\/td>\n<td style=\"width: 64px; height: 20px;\" nowrap=\"nowrap\">\n<p align=\"center\">1<\/p>\n<\/td>\n<td style=\"width: 64px; height: 20px;\" nowrap=\"nowrap\">\n<p align=\"center\">1<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td style=\"width: 93px; height: 20px;\" nowrap=\"nowrap\">\n<p align=\"center\">D4<\/p>\n<\/td>\n<td style=\"width: 55px; height: 20px;\" nowrap=\"nowrap\">\n<p align=\"center\">1<\/p>\n<\/td>\n<td style=\"width: 64px; height: 20px;\" nowrap=\"nowrap\">\n<p align=\"center\">2<\/p>\n<\/td>\n<td style=\"width: 64px; height: 20px;\" nowrap=\"nowrap\">\n<p align=\"center\">0<\/p>\n<\/td>\n<td style=\"width: 64px; height: 20px;\" nowrap=\"nowrap\">\n<p align=\"center\">2<\/p>\n<\/td>\n<td style=\"width: 64px; height: 20px;\" nowrap=\"nowrap\">\n<p align=\"center\">0<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td style=\"width: 93px; height: 20px;\" nowrap=\"nowrap\">\n<p align=\"center\">D5<\/p>\n<\/td>\n<td style=\"width: 55px; height: 20px;\" nowrap=\"nowrap\">\n<p align=\"center\">3<\/p>\n<\/td>\n<td style=\"width: 64px; height: 20px;\" nowrap=\"nowrap\">\n<p align=\"center\">1<\/p>\n<\/td>\n<td style=\"width: 64px; height: 20px;\" nowrap=\"nowrap\">\n<p align=\"center\">0<\/p>\n<\/td>\n<td style=\"width: 64px; height: 20px;\" nowrap=\"nowrap\">\n<p align=\"center\">0<\/p>\n<\/td>\n<td style=\"width: 64px; height: 20px;\" nowrap=\"nowrap\">\n<p align=\"center\">0<\/p>\n<\/td>\n<\/tr>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-c760a5a elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"c760a5a\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-4d44cc7\" data-id=\"4d44cc7\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-c25571e elementor-widget elementor-widget-text-editor\" data-id=\"c25571e\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<tr>\n<td style=\"width: 93px; height: 20px;\" nowrap=\"nowrap\">\n<p align=\"center\">D6<\/p>\n<\/td>\n<td style=\"width: 55px; height: 20px;\" nowrap=\"nowrap\">\n<p align=\"center\">0<\/p>\n<\/td>\n<td style=\"width: 64px; height: 20px;\" nowrap=\"nowrap\">\n<p align=\"center\">1<\/p>\n<\/td>\n<td style=\"width: 64px; height: 20px;\" nowrap=\"nowrap\">\n<p align=\"center\">1<\/p>\n<\/td>\n<td style=\"width: 64px; height: 20px;\" nowrap=\"nowrap\">\n<p align=\"center\">1<\/p>\n<\/td>\n<td style=\"width: 64px; height: 20px;\" nowrap=\"nowrap\">\n<p align=\"center\">2<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td style=\"width: 93px; height: 20px;\" nowrap=\"nowrap\">\n<p align=\"center\">D7<\/p>\n<\/td>\n<td style=\"width: 55px; height: 20px;\" nowrap=\"nowrap\">\n<p align=\"center\">2<\/p>\n<\/td>\n<td style=\"width: 64px; height: 20px;\" nowrap=\"nowrap\">\n<p align=\"center\">0<\/p>\n<\/td>\n<td style=\"width: 64px; height: 20px;\" nowrap=\"nowrap\">\n<p align=\"center\">1<\/p>\n<\/td>\n<td style=\"width: 64px; height: 20px;\" nowrap=\"nowrap\">\n<p align=\"center\">2<\/p>\n<\/td>\n<td style=\"width: 64px; height: 20px;\" nowrap=\"nowrap\">\n<p align=\"center\">1<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td style=\"width: 93px; height: 20px;\" nowrap=\"nowrap\">\n<p align=\"center\">D8<\/p>\n<\/td>\n<td style=\"width: 55px; height: 20px;\" nowrap=\"nowrap\">\n<p align=\"center\">1<\/p>\n<\/td>\n<td style=\"width: 64px; height: 20px;\" nowrap=\"nowrap\">\n<p align=\"center\">1<\/p>\n<\/td>\n<td style=\"width: 64px; height: 20px;\" nowrap=\"nowrap\">\n<p align=\"center\">0<\/p>\n<\/td>\n<td style=\"width: 64px; height: 20px;\" nowrap=\"nowrap\">\n<p align=\"center\">1<\/p>\n<\/td>\n<td style=\"width: 64px; height: 20px;\" nowrap=\"nowrap\">\n<p align=\"center\">0<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td style=\"width: 93px; height: 20px;\" nowrap=\"nowrap\">\n<p align=\"center\">D9<\/p>\n<\/td>\n<td style=\"width: 55px; height: 20px;\" nowrap=\"nowrap\">\n<p align=\"center\">1<\/p>\n<\/td>\n<td style=\"width: 64px; height: 20px;\" nowrap=\"nowrap\">\n<p align=\"center\">0<\/p>\n<\/td>\n<td style=\"width: 64px; height: 20px;\" nowrap=\"nowrap\">\n<p align=\"center\">2<\/p>\n<\/td>\n<td style=\"width: 64px; height: 20px;\" nowrap=\"nowrap\">\n<p align=\"center\">0<\/p>\n<\/td>\n<td style=\"width: 64px; height: 20px;\" nowrap=\"nowrap\">\n<p align=\"center\">0<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td style=\"width: 93px; height: 20px;\" nowrap=\"nowrap\">\n<p align=\"center\">D10<\/p>\n<\/td>\n<td style=\"width: 55px; height: 20px;\" nowrap=\"nowrap\">\n<p align=\"center\">0<\/p>\n<\/td>\n<td style=\"width: 64px; height: 20px;\" nowrap=\"nowrap\">\n<p align=\"center\">1<\/p>\n<\/td>\n<td style=\"width: 64px; height: 20px;\" nowrap=\"nowrap\">\n<p align=\"center\">1<\/p>\n<\/td>\n<td style=\"width: 64px; height: 20px;\" nowrap=\"nowrap\">\n<p align=\"center\">1<\/p>\n<\/td>\n<td style=\"width: 64px; height: 20px;\" nowrap=\"nowrap\">\n<p align=\"center\">1<\/p>\n<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-745435d elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"745435d\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-28f91c4\" data-id=\"28f91c4\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-0eb061e elementor-widget elementor-widget-text-editor\" data-id=\"0eb061e\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<ul>\n \t<li>Let\u2019s assume that we want to create K=3 clusters. First, three\u00a0seeds should\u00a0be chosen. Suppose, D2, D5 &amp; D7 are chosen as initial three\u00a0seeds.<\/li>\n \t<li>The next step is to calculate the Euclidean distance of other documents from D2, D5 &amp; D7.<\/li>\n \t<li>Assuming:\u00a0U=Online, V= Festival, X=Book, Y=Flight, Z=Delhi. Then the Euclidean distance between D1 &amp; D2 would be:<\/li>\n<\/ul>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-b20c359 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"b20c359\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-d4cd0ec\" data-id=\"d4cd0ec\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-44ed2f9 elementor-widget elementor-widget-text-editor\" data-id=\"44ed2f9\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<table style=\"width: 500px;\" border=\"0\" width=\"500\" cellspacing=\"0\" cellpadding=\"0\">\n<tbody>\n<tr>\n<td style=\"width: 107px; height: 20px;\" nowrap=\"nowrap\"><strong>Distance Matrix<\/strong><\/td>\n<td style=\"width: 80px; height: 20px;\" nowrap=\"nowrap\"><\/td>\n<td style=\"width: 75px; height: 20px;\" nowrap=\"nowrap\"><\/td>\n<td style=\"width: 64px; height: 20px;\" nowrap=\"nowrap\"><\/td>\n<td style=\"width: 91px; height: 20px;\" nowrap=\"nowrap\"><\/td>\n<td style=\"width: 84px; height: 20px;\" nowrap=\"nowrap\"><\/td>\n<\/tr>\n<tr>\n<td style=\"width: 107px; height: 20px;\" nowrap=\"nowrap\">\n<p align=\"center\"><\/p>\n<\/td>\n<td style=\"width: 393px; height: 20px;\" colspan=\"5\" nowrap=\"nowrap\">\n<p align=\"center\"><strong>Distance from 3 clusters<\/strong><\/p>\n<\/td>\n<\/tr>\n<tr>\n<td style=\"width: 107px; height: 20px;\" nowrap=\"nowrap\">\n<p align=\"center\"><strong>Documents<\/strong><\/p>\n<\/td>\n<td style=\"width: 80px; height: 20px;\" nowrap=\"nowrap\">\n<p align=\"center\"><strong>D2<\/strong><\/p>\n<\/td>\n<td style=\"width: 75px; height: 20px;\" nowrap=\"nowrap\">\n<p align=\"center\"><strong>D5<\/strong><\/p>\n<\/td>\n<td style=\"width: 64px; height: 20px;\" nowrap=\"nowrap\">\n<p align=\"center\"><strong>D7<\/strong><\/p>\n<\/td>\n<td style=\"width: 91px; height: 20px;\" nowrap=\"nowrap\">\n<p align=\"center\"><strong>Min. Distance<\/strong><\/p>\n<\/td>\n<td style=\"width: 84px; height: 20px;\" nowrap=\"nowrap\">\n<p align=\"center\"><strong>Movement<\/strong><\/p>\n<\/td>\n<\/tr>\n<tr>\n<td style=\"width: 107px; height: 20px;\" nowrap=\"nowrap\">\n<p align=\"center\">D1<\/p>\n<\/td>\n<td style=\"width: 80px; height: 20px;\" nowrap=\"nowrap\">\n<p align=\"center\">2.0<\/p>\n<\/td>\n<td style=\"width: 75px; height: 20px;\" nowrap=\"nowrap\">\n<p align=\"center\">2.6<\/p>\n<\/td>\n<td style=\"width: 64px; height: 20px;\" nowrap=\"nowrap\">\n<p align=\"center\">2.2<\/p>\n<\/td>\n<td style=\"width: 91px; height: 20px;\" nowrap=\"nowrap\">\n<p align=\"center\">2.0<\/p>\n<\/td>\n<td style=\"width: 84px; height: 20px;\" nowrap=\"nowrap\">\n<p align=\"center\">D2<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td style=\"width: 107px; height: 20px;\" nowrap=\"nowrap\">\n<p align=\"center\">D2<\/p>\n<\/td>\n<td style=\"width: 80px; height: 20px;\" nowrap=\"nowrap\">\n<p align=\"center\">0.0<\/p>\n<\/td>\n<td style=\"width: 75px; height: 20px;\" nowrap=\"nowrap\">\n<p align=\"center\">2.6<\/p>\n<\/td>\n<td style=\"width: 64px; height: 20px;\" nowrap=\"nowrap\">\n<p align=\"center\">1.7<\/p>\n<\/td>\n<td style=\"width: 91px; height: 20px;\" nowrap=\"nowrap\">\n<p align=\"center\">0.0<\/p>\n<\/td>\n<td style=\"width: 84px; height: 20px;\" nowrap=\"nowrap\">\n<p align=\"center\"><\/p>\n<\/td>\n<\/tr>\n<tr>\n<td style=\"width: 107px; height: 20px;\" nowrap=\"nowrap\">\n<p align=\"center\">D3<\/p>\n<\/td>\n<td style=\"width: 80px; height: 20px;\" nowrap=\"nowrap\">\n<p align=\"center\">2.4<\/p>\n<\/td>\n<td style=\"width: 75px; height: 20px;\" nowrap=\"nowrap\">\n<p align=\"center\">3.6<\/p>\n<\/td>\n<td style=\"width: 64px; height: 20px;\" nowrap=\"nowrap\">\n<p align=\"center\">2.2<\/p>\n<\/td>\n<td style=\"width: 91px; height: 20px;\" nowrap=\"nowrap\">\n<p align=\"center\">2.2<\/p>\n<\/td>\n<td style=\"width: 84px; height: 20px;\" nowrap=\"nowrap\">\n<p align=\"center\">D7<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td style=\"width: 107px; height: 20px;\" nowrap=\"nowrap\">\n<p align=\"center\">D4<\/p>\n<\/td>\n<td style=\"width: 80px; height: 20px;\" nowrap=\"nowrap\">\n<p align=\"center\">2.8<\/p>\n<\/td>\n<td style=\"width: 75px; height: 20px;\" nowrap=\"nowrap\">\n<p align=\"center\">3.0<\/p>\n<\/td>\n<td style=\"width: 64px; height: 20px;\" nowrap=\"nowrap\">\n<p align=\"center\">2.6<\/p>\n<\/td>\n<td style=\"width: 91px; height: 20px;\" nowrap=\"nowrap\">\n<p align=\"center\">2.6<\/p>\n<\/td>\n<td style=\"width: 84px; height: 20px;\" nowrap=\"nowrap\">\n<p align=\"center\">D7<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td style=\"width: 107px; height: 20px;\" nowrap=\"nowrap\">\n<p align=\"center\">D5<\/p>\n<\/td>\n<td style=\"width: 80px; height: 20px;\" nowrap=\"nowrap\">\n<p align=\"center\">2.6<\/p>\n<\/td>\n<td style=\"width: 75px; height: 20px;\" nowrap=\"nowrap\">\n<p align=\"center\">0.0<\/p>\n<\/td>\n<td style=\"width: 64px; height: 20px;\" nowrap=\"nowrap\">\n<p align=\"center\">2.8<\/p>\n<\/td>\n<td style=\"width: 91px; height: 20px;\" nowrap=\"nowrap\">\n<p align=\"center\">0.0<\/p>\n<\/td>\n<td style=\"width: 84px; height: 20px;\" nowrap=\"nowrap\">\n<p align=\"center\"><\/p>\n<\/td>\n<\/tr>\n<tr>\n<td style=\"width: 107px; height: 20px;\" nowrap=\"nowrap\">\n<p align=\"center\">D6<\/p>\n<\/td>\n<td style=\"width: 80px; height: 20px;\" nowrap=\"nowrap\">\n<p align=\"center\">2.4<\/p>\n<\/td>\n<td style=\"width: 75px; height: 20px;\" nowrap=\"nowrap\">\n<p align=\"center\">3.9<\/p>\n<\/td>\n<td style=\"width: 64px; height: 20px;\" nowrap=\"nowrap\">\n<p align=\"center\">2.6<\/p>\n<\/td>\n<td style=\"width: 91px; height: 20px;\" nowrap=\"nowrap\">\n<p align=\"center\">2.4<\/p>\n<\/td>\n<td style=\"width: 84px; height: 20px;\" nowrap=\"nowrap\">\n<p align=\"center\">D2<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td style=\"width: 107px; height: 20px;\" nowrap=\"nowrap\">\n<p align=\"center\">D7<\/p>\n<\/td>\n<td style=\"width: 80px; height: 20px;\" nowrap=\"nowrap\">\n<p align=\"center\">1.7<\/p>\n<\/td>\n<td style=\"width: 75px; height: 20px;\" nowrap=\"nowrap\">\n<p align=\"center\">2.8<\/p>\n<\/td>\n<td style=\"width: 64px; height: 20px;\" nowrap=\"nowrap\">\n<p align=\"center\">0.0<\/p>\n<\/td>\n<td style=\"width: 91px; height: 20px;\" nowrap=\"nowrap\">\n<p align=\"center\">0.0<\/p>\n<\/td>\n<td style=\"width: 84px; height: 20px;\" nowrap=\"nowrap\">\n<p align=\"center\"><\/p>\n<\/td>\n<\/tr>\n<tr>\n<td style=\"width: 107px; height: 20px;\" nowrap=\"nowrap\">\n<p align=\"center\">D8<\/p>\n<\/td>\n<td style=\"width: 80px; height: 20px;\" nowrap=\"nowrap\">\n<p align=\"center\">2.6<\/p>\n<\/td>\n<td style=\"width: 75px; height: 20px;\" nowrap=\"nowrap\">\n<p align=\"center\">2.0<\/p>\n<\/td>\n<td style=\"width: 64px; height: 20px;\" nowrap=\"nowrap\">\n<p align=\"center\">2.8<\/p>\n<\/td>\n<td style=\"width: 91px; height: 20px;\" nowrap=\"nowrap\">\n<p align=\"center\">2.0<\/p>\n<\/td>\n<td style=\"width: 84px; height: 20px;\" nowrap=\"nowrap\">\n<p align=\"center\">D5<\/p>\n<\/td>\n<\/tr>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-11c78c5 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"11c78c5\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-f25191d\" data-id=\"f25191d\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-ad2cdbd elementor-widget elementor-widget-text-editor\" data-id=\"ad2cdbd\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<tr>\n<td style=\"width: 107px; height: 20px;\" nowrap=\"nowrap\">\n<p align=\"center\">D9<\/p>\n<\/td>\n<td style=\"width: 80px; height: 20px;\" nowrap=\"nowrap\">\n<p align=\"center\">2.0<\/p>\n<\/td>\n<td style=\"width: 75px; height: 20px;\" nowrap=\"nowrap\">\n<p align=\"center\">3.0<\/p>\n<\/td>\n<td style=\"width: 64px; height: 20px;\" nowrap=\"nowrap\">\n<p align=\"center\">2.6<\/p>\n<\/td>\n<td style=\"width: 91px; height: 20px;\" nowrap=\"nowrap\">\n<p align=\"center\">2.0<\/p>\n<\/td>\n<td style=\"width: 84px; height: 20px;\" nowrap=\"nowrap\">\n<p align=\"center\">D2<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td style=\"width: 107px; height: 20px;\" nowrap=\"nowrap\">\n<p align=\"center\">D10<\/p>\n<\/td>\n<td style=\"width: 80px; height: 20px;\" nowrap=\"nowrap\">\n<p align=\"center\">2.2<\/p>\n<\/td>\n<td style=\"width: 75px; height: 20px;\" nowrap=\"nowrap\">\n<p align=\"center\">3.5<\/p>\n<\/td>\n<td style=\"width: 64px; height: 20px;\" nowrap=\"nowrap\">\n<p align=\"center\">2.4<\/p>\n<\/td>\n<td style=\"width: 91px; height: 20px;\" nowrap=\"nowrap\">\n<p align=\"center\">2.2<\/p>\n<\/td>\n<td style=\"width: 84px; height: 20px;\" nowrap=\"nowrap\">\n<p align=\"center\">D2<\/p>\n<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p style=\"margin-left: .75in;\"><\/p>\n\n<table style=\"width: 191px;\" border=\"0\" width=\"191\" cellspacing=\"0\" cellpadding=\"0\">\n<tbody>\n<tr>\n<td style=\"width: 64px; height: 20px;\" nowrap=\"nowrap\">\n<p align=\"center\"><strong>Clusters<\/strong><\/p>\n<\/td>\n<td style=\"width: 127px; height: 20px;\" nowrap=\"nowrap\">\n<p align=\"center\"><strong># of Observations<\/strong><\/p>\n<\/td>\n<\/tr>\n<tr>\n<td style=\"width: 64px; height: 20px;\" nowrap=\"nowrap\">\n<p align=\"center\">D2<\/p>\n<\/td>\n<td style=\"width: 127px; height: 20px;\" nowrap=\"nowrap\">\n<p align=\"center\">5<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td style=\"width: 64px; height: 20px;\" nowrap=\"nowrap\">\n<p align=\"center\">D5<\/p>\n<\/td>\n<td style=\"width: 127px; height: 20px;\" nowrap=\"nowrap\">\n<p align=\"center\">2<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td style=\"width: 64px; height: 20px;\" nowrap=\"nowrap\">\n<p align=\"center\">D7<\/p>\n<\/td>\n<td style=\"width: 127px; height: 20px;\" nowrap=\"nowrap\">\n<p align=\"center\">3<\/p>\n<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n&nbsp;\n<ul>\n \t<li>Hence, 10 documents have moved into 3 different clusters. Instead of Centroids, Medoids are formed and again distances are re-calculated to ensure that the documents who are closer to a medoid is assigned to the same cluster.<\/li>\n \t<li>Medoids are used to build the story for each cluster.<\/li>\n<\/ul>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-07bf4c4 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"07bf4c4\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-68072d8\" data-id=\"68072d8\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-9ac6c5c elementor-widget elementor-widget-text-editor\" data-id=\"9ac6c5c\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tBut there is still one important question remaining: How do you choose the optimal number of clusters?\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-d7e1937 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"d7e1937\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-8dbd65a\" data-id=\"8dbd65a\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-dcfb9d7 elementor-widget elementor-widget-text-editor\" data-id=\"dcfb9d7\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tOne approach would be to use the\u00a0Elbow method to choose the optimal number of clusters. This is based on plotting the cost function for various number of clusters and identifying the breakpoints. If adding more clusters is not significantly reducing\u00a0the variance within the cluster, one should stop adding more clusters. Although this method cannot give you the optimal number of clusters as an exact point, it can give you an optimal range.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<\/div>\n\t\t","protected":false},"excerpt":{"rendered":"<p>K means clustering is a method that is often used in sentiment analysis. This post gives an overview of how to implement this method.<\/p>\n","protected":false},"author":20,"featured_media":8229,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"content-type":"","footnotes":""},"categories":[187],"tags":[122],"ppma_author":[1610],"class_list":["post-498","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-bigdata-cloud","tag-big-data"],"authors":[{"term_id":1610,"user_id":20,"is_guest":0,"slug":"madhukar-kumar","display_name":"Madhukar Kumar","avatar_url":"https:\/\/secure.gravatar.com\/avatar\/?s=96&d=mm&r=g","user_url":"","last_name":"Kumar","first_name":"Madhukar","job_title":"","description":"Madhukar has over 10 years of experience in the analytics industry. He has been providing analytics&nbsp;consulting services to the UK, US, Canada, Europe and Australia for a decade, and has worked for respected corporations such as American Express and GE Money in the past.&nbsp;"}],"_links":{"self":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/498","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/users\/20"}],"replies":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/comments?post=498"}],"version-history":[{"count":5,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/498\/revisions"}],"predecessor-version":[{"id":37296,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/498\/revisions\/37296"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/media\/8229"}],"wp:attachment":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/media?parent=498"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/categories?post=498"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/tags?post=498"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/ppma_author?post=498"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}