K Means Clustering in Text Data

Madhukar Kumar Madhukar Kumar
September 27, 2016 Big Data, Cloud & DevOps
Clustering/segmentation is one of the most important techniques used in Acquisition Analytics. K means clustering groups similar observations in clusters in order to be able to extract insights from vast amounts of unstructured data.
When you want  to analyze the Facebook/Twitter/Youtube comments of a particular event, it would be impossible to manually look at each and every mention and see where the sentiment regarding a particular brand/event/person lies.
  • The basic idea of K Means clustering is to form K seeds first, and then group observations in K clusters on the basis of distance with each of K seeds. The observation will be included in the nth seed/cluster if the distance betweeen the observation and the nth seed is minimum when compared to other seeds.
Below is a brief overview of the methodology involved in performing a K Means Clustering Analysis.

The Process of building K clusters on Social Media text data:

  • The first step is to pull the social media mentions for a particular timeframe using social media listening tools (Radian 6, Sysmos, Synthesio etc.).  You would need to build query/add keywords to pull the data from social Media Listening tools.
  • The next step is data cleansing. This is the most important part as social media comments do not have any specific format. People use locals/slangs etc. on social media to express their emotions, so it’s important to be able to see through them and understand the underlying sentiment.
  • Remove punctuations, numbers, stopwords (R has specific stopword library but you can also create your own list of stopwords). Also, remove duplicate rows or URLs from the social media mentions.
  • The next step is to create corpus vector of all the words.
  • Once you have created the corpus vector of words, the next step is to create a document term matrix.
Let’s visualize the problem with one example. Let’s assume that there are 10 documents/mentions and 5 unique words post data cleansing. Below is the document term matrix for this dataset. It shows for how many times one word has appeared in the document. For example, in document 1 (D1), the words online, book and Delhi have each been mentioned once.
Document Term Matrix

Documents

Online

Festival

Book

Flight

Delhi

D1

1

0

1

0

1

D2

2

1

2

1

1

D3

0

0

1

1

1

D4

1

2

0

2

0

D5

3

1

0

0

0

D6

0

1

1

1

2

D7

2

0

1

2

1

D8

1

1

0

1

0

D9

1

0

2

0

0

D10

0

1

1

1

1

  • Let’s assume that we want to create K=3 clusters. First, three seeds should be chosen. Suppose, D2, D5 & D7 are chosen as initial three seeds.
  • The next step is to calculate the Euclidean distance of other documents from D2, D5 & D7.
  • Assuming: U=Online, V= Festival, X=Book, Y=Flight, Z=Delhi. Then the Euclidean distance between D1 & D2 would be:
Distance Matrix

Distance from 3 clusters

Documents

D2

D5

D7

Min. Distance

Movement

D1

2.0

2.6

2.2

2.0

D2

D2

0.0

2.6

1.7

0.0

D3

2.4

3.6

2.2

2.2

D7

D4

2.8

3.0

2.6

2.6

D7

D5

2.6

0.0

2.8

0.0

D6

2.4

3.9

2.6

2.4

D2

D7

1.7

2.8

0.0

0.0

D8

2.6

2.0

2.8

2.0

D5

D9

2.0

3.0

2.6

2.0

D2

D10

2.2

3.5

2.4

2.2

D2

Clusters

# of Observations

D2

5

D5

2

D7

3

 
  • Hence, 10 documents have moved into 3 different clusters. Instead of Centroids, Medoids are formed and again distances are re-calculated to ensure that the documents who are closer to a medoid is assigned to the same cluster.
  • Medoids are used to build the story for each cluster.
But there is still one important question remaining: How do you choose the optimal number of clusters?
One approach would be to use the Elbow method to choose the optimal number of clusters. This is based on plotting the cost function for various number of clusters and identifying the breakpoints. If adding more clusters is not significantly reducing the variance within the cluster, one should stop adding more clusters. Although this method cannot give you the optimal number of clusters as an exact point, it can give you an optimal range.
  • Experfy Insights

    Top articles, research, podcasts, webinars and more delivered to you monthly.

  • Madhukar Kumar

    Tags
    Big Data
    © 2021, Experfy Inc. All rights reserved.
    Leave a Comment
    Next Post
    Big Data: The next big recruiter?

    Big Data: The next big recruiter?

    Comments 2

    1. Kendra says:
      5 years ago

      Hello! This is my 1st comment here so I
      just wanted to give a quick shout out and tell you I truly enjoy reading through your articles.
      Can you suggest any other blogs/websites/forums that cover the same
      topics? Many thanks!

      Reply

    Leave a Reply Cancel reply

    Your email address will not be published. Required fields are marked *

    More in Big Data, Cloud & DevOps
    Big Data, Cloud & DevOps
    Cognitive Load Of Being On Call: 6 Tips To Address It

    If you’ve ever been on call, you’ve probably experienced the pain of being woken up at 4 a.m., unactionable alerts, alerts going to the wrong team, and other unfortunate events. But, there’s an aspect of being on call that is less talked about, but even more ubiquitous – the cognitive load. “Cognitive load” has perhaps

    5 MINUTES READ Continue Reading »
    Big Data, Cloud & DevOps
    How To Refine 360 Customer View With Next Generation Data Matching

    Knowing your customer in the digital age Want to know more about your customers? About their demographics, personal choices, and preferable buying journey? Who do you think is the best source for such insights? You’re right. The customer. But, in a fast-paced world, it is almost impossible to extract all relevant information about a customer

    4 MINUTES READ Continue Reading »
    Big Data, Cloud & DevOps
    3 Ways Businesses Can Use Cloud Computing To The Fullest

    Cloud computing is the anytime, anywhere delivery of IT services like compute, storage, networking, and application software over the internet to end-users. The underlying physical resources, as well as processes, are masked to the end-user, who accesses only the files and apps they want. Companies (usually) pay for only the cloud computing services they use,

    7 MINUTES READ Continue Reading »

    About Us

    Incubated in Harvard Innovation Lab, Experfy specializes in pipelining and deploying the world's best AI and engineering talent at breakneck speed, with exceptional focus on quality and compliance. Enterprises and governments also leverage our award-winning SaaS platform to build their own customized future of work solutions such as talent clouds.

    Join Us At

    Contact Us

    1700 West Park Drive, Suite 190
    Westborough, MA 01581

    Email: support@www.experfy.com

    Toll Free: (844) EXPERFY or
    (844) 397-3739

    © 2025, Experfy Inc. All rights reserved.