{"id":2333,"date":"2020-03-23T05:25:59","date_gmt":"2020-03-23T02:25:59","guid":{"rendered":"http:\/\/kusuaks7\/?p=1938"},"modified":"2023-12-25T18:17:17","modified_gmt":"2023-12-25T18:17:17","slug":"exploratory-data-analysis-for-natural-language-processing-a-compete-guide-to-python-tools","status":"publish","type":"post","link":"https:\/\/www.experfy.com\/blog\/ai-ml\/exploratory-data-analysis-for-natural-language-processing-a-compete-guide-to-python-tools\/","title":{"rendered":"Exploratory Data Analysis for Natural Language Processing: A Complete Guide to Python Tools"},"content":{"rendered":"\t\t<div data-elementor-type=\"wp-post\" data-elementor-id=\"2333\" class=\"elementor elementor-2333\" data-elementor-post-type=\"post\">\n\t\t\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-591a6c43 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"591a6c43\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-e770887\" data-id=\"e770887\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-65d0345e elementor-widget elementor-widget-text-editor\" data-id=\"65d0345e\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<section data-element_type=\"section\" data-id=\"3f160ca\">Exploratory data analysis is one of the most important parts of any machine learning workflow and Natural Language Processing is no different. But\u00a0<strong>which tools you should choose<\/strong>\u00a0to explore and visualize text data\u00a0efficiently?In this article, we will\u00a0<strong>discuss and implement nearly all the major techniques<\/strong>\u00a0that you can use to understand your text data and give you a complete(ish) tour into <a href=\"https:\/\/www.experfy.com\/blog\/software\/tools-for-working-with-excel-and-python\/\">Python tools<\/a> that get the job done.\n\n<\/section><section data-element_type=\"section\" data-id=\"8c18bdf\">\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-fb7e383 elementor-widget elementor-widget-heading\" data-id=\"fb7e383\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\"><h2>Before we start: Dataset and Dependencies<\/h2><\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-5d87892 elementor-widget elementor-widget-text-editor\" data-id=\"5d87892\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<\/section><section data-element_type=\"section\" data-id=\"b8868a6\">In this article, we will use\u00a0<a href=\"https:\/\/www.kaggle.com\/therohk\/million-headlines\" target=\"_blank\" rel=\"noopener noreferrer\">a million news headlines dataset<\/a>\u00a0from Kaggle.\nIf you want to follow the analysis step-by-step you may want to install the following libraries:<\/section><section data-element_type=\"section\" data-id=\"b68dac7\">\n<div style=\"background: #eee; border: 1px solid #ccc; padding: 5px 10px;\"><span style=\"font-family: courier new,courier,monospace;\">pip install\npandas matplotlib numpy\nnltk seaborn sklearn gensim pyldavis\nwordcloud textblob spacy textstat<\/span><\/div>\n<\/section><section data-element_type=\"section\" data-id=\"bbb6024\">Now, we can take a look at the data.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-464823c elementor-widget elementor-widget-text-editor\" data-id=\"464823c\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<\/section><section data-element_type=\"section\" data-id=\"d503a91\">\n<div style=\"background: #eee; border: 1px solid #ccc; padding: 5px 10px;\"><span style=\"font-family: courier new,courier,monospace;\">news= pd.read_csv(&#8216;data\/abcnews-date-text.csv&#8217;,nrows=10000)\nnews.head(3)<\/span><\/div>\n<\/section><section data-element_type=\"section\" data-id=\"75f70d0\">\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-63e611a elementor-widget elementor-widget-text-editor\" data-id=\"63e611a\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<\/section><section data-element_type=\"section\" data-id=\"649c2a3\">The dataset contains only two columns, the published date, and the news heading.For simplicity, I will be exploring the first\u00a0<strong>10000 rows<\/strong>\u00a0from this dataset. Since the headlines are sorted by\u00a0<em>publish_date\u00a0<\/em>it is actually\u00a0<strong>2 months from\u00a0<em>February\/19\/2003<\/em>\u00a0until\u00a0<em>April\/07\/2003<\/em>.<\/strong>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-147818f elementor-widget elementor-widget-text-editor\" data-id=\"147818f\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tOk, I think we are ready to start our data exploration!\n\n<\/section><section data-element_type=\"section\" data-id=\"0c30da6\">\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-3542a68 elementor-widget elementor-widget-heading\" data-id=\"3542a68\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\"><\/section><section data-element_type=\"section\" data-id=\"0c30da6\">\n<h2>Analyzing text statistics<\/h2><\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-1be6427 elementor-widget elementor-widget-text-editor\" data-id=\"1be6427\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<\/section><section data-element_type=\"section\" data-id=\"424d695\">Text statistics visualizations are simple but very insightful techniques.They include:\n<ul>\n \t<li>word frequency analysis,<\/li>\n \t<li>sentence length analysis,<\/li>\n \t<li>average word length analysis,<\/li>\n \t<li>etc.<\/li>\n<\/ul>\nThose really help\u00a0<strong>explore the fundamental characteristics<\/strong>\u00a0of the text data.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-1658526 elementor-widget elementor-widget-text-editor\" data-id=\"1658526\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tTo do so, we will be mostly using\u00a0<strong>histograms<\/strong>\u00a0(continuous data) and\u00a0<strong>bar charts<\/strong>\u00a0(categorical data).\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-a5b6e68 elementor-widget elementor-widget-text-editor\" data-id=\"a5b6e68\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tFirst, I\u2019ll take a look at the number of characters present in each sentence. This can give us a rough idea about the news headline length.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-3371410 elementor-widget elementor-widget-text-editor\" data-id=\"3371410\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<section data-element_type=\"section\" data-id=\"166f7fc\"><pre>news['headline_text'].str.len().hist()<\/pre><\/section>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-dd1d689 elementor-widget elementor-widget-text-editor\" data-id=\"dd1d689\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<\/section><section data-element_type=\"section\" data-id=\"6a0c61f\"><section data-element_type=\"section\" data-id=\"7e904d9\">\n<p style=\"text-align: center;\"><a href=\"https:\/\/ui.neptune.ai\/o\/neptune-ml\/org\/eda-nlp-tools\/n\/1-0-character-length-histogram-27f4f679-09fd-4490-b4e3-9020acc1c55d\/0e390d7b-fd8a-4612-8a08-d388cce901a7\" target=\"_blank\" rel=\"noopener noreferrer\" class=\"broken_link\">Code Snippet that Generates this Chart<\/a><\/p>\nThe histogram shows that news headlines range from 10 to 70 characters and generally, it is between 25 to 55 characters.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-f3f1be3 elementor-widget elementor-widget-text-editor\" data-id=\"f3f1be3\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<\/section><\/section><section data-element_type=\"section\" data-id=\"56009a0\">Now, we will move on to data exploration at a word-level. Let\u2019s plot the number of words appearing in each news headline.<\/section><section data-element_type=\"section\" data-id=\"ad2170b\">\n<div style=\"background: #eee; border: 1px solid #ccc; padding: 5px 10px;\"><span style=\"font-family: courier new,courier,monospace;\">text.str.split().\nmap(lambda x: len(x)).\nhist()<\/span><\/div>\n<\/section><section data-element_type=\"section\" data-id=\"344fe5e\">\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-7add463 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"7add463\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-850e783\" data-id=\"850e783\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-e1af48c elementor-widget elementor-widget-text-editor\" data-id=\"e1af48c\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<\/section><section data-element_type=\"section\" data-id=\"86bdb1f\"><section data-element_type=\"section\" data-id=\"cc782cd\">\n<p style=\"text-align: center;\"><a href=\"https:\/\/ui.neptune.ai\/o\/neptune-ml\/org\/eda-nlp-tools\/n\/1-1-word-number-histogram-aff0bde6-6ad1-45cf-a8f8-68a2ad7da521\/e4cee3db-8d07-4dc6-8584-063b11e76809\" target=\"_blank\" rel=\"noopener noreferrer\" class=\"broken_link\">Code Snippet that Generates this Chart<\/a><\/p>\nIt is clear that the number of words in news headlines ranges from 2 to 12 and mostly falls between 5 to 7 words.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-45f5391 elementor-widget elementor-widget-text-editor\" data-id=\"45f5391\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<\/section><\/section><section data-element_type=\"section\" data-id=\"a8d4dc2\">Up next, let\u2019s check the\u00a0<strong>average word length<\/strong>\u00a0in each sentence.<\/section><section data-element_type=\"section\" data-id=\"1782047\">\n<div style=\"background: #eee; border: 1px solid #ccc; padding: 5px 10px;\"><span style=\"font-family: courier new,courier,monospace;\">news[&#8216;headline_text&#8217;].str.split().\napply(lambda x : [len(i) for i in x]).\nmap(lambda x: np.mean(x)).hist()<\/span><\/div>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-7c20934 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"7c20934\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-d1f0b07\" data-id=\"d1f0b07\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-e57f475 elementor-widget elementor-widget-text-editor\" data-id=\"e57f475\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<\/section><section data-element_type=\"section\" data-id=\"15e1b57\"><section data-element_type=\"section\" data-id=\"ed65b2b\">\n<p style=\"text-align: center;\"><a href=\"https:\/\/ui.neptune.ai\/o\/neptune-ml\/org\/eda-nlp-tools\/n\/1-2-word-length-histogram-6204616c-6314-4ddd-9398-fe73415c09ff\/e5c67525-6a16-4751-b4c5-4c64c1ad2730\" target=\"_blank\" rel=\"noopener noreferrer\" class=\"broken_link\">Code Snippet that Generates this Chart<\/a><\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-5d8d538 elementor-widget elementor-widget-text-editor\" data-id=\"5d8d538\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<\/section><\/section><section data-element_type=\"section\" data-id=\"e56c12f\">The average word length ranges between 3 to 9 with 5 being the most common length. Does it mean that people are using really short words in news headlines?\nLet\u2019s find out.One reason why this may not be true is stopwords.\u00a0<strong>Stopwords are the words that are most commonly used in any language<\/strong>\u00a0such as\u00a0<em>\u201cthe\u201d,\u201d a\u201d,\u201d an<\/em>\u201d etc. As these words are probably small in length these words may have caused the above graph to be left-skewed.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-e75a30b elementor-widget elementor-widget-text-editor\" data-id=\"e75a30b\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tAnalyzing the amount and the types of stopwords can give us some good insights into the data.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-1402175 elementor-widget elementor-widget-text-editor\" data-id=\"1402175\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tTo get the corpus containing stopwords you can use the\u00a0<a href=\"https:\/\/www.nltk.org\/\" target=\"_blank\" rel=\"noopener noreferrer\">nltk library<\/a>. Nltk contains stopwords from many languages. Since we are only dealing with English news I will filter the English stopwords from the corpus.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-466edf6 elementor-widget elementor-widget-text-editor\" data-id=\"466edf6\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<\/section><section data-element_type=\"section\" data-id=\"fd99ca6\">\n<div style=\"background: #eee; border: 1px solid #ccc; padding: 5px 10px;\"><span style=\"font-family: courier new,courier,monospace;\">import nltk\nnltk.download(&#8216;stopwords&#8217;)\nstop=set(stopwords.words(&#8216;english&#8217;))<\/span><\/div>\n<\/section><section data-element_type=\"section\" data-id=\"d5ee3ee\">Now, we\u2019ll\u00a0 create the corpus.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-26aa16e elementor-widget elementor-widget-text-editor\" data-id=\"26aa16e\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<\/section><section data-element_type=\"section\" data-id=\"a753c64\">\n<div style=\"background: #eee; border: 1px solid #ccc; padding: 5px 10px;\"><span style=\"font-family: courier new,courier,monospace;\">corpus=[]\nnew= news[&#8216;headline_text&#8217;].str.split()\nnew=new.values.tolist()\ncorpus=[word for i in new for word in i]<\/span><\/div>\n<div style=\"background: #eee; border: 1px solid #ccc; padding: 5px 10px;\"><span style=\"font-family: courier new,courier,monospace;\">from collections import defaultdict\ndic=defaultdict(int)\nfor word in corpus:\nif word in stop:\ndic[word]+=1<\/span><\/div>\n<\/section><section data-element_type=\"section\" data-id=\"c5a7420\">and plot top stopwords.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-af55883 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"af55883\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-e5f4e67\" data-id=\"e5f4e67\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-f28efda elementor-widget elementor-widget-text-editor\" data-id=\"f28efda\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<\/section><section data-element_type=\"section\" data-id=\"7e4d762\"><section data-element_type=\"section\" data-id=\"bbc2729\">\n<p style=\"text-align: center;\"><a href=\"https:\/\/ui.neptune.ai\/o\/neptune-ml\/org\/eda-nlp-tools\/n\/1-3-top-stopwords-barchart-b953763c-3fea-4331-bff0-429411793e5f\/5c0fca05-ba07-4564-a02e-c44b08bfb8cb\" target=\"_blank\" rel=\"noopener noreferrer\" class=\"broken_link\">Code Snippet that Generates this Chart<\/a><\/p>\nWe can evidently see that stopwords such as \u201cto\u201d,\u201d in\u201d and \u201cfor\u201d dominate in news headlines.\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-62aefc7 elementor-widget elementor-widget-text-editor\" data-id=\"62aefc7\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<\/section><\/section><section data-element_type=\"section\" data-id=\"b055715\">So now\u00a0<strong>we know which stopwords occur frequently in our text, let\u2019s inspect which words other than these stopwords occur frequently.<\/strong>We will use the\u00a0<a href=\"https:\/\/pymotw.com\/2\/collections\/counter.html\" target=\"_blank\" rel=\"noopener noreferrer\">counter function<\/a>\u00a0from the collections library to count and store the occurrences of each word in a list of tuples. This is a\u00a0<strong>very useful function when we deal with word-level analysis<\/strong>\u00a0in natural language processing.\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-08efbd3 elementor-widget elementor-widget-text-editor\" data-id=\"08efbd3\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<\/section><section data-element_type=\"section\" data-id=\"22f5e35\">\n<div style=\"background: #eee; border: 1px solid #ccc; padding: 5px 10px;\"><span style=\"font-family: courier new,courier,monospace;\">counter=Counter(corpus)\nmost=counter.most_common()<\/span><\/div>\n<div style=\"background: #eee; border: 1px solid #ccc; padding: 5px 10px;\"><span style=\"font-family: courier new,courier,monospace;\">x, y= [], []\nfor word,count in most[:40]:\nif (word not in stop):\nx.append(word)\ny.append(count)<\/span><\/div>\n<div style=\"background: #eee; border: 1px solid #ccc; padding: 5px 10px;\"><span style=\"font-family: courier new,courier,monospace;\">sns.barplot(x=y,y=x)<\/span><\/div>\n<\/section><section data-element_type=\"section\" data-id=\"92b649c\">\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-2db9b52 elementor-widget elementor-widget-text-editor\" data-id=\"2db9b52\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<\/section><section data-element_type=\"section\" data-id=\"9cbfbd9\"><section data-element_type=\"section\" data-id=\"7dc2540\">\n<p style=\"text-align: center;\"><a href=\"https:\/\/ui.neptune.ai\/o\/neptune-ml\/org\/eda-nlp-tools\/n\/1-4-top-non-stopwords-barchart-36267acc-a418-4a5f-a3ba-67a3b51dde12\/b57bc536-8cec-46a7-918c-60fba6f2c83d\" target=\"_blank\" rel=\"noopener noreferrer\" class=\"broken_link\">Code Snippet that Generates this Chart<\/a><\/p>\nWow! The \u201cus\u201d, \u201cIraq\u201d and \u201cwar\u201d dominate the headlines over the last 15 years.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-20c2302 elementor-widget elementor-widget-text-editor\" data-id=\"20c2302\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\n<\/section><\/section><section data-element_type=\"section\" data-id=\"b61387f\">Here \u2018us\u2019 could mean either the USA or us (you and me). us is not a stopword, but when we observe other words in the graph they are all related to the US \u2013 Iraq war and \u201cus\u201d here probably indicate the USA.<\/section><section data-element_type=\"section\" data-id=\"a355140\">\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-14881e3 elementor-widget elementor-widget-heading\" data-id=\"14881e3\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\"><h2>Ngram exploration<\/h2><\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-e7db95a elementor-widget elementor-widget-text-editor\" data-id=\"e7db95a\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<\/section><section data-element_type=\"section\" data-id=\"1eed2cd\">Ngrams are simply\u00a0<strong>contiguous sequences of n words<\/strong>. For example \u201criverbank\u201d,\u201d The three musketeers\u201d etc.\nIf the number of words is two, it is called bigram. For 3 words it is called a trigram and so on.<strong>Looking at most frequent n-grams can give you a better understanding of the context<\/strong>\u00a0in which the word was used.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-3d9a7d9 elementor-widget elementor-widget-text-editor\" data-id=\"3d9a7d9\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\nTo implement n-grams we will use\u00a0<em>ngrams<\/em>\u00a0function from\u00a0<em>nltk.util<\/em>. For example:\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-29b561e elementor-widget elementor-widget-text-editor\" data-id=\"29b561e\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<section data-element_type=\"section\" data-id=\"be73709\"><div style=\"background: #eee; border: 1px solid #ccc; padding: 5px 10px;\"><span style=\"font-family: courier new,courier,monospace;\">from nltk.util import ngrams<br \/>list(ngrams([&#8216;I&#8217; ,&#8217;went&#8217;,&#8217;to&#8217;,&#8217;the&#8217;,&#8217;river&#8217;,&#8217;bank&#8217;],2))<\/span><\/div><\/section><section data-element_type=\"section\" data-id=\"d90b14a\"><p style=\"text-align: center;\">\u00a0<\/p><\/section>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-21e6b79 elementor-widget elementor-widget-text-editor\" data-id=\"21e6b79\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<\/section><section data-element_type=\"section\" data-id=\"de12fa8\">Now that we know how to create n-grams lets visualize them.<strong>To build a representation of our vocabulary we will use\u00a0<em>Countvectorizer.<\/em><\/strong>\u00a0<em>Countvectorizer<\/em>\u00a0is a simple method used to tokenize, vectorize and represent the corpus in an appropriate form. It is available in\u00a0<a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.feature_extraction.text.CountVectorizer.html\" target=\"_blank\" rel=\"noopener noreferrer\"><em>sklearn.feature_engineering.text<\/em><\/a>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-304de5c elementor-widget elementor-widget-text-editor\" data-id=\"304de5c\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tSo with all this, we will analyze the top bigrams in our news headlines.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-38e0de5 elementor-widget elementor-widget-text-editor\" data-id=\"38e0de5\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<section data-element_type=\"section\" data-id=\"e617e4c\"><div style=\"background: #eee; border: 1px solid #ccc; padding: 5px 10px;\"><span style=\"font-family: courier new,courier,monospace;\">def get_top_ngram(corpus, n=None):<br \/>vec = CountVectorizer(ngram_range=(n, n)).fit(corpus)<br \/>bag_of_words = vec.transform(corpus)<br \/>sum_words = bag_of_words.sum(axis=0)<br \/>words_freq = [(word, sum_words[0, idx])<br \/>for word, idx in vec.vocabulary_.items()]<br \/>words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)<br \/>return words_freq[:10]<\/span><\/div><\/section><section data-element_type=\"section\" data-id=\"002db0f\"><div style=\"background: #eee; border: 1px solid #ccc; padding: 5px 10px;\"><span style=\"font-family: courier new,courier,monospace;\">top_n_bigrams=get_top_ngram(news[&#8216;headline_text&#8217;],2)[:10]<br \/>x,y=map(list,zip(*top_n_bigrams))<br \/>sns.barplot(x=y,y=x)<\/span><\/div><\/section><section data-element_type=\"section\" data-id=\"7c741b1\"><p style=\"text-align: center;\">\u00a0<\/p><\/section>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-06360a1 elementor-widget elementor-widget-text-editor\" data-id=\"06360a1\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<\/section><\/section><section data-element_type=\"section\" data-id=\"f45fb55\">We can observe that the bigrams such as \u2018anti-war\u2019, \u2019killed in\u2019 that are related to war dominate the news headlines.How about trigrams?\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-a309f54 elementor-widget elementor-widget-text-editor\" data-id=\"a309f54\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<section data-element_type=\"section\" data-id=\"abd8d68\"><div style=\"background: #eee; border: 1px solid #ccc; padding: 5px 10px;\"><span style=\"font-family: courier new,courier,monospace;\">top_tri_grams=get_top_ngram(news[&#8216;headline_text&#8217;],n=3)<br \/>x,y=map(list,zip(*top_tri_grams))<br \/>sns.barplot(x=y,y=x)<\/span><\/div><\/section><section data-element_type=\"section\" data-id=\"21d1d40\"><p style=\"text-align: center;\">\u00a0<\/p><\/section>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-69d7b46 elementor-widget elementor-widget-text-editor\" data-id=\"69d7b46\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<\/section><section data-element_type=\"section\" data-id=\"5125f88\"><section data-element_type=\"section\" data-id=\"07f5588\">\n<p style=\"text-align: center;\"><a href=\"https:\/\/ui.neptune.ai\/o\/neptune-ml\/org\/eda-nlp-tools\/n\/2-0-top-ngrams-barchart-671a187d-c3b4-475a-bc9e-8aa6c937923b\/c427446f-7b0e-4621-b791-47b0fd31a39e\" target=\"_blank\" rel=\"noopener noreferrer\" class=\"broken_link\">Code Snippet that Generates this Chart<\/a><\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-98c774a elementor-widget elementor-widget-text-editor\" data-id=\"98c774a\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<\/section><\/section><section data-element_type=\"section\" data-id=\"8341fef\">We can see that many of these trigrams are some combinations of<em>\u00a0\u201cto face court\u201d<\/em>\u00a0and\u00a0<em>\u201canti war protest\u201d.\u00a0<\/em><strong>It means that we should put some effort into data cleaning<\/strong>\u00a0and see if we were able to combine those synonym terms into one clean token.<\/section><section data-element_type=\"section\" data-id=\"f578582\">\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-c957426 elementor-widget elementor-widget-heading\" data-id=\"c957426\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\"><h2>Topic Modeling exploration with pyLDAvis<\/h2><\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-939ece7 elementor-widget elementor-widget-text-editor\" data-id=\"939ece7\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<\/section><section data-element_type=\"section\" data-id=\"55c4f57\">Topic modeling is the process of\u00a0<strong>using unsupervised learning techniques to extract the main topics that occur in a collection of documents.<\/strong><a href=\"https:\/\/towardsdatascience.com\/light-on-math-machine-learning-intuitive-guide-to-latent-dirichlet-allocation-437c81220158\" target=\"_blank\" rel=\"noopener noreferrer\">Latent Dirichlet Allocation<\/a>\u00a0(LDA) is an easy to use and efficient model for topic modeling. Each document is represented by the distribution of topics and each topic is represented by the distribution of words.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-694b506 elementor-widget elementor-widget-text-editor\" data-id=\"694b506\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tOnce we categorize our documents in topics we can dig into further\u00a0<strong>data exploration for each topic or topic group<\/strong>.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-052ec1f elementor-widget elementor-widget-text-editor\" data-id=\"052ec1f\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tBut before getting into topic modeling we have to pre-process our data a little. We will:\n<ul>\n \t<li><strong><em>tokenize<\/em><\/strong>: the process by which sentences are converted to a list of tokens or words.<\/li>\n \t<li><strong><em>remove stopwords<\/em><\/strong><\/li>\n \t<li><em><strong>lemmatize<\/strong><\/em>: reduces the inflectional forms of each word into a common base or root.<\/li>\n \t<li><em><strong>convert to the bag of words<\/strong><\/em>: Bag of words is a dictionary where the keys are words(or ngrams\/tokens) and values are the number of times each word occurs in the corpus.<\/li>\n<\/ul>\nWith NLTK you can tokenize and lemmatize easily:\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-1ccf86e elementor-widget elementor-widget-text-editor\" data-id=\"1ccf86e\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<\/section><section data-element_type=\"section\" data-id=\"d0131c8\">\n<div style=\"background: #eee; border: 1px solid #ccc; padding: 5px 10px;\"><span style=\"font-family: courier new,courier,monospace;\">import nltk\nnltk.download(&#8216;punkt&#8217;)\nnltk.download(&#8216;wordnet&#8217;)<\/span><\/div>\n<div style=\"background: #eee; border: 1px solid #ccc; padding: 5px 10px;\"><span style=\"font-family: courier new,courier,monospace;\">def preprocess_news(df):\ncorpus=[]\nstem=PorterStemmer()\nlem=WordNetLemmatizer()\nfor news in df[&#8216;headline_text&#8217;]:\nwords=[w for w in word_tokenize(news) if (w not in stop)]<\/span><\/div>\n<div style=\"background: #eee; border: 1px solid #ccc; padding: 5px 10px;\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<span style=\"font-family: courier new,courier,monospace;\">\u00a0 words=[lem.lemmatize(w) for w in words if len(w)&gt;2]<\/span><\/div>\n<div style=\"background: #eee; border: 1px solid #ccc; padding: 5px 10px;\">\u00a0\u00a0<span style=\"font-family: courier new,courier,monospace;\">\u00a0\u00a0\u00a0\u00a0\u00a0 corpus.append(words)\nreturn corpus<\/span><\/div>\n<div style=\"background: #eee; border: 1px solid #ccc; padding: 5px 10px;\"><span style=\"font-family: courier new,courier,monospace;\">corpus=preprocess_news(news)<\/span><\/div>\n<\/section><section data-element_type=\"section\" data-id=\"bb5e0db\">Now, let\u2019s create the bag of words model using gensim\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-f74d241 elementor-widget elementor-widget-text-editor\" data-id=\"f74d241\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<\/section><section data-element_type=\"section\" data-id=\"06c2555\">\n<div style=\"background: #eee; border: 1px solid #ccc; padding: 5px 10px;\"><span style=\"font-family: courier new,courier,monospace;\">dic=gensim.corpora.Dictionary(corpus)\nbow_corpus = [dic.doc2bow(doc) for doc in corpus]<\/span><\/div>\n<\/section><section data-element_type=\"section\" data-id=\"7d7823e\">and we can finally create the LDA model:\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-fe898f6 elementor-widget elementor-widget-text-editor\" data-id=\"fe898f6\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<section data-element_type=\"section\" data-id=\"b8c6afc\"><div style=\"background: #eee; border: 1px solid #ccc; padding: 5px 10px;\"><span style=\"font-family: courier new,courier,monospace;\">lda_model = gensim.models.LdaMulticore(bow_corpus,<br \/>num_topics = 4,<br \/>id2word = dic,<br \/>passes = 10,<br \/>workers = 2)<br \/>lda_model.show_topics()<\/span><\/div><\/section><section data-element_type=\"section\" data-id=\"73fad4f\"><p style=\"text-align: center;\">\u00a0<\/p><\/section>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-eff3e3c elementor-widget elementor-widget-text-editor\" data-id=\"eff3e3c\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\n<\/section><section data-element_type=\"section\" data-id=\"5153bc2\">The topic 0 indicates something related to the Iraq war and police. Topic 3 shows the involvement of Australia in the Iraq war.You can print all the topics and try to make sense of them but there are tools that can help you run this data exploration more efficiently. One such tool is\u00a0<a href=\"https:\/\/github.com\/bmabey\/pyLDAvis\" target=\"_blank\" rel=\"noopener noreferrer\">pyLDAvis<\/a>\u00a0which\u00a0<strong>visualizes the results of LDA interactively.<\/strong>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-c5563fa elementor-widget elementor-widget-text-editor\" data-id=\"c5563fa\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<\/section><section data-element_type=\"section\" data-id=\"a4849cf\">\n<div style=\"background: #eee; border: 1px solid #ccc; padding: 5px 10px;\"><span style=\"font-family: courier new,courier,monospace;\">pyLDAvis.enable_notebook()\nvis = pyLDAvis.gensim.prepare(lda_model, bow_corpus, dic)\nvis<\/span><\/div>\n<\/section><section data-element_type=\"section\" data-id=\"029b02f\"><video src=\"https:\/\/neptune.ai\/wp-content\/uploads\/pyldavis.mp4\" autoplay=\"autoplay\" controls=\"controls\" width=\"300\" height=\"150\">\u00a0<\/video><\/section><section data-element_type=\"section\" data-id=\"b9ae11e\"><section data-element_type=\"section\" data-id=\"1d1eaa9\">\n<p style=\"text-align: center;\"><a href=\"https:\/\/ui.neptune.ai\/o\/neptune-ml\/org\/eda-nlp-tools\/n\/3-0-topic-modeling-vis-ddd6a861-62d0-40cb-9207-ebd5b47d74d0\/e7cb3e68-cc7b-443e-992b-414640a55a0b\" target=\"_blank\" rel=\"noopener noreferrer\" class=\"broken_link\">Code Snippet that Generates this Chart<\/a><\/p>\nOn the left side, the\u00a0<strong>area of each circle represents the importance of the topic<\/strong>\u00a0relative to the corpus. As there are four topics, we have four circles.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-531f58f elementor-widget elementor-widget-text-editor\" data-id=\"531f58f\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<\/section><\/section><section data-element_type=\"section\" data-id=\"152db8a\">\n<ul>\n \t<li>The\u00a0<strong>distance between the center of the circles indicates the similarity<\/strong>\u00a0between the topics. Here you can see that the topic 3 and topic 4 overlap, this indicates that the topics are more similar.<\/li>\n \t<li>On the right side, the<strong>\u00a0histogram of each topic shows the top 30 relevant words<\/strong>. For example, in topic 1 the most relevant words are police, new, may, war, etc<\/li>\n<\/ul>\nSo in our case, we can see a lot of words and topics associated with war in the news headlines.\n\n<\/section><section data-element_type=\"section\" data-id=\"2079ccf\">\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-7e24d7d elementor-widget elementor-widget-heading\" data-id=\"7e24d7d\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\"><h2>Wordcloud<\/h2><\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-9a0f416 elementor-widget elementor-widget-text-editor\" data-id=\"9a0f416\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<\/section><section data-element_type=\"section\" data-id=\"189e51d\">Wordcloud is a great way to represent text data. The size and color of each word that appears in the wordcloud indicate it\u2019s frequency or importance.Creating\u00a0<a href=\"https:\/\/amueller.github.io\/word_cloud\/index.html\" target=\"_blank\" rel=\"noopener noreferrer\">wordcloud in python<\/a>\u00a0with is easy but we need the data in a form of a corpus. Luckily, I prepared it in the previous section.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-b6440ca elementor-widget elementor-widget-text-editor\" data-id=\"b6440ca\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<\/section><section data-element_type=\"section\" data-id=\"6b500e4\">\n<div style=\"background: #eee; border: 1px solid #ccc; padding: 5px 10px;\"><span style=\"font-family: courier new,courier,monospace;\">from wordcloud import WordCloud, STOPWORDS\nstopwords = set(STOPWORDS)<\/span><\/div>\n<div style=\"background: #eee; border: 1px solid #ccc; padding: 5px 10px;\"><span style=\"font-family: courier new,courier,monospace;\">def show_wordcloud(data):\nwordcloud = WordCloud(\nbackground_color=&#8217;white&#8217;,\nstopwords=stopwords,\nmax_words=100,\nmax_font_size=30,\nscale=3,\nrandom_state=1)<\/span><\/div>\n<span style=\"font-family: courier new,courier,monospace;\">\u00a0<\/span>\n<div style=\"background: #eee; border: 1px solid #ccc; padding: 5px 10px;\"><span style=\"font-family: courier new,courier,monospace;\">\u00a0\u00a0\u00a0 wordcloud=wordcloud.generate(str(data))<\/span><\/div>\n<div style=\"background: #eee; border: 1px solid #ccc; padding: 5px 10px;\"><span style=\"font-family: courier new,courier,monospace;\">\u00a0\u00a0\u00a0 fig = plt.figure(1, figsize=(12, 12))\nplt.axis(&#8216;off&#8217;)<\/span><\/div>\n<div style=\"background: #eee; border: 1px solid #ccc; padding: 5px 10px;\"><span style=\"font-family: courier new,courier,monospace;\">\u00a0\u00a0\u00a0 plt.imshow(wordcloud)\nplt.show()<\/span><\/div>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-8122649 elementor-widget elementor-widget-text-editor\" data-id=\"8122649\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<div style=\"background: #eee; border: 1px solid #ccc; padding: 5px 10px;\"><span style=\"font-family: courier new,courier,monospace;\">show_wordcloud(corpus)<\/span><\/div><section data-element_type=\"section\" data-id=\"a2258c8\"><p style=\"text-align: center;\">\u00a0<\/p><\/section>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-fe18a48 elementor-widget elementor-widget-text-editor\" data-id=\"fe18a48\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<\/section><section data-element_type=\"section\" data-id=\"e191f60\"><section data-element_type=\"section\" data-id=\"1680c91\">\n<p style=\"text-align: center;\"><a href=\"https:\/\/ui.neptune.ai\/o\/neptune-ml\/org\/eda-nlp-tools\/n\/4-0-wordclouds-853dfded-4d17-4f37-83e4-15ec53f74e60\/5833b046-3cf9-4c0f-8fbf-4a5933da924e\" target=\"_blank\" rel=\"noopener noreferrer\" class=\"broken_link\">Code Snippet that Generates this Chart<\/a><\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-df3fb52 elementor-widget elementor-widget-text-editor\" data-id=\"df3fb52\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<\/section><\/section><section data-element_type=\"section\" data-id=\"3e06d72\">Again, you can see that the terms associated with the war are highlighted which indicates that these words occurred frequently in the news headlines.There are\u00a0<strong>many parameters that can be adjusted<\/strong>. Some of the most prominent ones are:\n<ul>\n \t<li><strong><em>stopwords<\/em><\/strong>: The set of words that are blocked from appearing in the image.<\/li>\n \t<li><strong><em>max_words<\/em><\/strong>: Indicates the maximum number of words to be displayed.<\/li>\n \t<li><strong><em>max_font_size<\/em><\/strong>: maximum font size.<\/li>\n<\/ul>\nThere are many more options to create beautiful word clouds. For more details, you can refer here.\n\n<\/section><section data-element_type=\"section\" data-id=\"ecc394f\">\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-b5ad8b2 elementor-widget elementor-widget-heading\" data-id=\"b5ad8b2\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\"><h2>Sentiment analysis<\/h2><\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-a19da19 elementor-widget elementor-widget-text-editor\" data-id=\"a19da19\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<\/section><section data-element_type=\"section\" data-id=\"9505664\">Sentiment analysis is a very common natural language processing task in which we\u00a0<strong>determine if the text is positive, negative or neutral.<\/strong>\u00a0This is very useful for finding the sentiment associated with reviews, comments which can get us some valuable insights out of text data.There are many projects that will help you do sentiment analysis in python. I personally like\u00a0<a href=\"https:\/\/github.com\/sloria\/TextBlob\" target=\"_blank\" rel=\"noopener noreferrer\">TextBlob<\/a>\u00a0and\u00a0<a href=\"https:\/\/github.com\/cjhutto\/vaderSentiment\" target=\"_blank\" rel=\"noopener noreferrer\">Vader Sentiment.<\/a>\n<\/section><section data-element_type=\"section\" data-id=\"6b8ed43\">\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-fa57b52 elementor-widget elementor-widget-heading\" data-id=\"fa57b52\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\"><h3>Textblob<\/h3><\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-cd6197b elementor-widget elementor-widget-text-editor\" data-id=\"cd6197b\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<\/section><section data-element_type=\"section\" data-id=\"ecfecca\">Textblob is a python library built on top of nltk. It has been around for some time and is very easy and convenient to use.The sentiment function of TextBlob returns two properties:\n<ul>\n \t<li><strong><em>polarity:<\/em><\/strong>\u00a0is a floating-point number that lies in the range of\u00a0<em>[-1,1]<\/em>\u00a0where\u00a0<strong>1 means positive\u00a0<\/strong>statement and<strong>\u00a0-1 means a negative<\/strong>\u00a0statement.<\/li>\n \t<li><strong><em>subjectivity:<\/em><\/strong>\u00a0refers to\u00a0<strong>how someone\u2019s judgment is shaped by personal opinions<\/strong>\u00a0and feelings. Subjectivity is represented as a floating-point value which lies in the range of [0,1].<\/li>\n<\/ul>\nI will run this function on our news headlines.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-c36ed0b elementor-widget elementor-widget-text-editor\" data-id=\"c36ed0b\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<section data-element_type=\"section\" data-id=\"c5fe17f\"><div style=\"background: #eee; border: 1px solid #ccc; padding: 5px 10px;\"><span style=\"font-family: courier new,courier,monospace;\">from textblob import TextBlob<br \/>TextBlob(&#8216;100 people killed in Iraq&#8217;).sentiment<\/span><\/div><\/section><section data-element_type=\"section\" data-id=\"7a3d479\"><p style=\"text-align: center;\">\u00a0<\/p><\/section>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-92c0fbe elementor-widget elementor-widget-text-editor\" data-id=\"92c0fbe\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\n<\/section><section data-element_type=\"section\" data-id=\"7319275\">TextBlob claims that the text\u00a0<em>\u201c100 people killed in Iraq\u201d<\/em>\u00a0is negative and is not an opinion or feeling but rather a factual statement. I think we can agree with TextBlob here.Now that we know how to calculate those sentiment scores\u00a0<strong>we can visualize them using a histogram and explore data even further.<\/strong>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-4f3af03 elementor-widget elementor-widget-text-editor\" data-id=\"4f3af03\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<section data-element_type=\"section\" data-id=\"c8524a3\"><div style=\"background: #eee; border: 1px solid #ccc; padding: 5px 10px;\"><span style=\"font-family: courier new,courier,monospace;\">def polarity(text):<br \/>return TextBlob(text).sentiment.polarity<\/span><\/div><div style=\"background: #eee; border: 1px solid #ccc; padding: 5px 10px;\"><span style=\"font-family: courier new,courier,monospace;\">news[&#8216;polarity_score&#8217;]=news[&#8216;headline_text&#8217;].<br \/>apply(lambda x : polarity(x))<br \/>news[&#8216;polarity_score&#8217;].hist()<\/span><\/div><\/section><section data-element_type=\"section\" data-id=\"d519421\"><p style=\"text-align: center;\">\u00a0<\/p><\/section>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-49e5e12 elementor-widget elementor-widget-text-editor\" data-id=\"49e5e12\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<\/section><section data-element_type=\"section\" data-id=\"77afd70\"><section data-element_type=\"section\" data-id=\"b4aca1f\">\n<p style=\"text-align: center;\"><a href=\"https:\/\/ui.neptune.ai\/o\/neptune-ml\/org\/eda-nlp-tools\/n\/5-0-polarity-score-histogram-7435097b-2554-423d-82f9-a4dfce94ea9b#state=cac0f382-5483-48bb-8b86-d684b92b2190&amp;code=8wztFsr_7DtvPSsM3xCyK9TE4leIVsx9UUK3JhAkpo0.85db2d27-454e-4eca-9ff9-ef66a986d643\" target=\"_blank\" rel=\"noopener noreferrer\" class=\"broken_link\">Code Snippet that Generates this Chart<\/a><\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-f846c3f elementor-widget elementor-widget-text-editor\" data-id=\"f846c3f\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\n<\/section><\/section><section data-element_type=\"section\" data-id=\"a8c5fde\">You can see that the polarity mainly ranges between 0.00 and 0.20. This indicates that the\u00a0<strong>majority of the news headlines are neutral.<\/strong>Let\u2019s dig a bit deeper by classifying the news as negative, positive and neutral based on the scores.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-062af96 elementor-widget elementor-widget-text-editor\" data-id=\"062af96\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<section data-element_type=\"section\" data-id=\"03d347b\"><div style=\"background: #eee; border: 1px solid #ccc; padding: 5px 10px;\"><span style=\"font-family: courier new,courier,monospace;\">def sentiment(x):<br \/>if x&lt;0:<br \/>return &#8216;neg&#8217;<br \/>elif x==0:<br \/>return &#8216;neu&#8217;<br \/>else:<br \/>return &#8216;pos&#8217;<\/span><\/div><div style=\"background: #eee; border: 1px solid #ccc; padding: 5px 10px;\"><span style=\"font-family: courier new,courier,monospace;\">news[&#8216;polarity&#8217;]=news[&#8216;polarity_score&#8217;].<br \/>map(lambda x: sentiment(x))<\/span><\/div><div style=\"background: #eee; border: 1px solid #ccc; padding: 5px 10px;\"><span style=\"font-family: courier new,courier,monospace;\">plt.bar(news.polarity.value_counts().index,<br \/>news.polarity.value_counts())<\/span><\/div><\/section><section data-element_type=\"section\" data-id=\"fb4814b\"><p style=\"text-align: center;\">\u00a0<\/p><\/section>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-6b73212 elementor-widget elementor-widget-text-editor\" data-id=\"6b73212\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<\/section><section data-element_type=\"section\" data-id=\"3372ae8\"><section data-element_type=\"section\" data-id=\"15446e6\">\n<p style=\"text-align: center;\"><a href=\"https:\/\/ui.neptune.ai\/o\/neptune-ml\/org\/eda-nlp-tools\/n\/5-1-sentiment-barchart-1da2f77b-db4e-4636-b186-0328dcbb791b\/ea6a3450-6d61-4b3f-9274-f1f0c241fa5c\" target=\"_blank\" rel=\"noopener noreferrer\" class=\"broken_link\">Code Snippet that Generates this Chart<\/a><\/p>\nYep, 70 % of news is neutral with only 18% of positive and 11% of negative.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-8bb451c elementor-widget elementor-widget-text-editor\" data-id=\"8bb451c\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<section data-element_type=\"section\" data-id=\"a68aa59\">Let\u2019s take a look at\u00a0<strong>some of the positive and negative headlines.<\/strong><\/section><section data-element_type=\"section\" data-id=\"bfe5f4c\"><pre>news[news['polarity']=='pos']['headline_text'].head()\n<\/pre><\/section><section data-element_type=\"section\" data-id=\"f31525c\"><p style=\"text-align: center;\">\u00a0<\/p><\/section>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-a6f5788 elementor-widget elementor-widget-text-editor\" data-id=\"a6f5788\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<section data-element_type=\"section\" data-id=\"1e3e120\">Positive news headlines are mostly about some victory in sports.<\/section><section data-element_type=\"section\" data-id=\"d97c5c3\"><pre>news[news['polarity']=='neg']['headline_text'].head()\n<\/pre><\/section><section data-element_type=\"section\" data-id=\"c9522e4\"><p style=\"text-align: center;\">\u00a0<\/p><\/section>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-d877e5e elementor-widget elementor-widget-heading\" data-id=\"d877e5e\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h3 class=\"elementor-heading-title elementor-size-default\"><h3>Vader Sentiment Analysis<\/h3><\/h3>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-043e78f elementor-widget elementor-widget-text-editor\" data-id=\"043e78f\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<\/section><section data-element_type=\"section\" data-id=\"a3a9676\">The next library we are going to discuss is VADER.\u00a0<strong>Vader works better in detecting negative sentiment<\/strong>. It is very useful in the case of social media text sentiment analysis.<strong>VADER or Valence Aware Dictionary and Sentiment Reasoner<\/strong>\u00a0is a rule\/lexicon-based, open-source sentiment analyzer pre-built library, protected under the MIT license.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-19983a6 elementor-widget elementor-widget-text-editor\" data-id=\"19983a6\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tVADER sentiment analysis class\u00a0<strong>returns a dictionary that contains the probabilities of the text for being positive, negative and neutral.<\/strong>\u00a0Then we can filter and choose the sentiment with most probability.\n\nWe will do the same analysis using VADER and check if there is much difference.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-e57efc9 elementor-widget elementor-widget-text-editor\" data-id=\"e57efc9\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<\/section><section data-element_type=\"section\" data-id=\"6467eba\">\n<div style=\"background: #eee; border: 1px solid #ccc; padding: 5px 10px;\"><span style=\"font-family: courier new,courier,monospace;\">from nltk.sentiment.vader import SentimentIntensityAnalyzer<\/span><\/div>\n<div style=\"background: #eee; border: 1px solid #ccc; padding: 5px 10px;\"><span style=\"font-family: courier new,courier,monospace;\">nltk.download(&#8216;vader_lexicon&#8217;)\nsid = SentimentIntensityAnalyzer()<\/span><\/div>\n<div style=\"background: #eee; border: 1px solid #ccc; padding: 5px 10px;\"><span style=\"font-family: courier new,courier,monospace;\">def get_vader_score(sent):\n# Polarity score returns dictionary\nss = sid.polarity_scores(sent)\n#return ss\nreturn np.argmax(list(ss.values())[:-1])<\/span><\/div>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-12185bd elementor-widget elementor-widget-text-editor\" data-id=\"12185bd\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<div style=\"background: #eee; border: 1px solid #ccc; padding: 5px 10px;\"><span style=\"font-family: courier new,courier,monospace;\">news[&#8216;polarity&#8217;]=news[&#8216;headline_text&#8217;].<br \/>map(lambda x: get_vader_score(x))<br \/>polarity=news[&#8216;polarity&#8217;].replace({0:&#8217;neg&#8217;,1:&#8217;neu&#8217;,2:&#8217;pos&#8217;})<\/span><\/div><div style=\"background: #eee; border: 1px solid #ccc; padding: 5px 10px;\"><span style=\"font-family: courier new,courier,monospace;\">plt.bar(polarity.value_counts().index,<br \/>polarity.value_counts())<\/span><\/div><section data-element_type=\"section\" data-id=\"3719299\"><p style=\"text-align: center;\">\u00a0<\/p><\/section>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-891976f elementor-widget elementor-widget-text-editor\" data-id=\"891976f\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<\/section><section data-element_type=\"section\" data-id=\"3ea1037\"><section data-element_type=\"section\" data-id=\"0ef4791\">\n<p style=\"text-align: center;\"><a href=\"https:\/\/ui.neptune.ai\/o\/neptune-ml\/org\/eda-nlp-tools\/n\/5-1-sentiment-barchart-1da2f77b-db4e-4636-b186-0328dcbb791b\/ea6a3450-6d61-4b3f-9274-f1f0c241fa5c\" target=\"_blank\" rel=\"noopener noreferrer\" class=\"broken_link\">Code Snippet that Generates this Chart<\/a><\/p>\nYep, there is a slight difference in distribution. Even more headlines are classified as neutral 85 % and the number of negative news headlines has increased (to 13 %).\n<\/section><\/section><section data-element_type=\"section\" data-id=\"b79156d\">\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-7b6f67c elementor-widget elementor-widget-heading\" data-id=\"7b6f67c\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\"><h2>Named Entity Recognition<\/h2><\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-1d96afd elementor-widget elementor-widget-text-editor\" data-id=\"1d96afd\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<\/section><section data-element_type=\"section\" data-id=\"72fc844\">Named entity recognition is an information extraction method in which entities that are present in the text are classified into predefined entity types like \u201cPerson\u201d,\u201d Place\u201d,\u201d Organization\u201d, etc. By using\u00a0<strong>NER we can get great insights about the types of entities present in the given text dataset<\/strong>.Let us consider an example of a news article.\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-cefe165 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"cefe165\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-54ca462\" data-id=\"54ca462\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-1c5f54d elementor-widget elementor-widget-text-editor\" data-id=\"1c5f54d\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<\/section><section data-element_type=\"section\" data-id=\"d0d5d43\">In the above news, the named entity recognition model should be able to identify\nentities such as RBI as an organization, Mumbai and India as Places, etc.There are three standard libraries to do Named Entity Recognition:\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-98337d5 elementor-widget elementor-widget-text-editor\" data-id=\"98337d5\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<ul>\n \t<li><a href=\"https:\/\/nlp.stanford.edu\/software\/CRF-NER.shtml\" target=\"_blank\" rel=\"noopener noreferrer\">Standford NER<\/a><\/li>\n \t<li><a href=\"https:\/\/spacy.io\/\" target=\"_blank\" rel=\"noopener noreferrer\">spaCy<\/a><\/li>\n \t<li><a href=\"https:\/\/www.nltk.org\/\" target=\"_blank\" rel=\"noopener noreferrer\">NLTK<\/a><\/li>\n<\/ul>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-8af17a1 elementor-widget elementor-widget-text-editor\" data-id=\"8af17a1\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tIn this tutorial,<strong>\u00a0I will use spaCy<\/strong>\u00a0which is an open-source library for advanced natural language processing tasks. It is written in Cython and is known for its industrial applications. Besides NER,<strong>\u00a0spaCy provides many other functionalities like pos tagging, word to vector transformation, etc.<\/strong>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-ee2cabf elementor-widget elementor-widget-text-editor\" data-id=\"ee2cabf\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\n<a href=\"https:\/\/spacy.io\/api\/annotation#section-named-entities\" target=\"_blank\" rel=\"noopener noreferrer\">SpaCy\u2019s named entity recognition<\/a>\u00a0has been trained on the\u00a0<a href=\"https:\/\/catalog.ldc.upenn.edu\/LDC2013T19\" target=\"_blank\" rel=\"noopener noreferrer\">OntoNotes 5<\/a>\u00a0corpus and it supports the following entity types:\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-e433c20 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"e433c20\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-ec32d1a\" data-id=\"ec32d1a\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-47b33f2 elementor-widget elementor-widget-text-editor\" data-id=\"47b33f2\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<\/section><section data-element_type=\"section\" data-id=\"738c521\">There are three\u00a0<a href=\"https:\/\/spacy.io\/models\/en\/\" target=\"_blank\" rel=\"noopener noreferrer\">pre-trained models for English<\/a>\u00a0in spaCy. I will use\u00a0<em>en_core_web_sm<\/em>\u00a0for our task but you can try other models.To use it we have to download it first:\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-33f8f73 elementor-widget elementor-widget-text-editor\" data-id=\"33f8f73\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<section data-element_type=\"section\" data-id=\"3daf08b\"><pre>python -m spacy download en_core_web_sm<\/pre><\/section><section data-element_type=\"section\" data-id=\"8e3c9e1\">Now we can initialize the language model:<\/section><section data-element_type=\"section\" data-id=\"399f2d1\"><div style=\"background: #eee; border: 1px solid #ccc; padding: 5px 10px;\"><span style=\"font-family: courier new,courier,monospace;\">import spacy<\/span><\/div><div style=\"background: #eee; border: 1px solid #ccc; padding: 5px 10px;\"><span style=\"font-family: courier new,courier,monospace;\">nlp = spacy.load(&#8220;en_core_web_sm&#8221;)<\/span><\/div><\/section><section data-element_type=\"section\" data-id=\"1e3bf57\">One of the nice things about Spacy is that we only need to apply\u00a0<em>nlp function<\/em>\u00a0once, the entire background pipeline will return the objects we need.<\/section><section data-element_type=\"section\" data-id=\"19c51e0\"><div style=\"background: #eee; border: 1px solid #ccc; padding: 5px 10px;\"><span style=\"font-family: courier new,courier,monospace;\">doc=nlp(&#8216;India and Iran have agreed to boost the economic viability<br \/>of the strategic Chabahar port through various measures,<br \/>including larger subsidies to merchant shipping firms using the facility,<br \/>people familiar with the development said on Thursday.&#8217;)<\/span><\/div><div style=\"background: #eee; border: 1px solid #ccc; padding: 5px 10px;\"><span style=\"font-family: courier new,courier,monospace;\">[(x.text,x.label_) for x in doc.ents]<\/span><\/div><\/section><section data-element_type=\"section\" data-id=\"703caa1\"><p style=\"text-align: center;\">\u00a0<\/p><\/section>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-3dc8c69 elementor-widget elementor-widget-text-editor\" data-id=\"3dc8c69\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<\/section><section data-element_type=\"section\" data-id=\"5866030\">We can see that India and Iran are recognized as Geographical locations (GPE), Chabahar as Person and Thursday as Date.We can also visualize the output using\u00a0<em>displacy<\/em>\u00a0module in spaCy.\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-6e6b94c elementor-widget elementor-widget-text-editor\" data-id=\"6e6b94c\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<section data-element_type=\"section\" data-id=\"6b579d0\"><div style=\"background: #eee; border: 1px solid #ccc; padding: 5px 10px;\"><span style=\"font-family: courier new,courier,monospace;\">from spacy import displacy<\/span><\/div><div style=\"background: #eee; border: 1px solid #ccc; padding: 5px 10px;\"><span style=\"font-family: courier new,courier,monospace;\">displacy.render(doc, style=&#8217;ent&#8217;)<\/span><\/div><\/section><section data-element_type=\"section\" data-id=\"b271a98\"><p style=\"text-align: center;\">\u00a0<\/p><\/section>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-88cdf9b elementor-widget elementor-widget-text-editor\" data-id=\"88cdf9b\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<\/section><section data-element_type=\"section\" data-id=\"2d2f932\">This creates a very neat<strong>\u00a0visualization of the sentence with the recognized entities\u00a0<\/strong>where each entity type is marked in different colors.Now that we know how to perform NER we can explore the data even further by doing a variety of visualizations on the named entities extracted from our dataset.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-bc5180b elementor-widget elementor-widget-text-editor\" data-id=\"bc5180b\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tFirst, we will\u00a0<strong>run the named entity recognition on our news<\/strong>\u00a0headlines and store the entity types.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-82cd576 elementor-widget elementor-widget-text-editor\" data-id=\"82cd576\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<\/section><section data-element_type=\"section\" data-id=\"303ee8e\">\n<div style=\"background: #eee; border: 1px solid #ccc; padding: 5px 10px;\"><span style=\"font-family: courier new,courier,monospace;\">def ner(text):\ndoc=nlp(text)\nreturn [X.label_ for X in doc.ents]<\/span><\/div>\n<div style=\"background: #eee; border: 1px solid #ccc; padding: 5px 10px;\"><span style=\"font-family: courier new,courier,monospace;\">ent=news[&#8216;headline_text&#8217;].\napply(lambda x : ner(x))\nent=[x for sub in ent for x in sub]<\/span><\/div>\n<div style=\"background: #eee; border: 1px solid #ccc; padding: 5px 10px;\"><span style=\"font-family: courier new,courier,monospace;\">counter=Counter(ent)\ncount=counter.most_common()<\/span><\/div>\n<\/section><section data-element_type=\"section\" data-id=\"edc6fb2\">Now, we can visualize the entity frequencies:\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-bed6e50 elementor-widget elementor-widget-text-editor\" data-id=\"bed6e50\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<section data-element_type=\"section\" data-id=\"83a3522\"><div style=\"background: #eee; border: 1px solid #ccc; padding: 5px 10px;\"><span style=\"font-family: courier new,courier,monospace;\">x,y=map(list,zip(*count))<br \/>sns.barplot(x=y,y=x)<\/span><\/div><\/section><section data-element_type=\"section\" data-id=\"83a70fc\"><p style=\"text-align: center;\">\u00a0<\/p><\/section>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-5d507ac elementor-widget elementor-widget-text-editor\" data-id=\"5d507ac\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<\/section><section data-element_type=\"section\" data-id=\"0cc8f64\"><section data-element_type=\"section\" data-id=\"80508ec\">\n<p style=\"text-align: center;\"><a href=\"https:\/\/ui.neptune.ai\/o\/neptune-ml\/org\/eda-nlp-tools\/n\/6-0-named-entity-barchart-9012f4a0-3761-4ebf-9c25-d4f363858010\/ac08ec73-ddd3-4a42-a35b-b2311eb9d075\" target=\"_blank\" rel=\"noopener noreferrer\" class=\"broken_link\">Code Snippet that Generates this Chart<\/a><\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-fb77f2d elementor-widget elementor-widget-text-editor\" data-id=\"fb77f2d\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\n<\/section><\/section><section data-element_type=\"section\" data-id=\"5710256\">Now we can see that the GPE and ORG dominate the news headlines followed by the PERSON entity.We can also\u00a0<strong>visualize the most common tokens per entity.<\/strong>\u00a0Let\u2019s check which places appear the most in news headlines.\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-4de71f3 elementor-widget elementor-widget-text-editor\" data-id=\"4de71f3\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<section data-element_type=\"section\" data-id=\"b608bc5\"><div style=\"background: #eee; border: 1px solid #ccc; padding: 5px 10px;\"><span style=\"font-family: courier new,courier,monospace;\">def ner(text,ent=&#8221;GPE&#8221;):<br \/>doc=nlp(text)<br \/>return [X.text for X in doc.ents if X.label_ == ent]<\/span><\/div><div style=\"background: #eee; border: 1px solid #ccc; padding: 5px 10px;\"><span style=\"font-family: courier new,courier,monospace;\">gpe=news[&#8216;headline_text&#8217;].apply(lambda x: ner(x))<br \/>gpe=[i for x in gpe for i in x]<br \/>counter=Counter(gpe)<\/span><\/div><div style=\"background: #eee; border: 1px solid #ccc; padding: 5px 10px;\"><span style=\"font-family: courier new,courier,monospace;\">x,y=map(list,zip(*counter.most_common(10)))<br \/>sns.barplot(y,x)<\/span><\/div><\/section><section data-element_type=\"section\" data-id=\"e26e60f\"><p style=\"text-align: center;\">\u00a0<\/p><\/section>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-3ee3ad1 elementor-widget elementor-widget-text-editor\" data-id=\"3ee3ad1\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<\/section><section data-element_type=\"section\" data-id=\"0ca245b\"><section data-element_type=\"section\" data-id=\"8d7c110\">\n<p style=\"text-align: center;\"><a href=\"https:\/\/ui.neptune.ai\/o\/neptune-ml\/org\/eda-nlp-tools\/n\/6-1-most-common-named-entity-barchart-0614fdac-0400-4460-ac3a-b3c5669906a0\/4d3d398d-df9d-484c-97ec-07390ba4dd21\" target=\"_blank\" rel=\"noopener noreferrer\" class=\"broken_link\">Code Snippet that Generates this Chart<\/a><\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-c0e1bb6 elementor-widget elementor-widget-text-editor\" data-id=\"c0e1bb6\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<section data-element_type=\"section\" data-id=\"f6d91f5\">I think we can confirm the fact that the \u201cus\u201d means the USA in news headlines. Let\u2019s also find the most common names that appeared in news headlines.<\/section><section data-element_type=\"section\" data-id=\"f78e821\"><div style=\"background: #eee; border: 1px solid #ccc; padding: 5px 10px;\"><span style=\"font-family: courier new,courier,monospace;\">per=news[&#8216;headline_text&#8217;].apply(lambda x: ner(x,&#8221;PERSON&#8221;))<br \/>per=[i for x in per for i in x]<br \/>counter=Counter(per)<\/span><\/div><div style=\"background: #eee; border: 1px solid #ccc; padding: 5px 10px;\"><span style=\"font-family: courier new,courier,monospace;\">x,y=map(list,zip(*counter.most_common(10)))<br \/>sns.barplot(y,x)<\/span><\/div><\/section><section data-element_type=\"section\" data-id=\"cc1edab\"><p style=\"text-align: center;\">\u00a0<\/p><\/section>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-9c5864c elementor-widget elementor-widget-text-editor\" data-id=\"9c5864c\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<\/section><section data-element_type=\"section\" data-id=\"8302b70\"><section data-element_type=\"section\" data-id=\"6cf782b\">\n<p style=\"text-align: center;\"><a href=\"https:\/\/ui.neptune.ai\/o\/neptune-ml\/org\/eda-nlp-tools\/n\/6-1-most-common-named-entity-barchart-0614fdac-0400-4460-ac3a-b3c5669906a0\/4d3d398d-df9d-484c-97ec-07390ba4dd21\" target=\"_blank\" rel=\"noopener noreferrer\" class=\"broken_link\">Code Snippet that Generates this Chart<\/a><\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-4f65b62 elementor-widget elementor-widget-text-editor\" data-id=\"4f65b62\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<\/section><\/section><section data-element_type=\"section\" data-id=\"9469f67\">Saddam Hussain and George Bush were the presidents of Iraq and the USA during wartime. Also, we can see that the model is far from perfect classifying\u00a0<em>\u201cvic govt\u201d<\/em>\u00a0or\u00a0<em>\u201cnsw govt\u201d<\/em>\u00a0as a person rather than a government agency.<\/section><section data-element_type=\"section\" data-id=\"c5d8a14\">\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-fc0a4a0 elementor-widget elementor-widget-heading\" data-id=\"fc0a4a0\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\"><h2>Exploration through Parts of Speach Tagging in python<\/h2><\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-4cb59e3 elementor-widget elementor-widget-text-editor\" data-id=\"4cb59e3\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<\/section><section data-element_type=\"section\" data-id=\"6983f23\">Parts of speech (POS) tagging is a\u00a0<strong>method that assigns part of speech labels to words in a sentence.<\/strong>\u00a0There are eight main parts of speech:\n<ul>\n \t<li>Noun (NN)- Joseph, London, table, cat, teacher, pen, city<\/li>\n \t<li>Verb (VB)- read, speak, run, eat, play, live, walk, have, like, are, is<\/li>\n \t<li>Adjective(JJ)- beautiful, happy, sad, young, fun, three<\/li>\n \t<li>Adverb(RB)- slowly, quietly, very, always, never, too, well, tomorrow<\/li>\n \t<li>Preposition (IN)- at, on, in, from, with, near, between, about, under<\/li>\n \t<li>Conjunction (CC)- and, or, but, because, so, yet, unless, since, if<\/li>\n \t<li>Pronoun(PRP)- I, you, we, they, he, she, it, me, us, them, him, her, this<\/li>\n \t<li>Interjection (INT)- Ouch! Wow! Great! Help! Oh! Hey! Hi!<\/li>\n<\/ul>\nThis is not a straightforward task, as the same word may be used in different sentences in different contexts. However, once you do it, there are a lot of helpful visualizations that you can create that can give you additional insights into your dataset.\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-28c3313 elementor-widget elementor-widget-text-editor\" data-id=\"28c3313\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<strong>I will use the nltk to do the parts of speech tagging<\/strong>\u00a0but there are other libraries that do a good job (spacy, textblob).\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-51d6da6 elementor-widget elementor-widget-text-editor\" data-id=\"51d6da6\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tLet\u2019s look at an example.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-9409c7d elementor-widget elementor-widget-text-editor\" data-id=\"9409c7d\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<section data-element_type=\"section\" data-id=\"b22b33c\"><div style=\"background: #eee; border: 1px solid #ccc; padding: 5px 10px;\"><span style=\"font-family: courier new,courier,monospace;\">import nltk<br \/>sentence=&#8221;The greatest comeback stories in 2019&#8243;<br \/>tokens=word_tokenize(sentence)<br \/>nltk.pos_tag(tokens)<\/span><\/div><\/section><section data-element_type=\"section\" data-id=\"ac98d90\"><p style=\"text-align: center;\">\u00a0<\/p><\/section>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-d9158e3 elementor-widget elementor-widget-text-editor\" data-id=\"d9158e3\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<section data-element_type=\"section\" data-id=\"52d755d\" data-settings=\"{&quot;background_background&quot;:&quot;classic&quot;}\"><strong>Note<\/strong>You can also visualize the sentence parts of speech and its dependency graph with\u00a0<em>spacy.displacy\u00a0<\/em>module.<p>\u00a0<\/p><div style=\"background: #eee; border: 1px solid #ccc; padding: 5px 10px;\"><span style=\"font-family: courier new,courier,monospace;\">doc = nlp(&#8216;The greatest comeback stories in 2019&#8217;)<br \/>displacy.render(doc, style=&#8217;dep&#8217;, jupyter=True, options={&#8216;distance&#8217;: 90})<\/span><\/div><p style=\"text-align: center;\">\u00a0<\/p><p>We can observe various dependency tags here. For example,\u00a0<em>DET<\/em>\u00a0tag denotes the relationship between the determiner \u201cthe\u201d and the noun \u201cstories\u201d.<\/p><\/section>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-d62ca93 elementor-widget elementor-widget-text-editor\" data-id=\"d62ca93\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tYou can check the list of dependency tags and their meanings\u00a0<a href=\"https:\/\/universaldependencies.org\/u\/dep\/index.html\" target=\"_blank\" rel=\"noopener noreferrer\">here<\/a>.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-853d6b1 elementor-widget elementor-widget-text-editor\" data-id=\"853d6b1\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<section data-element_type=\"section\" data-id=\"c5fc267\">Ok, now that we now what POS tagging is, let\u2019s use it to explore our headlines dataset.<\/section><section data-element_type=\"section\" data-id=\"9e413f8\"><div style=\"background: #eee; border: 1px solid #ccc; padding: 5px 10px;\"><span style=\"font-family: courier new,courier,monospace;\">def pos(text):<br \/>pos=nltk.pos_tag(word_tokenize(text))<br \/>pos=list(map(list,zip(*pos)))[1]<br \/>return pos<\/span><\/div><div style=\"background: #eee; border: 1px solid #ccc; padding: 5px 10px;\"><span style=\"font-family: courier new,courier,monospace;\">tags=news[&#8216;headline_text&#8217;].apply(lambda x : pos(x))<br \/>tags=[x for l in tags for x in l]<br \/>counter=Counter(tags)<\/span><\/div><div style=\"background: #eee; border: 1px solid #ccc; padding: 5px 10px;\"><span style=\"font-family: courier new,courier,monospace;\">x,y=list(map(list,zip(*counter.most_common(7))))<br \/>sns.barplot(x=y,y=x)<\/span><\/div><\/section><section data-element_type=\"section\" data-id=\"9b0927b\"><p style=\"text-align: center;\">\u00a0<\/p><\/section>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-b264a9c elementor-widget elementor-widget-text-editor\" data-id=\"b264a9c\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<\/section><section data-element_type=\"section\" data-id=\"1262c15\"><section data-element_type=\"section\" data-id=\"cb4a263\">\n<p style=\"text-align: center;\"><a href=\"https:\/\/ui.neptune.ai\/o\/neptune-ml\/org\/eda-nlp-tools\/n\/7-0-parts-of-speach-barchart-9140250c-50d2-4343-b5e2-b3f5fb9c2089\/15b07733-f02d-4a7c-b2fc-05ecffdf3e7b\" target=\"_blank\" rel=\"noopener noreferrer\" class=\"broken_link\">Code Snippet that Generates this Chart<\/a><\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-2f21903 elementor-widget elementor-widget-text-editor\" data-id=\"2f21903\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<\/section><\/section><section data-element_type=\"section\" data-id=\"5e6d4e6\">We can clearly see that the noun (NN) dominates in news headlines followed by the adjective (JJ). This is typical for news articles while\u00a0<strong>for artistic forms higher adjective(ADJ) frequency<\/strong>\u00a0could happen quite a lot.You can dig deeper into this by investigating\u00a0<strong>which singular noun occur most commonly in news headlines.<\/strong>\u00a0Let us find out.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-c96a58a elementor-widget elementor-widget-text-editor\" data-id=\"c96a58a\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<section data-element_type=\"section\" data-id=\"0c654f8\"><div style=\"background: #eee; border: 1px solid #ccc; padding: 5px 10px;\"><span style=\"font-family: courier new,courier,monospace;\">def get_adjs(text):<br \/>adj=[]<br \/>pos=nltk.pos_tag(word_tokenize(text))<br \/>for word,tag in pos:<br \/>if tag==&#8217;NN&#8217;:<br \/>adj.append(word)<br \/>return adj<\/span><\/div><div style=\"background: #eee; border: 1px solid #ccc; padding: 5px 10px;\"><span style=\"font-family: courier new,courier,monospace;\">words=news[&#8216;headline_text&#8217;].apply(lambda x : get_adjs(x))<br \/>words=[x for l in words for x in l]<br \/>counter=Counter(words)<\/span><\/div><div style=\"background: #eee; border: 1px solid #ccc; padding: 5px 10px;\"><span style=\"font-family: courier new,courier,monospace;\">x,y=list(map(list,zip(*counter.most_common(7))))<br \/>sns.barplot(x=y,y=x)<\/span><\/div><\/section><section data-element_type=\"section\" data-id=\"c9f5678\"><p style=\"text-align: center;\">\u00a0<\/p><\/section>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-04c175c elementor-widget elementor-widget-text-editor\" data-id=\"04c175c\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<\/section><section data-element_type=\"section\" data-id=\"cba4950\"><section data-element_type=\"section\" data-id=\"5f7c9e5\">\n<p style=\"text-align: center;\"><a href=\"https:\/\/ui.neptune.ai\/o\/neptune-ml\/org\/eda-nlp-tools\/n\/7-1-most-common-part-of-speach-barchart-3f896e91-e21c-4ea7-811f-02acb497479f\/a41302f3-8803-47ce-98e5-bb1a73eda5cc\" target=\"_blank\" rel=\"noopener noreferrer\" class=\"broken_link\">Code Snippet that Generates this Chart<\/a><\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-aa58aac elementor-widget elementor-widget-text-editor\" data-id=\"aa58aac\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<section data-element_type=\"section\" data-id=\"579cfb6\">Nouns such as<em>\u00a0\u201cwar\u201d, \u201ciraq\u201d, \u201cman\u201d<\/em>\u00a0dominate in the news headlines. You can visualize and examine other parts of speech using the above function.<\/section><section data-element_type=\"section\" data-id=\"72d016e\">\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-8edd903 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"8edd903\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-3837bc4\" data-id=\"3837bc4\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-463b69b elementor-widget elementor-widget-heading\" data-id=\"463b69b\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\"><h2>Exploring through text complexity<\/h2><\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-198072f elementor-widget elementor-widget-text-editor\" data-id=\"198072f\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<\/section><section data-element_type=\"section\" data-id=\"e4c2706\">It can be very informative to know\u00a0<strong>how readable (difficult to read) the text is<\/strong>\u00a0and what type of reader can fully understand it. Do we need a college degree to understand the message or a first-grader can clearly see what the point is?You can actually put a number called readability index on a document or text.\u00a0<strong>Readability index is a numeric value that indicates how difficult (or easy) it is to read and understand a text.<\/strong><\/section>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-218596d elementor-widget elementor-widget-text-editor\" data-id=\"218596d\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tThere are many readability score formulas available for the English language. Some of the most prominent ones are:\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-aead612 elementor-widget elementor-widget-text-editor\" data-id=\"aead612\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<section data-element_type=\"section\" data-id=\"ed8ceef\"><table><tbody><tr><th>Readability Test<\/th><th>Interpretation<\/th><th>Formula<\/th><\/tr><tr><td>Automated Readability Index (ARI)<\/td><td>The output is an approximate representation of the U.S<br \/>grade level needed to comprehend a text.<\/td><td>ARI = 4.71 * (characters\/words) +<br \/>0.5 * (words\/sentence) -21.43<\/td><\/tr><tr><td>Flesch Reading Ease (FRE)<\/td><td>Higher scores indicate material that is easier to read,<br \/>lower numbers mark harder-to-read passages:<br \/>&#8211; 0-30 College<br \/>&#8211; 50-60 High school<br \/>&#8211; 60+ Fourth grade<\/td><td>FRE = 206.835 \u2212 1.015 * (total words\/total sentences)<br \/>\u2212 84.6 * (total syllables\/ total words)<\/td><\/tr><tr><td>FleschKincaid Grade Level (FKGL)<\/td><td>The result is a number that corresponds with a U.S grade level.<\/td><td>FKGL = 0.39 * (total words\/ totalsentences)<br \/>+ 11.8 (total syllables\/total words) -15.59<\/td><\/tr><tr><td>Gunning Fog Index (GFI)<\/td><td>The result is a number that corresponds with a U.S grade level.<\/td><td>GFI = 0.4 * (( words\/ sentence) +<br \/>100 * (complex words\/ words))<\/td><\/tr><\/tbody><\/table><\/section><section data-element_type=\"section\" data-id=\"cd6e111\"><a href=\"https:\/\/github.com\/shivam5992\/textstat\" target=\"_blank\" rel=\"noopener noreferrer\">Textstat<\/a>\u00a0is a cool Python library that provides an implementation of all these text statistics calculation methods. Let\u2019s use Textstat to implement Flesch Reading Ease index.<\/section><section data-element_type=\"section\" data-id=\"2262883\">Now, you can plot a histogram of the scores and visualize the output.<\/section><section data-element_type=\"section\" data-id=\"c3302a2\"><div style=\"background: #eee; border: 1px solid #ccc; padding: 5px 10px;\"><span style=\"font-family: courier new,courier,monospace;\">from textstat import flesch_reading_ease<\/span><\/div><div style=\"background: #eee; border: 1px solid #ccc; padding: 5px 10px;\"><span style=\"font-family: courier new,courier,monospace;\">news[&#8216;headline_text&#8217;].<br \/>apply(lambda x : flesch_reading_ease(x)).hist()<\/span><\/div><\/section><section data-element_type=\"section\" data-id=\"dfb70d8\"><p style=\"text-align: center;\">\u00a0<\/p><\/section>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-c3e07da elementor-widget elementor-widget-text-editor\" data-id=\"c3e07da\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<\/section><section data-element_type=\"section\" data-id=\"9bc9f2c\"><section data-element_type=\"section\" data-id=\"d13458a\">\n<p style=\"text-align: center;\"><a href=\"https:\/\/ui.neptune.ai\/o\/neptune-ml\/org\/eda-nlp-tools\/n\/8-0-text-complexity-histogram-b00e38f2-5710-4efe-85c2-77b6366dbe3b\/b6b8dd8f-3ec6-4fd1-a548-7daf889444e5\" target=\"_blank\" rel=\"noopener noreferrer\" class=\"broken_link\">Code Snippet that Generates this Chart<\/a><\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-5f29444 elementor-widget elementor-widget-text-editor\" data-id=\"5f29444\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<section data-element_type=\"section\" data-id=\"8125a71\">Almost all of the readability scores fall above 60. This means that an average 11-year-old student can read and understand the news headlines. Let\u2019s check all news headlines that have a readability score below 5.<\/section><section data-element_type=\"section\" data-id=\"72882e6\"><div style=\"background: #eee; border: 1px solid #ccc; padding: 5px 10px;\"><span style=\"font-family: courier new,courier,monospace;\">x=[i for i in range(len(reading)) if reading[i]&lt;5]<br \/>news.iloc[x][&#8216;headline_text&#8217;].head()<\/span><\/div><\/section><section data-element_type=\"section\" data-id=\"1187bf7\"><p style=\"text-align: center;\">\u00a0<\/p><\/section>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-f66568d elementor-widget elementor-widget-text-editor\" data-id=\"f66568d\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<\/section><section data-element_type=\"section\" data-id=\"f302925\">You can see some of the complex words being used in news headlines like\u00a0<em>\u201ccapitulation\u201d,\u201d interim\u201d,\u201d entrapment\u201d<\/em>\u00a0etc. These words may have caused the scores to fall under 5.<\/section><section data-element_type=\"section\" data-id=\"31f8ba1\">\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-ce8fd2c elementor-widget elementor-widget-heading\" data-id=\"ce8fd2c\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\"><h2>Final Thoughts<\/h2><\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-3106c22 elementor-widget elementor-widget-text-editor\" data-id=\"3106c22\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<\/section><section data-element_type=\"section\" data-id=\"54ed853\">In this article, we discussed and implemented various exploratory data analysis methods for text data. Some common, some lesser-known but all of them could be a great addition to your data exploration toolkit.Hopefully, you will find some of them useful in your current and future projects.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-7ae3382 elementor-widget elementor-widget-text-editor\" data-id=\"7ae3382\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tTo make data exploration even easier, I have created a\u00a0\u00a0<strong>\u201cExploratory Data Analysis for Natural Language Processing Template\u201d<\/strong>\u00a0that you can use for your work.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-2b05558 elementor-widget elementor-widget-text-editor\" data-id=\"2b05558\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\n<\/section><section data-element_type=\"section\" data-id=\"1131fc1\"><section data-element_type=\"section\" data-id=\"fce91a8\">\n<p style=\"text-align: center;\"><a href=\"https:\/\/ui.neptune.ai\/o\/neptune-ml\/org\/eda-nlp-tools\/n\/8-0-text-complexity-histogram-b00e38f2-5710-4efe-85c2-77b6366dbe3b\/95b28cf3-a123-4104-bdd9-358e123b8a58\" target=\"_blank\" rel=\"noopener noreferrer\" class=\"broken_link\">Get Exploratory Data Analysis for Natural Language Processing Template<\/a><\/p>\n\n<\/section><\/section><section data-element_type=\"section\" data-id=\"6162b58\">Also, as you may have seen already,\u00a0<strong>for every chart in this article, there is a code snippet<\/strong>\u00a0that creates it. Just click on the button below a chart.Happy exploring!\n\n<\/section>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<\/div>\n\t\t","protected":false},"excerpt":{"rendered":"<p>This article discusses and shows implementation of various exploratory data analysis methods for text data. Some common, some lesser-known but all of them could be a great addition to your data exploration toolkit. Hopefully, you will find some of them useful in your current and future projects. To make data exploration even easier, the article contains an&nbsp;&ldquo;Exploratory Data Analysis for Natural Language Processing Template&rdquo;&nbsp;that you can use for your work.<\/p>\n","protected":false},"author":712,"featured_media":4051,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"content-type":"","footnotes":""},"categories":[183],"tags":[92],"ppma_author":[3528],"class_list":["post-2333","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-ai-ml","tag-machine-learning"],"authors":[{"term_id":3528,"user_id":712,"is_guest":0,"slug":"jakub-czakon","display_name":"Jakub Czakon","avatar_url":"https:\/\/secure.gravatar.com\/avatar\/?s=96&d=mm&r=g","user_url":"","last_name":"Czakon","first_name":"Jakub","job_title":"","description":"Jakub Czakon is Senior Data Scientist at neptune.ai."}],"_links":{"self":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/2333","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/users\/712"}],"replies":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/comments?post=2333"}],"version-history":[{"count":6,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/2333\/revisions"}],"predecessor-version":[{"id":35185,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/2333\/revisions\/35185"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/media\/4051"}],"wp:attachment":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/media?parent=2333"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/categories?post=2333"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/tags?post=2333"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/ppma_author?post=2333"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}