{"id":9380,"date":"2020-08-20T09:03:53","date_gmt":"2020-08-20T09:03:53","guid":{"rendered":"https:\/\/www.experfy.com\/blog\/?p=9380"},"modified":"2023-11-17T10:50:51","modified_gmt":"2023-11-17T10:50:51","slug":"text-preprocessing-for-nlp-and-machine-learning-tasks","status":"publish","type":"post","link":"https:\/\/www.experfy.com\/blog\/ai-ml\/text-preprocessing-for-nlp-and-machine-learning-tasks\/","title":{"rendered":"Text Preprocessing For NLP and Machine Learning Tasks"},"content":{"rendered":"\t\t<div data-elementor-type=\"wp-post\" data-elementor-id=\"9380\" class=\"elementor elementor-9380\" data-elementor-post-type=\"post\">\n\t\t\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-58c82bb5 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"58c82bb5\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-3af77e6a\" data-id=\"3af77e6a\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-7864e5a3 elementor-widget elementor-widget-text-editor\" data-id=\"7864e5a3\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\n<p>As soon as you start working on a data science task you realize the dependence of your results on the data quality. The initial step \u2014 data preparation \u2014 of any data science project sets the basis for effective performance of any sophisticated algorithm.<\/p>\n\n\n\n<p>In textual data science tasks, this means that any raw text needs to be carefully preprocessed before the algorithm can digest it. In the most general terms, we take some predetermined body of text and perform upon it some basic analysis and transformations, in order to be left with artefacts which will be much more useful for a more meaningful analytic task afterward.<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-f83e4dc elementor-widget elementor-widget-text-editor\" data-id=\"f83e4dc\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>The preprocessing usually consists of several steps that depend on a given task and the text, but can be roughly categorized into segmentation, cleaning, normalization, annotation and analysis.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:list -->\n<ul>\n<li><strong>Segmentation<\/strong>, lexical analysis, or tokenization, is the process that splits longer strings of text into smaller pieces, or tokens. Chunks of text can be tokenized into sentences, sentences can be tokenized into words, etc.<\/li>\n<li><strong>Cleaning<\/strong>\u00a0consists of getting rid of the less useful parts of text through stop-word removal, dealing with capitalization and characters and other details.<\/li>\n<li><strong>Normalization<\/strong>\u00a0consists of the translation (mapping) of terms in the scheme or linguistic reductions through stemming, lemmatization and other forms of standardization.<\/li>\n<li><strong>Annotation<\/strong>\u00a0consists of the application of a scheme to texts. Annotations may include labeling, adding markups, or part-of-speech tagging.<\/li>\n<li><strong>Analysis<\/strong>\u00a0means statistically probing, manipulating and generalizing from the dataset for feature analysis and trying to extract relationships between words.<\/li>\n<\/ul>\n<!-- \/wp:list -->\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-1587c63 elementor-widget elementor-widget-image\" data-id=\"1587c63\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/08\/Home-448x253_5.png\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-0fc47e6 elementor-widget elementor-widget-heading\" data-id=\"0fc47e6\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\"><h2 id=\"41de\"><strong>Segmentation<\/strong><\/h2>\n<\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-0ca1792 elementor-widget elementor-widget-text-editor\" data-id=\"0ca1792\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>Sometimes\u00a0<em>segmentation<\/em>\u00a0is used to refer to the breakdown of a text into pieces larger than words, such as paragraphs and sentences, while\u00a0<em>tokenization<\/em>\u00a0is reserved for the breakdown process which results exclusively in words.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p>This may sound like a straightforward process, but in reality it is anything but. Do you need a sentence or a phrase? And what is a phrase then? How are sentences identified within larger bodies of text? The school grammar suggests that sentences have \u201csentence-ending punctuation\u201d. But for machines the point is the same be it at the end of an abbreviation or of a sentence.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p><em>\u201cShall we call Mr. Brown?\u201d<\/em>\u00a0can easily fall into two sentences if abbreviations are not taken care of.<\/p>\n<!-- \/wp:paragraph -->\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-a4d0b34 elementor-widget elementor-widget-text-editor\" data-id=\"a4d0b34\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>Sometimes\u00a0<em>segmentation<\/em>\u00a0is used to refer to the breakdown of a text into pieces larger than words, such as paragraphs and sentences, while\u00a0<em>tokenization<\/em>\u00a0is reserved for the breakdown process which results exclusively in words.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p>This may sound like a straightforward process, but in reality it is anything but. Do you need a sentence or a phrase? And what is a phrase then? How are sentences identified within larger bodies of text? The school grammar suggests that sentences have \u201csentence-ending punctuation\u201d. But for machines the point is the same be it at the end of an abbreviation or of a sentence.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p><em>\u201cShall we call Mr. Brown?\u201d<\/em>\u00a0can easily fall into two sentences if abbreviations are not taken care of.<\/p>\n<!-- \/wp:paragraph -->\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-e053949 elementor-widget elementor-widget-heading\" data-id=\"e053949\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\"><h2 id=\"f5d9\"><strong>Cleaning<\/strong><\/h2><\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-a752a1c elementor-widget elementor-widget-text-editor\" data-id=\"a752a1c\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<!-- wp:paragraph -->\n<p>The process of cleaning helps put all text on equal footing, involving relatively simple ideas of substitution or removal:<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:list -->\n<ul>\n<li>setting all characters to lowercase<\/li>\n<li>noise removal, including removing numbers and punctuation (it is a part of tokenization, but still worth keeping in mind at this stage)<\/li>\n<li>stop words removal (language-specific)<\/li>\n<\/ul>\n<!-- \/wp:list -->\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-b10dd3e elementor-widget elementor-widget-image\" data-id=\"b10dd3e\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/08\/Home-448x253_5.png\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-403ccd4 elementor-widget elementor-widget-heading\" data-id=\"403ccd4\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h3 class=\"elementor-heading-title elementor-size-default\"><h3 id=\"134a\">Lowercasing<\/h3><\/h3>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-09926ed elementor-widget elementor-widget-text-editor\" data-id=\"09926ed\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<!-- wp:paragraph -->\n<p>Text often has a variety of capitalization reflecting the beginning of sentences or proper nouns emphasis. The common approach is to reduce everything to lower case for simplicity. Lowercasing is applicable to most text mining and NLP tasks and significantly helps with consistency of the output. However, it is important to remember that some words, like \u201cUS\u201d and \u201cus\u201d, can change meanings when reduced to the lower case.<\/p>\n<!-- \/wp:paragraph -->\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-3f06e2a elementor-widget elementor-widget-heading\" data-id=\"3f06e2a\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h3 class=\"elementor-heading-title elementor-size-default\"><h3 id=\"0890\">Noise Removal<\/h3><\/h3>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-393d704 elementor-widget elementor-widget-text-editor\" data-id=\"393d704\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<!-- wp:paragraph -->\n<p>Noise removal refers to removing characters digits and pieces of text that can interfere with the text analysis. There are various ways to remove noise, including punctuation removal<em>,\u00a0<\/em>special character removal<em>,\u00a0<\/em>numbers removal, html formatting removal, domain specific keyword removal, source code removal, and more. Noise removal is highly domain dependent. For example, in tweets, noise could be all special characters except hashtags as they signify concepts that can characterize a tweet. We should also remember that strategies may vary depending on the specific task: for example, numbers can be either removed or converted to textual representations.<\/p>\n<!-- \/wp:paragraph -->\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-0f951a1 elementor-widget elementor-widget-heading\" data-id=\"0f951a1\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h3 class=\"elementor-heading-title elementor-size-default\"><h3 id=\"c500\">Stop-word removal<\/h3><\/h3>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-7b5aca9 elementor-widget elementor-widget-text-editor\" data-id=\"7b5aca9\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<!-- wp:paragraph -->\n<p>Stop words are a set of commonly used words in a language like \u201ca\u201d, \u201cthe\u201d, \u201cis\u201d, \u201care\u201d and etc in English. These words do not carry important meaning and are removed from texts in many data science tasks. The intuition behind this approach is that, by removing low information words from text, we can focus on the important words instead. Besides, it reduces the number of features in consideration which helps keep your models better sized. Stop word removal is commonly applied in search systems, text classification applications, topic modeling, topic extraction and others.<a href=\"http:\/\/kavita-ganesan.com\/what-are-stop-words\/\" target=\"_blank\" rel=\"noreferrer noopener\">\u00a0Stop word lists<\/a>\u00a0can come from pre-established sets or you can create a<a href=\"http:\/\/kavita-ganesan.com\/tips-for-constructing-custom-stop-word-lists\/\" target=\"_blank\" rel=\"noreferrer noopener\">\u00a0custom one for your domain<\/a>.<\/p>\n<!-- \/wp:paragraph -->\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-af6b7fb elementor-widget elementor-widget-heading\" data-id=\"af6b7fb\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\"><h2 id=\"f86e\"><strong>Normalization<\/strong><\/h2>\n<!-- \/wp:heading --><\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-3ed999d elementor-widget elementor-widget-text-editor\" data-id=\"3ed999d\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<!-- wp:paragraph -->\n<p>Normalization puts all words on equal footing, and allows processing to proceed uniformly. It is closely related to cleaning, but brings the process a step forward putting all words on equal footing by stemming and lemmatizing them.<\/p>\n<!-- \/wp:paragraph -->\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-1f8d231 elementor-widget elementor-widget-heading\" data-id=\"1f8d231\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h3 class=\"elementor-heading-title elementor-size-default\"><h3 id=\"0d84\"><strong><em>Stemming<\/em><\/strong><\/h3><\/h3>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-10851c6 elementor-widget elementor-widget-text-editor\" data-id=\"10851c6\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>Stemming is the process of eliminating affixes (suffixes, prefixes, infixes, circumfixes) from a word in order to obtain a word stem. The results can be used to identify relationships and commonalities across large datasets. There are several stemming models, including Porter and Snowball. The danger here lies in the possibility of overstemming where words like \u201cuniverse\u201d and \u201cuniversity\u201d are reduced to the same root of \u201cunivers\u201d.<\/p>\n<!-- \/wp:paragraph -->\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-9d35df2 elementor-widget elementor-widget-heading\" data-id=\"9d35df2\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h3 class=\"elementor-heading-title elementor-size-default\"><h3 id=\"ab7d\"><strong><em>Lemmatization<\/em><\/strong><\/h3><\/h3>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-576d81b elementor-widget elementor-widget-text-editor\" data-id=\"576d81b\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<!-- wp:paragraph -->\n<p>Lemmatization is related to stemming, but it is able to capture canonical forms based on a word\u2019s lemma. By determining the part of speech and utilizing special tools, like WordNet\u2019s lexical database of English, lemmatization can get better results:<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p>The stemmed form of leafs is: leaf<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p>The stemmed form of leaves is: leav<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p>The lemmatized form of leafs is: leaf<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p>The lemmatized form of leaves is: leaf<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p>Stemming may be more useful in queries for databases whereas lemmazation may work much better when trying to determine text sentiment.<\/p>\n<!-- \/wp:paragraph -->\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-bcc83b4 elementor-widget elementor-widget-image\" data-id=\"bcc83b4\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/08\/Home-448x253_5.png\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-8aca37c elementor-widget elementor-widget-heading\" data-id=\"8aca37c\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\"><h2 id=\"d567\"><strong>Annotation<\/strong><\/h2><\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-ea00c93 elementor-widget elementor-widget-text-editor\" data-id=\"ea00c93\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<!-- wp:paragraph -->\n<p>Text annotation is a sophisticated and task-specific process of providing text with relevant markups. The most common and general practice is to add part-of-speech (POS) tags to the words.<\/p>\n<!-- \/wp:paragraph -->\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-4c91a25 elementor-widget elementor-widget-heading\" data-id=\"4c91a25\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\"><h2 id=\"0de4\"><strong><em>Part-of-speech tagging<\/em><\/strong><\/h2><\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-8d5b7a5 elementor-widget elementor-widget-text-editor\" data-id=\"8d5b7a5\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<!-- wp:paragraph -->\n<p>Understanding parts of speech can make a difference in determining the meaning of a sentence as it provides more granular information about the words. For example, in a document classification problem, the appearance of the word\u00a0<strong>book<\/strong>\u00a0as a\u00a0<strong>noun<\/strong>\u00a0could result in a different classification than\u00a0<strong>book<\/strong>\u00a0as a\u00a0<strong>verb<\/strong>. Part-of-speech tagging tries to assign a part of speech (such as nouns, verbs, adjectives, and others) to each word of a given text based on its definition and the context. It often requires looking at the proceeding and following words and combined with either a rule-based or stochastic method.<\/p>\n<!-- \/wp:paragraph -->\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-6fb80b5 elementor-widget elementor-widget-heading\" data-id=\"6fb80b5\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\"><h2 id=\"c7fb\"><strong>Analysis<\/strong><\/h2><\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-b84de8b elementor-widget elementor-widget-text-editor\" data-id=\"b84de8b\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<!-- wp:paragraph -->\n<p>Finally, before actual model training, we can explore our data for extracting features that might be used in model building.<\/p>\n<!-- \/wp:paragraph -->\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-4f2a213 elementor-widget elementor-widget-heading\" data-id=\"4f2a213\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h3 class=\"elementor-heading-title elementor-size-default\"><h3 id=\"49db\">Count<\/h3><\/h3>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-846ebee elementor-widget elementor-widget-text-editor\" data-id=\"846ebee\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>This is perhaps one of the more basic <a href=\"https:\/\/www.experfy.com\/blog\/top-tools-and-strategies-for-the-future-of-remote-work\/\" target=\"_blank\" rel=\"noreferrer noopener\">tools <\/a>for feature engineering. Adding such statistical information as word count, sentence count, punctuation counts and industry-specific word counts can greatly help in prediction or classification.<\/p>\n<!-- \/wp:paragraph -->\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-71201d7 elementor-widget elementor-widget-heading\" data-id=\"71201d7\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\"><h2 id=\"e45a\"><strong><em>Chunking (shallow parsing)<\/em><\/strong><\/h2><\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-7865fb1 elementor-widget elementor-widget-text-editor\" data-id=\"7865fb1\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<!-- wp:paragraph -->\n<p>Chunking is a process that identifies constituent parts of sentences, such as nouns, verbs, adjectives, etc. and links them to higher order units that have discrete grammatical meanings, for example, noun groups or phrases, verb groups, etc..<\/p>\n<!-- \/wp:paragraph -->\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-16fd6c3 elementor-widget elementor-widget-heading\" data-id=\"16fd6c3\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h3 class=\"elementor-heading-title elementor-size-default\"><h3 id=\"ad64\">Collocation extraction<\/h3><\/h3>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-6d5afa0 elementor-widget elementor-widget-text-editor\" data-id=\"6d5afa0\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>Collocations are more or less stable word combinations, such as \u201cbreak the rules,\u201d \u201cfree time,\u201d \u201cdraw a conclusion,\u201d \u201ckeep in mind,\u201d \u201cget ready,\u201d and so on. As they usually convey a specific established meaning it is worthwhile to extract them before the analysis.<\/p>\n<!-- \/wp:paragraph -->\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-6e1c5c2 elementor-widget elementor-widget-heading\" data-id=\"6e1c5c2\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h3 class=\"elementor-heading-title elementor-size-default\"><h3 id=\"23d8\">Word Embedding\/Text Vectors<\/h3><\/h3>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-573424a elementor-widget elementor-widget-text-editor\" data-id=\"573424a\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>Word embedding is the modern way of representing words as vectors to redefine the high dimensional word features into low dimensional feature vectors. In other words, it represents words at an X and Y vector coordinate where related words, based on a corpus of relationships, are placed closer together.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:separator --><hr class=\"wp-block-separator\" \/><!-- \/wp:separator -->\n\n<!-- wp:paragraph -->\n<p>Preparing a text for analysis is a complicated art which requires choosing optimal tools depending on the text properties and the task. There are multiple pre-built libraries and services for the most popular languages used in data science that help automate text pre-processing, however, certain steps will still require manually mapping terms, rules and words.<\/p>\n<!-- \/wp:paragraph -->\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<\/div>\n\t\t","protected":false},"excerpt":{"rendered":"<p>Preparing a text for analysis is a complicated art which requires choosing optimal tools depending on the text properties and the task. There are multiple pre-built libraries and services for the most popular languages used in data science that help automate text pre-processing, however, certain steps will still require manually mapping terms, rules and words. <\/p>\n","protected":false},"author":570,"featured_media":23500,"comment_status":"open","ping_status":"open","sticky":false,"template":"single-post-2.php","format":"standard","meta":{"content-type":"","footnotes":""},"categories":[183],"tags":[563,97,94,92,474],"ppma_author":[3261],"class_list":["post-9380","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-ai-ml","tag-algorithms","tag-artificial-intelligence","tag-data-science","tag-machine-learning","tag-nlp"],"authors":[{"term_id":3261,"user_id":570,"is_guest":0,"slug":"max-ved","display_name":"Max Ved","avatar_url":"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/04\/medium_cbaf23d5-a78a-4ceb-8f6e-343134811364-150x150.jpg","user_url":"https:\/\/sciforce.solutions\/","last_name":"Ved","first_name":"Max","job_title":"","description":"Max Ved, a Scientist Entrepreneur, is Co-Founder &amp; CTO at SciForce, an IT company specialized in the development of software solutions.\u00a0"}],"_links":{"self":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/9380","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/users\/570"}],"replies":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/comments?post=9380"}],"version-history":[{"count":7,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/9380\/revisions"}],"predecessor-version":[{"id":34145,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/9380\/revisions\/34145"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/media\/23500"}],"wp:attachment":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/media?parent=9380"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/categories?post=9380"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/tags?post=9380"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/ppma_author?post=9380"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}