{"id":1222,"date":"2019-02-15T10:32:01","date_gmt":"2019-02-15T10:32:01","guid":{"rendered":"http:\/\/kusuaks7\/?p=827"},"modified":"2023-07-14T17:15:44","modified_gmt":"2023-07-14T17:15:44","slug":"pre-processing-in-natural-language-machine-learning","status":"publish","type":"post","link":"https:\/\/www.experfy.com\/blog\/bigdata-cloud\/pre-processing-in-natural-language-machine-learning\/","title":{"rendered":"Pre-Processing in Natural Language Machine Learning"},"content":{"rendered":"<p><strong><em>Ready to learn Data Science? <a href=\"https:\/\/www.experfy.com\/training\/courses\">Browse courses<\/a>\u00a0like\u00a0<a href=\"https:\/\/www.experfy.com\/training\/tracks\/data-science-training-certification\">Data Science Training and Certification<\/a> developed by industry thought leaders and Experfy in Harvard Innovation Lab.<\/em><\/strong><\/p>\n<p>It is easy to forget how much data is stored in the conversations we have every day. With the evolution of the digital landscape, tapping into text, or Natural Language Processing (NLP), is a growing field in artificial intelligence and machine learning. This article covers the common pre-processing concepts applied to NLP problems.<\/p>\n<p>Text can come in a variety of forms from a list of individual words, to sentences to multiple paragraphs with special characters (like tweets for example). Like any data science problem, understand the questions that are being asked will inform what steps may be employed to transform words into numerical features that work with machine learning algorithms.<\/p>\n<h2>The History of\u00a0NPL<\/h2>\n<p>&gt;<\/p>\n<p>When I was a kid sci-fi almost always had a computer that you could bark orders and have them understood and sometime, but not always, executed. At the time the technology seemed far away in the future but today I carry a phone in my pocket that is smaller and more powerful than any of those imagined. The history of speech-to-text is a complicated and long but was the seed for much of NPL.<\/p>\n<p>Early efforts required large amounts of manually coded vocabulary and linguistic rules. The first automatic translations from English to Russian in 1954 at Georgetown were limited to a handful of sentences.<\/p>\n<p>By 1964 the first chatbot\u00a0<a href=\"https:\/\/en.wikipedia.org\/wiki\/ELIZA\" target=\"_blank\" rel=\"noopener noreferrer\">ELIZA<\/a>\u00a0was created at MIT. Built on pattern matching and substitution it mimicked the therapy process by asking open-ended questions. While it seemed to replicate awareness, it had no true contextualization of the conversation. Even with these limited capabilities many were surprised by how human the interactions felt.<\/p>\n<p>Much of the growth in the field really started in the 1980&#8217;s with the introduction of machine learning algorithms. Researchers moved from the more rigid\u00a0<a href=\"https:\/\/en.wikipedia.org\/wiki\/Transformational_grammar\" target=\"_blank\" rel=\"noopener noreferrer\">Transformational Grammar models<\/a>\u00a0into looser probabilistic relationships described in<a href=\"https:\/\/en.wikipedia.org\/wiki\/Cache_language_model\" target=\"_blank\" rel=\"noopener noreferrer\">\u00a0Cache Language Models<\/a>\u00a0which allowed more rapid scaling and handled unfamiliar inputs with greater ease.<\/p>\n<p>Through the 90&#8217;s the exponential increase of computing power helped forge advancement but it wasn\u2019t until 2006 when IBM\u2019s Watson went on Jeopardy that the progress of computer intelligence was visible to the general public. For me it was the introduction of Siri on the iPhone in 2011 that made me realize the potential.<\/p>\n<p>The current NLP landscape could easily be its own article. Unprecedented investment from private companies and a general open source attitude has expanded something that was largely exclusive to much larger audience and application. One fascinating example is in the field of translations where Google is working on translating any language on the fly (even if some of the bugs in the user experience needs to be worked out).<\/p>\n<h2>The Importance of Pre-Processing<\/h2>\n<p>&gt;<\/p>\n<p>None of the magic described above happens without a lot of work on the back end. Transforming text into something an algorithm can digest it a complicated process. There are four different parts:<\/p>\n<p>Cleaning\u00a0consist of getting rid of the less useful parts of text through stopword removal, dealing with capitalization and characters and other details.<\/p>\n<p>Annotation\u00a0consists of the application of a scheme to texts. Annotations may include structural markup and\u00a0<a title=\"Lexical category\" href=\"https:\/\/en.wikipedia.org\/wiki\/Lexical_category\" target=\"_blank\" rel=\"noopener noreferrer\">part-of-speech<\/a>\u00a0tagging.<\/p>\n<p>Normalization\u00a0consists of the translation (mapping) of terms in the scheme or linguistic reductions through Stemming, Lemmazation and other forms of standardization.<\/p>\n<p>Analysis\u00a0consists of statistically probing, manipulating and generalizing from the dataset for feature analysis.<\/p>\n<h2>The Tools<\/h2>\n<p>&gt;<\/p>\n<p>There are a variety of pre-processing methods. The list below is far from exclusive but it does give an idea of where to start. It is important to realize, like with all data problems, converting anything into a format for machine learning reduces it to a generalized state which means losing some of the fidelity of the data along the way. The true art is understand the pros and cons to each to carefully chose the right methods.<\/p>\n<h2>Capitalization<\/h2>\n<p>&gt;<\/p>\n<p>Text often has a variety of capitalization reflecting the beginning of sentences, proper nouns emphasis. The most common approach is to reduce everything to lower case for simplicity but it is important to remember that some words, like \u201cUS\u201d to \u201cus\u201d, can change meanings when reduced to the lower case.<\/p>\n<h2>Stopword<\/h2>\n<p>&gt;<\/p>\n<p>A majority of the words in a given text are connecting parts of a sentence rather than showing subjects, objects or intent. Word like \u201cthe\u201d or \u201cand\u201d cab be removed by comparing text to a list of stopword.<\/p>\n<p>&nbsp;<\/p>\n<p>IN:<br \/>\n[&#8216;He&#8217;, &#8216;did&#8217;, &#8216;not&#8217;, &#8216;try&#8217;, &#8216;to&#8217;, &#8216;navigate&#8217;, &#8216;after&#8217;, &#8216;the&#8217;, &#8216;first&#8217;, &#8216;bold&#8217;, &#8216;flight&#8217;, &#8216;,&#8217;,&#8217;for&#8217;, &#8216;the&#8217;, &#8216;reaction&#8217;, &#8216;had&#8217;, &#8216;taken&#8217;, &#8216;something&#8217;, &#8216;out&#8217;, &#8216;of&#8217;, &#8216;his&#8217;, &#8216;soul&#8217;, &#8216;.&#8217;]<\/p>\n<p>&nbsp;<\/p>\n<p>OUT:<br \/>\n[&#8216;try&#8217;, &#8216;navigate&#8217;, &#8216;first&#8217;, &#8216;bold&#8217;, &#8216;flight&#8217;, &#8216;,&#8217;, &#8216;reaction&#8217;, &#8216;taken&#8217;, &#8216;something&#8217;, &#8216;soul&#8217;, &#8216;.&#8217;]<\/p>\n<p>In the example above it reduced the list of 23 words to 11, but it is important to note that the word \u201cnot\u201d was dropped which depending on what I am working on could be a large problem. One might created their own stopword dictionary manually or utilize prebuilt libraries depending on the sensitivity required.<\/p>\n<h2>Tokenization<\/h2>\n<p>&gt;<\/p>\n<p>Tokenization describes splitting paragraphs into sentences, or sentences into individual words. For the former\u00a0<a href=\"https:\/\/en.wikipedia.org\/wiki\/Sentence_boundary_disambiguation\" target=\"_blank\" rel=\"noopener noreferrer\">Sentence Boundary Disambiguation<\/a>\u00a0(SBD) can be applied to create a list of individual sentences. This relies on a pre-trained, language specific algorithms like the Punkt Models from NLTK.<\/p>\n<p>Sentences can be split into individual words and punctuation through a similar process. Most commonly this split across white spaces, for example:<\/p>\n<p>&nbsp;<\/p>\n<p>IN:<\/p>\n<p>&#8220;He did not try to navigate after the first bold flight, for the reaction\u00a0had taken something out of his soul.&#8221;<\/p>\n<p>&nbsp;<\/p>\n<p>OUT:<\/p>\n<p>[&#8216;He&#8217;, &#8216;did&#8217;, &#8216;not&#8217;, &#8216;try&#8217;, &#8216;to&#8217;, &#8216;navigate&#8217;, &#8216;after&#8217;, &#8216;the&#8217;, &#8216;first&#8217;, &#8216;bold&#8217;, &#8216;flight&#8217;, &#8216;,&#8217;,&#8217;for&#8217;, &#8216;the&#8217;, &#8216;reaction&#8217;, &#8216;had&#8217;, &#8216;taken&#8217;, &#8216;something&#8217;, &#8216;out&#8217;, &#8216;of&#8217;, &#8216;his&#8217;, &#8216;soul&#8217;, &#8216;.&#8217;]<\/p>\n<p>&nbsp;<\/p>\n<p>There are occasions that this can cause problems when a word is abbreviated, truncated or is possessive. Proper nouns may also suffer in the case of names that use punctuation (like O\u2019Neil).<\/p>\n<h2>Parts of Speech\u00a0Tagging<\/h2>\n<p>&gt;<\/p>\n<p>Understand parts of speech can make difference in determining the meaning of a sentence. Part of Speech (POS) often requires look at the proceeding and following words and combined with either a rule-based or stochastic method. It can than be combined with other processes for more feature engineering.<\/p>\n<p>IN:<br \/>\n[&#8216;And&#8217;, &#8216;from&#8217;, &#8216;their&#8217;, &#8216;high&#8217;, &#8216;summits&#8217;, &#8216;,&#8217;, &#8216;one&#8217;, &#8216;by&#8217;, &#8216;one&#8217;, &#8216;,&#8217;, &#8216;drop&#8217;,&#8217;everlasting&#8217;, &#8216;dews&#8217;, &#8216;.&#8217;]<\/p>\n<p>&nbsp;<\/p>\n<p>OUT:<br \/>\n[(&#8216;And&#8217;, &#8216;CC&#8217;),<br \/>\n(&#8216;from&#8217;, &#8216;IN&#8217;),<br \/>\n(&#8216;their&#8217;, &#8216;PRP$&#8217;),<br \/>\n(&#8216;high&#8217;, &#8216;JJ&#8217;),<br \/>\n(&#8216;summits&#8217;, &#8216;NNS&#8217;),<br \/>\n(&#8216;,&#8217;, &#8216;,&#8217;),<br \/>\n(&#8216;one&#8217;, &#8216;CD&#8217;),<br \/>\n(&#8216;by&#8217;, &#8216;IN&#8217;),<br \/>\n(&#8216;one&#8217;, &#8216;CD&#8217;),<br \/>\n(&#8216;,&#8217;, &#8216;,&#8217;),<br \/>\n(&#8216;drop&#8217;, &#8216;NN&#8217;),<br \/>\n(&#8216;everlasting&#8217;, &#8216;VBG&#8217;),<br \/>\n(&#8216;dews&#8217;, &#8216;NNS&#8217;),<br \/>\n(&#8216;.&#8217;, &#8216;.&#8217;)]<\/p>\n<p>&nbsp;<\/p>\n<p>Definitions of Parts of Speech<br \/>\n(&#8216;their&#8217;, &#8216;PRP$&#8217;) PRP$: pronoun, possessive<br \/>\nher his mine my our ours their thy your<\/p>\n<h2>Stemming<\/h2>\n<p>&gt;<\/p>\n<p>Much of natural language machine learning is about sentiment of the text. Stemming is a process where words are reduced to a root by removing inflection through dropping unnecessary characters, usually a suffix. There are several stemming models, including Porter and Snowball. The results can be used to identify relationships and commonalities across large datasets.<\/p>\n<p>&nbsp;<\/p>\n<p>IN:<br \/>\n[&#8220;It never once occurred to me that the fumbling might be a mere mistake.&#8221;]<\/p>\n<p>OUT:<br \/>\n[&#8216;it&#8217;, &#8216;never&#8217;,\u00a0 &#8216;onc&#8217;,\u00a0 &#8216;occur&#8217;,\u00a0 &#8216;to&#8217;,\u00a0 &#8216;me&#8217;,\u00a0 &#8216;that&#8217;,\u00a0 &#8216;the&#8217;, &#8216;fumbl&#8217;,\u00a0&#8216;might&#8217;, &#8216;be&#8217;, &#8216;a&#8217;, &#8216;mere&#8217;,\u00a0 &#8216;mistake.&#8217;],<\/p>\n<p>&nbsp;<\/p>\n<p>It is easy to see where reductions may produce a \u201croot\u201d word that isn\u2019t an actual word. This doesn\u2019t necessarily adversely affect its efficiency, but there is a danger of \u201coverstemming\u201d were words like \u201cuniverse\u201d and \u201cuniversity\u201d are reduced to the same root of \u201cunivers\u201d.<\/p>\n<h2>Lemmazation<\/h2>\n<p>&gt;<\/p>\n<p>Lemmazation is an alternative approach from stemming to removing inflection. By determining the part of speech and utilizing WordNet\u2019s lexical database of English, lemmazation can get better results.<\/p>\n<p>&nbsp;<\/p>\n<p>The stemmed form of leafs is: leaf<\/p>\n<p>The stemmed form of leaves is: leav<\/p>\n<p>&nbsp;<\/p>\n<p>The lemmatized form of leafs is: leaf<\/p>\n<p>The lemmatized form of leaves is: leaf<\/p>\n<p>&nbsp;<\/p>\n<p>Lemmazation is a more intensive and therefor slower process, but more accurate. Stemming may be more useful in queries for databases whereas lemmazation may work much better when trying to determine text sentiment.<\/p>\n<h2>Count \/\u00a0Density<\/h2>\n<p>&gt;<\/p>\n<p>Perhaps one of the more basic tools for feature engineering, adding word count, sentence count, punctuation counts and Industry specific word counts can greatly help in prediction or classification. There are multiple statistical approaches whose relevance are heavily dependent on context.<\/p>\n<p><img decoding=\"async\" style=\"width: 384px; height: 275px;\" src=\"https:\/\/cdn-images-1.medium.com\/max\/1600\/1*IGhqvs3ueoX2blQMJGRs2Q.png\" alt=\"experfy-blog\" \/><\/p>\n<h2>Word Embedding\/Text Vectors<\/h2>\n<p>&gt;<\/p>\n<p>Word embedding is the modern way of representing words as vectors. The aim of word embedding is to redefine the high dimensional word features into low dimensional feature vectors. In other words it represents words at an X and Y vector coordinate where related words, based on a corpus of relationships, are placed closer together.\u00a0<a href=\"https:\/\/code.google.com\/archive\/p\/word2vec\/\" target=\"_blank\" rel=\"noopener noreferrer\">Word2Vec<\/a>\u00a0and\u00a0<a href=\"http:\/\/nlp.stanford.edu\/projects\/glove\/\" target=\"_blank\" rel=\"noopener noreferrer\">GloVe<\/a>\u00a0are the most common models to convert text to vectors.<\/p>\n<h2>Conclusion<\/h2>\n<p>&gt;<\/p>\n<p>While this is far from a comprehensive list, preparing text is a complicated art which requires choosing the optimal tools given both the data and the question you are asking. Many pre-built libraries and services are there to help but some may require manually mapping terms and words.<\/p>\n<p>Once a dataset is ready supervised and unsupervised machine learning techniques can be applied. From my initial experiments, which will be it\u2019s own article, there is a sharp difference in applying preprocessing techniques on a single string compared to large dataframes. Tuning the steps for optimal efficiency will be key to remain flexible in the face of scaling.<\/p>\n<p>Clap if you liked the article, follow if you are interested to see more on Natural Language Processing!<\/p>\n<p><strong>Additional Resources<\/strong><\/p>\n<p><a title=\"https:\/\/en.wikipedia.org\/wiki\/Natural_language_processing\" href=\"https:\/\/en.wikipedia.org\/wiki\/Natural_language_processing\" rel=\"noopener\">Natural language processing &#8211; Wikipedia<br \/>\nNatural language processing ( NLP) is a field of computer science, artificial intelligence and computational\u2026en.wikipedia.org<\/a><\/p>\n<p><a title=\"https:\/\/www.analyticsvidhya.com\/blog\/2017\/01\/ultimate-guide-to-understand-implement-natural-language-processing-codes-in-python\/\" href=\"https:\/\/www.analyticsvidhya.com\/blog\/2017\/01\/ultimate-guide-to-understand-implement-natural-language-processing-codes-in-python\/\" rel=\"noopener\">Ultimate Guide to Understand &amp; Implement Natural Language Processing (with codes in Python)<br \/>\nAccording to industry estimates, only 21% of the available data is present in structured form. Data is being generated\u2026www.analyticsvidhya.com<\/a><\/p>\n<p><a title=\"http:\/\/textminingonline.com\/dive-into-nltk-part-i-getting-started-with-nltk\" href=\"http:\/\/textminingonline.com\/dive-into-nltk-part-i-getting-started-with-nltk\" rel=\"noopener\">Dive Into NLTK, Part I: Getting Started with NLTK<br \/>\nPart I: Getting Started with NLTK (this article) Part II: Sentence Tokenize and Word Tokenize Part III: Part-Of-Speech\u2026textminingonline.com<\/a><\/p>\n<p><a title=\"https:\/\/www.theatlantic.com\/technology\/archive\/2014\/06\/when-parry-met-eliza-a-ridiculous-chatbot-conversation-from-1972\/372428\/\" href=\"https:\/\/www.theatlantic.com\/technology\/archive\/2014\/06\/when-parry-met-eliza-a-ridiculous-chatbot-conversation-from-1972\/372428\/\" rel=\"noopener\">When PARRY Met ELIZA: A Ridiculous Chatbot Conversation From 1972<br \/>\nThey might not have passed the Turing Test, but they won the battle for wackiness.www.theatlantic.com<\/a><\/p>\n<p>Originally appeared in Towards Data Science<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Ready to learn Data Science? Browse courses\u00a0like\u00a0Data Science Training and Certification developed by industry thought leaders and Experfy in Harvard Innovation Lab. It is easy to forget how much data is stored in the conversations we have every day. With the evolution of the digital landscape, tapping into text, or Natural Language Processing (NLP), is<\/p>\n","protected":false},"author":112,"featured_media":2552,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[187],"tags":[94],"ppma_author":[2687],"class_list":["post-1222","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-bigdata-cloud","tag-data-science"],"authors":[{"term_id":2687,"user_id":112,"is_guest":0,"slug":"kendall-fortney","display_name":"Kendall Fortney","avatar_url":"https:\/\/secure.gravatar.com\/avatar\/?s=96&d=mm&r=g","author_category":"","user_url":"","last_name":"Fortney","first_name":"Kendall","job_title":"","description":"Kendall Fortney is the Data Innovation Fellow at the Vermont Center for Geographic Information and a Consultant focused on Python, machine learning and Geospatial Data with background in Art and years of experience in tech in Vermont."}],"_links":{"self":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/1222","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/users\/112"}],"replies":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/comments?post=1222"}],"version-history":[{"count":0,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/1222\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/media\/2552"}],"wp:attachment":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/media?parent=1222"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/categories?post=1222"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/tags?post=1222"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/ppma_author?post=1222"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}