{"id":22605,"date":"2021-02-05T10:15:26","date_gmt":"2021-02-05T10:15:26","guid":{"rendered":"https:\/\/www.experfy.com\/blog\/statistical-language-models\/"},"modified":"2023-09-05T11:00:20","modified_gmt":"2023-09-05T11:00:20","slug":"statistical-language-models","status":"publish","type":"post","link":"https:\/\/www.experfy.com\/blog\/ai-ml\/statistical-language-models\/","title":{"rendered":"Statistical Language Models"},"content":{"rendered":"\t\t<div data-elementor-type=\"wp-post\" data-elementor-id=\"22605\" class=\"elementor elementor-22605\" data-elementor-post-type=\"post\">\n\t\t\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-7ccb4c0 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"7ccb4c0\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-9dcb461\" data-id=\"9dcb461\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-3456458 elementor-widget elementor-widget-heading\" data-id=\"3456458\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\">From simple to ++, with use cases, examples &amp; code snippets<\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-7b88579 elementor-widget elementor-widget-text-editor\" data-id=\"7b88579\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"72d2\">In NLP, a language model is a&nbsp;<em>probability distribution<\/em>&nbsp;over strings on an alphabet. In formal language theory, a language is a&nbsp;<em>set<\/em>&nbsp;of strings on an alphabet. The NLP version is a soft variant of the one in formal language theory.<\/p>\n<p id=\"e4b5\">The NLP version is better suited to modeling natural languages such as English or French. No hard rules dictate exactly which strings are in the language and which not. Rather we have observations to work with. People write. People talk. Their utterances characterize the language.<\/p>\n<p id=\"4339\">Importantly,&nbsp;the NLP statistical version is good for&nbsp;<em>learning<\/em>&nbsp;languages over strings from examples. Consider learning a model to recognize product names. The training set may contain iPhone 4 and iPhone 5 but not iPhone 12. It should recognize iPhone 12 as a product name.<\/p>\n<p id=\"8b62\">In this post, we cover statistical language models from simple to elaborate. The covered models include: Independent model, first-order Markov model, Kth-order Markov model, Hidden Markov Model, Conditional Markov model, and Conditional Random Fields. Each is illustrated with realistic examples and use cases.<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-458714d elementor-widget elementor-widget-heading\" data-id=\"458714d\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\">Next Token Probabilities<\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-3304334 elementor-widget elementor-widget-text-editor\" data-id=\"3304334\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"ba0a\">We\u2019ve defined a language as a probability distribution over strings. In many use cases, what we really want is the probability of the next symbol given the current string. It turns out these two are equivalent, as noted in [1].<\/p>\n<p id=\"ff61\">Consider a string on an alphabet. (The alphabet may be composed of characters or words or other tokens.) Denote this string&nbsp;<em>x<\/em>1,&nbsp;<em>x<\/em>2, \u2026,&nbsp;<em>x<\/em>n. We can write&nbsp;<em>P<\/em>(<em>x<\/em>1,<em>x<\/em>2,\u2026,<em>xn<\/em>) as&nbsp;<em>P<\/em>(<em>x<\/em>1)*<em>P<\/em>(<em>x<\/em>2|<em>x<\/em>1)*\u2026*<em>P<\/em>(<em>xn<\/em>|<em>xn<\/em>-1,<em>xn<\/em>-2,\u2026,<em>x1<\/em>).<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-8ad2c12 elementor-widget elementor-widget-heading\" data-id=\"8ad2c12\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\">Use Cases<\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-0567830 elementor-widget elementor-widget-text-editor\" data-id=\"0567830\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"1e2d\">What can we do with a language model? Quite a lot.<\/p>\n<p id=\"f5e9\"><strong>Suggest auto-completes<\/strong>. Smartphones offer auto-complete suggestions as we type. Web search engines offer auto-complete suggestions as we start typing a query. Under-the-hood these are powered by language models.<\/p>\n<p id=\"3ec0\"><strong>Recognize handwriting.<\/strong>&nbsp;Imagine pointing your smartphone at your handwritten notes and asking them to be recognized. Perhaps to be digitized and searchable. Or at least to make them legible!<\/p>\n<p id=\"dd61\">Handwriting recognition is challenging. Even people often get it wrong. Sometimes even their own writing!<\/p>\n<p id=\"48ef\">Consider trying to recognize the words in a poorly-written text. A pure image processing approach can make a lot of errors. Adding a language model can cut these down a lot. The language model provides a useful context.<\/p>\n<p id=\"b38d\"><strong>Example 1<\/strong>: Let\u2019s glimpse this in a simple example. Imagine the OCR thought the next word was&nbsp;<em>databasc<\/em>. The&nbsp;<em>e<\/em>&nbsp;was misrecognized as a&nbsp;<em>c<\/em>. They do look similar.<\/p>\n<p id=\"fb3d\">Now let\u2019s add a language model trained on English words. Specifically to assign high probabilities to strings that look like English words and low probabilities to those that don\u2019t.&nbsp;<em>bath<\/em>&nbsp;would get a high probability, but not&nbsp;<em>axbs<\/em>.<\/p>\n<p id=\"e82d\">This language model will know that&nbsp;<em>database<\/em>&nbsp;is much more likely than&nbsp;<em>databasc<\/em>. So adding it will help detect and correct the OCR\u2019s error.<\/p>\n<p id=\"7926\">By adding a multi-word language model, we can improve the error-detection and correction accuracy further. This model\u2019s alphabet is the lexicon of words. Its high-probability strings model high-probability word sequences. The alphabet becomes way larger. Every distinct word is in the alphabet. So care needs to be exercised to reap its benefits while avoiding the cost (the model getting overly complex). More on this later.<\/p>\n<p id=\"24ae\">A multi-word language model can also help fill in missing words in the written text.<\/p>\n<p id=\"f660\"><strong>Detect and correct spelling errors<\/strong>. We\u2019ll reinterpret the same example&nbsp;<em>databasc<\/em>. The last character,&nbsp;<em>c<\/em>, is now a spelling error.<\/p>\n<p id=\"ef7c\"><strong>Recognize speech.<\/strong>&nbsp;Similar reasoning holds here. Except that the modality is different. The underlying language being expressed is the same. This is not to say that accurate handwriting or speech recognition is easy. Just that adding a language model helps.<\/p>\n<p id=\"a157\"><strong>Recognize multi-token named entities<\/strong>. Multi-token named entities often have a language structure. As an example, in a person\u2019s name, the first name word(s) typically appear before the last name words. Sandwiched between the two might be middle name word(s). As another example, in a US street address, the street number typically appears before the street name. As we will see later, such entities are often recognized via latent language models such as Hidden Markov Models.<\/p>\n<p id=\"734b\">Now that we have seen some use cases, let\u2019s dive into<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-2797e27 elementor-widget elementor-widget-heading\" data-id=\"2797e27\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\">Models<\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-bc916df elementor-widget elementor-widget-text-editor\" data-id=\"bc916df\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"e60c\">We\u2019ll denote the string as<em>x<\/em>1,&nbsp;<em>x<\/em>2, \u2026,&nbsp;<em>x<\/em>n.<\/p>\n<p id=\"c353\"><strong>Independent.<\/strong>&nbsp;This is the simplest approach. It assumes that all characters in the string are generated independently.<\/p>\n<p id=\"d3c3\"><em>P<\/em>(<em>x<\/em>1,&nbsp;<em>x<\/em>2, \u2026,&nbsp;<em>xn<\/em>) =&nbsp;<em>Q<\/em>(<em>x<\/em>1)*<em>Q<\/em>(<em>x<\/em>2)*\u2026*<em>Q<\/em>(<em>xn<\/em>).<\/p>\n<p id=\"d908\">Here&nbsp;<em>Q<\/em>&nbsp;is a probability distribution over the alphabet.<\/p>\n<p id=\"a867\">This approach is often effective on long strings on a large alphabet. Such as documents with many words. The text is seen as a <a href=\"https:\/\/www.experfy.com\/blog\/bigdata-cloud\/what-is-a-sequence-how-to-use-a-sequence-in-ms-sql-server\/\" target=\"_blank\" rel=\"noreferrer noopener\">sequence <\/a>of words. The alphabet is the lexicon of words.<\/p>\n<p id=\"9f6b\">In this setting, more elaborate approaches get complex very quickly so they need to work convincingly and significantly better to justify their added complexity.<\/p>\n<p id=\"6f8a\">This approach would be effective in detecting the document\u2019s language. For each language, we would train a separate Q, a distribution over the words seen in the language\u2019s training set. For a new document W expressed as the sequence of words W1 W2 \u2026 Wn, we would calculate PL(W) = QL(W1]*QL[W2]*\u2026QL[Wn] for every language L we have modeled. We would deem L* = argmax_L PL(W) as the language of W.<\/p>\n<p id=\"2e45\"><strong>Example I1:<\/strong>&nbsp;Train on the sequence of words&nbsp;<em>the dog is wagging the cat\u2019s tail<\/em>. We get Q(the)=2\/7, Q(is) = 1\/7, etc.<\/p>\n<p id=\"7429\"><strong>Python Code I1<\/strong>: Not tested, or even run. Plus, needs a runner.<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-e528388 elementor-widget elementor-widget-text-editor\" data-id=\"e528388\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<pre class=\"wp-block-preformatted\">from collections import Counter<br>import mathclass IndependentModel:<br>    <br>    <strong>def __init__(self):<\/strong><br>        self.ctr = Counter()    <strong>def train(self,words):<\/strong><br>        for word in words:<br>            self.ctr[word] += 1    <strong>def Q(self,word):<\/strong><br>        return float(self.ctr[word])\/self.ctr.values()    <strong>def P(self, words):<\/strong><br>        return math.prod(map(lambda word: self.Q(word), words))<\/pre>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-743fb38 elementor-widget elementor-widget-text-editor\" data-id=\"743fb38\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"46ba\"><strong>Exercise I1<\/strong>: Evolve this code snippet into a language recognizer. Say English vs French.<\/p>\n<p id=\"1bcf\"><strong>First-order Markov Model.<\/strong>&nbsp;Here, a symbol is allowed to depend on the previous one.<\/p>\n<p id=\"3f48\"><em>P<\/em>(<em>x<\/em>1,&nbsp;<em>x<\/em>2, \u2026,&nbsp;<em>xn<\/em>) =&nbsp;<em>Q<\/em>(<em>x<\/em>1)*<em>Q<\/em>(<em>x<\/em>2|x1)*\u2026*<em>Q<\/em>(<em>xn|xn-1<\/em>)<\/p>\n<p id=\"0811\">A first-order Markov model is described by a state-transition matrix&nbsp;<em>P<\/em>.&nbsp;<em>pij&nbsp;<\/em>is the probability of going from state&nbsp;<em>i&nbsp;<\/em>to state&nbsp;<em>j<\/em>. Denoting&nbsp;<em>xt<\/em>&nbsp;as&nbsp;<em>i<\/em>&nbsp;and&nbsp;<em>xt<\/em>+1 as&nbsp;<em>j<\/em>,&nbsp;<em>Q<\/em>(<em>xt<\/em>+1,<em>xt<\/em>) equals&nbsp;<em>pij<\/em>.<\/p>\n<p id=\"6ba6\">This Markov model may also be expressed as a directed graph. There is an arc&nbsp;<em>i<\/em>&nbsp;\u2192&nbsp;<em>j<\/em>&nbsp;if the value&nbsp;<em>i<\/em>&nbsp;can be followed by a value&nbsp;<em>j<\/em>. This arc has a probability&nbsp;<em>p<\/em>(<em>j<\/em>|<em>i<\/em>) on it. The graph representation merely makes explicit which state transition probabilities are 0.<\/p>\n<p id=\"2d61\"><strong>Example 1M1:&nbsp;<\/strong>A first-order Markov model is effective in modeling short phrases. Say we want to discover salient short phrases (of 2-to-4 words each) from a large corpus of English documents. A first-order Markov model strikes a good balance between being rich enough to be able to do a reasonable job on it, without becoming too complex.<\/p>\n<p id=\"6931\">Consider the text \u201c<em>data mining is a phrase. data mining is a method of extracting meaningful information from a large data set<\/em>\u201d.<\/p>\n<p id=\"96db\">From this text, we will build the Markov graph. This graph\u2019s nodes are distinct words that appear in the text. There is an arc from node&nbsp;<em>u<\/em>&nbsp;to node&nbsp;<em>v<\/em>&nbsp;if word&nbsp;<em>u<\/em>&nbsp;is followed by word&nbsp;<em>v<\/em>&nbsp;at least once in the text. A few arcs are shown below.<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-a7485a9 elementor-widget elementor-widget-text-editor\" data-id=\"a7485a9\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<pre class=\"wp-block-preformatted\">data \u2192 mining   mining \u2192 is    method \u2192 of  extracting \u2192 meaningful<\/pre>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-1f1e87d elementor-widget elementor-widget-text-editor\" data-id=\"1f1e87d\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"c90c\">We see that&nbsp;<em>P<\/em>(<em>data<\/em>,&nbsp;<em>mining<\/em>) =&nbsp;<em>Q<\/em>(<em>data<\/em>)*<em>P<\/em>(<em>mining&nbsp;<\/em>|&nbsp;<em>data<\/em>) = (2\/19)*1 . By contrast,&nbsp;<em>P<\/em>(<em>a<\/em>,&nbsp;<em>method<\/em>) =&nbsp;<em>Q<\/em>(<em>a<\/em>)*<em>P<\/em>(<em>method&nbsp;<\/em>|&nbsp;<em>a<\/em>) = (3\/19)*(\u2153) = 1\/19 &lt;&nbsp;<em>P<\/em>(<em>data<\/em>,&nbsp;<em>mining<\/em>). Unfortunately,&nbsp;<em>P<\/em>(<em>is<\/em>, a) =&nbsp;<em>P<\/em>(<em>data<\/em>,&nbsp;<em>mining<\/em>). So (<em>is<\/em>,&nbsp;<em>a<\/em>) would be a false positive because it is not a salient phrase.<\/p>\n<p id=\"f2b0\">So the first-order Markov model on its own, while somewhat useful, also has glaring false positives. Adding grammatical information to this approach, specifically part-of-speech tags of words significantly improves its quality. That is, (mostly) reduces the false positives without (hardly) missing on the true ones. The intuition is that salient phrases are comprised of words favoring certain parts of speech, especially nouns. Consider (<em>data<\/em>,&nbsp;<em>mining<\/em>). Both words are nouns. Consider (<em>is<\/em>,&nbsp;<em>a<\/em>). \u2018is\u2019 is a verb, \u2018<em>a<\/em>\u2019 an article.<\/p>\n<p id=\"e2ba\">We won\u2019t go into detail on how to combine the two approaches. The main point is that the Markov model is still useful.<\/p>\n<p id=\"b931\"><strong>Python Code 1M1<\/strong>: Not tested, or even run. Plus, needs a runner.<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-8b08094 elementor-widget elementor-widget-text-editor\" data-id=\"8b08094\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<pre class=\"wp-block-preformatted\">from collections import defaultdict, Counter<br>import mathclass FirstOrderMarkovModel:    <strong>def __init__(selfs):<\/strong><br>       self.ctr2 = defaultdict(lambda:defaultdict(int))<br>       self.ctr1 = Counter()    <strong>def train(self,words):<\/strong><br>       words_with_sentinel = [\u2018_B_\u2019] + words<br>       for i in range(len(words_with_sentinel)-1):<br>          self.ctr2[words[i][words[i+1] += 1<br>          self.ctr1[words[i]] += 1    <strong>def Q(self,word,previous_word):<\/strong><br>       return float(self.ctr2[previous_word[word])\/<br>       self.ctr1[previous_word])    <strong>def P(words):<\/strong><br>       p = self.Q(words[0],\u2019_B\u2019)<br>       for i in range(len(words)-1):<br>          p *= self.Q(words[i+1],words[i])<br>       return p<\/pre>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-63822bf elementor-widget elementor-widget-text-editor\" data-id=\"63822bf\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"1d05\"><strong>Kth-order Markov Model<\/strong>. In this model, a character is allowed to depend on the previous&nbsp;<em>K<\/em>&nbsp;characters.<\/p>\n<p id=\"1d36\"><em>P<\/em>(<em>x<\/em>1,&nbsp;<em>x<\/em>2, \u2026,&nbsp;<em>xn<\/em>) =&nbsp;<em>Q<\/em>(<em>x<\/em>1)*<em>Q<\/em>(<em>x<\/em>2|x1)*\u2026*<em>Q<\/em>(<em>xn|xn-1,xn-2,\u2026,xn-k<\/em>)<\/p>\n<p id=\"6335\">With&nbsp;<em>K<\/em>&nbsp;set to 2 or 3, this approach can discover better salient phrases than a first-order Markov model without incurring a huge increase in complexity. What do we mean by \u201cbetter\u201d? Consider a phrase of length 3. Call it&nbsp;<em>x<\/em>1,&nbsp;<em>x<\/em>2,&nbsp;<em>x<\/em>3. The first-order Markov model loses information as it assumes&nbsp;<em>x3<\/em>&nbsp;is independent of&nbsp;<em>x1<\/em>&nbsp;given&nbsp;<em>x2<\/em>. The second-order model doesn\u2019t.&nbsp;<em>P<\/em>(<em>x<\/em>1)*<em>P<\/em>(<em>x<\/em>2|<em>x<\/em>1)*<em>P<\/em>(<em>x<\/em>3|<em>x<\/em>2,<em>x<\/em>1) equals&nbsp;<em>P<\/em>(<em>x<\/em>1,<em>x<\/em>2,<em>x<\/em>3).<\/p>\n<p id=\"615c\">As before, the need for combining the statistical approach with the grammatical approach remains.<\/p>\n<p id=\"a7a4\"><strong>Example KM1:<\/strong>&nbsp;We\u2019d like to build a second-order Markov model on three-word product names. Why second-order? Consider these made-up examples:&nbsp;<em>4k tv samsung<\/em>,&nbsp;<em>3k tv sony<\/em>, \u2026 Imagine that sony does not offer a 4k option. If we used a first-order Markov model we would lose the interaction between the first word (4k or 3k or \u2026) and the brand-name. The second-order model would capture this interaction. In other words, the product name would be influenced not only by the second word (tv in our example) but also by the first one (4k or 3k or \u2026).<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-42f2987 elementor-widget elementor-widget-heading\" data-id=\"42f2987\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\">Hidden Markov Model<\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-ab97fd6 elementor-widget elementor-widget-text-editor\" data-id=\"ab97fd6\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"3978\">Here, strings are probabilistically generated from a model with latent variables, i.e., hidden states.<\/p>\n<p id=\"5af3\">Notationally, let&nbsp;<em>X<\/em>&nbsp;=&nbsp;<em>x<\/em>1, \u2026,&nbsp;<em>xn<\/em>&nbsp;denote a string of length&nbsp;<em>n<\/em>&nbsp;and&nbsp;<em>S<\/em>&nbsp;=&nbsp;<em>s1<\/em>, \u2026,&nbsp;<em>sn<\/em>&nbsp;denote a string of states associated with&nbsp;<em>X<\/em>. Symbol&nbsp;<em>xi<\/em>&nbsp;is generated from state<em>&nbsp;si<\/em>.<\/p>\n<p id=\"0e91\"><em>X<\/em>&nbsp;is called the observation sequence;&nbsp;<em>S<\/em>&nbsp;the (hidden) state sequence from which&nbsp;<em>X<\/em>&nbsp;was generated.<\/p>\n<p id=\"aff3\">Structurally the HMM is a first-order Markov model over the set of states. Some states are connected to other states by arcs. In addition, each state has a probability distribution over the alphabet of observables.<\/p>\n<p id=\"0262\">This model generates a pair (<em>X<\/em>,&nbsp;<em>S<\/em>) as follows. First, it generates s1. Then it emits&nbsp;<em>x<\/em>1 from&nbsp;<em>s<\/em>1 with a certain probability. Next, it moves to state<em>&nbsp;s<\/em>2 with a certain probability. This probability depends only on where it came from, i.e.&nbsp;<em>s<\/em>1. What was emitted, i.e.&nbsp;<em>x<\/em>1, is immaterial. Next, it generates&nbsp;<em>x<\/em>2 from&nbsp;<em>s<\/em>2 with a certain probability. And so it goes.<\/p>\n<p id=\"0809\">In summary, the HMM generator alternates between generating states and generating observables from those states. The state transition probabilities depend only on the previous state. The emission transition probabilities depend only on the current state, i.e. the state from which the observable is emitted.<\/p>\n<p id=\"a287\"><strong>Example HMM1<\/strong>: Say we want to generate sentences in English that resemble real ones. Consider specifically sentences having one of two structures<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-3a67bb3 elementor-widget elementor-widget-text-editor\" data-id=\"3a67bb3\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<pre class=\"wp-block-preformatted\">Article Noun Verb Adverb<br>Pronoun Verb<\/pre>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-df5536e elementor-widget elementor-widget-text-editor\" data-id=\"df5536e\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"0015\">An example sentence of the first structure is&nbsp;<em>The boy is fast<\/em>. An example sentence of the second structure is&nbsp;<em>She sings<\/em>.<\/p>\n<p id=\"2e0f\">A realistic sentence generator would accommodate a lot more structures. We chose two because the maximum insight comes in going from one to two.<\/p>\n<p id=\"acf7\">What does an HMM that models these two structures look like? First, it helps to add a&nbsp;<em>begin<\/em>&nbsp;state and an&nbsp;<em>end<\/em>&nbsp;state. The HMM always starts from the&nbsp;<em>begin<\/em>&nbsp;state and stops at the&nbsp;<em>end&nbsp;<\/em>state. The&nbsp;<em>begin<\/em>&nbsp;and&nbsp;<em>end<\/em>&nbsp;states are called silent. They don\u2019t emit any token.<\/p>\n<p id=\"1989\">Okay, now to the structure, i.e. the arcs connecting the states.<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-d26b398 elementor-widget elementor-widget-text-editor\" data-id=\"d26b398\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<pre class=\"wp-block-preformatted\">begin \u2192 article \u2192 noun \u2192 verb \u2192 adverb \u2192 end<br>begin \u2192 pronoun \u2192 verb \u2192 end<\/pre>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-71a105e elementor-widget elementor-widget-text-editor\" data-id=\"71a105e\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"7292\">Note that some states are shown twice. This is just due to the presentation constraint. Take an example. Since the HMM has arcs&nbsp;<em>begin<\/em>&nbsp;\u2192&nbsp;<em>article<\/em>&nbsp;and&nbsp;<em>begin<\/em>&nbsp;\u2192&nbsp;<em>pronoun<\/em>&nbsp;what this really means is that from the&nbsp;<em>begin<\/em>&nbsp;state the HMM can move to the&nbsp;<em>article<\/em>&nbsp;state with a certain probability and to the&nbsp;<em>pronoun<\/em>&nbsp;state with another probability. The two probabilities sum to 1. From the state&nbsp;<em>begin<\/em>&nbsp;the HMM has to move somewhere.<\/p>\n<p id=\"c4c5\">Similarly, from the state&nbsp;<em>verb&nbsp;<\/em>we can either move to the state&nbsp;<em>adver<\/em>bor to the state&nbsp;<em>end<\/em>. Note that the HMM does not track how we arrived at the state&nbsp;<em>verb<\/em>, which means the state sequence&nbsp;<em>begin<\/em>&nbsp;\u21d2&nbsp;<em>article<\/em>&nbsp;\u21d2&nbsp;<em>noun<\/em>&nbsp;\u21d2&nbsp;<em>verb<\/em>&nbsp;\u21d2&nbsp;<em>end<\/em>&nbsp;is also possible, even though we didn\u2019t ask for it. This type of generalization is good if we want the HMM to be able to generate sentences such as&nbsp;<em>The boy ate<\/em>&nbsp;and bad if not.<\/p>\n<p id=\"e4bc\"><strong>Training<\/strong>: We can train the emission parameters of this model easily. We can take advantage of the fact that our states are named entities, for whom training sets are readily available. It\u2019s easy to collect words that form articles, those that form nouns, etc.<\/p>\n<p id=\"4336\">Next, we look into training the state transition probabilities. For example, we need to estimate the probability of going from the&nbsp;<em>begin<\/em>&nbsp;state to the&nbsp;<em>article<\/em>&nbsp;state. This is less than 1 as from the state&nbsp;<em>begin&nbsp;<\/em>we can also go to the state&nbsp;<em>pronoun<\/em>&nbsp;instead.<\/p>\n<p id=\"05f4\">For training the transitions, we assume we have access to sentences whose words have POS-tags attached to them. A rich training set of this type is easy to assemble. It is easy to get lots of sentences. It is also easy to get the POS-tag sequences of these sentences by running a suitable POS-tagger on each.<\/p>\n<p id=\"492f\">The transition probabilities of the HMM are easy to train from such a training set. In fact, we only need the POS-tag sequence for each sentence. Let\u2019s illustrate this. The probability on the arc&nbsp;<em>begin<\/em>&nbsp;\u2192&nbsp;<em>article<\/em>&nbsp;is estimated as the number of POS-tag sequences whose first token is&nbsp;<em>article&nbsp;<\/em>divided by the number of POS-tag sequences whose first token is either&nbsp;<em>article<\/em>&nbsp;or&nbsp;<em>pronoun<\/em>.<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-2b7e13b elementor-widget elementor-widget-heading\" data-id=\"2b7e13b\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\">Two generated sentences<\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-42e1714 elementor-widget elementor-widget-text-editor\" data-id=\"42e1714\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"5371\">Let\u2019s see a few. The first one is below.<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-4e2808f elementor-widget elementor-widget-text-editor\" data-id=\"4e2808f\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<pre class=\"wp-block-preformatted\">          The     boy     is     fast<br>           ^       ^      ^       ^<br>           |       |      |       |<br>begin \u2192 article \u2192 noun \u2192 verb \u2192 adverb \u2192 end<\/pre>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-4ee1d47 elementor-widget elementor-widget-text-editor\" data-id=\"4ee1d47\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"e338\">We start from the state&nbsp;<em>begin<\/em>, walk to&nbsp;<em>article<\/em>, emit&nbsp;<em>The<\/em>&nbsp;from it, walk to&nbsp;<em>noun<\/em>, emit&nbsp;<em>boy<\/em>&nbsp;from it, and so on. t<\/p>\n<p id=\"ecb7\">The second one is below.<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-a09504e elementor-widget elementor-widget-text-editor\" data-id=\"a09504e\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<pre class=\"wp-block-preformatted\">         She     sings<br>          ^        ^<br>          |        |<br>begin \u2192 pronoun \u2192 verb \u2192 end<\/pre>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-b9ed57a elementor-widget elementor-widget-text-editor\" data-id=\"b9ed57a\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"995f\">We start from&nbsp;<em>begin<\/em>, walk to&nbsp;<em>pronoun<\/em>, emit&nbsp;<em>She<\/em>&nbsp;from it, move to&nbsp;<em>verb<\/em>, emit&nbsp;<em>Sing<\/em>&nbsp;from it, and finally walk to&nbsp;<em>end<\/em>&nbsp;and stay put there.<\/p>\n<p id=\"bc9a\"><strong>Versus Independent<\/strong>: Imagine, by contrast, the sentences generated by the independent model. The words would be spewed out based on their probabilities with no regard for the desired structure.<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-85205ff elementor-widget elementor-widget-heading\" data-id=\"85205ff\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\">Position-specific Independent Model aka chain HMM<\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-0a94702 elementor-widget elementor-widget-text-editor\" data-id=\"0a94702\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"b7a8\">In some use cases, the generated symbols are significantly influenced by their positions in the string. As one example, consider product names such as&nbsp;<em>iPhone 11<\/em>,&nbsp;<em>canon camera<\/em>, and&nbsp;<em>sony tv<\/em>. There is clearly some sequential structure. In these examples, the brand name (<em>canon<\/em>,&nbsp;<em>sony<\/em>) precedes the product type (<em>camera<\/em>,&nbsp;<em>tv<\/em>).<\/p>\n<p id=\"6d98\">The position-specific generalization of the independent model is suited for such modeling.<\/p>\n<p id=\"79a7\"><em>P<\/em>(<em>x<\/em>1,&nbsp;<em>x<\/em>2, \u2026,&nbsp;<em>xn<\/em>) =&nbsp;<em>Q1<\/em>(<em>x<\/em>1)*<em>Q2<\/em>(<em>x<\/em>2)*\u2026*<em>Qn<\/em>(<em>xn<\/em>).<\/p>\n<p id=\"b87c\">Here&nbsp;<em>Qi<\/em>&nbsp;is a position-specific probability distribution over the alphabet. So with&nbsp;<em>n<\/em>&nbsp;positions we have&nbsp;<em>n<\/em>&nbsp;distributions&nbsp;<em>Q<\/em>1, \u2026,&nbsp;<em>Qn<\/em>.<\/p>\n<p id=\"9e2b\">A position-specific independent model may be viewed as an HMM whose states&nbsp;<em>1<\/em>,&nbsp;<em>2<\/em>, \u2026,&nbsp;<em>n<\/em>&nbsp;representing token positions 1 through&nbsp;<em>n<\/em>&nbsp;respectively. Such an HMM\u2019s graph is a single directed path 1 \u2192 2 \u2192 3 \u2192 \u2026 \u2192&nbsp;<em>n<\/em>. Consequently, the transition probabilities on all the arcs are 1. The HMM\u2019s parameters are the position-specific emission probabilities. We will call such a model a&nbsp;<em>chain HMM<\/em>.<\/p>\n<p id=\"3626\">A chain HMM is hardly an HMM as there is nothing Markovian about it. That said, calling it out as an HMM is useful. It surfaces paths to enhancing the model as needed.<\/p>\n<p id=\"5381\">An example of such an enhanced version is commonly used in biomolecular sequence analysis. It goes by the name&nbsp;<em>profile HMM<\/em>&nbsp;[3]. A profile HMM is a chain HMM with a few judiciously chosen states and transitions added to detour off the chain at certain positions. Detour paths merge back to the chain at certain other positions. Detour paths model position-specific mutations. In biomolecular sequences, such mutations often happen. A model that captures them recognizes membership of biomolecular sequences in the modeled family more accurately.<\/p>\n<p id=\"9eab\">A different enhancement path from a position-specific chain HMM is into a so-called Conditional Markov Model (CMM). We will cover CMMs later in this post. We will also illustrate, in the example below, what benefits we get by moving from a chain HMM to a CMM.<\/p>\n<p id=\"2ce0\"><strong>Example CHMM-Park-Names<\/strong>: Say we want to model national or state park names in the US. Such as Yellowstone National Park, Yosemite National Park, Castle Rock State Park, \u2026 A chain HMM is a good way to go.<\/p>\n<p id=\"3d66\">This HMM will be trained off a list of national plus state park names. We will set&nbsp;<em>n&nbsp;<\/em>to the maximum number of words in an entry in this list. The position-specific emission probabilities of the chain HMM are easy to estimate. The tokenized version of a park name reveals the position of each word in it. For example, in&nbsp;<em>yosemite national park<\/em>,&nbsp;<em>yosemite<\/em>&nbsp;is the first word,&nbsp;<em>national<\/em>&nbsp;the second, and&nbsp;<em>park<\/em>&nbsp;the third. So from a token in a park name we know which state\u2019s emission probability to train.<\/p>\n<p id=\"2bd0\">The model above is attractive for its simplicity and flexibility. Just train it on a list of park names. As the list gets richer, the model will improve on its own.<\/p>\n<p id=\"c443\">That said, it seems a bit unnatural. A natural model for the type of park names we have in mind is<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-6675eb7 elementor-widget elementor-widget-text-editor\" data-id=\"6675eb7\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<pre class=\"wp-block-preformatted\">word+ ( state | national) park<\/pre>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-a52e6dc elementor-widget elementor-widget-text-editor\" data-id=\"a52e6dc\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"6469\">This just means that a park name has one or two words followed by&nbsp;<em>national<\/em>&nbsp;or&nbsp;<em>state<\/em>&nbsp;followed by&nbsp;<em>park<\/em>.<\/p>\n<p id=\"3025\">The model below is closer to the natural model we seek.<\/p>\n<p id=\"c3c2\"><strong>Example HMM-Park-Names<\/strong>: Here we will use three states:<em>&nbsp;prefix_word<\/em>,&nbsp;<em>regional_word<\/em>, and&nbsp;<em>park<\/em>. The state&nbsp;<em>regional_word<\/em>&nbsp;will emit&nbsp;<em>state<\/em>&nbsp;and&nbsp;<em>national<\/em>&nbsp;(for now), with equal probability (for now). The state&nbsp;<em>prefix_word<\/em>&nbsp;will emit a word in a park name that appears before regional_word. The state&nbsp;<em>park<\/em>&nbsp;will (for now) emit the word&nbsp;<em>park<\/em>&nbsp;with probability 1.<\/p>\n<p id=\"f72c\">A training set of state sequences is easy to construct? We just need a few:<\/p>\n<pre class=\"wp-block-preformatted\">prefix_word regional_word park<br>prefix_word prefix_word regional_word park<br>\u2026<br>prefix_word prefix_word prefix_word regional_word park<\/pre>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-4574aa7 elementor-widget elementor-widget-text-editor\" data-id=\"4574aa7\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"4a30\">Why not just use a regular expression then? The HMM is more accurate. Consider the phrase \u201c<em>the national park<\/em>\u201din some text. This phrase is not an actual national park. The HMM can in principle model this by assigning a very low probability of emitting&nbsp;<em>the&nbsp;<\/em>from the state&nbsp;<em>prefix_word<\/em>.<\/p>\n<p id=\"b529\">In fact, this takes us to the topic of how to train the emission probabilities of&nbsp;<em>prefix_word<\/em>? (The emission probabilities of the other states have already been settled, for now.) Here is a simple approach. Take a list of national and state park names. Strip off the last two words in each park name. The words that remain are prefix words.<\/p>\n<p id=\"e709\">The state&nbsp;<em>park<\/em>&nbsp;could be extended to emit&nbsp;<em>beach<\/em>,&nbsp;<em>forest<\/em>,&nbsp;<em>monument,<\/em>&nbsp;etc. (In California, state beaches are often classified under state parks.)<\/p>\n<p id=\"8aff\">To get a better feel for the trade-offs between a chain HMM and an entity-based HMM, let\u2019s see another example involving the two.<\/p>\n<p id=\"0875\"><strong>Example CHMM-Product:<\/strong>&nbsp;Consider a chain HMM to model product names. (By \u2018product name\u2019 we mean both specific products and product categories.) We could use a trained version to recognize product names in unstructured text, some that it hasn\u2019t even been trained on.<\/p>\n<p id=\"81a7\">This HMM will be trained off a list of product names. The training is similar to that of the park chain HMM. Remember, we just need to learn (i) the number of states and (ii) the emission probabilities from the various states.<\/p>\n<p id=\"68b0\">This HMM has a simple structure and is easy to train. That said, the entity-based version we describe below will likely be more accurate.<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-4fa28d8 elementor-widget elementor-widget-heading\" data-id=\"4fa28d8\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\">Example HMM-Product<\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-1bb4df1 elementor-widget elementor-widget-text-editor\" data-id=\"1bb4df1\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"43b7\">We might choose as our entities&nbsp;<em>brand_token<\/em>,&nbsp;<em>product_type_token<\/em>,&nbsp;<em>product_version_token<\/em>, and&nbsp;<em>product_base_name_token<\/em>. Example values are&nbsp;<em>brand_token<\/em>&nbsp;=&nbsp;<em>canon<\/em>,&nbsp;<em>product_type_token<\/em>&nbsp;=&nbsp;<em>camera<\/em>,&nbsp;<em>product_version_token<\/em>&nbsp;= 12,&nbsp;<em>product_base_name_token<\/em>&nbsp;=&nbsp;<em>iphone<\/em>. Our HMM\u2019s states would be these entities.<\/p>\n<p id=\"8c8b\">Why does each entity\u2019s name end with the word&nbsp;<em>token<\/em>? Because the entity applies to a single token. So the state sequence associated with \u201c<em>smart phone<\/em>\u201dwould be [<em>product_type_token<\/em>,&nbsp;<em>product_type_token<\/em>].<\/p>\n<p id=\"85f2\">To train this HMM we would need training sets for the various entities. These training sets could be derived from lists of brands, product types, etc. We say \u201cderived\u201d because we can\u2019t use the lists as-is if they contain multi-word entries. Consider&nbsp;<em>smart phone<\/em>&nbsp;listed as a product type. From this, we would derive two examples:&nbsp;<em>smart<\/em>&nbsp;\u2192&nbsp;<em>product_type_token<\/em>&nbsp;and&nbsp;<em>phone<\/em>&nbsp;\u2192&nbsp;<em>product_type_token<\/em>.<\/p>\n<p id=\"7c56\">We also need state sequences that capture the sequences of entities in tokenized product names. We could construct such a training set manually. This isn\u2019t hard because product names tend to follow some common conventions. For instance, in many two-word product names, the first name is a brand, and the second a product type. We already saw an example:&nbsp;<em>canon camera<\/em>. Consequently [<em>brand_token<\/em>,&nbsp;<em>product_type_token<\/em>] should be in the training set of state sequences.<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-4182d20 elementor-widget elementor-widget-heading\" data-id=\"4182d20\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\">Automatically Deriving State Sequences From Product Names<\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-b42a1c9 elementor-widget elementor-widget-text-editor\" data-id=\"b42a1c9\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"087f\">Constructing state sequences manually will get us quite far. A few state sequences cover a lot of product names. That said, the manual approach risks not scaling to a robust industrial-strength model, one covering millions of products. The product names may span a long tail of state sequences. Consider the state sequence [<em>product_type_token<\/em>,\u00a0<em>brand_token<\/em>]. Some product names follow this convention.<\/p><p id=\"32a5\">As it turns out, we can automate the construction of state sequences from the tokenized product names. This way as we add more and more products to our training set, state sequences automatically get extracted from them.<\/p>\n<p id=\"2261\"><strong>Basic version<\/strong>: The basic idea is to take the tokenized product name and, for each word in it, find its most-likely entity. We can do this because we have training sets of (word, entity) pairs for the various entities.<\/p>\n<p id=\"7ac8\">Let\u2019s see a simple example.<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-a6d922c elementor-widget elementor-widget-text-editor\" data-id=\"a6d922c\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<pre class=\"wp-block-preformatted\">canon camera \u21d2 [canon,camera] \u21d2 [brand_token,product_type_token]<\/pre>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-ad12374 elementor-widget elementor-widget-text-editor\" data-id=\"ad12374\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"a6bc\"><strong>Refined version<\/strong>: We can refine this basic idea as follows. As before, we tokenize the product name. We then run the sequence of tokens through the HMM to find a state sequence that fits it best.<\/p>\n<p id=\"4907\">This refinement uses information from both the entity training sets and from the HMM\u2019s state transitions. Consequently, it can be more accurate.<\/p>\n<p id=\"d3da\">I like to call this approach \u201cbootstrap training\u201d. We initialize the HMM manually. This includes seeding it with whatever state sequences we choose to. We can then run tokenized product names through it in the hope of discovering more state sequences.<\/p>\n<p id=\"5a06\">To enable the discovery of new state sequences, when initializing the HMM, we should allow transitions from all states to all states. (Other than from the state&nbsp;<em>begin<\/em>&nbsp;to the state&nbsp;<em>end<\/em>&nbsp;and any going out from the state&nbsp;<em>end<\/em>.) We can initialize our training set for these transitions automatically, so that the initial probabilities on these transitions are very low, albeit not zero. We call this pseudo-training.<\/p>\n<p id=\"990d\"><strong>Curation<\/strong>: It&#8217;s possible that some of the new state sequences we discover via \u201cbootstrap training\u201d have errors in them. So these should be reviewed by humans. Nonetheless, we have probably benefited despite the human in the loop. Manually discovering new state sequences is a lot more time consuming than curating automatically discovered ones.<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-14ecef1 elementor-widget elementor-widget-heading\" data-id=\"14ecef1\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\">Conditional Markov Model<\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-0d0751d elementor-widget elementor-widget-text-editor\" data-id=\"0d0751d\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"95c9\">Just like an HMM, a conditional Markov model (CMM) operates on (<em>token sequence<\/em>,&nbsp;<em>state sequence<\/em>) pairs. Unlike an HMM a CMM is optimized for finding the best state sequence for a given token sequence. What exactly does this mean? This will get gradually clear in the examples below.<\/p>\n<p id=\"d1cc\">That said, in this post, we will not discuss how to find the best state sequence for a given token sequence. We will discuss how to score a particular state sequence for a given token sequence. This scoring will surface the \u201cconditional\u201d part in CMM.<\/p>\n<p id=\"8fbc\"><strong>Example CMM1 (First Part)<\/strong>: Imagine training a language model on (<em>tokenized full person name<\/em>,&nbsp;<em>token entities<\/em>) pairs, with the aim of using it to parse person names. Parsing means to break up a person\u2019s name into its constituents. Such as first name and last name.<\/p>\n<p id=\"07fd\">Consider the name&nbsp;<em>John K Smith<\/em>. The model should be able to infer that&nbsp;<em>K<\/em>&nbsp;is a middle name word. Even if the training set does not contain a&nbsp;<em>K<\/em>&nbsp;emitted from&nbsp;<em>middle_name_word<\/em>.<\/p>\n<p id=\"1499\">The CMM directly models this as<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-0f195fc elementor-widget elementor-widget-text-editor\" data-id=\"0f195fc\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<pre class=\"wp-block-preformatted\">P(state[2] = middle_name_word | state[1] = first_name_word, token[2] = K)P(state[2] = middle_name_word | state[1] = first_name_word)*P(token[2] = K | state[2] = middle_name_word)<\/pre>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-05a24a9 elementor-widget elementor-widget-text-editor\" data-id=\"05a24a9\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"03ae\">How is this beneficial, compared to how the HMM models it.<\/p>\n<p id=\"e4ed\">Towards answering this question, first let\u2019s abstract out the specifics of this example. We have&nbsp;<em>P<\/em>(<em>S<\/em>[<em>i<\/em>+1] |&nbsp;<em>S<\/em>[<em>i<\/em>],&nbsp;<em>X<\/em>[<em>i<\/em>+1]) for the CMM versus&nbsp;<em>P<\/em>(<em>S<\/em>[<em>i<\/em>+1] |&nbsp;<em>S<\/em>[<em>i<\/em>])*<em>P<\/em>(<em>X<\/em>[<em>i<\/em>+1] |&nbsp;<em>S<\/em>[<em>i<\/em>+1]) for the HMM. Here&nbsp;<em>X<\/em>[<em>i<\/em>] is the&nbsp;<em>i<\/em>th token and&nbsp;<em>S<\/em>[<em>i<\/em>] the state from which it was emitted.<\/p>\n<p id=\"f931\">The CMM approach may be viewed as a multinomial classifier whose input is the pair (<em>S<\/em>[<em>i<\/em>],&nbsp;<em>X<\/em>[<em>i<\/em>+1]), and whose output is a probability distribution over the various values&nbsp;<em>S<\/em>[<em>i<\/em>+1] can take. This formulation lends itself to extracting any features that we see fit from the input. Possibly overlapping. Possibly taking into account interactions among&nbsp;<em>S<\/em>[<em>i<\/em>] and&nbsp;<em>X<\/em>[<em>i<\/em>+1].<\/p>\n<p id=\"1597\">We can now use any multinomial classifier algorithm on this problem. Such as (multinomial) logistic regression, decision trees, or random forest.<\/p>\n<p id=\"2742\">By contrast, the HMM cannot capture interactions between&nbsp;<em>S<\/em>[<em>i<\/em>] and&nbsp;<em>X<\/em>[<em>i<\/em>+1]. Nor can it accommodate arbitrary features. Nor can it leverage sophisticated machine learning classifier algorithms. The CMM on the other hand cannot leverage the entity training sets. We no longer learn emission probabilities.<\/p>\n<p id=\"34a1\"><strong>Example CMM 1 (completed)<\/strong>: Let\u2019s complete this example. Starting with the features. From input (<em>S<\/em>[<em>i<\/em>],&nbsp;<em>X<\/em>[<em>i<\/em>+1]) we will extract the features&nbsp;<em>state&nbsp;<\/em>=&nbsp;<em>S<\/em>[<em>i<\/em>],&nbsp;<em>token<\/em>&nbsp;=&nbsp;<em>X<\/em>[i+1],&nbsp;<em>token length<\/em>&nbsp;= number of characters in&nbsp;<em>X<\/em>[<em>i<\/em>+1]. That is, we have three predictors:&nbsp;<em>state<\/em>,&nbsp;<em>token<\/em>, and&nbsp;<em>token length<\/em>. Each is categorical. The response is the value of&nbsp;<em>S<\/em>[<em>i<\/em>+1]. Also categorical.<\/p>\n<p id=\"2afb\">Why these features? The feature&nbsp;<em>state<\/em>&nbsp;lets us model the influence of the current state on the next state. The feature&nbsp;<em>token<\/em>&nbsp;let\u2019s us use the actual value of the token as a predictor of its state. This is useful. Certain tokens favor certain entities. For example,&nbsp;<em>John<\/em>&nbsp;is much more likely to be a first name word than the last name word. As an added bonus we get the interaction between&nbsp;<em>state<\/em>&nbsp;and&nbsp;<em>token<\/em>&nbsp;for free so long as our learning algorithm can exploit it. (If there is an interaction that is.) The feature&nbsp;<em>token length<\/em>&nbsp;is useful because we know that a token\u2019s length favors certain entities over others. First or middle names are often one character long. Last names hardly ever.<\/p>\n<p id=\"6034\">Let\u2019s see a few training examples. We\u2019ll start with (<em>token sequence<\/em>,&nbsp;<em>state sequence<\/em>) pairs.<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-21596d0 elementor-widget elementor-widget-text-editor\" data-id=\"21596d0\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<pre class=\"wp-block-preformatted\">John   Smith                        John    K    Smith<br>first  last                         first middle last<\/pre>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-ed6e39a elementor-widget elementor-widget-text-editor\" data-id=\"ed6e39a\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"ef1b\">From these let\u2019s derive a few training instances for the CMM formulation. (We don\u2019t show all.) The first three columns list the predictors. The last one lists the response.<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-c28bd3f elementor-widget elementor-widget-text-editor\" data-id=\"c28bd3f\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<pre class=\"wp-block-preformatted\"><strong>(previous)<\/strong> <strong>state<\/strong>    <strong>token<\/strong>   <strong>token's length<\/strong>             <strong>state<\/strong><br>       first          K             1                  middle<br>       begin         John           4                  first<br>       first         Smith          5                  last<br>         \u2026             \u2026            \u2026                  \u2026<\/pre>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-7c74933 elementor-widget elementor-widget-text-editor\" data-id=\"7c74933\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"e5d8\">The first training instance above reads as \u201c(<em>previous state<\/em>&nbsp;equaling&nbsp;<em>first<\/em>, token equaling&nbsp;<em>K<\/em>, token\u2019s length equalling 1)\u201d leads to next state being&nbsp;<em>middle<\/em>\u201d.<\/p>\n<p id=\"503f\"><strong>Example CMM 2 (Sketched)<\/strong>: Consider parsing US street addresses. For simplicity assume they have the form<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-8ac3f66 elementor-widget elementor-widget-text-editor\" data-id=\"8ac3f66\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<pre class=\"wp-block-preformatted\">street_num_word (street_name_word)+ street_suffix [unit_prefix unit_num_word]<\/pre>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-8720300 elementor-widget elementor-widget-text-editor\" data-id=\"8720300\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"ffdd\">Below are some pairs of tokenized US street addresses together with their entities. The entity names are abbreviated so the examples can fit in the space allocated.<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-5389e6b elementor-widget elementor-widget-text-editor\" data-id=\"5389e6b\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<pre class=\"wp-block-preformatted\">123             Marine           Dr<br>street_num_word street_name_word street_suffix203             Lake             Forest           Ave<br>street_num_word street_name_word street_name_word street_suffix123             Main             St            Apt         22<br>street_num_word street_name_word street_suffix unit_prefix unit_num<\/pre>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-dcf87a7 elementor-widget elementor-widget-text-editor\" data-id=\"dcf87a7\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"a951\">For the design of our features, let\u2019s start by focusing on those extracted from a token. Which features of a token discriminate among the various entities? The actual value of the token can predict whether it is a&nbsp;<em>street_name_word<\/em>&nbsp;or a&nbsp;<em>street_suffix<\/em>&nbsp;or a&nbsp;<em>unit_prefix<\/em>. The presence of digits predicts that it is a&nbsp;<em>street_num_word<\/em>&nbsp;or a&nbsp;<em>unit_num_word<\/em>.<\/p>\n<p id=\"dffd\">In view of this, from input (<em>S<\/em>[<em>i<\/em>],&nbsp;<em>X<\/em>[<em>i<\/em>+1]) we will extract the features&nbsp;<em>state&nbsp;<\/em>=&nbsp;<em>S<\/em>[<em>i<\/em>],&nbsp;<em>token<\/em>&nbsp;=&nbsp;<em>X<\/em>[i+1],&nbsp;<em>proportion_of_digits_in_token<\/em>&nbsp;= fraction of characters in&nbsp;<em>X<\/em>[<em>i<\/em>+1] that are digits.<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-682af9b elementor-widget elementor-widget-heading\" data-id=\"682af9b\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\">Conditional Random Fields<\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-a9f13a8 elementor-widget elementor-widget-text-editor\" data-id=\"a9f13a8\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"ac31\">Just like a CMM, a CRF operates on (<em>S<\/em>, X) pairs where&nbsp;<em>X<\/em>&nbsp;denotes a token sequence and S its state sequence. Both model&nbsp;<em>P<\/em>(<em>S<\/em>|<em>X<\/em>), to optimize for the use case of finding a high-probability state sequence&nbsp;<em>S&nbsp;<\/em>for a given&nbsp;<em>X<\/em>.<\/p>\n<p id=\"9572\">The CMM factors this as<\/p>\n<p id=\"062f\"><em>P<\/em>(<em>S<\/em>|<em>X<\/em>) = product_<em>i<\/em>&nbsp;<em>P<\/em>(<em>S<\/em>[<em>i<\/em>] | X[i-1],&nbsp;<em>S<\/em>[i-1])&nbsp;<strong>(CMM 1)<\/strong><\/p>\n<p id=\"3dbe\">That is, the probability of state&nbsp;<em>S<\/em>[<em>i<\/em>] is conditioned on the previous state&nbsp;<em>S<\/em>[<em>i<\/em>-1] and the current token&nbsp;<em>X<\/em>[<em>i<\/em>].<\/p>\n<p id=\"5aaf\">The CRF is less restrictive. That is, it accommodates more general P(S|X).<\/p>\n<p id=\"4933\">The CRF operates on an undirected graph whose nodes model states and whose edges model direct influences among pairs of states. These influences are symmetric.<\/p>\n<p id=\"9166\">In this post, we will limit ourselves to a linear-chain CRF. It has the structure<\/p>\n<p id=\"ce3d\"><em>S<\/em>[1] \u2014&nbsp;<em>S<\/em>[2] \u2014&nbsp;<em>S<\/em>[3] \u2014 \u2026 \u2014&nbsp;<em>S<\/em>[n]<\/p>\n<p id=\"8525\">This just models that a state is influenced by the previous state and the next state. That is, S[i] is directly influenced by S[i-1] and by S[i+1].<\/p>\n<p id=\"e649\">In the CRF world, the linear CRF is the closest analog to a CMM. So let\u2019s look into it in more detail.<\/p>\n<p id=\"2d4f\">The linear CRF models&nbsp;<em>P<\/em>(<em>S<\/em>|<em>X<\/em>) as<\/p>\n<p id=\"769f\"><em>P<\/em>(<em>S<\/em>|X) is proportional to product_<em>i<\/em>&nbsp;product_f&nbsp;<em>P<\/em>(<em>f<\/em>)*e^<em>f<\/em>(<em>S<\/em>[<em>i<\/em>-1],&nbsp;<em>S<\/em>[<em>i<\/em>],&nbsp;<em>X<\/em>)&nbsp;<strong>(CRF 1)<\/strong><\/p>\n<p id=\"0dd1\">The intuition is a bit easier to convey if we define a score&nbsp;<em>C<\/em>(<em>S<\/em>,&nbsp;<em>X<\/em>) from this by taking logs.<\/p>\n<p id=\"fee2\"><em>C<\/em>(<em>S<\/em>,&nbsp;<em>X<\/em>) = sum_i sum_f w(f)*f(S[i-1],S[i],X)&nbsp;<strong>(CRF 2)<\/strong><\/p>\n<p id=\"8ef3\">We call it&nbsp;<em>C<\/em>&nbsp;because we think of it as a compatibility function. As a function of S, C(S, X) is monotone in P(S|X). So for the purposes of finding high-scoring state sequences, C(S, X) works equally well.<\/p>\n<p id=\"9757\">What is&nbsp;<em>f<\/em>? It\u2019s a feature function that scores the compatibility of the triple (S[i-1], S[i], and X) along some dimension. More positive values are more compatible.&nbsp;<em>w<\/em>(<em>f<\/em>) is&nbsp;<em>f<\/em>\u2019s weight. It controls the compatibility\u2019s strength and direction.<\/p>\n<p id=\"0a78\">We can have as many feature functions as we choose.<\/p>\n<p id=\"418c\">Okay, so in a linear CRF, we have a set of possibly overlapping feature functions. Each is applied to every edge. That said, they can use any information from any subset of the tokens in X. By contrast, CMMs only score compatibility of (<em>S<\/em>[<em>i<\/em>-1],&nbsp;<em>S<\/em>[<em>i<\/em>],&nbsp;<em>X<\/em>[<em>i<\/em>]) triples.<\/p>\n<p id=\"2595\">At this point, it\u2019s best to see a concrete example with specific feature functions.<\/p>\n<p id=\"83d2\"><strong>Example CRF 1<\/strong>: Consider a generalized version of an example we saw earlier. Parse global street addresses. The generalization is that we are going global.<\/p>\n<p id=\"9fe9\">Street addresses across the globe come in many different formats. In some, street numbers come before the street name, in some after. Some start with units, others end with them. (An example of a unit is \u201cApt 30\u201d.) These are just a few of the variations.<\/p>\n<p id=\"c24a\">Global street addresses also have varying quality, in terms of how well they adhere to the format for their locale. We want to be able to parse poor-quality addresses as well as possible.<\/p>\n<p id=\"a737\"><strong>Let\u2019s see two examples<\/strong><\/p>\n<pre class=\"wp-block-preformatted\">123 St Francis St<br>Rue du Vivier 15<\/pre>\n<p id=\"0ca2\">The first one is a US one, the second one Belgian. In addition to the positions of the street numbers and names, the positions of the street keywords also vary. (In these examples, \u2018St\u2019 and \u2018Rue\u2019 are street keywords.)<\/p>\n<p id=\"ba46\">We\u2019ll use a linear-chain CRF.<\/p>\n<p id=\"0487\">Are there any global features we should extract from the tokenized street address? Two come to mind:&nbsp;<em>country<\/em>&nbsp;and&nbsp;<em>language.<\/em><\/p>\n<p id=\"4666\">Both the language and the country influence address formats. These factors overlap, but they also have complementary influences. Such is the case of Belgian vs French addresses, both written in French. In some aspects they are similar. In other aspects, they are not. That is the country matters.<\/p>\n<p id=\"698b\"><strong>In view of this, we should include both.<\/strong><\/p>\n<p id=\"6315\">We\u2019ll build two specific feature functions around these.<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-4a44d07 elementor-widget elementor-widget-text-editor\" data-id=\"4a44d07\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<pre class=\"wp-block-preformatted\"><strong>F1<\/strong>: country-edge compatibilities: (S[i-1],S[i], country(X))<br><strong>F2<\/strong>: language-edge compatibilities: (S[i-1],S[i], country(X))<\/pre>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-626a89a elementor-widget elementor-widget-text-editor\" data-id=\"626a89a\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"8abe\">We\u2019ll add a third feature function to prefer tokens that are compatible with their tagged states.<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-476bcf6 elementor-widget elementor-widget-text-editor\" data-id=\"476bcf6\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<pre class=\"wp-block-preformatted\"><strong>F3<\/strong>: state-token compatibilities: (S[i], X[i])<\/pre>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-a0d7669 elementor-widget elementor-widget-text-editor\" data-id=\"a0d7669\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"7133\"><strong>Training<\/strong>: F1 and F2 may be trained from (<em>tokenized street address<\/em>,&nbsp;<em>state sequence<\/em>) pairs tagged with their country and language. Transitions from&nbsp;<em>S<\/em>[<em>i<\/em>-1] to&nbsp;<em>S<\/em>[<em>i<\/em>] which are compatible with the country and language score high, those that are not don\u2019t. F3 may be trained from the (<em>state<\/em>,&nbsp;<em>token<\/em>) pairs in the training set.<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-4821417 elementor-widget elementor-widget-heading\" data-id=\"4821417\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h4 class=\"elementor-heading-title elementor-size-default\">Further Reading<\/h4>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-861a5c6 elementor-widget elementor-widget-text-editor\" data-id=\"861a5c6\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<ol><li><a href=\"https:\/\/www.cl.cam.ac.uk\/teaching\/1718\/R228\/lectures\/lec9.pdf\" target=\"_blank\" rel=\"noreferrer noopener\">https:\/\/www.cl.cam.ac.uk\/teaching\/1718\/R228\/lectures\/lec9.pdf<\/a><\/li><li><a href=\"https:\/\/en.wikipedia.org\/wiki\/Maximum-entropy_Markov_model\" target=\"_blank\" rel=\"noreferrer noopener\">Maximum-entropy Markov model<\/a><\/li><li><a href=\"https:\/\/www.ebi.ac.uk\/training-beta\/online\/courses\/pfam-creating-protein-families\/what-are-profile-hidden-markov-models-hmms\/#:~:text=Profile%20HMMs%20are%20probabilistic%20models,the%20alignment%2C%20see%20Figure%202\" target=\"_blank\" rel=\"noreferrer noopener\">https:\/\/www.ebi.ac.uk\/training-beta\/online\/courses\/pfam-creating-protein-families\/what-are-profile-hidden-markov-models-hmms\/#:~:text=Profile%20HMMs%20are%20probabilistic%20models,the%20alignment%2C%20see%20Figure%202<\/a>.<\/li><li><a href=\"http:\/\/pages.cs.wisc.edu\/~jerryzhu\/cs838\/CRF.pdf\" target=\"_blank\" rel=\"noreferrer noopener\">CS838\u20131 Advanced NLP: Conditional Random Fields<\/a><\/li><li><a href=\"https:\/\/repository.upenn.edu\/cgi\/viewcontent.cgi?referer=https:\/\/en.wikipedia.org\/&amp;httpsredir=1&amp;article=1162&amp;context=cis_papers\" target=\"_blank\" rel=\"noreferrer noopener\">Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data<\/a><\/li><\/ol>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<\/div>\n\t\t","protected":false},"excerpt":{"rendered":"<p>This post covers statistical language models from simple to elaborate. The covered models include Independent model, first-order Markov model, Kth-order Markov model, Hidden Markov Model, Conditional Markov model, and Conditional Random Fields. Each is illustrated with realistic examples and use cases.<\/p>\n","protected":false},"author":1044,"featured_media":18612,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"content-type":"","footnotes":""},"categories":[183],"tags":[1310,1311,1312,1313,474,1314],"ppma_author":[3691],"class_list":["post-22605","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-ai-ml","tag-conditional-markov-models","tag-conditional-random-fields","tag-hidden-markov-models","tag-markov-models","tag-nlp","tag-statistical-language-models"],"authors":[{"term_id":3691,"user_id":1044,"is_guest":0,"slug":"arun-jagota","display_name":"Arun Jagota","avatar_url":"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/Arun-Jagota-150x150.jpeg","user_url":"https:\/\/www.salesforce.com\/in\/?ir=1","last_name":"Jagota","first_name":"Arun","job_title":"","description":"Arun Jagota is Director of Data Science at Salesforce.com. A PhD in computer science, he has taught undergraduate, graduate, and continuing education courses in Computer Science at many US Universities from 1992 through 2006. He has written a number of books, most available at Amazon.com, 50 academic publications and has 17+ patents issued."}],"_links":{"self":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/22605","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/users\/1044"}],"replies":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/comments?post=22605"}],"version-history":[{"count":4,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/22605\/revisions"}],"predecessor-version":[{"id":32290,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/22605\/revisions\/32290"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/media\/18612"}],"wp:attachment":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/media?parent=22605"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/categories?post=22605"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/tags?post=22605"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/ppma_author?post=22605"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}