{"id":22617,"date":"2021-02-11T11:20:11","date_gmt":"2021-02-11T11:20:11","guid":{"rendered":"https:\/\/www.experfy.com\/blog\/text-classification-with-nlp-tf-idf-vs-word2vec-vs-bert\/"},"modified":"2023-09-05T07:01:16","modified_gmt":"2023-09-05T07:01:16","slug":"text-classification-with-nlp-tf-idf-vs-word2vec-vs-bert","status":"publish","type":"post","link":"https:\/\/www.experfy.com\/blog\/ai-ml\/text-classification-with-nlp-tf-idf-vs-word2vec-vs-bert\/","title":{"rendered":"Text Classification With NLP: Tf-Idf vs Word2Vec vs BERT"},"content":{"rendered":"\t\t<div data-elementor-type=\"wp-post\" data-elementor-id=\"22617\" class=\"elementor elementor-22617\" data-elementor-post-type=\"post\">\n\t\t\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-7d79292 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"7d79292\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-c9338e4\" data-id=\"c9338e4\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-464ea99 elementor-widget elementor-widget-heading\" data-id=\"464ea99\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h4 class=\"elementor-heading-title elementor-size-default\">Preprocessing, Model Design, Evaluation, Explainability for Bag-of-Words, Word Embedding, Language models<\/h4>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-f4889a4 elementor-widget elementor-widget-heading\" data-id=\"f4889a4\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\">Summary<\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-6670c4f elementor-widget elementor-widget-text-editor\" data-id=\"6670c4f\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"84c0\">In this article, using NLP and Python, I will explain 3 different strategies for text multiclass classification: the old-fashioned&nbsp;<em>Bag-of-Words&nbsp;<\/em>(with Tf-Idf )<em>,&nbsp;<\/em>the famous&nbsp;<em>Word Embedding (<\/em>with Word2Vec), and the cutting edge<em>&nbsp;Language models<\/em>&nbsp;(with BERT).<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-5ba9888 elementor-widget elementor-widget-image\" data-id=\"5ba9888\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img fetchpriority=\"high\" decoding=\"async\" width=\"1024\" height=\"500\" src=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1T8WWibd7u8b7gfgeG0LgAA-1024x500.gif\" class=\"attachment-large size-large wp-image-18662\" alt=\"Text Classification with NLP: Tf-Idf vs Word2Vec vs BERT\" srcset=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1T8WWibd7u8b7gfgeG0LgAA-1024x500.gif 1024w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1T8WWibd7u8b7gfgeG0LgAA-300x147.gif 300w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1T8WWibd7u8b7gfgeG0LgAA-768x375.gif 768w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1T8WWibd7u8b7gfgeG0LgAA-1536x750.gif 1536w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1T8WWibd7u8b7gfgeG0LgAA-610x298.gif 610w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1T8WWibd7u8b7gfgeG0LgAA-750x366.gif 750w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1T8WWibd7u8b7gfgeG0LgAA-1140x557.gif 1140w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-077475a elementor-widget elementor-widget-text-editor\" data-id=\"077475a\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"d857\"><a href=\"https:\/\/en.wikipedia.org\/wiki\/Natural_language_processing\" target=\"_blank\" rel=\"noreferrer noopener\"><strong>NLP (Natural Language Processing)<\/strong><\/a>\u00a0is the field of <a href=\"https:\/\/www.experfy.com\/blog\/ai-ml\/product-definition-in-the-age-of-ai\/\" target=\"_blank\" rel=\"noreferrer noopener\">artificial intelligence<\/a> that studies the interactions between computers and human languages, in particular how to program computers to process and analyze large amounts of natural language data. NLP is often applied for classifying text data.\u00a0<strong>Text classification<\/strong>\u00a0is the problem of assigning categories to text data according to its content.<\/p>\n<p id=\"deee\">There&nbsp;are different techniques to extract information from raw text data and use it to train a classification model. This tutorial compares the old school approach of&nbsp;<em>Bag-of-Words<\/em>&nbsp;(used with a simple machine learning algorithm), the popular&nbsp;<em>Word Embedding<\/em>&nbsp;model (used with a deep learning neural network), and the state of the art&nbsp;<em>Language models<\/em>&nbsp;(used with transfer learning from attention-based transformers) that have completely revolutionized the NLP landscape.<\/p>\n<p id=\"d290\">I will present some useful Python code that can be easily applied in other similar cases (just copy, paste, run) and walk through every line of code with comments so that you can replicate this example (link to the full code below).<a href=\"https:\/\/github.com\/mdipietro09\/DataScience_ArtificialIntelligence_Utils\/blob\/master\/natural_language_processing\/example_text_classification.ipynb\" target=\"_blank\" rel=\"noreferrer noopener\">mdipietro09\/DataScience_ArtificialIntelligence_UtilsPermalink Dismiss GitHub is home to over 50 million developers working together to host and review code, manage\u2026github.com<\/a><\/p>\n<p id=\"27b4\">I will use the \u201c<strong>News category dataset<\/strong>\u201d in which you are provided with news headlines from the year 2012 to 2018 obtained from&nbsp;<em>HuffPost&nbsp;<\/em>and you are asked to classify them with the right category, therefore this is a multiclass classification problem (link below).<a href=\"https:\/\/www.kaggle.com\/rmisra\/news-category-dataset\" target=\"_blank\" rel=\"noreferrer noopener\">News Category DatasetIdentify the type of news based on headlines and short descriptionswww.kaggle.com<\/a><\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-a89eb49 elementor-widget elementor-widget-heading\" data-id=\"a89eb49\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h4 class=\"elementor-heading-title elementor-size-default\">In particular, I will go through:<\/h4>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-dd65ea8 elementor-widget elementor-widget-text-editor\" data-id=\"dd65ea8\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<ul><li>Setup: import packages, read data, Preprocessing, Partitioning.<\/li><li>Bag-of-Words: Feature Engineering &amp; Feature Selection &amp; Machine Learning with&nbsp;<em>scikit-learn<\/em>, Testing &amp; Evaluation, Explainability with&nbsp;<em>lime<\/em>.<\/li><li>Word Embedding: Fitting a Word2Vec with&nbsp;<em>gensim<\/em>, Feature Engineering &amp; Deep Learning with&nbsp;<em>tensorflow\/keras<\/em>, Testing &amp; Evaluation, Explainability with the Attention mechanism.<\/li><li>Language Models: Feature Engineering with&nbsp;<em>transformers<\/em>, Transfer Learning from pre-trained BERT with&nbsp;<em>transformers&nbsp;<\/em>and<em>&nbsp;tensorflow\/keras,&nbsp;<\/em>Testing &amp; Evaluation.<\/li><\/ul>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-a53fa24 elementor-widget elementor-widget-heading\" data-id=\"a53fa24\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\">Setup<\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-7b052bf elementor-widget elementor-widget-heading\" data-id=\"7b052bf\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h3 class=\"elementor-heading-title elementor-size-default\">First of all, I need to import the following libraries:<\/h3>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-06cc257 elementor-widget elementor-widget-text-editor\" data-id=\"06cc257\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<pre class=\"wp-block-preformatted\"><strong>## for data<br><\/strong>import <strong>json<br><\/strong>import <strong>pandas <\/strong>as pd<br>import <strong>numpy <\/strong>as np<strong>## for plotting<\/strong><br>import <strong>matplotlib<\/strong>.pyplot as plt<br>import <strong>seaborn <\/strong>as sns<strong>## for processing<br><\/strong>import <strong>re<\/strong><br>import <strong>nltk<\/strong><strong>## for bag-of-words<\/strong><br>from <strong>sklearn <\/strong>import feature_extraction, model_selection, naive_bayes, pipeline, manifold, preprocessing<strong>## for explainer<\/strong><br>from <strong>lime <\/strong>import lime_text<strong>## for word embedding<\/strong><br>import <strong>gensim<br><\/strong>import gensim.downloader as gensim_api<strong>## for deep learning<\/strong><br>from <strong>tensorflow<\/strong>.keras import models, layers, preprocessing as kprocessing<br>from tensorflow.keras import backend as K<strong>## for bert language model<\/strong><br>import <strong>transformers<\/strong><\/pre>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-d4b219e elementor-widget elementor-widget-text-editor\" data-id=\"d4b219e\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"9c06\">The dataset is contained into a json file, so I will first read it into a list of dictionaries with&nbsp;<em>json&nbsp;<\/em>and then transform it into a&nbsp;<em>pandas&nbsp;<\/em>Dataframe.<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-bb9a1c4 elementor-widget elementor-widget-text-editor\" data-id=\"bb9a1c4\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<pre class=\"wp-block-preformatted\">lst_dics = []<br>with <strong>open<\/strong>('data.json', mode='r', errors='ignore') as json_file:<br>    for dic in json_file:<br>        lst_dics.append( json<strong>.loads<\/strong>(dic) )<strong>## print the first one<\/strong><br>lst_dics[0]<\/pre>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-48f7804 elementor-widget elementor-widget-image\" data-id=\"48f7804\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" width=\"1024\" height=\"155\" src=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1N7xAYy2MBRJHKMBXnxMi0A-1024x155.png\" class=\"attachment-large size-large wp-image-18663\" alt=\"Image for post\" srcset=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1N7xAYy2MBRJHKMBXnxMi0A-1024x155.png 1024w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1N7xAYy2MBRJHKMBXnxMi0A-300x45.png 300w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1N7xAYy2MBRJHKMBXnxMi0A-768x116.png 768w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1N7xAYy2MBRJHKMBXnxMi0A-1536x233.png 1536w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1N7xAYy2MBRJHKMBXnxMi0A-2048x310.png 2048w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1N7xAYy2MBRJHKMBXnxMi0A-610x92.png 610w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1N7xAYy2MBRJHKMBXnxMi0A-750x114.png 750w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1N7xAYy2MBRJHKMBXnxMi0A-1140x173.png 1140w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-fa16ffd elementor-widget elementor-widget-text-editor\" data-id=\"fa16ffd\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"6721\">The original dataset contains over 30 categories, but for the purposes of this tutorial, I will work with a subset of 3: Entertainment, Politics, and Tech.<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-7ec09b1 elementor-widget elementor-widget-text-editor\" data-id=\"7ec09b1\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<pre class=\"wp-block-preformatted\"><strong>## create dtf<\/strong><br>dtf = pd.DataFrame(lst_dics)<strong>## filter categories<\/strong><br>dtf = dtf[ dtf[\"category\"].isin(['<strong>ENTERTAINMENT<\/strong>','<strong>POLITICS<\/strong>','<strong>TECH<\/strong>']) ][[\"category\",\"headline\"]]<strong>## rename columns<\/strong><br>dtf = dtf.rename(columns={\"category\":\"<strong>y<\/strong>\", \"headline\":\"<strong>text<\/strong>\"})<strong>## print 5 random rows<\/strong><br>dtf.sample(5)<\/pre>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-c8dd1e7 elementor-widget elementor-widget-image\" data-id=\"c8dd1e7\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" width=\"1024\" height=\"360\" src=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1iurA976CkC9i1Yi1L6hIIw-1024x360.png\" class=\"attachment-large size-large wp-image-18664\" alt=\"Text Classification with NLP: Tf-Idf vs Word2Vec vs BERT\" srcset=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1iurA976CkC9i1Yi1L6hIIw-1024x360.png 1024w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1iurA976CkC9i1Yi1L6hIIw-300x105.png 300w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1iurA976CkC9i1Yi1L6hIIw-768x270.png 768w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1iurA976CkC9i1Yi1L6hIIw-610x214.png 610w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1iurA976CkC9i1Yi1L6hIIw-750x263.png 750w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1iurA976CkC9i1Yi1L6hIIw-1140x400.png 1140w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1iurA976CkC9i1Yi1L6hIIw.png 1532w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-549a345 elementor-widget elementor-widget-text-editor\" data-id=\"549a345\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"c391\">In order to understand the composition of the dataset, I am going to look into the&nbsp;<strong>univariate distribution<\/strong>&nbsp;of the target by showing labels frequency with a bar plot.<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-7af6a97 elementor-widget elementor-widget-text-editor\" data-id=\"7af6a97\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<pre class=\"wp-block-preformatted\">fig, ax = plt.subplots()<br>fig.suptitle(<strong>\"y\"<\/strong>, fontsize=12)<br>dtf[<strong>\"y\"<\/strong>].reset_index().groupby(<strong>\"y\"<\/strong>).count().sort_values(by= <br>       \"index\").plot(kind=\"barh\", legend=False, <br>        ax=ax).grid(axis='x')<br>plt.show()<\/pre>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-17c7011 elementor-widget elementor-widget-image\" data-id=\"17c7011\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"447\" src=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1b7hN7kENZzF4wsck1ne0QA-1024x447.png\" class=\"attachment-large size-large wp-image-18665\" alt=\"Text Classification with NLP: Tf-Idf vs Word2Vec vs BERT\" srcset=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1b7hN7kENZzF4wsck1ne0QA-1024x447.png 1024w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1b7hN7kENZzF4wsck1ne0QA-300x131.png 300w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1b7hN7kENZzF4wsck1ne0QA-768x335.png 768w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1b7hN7kENZzF4wsck1ne0QA-610x266.png 610w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1b7hN7kENZzF4wsck1ne0QA-750x328.png 750w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1b7hN7kENZzF4wsck1ne0QA.png 1067w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-6abb194 elementor-widget elementor-widget-text-editor\" data-id=\"6abb194\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"a1d5\">The dataset is imbalanced: the proportion of Tech news is really small compared to the others, this will make for models to recognize Tech news rather tough.<\/p>\n<p id=\"8362\">Before explaining and building the models, I am going to give an example of preprocessing by cleaning text, removing stop words, and applying lemmatization. I will write a function and apply it to the whole data set.<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-e4c2fa3 elementor-widget elementor-widget-text-editor\" data-id=\"e4c2fa3\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<pre class=\"wp-block-preformatted\"><strong>'''<br>Preprocess a string.<br>:parameter<br>    :param text: string - name of column containing text<br>    :param lst_stopwords: list - list of stopwords to remove<br>    :param flg_stemm: bool - whether stemming is to be applied<br>    :param flg_lemm: bool - whether lemmitisation is to be applied<br>:return<br>    cleaned text<br>'''<\/strong><br>def <strong>utils_preprocess_text<\/strong>(text, flg_stemm=False, flg_lemm=True, lst_stopwords=None):<br>    <strong>## clean (convert to lowercase and remove punctuations and   <br>    characters and then strip)<\/strong><br>    text = re.sub(r'[^\\w\\s]', '', str(text).lower().strip())<br>            <br>    <strong>## Tokenize (convert from string to list)<\/strong><br>    lst_text = text.split()<strong>    ## remove Stopwords<\/strong><br>    if lst_stopwords is not None:<br>        lst_text = [word for word in lst_text if word not in <br>                    lst_stopwords]<br>                <br>    <strong>## Stemming (remove -ing, -ly, ...)<\/strong><br>    if flg_stemm == True:<br>        ps = nltk.stem.porter.PorterStemmer()<br>        lst_text = [ps.stem(word) for word in lst_text]<br>                <br>    <strong>## Lemmatisation (convert the word into root word)<\/strong><br>    if flg_lemm == True:<br>        lem = nltk.stem.wordnet.WordNetLemmatizer()<br>        lst_text = [lem.lemmatize(word) for word in lst_text]<br>            <br>    <strong>## back to string from list<\/strong><br>    text = \" \".join(lst_text)<br>    return text<\/pre>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-3695755 elementor-widget elementor-widget-text-editor\" data-id=\"3695755\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"3136\">That function removes a set of words from the corpus if given. I can create a list of generic stop words for the English vocabulary with&nbsp;<em>nltk&nbsp;<\/em>(we could edit this list by adding or removing words).<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-155f920 elementor-widget elementor-widget-text-editor\" data-id=\"155f920\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<pre class=\"wp-block-preformatted\">lst_stopwords = <strong>nltk<\/strong>.corpus.stopwords.words(\"<strong>english<\/strong>\")<br>lst_stopwords<\/pre>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-4db531c elementor-widget elementor-widget-image\" data-id=\"4db531c\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"212\" src=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1k1fsJU_S0_WPZku6gg-qOQ-1024x212.png\" class=\"attachment-large size-large wp-image-18666\" alt=\"Text Classification with NLP: Tf-Idf vs Word2Vec vs BERT\" srcset=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1k1fsJU_S0_WPZku6gg-qOQ-1024x212.png 1024w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1k1fsJU_S0_WPZku6gg-qOQ-300x62.png 300w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1k1fsJU_S0_WPZku6gg-qOQ-768x159.png 768w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1k1fsJU_S0_WPZku6gg-qOQ-610x126.png 610w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1k1fsJU_S0_WPZku6gg-qOQ-750x155.png 750w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1k1fsJU_S0_WPZku6gg-qOQ.png 1095w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-c7a0a55 elementor-widget elementor-widget-text-editor\" data-id=\"c7a0a55\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"482f\">Now I shall apply the function I wrote on the whole dataset and store the result in a new column named \u201c<em>text_clean<\/em>\u201d so that you can choose to work with the raw corpus or the preprocessed text.<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-ae83d76 elementor-widget elementor-widget-text-editor\" data-id=\"ae83d76\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<pre class=\"wp-block-preformatted\">dtf[\"<strong>text_clean<\/strong>\"] = dtf[\"text\"].apply(lambda x: <br>          <strong>utils_preprocess_text<\/strong>(x, flg_stemm=False, <strong>flg_lemm=True<\/strong>, <br>          <strong>lst_stopwords=lst_stopwords<\/strong>))dtf.head()<\/pre>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-dd396c6 elementor-widget elementor-widget-image\" data-id=\"dd396c6\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"228\" src=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1t-R6djtHnK4cBVqFrjfLgA-1024x228.png\" class=\"attachment-large size-large wp-image-18667\" alt=\"Text Classification with NLP: Tf-Idf vs Word2Vec vs BERT\" srcset=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1t-R6djtHnK4cBVqFrjfLgA-1024x228.png 1024w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1t-R6djtHnK4cBVqFrjfLgA-300x67.png 300w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1t-R6djtHnK4cBVqFrjfLgA-768x171.png 768w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1t-R6djtHnK4cBVqFrjfLgA-610x136.png 610w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1t-R6djtHnK4cBVqFrjfLgA-750x167.png 750w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1t-R6djtHnK4cBVqFrjfLgA-1140x254.png 1140w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1t-R6djtHnK4cBVqFrjfLgA.png 1494w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-57c9a5f elementor-widget elementor-widget-text-editor\" data-id=\"57c9a5f\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"d7f8\">If you are interested in a deeper text analysis and preprocessing, you can check\u00a0<a href=\"https:\/\/towardsdatascience.com\/text-analysis-feature-engineering-with-nlp-502d6ea9225d\" target=\"_blank\" rel=\"noreferrer noopener\" class=\"broken_link\">this article<\/a>. With this in mind, I am going to partition the dataset into training set (70%) and test set (30%) in order to evaluate the models performance.<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-b2d9d82 elementor-widget elementor-widget-text-editor\" data-id=\"b2d9d82\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<pre class=\"wp-block-preformatted\"><strong>## split dataset<\/strong><br>dtf_train, dtf_test = model_selection.<strong>train_test_split<\/strong>(dtf, test_size=0.3)<strong>## get target<\/strong><br>y_train = dtf_train[<strong>\"y\"<\/strong>].values<br>y_test = dtf_test[<strong>\"y\"<\/strong>].values<\/pre>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-5e5e2dd elementor-widget elementor-widget-text-editor\" data-id=\"5e5e2dd\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"e70d\">Let\u2019s get started, shall we?<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-fad31cc elementor-widget elementor-widget-heading\" data-id=\"fad31cc\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\">Bag-of-Words<\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-4b58cbd elementor-widget elementor-widget-text-editor\" data-id=\"4b58cbd\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"96c7\">The\u00a0<a href=\"https:\/\/en.wikipedia.org\/wiki\/Bag-of-words_model\" target=\"_blank\" rel=\"noreferrer noopener\"><em>Bag-of-Words<\/em><\/a>\u00a0model is simple: it builds a vocabulary from a corpus of documents and counts how many times the words appear in each document. To put it another way, each word in the vocabulary becomes a feature and a document is represented by a vector with the same length of the vocabulary (a \u201cbag of words\u201d). For instance, let\u2019s take 3 sentences and represent them with this approach:<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-c894e6b elementor-widget elementor-widget-image\" data-id=\"c894e6b\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t<figure class=\"wp-caption\">\n\t\t\t\t\t\t\t\t\t\t<img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"190\" src=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1m1O25pvl8R5DlkhuJjRrDw-1024x190.png\" class=\"attachment-large size-large wp-image-18668\" alt=\"Image for post\" srcset=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1m1O25pvl8R5DlkhuJjRrDw-1024x190.png 1024w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1m1O25pvl8R5DlkhuJjRrDw-300x56.png 300w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1m1O25pvl8R5DlkhuJjRrDw-768x142.png 768w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1m1O25pvl8R5DlkhuJjRrDw-1536x285.png 1536w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1m1O25pvl8R5DlkhuJjRrDw-610x113.png 610w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1m1O25pvl8R5DlkhuJjRrDw-750x139.png 750w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1m1O25pvl8R5DlkhuJjRrDw-1140x211.png 1140w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1m1O25pvl8R5DlkhuJjRrDw.png 1601w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/>\t\t\t\t\t\t\t\t\t\t\t<figcaption class=\"widget-image-caption wp-caption-text\">Feature matrix shape:&nbsp;<strong><em>Number of documents<\/em><\/strong>x<strong><em>Length of vocabulary<\/em><\/strong><\/figcaption>\n\t\t\t\t\t\t\t\t\t\t<\/figure>\n\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-8e99170 elementor-widget elementor-widget-text-editor\" data-id=\"8e99170\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"f309\">As you can imagine, this approach causes a significant dimensionality problem: the more documents you have the larger is the vocabulary, so the feature matrix will be a huge sparse matrix. Therefore, the Bag-of-Words model is usually preceded by an important preprocessing (word cleaning, stop words removal, stemming\/lemmatization) aimed to reduce the dimensionality problem.<\/p>\n<p id=\"1375\">Terms frequency is not necessarily the best representation for text. In fact, you can find in the corpus common words with the highest frequency but little predictive power over the target variable. To address this problem there is an advanced variant of the Bag-of-Words that, instead of simple counting, uses the\u00a0<strong>term frequency\u2013inverse document frequency\u00a0<\/strong>(or T<a href=\"https:\/\/en.wikipedia.org\/wiki\/Tf%E2%80%93idf\" target=\"_blank\" rel=\"noreferrer noopener\">f\u2013Idf<\/a>)<strong>.\u00a0<\/strong>Basically, the value of a word increases proportionally to count, but it is inversely proportional to the frequency of the word in the corpus.<\/p>\n<p id=\"3524\">Let\u2019s start with the&nbsp;<strong>Feature Engineering,&nbsp;<\/strong>the process to create features by extracting information from the data. I am going to use the Tf-Idf vectorizer with a limit of 10,000 words (so the length of my vocabulary will be 10k), capturing unigrams (i.e. \u201c<em>new<\/em>\u201d and \u201c<em>york<\/em>\u201d) and bigrams (i.e. \u201c<em>new york<\/em>\u201d). I will provide the code for the classic count vectorizer as well:<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-6f82884 elementor-widget elementor-widget-text-editor\" data-id=\"6f82884\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<pre class=\"wp-block-preformatted\"><strong><em>## Count (classic BoW)<\/em><\/strong><br><em>vectorizer = feature_extraction.text.<\/em><strong><em>CountVectorizer<\/em><\/strong><em>(max_features=10000, <\/em>ngram_range=(1,2))<br><br><strong><em>## Tf-Idf (advanced variant of BoW)<\/em><\/strong><br>vectorizer = feature_extraction.text.<strong>TfidfVectorizer<\/strong>(max_features=10000, ngram_range=(1,2))<\/pre>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-85ceaf1 elementor-widget elementor-widget-text-editor\" data-id=\"85ceaf1\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"187b\">Now I will use the vectorizer on the preprocessed corpus of the train set to extract a vocabulary and create the feature matrix.<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-1a25cf1 elementor-widget elementor-widget-text-editor\" data-id=\"1a25cf1\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<pre class=\"wp-block-preformatted\">corpus = dtf_train[\"<strong>text_clean<\/strong>\"]vectorizer.fit(corpus)<br>X_train = vectorizer.transform(corpus)<br>dic_vocabulary = vectorizer.vocabulary_<\/pre>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-b2e83c4 elementor-widget elementor-widget-text-editor\" data-id=\"b2e83c4\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"0d8b\">The feature matrix&nbsp;<em>X_train&nbsp;<\/em>has a shape of 34,265 (Number of documents in training) x 10,000 (Length of vocabulary) and it\u2019s pretty sparse:<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-65da158 elementor-widget elementor-widget-text-editor\" data-id=\"65da158\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<pre class=\"wp-block-preformatted\">sns.<strong>heatmap<\/strong>(X_train.todense()[:,np.random.randint(0,X.shape[1],100)]==0, vmin=0, vmax=1, cbar=False).set_title('Sparse Matrix Sample')<\/pre>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-d71d22f elementor-widget elementor-widget-image\" data-id=\"d71d22f\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t<figure class=\"wp-caption\">\n\t\t\t\t\t\t\t\t\t\t<img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"358\" src=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1CpZ9fxPY5iSEzgdyS021_Q-1024x358.png\" class=\"attachment-large size-large wp-image-18669\" alt=\"Text Classification with NLP: Tf-Idf vs Word2Vec vs BERT\" srcset=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1CpZ9fxPY5iSEzgdyS021_Q-1024x358.png 1024w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1CpZ9fxPY5iSEzgdyS021_Q-300x105.png 300w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1CpZ9fxPY5iSEzgdyS021_Q-768x269.png 768w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1CpZ9fxPY5iSEzgdyS021_Q-610x213.png 610w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1CpZ9fxPY5iSEzgdyS021_Q-750x262.png 750w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1CpZ9fxPY5iSEzgdyS021_Q-1140x399.png 1140w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1CpZ9fxPY5iSEzgdyS021_Q.png 1427w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/>\t\t\t\t\t\t\t\t\t\t\t<figcaption class=\"widget-image-caption wp-caption-text\">Random sample from the feature matrix (non-zero values in black)<\/figcaption>\n\t\t\t\t\t\t\t\t\t\t<\/figure>\n\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-90cb423 elementor-widget elementor-widget-text-editor\" data-id=\"90cb423\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"9df7\">In order to know the position of a certain word, we can look it up in the vocabulary:<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-36e40e6 elementor-widget elementor-widget-text-editor\" data-id=\"36e40e6\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<pre class=\"wp-block-preformatted\">word = \"new york\"dic_vocabulary[word]<\/pre>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-bdb42e1 elementor-widget elementor-widget-text-editor\" data-id=\"bdb42e1\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"931e\">If the word exists in the vocabulary, this command prints a number&nbsp;<em>N<\/em>, meaning that the&nbsp;<em>N<\/em>th feature of the matrix is that word.<\/p>\n<p id=\"18fd\">In order to drop some columns and reduce the matrix dimensionality, we can carry out some&nbsp;<strong>Feature Selection<\/strong>, the process of selecting a subset of relevant variables. I will proceed as follows:<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-873e5a5 elementor-widget elementor-widget-text-editor\" data-id=\"873e5a5\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<ol><li>treat each category as binary (for example, the \u201cTech\u201d category is 1 for the Tech news and 0 for the others);<\/li><li><mark>perform a&nbsp;<\/mark><mark><a href=\"https:\/\/en.wikipedia.org\/wiki\/Chi-squared_test\" rel=\"noopener\">Chi-Square test<\/a><\/mark><mark>&nbsp;to determine whether a feature and the (binary) target are independent;<\/mark><\/li><li>keep only the features with a certain p-value from the Chi-Square test.<\/li><\/ol>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-17804ce elementor-widget elementor-widget-text-editor\" data-id=\"17804ce\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<pre class=\"wp-block-preformatted\">y = dtf_train[\"<strong>y<\/strong>\"]<br>X_names = vectorizer.get_feature_names()<br>p_value_limit = 0.95dtf_features = pd.DataFrame()<br>for cat in np.unique(y):<br>    chi2, p = feature_selection.<strong>chi2<\/strong>(X_train, y==cat)<br>    dtf_features = dtf_features.append(pd.DataFrame(<br>                   {\"feature\":X_names, \"score\":1-p, \"y\":cat}))<br>    dtf_features = dtf_features.sort_values([\"y\",\"score\"], <br>                    ascending=[True,False])<br>    dtf_features = dtf_features[dtf_features[\"score\"]&gt;p_value_limit]X_names = dtf_features[\"feature\"].unique().tolist()<\/pre>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-025f71c elementor-widget elementor-widget-text-editor\" data-id=\"025f71c\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"cf35\">I reduced the number of features from 10,000 to 3,152 by keeping the most statistically relevant ones. Let\u2019s print some:<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-5ea5b42 elementor-widget elementor-widget-text-editor\" data-id=\"5ea5b42\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<pre class=\"wp-block-preformatted\">for cat in np.unique(y):<br>   print(\"# {}:\".format(cat))<br>   print(\"  . selected features:\",<br>         len(dtf_features[dtf_features[\"y\"]==cat]))<br>   print(\"  . top features:\", \",\".join(<br>dtf_features[dtf_features[\"y\"]==cat][\"feature\"].values[:10]))<br>   print(\" \")<\/pre>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-095a475 elementor-widget elementor-widget-image\" data-id=\"095a475\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"218\" src=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1Fo0EjcD4Ibo2Jz1y6eNL0A-1024x218.png\" class=\"attachment-large size-large wp-image-18670\" alt=\"Image for post\" srcset=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1Fo0EjcD4Ibo2Jz1y6eNL0A-1024x218.png 1024w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1Fo0EjcD4Ibo2Jz1y6eNL0A-300x64.png 300w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1Fo0EjcD4Ibo2Jz1y6eNL0A-768x164.png 768w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1Fo0EjcD4Ibo2Jz1y6eNL0A-610x130.png 610w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1Fo0EjcD4Ibo2Jz1y6eNL0A-750x160.png 750w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1Fo0EjcD4Ibo2Jz1y6eNL0A-1140x243.png 1140w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1Fo0EjcD4Ibo2Jz1y6eNL0A.png 1479w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-0fe99c0 elementor-widget elementor-widget-text-editor\" data-id=\"0fe99c0\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"1761\">We can refit the vectorizer on the corpus by giving this new set of words as input. That will produce a smaller feature matrix and a shorter vocabulary.<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-7f60c3d elementor-widget elementor-widget-text-editor\" data-id=\"7f60c3d\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<pre class=\"wp-block-preformatted\">vectorizer = feature_extraction.text.<strong>TfidfVectorizer<\/strong>(vocabulary=X_names)vectorizer.fit(corpus)<br>X_train = vectorizer.transform(corpus)<br>dic_vocabulary = vectorizer.vocabulary_<\/pre>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-f872d46 elementor-widget elementor-widget-text-editor\" data-id=\"f872d46\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"d6e5\">The new feature matrix&nbsp;<em>X_train<\/em>&nbsp;has a shape of is 34,265 (Number of documents in training) x 3,152 (Length of the given vocabulary). Let\u2019s see if the matrix is less sparse:<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-b2ada3a elementor-widget elementor-widget-image\" data-id=\"b2ada3a\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t<figure class=\"wp-caption\">\n\t\t\t\t\t\t\t\t\t\t<img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"361\" src=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1O8lMt_obkHbMXuOSg1bTRA-1024x361.png\" class=\"attachment-large size-large wp-image-18671\" alt=\"Text Classification with NLP: Tf-Idf vs Word2Vec vs BERT\" srcset=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1O8lMt_obkHbMXuOSg1bTRA-1024x361.png 1024w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1O8lMt_obkHbMXuOSg1bTRA-300x106.png 300w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1O8lMt_obkHbMXuOSg1bTRA-768x271.png 768w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1O8lMt_obkHbMXuOSg1bTRA-610x215.png 610w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1O8lMt_obkHbMXuOSg1bTRA-750x264.png 750w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1O8lMt_obkHbMXuOSg1bTRA-1140x402.png 1140w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1O8lMt_obkHbMXuOSg1bTRA.png 1430w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/>\t\t\t\t\t\t\t\t\t\t\t<figcaption class=\"widget-image-caption wp-caption-text\">Random sample from the new feature matrix (non-zero values in black)<\/figcaption>\n\t\t\t\t\t\t\t\t\t\t<\/figure>\n\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-950338b elementor-widget elementor-widget-text-editor\" data-id=\"950338b\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"3a83\">It\u2019s time to train a&nbsp;<strong>machine learning model<\/strong>&nbsp;and test it. I recommend using a Naive Bayes algorithm: a probabilistic classifier that makes use of&nbsp;<a href=\"https:\/\/en.wikipedia.org\/wiki\/Bayes%27_theorem\" rel=\"noopener\">Bayes\u2019 Theorem<\/a>, a rule that uses probability to make predictions based on prior knowledge of conditions that might be related. This algorithm is the most suitable for such large dataset as it considers each feature independently, calculates the probability of each category, and then predicts the category with the highest probability.<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-dc3b701 elementor-widget elementor-widget-text-editor\" data-id=\"dc3b701\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<pre class=\"wp-block-preformatted\">classifier = naive_bayes.<strong>MultinomialNB<\/strong>()<\/pre>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-9bc3ee7 elementor-widget elementor-widget-text-editor\" data-id=\"9bc3ee7\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"4428\">I\u2019m going to train this classifier on the feature matrix and then test it on the transformed test set. To that end, I need to build a&nbsp;<em>scikit-learn<\/em>&nbsp;pipeline: a sequential application of a list of transformations and a final estimator. Putting the Tf-Idf vectorizer and the Naive Bayes classifier in a pipeline allows us to transform and predict test data in just one step.<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-2146465 elementor-widget elementor-widget-text-editor\" data-id=\"2146465\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<pre class=\"wp-block-preformatted\"><strong>## pipeline<\/strong><br>model = pipeline.<strong>Pipeline<\/strong>([(\"<strong>vectorizer<\/strong>\", vectorizer),  <br>                           (\"<strong>classifier<\/strong>\", classifier)])<strong>## train classifier<br><\/strong>model[\"classifier\"].fit(X_train, y_train)<strong>## test<br><\/strong>X_test = dtf_test[\"text_clean\"].values<br>predicted = model.predict(X_test)<br>predicted_prob = model.predict_proba(X_test)<\/pre>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-0f477f5 elementor-widget elementor-widget-text-editor\" data-id=\"0f477f5\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"8016\"><strong>We can now\u00a0evaluate the performance\u00a0of the Bag-of-Words model, I will use the following metrics:<\/strong><\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-2cb5b60 elementor-widget elementor-widget-text-editor\" data-id=\"2cb5b60\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<ul><li>Accuracy: the fraction of predictions the model got right.<\/li><li>Confusion Matrix: a summary table that breaks down the number of correct and incorrect predictions by each class.<\/li><li>ROC: a plot that illustrates the true positive rate against the false positive rate at various threshold settings. The area under the curve (AUC) indicates the probability that the classifier will rank a randomly chosen positive observation higher than a randomly chosen negative one.<\/li><li>Precision: the fraction of relevant instances among the retrieved instances.<\/li><li>Recall: the fraction of the total amount of relevant instances that were actually retrieved.<\/li><\/ul>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-40af661 elementor-widget elementor-widget-text-editor\" data-id=\"40af661\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<pre class=\"wp-block-preformatted\">classes = np.unique(y_test)<br>y_test_array = pd.get_dummies(y_test, drop_first=False).values<br>    <strong>## Accuracy, Precision, Recall<\/strong><br>accuracy = metrics.accuracy_score(y_test, predicted)<br>auc = metrics.roc_auc_score(y_test, predicted_prob, <br>                            multi_class=\"ovr\")<br>print(\"Accuracy:\",  round(accuracy,2))<br>print(\"Auc:\", round(auc,2))<br>print(\"Detail:\")<br>print(metrics.classification_report(y_test, predicted))<br>    <br><strong>## Plot confusion matrix<\/strong><br>cm = metrics.confusion_matrix(y_test, predicted)<br>fig, ax = plt.subplots()<br>sns.heatmap(cm, annot=True, fmt='d', ax=ax, cmap=plt.cm.Blues, <br>            cbar=False)<br>ax.set(xlabel=\"Pred\", ylabel=\"True\", xticklabels=classes, <br>       yticklabels=classes, title=\"Confusion matrix\")<br>plt.yticks(rotation=0)<br>fig, ax = plt.subplots(nrows=1, ncols=2)<br><strong>## Plot roc<\/strong><br>for i in range(len(classes)):<br>    fpr, tpr, thresholds = metrics.roc_curve(y_test_array[:,i],  <br>                           predicted_prob[:,i])<br>    ax[0].plot(fpr, tpr, lw=3, <br>              label='{0} (area={1:0.2f})'.format(classes[i], <br>                              metrics.auc(fpr, tpr))<br>               )<br>ax[0].plot([0,1], [0,1], color='navy', lw=3, linestyle='--')<br>ax[0].set(xlim=[-0.05,1.0], ylim=[0.0,1.05], <br>          xlabel='False Positive Rate', <br>          ylabel=\"True Positive Rate (Recall)\", <br>          title=\"Receiver operating characteristic\")<br>ax[0].legend(loc=\"lower right\")<br>ax[0].grid(True)<br>    <br><strong>## Plot precision-recall curve<br><\/strong>for i in range(len(classes)):<br>    precision, recall, thresholds = metrics.precision_recall_curve(<br>                 y_test_array[:,i], predicted_prob[:,i])<br>    ax[1].plot(recall, precision, lw=3, <br>               label='{0} (area={1:0.2f})'.format(classes[i], <br>                                  metrics.auc(recall, precision))<br>              )<br>ax[1].set(xlim=[0.0,1.05], ylim=[0.0,1.05], xlabel='Recall', <br>          ylabel=\"Precision\", title=\"Precision-Recall curve\")<br>ax[1].legend(loc=\"best\")<br>ax[1].grid(True)<br>plt.show()<\/pre>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-99e61a9 elementor-widget elementor-widget-image\" data-id=\"99e61a9\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"958\" src=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1iPL_8iJOuTJ_mrLvftwUEw-1024x958.png\" class=\"attachment-large size-large wp-image-18672\" alt=\"Text Classification with NLP: Tf-Idf vs Word2Vec vs BERT\" srcset=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1iPL_8iJOuTJ_mrLvftwUEw-1024x958.png 1024w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1iPL_8iJOuTJ_mrLvftwUEw-300x281.png 300w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1iPL_8iJOuTJ_mrLvftwUEw-768x718.png 768w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1iPL_8iJOuTJ_mrLvftwUEw-610x571.png 610w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1iPL_8iJOuTJ_mrLvftwUEw-750x701.png 750w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1iPL_8iJOuTJ_mrLvftwUEw-1140x1066.png 1140w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1iPL_8iJOuTJ_mrLvftwUEw.png 1437w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-0414fdc elementor-widget elementor-widget-text-editor\" data-id=\"0414fdc\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"7722\">The BoW model got 85% of the test set right (Accuracy is 0.85), but struggles to recognize Tech news (only 252 predicted correctly).<\/p>\n<p id=\"9d9a\">Let\u2019s try to understand why the model classifies news with a certain category and assess the&nbsp;<strong>explainability&nbsp;<\/strong>of these predictions. The&nbsp;<em>lime&nbsp;<\/em>package can help us to build an explainer. To give an illustration, I will take a random observation from the test set and see what the model predicts and why.<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-59f4fdb elementor-widget elementor-widget-text-editor\" data-id=\"59f4fdb\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<pre class=\"wp-block-preformatted\"><strong>## select observation<br><\/strong>i = 0<br>txt_instance = dtf_test[\"<strong>text<\/strong>\"].iloc[i]<strong>## check true value and predicted value<\/strong><br>print(\"True:\", y_test[i], \"--&gt; Pred:\", predicted[i], \"| Prob:\", round(np.max(predicted_prob[i]),2))<strong>## show explanation<\/strong><br>explainer = lime_text.<strong>LimeTextExplainer<\/strong>(class_names=<br>             np.unique(y_train))<br>explained = explainer.explain_instance(txt_instance, <br>             model.predict_proba, num_features=3)<br>explained.show_in_notebook(text=txt_instance, predict_proba=False)<\/pre>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-26a8d80 elementor-widget elementor-widget-image\" data-id=\"26a8d80\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img loading=\"lazy\" decoding=\"async\" width=\"607\" height=\"44\" src=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1CFZTX1Ud0jOwNZMhrr4cWA.png\" class=\"attachment-large size-large wp-image-18673\" alt=\"Image for post\" srcset=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1CFZTX1Ud0jOwNZMhrr4cWA.png 607w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1CFZTX1Ud0jOwNZMhrr4cWA-300x22.png 300w\" sizes=\"(max-width: 607px) 100vw, 607px\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-8527fdb elementor-widget elementor-widget-image\" data-id=\"8527fdb\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"141\" src=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1qcAV1wvucxNogDz_eh3dpQ-1024x141.png\" class=\"attachment-large size-large wp-image-18674\" alt=\"Text Classification with NLP: Tf-Idf vs Word2Vec vs BERT\" srcset=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1qcAV1wvucxNogDz_eh3dpQ-1024x141.png 1024w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1qcAV1wvucxNogDz_eh3dpQ-300x41.png 300w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1qcAV1wvucxNogDz_eh3dpQ-768x106.png 768w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1qcAV1wvucxNogDz_eh3dpQ-1536x212.png 1536w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1qcAV1wvucxNogDz_eh3dpQ-610x84.png 610w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1qcAV1wvucxNogDz_eh3dpQ-750x104.png 750w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1qcAV1wvucxNogDz_eh3dpQ-1140x158.png 1140w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1qcAV1wvucxNogDz_eh3dpQ.png 1947w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-ef43d30 elementor-widget elementor-widget-text-editor\" data-id=\"ef43d30\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"cf12\">That makes sense: the words \u201c<em>Clinton<\/em>\u201d and \u201c<em>GOP<\/em>\u201d pointed the model in the right direction (Politics news) even if the word \u201c<em>Stage<\/em>\u201d is more common among Entertainment news.<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-76871e9 elementor-widget elementor-widget-heading\" data-id=\"76871e9\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\">Word Embedding<\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-2be0a2b elementor-widget elementor-widget-text-editor\" data-id=\"2be0a2b\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"58f4\"><a href=\"https:\/\/en.wikipedia.org\/wiki\/Word_embedding\" target=\"_blank\" rel=\"noreferrer noopener\"><em>Word Embedding<\/em><\/a>\u00a0is the collective name for feature learning techniques where words from the vocabulary are mapped to vectors of real numbers. These vectors are calculated from the probability distribution for each word appearing before or after another. To put it another way, words of the same context usually appear together in the corpus, so they will be close in the vector space as well. For instance, let\u2019s take the 3 sentences from the previous example:<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-c4d3f44 elementor-widget elementor-widget-image\" data-id=\"c4d3f44\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t<figure class=\"wp-caption\">\n\t\t\t\t\t\t\t\t\t\t<img loading=\"lazy\" decoding=\"async\" width=\"1012\" height=\"401\" src=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1u67szEvNSMqrQeitdahw_A.png\" class=\"attachment-large size-large wp-image-18675\" alt=\"Text Classification with NLP: Tf-Idf vs Word2Vec vs BERT\" srcset=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1u67szEvNSMqrQeitdahw_A.png 1012w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1u67szEvNSMqrQeitdahw_A-300x119.png 300w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1u67szEvNSMqrQeitdahw_A-768x304.png 768w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1u67szEvNSMqrQeitdahw_A-610x242.png 610w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1u67szEvNSMqrQeitdahw_A-750x297.png 750w\" sizes=\"(max-width: 1012px) 100vw, 1012px\" \/>\t\t\t\t\t\t\t\t\t\t\t<figcaption class=\"widget-image-caption wp-caption-text\">Words embedded in 2D vector space<\/figcaption>\n\t\t\t\t\t\t\t\t\t\t<\/figure>\n\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-cf6a51c elementor-widget elementor-widget-text-editor\" data-id=\"cf6a51c\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"efe4\">In this tutorial, I\u2019m going to use the first model of this family: Google\u2019s\u00a0<em><a href=\"https:\/\/en.wikipedia.org\/wiki\/Word2vec\" target=\"_blank\" rel=\"noreferrer noopener\">Word2Vec<\/a><\/em> (2013). Other popular Word Embedding models are Stanford\u2019s\u00a0<em><a href=\"https:\/\/en.wikipedia.org\/wiki\/GloVe_(machine_learning)\" target=\"_blank\" rel=\"noreferrer noopener\">GloVe<\/a><\/em> (2014)and Facebook\u2019s\u00a0<a href=\"https:\/\/en.wikipedia.org\/wiki\/FastText\" target=\"_blank\" rel=\"noreferrer noopener\"><em>FastText<\/em><\/a>\u00a0(2016).<\/p>\n<p id=\"b046\"><strong>Word2Vec&nbsp;<\/strong>produces a vector space, typically of several hundred dimensions, with each unique word in the corpus such that words that share common contexts in the corpus are located close to one another in the space. That can be done using 2 different approaches: starting from a single word to predict its context (<em>Skip-gram<\/em>) or starting from the context to predict a word (<em>Continuous Bag-of-Words<\/em>).<\/p>\n<p id=\"3453\"><strong>In Python, you can load a pre-trained Word Embedding model from\u00a0<em><a href=\"https:\/\/github.com\/RaRe-Technologies\/gensim-data\" target=\"_blank\" rel=\"noreferrer noopener\">genism-data<\/a><\/em> like this:<\/strong><\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-6e9e2e2 elementor-widget elementor-widget-text-editor\" data-id=\"6e9e2e2\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<pre class=\"wp-block-preformatted\">nlp = gensim_api.load(\"<strong>word2vec-google-news-300\"<\/strong>)<\/pre>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-881d5e6 elementor-widget elementor-widget-text-editor\" data-id=\"881d5e6\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"3b3c\">Instead of using a pre-trained model, I am going to fit my own Word2Vec on the training data corpus with&nbsp;<em>gensim.<\/em>&nbsp;Before fitting the model, the corpus needs to be transformed into a list of lists of n-grams. In this particular case, I\u2019ll try to capture unigrams (\u201c<em>york<\/em>\u201d), bigrams (\u201c<em>new york<\/em>\u201d), and trigrams (\u201c<em>new york city<\/em>\u201d).<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-53d6614 elementor-widget elementor-widget-text-editor\" data-id=\"53d6614\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<pre class=\"wp-block-preformatted\">corpus = dtf_train[\"<strong>text_clean<\/strong>\"]<strong><br>## create list of lists of unigrams<\/strong><br>lst_corpus = []<br>for string in corpus:<br>   lst_words = string.split()<br>   lst_grams = [\" \".join(lst_words[i:i+1]) <br>               for i in range(0, len(lst_words), 1)]<br>   lst_corpus.append(lst_grams)<strong><br>## detect bigrams and trigrams<\/strong><br>bigrams_detector = gensim.models.phrases.<strong>Phrases<\/strong>(lst_corpus, <br>                 delimiter=\" \".encode(), min_count=5, threshold=10)<br>bigrams_detector = gensim.models.phrases.<strong>Phraser<\/strong>(bigrams_detector)trigrams_detector = gensim.models.phrases.<strong>Phrases<\/strong>(bigrams_detector[lst_corpus], <br>            delimiter=\" \".encode(), min_count=5, threshold=10)<br>trigrams_detector = gensim.models.phrases.<strong>Phraser<\/strong>(trigrams_detector)<\/pre>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-b565d8c elementor-widget elementor-widget-text-editor\" data-id=\"b565d8c\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"c423\"><strong>When fitting the Word2Vec, you need to specify:<\/strong><\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-8e5a7e1 elementor-widget elementor-widget-text-editor\" data-id=\"8e5a7e1\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<ul><li>the target size of the word vectors, I\u2019ll use 300;<\/li><li>the window, or the maximum distance between the current and predicted word within a sentence, I\u2019ll use the mean length of text in the corpus;<\/li><li>the training algorithm, I\u2019ll use skip-grams (sg=1) as in general it has better results.<\/li><\/ul>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-14feea4 elementor-widget elementor-widget-text-editor\" data-id=\"14feea4\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<pre class=\"wp-block-preformatted\"><strong>## fit w2v<\/strong><br>nlp = gensim.models.word2vec.<strong>Word2Vec<\/strong>(lst_corpus, size=300,   <br>            window=8, min_count=1, sg=1, iter=30)<\/pre>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-f7ceeaa elementor-widget elementor-widget-text-editor\" data-id=\"f7ceeaa\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"9959\">We have our embedding model, so we can select any word from the corpus and transform it into a vector.<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-4b47d60 elementor-widget elementor-widget-text-editor\" data-id=\"4b47d60\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<pre class=\"wp-block-preformatted\">word = \"data\"<br>nlp[word].shape<\/pre>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-d8fed5c elementor-widget elementor-widget-image\" data-id=\"d8fed5c\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img loading=\"lazy\" decoding=\"async\" width=\"120\" height=\"32\" src=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1nMBpatACacNe1SZYF39j3Q.png\" class=\"attachment-large size-large wp-image-18676\" alt=\"300 image\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-aaa8902 elementor-widget elementor-widget-text-editor\" data-id=\"aaa8902\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"d20c\">We can even use it to visualize a word and its context into a smaller dimensional space (2D or 3D) by applying any dimensionality reduction algorithm (i.e.\u00a0<a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.manifold.TSNE.html\" target=\"_blank\" rel=\"noreferrer noopener\">TSNE<\/a>).<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-9c74bae elementor-widget elementor-widget-text-editor\" data-id=\"9c74bae\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<pre class=\"wp-block-preformatted\">word = \"data\"<br>fig = plt.figure()<strong>## word embedding<\/strong><br>tot_words = [word] + [tupla[0] for tupla in <br>                 nlp.most_similar(word, topn=20)]<br>X = nlp[tot_words]<strong>## pca to reduce dimensionality from 300 to 3<\/strong><br>pca = manifold.<strong>TSNE<\/strong>(perplexity=40, n_components=3, init='pca')<br>X = pca.fit_transform(X)<strong>## create dtf<\/strong><br>dtf_ = pd.DataFrame(X, index=tot_words, columns=[\"x\",\"y\",\"z\"])<br>dtf_[\"input\"] = 0<br>dtf_[\"input\"].iloc[0:1] = 1<strong>## plot 3d<\/strong><br>from mpl_toolkits.mplot3d import Axes3D<br>ax = fig.add_subplot(111, projection='3d')<br>ax.scatter(dtf_[dtf_[\"input\"]==0]['x'], <br>           dtf_[dtf_[\"input\"]==0]['y'], <br>           dtf_[dtf_[\"input\"]==0]['z'], c=\"black\")<br>ax.scatter(dtf_[dtf_[\"input\"]==1]['x'], <br>           dtf_[dtf_[\"input\"]==1]['y'], <br>           dtf_[dtf_[\"input\"]==1]['z'], c=\"red\")<br>ax.set(xlabel=None, ylabel=None, zlabel=None, xticklabels=[], <br>       yticklabels=[], zticklabels=[])<br>for label, row in dtf_[[\"x\",\"y\",\"z\"]].iterrows():<br>    x, y, z = row<br>    ax.text(x, y, z, s=label)<\/pre>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-039d56b elementor-widget elementor-widget-image\" data-id=\"039d56b\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img fetchpriority=\"high\" decoding=\"async\" width=\"1024\" height=\"500\" src=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1T8WWibd7u8b7gfgeG0LgAA-1024x500.gif\" class=\"attachment-large size-large wp-image-18662\" alt=\"Text Classification with NLP: Tf-Idf vs Word2Vec vs BERT\" srcset=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1T8WWibd7u8b7gfgeG0LgAA-1024x500.gif 1024w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1T8WWibd7u8b7gfgeG0LgAA-300x147.gif 300w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1T8WWibd7u8b7gfgeG0LgAA-768x375.gif 768w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1T8WWibd7u8b7gfgeG0LgAA-1536x750.gif 1536w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1T8WWibd7u8b7gfgeG0LgAA-610x298.gif 610w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1T8WWibd7u8b7gfgeG0LgAA-750x366.gif 750w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1T8WWibd7u8b7gfgeG0LgAA-1140x557.gif 1140w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-4544678 elementor-widget elementor-widget-text-editor\" data-id=\"4544678\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"2637\">That\u2019s pretty cool and all, but how can the word embedding be useful to predict the news category? Well, the word vectors can be used in a neural network as weights. This is how:<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-c536b6b elementor-widget elementor-widget-text-editor\" data-id=\"c536b6b\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<ul><li>First, transform the corpus into padded sequences of word ids to get a feature matrix.<\/li><li>Then, create an embedding matrix so that the vector of the word with id&nbsp;<em>N&nbsp;<\/em>is located at the&nbsp;<em>Nth<\/em>&nbsp;row.<\/li><li>Finally, build a neural network with an embedding layer that weighs every word in the sequences with the corresponding vector.<\/li><\/ul>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-5ed7dbf elementor-widget elementor-widget-text-editor\" data-id=\"5ed7dbf\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"7f89\">Let\u2019s start with the&nbsp;<strong>Feature Engineering&nbsp;<\/strong>by transforming the same preprocessed corpus (list of lists of n-grams) given to the Word2Vec into a list of sequences using&nbsp;<em>tensorflow\/keras<\/em>:<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-89f5ec1 elementor-widget elementor-widget-text-editor\" data-id=\"89f5ec1\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<pre class=\"wp-block-preformatted\"><strong>## tokenize text<\/strong><br>tokenizer = kprocessing.text.<strong>Tokenizer<\/strong>(lower=True, split=' ', <br>                     oov_token=\"NaN\", <br>                     filters='!\"#$%&amp;()*+,-.\/:;&lt;=&gt;?@[\\\\]^_`{|}~\\t\\n')<br>tokenizer.fit_on_texts(lst_corpus)<br>dic_vocabulary = tokenizer.word_index<strong>## create sequence<\/strong><br>lst_text2seq= tokenizer.texts_to_sequences(lst_corpus)<strong>## padding sequence<\/strong><br>X_train = kprocessing.sequence.<strong>pad_sequences<\/strong>(lst_text2seq, <br>                    maxlen=15, padding=\"post\", truncating=\"post\")<\/pre>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-fa3b83e elementor-widget elementor-widget-text-editor\" data-id=\"fa3b83e\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"5ade\">The feature matrix&nbsp;<em>X_train<\/em>&nbsp;has a shape of 34,265 x 15 (Number of sequences x Sequences max length). Let\u2019s visualize it:<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-53813a7 elementor-widget elementor-widget-text-editor\" data-id=\"53813a7\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<pre class=\"wp-block-preformatted\">sns.heatmap(X_train==0, vmin=0, vmax=1, cbar=False)<br>plt.show()<\/pre>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-e23dab6 elementor-widget elementor-widget-image\" data-id=\"e23dab6\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t<figure class=\"wp-caption\">\n\t\t\t\t\t\t\t\t\t\t<img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"340\" src=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1tQ588zm4i96xfEBCfiUpzQ-1024x340.png\" class=\"attachment-large size-large wp-image-18677\" alt=\"Text Classification with NLP: Tf-Idf vs Word2Vec vs BERT\" srcset=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1tQ588zm4i96xfEBCfiUpzQ-1024x340.png 1024w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1tQ588zm4i96xfEBCfiUpzQ-300x100.png 300w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1tQ588zm4i96xfEBCfiUpzQ-768x255.png 768w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1tQ588zm4i96xfEBCfiUpzQ-610x203.png 610w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1tQ588zm4i96xfEBCfiUpzQ-750x249.png 750w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1tQ588zm4i96xfEBCfiUpzQ-1140x379.png 1140w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1tQ588zm4i96xfEBCfiUpzQ.png 1453w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/>\t\t\t\t\t\t\t\t\t\t\t<figcaption class=\"widget-image-caption wp-caption-text\">Feature matrix (34,265 x 15)<\/figcaption>\n\t\t\t\t\t\t\t\t\t\t<\/figure>\n\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-ce384fe elementor-widget elementor-widget-text-editor\" data-id=\"ce384fe\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"23eb\">Every text in the corpus is now an id sequence with length 15. For instance, if a text had 10 tokens in it, then the sequence is composed of 10 ids + 5 0s, which is the padding element (while the id for word not in the vocabulary is 1). Let\u2019s print how a text from the train set has been transformed into a sequence with the padding and the vocabulary.<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-8a379a0 elementor-widget elementor-widget-text-editor\" data-id=\"8a379a0\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<pre class=\"wp-block-preformatted\">i = 0<br><br><strong>## list of text: [\"I like this\", ...]<\/strong><br>len_txt = len(dtf_train[\"text_clean\"].iloc[i].split())<br>print(\"from: \", dtf_train[\"text_clean\"].iloc[i], \"| len:\", len_txt)<br><br><strong>## sequence of token ids: [[1, 2, 3], ...]<\/strong><br>len_tokens = len(X_train[i])<br>print(\"to: \", X_train[i], \"| len:\", len(X_train[i]))<br><br><strong>## vocabulary: {\"I\":1, \"like\":2, \"this\":3, ...}<\/strong><br>print(\"check: \", dtf_train[\"text_clean\"].iloc[i].split()[0], <br>      \" -- idx in vocabulary --&gt;\", <br>      dic_vocabulary[dtf_train[\"text_clean\"].iloc[i].split()[0]])<br><br>print(\"vocabulary: \", dict(list(dic_vocabulary.items())[0:5]), \"... (padding element, 0)\")<\/pre>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-9e138fa elementor-widget elementor-widget-image\" data-id=\"9e138fa\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"114\" src=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1vhJWO3cTN2jfrg2upX2kGQ-1024x114.png\" class=\"attachment-large size-large wp-image-18678\" alt=\"Text Classification with NLP: Tf-Idf vs Word2Vec vs BERT\" srcset=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1vhJWO3cTN2jfrg2upX2kGQ-1024x114.png 1024w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1vhJWO3cTN2jfrg2upX2kGQ-300x33.png 300w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1vhJWO3cTN2jfrg2upX2kGQ-768x85.png 768w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1vhJWO3cTN2jfrg2upX2kGQ-610x68.png 610w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1vhJWO3cTN2jfrg2upX2kGQ-750x83.png 750w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1vhJWO3cTN2jfrg2upX2kGQ-1140x127.png 1140w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1vhJWO3cTN2jfrg2upX2kGQ.png 1333w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-af73ec5 elementor-widget elementor-widget-text-editor\" data-id=\"af73ec5\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"a07a\">Before moving on, don\u2019t forget to do the same feature engineering on the test set as well:<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-09a3eeb elementor-widget elementor-widget-text-editor\" data-id=\"09a3eeb\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<pre class=\"wp-block-preformatted\">corpus = dtf_test[\"<strong>text_clean<\/strong>\"]<strong><br>## create list of n-grams<\/strong><br>lst_corpus = []<br>for string in corpus:<br>    lst_words = string.split()<br>    lst_grams = [\" \".join(lst_words[i:i+1]) for i in range(0, <br>                 len(lst_words), 1)]<br>    lst_corpus.append(lst_grams)<br>    <strong>## detect common bigrams and trigrams using the fitted detectors<\/strong><br>lst_corpus = list(bigrams_detector[lst_corpus])<br>lst_corpus = list(trigrams_detector[lst_corpus])<br><strong>## text to sequence with the fitted tokenizer<\/strong><br>lst_text2seq = tokenizer.texts_to_sequences(lst_corpus)<br><strong>## padding sequence<\/strong><br>X_test = kprocessing.sequence.<strong>pad_sequences<\/strong>(lst_text2seq, maxlen=15,<br>             padding=\"post\", truncating=\"post\")<\/pre>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-ec9b2cc elementor-widget elementor-widget-image\" data-id=\"ec9b2cc\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t<figure class=\"wp-caption\">\n\t\t\t\t\t\t\t\t\t\t<img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"341\" src=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1H6ZZAT-9tmu8PcaY74Un6w-1024x341.png\" class=\"attachment-large size-large wp-image-18679\" alt=\"Image for post\" srcset=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1H6ZZAT-9tmu8PcaY74Un6w-1024x341.png 1024w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1H6ZZAT-9tmu8PcaY74Un6w-300x100.png 300w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1H6ZZAT-9tmu8PcaY74Un6w-768x256.png 768w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1H6ZZAT-9tmu8PcaY74Un6w-610x203.png 610w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1H6ZZAT-9tmu8PcaY74Un6w-750x250.png 750w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1H6ZZAT-9tmu8PcaY74Un6w-1140x380.png 1140w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1H6ZZAT-9tmu8PcaY74Un6w.png 1432w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/>\t\t\t\t\t\t\t\t\t\t\t<figcaption class=\"widget-image-caption wp-caption-text\">X_test (14,697 x 15)<\/figcaption>\n\t\t\t\t\t\t\t\t\t\t<\/figure>\n\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-a7032cf elementor-widget elementor-widget-text-editor\" data-id=\"a7032cf\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"a0f8\">We\u2019ve got our&nbsp;<em>X_train<\/em>&nbsp;and&nbsp;<em>X_test<\/em>, now we need to create the&nbsp;<strong>matrix of embedding<\/strong>&nbsp;that will be used as a weight matrix in the neural network classifier.<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-d7b64ac elementor-widget elementor-widget-text-editor\" data-id=\"d7b64ac\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<pre class=\"wp-block-preformatted\"><strong>## start the matrix (length of vocabulary x vector size) with all 0s<\/strong><br>embeddings = np.zeros((len(dic_vocabulary)+1, 300))for word,idx in dic_vocabulary.items():<br>    <strong>## update the row with vector<\/strong><br>    try:<br>        embeddings[idx] =  nlp[word]<br>    <strong>## if word not in model then skip and the row stays all 0s<\/strong><br>    except:<br>        pass<\/pre>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-0367efe elementor-widget elementor-widget-text-editor\" data-id=\"0367efe\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"2aa4\">That code generates a matrix of shape 22,338 x 300 (Length of vocabulary extracted from the corpus x Vector size). It can be navigated by word id, which can be obtained from the vocabulary.<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-4b62791 elementor-widget elementor-widget-text-editor\" data-id=\"4b62791\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<pre class=\"wp-block-preformatted\">word = \"data\"print(\"dic[word]:\", dic_vocabulary[word], \"|idx\")<br>print(\"embeddings[idx]:\", embeddings[dic_vocabulary[word]].shape, <br>      \"|vector\")<\/pre>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-7fdf519 elementor-widget elementor-widget-image\" data-id=\"7fdf519\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img loading=\"lazy\" decoding=\"async\" width=\"490\" height=\"52\" src=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1EWLnzZ0abpnhH1zR8jq5PQ.png\" class=\"attachment-large size-large wp-image-18680\" alt=\"Length of vocabulary extracted from the corpus x Vector size\" srcset=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1EWLnzZ0abpnhH1zR8jq5PQ.png 490w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1EWLnzZ0abpnhH1zR8jq5PQ-300x32.png 300w\" sizes=\"(max-width: 490px) 100vw, 490px\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-f95888b elementor-widget elementor-widget-text-editor\" data-id=\"f95888b\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"f4fc\">It\u2019s finally time to build a&nbsp;<strong>deep learning model<\/strong>. I\u2019m going to use the embedding matrix in the first Embedding layer of the neural network that I will build and train to classify the news. Each id in the input sequence will be used as the index to access the embedding matrix. The output of this Embedding layer will be a 2D matrix with a word vector for each word id in the input sequence (Sequence length x Vector size). Let\u2019s use the sentence \u201c<em>I like this article<\/em>\u201d as an example:<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-04af1c0 elementor-widget elementor-widget-image\" data-id=\"04af1c0\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"610\" src=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1KvBp0xzRThA7qTXACT4A-g-1024x610.png\" class=\"attachment-large size-large wp-image-18681\" alt=\"Sequence length x Vector size\" srcset=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1KvBp0xzRThA7qTXACT4A-g-1024x610.png 1024w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1KvBp0xzRThA7qTXACT4A-g-300x179.png 300w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1KvBp0xzRThA7qTXACT4A-g-768x457.png 768w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1KvBp0xzRThA7qTXACT4A-g-1536x915.png 1536w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1KvBp0xzRThA7qTXACT4A-g-610x363.png 610w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1KvBp0xzRThA7qTXACT4A-g-750x447.png 750w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1KvBp0xzRThA7qTXACT4A-g-1140x679.png 1140w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1KvBp0xzRThA7qTXACT4A-g.png 1627w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-5a5127a elementor-widget elementor-widget-text-editor\" data-id=\"5a5127a\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"e0c6\"><strong>My neural network shall be structured as follows:<\/strong><\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-3f5e443 elementor-widget elementor-widget-text-editor\" data-id=\"3f5e443\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<ul><li>an Embedding layer that takes the sequences as input and the word vectors as weights, just as described before.<\/li><li>A simple Attention layer that won\u2019t affect the predictions but it\u2019s going to capture the weights of each instance and allow us to build a nice explainer (it isn&#8217;t necessary for the predictions, just for the explainability, so you can skip it). The Attention mechanism was presented in\u00a0<a href=\"https:\/\/arxiv.org\/abs\/1409.0473\" target=\"_blank\" rel=\"noreferrer noopener\">this paper<\/a>\u00a0(2014) as a solution to the problem of the sequence models (i.e. LSTM) to understand what parts of a long text are actually relevant.<\/li><li>Two layers of Bidirectional LSTM to model the order of words in a sequence in both directions.<\/li><li>Two final dense layers that will predict the probability of each news category.<\/li><\/ul>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-dfdeff2 elementor-widget elementor-widget-text-editor\" data-id=\"dfdeff2\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<pre class=\"wp-block-preformatted\"><strong>## code attention layer<\/strong><br>def <strong>attention_layer<\/strong>(inputs, neurons):<br>    x = layers.<strong>Permute<\/strong>((2,1))(inputs)<br>    x = layers.<strong>Dense<\/strong>(neurons, activation=\"softmax\")(x)<br>    x = layers.<strong>Permute<\/strong>((2,1), name=\"<strong>attention<\/strong>\")(x)<br>    x = layers.<strong>multiply<\/strong>([inputs, x])<br>    return x<strong><br>## input<\/strong><br>x_in = layers.<strong>Input<\/strong>(shape=(15,))<strong>## embedding<\/strong><br>x = layers.<strong>Embedding<\/strong>(input_dim=embeddings.shape[0],  <br>                     output_dim=embeddings.shape[1], <br>                     weights=[embeddings],<br>                     input_length=15, trainable=False)(x_in)<strong>## apply attention<\/strong><br>x = attention_layer(x, neurons=15)<strong>## 2 layers of bidirectional lstm<\/strong><br>x = layers.<strong>Bidirectional<\/strong>(layers.<strong>LSTM<\/strong>(units=15, dropout=0.2, <br>                         return_sequences=True))(x)<br>x = layers.<strong>Bidirectional<\/strong>(layers.<strong>LSTM<\/strong>(units=15, dropout=0.2))(x)<strong>## final dense layers<\/strong><br>x = layers.<strong>Dense<\/strong>(64, activation='relu')(x)<br>y_out = layers.<strong>Dense<\/strong>(3, activation='softmax')(x)<strong>## compile<\/strong><br>model = models.<strong>Model<\/strong>(x_in, y_out)<br>model.compile(loss='sparse_categorical_crossentropy',<br>              optimizer='adam', metrics=['accuracy'])<br><br>model.summary()<\/pre>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-469b785 elementor-widget elementor-widget-image\" data-id=\"469b785\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"624\" src=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1Dfu-2hqEMaBe6YHx1C71Uw-1024x624.png\" class=\"attachment-large size-large wp-image-18682\" alt=\"My neural network shall be structured as follows:\" srcset=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1Dfu-2hqEMaBe6YHx1C71Uw-1024x624.png 1024w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1Dfu-2hqEMaBe6YHx1C71Uw-300x183.png 300w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1Dfu-2hqEMaBe6YHx1C71Uw-768x468.png 768w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1Dfu-2hqEMaBe6YHx1C71Uw-610x372.png 610w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1Dfu-2hqEMaBe6YHx1C71Uw-750x457.png 750w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1Dfu-2hqEMaBe6YHx1C71Uw-1140x694.png 1140w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1Dfu-2hqEMaBe6YHx1C71Uw.png 1241w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-fd172db elementor-widget elementor-widget-text-editor\" data-id=\"fd172db\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"9280\">Now we can train the model and check the performance on a subset of the training set used for validation before testing it on the actual test set.<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-5027537 elementor-widget elementor-widget-text-editor\" data-id=\"5027537\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<pre class=\"wp-block-preformatted\"><strong>## encode y<\/strong><br>dic_y_mapping = {n:label for n,label in <br>                 enumerate(np.unique(y_train))}<br>inverse_dic = {v:k for k,v in dic_y_mapping.items()}<br>y_train = np.array([inverse_dic[y] for y in y_train])<strong>## train<\/strong><br>training = model.fit(x=X_train, y=y_train, batch_size=256, <br>                     epochs=10, shuffle=True, verbose=0, <br>                     validation_split=0.3)<strong>## plot loss and accuracy<\/strong><br>metrics = [k for k in training.history.keys() if (\"loss\" not in k) and (\"val\" not in k)]<br>fig, ax = plt.subplots(nrows=1, ncols=2, sharey=True)ax[0].set(title=\"Training\")<br>ax11 = ax[0].twinx()<br>ax[0].plot(training.history['loss'], color='black')<br>ax[0].set_xlabel('Epochs')<br>ax[0].set_ylabel('Loss', color='black')<br>for metric in metrics:<br>    ax11.plot(training.history[metric], label=metric)<br>ax11.set_ylabel(\"Score\", color='steelblue')<br>ax11.legend()ax[1].set(title=\"Validation\")<br>ax22 = ax[1].twinx()<br>ax[1].plot(training.history['val_loss'], color='black')<br>ax[1].set_xlabel('Epochs')<br>ax[1].set_ylabel('Loss', color='black')<br>for metric in metrics:<br>     ax22.plot(training.history['val_'+metric], label=metric)<br>ax22.set_ylabel(\"Score\", color=\"steelblue\")<br>plt.show()<\/pre>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-774acdc elementor-widget elementor-widget-image\" data-id=\"774acdc\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"242\" src=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1MdYjdeOju4ez8gcAc-v8qA-1024x242.png\" class=\"attachment-large size-large wp-image-18683\" alt=\"complete the evaluation of the Word Embedding model\" srcset=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1MdYjdeOju4ez8gcAc-v8qA-1024x242.png 1024w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1MdYjdeOju4ez8gcAc-v8qA-300x71.png 300w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1MdYjdeOju4ez8gcAc-v8qA-768x182.png 768w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1MdYjdeOju4ez8gcAc-v8qA-610x144.png 610w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1MdYjdeOju4ez8gcAc-v8qA-750x177.png 750w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1MdYjdeOju4ez8gcAc-v8qA-1140x270.png 1140w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1MdYjdeOju4ez8gcAc-v8qA.png 1514w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-11bae82 elementor-widget elementor-widget-text-editor\" data-id=\"11bae82\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"759d\">Nice! In some epochs, the accuracy reached 0.89. In order to complete the&nbsp;<strong>evaluation<\/strong>&nbsp;of the Word Embedding model, let\u2019s predict the test set and compare the same metrics used before (code for metrics is the same as before).<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-6cb50e6 elementor-widget elementor-widget-text-editor\" data-id=\"6cb50e6\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<pre class=\"wp-block-preformatted\"><strong>## test<\/strong><br>predicted_prob = model.predict(X_test)<br>predicted = [dic_y_mapping[np.argmax(pred)] for pred in <br>             predicted_prob]<\/pre>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-6c9815b elementor-widget elementor-widget-image\" data-id=\"6c9815b\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t<figure class=\"wp-caption\">\n\t\t\t\t\t\t\t\t\t\t<img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"988\" src=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1a39MMTNXnDaFOKFur2Z7xQ-1024x988.png\" class=\"attachment-large size-large wp-image-18684\" alt=\"complete the evaluation of the Word Embedding model\" srcset=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1a39MMTNXnDaFOKFur2Z7xQ-1024x988.png 1024w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1a39MMTNXnDaFOKFur2Z7xQ-300x290.png 300w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1a39MMTNXnDaFOKFur2Z7xQ-768x741.png 768w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1a39MMTNXnDaFOKFur2Z7xQ-610x589.png 610w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1a39MMTNXnDaFOKFur2Z7xQ-750x724.png 750w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1a39MMTNXnDaFOKFur2Z7xQ-1140x1100.png 1140w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1a39MMTNXnDaFOKFur2Z7xQ.png 1430w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/>\t\t\t\t\t\t\t\t\t\t\t<figcaption class=\"widget-image-caption wp-caption-text\"><\/figcaption>\n\t\t\t\t\t\t\t\t\t\t<\/figure>\n\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-6f802f5 elementor-widget elementor-widget-text-editor\" data-id=\"6f802f5\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"8636\">The model performs as good as the previous one, in fact, it also struggles to classify Tech news.<\/p>\n<p id=\"a333\">But is it&nbsp;<strong>explainable&nbsp;<\/strong>as well? Yes, it is! I put an Attention layer in the neural network to extract the weights of each word and understand how much those contributed to classify an instance. So I\u2019ll try to use Attention weights to build an explainer (similar to the one seen in the previous section):<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-3f3b1d4 elementor-widget elementor-widget-text-editor\" data-id=\"3f3b1d4\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<pre class=\"wp-block-preformatted\"><strong>## select observation<br><\/strong>i = 0<br>txt_instance = dtf_test[\"<strong>text<\/strong>\"].iloc[i]<strong>## check true value and predicted value<\/strong><br>print(\"True:\", y_test[i], \"--&gt; Pred:\", predicted[i], \"| Prob:\", round(np.max(predicted_prob[i]),2))<strong><br>## show explanation<br>### 1. preprocess input<br><\/strong>lst_corpus = []<br>for string in [re.sub(r'[^\\w\\s]','', txt_instance.lower().strip())]:<br>    lst_words = string.split()<br>    lst_grams = [\" \".join(lst_words[i:i+1]) for i in range(0, <br>                 len(lst_words), 1)]<br>    lst_corpus.append(lst_grams)<br>lst_corpus = list(bigrams_detector[lst_corpus])<br>lst_corpus = list(trigrams_detector[lst_corpus])<br>X_instance = kprocessing.sequence.pad_sequences(<br>              tokenizer.texts_to_sequences(corpus), maxlen=15, <br>              padding=\"post\", truncating=\"post\")<strong>### 2. get attention weights<\/strong><br>layer = [layer for layer in model.layers if \"<strong>attention<\/strong>\" in <br>         layer.name][0]<br>func = K.function([model.input], [layer.output])<br>weights = func(X_instance)[0]<br>weights = np.mean(weights, axis=2).flatten()<strong>### 3. rescale weights, remove null vector, map word-weight<\/strong><br>weights = preprocessing.MinMaxScaler(feature_range=(0,1)).fit_transform(np.array(weights).reshape(-1,1)).reshape(-1)<br>weights = [weights[n] for n,idx in enumerate(X_instance[0]) if idx <br>           != 0]<br>dic_word_weigth = {word:weights[n] for n,word in <br>                   enumerate(lst_corpus[0]) if word in <br>                   tokenizer.word_index.keys()}<strong>### 4. barplot<\/strong><br>if len(dic_word_weigth) &gt; 0:<br>   dtf = pd.DataFrame.from_dict(dic_word_weigth, orient='index', <br>                                columns=[\"score\"])<br>   dtf.sort_values(by=\"score\", <br>           ascending=True).tail(top).plot(kind=\"barh\", <br>           legend=False).grid(axis='x')<br>   plt.show()<br>else:<br>   print(\"--- No word recognized ---\")<strong>### 5. produce html visualization<\/strong><br>text = []<br>for word in lst_corpus[0]:<br>    weight = dic_word_weigth.get(word)<br>    if weight is not None:<br>         text.append('&lt;b&gt;&lt;span style=\"background-color:rgba(100,149,237,' + str(weight) + ');\"&gt;' + word + '&lt;\/span&gt;&lt;\/b&gt;')<br>    else:<br>         text.append(word)<br>text = ' '.join(text)<strong>### 6. visualize on notebook<br><\/strong>print(\"<strong>\\033<\/strong>[1m\"+\"Text with highlighted words\")<br>from IPython.core.display import display, HTML<br>display(HTML(text))<\/pre>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-f897b64 elementor-widget elementor-widget-image\" data-id=\"f897b64\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img loading=\"lazy\" decoding=\"async\" width=\"594\" height=\"38\" src=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/121iP8FD4XOBDn3XS5LKiJA.png\" class=\"attachment-large size-large wp-image-18685\" alt=\"Image for post\" srcset=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/121iP8FD4XOBDn3XS5LKiJA.png 594w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/121iP8FD4XOBDn3XS5LKiJA-300x19.png 300w\" sizes=\"(max-width: 594px) 100vw, 594px\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-3d5acfa elementor-widget elementor-widget-image\" data-id=\"3d5acfa\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img loading=\"lazy\" decoding=\"async\" width=\"867\" height=\"422\" src=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1CrFeNLHgDzBxAmgGiXR8lg.png\" class=\"attachment-large size-large wp-image-18686\" alt=\"the words \u201cclinton\u201d and \u201cgop\u201d activated the neurons of the model, but this time also \u201chigh\u201d and \u201cbenghazi\u201d &quot;\" srcset=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1CrFeNLHgDzBxAmgGiXR8lg.png 867w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1CrFeNLHgDzBxAmgGiXR8lg-300x146.png 300w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1CrFeNLHgDzBxAmgGiXR8lg-768x374.png 768w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1CrFeNLHgDzBxAmgGiXR8lg-610x297.png 610w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1CrFeNLHgDzBxAmgGiXR8lg-750x365.png 750w\" sizes=\"(max-width: 867px) 100vw, 867px\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-3096114 elementor-widget elementor-widget-text-editor\" data-id=\"3096114\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"3f7e\">Just like before, the words \u201c<em>clinton<\/em>\u201d and \u201c<em>gop<\/em>\u201d activated the neurons of the model, but this time also \u201c<em>high<\/em>\u201d and \u201c<em>benghazi<\/em>\u201d have been considered slightly relevant for the prediction.<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-f8c6cf3 elementor-widget elementor-widget-heading\" data-id=\"f8c6cf3\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\">Language Models<\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-9195e64 elementor-widget elementor-widget-text-editor\" data-id=\"9195e64\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"0455\"><a href=\"https:\/\/en.wikipedia.org\/wiki\/Language_model\" target=\"_blank\" rel=\"noreferrer noopener\">Language Models<\/a>, or Contextualized\/Dynamic Word Embeddings<strong>,\u00a0<\/strong>overcome the biggest limitation of the classic Word Embedding approach: polysemy disambiguation, a word with different meanings (e.g. \u201c\u00a0<em>bank<\/em>\u201d or \u201c<em>stick<\/em>\u201d) is identified by just one vector. One of the first popular ones was ELMO (2018), which doesn\u2019t apply a fixed embedding but, using a bidirectional LSTM, looks at the entire sentence and then assigns an embedding to each word.<\/p>\n<p id=\"9efe\">Enter Transformers: a new modeling technique presented by Google\u2019s paper\u00a0<a href=\"https:\/\/arxiv.org\/abs\/1706.03762\" target=\"_blank\" rel=\"noreferrer noopener\"><em>Attention is All You Need<\/em><\/a>(2017)in which it was demonstrated that sequence models (like LSTM) can be totally replaced by Attention mechanisms, even obtaining better performances.<\/p>\n<p id=\"0e55\">Google\u2019s\u00a0<a href=\"https:\/\/en.wikipedia.org\/wiki\/BERT_(language_model)\" target=\"_blank\" rel=\"noreferrer noopener\"><strong>BERT<\/strong><\/a>(Bidirectional Encoder Representations from Transformers, 2018) combines ELMO context embedding and several Transformers, plus it\u2019s bidirectional (which was a big novelty for Transformers). The vector BERT assigns to a word is a function of the entire sentence, therefore, a word can have different vectors based on the contexts. Let\u2019s try it using\u00a0<em>transformers<\/em>:<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-cff60a3 elementor-widget elementor-widget-text-editor\" data-id=\"cff60a3\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<pre class=\"wp-block-preformatted\">txt = <strong>\"bank river\"<\/strong><strong>## bert tokenizer<\/strong><br>tokenizer = transformers.<strong>BertTokenizer<\/strong>.<strong>from_pretrained<\/strong>('bert-base-uncased', do_lower_case=True)<strong>## bert model<\/strong><br>nlp = transformers.<strong>TFBertModel<\/strong>.<strong>from_pretrained<\/strong>('bert-base-uncased')<strong>## return hidden layer with embeddings<\/strong><br>input_ids = np.array(tokenizer.encode(txt))[None,:]  <br>embedding = nlp(input_ids)embedding[0][0]<\/pre>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-ed08176 elementor-widget elementor-widget-image\" data-id=\"ed08176\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"188\" src=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1f7-l1PPDu5Q6rgUBzgz12A-1024x188.png\" class=\"attachment-large size-large wp-image-18687\" alt=\"Bidirectional Encoder Representations from Transformers, 2018\" srcset=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1f7-l1PPDu5Q6rgUBzgz12A-1024x188.png 1024w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1f7-l1PPDu5Q6rgUBzgz12A-300x55.png 300w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1f7-l1PPDu5Q6rgUBzgz12A-768x141.png 768w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1f7-l1PPDu5Q6rgUBzgz12A-610x112.png 610w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1f7-l1PPDu5Q6rgUBzgz12A-750x138.png 750w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1f7-l1PPDu5Q6rgUBzgz12A.png 1039w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-27ff21c elementor-widget elementor-widget-text-editor\" data-id=\"27ff21c\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"cfbd\">If we change the input text into \u201c<em>bank money<\/em>\u201d, we get this instead:<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-2cc33bb elementor-widget elementor-widget-image\" data-id=\"2cc33bb\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"178\" src=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1V2iWL5kNy9WBzIFPfoVZYQ-1024x178.png\" class=\"attachment-large size-large wp-image-18688\" alt=\"Bidirectional Encoder Representations from Transformers, 2018\" srcset=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1V2iWL5kNy9WBzIFPfoVZYQ-1024x178.png 1024w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1V2iWL5kNy9WBzIFPfoVZYQ-300x52.png 300w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1V2iWL5kNy9WBzIFPfoVZYQ-768x133.png 768w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1V2iWL5kNy9WBzIFPfoVZYQ-610x106.png 610w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1V2iWL5kNy9WBzIFPfoVZYQ-750x130.png 750w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1V2iWL5kNy9WBzIFPfoVZYQ.png 1037w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-f5626f5 elementor-widget elementor-widget-text-editor\" data-id=\"f5626f5\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"bd4c\">In order to complete a text classification task, you can use BERT in 3 different ways:<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-f06e7fc elementor-widget elementor-widget-text-editor\" data-id=\"f06e7fc\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<ul><li>train it all from scratches and use it as classifier.<\/li><li>Extract the word embeddings and use them in an embedding layer (like I did with Word2Vec).<\/li><li>Fine-tuning the pre-trained model (transfer learning).<\/li><\/ul>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-06ceb0a elementor-widget elementor-widget-text-editor\" data-id=\"06ceb0a\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"8eb7\">I\u2019m going with the latter and do transfer learning from a pre-trained lighter version of BERT, called\u00a0<a href=\"https:\/\/huggingface.co\/transformers\/model_doc\/distilbert.html\" target=\"_blank\" rel=\"noreferrer noopener\">Distil-BERT<\/a>\u00a0(66 million of parameters instead of 110 million!).<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-336f1bf elementor-widget elementor-widget-text-editor\" data-id=\"336f1bf\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<pre class=\"wp-block-preformatted\"><strong>## distil-bert tokenizer<\/strong><br>tokenizer = transformers.<strong>AutoTokenizer<\/strong>.<strong>from_pretrained<\/strong>('distilbert-base-uncased', do_lower_case=True)<\/pre>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-3c0b7b0 elementor-widget elementor-widget-text-editor\" data-id=\"3c0b7b0\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"9ddd\">As usual, before fitting the model there is some&nbsp;<strong>Feature Engineering<\/strong>&nbsp;to do, but this time it\u2019s gonna be a little trickier. To give an illustration of what I\u2019m going to do, let\u2019s take as an example our beloved sentence \u201c<em>I like this article<\/em>\u201d, which has to be transformed into 3 vectors (Ids, Mask, Segment):<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-de51e14 elementor-widget elementor-widget-image\" data-id=\"de51e14\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t<figure class=\"wp-caption\">\n\t\t\t\t\t\t\t\t\t\t<img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"336\" src=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1rENCe-2FhlIBIfUstVHRhA-1024x336.png\" class=\"attachment-large size-large wp-image-18689\" alt=\"pre-trained lighter version of BERT, called Distil-BERT\" srcset=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1rENCe-2FhlIBIfUstVHRhA-1024x336.png 1024w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1rENCe-2FhlIBIfUstVHRhA-300x99.png 300w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1rENCe-2FhlIBIfUstVHRhA-768x252.png 768w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1rENCe-2FhlIBIfUstVHRhA-1536x504.png 1536w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1rENCe-2FhlIBIfUstVHRhA-610x200.png 610w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1rENCe-2FhlIBIfUstVHRhA-750x246.png 750w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1rENCe-2FhlIBIfUstVHRhA-1140x374.png 1140w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1rENCe-2FhlIBIfUstVHRhA.png 1611w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/>\t\t\t\t\t\t\t\t\t\t\t<figcaption class=\"widget-image-caption wp-caption-text\">Shape: 3 x Sequence length<\/figcaption>\n\t\t\t\t\t\t\t\t\t\t<\/figure>\n\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-c65d6ca elementor-widget elementor-widget-text-editor\" data-id=\"c65d6ca\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"c6b3\">First of all, we need to select the sequence max length. This time I\u2019m gonna choose a much larger number (i.e. 50) because BERT splits unknown words into sub-tokens until it finds a known unigrams. For example, if a made-up word like \u201c<em>zzdata<\/em>\u201d is given, BERT would split it into [\u201c<em>z<\/em>\u201d, \u201c<em>##z<\/em>\u201d, \u201c<em>##data<\/em>\u201d]. Moreover, we have to insert special tokens into the input text, then generate masks and segments. Finally, put all together in a tensor to get the feature matrix that will have the shape of 3 (ids, masks, segments) x Number of documents in the corpus x Sequence length.<\/p>\n<p id=\"37c8\">Please note that I\u2019m using the raw text as corpus (so far I\u2019ve been using the&nbsp;<em>clean_text&nbsp;<\/em>column).<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-0622411 elementor-widget elementor-widget-text-editor\" data-id=\"0622411\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<pre class=\"wp-block-preformatted\">corpus = dtf_train[\"<strong>text<\/strong>\"]<br>maxlen = 50<strong><br>## add special tokens<\/strong><br>maxqnans = np.int((maxlen-20)\/2)<br>corpus_tokenized = [\"[CLS] \"+<br>             \" \".join(tokenizer.tokenize(re.sub(r'[^\\w\\s]+|\\n', '', <br>             str(txt).lower().strip()))[:maxqnans])+<br>             \" [SEP] \" for txt in corpus]<br><br><strong>## generate masks<\/strong><br>masks = [[1]*len(txt.split(\" \")) + [0]*(maxlen - len(<br>           txt.split(\" \"))) for txt in corpus_tokenized]<br>    <br><strong>## padding<\/strong><br>txt2seq = [txt + \" [PAD]\"*(maxlen-len(txt.split(\" \"))) if len(txt.split(\" \")) != maxlen else txt for txt in corpus_tokenized]<br>    <br><strong>## generate idx<\/strong><br>idx = [tokenizer.encode(seq.split(\" \")) for seq in txt2seq]<br>    <br><strong>## generate segments<\/strong><br>segments = [] <br>for seq in txt2seq:<br>    temp, i = [], 0<br>    for token in seq.split(\" \"):<br>        temp.append(i)<br>        if token == \"[SEP]\":<br>             i += 1<br>    segments.append(temp)<strong>## feature matrix<\/strong><br>X_train = [np.asarray(idx, dtype='int32'), <br>           np.asarray(masks, dtype='int32'), <br>           np.asarray(segments, dtype='int32')]<\/pre>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-fec820e elementor-widget elementor-widget-text-editor\" data-id=\"fec820e\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"5aa3\">The feature matrix&nbsp;<em>X_train<\/em>&nbsp;has a shape of 3 x 34,265 x 50. We can check a random observation from the feature matrix:<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-a9b9983 elementor-widget elementor-widget-text-editor\" data-id=\"a9b9983\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<pre class=\"wp-block-preformatted\">i = 0print(\"txt: \", dtf_train[\"text\"].iloc[0])<br>print(\"tokenized:\", [tokenizer.convert_ids_to_tokens(idx) for idx in X_train[0][i].tolist()])<br>print(\"idx: \", X_train[0][i])<br>print(\"mask: \", X_train[1][i])<br>print(\"segment: \", X_train[2][i])<\/pre>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-e8c80ff elementor-widget elementor-widget-image\" data-id=\"e8c80ff\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"262\" src=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1vRUkclqzuGDvmgu1VFs-5A-1024x262.png\" class=\"attachment-large size-large wp-image-18690\" alt=\"take the same code and apply it to dtf_test[\u201ctext\u201d] to get X_test.\" srcset=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1vRUkclqzuGDvmgu1VFs-5A-1024x262.png 1024w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1vRUkclqzuGDvmgu1VFs-5A-300x77.png 300w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1vRUkclqzuGDvmgu1VFs-5A-768x196.png 768w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1vRUkclqzuGDvmgu1VFs-5A-1536x393.png 1536w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1vRUkclqzuGDvmgu1VFs-5A-610x156.png 610w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1vRUkclqzuGDvmgu1VFs-5A-750x192.png 750w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1vRUkclqzuGDvmgu1VFs-5A-1140x291.png 1140w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1vRUkclqzuGDvmgu1VFs-5A.png 1921w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-30c7756 elementor-widget elementor-widget-text-editor\" data-id=\"30c7756\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"a1fc\">You can take the same code and apply it to dtf_test[\u201ctext\u201d] to get&nbsp;<em>X_test<\/em>.<\/p>\n<p id=\"1d18\">Now, I\u2019m going to build the&nbsp;<strong>deep learning model with transfer learning<\/strong>&nbsp;from the pre-trained BERT. Basically, I\u2019m going to summarize the output of BERT into one vector with Average Pooling and then add two final Dense layers to predict the probability of each news category.<\/p>\n<p id=\"f6c7\">If you want to use the original versions of BERT, here\u2019s the code (remember to redo the feature engineering with the right tokenizer):<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-d9e32d5 elementor-widget elementor-widget-text-editor\" data-id=\"d9e32d5\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<pre class=\"wp-block-preformatted\"><strong>## inputs<\/strong><br>idx = layers.<strong>Input<\/strong>((50), dtype=\"int32\", name=\"input_idx\")<br>masks = layers.<strong>Input<\/strong>((50), dtype=\"int32\", name=\"input_masks\")<br>segments = layers.Input((50), dtype=\"int32\", name=\"input_segments\")<strong>## pre-trained bert<\/strong><br>nlp = transformers.<strong>TFBertModel.from_pretrained<\/strong>(\"bert-base-uncased\")<br>bert_out, _ = nlp([idx, masks, segments])<strong>## fine-tuning<\/strong><br>x = layers.<strong>GlobalAveragePooling1D<\/strong>()(bert_out)<br>x = layers.<strong>Dense<\/strong>(64, activation=\"relu\")(x)<br>y_out = layers.<strong>Dense<\/strong>(len(np.unique(y_train)), <br>                     activation='softmax')(x)<strong>## compile<\/strong><br>model = models.<strong>Model<\/strong>([idx, masks, segments], y_out)for layer in model.layers[:4]:<br>    layer.trainable = Falsemodel.compile(loss='sparse_categorical_crossentropy', <br>              optimizer='adam', metrics=['accuracy'])model.summary()<\/pre>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-6c7dfd5 elementor-widget elementor-widget-image\" data-id=\"6c7dfd5\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"517\" src=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1riJ2LlNVz_0MJvqAbYG3Bw-1024x517.png\" class=\"attachment-large size-large wp-image-18691\" alt=\"original versions of BERT\" srcset=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1riJ2LlNVz_0MJvqAbYG3Bw-1024x517.png 1024w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1riJ2LlNVz_0MJvqAbYG3Bw-300x152.png 300w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1riJ2LlNVz_0MJvqAbYG3Bw-768x388.png 768w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1riJ2LlNVz_0MJvqAbYG3Bw-1536x776.png 1536w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1riJ2LlNVz_0MJvqAbYG3Bw-610x308.png 610w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1riJ2LlNVz_0MJvqAbYG3Bw-750x379.png 750w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1riJ2LlNVz_0MJvqAbYG3Bw-1140x576.png 1140w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1riJ2LlNVz_0MJvqAbYG3Bw.png 1552w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-df84cd3 elementor-widget elementor-widget-text-editor\" data-id=\"df84cd3\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"ee15\"><strong>As I said, I\u2019m going to use the lighter version instead, Distil-BERT:<\/strong><\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-bcfad03 elementor-widget elementor-widget-text-editor\" data-id=\"bcfad03\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<pre class=\"wp-block-preformatted\"><strong>## inputs<\/strong><br>idx = layers.<strong>Input<\/strong>((50), dtype=\"int32\", name=\"input_idx\")<br>masks = layers.<strong>Input<\/strong>((50), dtype=\"int32\", name=\"input_masks\")<strong>## pre-trained bert with config<\/strong><br>config = transformers.DistilBertConfig(dropout=0.2, <br>           attention_dropout=0.2)<br>config.output_hidden_states = Falsenlp = transformers.<strong>TFDistilBertModel.from_pretrained<\/strong>('distilbert-<br>                  base-uncased', config=config)<br>bert_out = nlp(idx, attention_mask=masks)[0]<strong>## fine-tuning<\/strong><br>x = layers.<strong>GlobalAveragePooling1D<\/strong>()(bert_out)<br>x = layers.<strong>Dense<\/strong>(64, activation=\"relu\")(x)<br>y_out = layers.<strong>Dense<\/strong>(len(np.unique(y_train)), <br>                     activation='softmax')(x)<strong>## compile<\/strong><br>model = models.<strong>Model<\/strong>([idx, masks], y_out)for layer in model.layers[:3]:<br>    layer.trainable = Falsemodel.compile(loss='sparse_categorical_crossentropy', <br>              optimizer='adam', metrics=['accuracy'])model.summary()<\/pre>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-ad4f757 elementor-widget elementor-widget-image\" data-id=\"ad4f757\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"424\" src=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1p5sc1h4l3DewgrHvZDrr9g-1024x424.png\" class=\"attachment-large size-large wp-image-18692\" alt=\"lighter version instead, Distil-BERT:\" srcset=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1p5sc1h4l3DewgrHvZDrr9g-1024x424.png 1024w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1p5sc1h4l3DewgrHvZDrr9g-300x124.png 300w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1p5sc1h4l3DewgrHvZDrr9g-768x318.png 768w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1p5sc1h4l3DewgrHvZDrr9g-610x252.png 610w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1p5sc1h4l3DewgrHvZDrr9g-750x310.png 750w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1p5sc1h4l3DewgrHvZDrr9g-1140x472.png 1140w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1p5sc1h4l3DewgrHvZDrr9g.png 1532w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-49953c8 elementor-widget elementor-widget-text-editor\" data-id=\"49953c8\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"ed0a\">Let\u2019s&nbsp;<strong>train, test, evaluate<\/strong>&nbsp;this bad boy (code for evaluation is the same):<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-d42e359 elementor-widget elementor-widget-text-editor\" data-id=\"d42e359\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<pre class=\"wp-block-preformatted\"><strong>## encode y<\/strong><br>dic_y_mapping = {n:label for n,label in <br>                 enumerate(np.unique(y_train))}<br>inverse_dic = {v:k for k,v in dic_y_mapping.items()}<br>y_train = np.array([inverse_dic[y] for y in y_train])<strong>## train<\/strong><br>training = model.fit(x=X_train, y=y_train, batch_size=64, <br>                     epochs=1, shuffle=True, verbose=1, <br>                     validation_split=0.3)<strong>## test<\/strong><br>predicted_prob = model.predict(X_test)<br>predicted = [dic_y_mapping[np.argmax(pred)] for pred in <br>             predicted_prob]<\/pre>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-09e2891 elementor-widget elementor-widget-image\" data-id=\"09e2891\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"61\" src=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1oS2xD1Y0IWPGDg8uVx2dPQ-1024x61.png\" class=\"attachment-large size-large wp-image-18693\" alt=\"train, test, evaluate this bad boy\" srcset=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1oS2xD1Y0IWPGDg8uVx2dPQ-1024x61.png 1024w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1oS2xD1Y0IWPGDg8uVx2dPQ-300x18.png 300w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1oS2xD1Y0IWPGDg8uVx2dPQ-768x46.png 768w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1oS2xD1Y0IWPGDg8uVx2dPQ-1536x91.png 1536w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1oS2xD1Y0IWPGDg8uVx2dPQ-610x36.png 610w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1oS2xD1Y0IWPGDg8uVx2dPQ-750x45.png 750w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1oS2xD1Y0IWPGDg8uVx2dPQ-1140x68.png 1140w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1oS2xD1Y0IWPGDg8uVx2dPQ.png 1965w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-6fbdd02 elementor-widget elementor-widget-image\" data-id=\"6fbdd02\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"951\" src=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1NsiKi7b0JGlCQPLpeVkftA-1024x951.png\" class=\"attachment-large size-large wp-image-18694\" alt=\"train, test, evaluate this bad boy\" srcset=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1NsiKi7b0JGlCQPLpeVkftA-1024x951.png 1024w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1NsiKi7b0JGlCQPLpeVkftA-300x279.png 300w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1NsiKi7b0JGlCQPLpeVkftA-768x713.png 768w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1NsiKi7b0JGlCQPLpeVkftA-610x567.png 610w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1NsiKi7b0JGlCQPLpeVkftA-750x697.png 750w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1NsiKi7b0JGlCQPLpeVkftA-1140x1059.png 1140w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1NsiKi7b0JGlCQPLpeVkftA.png 1464w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-db6f3d5 elementor-widget elementor-widget-text-editor\" data-id=\"db6f3d5\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"1319\">The performance of BERT is slightly better than the previous models, in fact, it can recognize more Tech news than the others.<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-5c6a70d elementor-widget elementor-widget-heading\" data-id=\"5c6a70d\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\">Conclusion<\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-e31fae6 elementor-widget elementor-widget-text-editor\" data-id=\"e31fae6\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"9228\">This article has been a tutorial to demonstrate&nbsp;<strong>how to apply different NLP models to a multiclass classification use case<\/strong>. I compared 3 popular approaches: Bag-of-Words with Tf-Idf, Word Embedding with Word2Vec, and Language model with BERT. I went through Feature Engineering &amp; Selection, Model Design &amp; Testing, Evaluation &amp; Explainability, comparing the 3 models in each step (where possible).<\/p>\n<p id=\"9ace\">Please note that I haven\u2019t covered explainability for BERT as I\u2019m still working on that, but I will update this article as soon as I can. If you have any useful resources about that, feel free to contact me.<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<\/div>\n\t\t","protected":false},"excerpt":{"rendered":"<p>This article, using NLP and Python, will explain 3 different strategies for text multiclass classification<\/p>\n","protected":false},"author":1050,"featured_media":18695,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"content-type":"","footnotes":""},"categories":[183],"tags":[97,474,408,114,1329],"ppma_author":[3834],"class_list":["post-22617","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-ai-ml","tag-artificial-intelligence","tag-nlp","tag-programming","tag-python","tag-text-classification"],"authors":[{"term_id":3834,"user_id":1050,"is_guest":0,"slug":"mauro-di-pietro","display_name":"Mauro Di Pietro","avatar_url":"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/Mauro-Di-Pietro-150x150.jpeg","user_url":"http:\/\/www.group.intesasanpaolo.com","last_name":"Di Pietro","first_name":"Mauro","job_title":"","description":"Mauro Di Pietro is Senior Data Scientist at Intesa Sanpaolo, the leading Italian banking group."}],"_links":{"self":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/22617","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/users\/1050"}],"replies":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/comments?post=22617"}],"version-history":[{"count":4,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/22617\/revisions"}],"predecessor-version":[{"id":32233,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/22617\/revisions\/32233"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/media\/18695"}],"wp:attachment":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/media?parent=22617"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/categories?post=22617"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/tags?post=22617"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/ppma_author?post=22617"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}