{"id":22447,"date":"2020-11-17T10:41:37","date_gmt":"2020-11-17T10:41:37","guid":{"rendered":"https:\/\/www.experfy.com\/blog\/user-agent-strings-parsing-ml-models\/"},"modified":"2021-11-25T09:12:06","modified_gmt":"2021-11-25T09:12:06","slug":"user-agent-strings-parsing-ml-models","status":"publish","type":"post","link":"https:\/\/www.experfy.com\/blog\/ai-ml\/user-agent-strings-parsing-ml-models\/","title":{"rendered":"Still Parsing User-Agent Strings for Your Machine Learning Models?"},"content":{"rendered":"\n<p class=\"has-medium-font-size\"><em>Use This Instead<\/em><\/p>\n\n\n\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\"><p>Information contained in User-Agent strings can be efficiently represented using low-dimensional embeddings, and then employed in downstream Machine Learning tasks.<\/p><\/blockquote>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"d360\">What on Earth are User-Agent strings?<\/h2>\n\n\n\n<p id=\"4fd1\">When a user interacts with a website, the browser sends HTTP requests to the server to fetch the required content, submit data, or perform other actions. Such requests typically contain several&nbsp;<a href=\"https:\/\/developer.mozilla.org\/en-US\/docs\/Glossary\/Request_header\" target=\"_blank\" rel=\"noreferrer noopener\">headers<\/a>, i.e. character key-value pairs that specify parameters of a given request. A&nbsp;<a href=\"https:\/\/en.wikipedia.org\/wiki\/User_agent\" target=\"_blank\" rel=\"noreferrer noopener\">User-Agent string<\/a>&nbsp;(refered to as \u201cUAS\u201d below) is an HTTP request header that describes the software acting on the user\u2019s behalf (Figure 1).<\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter\"><img decoding=\"async\" src=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1Z6oqwtoXqFXzpPuwa-1NdA.png\" alt=\"Still Parsing User-Agent Strings for Your Machine Learning Models?\"\/><figcaption>Figure 1. The browser acts as the user\u2019s agent when sending requests to the server. The User-Agent string describes properties of the browser.<\/figcaption><\/figure><\/div>\n\n\n\n<p id=\"44ab\">The original purpose of UAS was&nbsp;<a href=\"https:\/\/developer.mozilla.org\/en-US\/docs\/Web\/HTTP\/Content_negotiation\" target=\"_blank\" rel=\"noreferrer noopener\"><em>content negotiation<\/em><\/a>, i.e. a mechanism of determining the best content to serve to a user depending on the information contained in the respective UAS (e.g., image format, the language of a document, text encoding, etc.). This information typically includes details on the environment the browser runs in (device, operating system and its version, locale), browser engine and version, layout engine and version, and so on (Figure 2).<\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter\"><img decoding=\"async\" src=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1F70F9Ev4l6bKR6la2j_h9w.png\" alt=\"Example of a typical UAS and its elements\"\/><figcaption>Figure 2. Example of a typical UAS and its elements.<\/figcaption><\/figure><\/div>\n\n\n\n<p id=\"63bf\">Although serving different web pages to different browsers is&nbsp;<a href=\"https:\/\/developer.mozilla.org\/en-US\/docs\/Web\/HTTP\/Browser_detection_using_the_user_agent\" target=\"_blank\" rel=\"noreferrer noopener\">considered a bad idea<\/a>&nbsp;nowadays, UAS still have many practical applications. The most common one is&nbsp;<em>web analytics<\/em>, i.e. reporting on the traffic composition to optimise the effectiveness of a website. Another use case is web&nbsp;<em>traffic management<\/em>, which involves blocking nuisance crawlers, decreasing the load on a website from unwanted visitors, preventing&nbsp;<a href=\"https:\/\/en.wikipedia.org\/wiki\/Click_fraud\" target=\"_blank\" rel=\"noreferrer noopener\">click fraud<\/a>, and other similar tasks. Due to the rich information they contain, UAS can also serve as a source of data for&nbsp;<em>Machine Learning applications<\/em>. However, the latter use case has not received much attention so far. Here I address this issue and discuss an efficient way to create informative features from UAS for Machine Learning models. This article is a summary of the talk I presented recently at two conferences \u2014&nbsp;<a href=\"https:\/\/2020.whyr.pl\/\" rel=\"noopener\"><em>Why R?<\/em><\/a>&nbsp;and&nbsp;<a href=\"https:\/\/info.mango-solutions.com\/earl-online-2020\" target=\"_blank\" rel=\"noreferrer noopener\"><em>Enterprise Applications of the R Language<\/em><\/a>. Data and code used in the examples described below are available at&nbsp;<a href=\"https:\/\/github.com\/ranalytics\/uas_embeddings\" target=\"_blank\" rel=\"noreferrer noopener\">GitHub<\/a>.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"b8df\">UAS elements as features for Machine Learning models<\/h2>\n\n\n\n<p id=\"4012\">UAS elements can often serve as useful proxies of user characteristics, such as lifestyle, tech-savviness, and even affluence. For example, a user, who is typically visiting a website from a high-end mobile device is likely different from a user visiting that same website from Internet Explorer on a desktop computer running Windows XP. Having UAS-based proxies for such characteristics can be particularly valuable when no other demographic information is available for a user (e.g. when a new, unidentified person visits a website).<\/p>\n\n\n\n<p id=\"185b\">In certain applications, it can also be useful to distinguish between human and non-human web traffic. In some cases, this is straightforward to do as automated web crawlers use a simplified UAS format that includes the word \u201cbot\u201d (e.g.,&nbsp;<code>Googlebot\/2.1 (+http:\/\/www.google.com\/bot.html)<\/code>). However, some crawlers do not follow this convention (e.g., Facebook bots&nbsp;<a href=\"https:\/\/developers.whatismybrowser.com\/useragents\/explore\/software_name\/facebook-bot\/\" rel=\"noopener\">contain<\/a>&nbsp;the&nbsp;<code>facebookexternalhit<\/code>&nbsp;word in their UAS), and identifying them requires a lookup dictionary.<\/p>\n\n\n\n<p id=\"a90e\">One seemingly straightforward way to create Machine Learning features from UAS is to&nbsp;<em>apply a parser<\/em>, extract individual UAS elements, and then one-hot encode these elements. This approach can work well in simple cases when only high-level and readily identifiable UAS elements need transforming into features. For example, it is relatively easy to determine the type of hardware of a User Agent (mobile vs desktop vs server, etc.). Several high-quality, regular expressions-based parsers can be used for this kind of feature engineering (e.g., see the&nbsp;<a href=\"https:\/\/github.com\/ua-parser\" target=\"_blank\" rel=\"noreferrer noopener\">\u201cua-parser\u201d project<\/a>&nbsp;and its implementations for a selection of languages).<\/p>\n\n\n\n<p id=\"6dd3\">However, the above approach quickly becomes impractical when one wants to use&nbsp;<em>all<\/em>&nbsp;<em>elements<\/em>&nbsp;making up a UAS and extract maximum information from them. There are two main reasons for this:<\/p>\n\n\n\n<ul class=\"wp-block-list\"><li>Existing&nbsp;<a href=\"https:\/\/tools.ietf.org\/html\/rfc7231#section-5.5.3\" target=\"_blank\" rel=\"noreferrer noopener\">recommendations for the formatting of User-Agent headers<\/a>&nbsp;are not enforced in any way, and one can encounter a wide variety of UAS specifications in the real world. As a result, consistent parsing of UAS is notoriously hard. Moreover, new devices and versions of operating systems and browsers emerge every day, turning the maintenance of high-quality parsers into a formidable task.<\/li><li>The number of possible UAS elements and their combinations is astronomically large. Even if it were possible to one-hot encode them, the resultant data matrix would be extremely sparse and too large to fit into the memory of computers typically used by Data Scientists these days.<\/li><\/ul>\n\n\n\n<p id=\"cf3e\">To overcome these challenges, one can apply a&nbsp;<em>dimensionality reduction<\/em>&nbsp;technique and represent UAS as vectors of fixed size, while minimising the loss of original information. This idea is not new, of course, and as UAS are simply strings of text, this can be achieved using a variety of methods for Natural Language Processing. In my projects, I have often found that the fastText algorithm developed by researchers from Facebook (<a href=\"https:\/\/arxiv.org\/abs\/1607.04606\" target=\"_blank\" rel=\"noreferrer noopener\">Bojanowski et al. 2016<\/a>) produces particularly useful solutions.<\/p>\n\n\n\n<p id=\"17f9\">Describing the fastText algorithm is out of the scope of this article. However, before we proceed with examples, it is worth mentioning some of the practical benefits of this method:<\/p>\n\n\n\n<ul class=\"wp-block-list\"><li>it is not data-hungry, i.e. a decently performing model can be trained on a few thousand examples only;<\/li><li>it works well on short and structured documents, such as UAS;<\/li><li>as the name suggests, it is fast to train;<\/li><li>it deals well with \u201cout-of-vocabulary words\u201d, i.e. it can generate meaningful vector representations (embeddings) even for strings that have not been seen during the training.<\/li><\/ul>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"fdd2\">fastText implementations: choose your weapon<\/h2>\n\n\n\n<p id=\"9823\">The&nbsp;<a href=\"https:\/\/github.com\/facebookresearch\/fastText\" target=\"_blank\" rel=\"noreferrer noopener\">official implementation<\/a>&nbsp;of fastText is available as a standalone C++ library and as a Python wrapper. Both of these libraries are&nbsp;<a href=\"https:\/\/fasttext.cc\/\" target=\"_blank\" rel=\"noreferrer noopener\">well documented<\/a>&nbsp;and easy to install and use. The widely used Python library&nbsp;<code>gensim<\/code>&nbsp;has&nbsp;<a href=\"https:\/\/radimrehurek.com\/gensim\/models\/fasttext.html\" target=\"_blank\" rel=\"noreferrer noopener\">its own implementation<\/a>&nbsp;of the algorithm. Below I will demonstrate how one can also train fastText models in R.<\/p>\n\n\n\n<p id=\"e7e6\">There exist several R wrappers around the fastText C++ library (see&nbsp;<code><a href=\"https:\/\/github.com\/mlampros\/fastText\" target=\"_blank\" rel=\"noreferrer noopener\">fastText<\/a><\/code>,&nbsp;<code><a href=\"https:\/\/github.com\/pommedeterresautee\/fastrtext\" target=\"_blank\" rel=\"noreferrer noopener\">fastrtext<\/a><\/code>, and&nbsp;<code><a href=\"https:\/\/cran.r-project.org\/web\/packages\/fastTextR\/index.html\" target=\"_blank\" rel=\"noreferrer noopener\">fastTextR<\/a><\/code>). However, arguably the simplest way to train and use fastText models in R is by calling the official Python bindings via the&nbsp;<code><a href=\"https:\/\/rstudio.github.io\/reticulate\/\" target=\"_blank\" rel=\"noreferrer noopener\">reticulate<\/a><\/code>&nbsp;package. Importing the&nbsp;<code>fasttext<\/code>&nbsp;Python module into an R environment can be done as follows:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"># Install `fasttext` first \n# (see https:\/\/fasttext.cc\/docs\/en\/support.html)# Load the `reticulate` package \n# (install first, if needed):\nrequire(reticulate)# Make sure `fasttext` is available to R:\npy_module_available(\"fasttext\")\n## [1] TRUE# Import `fasttext`:\nft &lt;- import(\"fasttext\")# Then call the required methods using \n# the standard `$` notation, \n# e.g.: `ft$train_supervised()`<\/pre>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"688c\">Learning the UAS embeddings with fastText in R<\/h2>\n\n\n\n<p id=\"a03c\">Examples described below are based on a sample of 200,000 unique UAS from the&nbsp;<a href=\"https:\/\/developers.whatismybrowser.com\/useragents\/database\/\" target=\"_blank\" rel=\"noreferrer noopener\">whatismybrowser.com<\/a>&nbsp;database (Figure 3).https:\/\/towardsdatascience.com\/media\/45d78c93c3aaedc888919cd6ac032fce Figure 3. A sample of UAS used in the examples in this article. Data are stored in a plain text file, where each row contains a single UAS. Note that all UAS were normalised to lower case and no other pre-processing has been applied.<\/p>\n\n\n\n<p id=\"4703\">Training an unsupervised fastText model in R is as simple as calling a command similar to the following one:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">m_unsup &lt;- ft$train_unsupervised(<br>   input = \".\/data\/train_data_unsup.txt\",<br>   model = \"skipgram\",<br>   lr = 0.05, <br>   dim = 32L, <em># vector dimension<\/em><br>   ws = 3L, <br>   minCount = 1L,<br>   minn = 2L, <br>   maxn = 6L, <br>   neg = 3L, <br>   wordNgrams = 2L, <br>   loss = \"ns\",<br>   epoch = 100L, <br>   thread = 10L<br>)<\/pre>\n\n\n\n<p id=\"fbb5\">The&nbsp;<code>dim<\/code>&nbsp;argument in the above command specifies the dimensionality of the embedding space. In this example, we want to transform each UAS into a vector of size 32. Other arguments control the process of model training (<code>lr<\/code>&nbsp;\u2014 learning rate,&nbsp;<code>loss<\/code>&nbsp;\u2014 the loss function,&nbsp;<code>epoch<\/code>&nbsp;\u2014 number of epochs, etc.). To understand the meaning of all arguments, please refer to the official&nbsp;<a href=\"https:\/\/fasttext.cc\/docs\/en\/options.html\" target=\"_blank\" rel=\"noreferrer noopener\">fastText documentation<\/a>.<\/p>\n\n\n\n<p id=\"15d6\">Once the model is trained, it is easy to calculate embeddings for new cases (e.g., from a test set). The following example shows how to do this (the key command here is&nbsp;<code>m_unsup$get_sentence_vector()<\/code>, which returns a vector averaged across embeddings of individual elements that make up a given UAS):<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">test_data &lt;- readLines(\".\/data\/test_data_unsup.txt\")<br><br>emb_unsup &lt;- test_data %&gt;% <br>  lapply(., function(x) {<br>    m_unsup$get_sentence_vector(text = x) %&gt;%<br>      t(.) %&gt;% as.data.frame(.)<br>  }) %&gt;% <br>  bind_rows(.) %&gt;% <br>  setNames(., paste0(\"f\", 1:32))<br># Printing out the first 5 values<br># of the vectors (of size 32)<br># that represent the first 3 UAS<br># from the test set:emb_unsup[1:3, 1:5]<br>##      f1       f2    f3    f4      f5<br>## 1 0.197 -0.03726 0.147 0.153  0.0423<br>## 2 0.182  0.00307 0.147 0.101  0.0326<br>## 3 0.101 -0.28220 0.189 0.202 -0.1623<\/pre>\n\n\n\n<p id=\"c491\">But how do we know if the trained unsupervised model is any good? Of course, one way to test it would involve plugging the vector representations of UAS obtained with that model into a downstream <a href=\"https:\/\/www.experfy.com\/blog\/ai-ml\/text-preprocessing-for-nlp-and-machine-learning-tasks\/\" target=\"_blank\" rel=\"noreferrer noopener\">Machine Learning task<\/a> and evaluating the quality of the resultant solution. However, before jumping to a downstream modelling task, one can also visually assess how good the fastText embeddings are. The well known tSNE plots (<a href=\"https:\/\/www.jmlr.org\/papers\/volume9\/vandermaaten08a\/vandermaaten08a.pdf\" target=\"_blank\" rel=\"noreferrer noopener\">Maaten &amp; Hinton 2008<\/a>; see also&nbsp;<a href=\"https:\/\/www.youtube.com\/watch?v=RJVL80Gg3lA\" target=\"_blank\" rel=\"noreferrer noopener\">this YouTube video<\/a>) can be particularly useful.<\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter\"><img decoding=\"async\" src=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1R72eBKGlltq7cT1xOotA0Q.png\" alt=\"A 3D tSNE visualisation of embeddings\"\/><figcaption>Figure 3. A 3D tSNE visualisation of embeddings obtained with an unsupervised fastText model. Each point in this graph corresponds to a UAS from a test set. Points were colour-coded according to the User Agent\u2019s hardware type.<\/figcaption><\/figure><\/div>\n\n\n\n<p id=\"a4b5\">Figure 3 shows a 3D tSNE plot of embeddings calculated for UAS from a test set using the fastText model specified above. Although this model was trained in an unsupervised manner, it was able to produce embeddings that reflect important properties of the original User-Agent strings. For example, one can see a good separation of the points with regards to the hardware type.<\/p>\n\n\n\n<p id=\"9d14\">Training a supervised fastText model by definition requires labelled data. This is where existing UAS parsers can often be of great help as one can use them to quickly label thousands of training examples. The fastText algorithm supports both the multiclass and multilabel classifiers. The expected (default) format for labels is&nbsp;<code>__label__&lt;value&gt;<\/code>. Labels formatted this way (and potentially separated by a space in the case of a multilabel model) are to be prepended to each document in the training dataset.<\/p>\n\n\n\n<p id=\"108d\">Suppose we are interested in embeddings that emphasise differences in UAS with regards to the hardware type. Figure 4 shows an example of labelled data that are suitable for training the respective model.<\/p>\n\n\n<div class='gist '><br \/><\/div>\n\n\n<p id=\"1fb0\">The R command required to train a supervised fastText model on such labelled data is similar to what we have seen before:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">m_sup &lt;- ft$train_supervised(<br>    input = \".\/data\/train_data_sup.txt\",<br>    lr = 0.05, <br>    dim = 32L, <em># vector dimension<\/em><br>    ws = 3L, <br>    minCount = 1L,<br>    minCountLabel = 10L, <em># min label occurence<\/em><br>    minn = 2L, <br>    maxn = 6L, <br>    neg = 3L, <br>    wordNgrams = 2L, <br>    loss = \"softmax\", <em># loss function<\/em><br>    epoch = 100L, <br>    thread = 10L<br>)<\/pre>\n\n\n\n<p id=\"4ea3\">We can evaluate the resultant supervised model by calculating precision, recall and f1 score on a labelled test set. These metrics can be calculated across all labels or for individual labels. For example, here are the quality metrics for UAS corresponding to the label \u201cmobile\u201d:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">metrics &lt;- m_sup$test_label(\".\/data\/test_data_sup.txt\")<br>metrics[\"__label__mobile\"]## $`__label__mobile`<br>## $`__label__mobile`$precision<br>## [1] 0.998351<br>##<br>## $`__label__mobile`$recall<br>## [1] 0.9981159<br>##<br>## $`__label__mobile`$f1score<br>## [1] 0.9982334<\/pre>\n\n\n\n<p id=\"968b\">Visual inspection of the tSNE plot for this supervised model also confirms its high quality: we can see a clear separation of the test cases with regards to the hardware type (Figure 5). This is not surprising as by training a supervised model we provide supporting information that helps the algorithm to create task-specific embeddings.<\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter\"><img decoding=\"async\" src=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1LUnXzrwN7LkyavgKeFjygw.png\" alt=\"Still Parsing User-Agent Strings for Your Machine Learning Models?\"\/><figcaption>Figure 5. A 3D tSNE visualisation of embeddings obtained with a fastText model trained in a supervised mode, with labels corresponding to the hardware type.<\/figcaption><\/figure><\/div>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"c487\">Conclusions<\/h2>\n\n\n\n<p id=\"1cf9\">This article has demonstrated that the rich information contained in UAS can be efficiently represented using low-dimensional embeddings. Models to generate such embeddings can be trained in both unsupervised and supervised modes. Unsupervised embeddings are generic and thus can be used in any downstream Machine Learning task. Whenever possible, however, I would recommend training a task-specific, supervised model. One particularly useful algorithm for representing UAS as vectors of fixed size is fastText. Its implementations are available in all major languages used by Data Scientists nowadays.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Information contained in User-Agent strings can be efficiently represented using low-dimensional embeddings, and then employed in downstream Machine Learning tasks.<\/p>\n","protected":false},"author":976,"featured_media":16864,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"content-type":"","footnotes":""},"categories":[183],"tags":[1018,1019,92,1020,1021,1022],"ppma_author":[3672],"class_list":["post-22447","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-ai-ml","tag-aalgorithms","tag-embedding","tag-machine-learning","tag-supervised-model","tag-unsupervised-model","tag-user-agent-strings"],"authors":[{"term_id":3672,"user_id":976,"is_guest":0,"slug":"sergey-mastitsky","display_name":"Sergey Mastitsky","avatar_url":"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/Sergey-Ivanov-1-150x150.jpeg","user_url":"http:\/\/www.aviva.com","last_name":"Mastitsky","first_name":"Sergey","job_title":"","description":"Sergey Mastitsky is Data Science Lead at Aviva. He is the author of Statistical Analysis and Visualisation of Data Using R, Data Visualisation Using ggplot2, Implementing Classification, Regression, and other Algorithms of Data Mining Using R, and Times Series Analysis Using R, All these books were written in Russian. He provides data science consulting services at <a href=\"http:\/\/nextgamesolutions.com\/#slide=1\" target=\"_blank\" rel=\"noopener\">Next Game Solutions<\/a>"}],"_links":{"self":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/22447","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/users\/976"}],"replies":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/comments?post=22447"}],"version-history":[{"count":2,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/22447\/revisions"}],"predecessor-version":[{"id":27840,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/22447\/revisions\/27840"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/media\/16864"}],"wp:attachment":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/media?parent=22447"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/categories?post=22447"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/tags?post=22447"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/ppma_author?post=22447"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}