{"id":1316,"date":"2019-02-15T10:32:04","date_gmt":"2019-02-15T10:32:04","guid":{"rendered":"http:\/\/kusuaks7\/?p=921"},"modified":"2023-07-31T08:56:09","modified_gmt":"2023-07-31T08:56:09","slug":"data-analytics-with-python-by-web-scraping-illustration-with-cia-world-factbook","status":"publish","type":"post","link":"https:\/\/www.experfy.com\/blog\/bigdata-cloud\/data-analytics-with-python-by-web-scraping-illustration-with-cia-world-factbook\/","title":{"rendered":"Data Analytics with Python by Web scraping: Illustration with CIA World Factbook"},"content":{"rendered":"<p style=\"margin-left: -1.95pt;\"><strong><em>Ready to learn Data Science? <a href=\"https:\/\/www.experfy.com\/training\/courses\">Browse courses<\/a>\u00a0like\u00a0<a href=\"https:\/\/www.experfy.com\/training\/tracks\/data-science-training-certification\">Data Science Training and Certification<\/a> developed by industry thought leaders and Experfy in Harvard Innovation Lab.<\/em><\/strong><\/p>\n<h4 style=\"margin-left: -1.3pt;\">In this article, we show how to use Python libraries and HTML parsing to extract useful information from a website and answer some important analytics questions afterwards.<\/h4>\n<p style=\"text-align: center;\"><img decoding=\"async\" style=\"width: 650px; height: 448px;\" src=\"https:\/\/cdn-images-1.medium.com\/max\/1600\/1*X2QkNgg-vR3NRnGDquRm9w.png\" alt=\"experfy-blog\" \/><\/p>\n<p><strong>In<\/strong>a data science project, almost always the most time consuming and messy part is the data gathering and cleaning. Everyone likes to build a cool deep neural network (or XGboost) model or two and show off one\u2019s skills with cool 3D interactive plots. But the models need raw data to start with and they don\u2019t come easy and clean.<\/p>\n<p><strong><em>Life, after all, is not Kaggle where a zip file full of data is waiting for you to be unpacked and modeled\u00a0\ud83d\ude42<\/em><\/strong><\/p>\n<p><strong>But why gather data or build model anyway<\/strong>? The fundamental motivation is to answer a business or scientific or social question.\u00a0<em>Is there a trend<\/em>?\u00a0<em>Is this thing related to that<\/em>?\u00a0<em>Can the measurement of this entity predict the outcome for that phenomena<\/em>? It is because answering this question will validate a hypothesis you have as a scientist\/practitioner of the field. You are just using data (as opposed to test tubes like a chemist or magnets like a physicist) to test your hypothesis and prove\/disprove it scientifically.\u00a0<strong>That is the \u2018science\u2019 part of the data science. Nothing more, nothing less\u2026<\/strong><\/p>\n<p>Trust me, it is not that hard to come up with a good quality question which requires a bit of application of data science techniques to answer. Each such question then becomes a small little project of your which you can code up and showcase on a open-source platform like Github to show to your friends. Even if you are not a data scientist by profession, nobody can stop you writing cool program to answer a good data question. That showcases you as a person who is comfortable around data and one who can tell a story with data.<\/p>\n<p>Let\u2019s tackle one such question today\u2026<\/p>\n<p><strong><em>Is there any relationship between the GDP (in terms of purchasing power parity) of a country and the percentage of its Internet users? And is this trend similar for low-income\/middle-income\/high-income countries?<\/em><\/strong><\/p>\n<p>Now, there can be any number of sources you can think of to gather data for answering this question. I found that an website from CIA (Yes, the \u2018AGENCY\u2019), which hosts basic factual information about all countries around the world, is a good place to scrape the data from.<\/p>\n<p>So, we will use following Python modules to build our database and visualizations,<\/p>\n<ul>\n<li><strong>Pandas<\/strong>,\u00a0<strong>Numpy, matplotlib\/seaborn<\/strong><\/li>\n<li>Python\u00a0<strong>urllib<\/strong>\u00a0(for sending the HTTP requests)<\/li>\n<li><strong>BeautifulSoup<\/strong>\u00a0(for HTML parsing)<\/li>\n<li><strong>Regular expression module\u00a0<\/strong>(for finding the exact matching text to search for)<\/li>\n<\/ul>\n<p>Let\u2019s talk about the program structure to answer this data science question. The\u00a0<a href=\"https:\/\/github.com\/tirthajyoti\/Web-Database-Analytics-Python\/blob\/master\/CIA-Factbook-Analytics2.ipynb\" target=\"_blank\" rel=\"noopener noreferrer\">entire boiler plate code is available here<\/a>\u00a0in my\u00a0<a href=\"https:\/\/github.com\/tirthajyoti\/Web-Database-Analytics-Python\" target=\"_blank\" rel=\"noopener noreferrer\">Github repository<\/a>. Please feel free to fork and star if you like it.<\/p>\n<h3 style=\"margin-left: -1.2pt;\"><strong>Reading the front HTML page and passing on to BeautifulSoup<\/strong><\/h3>\n<p>Here is how the\u00a0front page of the CIA World Factbook\u00a0looks like,<\/p>\n<p style=\"text-align: center;\"><img decoding=\"async\" style=\"width: 650px; height: 426px;\" src=\"https:\/\/cdn-images-1.medium.com\/max\/1600\/1*CjEOFPmEDpz5z-Wc_YOfNg.png\" alt=\"experfy-blog\" \/><\/p>\n<p style=\"text-align: center;\">Fig: CIA World Factbook front\u00a0page<\/p>\n<p>We use a simple urllib request with a SSL error ignore context to retrieve this page and then pass it on to the magical BeautifulSoup, which parses the HTML for us and produce a pretty text dump. For those, who are not familiar with the BeautifulSoup library, they can watch the following video or read this\u00a0<a href=\"https:\/\/medium.freecodecamp.org\/how-to-scrape-websites-with-python-and-beautifulsoup-5946935d93fe\" target=\"_blank\" rel=\"noopener noreferrer\">great informative article on Medium<\/a>.<\/p>\n<p>So, here is the code snippet for reading the front page HTML,<\/p>\n<p>Here is how we pass it on to BeautifulSoup and use the\u00a0find_all\u00a0method to find all the country names and codes embedded in the HTML. Basically, the idea is to\u00a0<strong>find the HTML tags named \u2018option\u2019<\/strong>. The text in that tag is the country name and the char 5 and 6 of the tag value represent the 2-character country code.<\/p>\n<p>Now, you may ask how would you know that you need to extract 5th and 6th character only? The simple answer is that\u00a0<strong>you have to examine the soup text i.e. parsed HTML text yourself and determine those indices<\/strong>. There is no universal method to determine this. Each HTML page and the underlying structure is unique.<\/p>\n<h3 style=\"margin-left: -1.2pt;\"><strong>Crawling: Download all the text data of all countries into a dictionary by scraping each page individually<\/strong><\/h3>\n<p>This step is the essential scraping or crawling as they say. To do this,\u00a0<strong>the key thing to identify is how the URL of each countries information page is structured<\/strong>. Now, in general case, this is may be hard to get. In this particular case, quick examination shows a very simple and regular structure to follow. Here is the screenshot of Australia for example,<\/p>\n<p style=\"text-align: center;\"><img decoding=\"async\" style=\"width: 640px; height: 419px;\" src=\"https:\/\/cdn-images-1.medium.com\/max\/640\/1*vYfbPogbxVdPhX9hoSUc6g.png\" alt=\"experfy-blog\" \/><\/p>\n<p>That means there is a fixed URL to which you have to append the 2-character country code and you get to the URL of that country\u2019s page. So, we can just iterate over the country codes\u2019 list and use BeautifulSoup to extract all the text and store in our local dictionary. Here is the code snippet,<\/p>\n<h3 style=\"margin-left: -1.2pt;\"><strong>Store in a Pickle dump if you\u00a0like<\/strong><\/h3>\n<p>For good measure, I prefer to serialize and\u00a0<strong>store this data in a\u00a0<\/strong><a href=\"https:\/\/pythontips.com\/2013\/08\/02\/what-is-pickle-in-python\/\" target=\"_blank\" rel=\"noopener noreferrer\"><strong>Python pickle object<\/strong><\/a>\u00a0anyway. That way I can just read the data directly next time I open the Jupyter notebook without repeating the web crawling steps.<\/p>\n<h3 style=\"margin-left: -1.2pt;\"><strong>Using regular expression to extract the GDP\/capita data from the text\u00a0dump<\/strong><\/h3>\n<p>This is the core text analytics part of the program, where we take help of\u00a0<a href=\"https:\/\/docs.python.org\/3\/howto\/regex.html\" target=\"_blank\" rel=\"noopener noreferrer\"><strong><em>regular expression<\/em><\/strong>\u00a0module<\/a>\u00a0to find what we are looking for in the huge text string and extract the relevant numerical data. Now, regular expression is a rich resource in Python (or in virtually every high level programming language). It allows searching\/matching particular pattern of strings within a large corpus of text. Here, we use very simple methods of regular expression for matching the exact words like \u201c<em>GDP<\/em> <em>\u2014<\/em> <em>per capita (PPP):<\/em>\u201d and then read few characters after that, extract the positions of certain symbols like $ and parentheses to eventually extract the numerical value of GDP\/capita. Here is the idea illustrated with a figure.<\/p>\n<p style=\"text-align: center;\"><img decoding=\"async\" style=\"width: 640px; height: 529px;\" src=\"https:\/\/cdn-images-1.medium.com\/max\/640\/1*1FgkmYUwds5pKIZC4HvkTw.png\" alt=\"experfy-blog\" \/><\/p>\n<p style=\"text-align: center;\">Fig: Illustration of the text analytics<\/p>\n<p>There are other regular expression tricks used in this notebook, for example to extract the total GDP properly regardless whether the figure is given in billions or trillions.<\/p>\n<p>Here is the example code snippet.\u00a0<strong>Notice the multiple error-handling checks placed in the code<\/strong>. This is necessary because of the supremely unpredictable nature of HTML pages. Not all country may have the GDP data, not all pages may have the exact same wordings for the data, not all numbers may look same, not all strings may have $ and () placed similarly. Any number of things can go wrong.<\/p>\n<p><em>It is almost impossible to plan and write code for all scenarios but at least you have to have code to handle the exception if they occur so that your program does not come to a halt and can gracefully move on to the next page for processing.<\/em><\/p>\n<h3 style=\"margin-left: -1.2pt;\"><strong>Don\u2019t forget to use pandas inner\/left join\u00a0method<\/strong><\/h3>\n<p>One thing to remember is that all these text analytics will produce dataframes with slightly different set of countries as different types of data may be unavailable for different countries. One could use a\u00a0<a href=\"https:\/\/pandas.pydata.org\/pandas-docs\/stable\/merging.html\" target=\"_blank\" rel=\"noopener noreferrer\"><strong>Pandas left join<\/strong><\/a>\u00a0to create a dataframe with intersection of all common countries for which all the pieces of data is available\/could be extracted.<\/p>\n<h3 style=\"margin-left: -1.2pt;\"><strong>Ah the cool stuff now, Modeling\u2026but wait! Let\u2019s do filtering first!<\/strong><\/h3>\n<p>After all the hard work of HTML parsing, page crawling, and text mining, now you are ready to reap the benefits \u2014 eager to run the regression algorithms and cool visualization scripts! But wait, often you need to clean up your data (particularly for this kind of socio-economic problems) a wee bit more before generating those plots. Basically, you want to filter out the outliers e.g. very small countries (like island nations) who may have extremely skewed values of the parameters you want to plot but does not follow the main underlying dynamics you want to investigate. A few lines of code is good for those filters. There may be more\u00a0<em>Pythonic<\/em>\u00a0way to implement them but I tried to keep it extremely simple and easy to follow. The following code, for example, creates filters to keep out small countries with &lt; 50 billion of total GDP and low and high income boundaries of $5,000 and $25,000 respectively (GDP\/capita).<\/p>\n<h3 style=\"margin-left: -1.2pt;\"><strong>Finally, the visualization<\/strong><\/h3>\n<p>We use\u00a0<a href=\"https:\/\/seaborn.pydata.org\/generated\/seaborn.regplot.html\" target=\"_blank\" rel=\"noopener noreferrer\"><strong>seaborn regplot<\/strong>\u00a0function<\/a>\u00a0to create the scatter plots (Internet users % vs. GDP\/capita) with linear regression fit and 95% confidence interval bands shown. They look like following. One can interpret the result as<\/p>\n<p><em>There is a strong positive correlation between Internet users % and GDP\/capita for a country. Moreover, the strength of correlation is significantly higher for low-income\/low-GDP countries than the high-GDP, advanced nations.\u00a0<\/em><strong><em>That could mean access to internet helps the lower income countries to grow faster and improve the average condition of their citizens more than it does for the advanced nations<\/em><\/strong><em>.<\/em><\/p>\n<p style=\"text-align: center;\"><img decoding=\"async\" style=\"width: 650px; height: 197px;\" src=\"https:\/\/cdn-images-1.medium.com\/max\/640\/1*UAMZrO5oXN_vKvwu-Zhaxg.png \" alt=\"experfy-blog\" \/><\/p>\n<h3><strong>Summary<\/strong><\/h3>\n<p>This article goes over a demo Python notebook to illustrate how to crawl web pages for downloading raw information by HTML parsing using BeautifulSoup. Thereafter, it also illustrates the use of Regular Expression module to search and extract important pieces of information what the user demands.<\/p>\n<p><em>Above all, it demonstrates how or why there can be no simple, universal rule or program structure while mining messy HTML parsed texts. One has to examine the text structure and put in place appropriate error-handling checks to gracefully handle all the situations to maintain the flow of the program (and not crash) even if it cannot extract data for all those scenarios.<\/em><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Ready to learn Data Science? Browse courses\u00a0like\u00a0Data Science Training and Certification developed by industry thought leaders and Experfy in Harvard Innovation Lab. In this article, we show how to use Python libraries and HTML parsing to extract useful information from a website and answer some important analytics questions afterwards. Ina data science project, almost always<\/p>\n","protected":false},"author":137,"featured_media":2981,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"content-type":"","footnotes":""},"categories":[187],"tags":[94],"ppma_author":[1967],"class_list":["post-1316","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-bigdata-cloud","tag-data-science"],"authors":[{"term_id":1967,"user_id":137,"is_guest":0,"slug":"tirthajyoti-sarkar","display_name":"Tirthajyoti Sarkar","avatar_url":"https:\/\/secure.gravatar.com\/avatar\/?s=96&d=mm&r=g","user_url":"","last_name":"Sarkar","first_name":"Tirthajyoti","job_title":"","description":"Dr. Tirthajyoti Sarkar, Principal Engineer at ON Semiconductor, conducts research on and designs advanced semiconductor technology and products, which power various things from smartphones to electric cars, with data centers and washing machines in between. He also moonlights by learning and practicing data science, machine learning, and Python\/R programming. He writes for multiple Data Science\/Artificial intelligence focused publications and loves to experiment with advanced machine learning techniques for application to semiconductor designs."}],"_links":{"self":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/1316","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/users\/137"}],"replies":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/comments?post=1316"}],"version-history":[{"count":5,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/1316\/revisions"}],"predecessor-version":[{"id":29796,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/1316\/revisions\/29796"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/media\/2981"}],"wp:attachment":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/media?parent=1316"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/categories?post=1316"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/tags?post=1316"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/ppma_author?post=1316"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}