{"id":1350,"date":"2019-02-15T10:32:05","date_gmt":"2019-02-15T07:32:05","guid":{"rendered":"http:\/\/kusuaks7\/?p=955"},"modified":"2023-08-09T10:11:43","modified_gmt":"2023-08-09T10:11:43","slug":"using-scrapy-to-build-your-own-dataset","status":"publish","type":"post","link":"https:\/\/www.experfy.com\/blog\/bigdata-cloud\/using-scrapy-to-build-your-own-dataset\/","title":{"rendered":"Using Scrapy to Build your Own Dataset"},"content":{"rendered":"<p><strong><em>Ready to learn Big Data? Browse <a href=\"https:\/\/www.experfy.com\/training\/tracks\/big-data-training-certification\">Big Data Training and Certification Courses<\/a> developed by industry thought leaders and Experfy in Harvard Innovation Lab.<\/em><\/strong><\/p>\n<p id=\"a45f\">When I first started working in industry, one of the things I quickly realized is sometimes you have to gather, organize, and clean your own data. For this tutorial, we will gather data from a crowdfunding website called\u00a0<a href=\"https:\/\/fundrazr.com\/\" target=\"_blank\" rel=\"noopener noreferrer\" data-href=\"https:\/\/fundrazr.com\/\" data->FundRazr<\/a>. Like many websites, the site has its own structure, form, and has tons of accessible useful data, but it is hard to get data from the site as it doesn\u2019t have a structured API. In result, we will web scrape the site to get that unstructured website data and put into an ordered form to build our own dataset.<\/p>\n<p id=\"b3aa\">In order to scrape the website, we will use\u00a0<a href=\"https:\/\/scrapy.org\/\" target=\"_blank\" rel=\"noopener noreferrer\" data-href=\"https:\/\/scrapy.org\/\" data->Scrapy<\/a>. In short, Scrapy is a framework built to build web scrapers more easily and relieve the pain of maintaining them. Basically, it allows you to focus on the data extraction using CSS selectors and choosing XPath expressions and less on the intricate internals of how spiders are supposed to work. This blog post goes a little beyond the great\u00a0<a href=\"https:\/\/doc.scrapy.org\/en\/latest\/intro\/tutorial.html\" target=\"_blank\" rel=\"noopener noreferrer\" data-href=\"https:\/\/doc.scrapy.org\/en\/latest\/intro\/tutorial.html\" data->official tutorial from the scrapy documentation<\/a>\u00a0in the hopes that if you need to scrape something a bit harder, you can do it on your own. With that, lets get started. If you get lost, I recommend opening the\u00a0<a href=\"https:\/\/www.youtube.com\/watch?v=O_j3OTXw2_E\" target=\"_blank\" rel=\"noopener noreferrer\" data-href=\"https:\/\/www.youtube.com\/watch?v=O_j3OTXw2_E\" data->video<\/a>\u00a0in a separate tab.<\/p>\n<h4 id=\"0739\"><strong>Getting Started (Prerequisites)<\/strong><\/h4>\n<p id=\"55f8\">If you already have anaconda and google chrome (or Firefox), skip to Creating a New Scrapy Project.<\/p>\n<p id=\"701e\"><strong>1.<\/strong>\u00a0Install Anaconda (Python) on your operating system. You can either download anaconda from the official site and install on your own or you can follow these anaconda installation tutorials below.<\/p>\n<p id=\"fabc\"><strong>2.<\/strong>\u00a0Install Scrapy (anaconda comes with it, but just in case). You can also install on your terminal (mac\/linux) or command line (windows). You can type the following:<\/p>\n<pre id=\"936a\">conda install -c conda-forge scrapy<\/pre>\n<p id=\"66d9\">3. Make sure you have Google chrome or Firefox. In this tutorial, I am using Google Chrome. If you don\u2019t have google chrome, you can install it here using this\u00a0<a href=\"https:\/\/support.google.com\/chrome\/answer\/95346?co=GENIE.Platform%3DDesktop&amp;hl=en\" target=\"_blank\" rel=\"noopener noreferrer\" data-href=\"https:\/\/support.google.com\/chrome\/answer\/95346?co=GENIE.Platform%3DDesktop&amp;hl=en\" data->link<\/a>.<\/p>\n<h4 id=\"c24c\">Creating a New Scrapy\u00a0project<\/h4>\n<p id=\"96d0\">1.Open a terminal (mac\/linux) or command line (windows). Navigate to a desired folder (see the image below if you need help) and type<\/p>\n<pre id=\"cdf1\">scrapy startproject fundrazr<\/pre>\n<figure id=\"ba40\"><canvas width=\"75\" height=\"10\"><\/canvas><img decoding=\"async\" style=\"width: 700px; height: 94px;\" src=\"https:\/\/cdn-images-1.medium.com\/max\/1600\/1*NSTOVwlIonixPtqM_YOx9w.png\" data-src=\"https:\/\/cdn-images-1.medium.com\/max\/1600\/1*NSTOVwlIonixPtqM_YOx9w.png\" \/><\/figure>\n<p id=\"38de\" style=\"text-align: center;\">scrapy startproject fundrazr<\/p>\n<p>This makes a fundrazr directory with the following contents:<\/p>\n<figure id=\"ca4e\"><canvas width=\"75\" height=\"23\"><\/canvas><img decoding=\"async\" style=\"width: 700px; height: 223px;\" src=\"https:\/\/cdn-images-1.medium.com\/max\/1600\/1*ZedpgQ0cl7IPjRywCXwYbw.png\" data-src=\"https:\/\/cdn-images-1.medium.com\/max\/1600\/1*ZedpgQ0cl7IPjRywCXwYbw.png\" \/><\/figure>\n<p id=\"9a2a\" style=\"text-align: center;\">fundrazr project directory<\/p>\n<h4><strong>Finding good start URLs using Inspect on Google Chrome (or\u00a0Firefox)<\/strong><\/h4>\n<p id=\"787b\">In the spider framework,\u00a0<strong>start_urls\u00a0<\/strong>is a list of URLs where the spider will begin to crawl from, when no particular URLs are specified. We will use each element in the\u00a0<strong>start_urls<\/strong>\u00a0list as a means to get individual campaign links.<\/p>\n<p id=\"9515\">1. The image below shows that based on the category you choose, you get a different start url. The highlighted part in black are the possible categories of fundrazrs to scrape.<\/p>\n<figure id=\"40d9\"><canvas width=\"75\" height=\"25\"><\/canvas><img decoding=\"async\" style=\"width: 700px; height: 236px;\" src=\"https:\/\/cdn-images-1.medium.com\/max\/1600\/1*9ePsVj6nMeSSns2SSvUXvA.png\" data-src=\"https:\/\/cdn-images-1.medium.com\/max\/1600\/1*9ePsVj6nMeSSns2SSvUXvA.png\" \/><\/figure>\n<p style=\"text-align: center;\">\n<p style=\"text-align: center;\">Finding a good first start_url<\/p>\n<p>For this tutorial, the first in the list\u00a0<strong>start_urls<\/strong>\u00a0is:<\/p>\n<p id=\"1bf3\"><a href=\"https:\/\/fundrazr.com\/find?category=Health\" target=\"_blank\" rel=\"noopener noreferrer\" data-href=\"https:\/\/fundrazr.com\/find?category=Health\" data->https:\/\/fundrazr.com\/find?category=Health<\/a><\/p>\n<p id=\"41ec\">2. This part is about getting additional elements to put in the\u00a0<strong>start_urls<\/strong>\u00a0list. We are finding out how to go to the next page so we can get additional urls to put in\u00a0<strong>start_urls<\/strong>.<\/p>\n<figure id=\"2a3a\"><canvas width=\"75\" height=\"36\"><\/canvas><img decoding=\"async\" style=\"width: 700px; height: 348px;\" src=\"https:\/\/cdn-images-1.medium.com\/max\/1600\/1*ZzB7x0mx3jj5_GVbJZfKeg.png\" data-src=\"https:\/\/cdn-images-1.medium.com\/max\/1600\/1*ZzB7x0mx3jj5_GVbJZfKeg.png\" \/><\/figure>\n<p id=\"296e\" style=\"text-align: center;\">Getting additional elements to put in\u00a0<strong>start_urls<\/strong>\u00a0list by inspecting Next\u00a0button<\/p>\n<p>The second start url is:\u00a0<a href=\"https:\/\/fundrazr.com\/find?category=Health&amp;page=2\" target=\"_blank\" rel=\"nofollow noopener noreferrer\" data-href=\"https:\/\/fundrazr.com\/find?category=Health&amp;page=2\" data->https:\/\/fundrazr.com\/find?category=Health&amp;page=2<\/a><\/p>\n<p id=\"319e\">The code below will be used in the code for the spider later in the tutorial. All it does is make a list of start_urls. The variable npages is just how many additional pages (after the first page) we want to get campaign links from.<\/p>\n<h4 id=\"0c75\"><strong>Scrapy Shell for finding Individual Campaign\u00a0Links<\/strong><\/h4>\n<p id=\"70d5\">The best way to learn how to extract data with Scrapy is using the Scrapy shell. We will use XPaths which can be used to select elements from HTML documents.<\/p>\n<p id=\"50bf\">The first thing we will try and get the xpaths for are the individual campaign links. First we do inspect to see roughly where the campaigns are in the HTML.<\/p>\n<figure id=\"2521\"><canvas width=\"75\" height=\"33\"><\/canvas><img decoding=\"async\" style=\"width: 700px; height: 324px;\" src=\"https:\/\/cdn-images-1.medium.com\/max\/1600\/1*n5MWPew1dsD-f_FqU0xMTQ.png\" data-src=\"https:\/\/cdn-images-1.medium.com\/max\/1600\/1*n5MWPew1dsD-f_FqU0xMTQ.png\" \/><\/figure>\n<p id=\"d0e0\" style=\"text-align: center;\">Finding links to individual campaigns<\/p>\n<p>We will use XPath to extract the part enclosed in the red rectangle below.<\/p>\n<figure id=\"35d7\"><img decoding=\"async\" style=\"width: 700px; height: 70px;\" src=\"https:\/\/cdn-images-1.medium.com\/max\/1600\/1*XOzIItj5lS-wAiCeJn03FQ.png\" data-action=\"zoom\" data-action-value=\"1*XOzIItj5lS-wAiCeJn03FQ.png\" data-height=\"100\" data-image-id=\"1*XOzIItj5lS-wAiCeJn03FQ.png\" data-width=\"998\" \/><\/figure>\n<p id=\"c189\" style=\"text-align: center;\">The enclosed part is a partial url we will\u00a0isolate<\/p>\n<p>In terminal type (mac\/linux):<\/p>\n<pre id=\"ccf1\">scrapy shell 'https:\/\/fundrazr.com\/find?category=Health'<\/pre>\n<p id=\"7519\">In command line type (windows):<\/p>\n<pre id=\"46cf\">scrapy shell \u201c<a href=\"https:\/\/fundrazr.com\/find?category=Health\" target=\"_blank\" rel=\"nofollow noopener noreferrer\" data-href=\"https:\/\/fundrazr.com\/find?category=Health\" data->https:\/\/fundrazr.com\/find?category=Health<\/a>\"<\/pre>\n<p id=\"8089\">Type the following into scrapy shell (to help understand the code, please see the video):<\/p>\n<p id=\"c9b7\"><span style=\"font-family: courier new,courier,monospace;\"><span style=\"background-color: #e6e6fa;\">response.xpath(&#8220;\/\/h2[contains(<\/span><a title=\"Twitter profile for @class\" href=\"http:\/\/twitter.com\/class\" target=\"_blank\" rel=\"noopener noreferrer\" data-href=\"http:\/\/twitter.com\/class\" data-><span style=\"background-color: #e6e6fa;\">@class<\/span><\/a><span style=\"background-color: #e6e6fa;\">, &#8216;title headline-font&#8217;)]\/a[contains(<\/span><a title=\"Twitter profile for @class\" href=\"http:\/\/twitter.com\/class\" target=\"_blank\" rel=\"noopener noreferrer\" data-href=\"http:\/\/twitter.com\/class\" data-><span style=\"background-color: #e6e6fa;\">@class<\/span><\/a><span style=\"background-color: #e6e6fa;\">, &#8216;campaign-link&#8217;)]\/\/@href&#8221;).extract()<\/span><\/span><\/p>\n<figure id=\"a59d\"><canvas width=\"75\" height=\"17\"><\/canvas><img decoding=\"async\" style=\"width: 700px; height: 165px;\" src=\"https:\/\/cdn-images-1.medium.com\/max\/1600\/1*gsFvVmUq85f1Xg5L7mHyAA.png\" data-src=\"https:\/\/cdn-images-1.medium.com\/max\/1600\/1*gsFvVmUq85f1Xg5L7mHyAA.png\" \/><\/figure>\n<p id=\"d9e2\" style=\"text-align: center;\">There is a good chance you will get different partial urls as websites update over\u00a0time<\/p>\n<p>The code below is for getting all the campaign links for a given start url (more on this later in the First Spider section)<\/p>\n<p id=\"0508\">Exit Scrapy Shell by typing\u00a0<strong>exit()<\/strong>.<\/p>\n<figure id=\"17c5\"><canvas width=\"75\" height=\"12\"><\/canvas><img decoding=\"async\" style=\"width: 606px; height: 108px;\" src=\"https:\/\/cdn-images-1.medium.com\/max\/1600\/1*62ksPbArt2iM732MQnN4wg.png\" data-src=\"https:\/\/cdn-images-1.medium.com\/max\/1600\/1*62ksPbArt2iM732MQnN4wg.png\" \/><\/figure>\n<p id=\"d082\" style=\"text-align: center;\">exit scrapy \u00a0shell<\/p>\n<h4><strong>Inspecting Individual Campaigns<\/strong><\/h4>\n<p id=\"2ac4\">While we should previously worked on understanding the structure of where individual campaigns links are, this section goes over where things are on individual campaigns.<\/p>\n<ol>\n<li id=\"f981\">Next we go to an individual campaign page (see link below) to scrape (I should note that some of these campaigns are difficult to view)<\/li>\n<\/ol>\n<p id=\"7506\"><a href=\"https:\/\/fundrazr.com\/savemyarm\" target=\"_blank\" rel=\"noopener noreferrer\" data-href=\"https:\/\/fundrazr.com\/savemyarm\" data->https:\/\/fundrazr.com\/savemyarm<\/a><\/p>\n<p id=\"c62e\">2. Using the same inspect process as before, we inspect the title on the page<\/p>\n<figure id=\"4772\"><canvas width=\"75\" height=\"40\"><\/canvas><img decoding=\"async\" style=\"width: 700px; height: 377px;\" src=\"https:\/\/cdn-images-1.medium.com\/max\/1600\/1*Os8G6pRp2ZC3iM8-rgCBWQ.png\" data-src=\"https:\/\/cdn-images-1.medium.com\/max\/1600\/1*Os8G6pRp2ZC3iM8-rgCBWQ.png\" \/><\/figure>\n<p id=\"ed51\" style=\"text-align: center;\">Inspect Campaign\u00a0Title<\/p>\n<p>3. Now we are going to use scrapy shell again, but this time with an individual campaign. We do this because we want to find out how individual campaigns are formatted (including finding out how to extract the title from the webpage).<\/p>\n<p id=\"9226\">In terminal type (mac\/linux):<\/p>\n<pre id=\"d191\">scrapy shell '<a href=\"https:\/\/fundrazr.com\/savemyarm%27\" target=\"_blank\" rel=\"nofollow noopener noreferrer\" data-href=\"https:\/\/fundrazr.com\/savemyarm&#039;\">https:\/\/fundrazr.com\/savemyarm<\/a>'<\/pre>\n<p id=\"86e5\">In command line type (windows):<\/p>\n<pre id=\"f5ce\">scrapy shell \u201c<a href=\"https:\/\/fundrazr.com\/savemyarm\" target=\"_blank\" rel=\"nofollow noopener noreferrer\" data-href=\"https:\/\/fundrazr.com\/savemyarm\" data->https:\/\/fundrazr.com\/savemyarm<\/a>\"<\/pre>\n<p id=\"e7f9\">The code to get the campaign title is<\/p>\n<pre id=\"d3ad\">response.xpath(\"\/\/div[contains(<a title=\"Twitter profile for @id\" href=\"http:\/\/twitter.com\/id\" target=\"_blank\" rel=\"noopener noreferrer\" data-href=\"http:\/\/twitter.com\/id\" data->@id<\/a>, 'campaign-title')]\/descendant::text()\").extract()[0]<\/pre>\n<figure id=\"90b3\"><img decoding=\"async\" style=\"width: 700px; height: 35px;\" src=\"https:\/\/cdn-images-1.medium.com\/max\/1600\/1*H6eftqCEKlegtcRq7cGdYQ.png\" data-action=\"zoom\" data-action-value=\"1*H6eftqCEKlegtcRq7cGdYQ.png\" data-height=\"68\" data-image-id=\"1*H6eftqCEKlegtcRq7cGdYQ.png\" data-width=\"1358\" \/><\/figure>\n<p id=\"0e3a\">4. We can do the same for the other parts of the page.<\/p>\n<p id=\"cc2b\">amount Raised:<\/p>\n<p id=\"e9d5\"><span style=\"font-family: courier new,courier,monospace;\"><span style=\"background-color: #e6e6fa;\">response.xpath(&#8220;\/\/span[contains(<\/span><a title=\"Twitter profile for @class\" href=\"http:\/\/twitter.com\/class\" target=\"_blank\" rel=\"noopener noreferrer\" data-href=\"http:\/\/twitter.com\/class\" data-><span style=\"background-color: #e6e6fa;\">@class<\/span><\/a><span style=\"background-color: #e6e6fa;\">,&#8217;stat&#8217;)]\/span[contains(<\/span><a title=\"Twitter profile for @class\" href=\"http:\/\/twitter.com\/class\" target=\"_blank\" rel=\"noopener noreferrer\" data-href=\"http:\/\/twitter.com\/class\" data-><span style=\"background-color: #e6e6fa;\">@class<\/span><\/a><span style=\"background-color: #e6e6fa;\">, &#8216;amount-raised&#8217;)]\/descendant::text()&#8221;).extract()<\/span><\/span><\/p>\n<p id=\"962b\">goal:<\/p>\n<p id=\"76ad\"><span style=\"font-family: courier new,courier,monospace;\"><span style=\"background-color: #e6e6fa;\">response.xpath(&#8220;\/\/div[contains(<\/span><a title=\"Twitter profile for @class\" href=\"http:\/\/twitter.com\/class\" target=\"_blank\" rel=\"noopener noreferrer\" data-href=\"http:\/\/twitter.com\/class\" data-><span style=\"background-color: #e6e6fa;\">@class<\/span><\/a><span style=\"background-color: #e6e6fa;\">, &#8216;stats-primary with-goal&#8217;)]\/\/span[contains(<\/span><a title=\"Twitter profile for @class\" href=\"http:\/\/twitter.com\/class\" target=\"_blank\" rel=\"noopener noreferrer\" data-href=\"http:\/\/twitter.com\/class\" data-><span style=\"background-color: #e6e6fa;\">@class<\/span><\/a><span style=\"background-color: #e6e6fa;\">, &#8216;stats-label hidden-phone&#8217;)]\/text()&#8221;).extract()<\/span><\/span><\/p>\n<p id=\"6934\">currency type:<\/p>\n<pre id=\"533b\">response.xpath(\"\/\/div[contains(<a title=\"Twitter profile for @class\" href=\"http:\/\/twitter.com\/class\" target=\"_blank\" rel=\"noopener noreferrer\" data-href=\"http:\/\/twitter.com\/class\" data->@class<\/a>, 'stats-primary with-goal')]\/@title\").extract()<\/pre>\n<p id=\"45fe\">campaign end date:<\/p>\n<p id=\"cdc8\"><span style=\"font-family: courier new,courier,monospace;\"><span style=\"background-color: #e6e6fa;\">response.xpath(&#8220;\/\/div[contains(<\/span><a title=\"Twitter profile for @id\" href=\"http:\/\/twitter.com\/id\" target=\"_blank\" rel=\"noopener noreferrer\" data-href=\"http:\/\/twitter.com\/id\" data-><span style=\"background-color: #e6e6fa;\">@id<\/span><\/a><span style=\"background-color: #e6e6fa;\">, &#8216;campaign-stats&#8217;)]\/\/span[contains(<\/span><a title=\"Twitter profile for @class\" href=\"http:\/\/twitter.com\/class\" target=\"_blank\" rel=\"noopener noreferrer\" data-href=\"http:\/\/twitter.com\/class\" data-><span style=\"background-color: #e6e6fa;\">@class<\/span><\/a><span style=\"background-color: #e6e6fa;\">,&#8217;stats-label hidden-phone&#8217;)]\/span[<\/span><a title=\"Twitter profile for @class\" href=\"http:\/\/twitter.com\/class\" target=\"_blank\" rel=\"noopener noreferrer\" data-href=\"http:\/\/twitter.com\/class\" data-><span style=\"background-color: #e6e6fa;\">@class<\/span><\/a><span style=\"background-color: #e6e6fa;\">=&#8217;nowrap&#8217;]\/text()&#8221;).extract()<\/span><\/span><\/p>\n<p id=\"5391\">number of contributors:<\/p>\n<p id=\"ee77\"><span style=\"font-family: courier new,courier,monospace;\"><span style=\"background-color: #e6e6fa;\">response.xpath(&#8220;\/\/div[contains(<\/span><a title=\"Twitter profile for @class\" href=\"http:\/\/twitter.com\/class\" target=\"_blank\" rel=\"noopener noreferrer\" data-href=\"http:\/\/twitter.com\/class\" data-><span style=\"background-color: #e6e6fa;\">@class<\/span><\/a><span style=\"background-color: #e6e6fa;\">, &#8216;stats-secondary with-goal&#8217;)]\/\/span[contains(<\/span><a title=\"Twitter profile for @class\" href=\"http:\/\/twitter.com\/class\" target=\"_blank\" rel=\"noopener noreferrer\" data-href=\"http:\/\/twitter.com\/class\" data-><span style=\"background-color: #e6e6fa;\">@class<\/span><\/a><span style=\"background-color: #e6e6fa;\">, &#8216;donation-count stat&#8217;)]\/text()&#8221;).extract()<\/span><\/span><\/p>\n<p id=\"70a3\">story:<\/p>\n<pre id=\"aef1\">response.xpath(\"\/\/div[contains(<a title=\"Twitter profile for @id\" href=\"http:\/\/twitter.com\/id\" target=\"_blank\" rel=\"noopener noreferrer\" data-href=\"http:\/\/twitter.com\/id\" data->@id<\/a>, 'full-story')]\/descendant::text()\").extract()<\/pre>\n<p id=\"3b5e\">url:<\/p>\n<pre id=\"a254\">response.xpath(\"\/\/meta[<a title=\"Twitter profile for @property\" href=\"http:\/\/twitter.com\/property\" target=\"_blank\" rel=\"noopener noreferrer\" data-href=\"http:\/\/twitter.com\/property\" data->@property<\/a>='og:url']\/@content\").extract()<\/pre>\n<p id=\"eb69\">5. Exit scrapy shell by typing:<\/p>\n<pre id=\"e2fb\">exit()<\/pre>\n<h4 id=\"33d5\">Items<\/h4>\n<p id=\"3e89\">The main goal in scraping is to extract structured data from unstructured sources, typically, web pages. Scrapy spiders can return the extracted data as Python dicts. While convenient and familiar, Python dicts lack structure: it is easy to make a typo in a field name or return inconsistent data, especially in a larger project with many spiders (almost word for word copied from the great scrapy official documentation!).<\/p>\n<figure id=\"da78\"><canvas width=\"75\" height=\"23\"><\/canvas><img decoding=\"async\" style=\"width: 700px; height: 226px;\" src=\"https:\/\/cdn-images-1.medium.com\/max\/1600\/1*0XU-5Axl8h9aDZFkXhGyLQ.png\" data-src=\"https:\/\/cdn-images-1.medium.com\/max\/1600\/1*0XU-5Axl8h9aDZFkXhGyLQ.png\" \/><\/figure>\n<p style=\"text-align: center;\">\n<p id=\"9088\" style=\"text-align: center;\">File we will be modifying<\/p>\n<p>The code for items.py is\u00a0<a href=\"https:\/\/github.com\/mGalarnyk\/Python_Tutorials\/raw\/master\/Scrapy\/fundrazr\/fundrazr\/items.py\" target=\"_blank\" rel=\"noopener noreferrer\" data-href=\"https:\/\/github.com\/mGalarnyk\/Python_Tutorials\/raw\/master\/Scrapy\/fundrazr\/fundrazr\/items.py\" data->here<\/a>.<\/p>\n<p id=\"58f4\">Save it under the fundrazr\/fundrazr directory (overwrite the original items.py file).<\/p>\n<p id=\"03e4\">The item class (basically how we store our data before outputting it) used in this tutorial looks like this.<\/p>\n<figure id=\"bdbb\"><canvas width=\"75\" height=\"47\"><\/canvas><img decoding=\"async\" style=\"width: 552px; height: 350px;\" src=\"https:\/\/cdn-images-1.medium.com\/max\/1600\/1*itTpe8dWR8CXIPzymjoHOA.png\" data-src=\"https:\/\/cdn-images-1.medium.com\/max\/1600\/1*itTpe8dWR8CXIPzymjoHOA.png\" \/><\/figure>\n<p style=\"text-align: center;\">items.py code<\/p>\n<h4><strong>The Spider<\/strong><\/h4>\n<p id=\"fe3b\">Spiders are classes that you define and that Scrapy uses to scrape information from a website (or a group of websites). The code for our spider is below.<\/p>\n<figure id=\"06b4\"><canvas width=\"75\" height=\"45\"><\/canvas><img decoding=\"async\" style=\"width: 700px; height: 427px;\" src=\"https:\/\/cdn-images-1.medium.com\/max\/1600\/1*HhPZAv8PvzoPV2YhdUdAOQ.png\" data-src=\"https:\/\/cdn-images-1.medium.com\/max\/1600\/1*HhPZAv8PvzoPV2YhdUdAOQ.png\" \/><\/figure>\n<p id=\"5dd3\" style=\"text-align: center;\">Fundrazr Scrapy\u00a0Code<\/p>\n<p>Download the code\u00a0<a href=\"https:\/\/raw.githubusercontent.com\/mGalarnyk\/Python_Tutorials\/master\/Scrapy\/fundrazr\/fundrazr\/spiders\/fundrazr_scrape.py\" target=\"_blank\" rel=\"noopener noreferrer\" data-href=\"https:\/\/raw.githubusercontent.com\/mGalarnyk\/Python_Tutorials\/master\/Scrapy\/fundrazr\/fundrazr\/spiders\/fundrazr_scrape.py\" data->here<\/a>.<\/p>\n<p id=\"ae84\">Save it in a file named\u00a0<strong>fundrazr_scrape.py<\/strong>\u00a0under the fundrazr\/spiders directory.<\/p>\n<p id=\"5b64\">The current project should now have the following contents:<\/p>\n<figure id=\"bac9\"><canvas width=\"75\" height=\"27\"><\/canvas><img decoding=\"async\" style=\"width: 700px; height: 262px;\" src=\"https:\/\/cdn-images-1.medium.com\/max\/1600\/1*VI0lowOHVKAifkZuaU5uoA.png\" data-src=\"https:\/\/cdn-images-1.medium.com\/max\/1600\/1*VI0lowOHVKAifkZuaU5uoA.png\" \/><\/figure>\n<p id=\"105e\" style=\"text-align: center;\">File we will be creating\/adding<\/p>\n<h4><strong>Running the\u00a0Spider<\/strong><\/h4>\n<ol>\n<li id=\"ce54\">Go to the fundrazr\/fundrazr directory and type:<\/li>\n<\/ol>\n<pre id=\"33f9\">scrapy crawl my_scraper -o MonthDay_Year.csv<\/pre>\n<figure id=\"e6f2\"><canvas width=\"75\" height=\"45\"><\/canvas><img decoding=\"async\" style=\"width: 700px; height: 424px;\" src=\"https:\/\/cdn-images-1.medium.com\/max\/1600\/1*84JJg6FgMOKfcJXNivkJTQ.png\" data-src=\"https:\/\/cdn-images-1.medium.com\/max\/1600\/1*84JJg6FgMOKfcJXNivkJTQ.png\" \/><\/figure>\n<p id=\"1a93\" style=\"text-align: center;\">scrapy crawl my_scraper -o MonthDay_Year.csv<\/p>\n<p>2. The data should be outputted in the fundrazr\/fundrazr directory.<\/p>\n<figure id=\"62c2\"><canvas width=\"75\" height=\"32\"><\/canvas><img decoding=\"async\" style=\"width: 700px; height: 309px;\" src=\"https:\/\/cdn-images-1.medium.com\/max\/1600\/1*v7eePNbptxkHWyJg3LKz4Q.png\" data-src=\"https:\/\/cdn-images-1.medium.com\/max\/1600\/1*v7eePNbptxkHWyJg3LKz4Q.png\" \/><\/figure>\n<p style=\"text-align: center;\">Data Output\u00a0Location<\/p>\n<h4 id=\"04cc\"><strong>Our data<\/strong><\/h4>\n<ol>\n<li id=\"53fb\">The data outputted in this tutorial should look roughly like the image below. The individual campaigns scraped will vary as the website is constantly updated. Also it is possible there will be spaces between each individual campaign as excel is interpreting the csv file.<\/li>\n<\/ol>\n<figure id=\"86c9\"><canvas width=\"75\" height=\"56\"><\/canvas><img decoding=\"async\" style=\"width: 700px; height: 527px;\" src=\"https:\/\/cdn-images-1.medium.com\/max\/1600\/1*Kbi0QlS60C3ErTLJnGZDeQ.png\" data-src=\"https:\/\/cdn-images-1.medium.com\/max\/1600\/1*Kbi0QlS60C3ErTLJnGZDeQ.png\" \/><\/figure>\n<p id=\"cdb3\" style=\"text-align: center;\">The data should\u00a0<strong>roughly\u00a0<\/strong>be in this\u00a0format.<\/p>\n<p>2. If you want to download a larger file (it was made by changing npages = 2 to npages = 450 and adding download_delay\u00a0<strong>=<\/strong>\u00a02), you can download a bigger file with roughly 6000 campaigns scraped by downloading the file from my\u00a0<a href=\"https:\/\/github.com\/mGalarnyk\/Python_Tutorials\/tree\/master\/Scrapy\/fundrazr\/fundrazr\" target=\"_blank\" rel=\"noopener noreferrer\" data-href=\"https:\/\/github.com\/mGalarnyk\/Python_Tutorials\/tree\/master\/Scrapy\/fundrazr\/fundrazr\" data->github<\/a>. The file is called MiniMorningScrape.csv (it is a large file).<\/p>\n<figure id=\"ef63\"><canvas width=\"75\" height=\"43\"><\/canvas><img decoding=\"async\" style=\"width: 700px; height: 418px;\" src=\"https:\/\/cdn-images-1.medium.com\/max\/1600\/1*bVE-wShga0SRq5OFyKam2g.png\" data-src=\"https:\/\/cdn-images-1.medium.com\/max\/1600\/1*bVE-wShga0SRq5OFyKam2g.png\" \/><\/figure>\n<p style=\"text-align: center;\">Roughly 6000 Campaigns Scraped<\/p>\n<h4 id=\"c31b\">Closing Thoughts<\/h4>\n<p id=\"7f34\">Creating a dataset can be a lot of work and is often an overlooked part of learning data science. One thing we didn\u2019t go over is that while we scraped a lot of data, we still haven\u2019t cleaned the data enough to do analysis. That is for another blog post though. If you have any questions or thoughts on the tutorial, feel free to reach out in the comments below,\u00a0<a href=\"https:\/\/www.youtube.com\/watch?v=O_j3OTXw2_E\" target=\"_blank\" rel=\"noopener noreferrer\" data-href=\"https:\/\/www.youtube.com\/watch?v=O_j3OTXw2_E\" data->youtube video page<\/a>, or through\u00a0<a href=\"https:\/\/twitter.com\/GalarnykMichael\" target=\"_blank\" rel=\"noopener noreferrer\" data-href=\"https:\/\/twitter.com\/GalarnykMichael\" data->Twitter<\/a>.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Scrapy is a framework built to build web scrapers more easily and relieve the pain of maintaining them. Basically, it allows you to focus on the data extraction using CSS selectors and choosing XPath expressions and less on the intricate internals of how spiders are supposed to work. If you need to scrape something a bit harder, you can do it on your own. With that, let&#8217;s get started.&nbsp;<\/p>\n","protected":false},"author":312,"featured_media":3061,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[187],"tags":[94],"ppma_author":[1985],"class_list":["post-1350","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-bigdata-cloud","tag-data-science"],"authors":[{"term_id":1985,"user_id":312,"is_guest":0,"slug":"michael-galarnyk","display_name":"Michael Galarnyk","avatar_url":"https:\/\/secure.gravatar.com\/avatar\/?s=96&d=mm&r=g","author_category":"","user_url":"","last_name":"Galarnyk","first_name":"Michael","job_title":"","description":"Michael Galarnyk&nbsp;is a Data Scientist at The Scripps Research Institute."}],"_links":{"self":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/1350","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/users\/312"}],"replies":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/comments?post=1350"}],"version-history":[{"count":0,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/1350\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/media\/3061"}],"wp:attachment":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/media?parent=1350"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/categories?post=1350"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/tags?post=1350"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/ppma_author?post=1350"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}