{"id":2063,"date":"2019-11-12T01:19:56","date_gmt":"2019-11-12T01:19:56","guid":{"rendered":"http:\/\/kusuaks7\/?p=1668"},"modified":"2024-02-27T11:45:47","modified_gmt":"2024-02-27T11:45:47","slug":"everything-a-data-scientist-should-know-about-data-management","status":"publish","type":"post","link":"https:\/\/www.experfy.com\/blog\/bigdata-cloud\/everything-a-data-scientist-should-know-about-data-management\/","title":{"rendered":"Everything a Data Scientist Should Know About Data Management"},"content":{"rendered":"\t\t<div data-elementor-type=\"wp-post\" data-elementor-id=\"2063\" class=\"elementor elementor-2063\" data-elementor-post-type=\"post\">\n\t\t\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-192663b2 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"192663b2\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-1f776588\" data-id=\"1f776588\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-644d3946 elementor-widget elementor-widget-text-editor\" data-id=\"644d3946\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tTo be a real \u201c<a href=\"https:\/\/medium.com\/applied-data-science\/new-series-the-full-stack-data-scientist-15791cbef626\" target=\"_blank\" rel=\"noopener noreferrer\" class=\"broken_link\">full-stack<\/a>\u201d data scientist, or what many bloggers and employers call a \u201c<a href=\"https:\/\/towardsdatascience.com\/whats-the-secret-sauce-to-transforming-into-a-unicorn-in-data-science-94082b01c39d\" target=\"_blank\" rel=\"noopener noreferrer\" class=\"broken_link\">unicorn<\/a>,\u201d you have to master every step of the data science process \u2014 all the way from storing your data, to putting your finished product (typically a predictive model) in production. But the bulk of data science training focuses on machine\/deep learning techniques; data management knowledge is often treated as an afterthought. Data science students usually learn modeling skills with processed and cleaned data in text files stored on their laptop, ignoring how the data sausage is made.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-832f7b7 elementor-widget elementor-widget-text-editor\" data-id=\"832f7b7\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\nStudents often don\u2019t realize that in industry settings, getting the raw data from various sources to be ready for modeling is usually\u00a080% of the work.\u00a0And because enterprise projects usually involve a massive amount of data that their local machine is not equipped to handle, the entire modeling process often takes place in the cloud, with most of the applications and databases hosted on servers in data centers elsewhere. Even after the student landed a job as a data scientist, data management often becomes something that a separate data engineering team takes care of. As a result, too many data scientists know too little about data storage and infrastructure, often to the detriment of their ability to make the right decisions at their jobs. The goal of this article is to provide a roadmap of what a data scientist in 2019 should know about data management \u2014 from types of databases, where and how data is stored and processed, to the current commercial options \u2014 so the aspiring \u201cunicorns\u201d could dive deeper on their own, or at least learn enough to sound like one at interviews and cocktail parties.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-82a3931 elementor-widget elementor-widget-heading\" data-id=\"82a3931\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\"><h2>The Rise of Unstructured Data &amp; Big Data Tools<\/h2><\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-0544141 elementor-widget elementor-widget-image\" data-id=\"0544141\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/miro.medium.com\/max\/809\/1*Fz7nDoYorQS9cRQaLP-SkA.jpeg\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-96a6642 elementor-widget elementor-widget-text-editor\" data-id=\"96a6642\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p style=\"text-align: center;\"><span style=\"font-size: 11px;\">IBM 305 RAMAC (Source: <a href=\"https:\/\/commons.wikimedia.org\/wiki\/File:BRL61-IBM_305_RAMAC.jpeg\" target=\"_blank\" rel=\"noopener noreferrer nofollow\">WikiCommons<\/a>).<\/span><\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-003c6b7 elementor-widget elementor-widget-text-editor\" data-id=\"003c6b7\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\nThe story of data science is really the\u00a0<a href=\"https:\/\/www.forbes.com\/sites\/insights-intelai\/2019\/05\/22\/automated-inspiration\/#5516de61c44f\" target=\"_blank\" rel=\"noopener noreferrer\">story of data storage<\/a>. In the pre-digital age, data was stored in our heads, on clay tablets, or on paper, which made aggregating and analyzing data extremely time-consuming.\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-6072d37 elementor-widget elementor-widget-text-editor\" data-id=\"6072d37\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tIn 1956, IBM introduced the first commercial computer with a magnetic hard drive,\u00a0<a href=\"https:\/\/gizmodo.com\/ibm-305-ramac-the-grandaddy-of-modern-hard-drives-5494858\" target=\"_blank\" rel=\"noopener noreferrer\">305 RAMAC<\/a>. The entire unit required 30 ft x 50 ft of physical space, weighed over a ton, and for $3,200 a month, companies could lease the unit to store up to 5 MB of data. In the 60 years since\u00a0<a href=\"https:\/\/www.computerworld.com\/article\/3182207\/cw50-data-storage-goes-from-1m-to-2-cents-per-gigabyte.html\" target=\"_blank\" rel=\"noopener noreferrer\">prices per gigabyte in DRAM<\/a>\u00a0has dropped from a whopping $2.64 billion in 1965 to $4.9 in 2017. Besides being magnitudes cheaper, data storage also became much denser\/smaller in size. A disk platter in the 305 RAMAC stored a hundred bits per square inch, compared to\u00a0<a href=\"https:\/\/www.wired.com\/2012\/03\/seagate-trillion-bits\/\" target=\"_blank\" rel=\"noopener noreferrer\">over a trillion bits per square inch<\/a>\u00a0in a typical disk platter today.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-4b75cd5 elementor-widget elementor-widget-text-editor\" data-id=\"4b75cd5\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p data-selectable-paragraph=\"\">This combination of dramatically reduced cost and size in data storage is what makes today\u2019s big data analytics possible. With ultra-low storage costs, building the data science infrastructure to collect and extract insights from huge amount of data became a profitable approach for businesses. And with the profusion of\u00a0<a href=\"https:\/\/internetofthingsagenda.techtarget.com\/definition\/IoT-device\" target=\"_blank\" rel=\"noopener noreferrer\">IoT devices<\/a>\u00a0that constantly generate and transmit users\u2019 data, businesses are collecting data on an ever-increasing number of activities, creating a massive amount of high-volume, high-velocity, and high-variety information assets (or the \u201c<a href=\"https:\/\/www.zdnet.com\/article\/volume-velocity-and-variety-understanding-the-three-vs-of-big-data\/\" target=\"_blank\" rel=\"noopener noreferrer\">three Vs of big data<\/a>\u201d). Most of these activities (e.g. emails, videos, audio, chat messages, social media posts) generate\u00a0<a href=\"https:\/\/www.datamation.com\/big-data\/structured-vs-unstructured-data.html\" target=\"_blank\" rel=\"noopener noreferrer\">unstructured data<\/a>, which accounts for almost 80% of total enterprise data today and is growing twice as fast as structured data in the past decade.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-395b0a1 elementor-widget elementor-widget-image\" data-id=\"395b0a1\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/miro.medium.com\/max\/893\/1*JeIC6PreHjgh06w3WqkXMA.jpeg\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-8299736 elementor-widget elementor-widget-text-editor\" data-id=\"8299736\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p style=\"text-align: center;\"><span style=\"font-size: 11px;\">125 Exabytes of enterprise data was stored in 2017; 80% was unstructured data. (Source:\u00a0Credit Suisse).<\/span><\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-28c9af6 elementor-widget elementor-widget-text-editor\" data-id=\"28c9af6\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tThis massive data growth dramatically transformed the way data is stored and analyzed, as the traditional tools and approaches were not equipped to handle the \u201cthree Vs of big data.\u201d New technologies were developed with the ability to handle the ever-increasing volume and variety of data, and at a faster speed and lower cost. These new tools also have profound effects on how data scientists do their job \u2014 allowing them to monetize the massive data volume by performing analytics and building new applications that were not possible before. Below are the major big data management innovations that we think every data scientist should know about.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-2ba22d0 elementor-widget elementor-widget-heading\" data-id=\"2ba22d0\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h3 class=\"elementor-heading-title elementor-size-default\"><h3 data-selectable-paragraph=\"\"><strong>Relational Databases &amp; NoSQL<\/strong><\/h3><\/h3>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-60e3f0c elementor-widget elementor-widget-text-editor\" data-id=\"60e3f0c\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p data-selectable-paragraph=\"\"><a href=\"https:\/\/www.alooma.com\/blog\/types-of-modern-databases\" target=\"_blank\" rel=\"noopener noreferrer\">Relational Database Management Systems<\/a>\u00a0(RDBMS) emerged in the 1970s to store data as tables with rows and columns, using Structured Query Language (SQL) statements to query and maintain the database. A relational database is basically a collection of tables, each with a schema that rigidly defines the attributes and types of data that they store, as well as keys that identify specific columns or rows to facilitate access. The RDBMS landscape was once ruled by\u00a0<a href=\"https:\/\/db-engines.com\/en\/system\/Oracle\" target=\"_blank\" rel=\"noopener noreferrer\">Oracle<\/a>\u00a0and\u00a0<a href=\"https:\/\/db-engines.com\/en\/system\/IBM+Db2\" target=\"_blank\" rel=\"noopener noreferrer\">IBM<\/a>, but today many open source options, like\u00a0<a href=\"https:\/\/db-engines.com\/en\/system\/MySQL\" target=\"_blank\" rel=\"noopener noreferrer\">MySQL<\/a>,\u00a0<a href=\"https:\/\/db-engines.com\/en\/system\/SQLite\" target=\"_blank\" rel=\"noopener noreferrer\">SQLite<\/a>, and\u00a0<a href=\"https:\/\/db-engines.com\/en\/system\/PostgreSQL\" target=\"_blank\" rel=\"noopener noreferrer\">PostgreSQL<\/a>\u00a0are just as popular.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-61c9ce5 elementor-widget elementor-widget-image\" data-id=\"61c9ce5\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/miro.medium.com\/max\/1835\/0*BcJW0soJynaxRoK_\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-e1bc89a elementor-widget elementor-widget-text-editor\" data-id=\"e1bc89a\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p style=\"text-align: center;\"><span style=\"font-size: 11px;\">RDBMS ranked by popularity (Source:\u00a0<a href=\"https:\/\/db-engines.com\/en\/ranking\/relational+dbms\" target=\"_blank\" rel=\"noopener noreferrer\">DB-Engines<\/a>).<\/span><\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-ed4235f elementor-widget elementor-widget-text-editor\" data-id=\"ed4235f\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tRelational databases found a home in the business world due to some very appealing properties.\u00a0<a href=\"https:\/\/towardsdatascience.com\/choosing-the-right-database-c45cd3a28f77\" target=\"_blank\" rel=\"noopener noreferrer\" class=\"broken_link\">Data integrity<\/a>\u00a0is absolutely paramount in relational databases. RDBMS satisfies the requirements of\u00a0<a href=\"https:\/\/blog.yugabyte.com\/a-primer-on-acid-transactions\/\" target=\"_blank\" rel=\"noopener noreferrer\">Atomicity, Consistency, Isolation, and Durability (or ACID-compliant)<\/a>\u00a0by imposing a number of constraints to ensure that the stored data is reliable and accurate, making them ideal for tracking and storing things like account numbers, orders, and payments. But these constraints come with costly tradeoffs. Because of the schema and type constraints, RDBMS are terrible at storing unstructured or semi-structured data. The rigid schema also makes RDBMS more expensive to set up, maintain and grow. Setting up an RDBMS requires users to have specific use cases in advance; any changes to the schema are usually difficult and time-consuming.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-26fc5bc elementor-widget elementor-widget-text-editor\" data-id=\"26fc5bc\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tIn addition, traditional RDBMS were designed to\u00a0<a href=\"https:\/\/crate.io\/a\/rise-distributed-sql-database\/\" target=\"_blank\" rel=\"noopener noreferrer\">run on a single computer node<\/a>, which means their speed is significantly slower when processing large volumes of data.\u00a0<a href=\"https:\/\/crate.io\/a\/rise-distributed-sql-database\/\" target=\"_blank\" rel=\"noopener noreferrer\">Sharding<\/a>\u00a0RDBMS in order to scale horizontally while maintaining ACID compliance is also extremely challenging. All these attributes make traditional RDBMS ill-equipped to handle modern big data.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-72eae4f elementor-widget elementor-widget-text-editor\" data-id=\"72eae4f\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p data-selectable-paragraph=\"\">By the mid-2000s, the existing RDBMS could no longer handle the changing needs and exponential growth of a few very successful online businesses, and many non-relational (or NoSQL) databases were developed as a result (here\u2019s\u00a0<a href=\"https:\/\/blog.yugabyte.com\/facebooks-user-db-is-it-sql-or-nosql\/\" target=\"_blank\" rel=\"noopener noreferrer\">a story<\/a>\u00a0on how Facebook dealt with the limitations of MySQL when their data volume started to grow). Without any known solutions at the time, these online businesses invented new approaches and tools to handle the massive amount of unstructured data they collected: Google created\u00a0<a href=\"https:\/\/en.wikipedia.org\/wiki\/Google_File_System\" target=\"_blank\" rel=\"noopener noreferrer\">GFS<\/a>,\u00a0<a href=\"https:\/\/ai.google\/research\/pubs\/pub62\" target=\"_blank\" rel=\"noopener noreferrer\" class=\"broken_link\">MapReduce<\/a>, and\u00a0<a href=\"https:\/\/cloud.google.com\/bigtable\/docs\/\" target=\"_blank\" rel=\"noopener noreferrer\">BigTable<\/a>; Amazon created\u00a0<a href=\"https:\/\/cloudacademy.com\/blog\/amazon-dynamodb-ten-things\/\" target=\"_blank\" rel=\"noopener noreferrer\">DynamoDB<\/a>; Yahoo created\u00a0<a href=\"https:\/\/www.sas.com\/nl_nl\/insights\/big-data\/hadoop.html#hadoopworld\" target=\"_blank\" rel=\"noopener noreferrer\">Hadoop<\/a>; Facebook created\u00a0<a href=\"https:\/\/www.facebook.com\/notes\/facebook-engineering\/cassandra-a-structured-storage-system-on-a-p2p-network\/24413138919\/\" target=\"_blank\" rel=\"noopener noreferrer\">Cassandra<\/a>\u00a0and\u00a0<a href=\"https:\/\/www.facebook.com\/notes\/facebook-engineering\/hive-a-petabyte-scale-data-warehouse-using-hadoop\/89508453919\/\" target=\"_blank\" rel=\"noopener noreferrer\">Hive<\/a>; LinkedIn created\u00a0<a href=\"https:\/\/engineering.linkedin.com\/blog\/2016\/04\/kafka-ecosystem-at-linkedin\" target=\"_blank\" rel=\"noopener noreferrer\">Kafka<\/a>. Some of these businesses open-sourced their work; some published research papers detailing their designs, resulting in a proliferation of databases with the new technologies, and NoSQL databases emerged as a major player in the industry.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-de1b4d1 elementor-widget elementor-widget-image\" data-id=\"de1b4d1\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/miro.medium.com\/max\/1063\/0*EeRO-8cl2oZkZ49J\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-976c32b elementor-widget elementor-widget-text-editor\" data-id=\"976c32b\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p style=\"text-align: center;\"><span style=\"font-size: 11px;\">An explosion of database options since the 2000s. Source:\u00a0<a href=\"https:\/\/www.researchgate.net\/figure\/Landscape-and-categorization-of-the-high-variety-of-existing-database-systems-18_fig2_303562879\" target=\"_blank\" rel=\"noopener noreferrer\" class=\"broken_link\">Korflatis et al. (2016)<\/a>.<\/span><\/p>\n<a href=\"https:\/\/www.fullstackpython.com\/no-sql-datastore.html\" target=\"_blank\" rel=\"noopener noreferrer\">NoSQL databases<\/a>\u00a0are schema-agnostic and provide the flexibility needed to store and manipulate large volumes of\u00a0<a href=\"https:\/\/www.datamation.com\/big-data\/structured-vs-unstructured-data.html\" target=\"_blank\" rel=\"noopener noreferrer\">unstructured and semi-structured data<\/a>. Users don\u2019t need to know what types of data will be stored during set-up, and the system can accommodate changes in data types and schema. Designed to distribute data across different nodes, NoSQL databases are generally more horizontally scalable and fault-tolerant. However, these performance benefits also come with a cost \u2014 NoSQL databases are not ACID compliant, and data consistency is not guaranteed. They instead provide \u201c<a href=\"https:\/\/medium.baqend.com\/nosql-databases-a-survey-and-decision-guidance-ea7823a822d\" target=\"_blank\" rel=\"noopener noreferrer\">eventual consistency<\/a>\u201d: when old data is getting overwritten, they\u2019d return results that are a little wrong temporarily. For example,\u00a0<a href=\"https:\/\/arstechnica.com\/information-technology\/2016\/03\/to-sql-or-nosql-thats-the-database-question\/\" target=\"_blank\" rel=\"noopener noreferrer\">Google\u2019s search engine index<\/a>\u00a0can\u2019t overwrite its data while people are simultaneously searching a given term, so it doesn\u2019t give us the most up-to-date results when we search, but it gives us the latest, best answer it can. While this setup won\u2019t work in situations where data consistency is absolutely necessary (such as financial transactions), it\u2019s just fine for tasks that require speed rather than pinpoint accuracy.\n<p data-selectable-paragraph=\"\">There are now several different categories of NoSQL, each serving some specific purposes. Key-Value Stores, such as\u00a0<a href=\"https:\/\/db-engines.com\/en\/system\/Redis\" rel=\"noopener\">Redis<\/a>,\u00a0<a href=\"https:\/\/db-engines.com\/en\/system\/Amazon+DynamoDB\" rel=\"noopener\">DynamoDB<\/a>, and\u00a0<a href=\"https:\/\/db-engines.com\/en\/system\/Microsoft+Azure+Cosmos+DB\" rel=\"noopener\">Cosmos DB<\/a>, store only key-value pairs and provide basic functionality for retrieving the value associated with a known key. They work best with a simple database schema and when speed is important. Wide Column Stores, such as\u00a0<a href=\"https:\/\/db-engines.com\/en\/system\/Cassandra\" rel=\"noopener\">Cassandra<\/a>,\u00a0<a href=\"https:\/\/db-engines.com\/en\/system\/ScyllaDB\" rel=\"noopener\">Scylla<\/a>, and\u00a0<a href=\"https:\/\/db-engines.com\/en\/system\/HBase\" class=\"broken_link\" rel=\"noopener\">HBase<\/a>, store data in column families or tables and are built to manage petabytes of data across a massive, distributed system. Document Stores, such as\u00a0<a href=\"https:\/\/db-engines.com\/en\/system\/MongoDB\" rel=\"noopener\">MongoDB<\/a>\u00a0and\u00a0<a href=\"https:\/\/db-engines.com\/en\/system\/Couchbase\" rel=\"noopener\">Couchbase<\/a>, store data in XML or JSON format, with the document name as key and the contents of the document as value. The documents can contain many different value types and can be nested, making them particularly well-suited to manage semi-structured data across distributed systems. Graph Databases, such as\u00a0<a href=\"https:\/\/db-engines.com\/en\/system\/Neo4j\" rel=\"noopener\">Neo4J<\/a>\u00a0and\u00a0<a href=\"https:\/\/db-engines.com\/en\/system\/Amazon+Neptune\" rel=\"noopener\">Amazon Neptune<\/a>, represent data as a network of related nodes or objects in order to facilitate data visualizations and graph analytics.\u00a0<a href=\"https:\/\/towardsdatascience.com\/graph-databases-whats-the-big-deal-ec310b1bc0ed\" class=\"broken_link\" rel=\"noopener\">Graph databases<\/a>\u00a0are particularly useful for analyzing the relationships between heterogeneous data points, such as in fraud prevention or Facebook\u2019s friends graph.<\/p>\n<p data-selectable-paragraph=\"\">MongoDB is currently the\u00a0<a href=\"https:\/\/db-engines.com\/en\/ranking\" rel=\"noopener\">most popular NoSQL database<\/a>\u00a0and has delivered substantial values for some businesses that have been struggling to handle their unstructured data with the traditional RDBMS approach. Here are\u00a0<a href=\"https:\/\/arstechnica.com\/information-technology\/2016\/03\/to-sql-or-nosql-thats-the-database-question\/\" rel=\"noopener\">two industry examples<\/a>: after MetLife spent\u00a0<em>years<\/em>\u00a0trying to build a centralized customer database on an RDBMS that could handle all its insurance products, someone at an internal hackathon built one with MongoDB within hours, which went to production in 90 days. YouGov, a market research firm that collects 5 gigabits of data an hour, saved 70 percent of the storage capacity it formerly used by migrating from RDBMS to MongoDB.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-36df882 elementor-widget elementor-widget-heading\" data-id=\"36df882\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h3 class=\"elementor-heading-title elementor-size-default\"><h3 data-selectable-paragraph=\"\"><strong>Data Warehouse, Data Lake, &amp; Data Swamp<\/strong><\/h3><\/h3>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-1f6f854 elementor-widget elementor-widget-text-editor\" data-id=\"1f6f854\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p data-selectable-paragraph=\"\">As data sources continue to grow, performing data analytics with multiple databases became inefficient and costly. One solution called\u00a0<a href=\"https:\/\/blogs.oracle.com\/bigdata\/data-lake-database-data-warehouse-difference\" class=\"broken_link\" rel=\"noopener\">Data Warehouse<\/a>\u00a0emerged in\u00a0<a href=\"https:\/\/www.dataversity.net\/brief-history-data-warehouse\/\" rel=\"noopener\">the 1980s<\/a>, which centralizes an enterprise\u2019s data from all of its databases. Data Warehouse supports the flow of data from operational systems to analytics\/decision systems by creating a single repository of data from various sources (both internal and external). In most cases, a Data Warehouse is a relational database that stores processed data that is optimized for gathering business insights. It collects data with predetermined structure and schema coming from transactional systems and business applications, and the data is typically used for operational reporting and analysis.<\/p>\n<p data-selectable-paragraph=\"\">But because data that goes into data warehouses need to be processed before it gets stored \u2014 with today\u2019s massive amount of unstructured data, that could take significant time and resources. In response, businesses started maintaining\u00a0<a href=\"https:\/\/blogs.oracle.com\/bigdata\/data-lake-database-data-warehouse-difference\" class=\"broken_link\" rel=\"noopener\">Data Lakes<\/a>\u00a0in\u00a0<a href=\"https:\/\/www.dataversity.net\/brief-history-data-lakes\/\" rel=\"noopener\">the 2010s<\/a>, which store all of an enterprise\u2019s structured and unstructured data at any scale. Data Lakes store raw data, and could be set up without having to first define the data structure and schema. Data Lakes allow users to run analytics without having to move the data to a separate analytics system, enabling businesses to gain insights from new sources of data that was not available for analysis before, for instance by building machine learning models using data from log files, click-streams, social media, and IoT devices. By making all of the enterprise data readily available for analysis, data scientists could answer a new set of business questions, or tackle old questions with new data.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-543a2d7 elementor-widget elementor-widget-image\" data-id=\"543a2d7\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/miro.medium.com\/max\/1478\/0*QC4YhkOjmsLrnt6f\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-7a4281a elementor-widget elementor-widget-text-editor\" data-id=\"7a4281a\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p style=\"text-align: center;\"><span style=\"font-size: 11px;\">Data Warehouse and Data Lake Comparisons (Source:\u00a0<a href=\"https:\/\/aws.amazon.com\/big-data\/datalakes-and-analytics\/what-is-a-data-lake\/\" rel=\"noopener\">AWS<\/a>).<\/span><\/p>\nA common challenge with the Data Lake architecture is that without the appropriate data quality and governance framework in place, when terabytes of structured and unstructured data flow into the Data Lakes, it often becomes extremely difficult to sort through their content. The Data Lakes could turn into\u00a0<a href=\"https:\/\/s3-ap-southeast-1.amazonaws.com\/mktg-apac\/Big+Data+Refresh+Q4+Campaign\/ESG-White-Paper-AWS-Apr-2017+(FINAL).pdf\" rel=\"noopener\">Data Swamps<\/a>\u00a0as the stored data become too messy to be usable. Many organizations are now calling for more data governance and metadata management practices to prevent Data Swamps from forming.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-4a00490 elementor-widget elementor-widget-heading\" data-id=\"4a00490\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h3 class=\"elementor-heading-title elementor-size-default\"><h3 data-selectable-paragraph=\"\"><strong>Distributed &amp; Parallel Processing: Hadoop, Spark, &amp; MPP<\/strong><\/h3><\/h3>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-4bdb66f elementor-widget elementor-widget-text-editor\" data-id=\"4bdb66f\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p data-selectable-paragraph=\"\">While storage and computing needs grew by leaps and bounds in the last several decades, traditional hardware has not advanced enough to keep up. Enterprise data no longer fits neatly in standard storage, and the computation power required to handle most big data analytics tasks might take weeks, months, or simply not possible to complete on a standard computer. To overcome this deficiency, many new technologies have evolved to include multiple computers working together, distributing the database to thousands of commodity servers. When a network of computers are connected and work together to accomplish the same task, computers form a\u00a0<a href=\"https:\/\/www.datacamp.com\/community\/blog\/data-science-cloud\" class=\"broken_link\" rel=\"noopener\">cluster<\/a>. A cluster can be thought of as a single computer, but can dramatically improve the performance, availability, and scalability over a single, more powerful machine, and at a lower cost by using commodity hardware.\u00a0<a href=\"https:\/\/towardsdatascience.com\/what-happened-to-hadoop-what-should-you-do-now-2876f68dbd1d\" class=\"broken_link\" rel=\"noopener\">Apache Hadoop<\/a>\u00a0is an example of distributed data infrastructures that leverage clusters to store and process massive amounts of data, and what enables the Data Lake architecture.<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-4373733 elementor-widget elementor-widget-image\" data-id=\"4373733\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/miro.medium.com\/max\/1720\/0*Qll4HLSAjByLHgn3\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-dfcf92c elementor-widget elementor-widget-text-editor\" data-id=\"dfcf92c\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p style=\"text-align: center;\"><span style=\"font-size: 11px;\">Evolution of database technologies (Source:\u00a0<a href=\"https:\/\/practicalanalytics.co\/2015\/06\/02\/the-maturing-nosql-ecoystem-a-c-level-guide\/\" rel=\"noopener\">Business Analytic 3.0<\/a>).<\/span><\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-c1c6556 elementor-widget elementor-widget-text-editor\" data-id=\"c1c6556\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tWhen you think Hadoop, think \u201cdistribution.\u201d Hadoop consists of\u00a0<a href=\"https:\/\/hortonworks.com\/apache\/hadoop\/#section_2\" rel=\"noopener\">three main components<\/a>: Hadoop Distributed File System (HDFS), a way to store and keep track of your data across multiple (distributed) physical hard drives; MapReduce, a framework for processing data across distributed processors; and Yet Another Resource Negotiator (YARN), a cluster management framework that orchestrates the distribution of things such as CPU usage, memory, and network bandwidth allocation across distributed computers. Hadoop\u2019s processing layer is an especially notable innovation:\u00a0<a href=\"https:\/\/hci.stanford.edu\/courses\/cs448g\/a2\/files\/map_reduce_tutorial.pdf\" rel=\"noopener\">MapReduce<\/a>\u00a0is a two-step computational approach for processing large (multi-terabyte or greater) data sets distributed across large clusters of commodity hardware in a reliable, fault-tolerant way. The first step is to distribute your data across multiple computers (Map), with each performing a computation on its slice of the data in parallel. The next step is to combine those results in a pair-wise manner (Reduce). Google\u00a0<a href=\"https:\/\/ai.google\/research\/pubs\/pub62\" class=\"broken_link\" rel=\"noopener\">published a paper<\/a>\u00a0on MapReduce in 2004, which got\u00a0<a href=\"https:\/\/www.wired.com\/2011\/10\/how-yahoo-spawned-hadoop\/\" rel=\"noopener\">picked up by Yahoo programmers<\/a>\u00a0who implemented it in the open-source Apache environment in 2006, providing every business the capability to store an unprecedented volume of data using commodity hardware. Even though there are many open-source implementations of the idea, the Google brand name MapReduce has stuck around, kind of like Jacuzzi or Kleenex.\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-c47dbf3 elementor-widget elementor-widget-text-editor\" data-id=\"c47dbf3\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p data-selectable-paragraph=\"\">Hadoop is built for iterative computations, scanning massive amounts of data in a single operation from disk, distributing the processing across multiple nodes, and storing the results back on disk. Querying\u00a0<a href=\"https:\/\/hortonworks.com\/apache\/hadoop\/#section_2\" rel=\"noopener\">zettabytes of indexed data<\/a>\u00a0that would take 4 hours to run in a traditional data warehouse environment could be completed in 10\u201312 seconds with Hadoop and\u00a0<a href=\"https:\/\/hbase.apache.org\/\" rel=\"noopener\">HBase<\/a>. Hadoop is typically used to generate complex analytics models or high volume data storage applications such as retrospective and predictive analytics, machine learning and pattern matching, customer segmentation and churn analysis, and active archives.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-065120c elementor-widget elementor-widget-text-editor\" data-id=\"065120c\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p data-selectable-paragraph=\"\">But MapReduce\u00a0<a href=\"https:\/\/datawhatnow.com\/batch-processing-mapreduce\/\" class=\"broken_link\" rel=\"noopener\">processes data in batches<\/a>\u00a0and is therefore not suitable for processing real-time data.\u00a0<a href=\"https:\/\/towardsdatascience.com\/a-beginners-guide-to-apache-spark-ff301cb4cd92\" class=\"broken_link\" rel=\"noopener\">Apache Spark<\/a>\u00a0was built in 2012 to fill that gap. Spark is a parallel data processing tool that is optimized for speed and efficiency by\u00a0<a href=\"https:\/\/dzone.com\/articles\/apache-spark-introduction-and-its-comparison-to-ma\" class=\"broken_link\" rel=\"noopener\">processing data in-memory<\/a>. It operates under the same MapReduce principle but runs much faster by completing most of the computation in memory and only writing to disk when memory is full, or the computation is complete. This in-memory computation allows Spark to \u201crun programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk.\u201d However, when the data set is so large that\u00a0<a href=\"https:\/\/towardsdatascience.com\/a-beginners-guide-to-apache-spark-ff301cb4cd92\" class=\"broken_link\" rel=\"noopener\">insufficient RAM becomes an issue<\/a>\u00a0(usually hundreds of gigabytes or more), Hadoop MapReduce might outperform Spark. Spark also has an extensive set of data analytics libraries covering a wide range of functions:\u00a0<a href=\"https:\/\/spark.apache.org\/sql\/\" rel=\"noopener\">Spark SQL<\/a>\u00a0for SQL and structured data;\u00a0<a href=\"https:\/\/spark.apache.org\/mllib\/\" rel=\"noopener\">MLib<\/a>\u00a0for machine learning,\u00a0<a href=\"https:\/\/spark.apache.org\/streaming\/\" rel=\"noopener\">Spark Streaming<\/a>\u00a0for stream processing, and\u00a0<a href=\"https:\/\/spark.apache.org\/graphx\/\" rel=\"noopener\">GraphX<\/a>\u00a0for graph analytics. Since Spark\u2019s focus is on computation, it does not come with its own storage system and instead runs on a variety of storage systems such as\u00a0<a href=\"https:\/\/aws.amazon.com\/s3\/\" rel=\"noopener\">Amazon S3<\/a>,\u00a0<a href=\"https:\/\/azure.microsoft.com\/en-us\/services\/storage\/\" rel=\"noopener\">Azure Storage<\/a>, and\u00a0<a href=\"https:\/\/hadoop.apache.org\/docs\/r1.2.1\/hdfs_design.html\" rel=\"noopener\">Hadoop\u2019s HDFS<\/a>.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-2c11b67 elementor-widget elementor-widget-image\" data-id=\"2c11b67\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/miro.medium.com\/max\/463\/0*MbOv4dHH2DLiEjGd\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-6f27d95 elementor-widget elementor-widget-text-editor\" data-id=\"6f27d95\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p style=\"text-align: center;\"><span style=\"font-size: 11px;\">In an MPP system, all the nodes are interconnected, and data could be exchanged across the network (Source:\u00a0<a href=\"https:\/\/www.ibm.com\/support\/knowledgecenter\/en\/SSZJPZ_11.5.0\/com.ibm.swg.im.iis.productization.iisinfsv.install.doc\/topics\/wsisinst_pln_engscalabilityparallel.html\" rel=\"noopener\">IBM<\/a>).<\/span><\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-ccceafe elementor-widget elementor-widget-text-editor\" data-id=\"ccceafe\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tHadoop and Spark are not the only technologies that leverage clusters to process large volumes of data. Another popular computational approach to distributed query processing is called\u00a0<a href=\"https:\/\/www.zdnet.com\/article\/mapreduce-and-mpp-two-sides-of-the-big-data-coin\/\" rel=\"noopener\">Massively Parallel Processing (MPP)<\/a>. Similar to MapReduce, MPP distributes data processing across multiple nodes, and the nodes process the data in parallel for faster speed. But unlike Hadoop, MPP is used in RDBMS and utilizes a\u00a0<a href=\"https:\/\/0x0fff.com\/hadoop-vs-mpp\/\" rel=\"noopener\">\u201cshare-nothing\u201d architecture<\/a>\u00a0\u2014 each node processes its own slice of the data with multi-core processors, making them many times faster than traditional RDBMS. Some MPP databases, like\u00a0<a href=\"https:\/\/pivotal.io\/pivotal-greenplum\" rel=\"noopener\">Pivotal Greenplum<\/a>, have\u00a0<a href=\"https:\/\/madlib.apache.org\/\" rel=\"noopener\">mature machine learning libraries<\/a>\u00a0that allow for in-database analytics. However, as with traditional RDBMS, most MPP databases do not support unstructured data, and even structured data will require some processing to fit the MPP infrastructure; therefore, it takes additional time and resources to set up the data pipeline for an MPP database. Since MPP databases are ACID-compliant and deliver much faster speed than traditional RDBMS, they are usually employed in high-end enterprise data warehousing solutions such as\u00a0<a href=\"https:\/\/db-engines.com\/en\/system\/Amazon+Redshift%3BGreenplum%3BSnowflake\" rel=\"noopener\">Amazon Redshift, Pivotal Greenplum, and Snowflake<\/a>. As an industry example, the\u00a0<a href=\"https:\/\/www.forbes.com\/sites\/tomgroenfeldt\/2013\/02\/14\/at-nyse-the-data-deluge-overwhelms-traditional-databases\/#3d171aa05aab\" rel=\"noopener\">New York Stock Exchange<\/a>\u00a0receives four to five terabytes of data daily and conducts complex analytics, market surveillance, capacity planning and monitoring. The company had been using a traditional database that couldn\u2019t handle the workload, which took hours to load and had poor query speed. Moving to an MPP database reduced their daily analysis run time by eight hours.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-8f8da85 elementor-widget elementor-widget-heading\" data-id=\"8f8da85\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h3 class=\"elementor-heading-title elementor-size-default\"><h3 data-selectable-paragraph=\"\"><strong>Cloud Services<\/strong><\/h3><\/h3>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-6020995 elementor-widget elementor-widget-text-editor\" data-id=\"6020995\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p data-selectable-paragraph=\"\">Another innovation that completely transformed enterprise big data analytics capabilities is\u00a0the rise of cloud services. In the bad old days before cloud services were available, businesses had to buy on-premises data storage and analytics solutions from software and hardware vendors, usually paying upfront perpetual software license fees and annual hardware maintenance and service fees. On top of those are the costs of power, cooling, security, disaster protection, IT staff, etc, for building and maintaining the on-premises infrastructure. Even when it was technically possible to store and process big data, most businesses found it cost-prohibitive to do so at scale. Scaling with on-premises infrastructure also requires an extensive design and procurement process, which takes a long time to implement and requires substantial upfront capital. Many potentially valuable data collection and analytics possibilities were ignored as a result.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-b4b6caf elementor-widget elementor-widget-image\" data-id=\"b4b6caf\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/miro.medium.com\/max\/1280\/0*T2yuJKY3GsI1DLux\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-0ee04fb elementor-widget elementor-widget-text-editor\" data-id=\"0ee04fb\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p style=\"text-align: center;\"><span style=\"font-size: 11px;\">\u201cAs a Service\u201d providers: e.g. Infrastructure as a Service (IaaS) and Storage as a Service (STaaS) (Source:\u00a0<a href=\"https:\/\/imelgrat.me\/cloud\/cloud-services-models-help-business\/\" rel=\"noopener\">IMELGRAT.ME<\/a>).<\/span><\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-2b16be4 elementor-widget elementor-widget-text-editor\" data-id=\"2b16be4\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tThe on-premises model began to lose market share quickly when cloud services were introduced in the late 2000s \u2014 the global cloud services market has been growing 15% annually in the past decade. Cloud service platforms provide subscriptions to\u00a0<a href=\"https:\/\/imelgrat.me\/cloud\/cloud-services-models-help-business\/\" rel=\"noopener\">a variety of services<\/a>\u00a0(from virtual computing to storage infrastructure to databases), delivered over the internet on a pay-as-you-go basis, offering customers rapid access to flexible and low-cost storage and virtual computing resources. Cloud service providers are responsible for all of their hardware and software purchases and maintenance, and usually have a vast network of servers and support staff to provide reliable services. Many businesses discovered that they could significantly reduce costs and improve operational efficiencies with cloud services, and are able to develop and productionize their products more quickly with the out-of-the-box cloud resources and their built-in scalability. By removing the upfront costs and time commitment to build on-premises infrastructure, cloud services also lower the barriers to adopt big data tools and effectively democratized big data analytics for small and mid-size businesses.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-cd4505f elementor-widget elementor-widget-text-editor\" data-id=\"cd4505f\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p data-selectable-paragraph=\"\">There are several cloud services models, with public clouds being the most common. In a\u00a0<a href=\"https:\/\/azure.microsoft.com\/en-us\/overview\/what-are-private-public-hybrid-clouds\/?cdn=disable\" rel=\"noopener\">public cloud<\/a>, all hardware, software, and other supporting infrastructure are owned and managed by the cloud service provider. Customers share the cloud infrastructure with other \u201ccloud tenants\u201d and access their services through a web browser. A\u00a0<a href=\"https:\/\/www.liquidweb.com\/kb\/difference-private-cloud-premise\/\" rel=\"noopener\">private cloud<\/a>\u00a0is often used by organizations with special security needs such as government agencies and financial institutions. In a private cloud, the services and infrastructure are dedicated solely to one organization and are maintained on a private network. The private cloud can be on-premises or hosted by a third-party service provider elsewhere.\u00a0<a href=\"https:\/\/www.sdxcentral.com\/cloud\/multi-cloud\/definitions\/what-is-multi-cloud\/\" class=\"broken_link\" rel=\"noopener\">Hybrid clouds<\/a>\u00a0combine private clouds with public clouds, allowing organizations to reap the advantages of both. In a hybrid cloud, data and applications can move between private and public clouds for greater flexibility: e.g., the public cloud could be used for high-volume, lower-security data, and the private cloud for sensitive, business-critical data like financial reporting. The\u00a0<a href=\"https:\/\/www.sdxcentral.com\/articles\/news\/pivotal-greenplum-adds-multicloud-support\/2017\/09\/\" class=\"broken_link\" rel=\"noopener\">multi-cloud model<\/a>\u00a0involves multiple cloud platforms, and each delivers a specific application service. A multi-cloud can be a combination of public, private, and hybrid clouds to achieve the organization\u2019s goals. Organizations often choose multi-cloud to suit their particular business, locations, and timing needs, and to avoid vendor lock-in.<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-dbec10a elementor-widget elementor-widget-heading\" data-id=\"dbec10a\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h3 class=\"elementor-heading-title elementor-size-default\"><h3>Case Study: Building the End-to-End Data Science Infrastructure for a Recommendation App Startup<\/h3><\/h3>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-220001c elementor-widget elementor-widget-image\" data-id=\"220001c\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/miro.medium.com\/max\/1836\/0*pY8S1N-OJpvGWneX\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-041a1c6 elementor-widget elementor-widget-text-editor\" data-id=\"041a1c6\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\nBuilding out a viable data science product involves much more than just building a machine learning model with scikit-learn, pickling it, and loading it on a server. It requires an understanding of how all the parts of the enterprise\u2019s ecosystem work together, starting with where\/how the data flows into the data team, the environment where the data is processed\/transformed, the enterprise\u2019s conventions for visualizing\/presenting data, and how the model output will be converted as input for some other enterprise applications. The main goals involve building a process that will be easy to maintain, where models can be iterated on and the performance is reproducible, and the model\u2019s output can be easily understood and visualized for other stakeholders so that they may make better-informed business decisions. Achieving those goals requires selecting the right tools, as well as an understanding of what others in the industry are doing and the best practices.\n<p data-selectable-paragraph=\"\">Let\u2019s illustrate with a scenario: suppose you just got hired as the lead data scientist for a vacation recommendation app startup that is expected to collect hundreds of gigabytes of both structured (customer profiles, temperatures, prices, and transaction records) and unstructured (customers\u2019 posts\/comments and image files) data from users daily. Your predictive models would need to be retrained with new data weekly and make recommendations instantaneously on demand. Since you expect your app to be a huge hit, your data collection, storage, and analytics capacity would have to be extremely scalable. How would you design your data science process and productionize your models? What are the tools that you\u2019d need to get the job done? Since this is a startup and you are the lead \u2014 and perhaps the only \u2014 data scientist, it\u2019s on you to make these decisions.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-32df312 elementor-widget elementor-widget-text-editor\" data-id=\"32df312\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p data-selectable-paragraph=\"\">First, you\u2019d have to figure out how to set up the data pipeline that takes in the raw data from data sources, processes the data, and feeds the processed data to databases. The ideal data pipeline has low event latency (ability to query data as soon as it\u2019s been collected); scalability (able to handle massive amount of data as your product scales); interactive querying (support both batch queries and smaller interactive queries that allow data scientists to explore the tables and schemas); versioning (ability to make changes to the pipeline without bringing down the pipeline and losing data); monitoring (the pipeline should generate alerts when data stops coming in); and testing (ability to test the pipeline without interruptions). Perhaps most importantly, it had better not interfere with daily business operations \u2014 e.g., heads will roll if the new model you\u2019re testing causes your operational database to grind to a halt. Building and maintaining the data pipeline is usually the responsibility of a data engineer (for more details,\u00a0<a href=\"https:\/\/towardsdatascience.com\/data-science-for-startups-data-pipelines-786f6746a59a\" class=\"broken_link\" rel=\"noopener\">this article<\/a>\u00a0has an excellent overview on building the data pipeline for startups), but a data scientist should at least be familiar with the process, its limitations, and the tools needed to access the processed data for analysis.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-bab46c3 elementor-widget elementor-widget-text-editor\" data-id=\"bab46c3\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p data-selectable-paragraph=\"\">Next, you\u2019d have to decide if you want to set up on-premises infrastructure or use cloud services. For a startup, the top priority is to scale data collection without scaling operational resources. As mentioned earlier, on-premises infrastructure requires huge upfront and maintenance costs, so cloud services tend to be a better option for startups. Cloud services allow scaling to match demand and require minimal maintenance efforts so that your small team of staff could focus on the product and analytics instead of infrastructure management.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-7e69e9e elementor-widget elementor-widget-image\" data-id=\"7e69e9e\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/miro.medium.com\/max\/750\/0*VsCnCDsZeLUwRCam\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-755098c elementor-widget elementor-widget-text-editor\" data-id=\"755098c\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p style=\"text-align: center;\"><span style=\"font-size: 11px;\">Examples of vendors that provide Hadoop-based solutions (Source:\u00a0<a href=\"https:\/\/commons.wikimedia.org\/wiki\/File:Top-Hadoop-Companies_Final.png\" class=\"broken_link\" rel=\"nofollow noopener\">WikiCommons<\/a>).<\/span><\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-8684825 elementor-widget elementor-widget-text-editor\" data-id=\"8684825\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tIn order to choose a cloud service provider, you\u2019d have to first establish the data that you\u2019d need for analytics, and the databases and analytics infrastructure most suitable for those data types. Since there would be both structured and unstructured data in your analytics pipeline, you might want to set up both a Data Warehouse and a Data Lake. An important thing to consider for data scientists is whether the storage layer supports the big data tools that are needed to build the models and if the database provides effective in-database analytics. For example, some ML libraries such as Spark\u2019s MLlib cannot be used effectively with databases as the main interface for data \u2014 the data would have to be unloaded from the database before it can be operated on, which could be extremely time-consuming as data volume grows and might become a bottleneck when you\u2019ve to retrain your models regularly (thus causing another \u201cheads-rolling\u201d situation).\n<p data-selectable-paragraph=\"\">For data science in the cloud, most cloud providers are working hard to develop their native machine learning capabilities that allow data scientists to build and deploy machine learning models easily with data stored in their own platform (Amazon has\u00a0<a href=\"https:\/\/towardsdatascience.com\/chapter-1-intro-to-aws-sagemaker-a1ecf00ec761\" class=\"broken_link\" rel=\"noopener\">SageMaker<\/a>, Google has\u00a0<a href=\"https:\/\/cloud.google.com\/bigquery-ml\/docs\/\" rel=\"noopener\">BigQuery ML<\/a>, Microsoft has\u00a0<a href=\"https:\/\/azure.microsoft.com\/en-us\/services\/machine-learning-studio\/\" rel=\"noopener\">Azure Machine Learning<\/a>). But the toolsets are still developing and often incomplete: for example,\u00a0<a href=\"https:\/\/cloud.google.com\/bigquery-ml\/docs\/bigqueryml-intro\" rel=\"noopener\">BigQuery ML<\/a>\u00a0currently only supports linear regression, binary and multiclass logistic regression, K-means clustering, and TensorFlow model importing. If you decide to use these tools, you\u2019d have to test their capabilities thoroughly to make sure they do what you need them to do.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-5afd5b5 elementor-widget elementor-widget-text-editor\" data-id=\"5afd5b5\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p data-selectable-paragraph=\"\">Another major thing to consider when choosing a cloud provider is vendor-lock in. If you choose a proprietary cloud database solution, you most likely won\u2019t be able to access the software or the data in your local environment, and switching vendors would require migrating to a different database, which could be costly. One way to address this problem is to choose vendors that support\u00a0<a href=\"https:\/\/opensourceforu.com\/2019\/04\/an-overview-of-open-source-cloud-platforms-for-enterprises\/\" rel=\"noopener\">open-source<\/a>\u00a0technologies (<a href=\"https:\/\/medium.com\/netflix-techblog\/why-we-use-and-contribute-to-open-source-software-1faa77c2e5c4\" class=\"broken_link\" rel=\"noopener\">here\u2019s Netflix explaining why they use open-source software<\/a>). Another advantage of using open source technologies is that they tend to attract a larger community of users, meaning it\u2019d be easier for you to hire someone who has the experience and skills to work within your infrastructure. Another way to address the problem is to choose third-party vendors (such as\u00a0<a href=\"https:\/\/pivotal.io\/pivotal-greenplum\" rel=\"noopener\">Pivotal Greenplum<\/a>\u00a0and\u00a0<a href=\"https:\/\/www.snowflake.com\/\" rel=\"noopener\">Snowflake<\/a>) that provide cloud database solutions using other major cloud providers as storage backend, which also allows you to store your data in multiple clouds if that fits your startup\u2019s needs.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-2d267fe elementor-widget elementor-widget-text-editor\" data-id=\"2d267fe\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p data-selectable-paragraph=\"\">Finally, since you expect the company to grow, you\u2019d have to put in place a robust cloud management practice to secure your cloud and prevent\u00a0data loss and leakages\u00a0\u2014 such as managing data access and securing interfaces and APIs. You\u2019d also want to implement\u00a0<a href=\"https:\/\/hortonworks.com\/wp-content\/uploads\/2014\/05\/TeradataHortonworks_Datalake_White-Paper_20140410.pdf\" rel=\"noopener\">data governance best practices<\/a>\u00a0to maintain data quality and ensure your Data Lake won\u2019t turn into a Data Swamp.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-83eb75e elementor-widget elementor-widget-text-editor\" data-id=\"83eb75e\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p data-selectable-paragraph=\"\">As you can see, there\u2019s so much more in an enterprise data science project than tuning the hyperparameters in your machine learning models! We hope this high-level overview has gotten you excited to learn more about data management, and maybe pick up a few things to impress the data engineers at the water cooler.<\/p>\n<p data-selectable-paragraph=\"\"><em><span style=\"font-family: courier new,courier,monospace;\">Co-author:\u00a0<a href=\"https:\/\/www.linkedin.com\/in\/robert-bennett-4002734b\/\" target=\"_blank\" rel=\"noopener nofollow noreferrer\" class=\"broken_link\">Robert Bennett<\/a><\/span><\/em><\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<\/div>\n\t\t","protected":false},"excerpt":{"rendered":"<p>Data science training mainly focuses on machine\/deep learning techniques. Data management knowledge is often treated as an afterthought. Data science students usually learn modeling skills with processed and cleaned data in text files stored on their laptop.&nbsp;To be a unicorn, you have to master every step of the data science process &mdash; all the way from storing your data, to putting your finished product in production. Here is a high-level overview to learn more about data management.<\/p>\n","protected":false},"author":670,"featured_media":2709,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"content-type":"","footnotes":""},"categories":[187],"tags":[94],"ppma_author":[3445],"class_list":["post-2063","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-bigdata-cloud","tag-data-science"],"authors":[{"term_id":3445,"user_id":670,"is_guest":0,"slug":"phoebe-wong","display_name":"Phoebe Wong","avatar_url":"https:\/\/secure.gravatar.com\/avatar\/?s=96&d=mm&r=g","user_url":"","last_name":"Wong","first_name":"Phoebe","job_title":"","description":"<a href=\"https:\/\/phoebetwong.github.io\/\">Phoebe Wong<\/a>, Data Science Fellow&nbsp;<a href=\"http:\/\/twitter.com\/FlatironSchool\" target=\"_blank\" title=\"Twitter profile for @FlatironSchool\" rel=\"noopener\"> at FlatironSchool<\/a>, is Finance Director at <a href=\"https:\/\/equalcitizens.us\/\" target=\"_blank\" rel=\"noopener\">Equal Citizens<\/a> that advances democracy reform in the U.S. through a variety of projects."}],"_links":{"self":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/2063","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/users\/670"}],"replies":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/comments?post=2063"}],"version-history":[{"count":7,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/2063\/revisions"}],"predecessor-version":[{"id":36180,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/2063\/revisions\/36180"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/media\/2709"}],"wp:attachment":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/media?parent=2063"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/categories?post=2063"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/tags?post=2063"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/ppma_author?post=2063"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}