The Data Revolution is in progress.
Not so long ago, most businesses ran on mainframe computers. These computers were expensive to purchase and were typically stored in corporate headquarters.
Internal staff had access to applications via a mainframe terminal. Data was typically stored in VSAM files. Individual fields were determined by the character position in the line of data. COBOL programmer wrote code to pull the data by indicating the start and stop positions for the fields requested. Reports were then sent to a printer off hours through batch jobs. The Business Managers sifted through reams of paper to find information. It was a slow, tedious process that required skilled programmers with domain knowledge. The information was not shared freely for the most part.
This allowed for 4th generation programming languages and reporting applications to query the database in a language called SQL. The resulting data appeared on the screen in WYSIWYG (What you see is what you get) format, which could be exported to a spreadsheet, PDF, printer or emailed to another user. This removed the need for specialized programmers between the business users and the data.
Traditional reports pulled data from the live systems, locking the data and causing performance issues to the underlying applications. The Data Warehouse was introduced to solve this issue by implementing a standard methodology for storing data.
A developer created a model, either a Star Schema or a Snowflake Schema, of the data through the use of Fact and Dimension tables. Values were stored in the Fact tables such as Sums, Averages, Min, Max, etc. and Dimension tables contained descriptive adjectives such as Customer, Location, Time or Product.
So you could quickly manipulate the data to determine how many sales occurred in a specific time frame in specific region by a particular sales person at a specific store based on a specific product. This process pulled data from the source system, loaded it into a Staging Database, and finally moved it to the Data Warehouse. As the data flowed through each phase of the Extract, Transform and Load (ETL) process, the developer applies business logic, such as creating a “CustomerName” which concatenates (Lastname + “, “+ FirstName) to comply with corporate standards and easier data manipulation.
Although the business now had a Single Version of the Truth, the cost of building and maintaining the DW were high. This resulted in storing limited sets of data. Finding and retaining qualified programmers to build and maintain it was a challenge, and adding new data sources was not easy either.
Self Service Reporting
As technology matured, each company formed its own Information Technology (IT) Department. But they still were not able to satisfy the demands of the business. They were not receiving accurate data in a timely manner. As a result, they enlisted internal staff or hired consultants to build reports in silos, without letting the IT department know about it. By piecing together bits of data from different locations, many of the report writers did not follow best practices or adhere to company policies regarding the storage and methods used to access data. Vendors soon saw the demand and created applications to allow business users to pull their own reports without having programming skills –– this is known as self-service reporting. Now any department with a corporate credit card could access the company's data without assistance from the IT department.
Business users now had the ability to run reports in real-time against Data Warehouses, Traditional Relational Databases, and Self Service. However, they could only access data stored in database format. There were still mountains of idle data scattered throughout each organization. These organizations were unable to utilize these datasets, primarily because it was "unstructured" or "semi-structured" data.
Semi-structured data does not comply with standard relational database formats, yet it has a certain degree of predictability. Likewise, unstructured data, such as email archives, have no predefined data model format. However, we can add structure to data by utilizing a new product called Apache Hadoop.
Hadoop is an open-source software framework written in Java for distributed storage and distributed processing of very large data sets on computer clusters built from commodity hardware.
Hadoop has evolved over time, allowing access through a SQL query language called Apache HiveSQL. However, when it has some speed limitations when it gets translated to Map Reduce code. Another language called Apache Pig allows developers to manipulate the data to perform calculations and aggregations and format data for reporting.
There are many advantages to Hadoop. It allows developers to query large data sets, read structured and unstructured data, and mash together different types of data. More recently, a new product called Apache Spark allows similar data manipulation and can run within the Hadoop ecosystem or stand alone.
The industry soon realized that although data was once "nice to have," it rose in status to "have to have" because it became apparent that data could be used to drive business decisions.
Increase sales. Reduce costs. Streamline processes. Find patterns in the data by converting it to information. And then analyze the insights, and take action on the new information. With the rise of open data sets, social media, reduced costs in software and hardware, and availability in programmers, organizations are leveraging this new technology to gain competitive advantage.
The new field of Data Scientist, was labeled as the sexiest job of the 21st Century. This is probably because a Data Scientist combines the skill sets of Programmer, Statistician, and Business Analyst. The intersection of these three fields can prepare data, apply algorithms, and translate insights into a common language for consumption. Data Scientists understand the data, business, and statistics and can crunch the data using traditional relational database or through unstructured and semi-structured data sets. Some of the algorithms are used to identify patterns and predict future behavior. This is a highly sought-after position in almost any industry.
Artificial Intelligence first came about in the 1950s when people saw the need for computers to think. Although they lacked the processing power at the time, they laid the foundation for future work. As price of hardware and software decreased over time, advancements in pattern recognition, data mining, predicative analytics started to gain pace. One of the main theoretical concepts used in these areas is Artificial Neural Networks. They are basically a series of connected nodes, weighted by probability, which are activated according to certain criteria, which then activate other nodes further downstream. By training a neural net, it can learn over time and remember things and events, as well as perform simulations projected into the future in order to better predict what different outcomes will look like. These algorithms are growing in both the public and private space because they can automate many repetitive processes. Many organizations are investing in AI to streamline processes and reduce overhead costs.
Each business or organization is now in the software business. This is because every company runs software, which accumulates data that can be harnessed and mashed with other data to provide useful insights.
This has created an explosion in data in the current Data Revolution. Data Scientists are able to extract knowledge from large volumes of data, from both structured and unstructured data sets. Companies are now able to extract personal information from customers in marketing campaigns, via sentiment analysis and algorithms that predict customer behavior. Being a data-driven enterprise is becoming the standard.