Big Data has grown unprecedentedly rapidly with the spread of cloud infrastructure, in less than a decade. Big Data today has helped organisations who adopted a Big Data strategy to be at the forefront of research and development. This course aims to help to develop strategies to better leverage Big Data in today’s data-driven economy. This course refers to a wide range of techniques to address Big Data’s challenges with the aim to pave the way to more new opportunities. The course’s overall objective is to help in the application of different techniques and tools to address Big Data challenges and to scale Big Data Analytics with originality.
The course is intended for Data Engineers working on data integration and data preparation including ETL processes, Data Scientists working on scaling Big Data analytics, Researchers working on Big Data Discovery, Policy makers working on Big Data to address today’s challenges across sectors and all people who would like to learn different techniques to address Big Data challenges today to become new Big Data savvy professionals.
What am I going to get from this course?
At the end of this course attendees will be able to develop pertinent strategies to better leverage Big Data. They will be able to understand thoroughly and intuitively the new opportunities and the new challenges of Big Data. Learn the much-needed new skills to address these challenges spanning data integration, data preparation and analytics, including the emerging analytics. Implement different techniques and tools using one of the most powerful integrated programming environments (IDE) combing R and Spark that will help to integrate, prepare and analyze Big Data.
Prerequisites and Target Audience
What will students need to know or do before starting this course?
The following could help, but they are not totally a prerequisite for this course.
- Knowledge of relational database management systems (RDBMS) and/or SQL (Structured Query Language) or NoSQL (Not only SQL),
- Concepts and notions of algebra (inverse problems) and probability,
- Some familiarity with the development of algorithms and coding (Matlab or R).
Who should take this course? Who should not?
This course is intended for people working on Big Data across sectors and for people who are willing to become Big Data savvy professionals. It specifically designed for :
- Data Engineers working on data integration and data preparation including ETL/ELT (Extract, Transform and Load or Extract, Load and Transform) processes,
- Data Scientists working on scaling Big Data Processing and Big Data Analytics and developing Machine Learning (ML or AI) applications,
- Researchers and Scientists working on Big Data for Discovery (drug or gene discovery) or for testing new algorithms,
- Policy makers working on Big Data to address today’s challenges across sectors including education,
- People who would like to learn different techniques to address Big Data challenges spanning data integration, data preparation and data analytics, including emerging analytics.
Module 1: Big Data - Opportunities & Challenges
Opportunities and challenges of Big Data
With the arrival of Big Data, they are new opportunities as well as new challenges that are more subtle than they appear. Big data has created radical shifts not only as paradigm shifts but also shifts in jobs requiring new skills spanning data engineering skills to data science skills. Some of these new required skills have been described as more inquiry type of skills or “soft” skills.
Financial Stock Market Trends
This quiz is meant to help to explore the tools and the techniques introduced throughout the course. A code file is attached to help to extract data on financial stock markets trends - Use this file to display the trends of some of the companies such as Amazon, Apple and IBM
Module 2: Big Data Integration & Preparation
Big Data Integration - Processing & Streaming of Big Data
Big Data has grown in volume, velocity and variety, requiring its integration and its processing on real-time. Processing such large and stream data is a key operational challenge for major industries today. There are tools that can help with Big Data integration such as Hadoop ecosystem. The integration can be either through horizontal scaling where processing instructions (CPU based architecture) are processed in sequence or through vertical scaling where processing instructions are processed in parallel (GPU based architecture).
Big Data Preparation - Extract, Transform and Load (ETL or ELT)
Big Data is not only growing rapidly but also expanding to include other types of mostly unstructured and "un-formatted" data such as image data, sensor data, textual, and web related data, java script object notation (json) data. In addition to their integration, laborious data preparation is also required prior to their analysis, a process that may take substantial amount of time estimated to reach up to 80% of the time needed to carry out any in-depth data analysis. This data preparation process is as as decisive as data integration to have data prepared in the right format for data analytics and to feed to ML. Data preparation includes ETL processes, ETL stands for “extract, transform and load". Data preparation may overlap with data integration and the tools can be put on the top of data integration tools such as Hive, which added on top of Hadoop to carry out ETL for scaling data preparation.
Data Integration - Parallel Processing
This quiz can be carried by either connecting to data sets via internet (cloud) or can be used locally. A case is attached for practice and as quiz to carry out parallel processing by first finding the number of processors (cores) in your compter and apply parallel processing accordingly.
Module 3: Big Data Analytics & At Work
Big Data Analytics
As with big data integration and big data preparation, big data analytics will continue to be challenging as it is expanding to include not only predictive analytics but also prescriptive analytics and the emerging analytics such as edge analytics. This lecture introduces different analytics with a focus however on machine learning, in particular supervised and unsupervised learning. This lecture refers to the effective strategy to carry out big data analytics along ways to measure different ML s' performance and accuracy with demos and examples to test these different ML techniques.
Big Data at Work
This lecture focuses on the implementation of the different techniques presented throughout this course from data integration to data analytics. The implementation will be carried out using R via it IDE RStudio with Apache Spark. R as functional programming uses two key concepts, functions and objects, to prepare and analyse any types of data. Spark is also based on two key concepts, Distributed File System and (DFS) and MapReduce to scale processing of Big Data. Spark is similar to Hadoop, however unlike Hadoop it uses the memory to process data streaming more rapidly and on real time. R and Spark combined allow scaling of streaming analytics, albeit more a horizontal scaling. Spark provides also SQL and Machine Learning capabilities. In this lecture, you will learn how to apply the R/Spark combined magic to merge, prepare, query and analyze Big Data and to catch up with Big Data's rapid growth.
Data Analytics and Data at Work
This quiz consists of two parts. The first part relates to the use of R and Spark as platforms to develop functions (instructions) and data sets as data frames (objects). The second part spans the use all the different types of data sets, including a case of large data set (flight data set) and the use of different techniques of preparing and analyzing data.