Need Custom Training for Your Team?

Get Quote

Call Us

Toll Free (844) 397-3739

Inquire About This Course


Thumb aed7b4f6 9d01 485b bb97 be5eb5afea5a

Sumit Pal

The instructor for this course has more than 22 years of experience in various roles spanning companies from startups to enterprises. He has worked for Microsoft (SQL server development team), Oracle (OLAP development team) and Verizon (as an Director for Big Data Architecture). Currently, he consults for multiple clients advising them on their data architectures and big data solutions and does hands on coding with Spark, Scala, Java and Python. Author of recently published book : SQL on Big data, he has extensive experience in building scalable systems across the stack from middletier, data tier to visualization for analytics applications, using Big Data, NoSQL DB and has deep expertise in Database Internals, Data Warehouses, Dimensional Modeling, Data Science with Java and Python and SQL.

Big Data Analyst

Instructor: Sumit Pal

Master the skills necessary to build a career in Big Data.

  • Get up to speed with big data technologies and start doing analyst work on massive data sets
  • Instructor: Microsoft SQL Server team (1996-1997), Oracle development team (1997-2004) and Big Data team at Verizon Labs (2013-2015)   
  • This big data online training prepares you for Cloudera's Business Analyst Certification

Course Description

This Big Data online training gives one the background necessary to start doing analyst work on Big Data. It covers - areas like Big Data basics, Hadoop basics and tools like Hive and Pig - which allows one to load large data sets on Hadoop and start playing around with SQL Like queries over it using Hive and do analysis and Data Wrangling work with Pig. This online Big Data training also teaches Machine Learning Basics and Data Science using R and also covers Mahout briefly - a Recommendation, Clustering Engine on Large data sets. The course includes hands-on exercises with Hadoop, Hive , Pig and R with some examples of using R to do Machine Learning and Data Science work

What am I going to get from this course?
  • Students will get a good idea of Big Data Landscape, Learn basics of Big Data and Hadoop and HDFS.
  • Students will also learn to use tools like - Hive and Pig - both from a theoretical aspect as well as Hands on.
  • Students will Learn some amount of R and SparkR ( a big data processing framework )
  • Students will learn about Mahout and also about Data Science and where it is used
  • Students will learn basics of some Data Science Algorithms like - Decision Trees, Naive Bayes and Clustering algorithms and do hands on work with them
  • Students will learn about R on Hadoop - tools and solutions
  • Students will also learn how to use Hadoop Virtual Machines on their laptop

Prerequisites and Target Audience

What will students need to know or do before starting this course?
  • Interest in Data and some SQL and general aptitude
Who should take this course? Who should not?
  • The course is open for anyone who likes to know about Big Data tools and technologies and someone who is interested in knowing about Data Science and the algorithms and where they are used
  • It will be useful for both Business Analysts as well as Managers and anyone interested in working with big data.


Module 1: Big Data Analytics Overview
Lecture 1 Introduction

Introduction to the course and contents

Lecture 2 How Big Data Affects Our Daily Life

Lecture 3 Big Data Analytics Overview

Discuss State of Practice in Analytics and the disruption happening How Big Data is usurping the traditional analytics

Lecture 4 Big Data Analytics Across Verticals

Discuss usage of Big Data in different verticals and newly evolving field of IOT and Cybersecurity and how Big Data is so essential for them

Module 2: Big Data Analytics with Hadoop
Lecture 5 What is Hadoop?

Motivation for Hadoop and Distributed Data Processing, new Architectures and History of Hadoop

Lecture 6 Hadoop - Key Platform Components and Architecture

Here we cover how Hadoop evolved to what it is today and its main components and the reason why Hadoop exists

Lecture 7 Hadoop Cluster

This module covers details about a Hadoop Cluster and how data splitting and data compression is so essential for Hadoop

Lecture 8 HDFS and Map Reduce Architecture

This section covers in details about the 2 major components that build up Hadoop - HDFS and Map Reduce and their internals

Lecture 9 Hadoop Ecosystem

In this section we cover about Hadoop Ecosystem, Deployment architectures and major Hadoop Vendors and also when, where and how to use Hadoop deployments

Resource 1 Resources Download
Module 3: Hive
Lecture 10 Hive Overview

In this section we discuss how Hive fits into the overall Hadoop Architecture and what is Hive and what it is Not

Lecture 11 Hive Architecture

In this section we discuss about Hive Architecture as well as Hive basic command level details - how to create tables, data types and support for complex data types

Lecture 12 How to connect Tableau to Hive

In this section we see a basic demo of how to setup Tableau to connect to Hive installation on your laptop or VM

Lecture 13 Hive Tables, Partitions and Data Formats

In this section we discuss Hive Tables, Data Formats and how to do data partitioning in Hive for better performance and scalability

Lecture 14 Hive deeper details

In this section we cover more capabilities of Hive - Functions and Joins, Other Hive Queries and building UDFs and Importing and Exporting Data

Lecture 15 Hive Hands On Video

Hands on Video showing how to work with Hive and walk through of a sample example

Module 4: PIG
Lecture 16 Pig Overview

In this section we discuss how Pig fits into the Hadoop Ecosystem and an introduction to Pig How Pig works What is Pig What Pig is Not

Lecture 17 Pig Data Types and Operators

In this section we cover more operators and commands available in Pig and their usage with examples. This is the meat of Pig

Lecture 18 Pig Hands On

In this section we show the video of how to start using pig and some sample examples

Lecture 19 Deeper Into Pig - Some Advanced Things on Pig

More operators and advanced concepts in Pig

Module 5: Introduction to R
Lecture 20 What is R?

In this section we learn about the basics of the R Programming Language and the Data Exploration Capabilities of R

Lecture 21 Data Ingestion and Manipulation with R

In this section we learn the capabilities of R for doing basic Data Ingestion / Reading and Manipulation

Lecture 22 Data Visualization with R

Here we learn how to do some basic data visualization with R

Module 6: R with Big Data ( Hadoop and Spark )
Lecture 23 R with Big Data - 1

Here we cover how R and Big Data Technologies have evolved and adapted for processing large data sets using R language constructs but with Map Reduce and Spark as the underlying engines to run R code

Lecture 24 R with Big Data - 2

Here we cover how R and Big Data Technologies have evolved and adapted for processing large data sets using R language constructs but with Map Reduce and Spark as the underlying engines to run R code

Lecture 25 R with SparkR

See working examples of using R on Spark

Module 7: Fundamentals of Machine Learning
Lecture 26 Basics of Machine Learning

What is Machine Learning, Data Science and where they are used

Lecture 27 Road to Data Science

In this section we discuss the kind of skills and capabilities needed to become a data scientists We discuss the Life cycle of Data Science projects Everyday usage of Data Science based algorithms

Lecture 28 Basic Concepts and Terminology and their meaning

In this section we discuss the basics concepts for Data Science things like Bias and Variance and why they are important to go further into this field

Lecture 29 Basic Concepts and Terminology and their meaning

This is an additional module to the previous one - where we discuss more of the fundamental concepts and terminology of the different things in Machine Learning and Data Science and how they help us to build the right algorithms

Lecture 30 Classification and Regression

In this section we look at the basics of Classification and Regression Algorithms

Lecture 31 Naive Bayes and Decision Trees

In this section we look at 2 of the most commonly used algorithms in the field of Data Science - Naive Bayes and Decision Trees

Module 8: Installation and Hands On Exercise
Lecture 32 Installation

In this section we will install and setup the VM The zip file contains the following files -Install.txt - Start here - following the instructions (This has been tested on Windows 7 and Windows 10 laptop ) -Vagrant_README.md -- The above Install.txt file will also tell you to refer to this file and do the steps as mentioned in this file for Installation and Setup -VagrantNotes.txt -- This file will tell you how to copy files from your laptop to the VM These 2 files are to be used when setting up connectivity to Hive from Tableau TableauConnectToHive.png - HortonworksHiveODBC64.msi

Lecture 33 Hands On working session with Hadoop and HDFS

This is to be tried only after the VM has been installed and it is working and you are comfortable working with the VM See the zip file - This has some very basic Hadoop and HDFS commands for you to use and get used to Hadoop. Also available in the zip file is a sample dataset (Text file ) for you to use for your Hadoop commands

Lecture 34 Hands on Working session and Exercises with Hive (RECAP)

This lecture contains the resources to work with Hive Examples. The zip file has all the examples and code for you to try and play with Hive and learn the Commands and Querying capabilities of Hive The HivePigData.zip file - has all the data you need to do the exercises

Lecture 35 Connecting Tableau to Hive (RECAP)

In this Lecture we will see demo of how to connect Tableau to Hive ( just the connectivity part ) not doing the actual visualization of data in Tableau ( that is not part of the course ) The zip file has the ODBC Driver to connect to Hive from Tableau and a Screen Image of how to do the setup - look at the video with this Lecture to do the setup ODBC Driver - HortonworksHiveODBC64.msi Tableau to Hive Connection Setup - TableauConnectToHive.png

Lecture 36 Hands on Exercise with Pig (RECAP)

In this module we will do some sample exercises to learn Pig more deeply. The zip file has the code / exercises to do with Pig The HivePigData.zip fiile has the datasets we would be using. Look at the video for the sample example of how to learn pig

Resource 2 Code and Data Sets

This is not a lecture per-se - but the code (R) and Data Sets for Module 10 - where we learnt - Cluster Analysis, Decision Trees, Descriptive Statistics and little bit of probability

Resource 3 DataSets to Download for Module 5
Resource 4 DataSets to Download for Module 5 - SparkR
Module 9: Apache Mahout Introduction
Lecture 37 Mahout Basics

This is the only section in the module - which discusses the basics of Mahout - where it started and where it is going and the capabilities and algorithms it has in built for large scale data science.

Lecture 38 Mahout Demo for Recommendation

This section shows how to run Mahout's recommendation engine from out of the box and 1 configurable example developed by the instructor to run Mahout's Recommendation Engine

Module 10: Data Analysis and Statistical Methods
Lecture 39 Cluster Analysis Part 1

This section walks through the different ways of doing Clustering of data using out of the box algorithms in R

Lecture 40 Cluster Analysis - Part 2

This section walks through the different ways of doing Clustering of data using out of the box algorithms in R

Lecture 41 Statistical Method - Part 1

This section covers - Descriptive Statistics part of Data Analysis using R

Lecture 42 Statistical Method - Part 2

This section covers - basics of Probability Theory part of Data Analysis using R

Lecture 43 Statistical Method - Part 3

This section covers - Inferential Statistics part of Data Analysis using R

Lecture 44 Decision Tree - Part 1

This section covers - building decision trees using R with sample examples and demos

Lecture 45 Decision Tree - Part 2

This section covers - Decision Trees using Random Forest Algorithm in R


4 Reviews

Empty user
Ben J

January, 2017

Excellent course with right contents in terms of coverage and right amount of depth to get started, up and running. The trainer had done lot of hard work in building the right slides and right content which is appropriate for this extensive subject area

Empty user
Martin R

January, 2017

Good bang for the buck - gets the trainee up to speed with the Big Data Analyst Skills in a short but comprehensive course. The course has a good balance of hands on and theoretical content

Empty user
Jyotsna C

February, 2017

The is a great course to get started with Big Data. The hands-on exercises are very helpful and the instructor teaches complex concepts in easy to understand language.

Empty user
Sanjay M

February, 2017

One of the best courses on Big Data. I have been searching for something like this for a while. I have taken many other courses before but the way Sumit takes you to the journey of Big Data is quite unique. He starts off with a big picture and explains each and every aspect of Hadoop with hands on exercises. Highly recommended. Must for anyone to learn hadoop the right way.