Need Custom Training for Your Team?

Get Quote

Call Us

Toll Free (844) 397-3739

Inquire About This Course


Thumb 31c5bf48 ad7e 40ef 9c5f f592bcfd4e36

Matt Pouttu Clarke

The instructor has provided technical leadership to market leaders such as CSC, General Dynamics, AAA, Daimler-Benz, Hearst, and Telstra. He has worked for 21 years in a variety of industries including Advertising, Defense, Finance, Telecom, and Manufacturing with architectures ranging from embedded systems to supercomputers.

Hadoop Developer Training

Instructor: Matt Pouttu Clarke

MapReduce, HDFS, YARN, Hive, Pig, Scoop, Flume, and Drill plus advanced topics

  • Learn advanced techniques like Monte Carlo Simulation, intelligent hashing, partition skew detection, partition pruning, and Hadoop push predicates.
  • Develop rock solid end-to-end MapReduce applications.
  • Instructor: Provided technical leadership to market leaders such as CSC, General Dynamics, AAA, Daimler-Benz, Hearst and Telstra. 21 years of hand-on experience with architectures ranging from embedded systems to supercomputers.

Course Description

Learn the fundamentals of how to produce industrial strength applications using the Hadoop ecosystem. In addition to the basics we introduce advanced topics such as intelligent hashing, partition skew detection, Monte Carlo simulation, partition pruning, and push predicates. Emerging industry standards in data formats, messaging, and stream processing provide guidance to students on future studies.

What am I going to get from this course?
  • Understand core Hadoop components, how they work together, and real world industry best practices in this hadoop training course.
  • How to produce industrial strength MapReduce applications with the highest standards of quality and robustness.
  • Learn to utilize the Hadoop APIs for basic Data Science tasks such as Monte Carlo Simulation and data preparation.
  • How to partition, reduce, sort, and join data using MapReduce to produce any result you could produce using SQL.
  • Leverage the latest data storage formats to make data processing using MapReduce faster and easier than ever before.
  • Proper usage of compression in large scale environments.
  • How to collect data using Flume and Sqoop.
  • Data exploration using Hive, Pig, and Drill.
  • How to create truly reusable User Defined Functions which operate identically regardless of Hadoop distributions or version upgrades.
  • Methods of exposing an API to enable Hadoop as a Service (HaaS) 
  • Future directions and trends in Big Data.

Prerequisites and Target Audience

What will students need to know or do before starting this course?
  1. Working knowledge of Java, equivalent courses, or Java certification.
  2. Ability to use the basics of the Unix command line.
Who should take this course? Who should not?
Students wanting to learn hadoop and desiring a deep dive into real world usage of Hadoop and related APIs and tools will benefit most from the course.  Students must master all the relevant details of the Hadoop APIs and complete rigorous and challenging assignments in the context of a data aggregator case study.


Module 1: Hadoop Cluster Overview
Lecture 1 YARN

We discuss job execution framework called YARN: Yet Another Resource Negotiator, and how the components of YARN interact to manage Hadoop job execution.

Lecture 2 HDFS

We discuss Hadoop Distributed File System (HDFS) including design principles, proper usage, and best practices.

Quiz 1 Hadoop Cluster Overview

Verifies student understanding of the basic YARN and HDFS components of Hadoop and how they interact to provide job management and storage management for an application.

Module 2: Industrial Strength MapReduce
Lecture 3 Industrial Strength MapReduce Part 1

Covers how to utilize Eclipse IDE and JUnit to produce and maintain an industrial quality Hadoop MapReduce code base.

Quiz 2 Industrial Strength MapReduce Part 1

Lecture 4 Industrial Strength MapReduce Part 2

Covers how to utilize Eclipse IDE and JUnit to produce and maintain an industrial quality Hadoop MapReduce code base.

Quiz 3 Industrial Strength MapReduce Part 2
Lecture 5 Industrial Strength MapReduce Part 3

Covers how to utilize Eclipse IDE and JUnit to produce and maintain an industrial quality Hadoop MapReduce code base.

Lecture 6 Industrial Strength MapReduce Part 4

Covers how to utilize Eclipse IDE and JUnit to produce and maintain an industrial quality Hadoop MapReduce code base.

Lecture 7 Viewing Log Files and Understanding Counters

How to view and interpret Hadoop MapReduce log files and counters.

Lecture 8 Exercise 00 - Your first Hadoop Test Case

Create a test case from scratch using the MapReduce APIs and complying with code coverage KPIs.

Lecture 9 Exercise 00 - Review

Review your answer and the provided answer. Compare and contrast.

Module 3: Basic Data Science with the Hadoop APIs
Lecture 10 Writable and WritableComparable

Covers fundamentals of how Hadoop serializes various Java data types. Also, how to utilize Eclipse and Maven to explore class and interface hierarchies in the Hadoop code base.

Quiz 4 Writable and WritableComparable

Review the basic structure of WritableComparable as it is visible in Eclipse.

Lecture 11 Introduction to Monte Carlo Simulation

Shows how to implement a Monte Carlo Simulation in Hadoop to verify logic and allow local mode performance and load test.

Quiz 5 Introduction to Monte Carlo Simulation

Review the role of Monte Carlo Simulation in producing robust Hadoop solutions.

Lecture 12 Introduction to Intelligent Hashing

Show how Intelligent Hashing APIs from Google can make Hadoop jobs more efficient and fault tolerant.

Quiz 6 Introduction to Intelligent Hashing

Review how to create an Intelligent Hash and what trade offs are involved in their use.

Lecture 13 Exercise 01 - Data Enrichment using Hadoop

Enrich the data with an Intelligent Hash

Lecture 14 Exercise 01 - Review

Review your solution with the provided solution. Compare and contrast.

Module 4: Partitioners, Reducers, and Sorting
Lecture 15 Partitioners, Reducers, and Sorting
Quiz 7 Partitioners, Reducers, and Sorting

Test your understanding of how partitioning, reducing, and sorting work.

Lecture 16 Exercise 02 - Monitoring Partition Skew

How to identify partition skew, potentially well before the job completes.

Lecture 17 Exercise 02 - Review

Review your answer to the provided answer. Compare and contrast.

Module 5: Data Formats, Compression, and Splitting
Lecture 18 Data Input and Output

Role of Custom Writable and WritableComparable File Compression and Splitting Custom InputFormats and OutputFormats Multiple Inputs Schema Evolution File formats: SequenceFile, Avro, and Parquet

Lecture 19 Exercise 03 - convert a file from text to Avro

How to create an Avro output from a text input.

Lecture 20 Exercise 04 - convert a file from text to ParquetAvro

How to create a ParquetAvro file from text input.

Module 6: Joining Using MapReduce
Lecture 21 Importance of EIPs - Enterprise Integration Patterns

Relation of EIPs to proper MapReduce application design and implementation.

Lecture 22 Orchestration and Routing Job Flows

Best practices for handling precedence and inter-job dependencies in MapReduce applications.

Lecture 23 Partition Pruning, Push Predicates, and Joins

How to control the exact behavior of joins as well as how to filter data in storage with all the functionality of the SQL WHERE clause.

Lecture 24 Exercise 05 - joining emp with activity logs

Create an inner join using MapReduce.

Lecture 25 Exercise 06 - partition pruning and push predicates

Add partition pruning and push predicates to the join job.

Module 7: Data Collection with Flume and Sqoop
Lecture 26 Flume Fundamentals

Learn the best practices and key use cases for Flume, and how to stream data to HDFS using Flume.

Lecture 27 Sqoop Fundamentals

Use Sqoop to import a data set from a relational database to HDFS.

Module 8: Data Exploration with Hive, Pig, and Drill
Lecture 28 Hive Fundamentals

Learn how to import and query text data in Hive and the advantages of using ParquetAvro to simplify data management and improve performance.

Lecture 29 Pig Fundamentals

Learn some fundamentals of Pig and how to load, store, and transform data using PigLatin.

Lecture 30 Drill Fundamentals

Learn how Drill can scale out high performance end user queries using ANSI standard SQL.

Module 9: Integrating Hadoop into the Enterprise
Lecture 31 Reusable User Defined Functions (UDFs)

Learn how to create reusable User Defined Functions (UDFs) which work identically across multiple big data tools and across tool version upgrades.

Lecture 32 Exercise 07 - reusable Linear Regression function

Create a reusable function which integrates with multiple big data tools.

Lecture 33 Hadoop as a Service (HaaS)

Learn how to apply EIPs to create Hadoop as a Service (HaaS)

Lecture 34 Exercise 08 - develop basic Hadoop as a Service

Learn how to expose a Hadoop endpoint to future proof and simplify your data processing architecture.

Lecture 35 Job Scheduling with Oozie

Learn best practices for leveraging Oozie to trigger workflows.

Module 10: Future Trends in Hadoop
Lecture 36 Truly Scalable Messaging - LinkedIn’s Apache Kafka

Learn how Kafka enables robust EIP based designs to scale to big data and beyond.

Lecture 37 Unified Batch and Real-time - Google’s Apache Beam

Learn about Apache Beam: the most significant open source contribution since Hadoop.

Lecture 38 Hadoop as a Service Cloud - Amazon Web Services and Google Cloud

Learn how Cloud providers expose APIs to allow pay-as-you-go Hadoop infrastructure.


4 Reviews

Empty user
Patricia L

December, 2016

I wanted to take this course since I wanted to gain a deeper understanding of Hadoop for my job so I may expand my role. Matt, the instructor, was amazing and laid out a very detailed curriculum in an easy to understand format. I also really found the hands-on coding exercises very beneficial and feel this provided me with a great chance to practice what I learned during the course.

Empty user
Linda J

May, 2017

I find Hadoop an excellent course, as I want to pursue a role with Hadoop. The course contains all the needed learning material. I could thoroughly understand the fundamentals and advance Hadoop topics.. The course material is very informative with all the details regarding Hadoop APis, data science tasks, data processing with MapReduce, Flume and Sqoop. The course also acquaints us with the future trends in big data. With this course, it can be a matter of time to master Hadoop and APis in the data aggregator case study. I also find the exercises in the course are much relevant. The instructor’s knowledge is amazing.

Empty user
Manjeet R

May, 2017

For a working professional, who cannot attend to classes, this is an opportunity learn everything about Hadoop. The course nicely explains how to produce MapReduce applications meeting the highest standards of quality and robustness.

Empty user
Tom J

May, 2017

This is one of the best course.The course definitely benefits us in using Hadoop in the real world and related API and tools. With 10 modules in the course, every aspect of Hadoop covered for the student.