Industry recognized certification enables you to add this credential to your resume upon completion of all courses

Need Custom Training for Your Team?
Get Quote
Call Us

Toll Free (844) 397-3739

Inquire About This Course
Dr. Mark Plutowski, Instructor - Apache Spark SQL

Dr. Mark Plutowski

Has 20 years experience in data-driven analysis and development. He has a Ph.D in Computer Science from UCSD & M.S. Electrical/Computer from USC; he worked at IBM, Sony, and Netflix generating 29 patents.

Instructor: Dr. Mark Plutowski

Create, run, and tune Spark SQL applications, end-to-end.

  • Learn how to build Spark applications.
  • Learn how to build your own local standalone cluster.
  • Instructor has a Ph.D with 20 years of experience in data-driven analysis and development working for organizations like Netflix, IBM, and Sony. 

Course Description

In this course you'll learn the physical components of a Spark cluster, and the Spark computing framework. You’ll build your own local standalone cluster. You’ll learn how to run Spark jobs in a variety of ways. You’ll write Spark code. You’ll create Spark tables and query them using SQL. You will learn a process for creating successful Spark applications.

What am I going to get from this course?

  • Install and configure Spark.
  • Run Spark in several ways.
  • Build Spark applications.
  • Profile Spark applications.
  • Tune Spark applications.
  • Learn 30+ Spark commands.
  • Use Spark SQL window functions.
  • Step through 900 lines of Spark code.
  • Apply pro tips and best practices tested in production.

Prerequisites and Target Audience

What will students need to know or do before starting this course?

  • Have familiarity with Python. 
  • Be comfortable at the Unix command line.
  • Have a data orientation and enjoy combing through huge piles of data.
  • Have a laptop, software development environment, and internet connection.

Who should take this course? Who should not?

This course is for all Data Driven Professionals, including:
  • Data Scientists
  • Data Engineers
  • Quantitative Analysts
  • Engineering Managers
  • Data Analysts
  • Dashboard Developers
  • Machine Learning Developers
  • R and SQL Developers


Module 1:

Lecture 1 Course Overview and objectives

[0:00 - 4:08] Introduction - Overview of Objectives, Goals, Benefits - About the Instructor [4:08 - 9:27] Course Curriculum - How can Spark be used - What kinds of problems is Spark good for - How will you learn in this class? - What is the Audience of this course? - What Topics will be covered? - What you will learn. - Prerequisites

Module 2:

Lecture 2 Building Your Local Dev Environment

[0:00 - 6:40] Spark Computing Framework - Components of Spark Physical Cluster - Components of Spark Software Architecture - Execution Modes [6:40 - 20:50] Installing Spark - Virtual Environment - Using package management system vs building from source - Logging configuration [20:51 - 29:30] Running Spark - Running the pyspark shell - Running "Hello World" in Spark - Running Spark in the python shell, ipython shell - Creating an RDD and inspecting its contents. - Using the Spark session object and the Spark context

Module 3:

Lecture 3 Running Spark

[0:00 - 17:40] The Spark UI - Review Spark Cluster Components - Review Spark Execution Modes - Spark Standalone Cluster Architecture - Using the spark-submit command - Running in an Integrated Development Environment - Using the Spark UI [17:41 - 29:00] Running a Spark application in notebook and IDE - Writing a new Spark application - Running Spark in a Jupyter notebook - Creating a dataframe - Debugging Spark in Pycharm - Using the Spark UI to inspect resource usage - Inspecting the driver and executor - Observing the programmatic creation of jobs, stages, and tasks - Profiling memory and data movement

Module 4:

Lecture 4 The Spark Computing Framework

[00:00 - 9:34] Core Components of the Spark Computing Framework - Cluster Manager - Workers, Executors, Task slots - Master Node - Driver Program - Driver Node vs Driver Process - What “Job” means, inside of a Spark Session vs outside - How Jobs are organized into Stages and Tasks [9:35 - 15:26] Parallelized Collections - Resilient Distributed Datasets - Dataframes - Datasets - Partitions [15:27 - 23:16] Parallel Operations - Transformation vs Action - Lazy Execution - Accumulators and Lazy Execution - Types of Transformations - Shuffle Operation - Shuffle Boundary - Execution Plan - How transformations and actions are assembled into stages - Shared Variables - Broadcast Variables - Accumulators - Description of Exercise 1

Module 5:

Lecture 5 Resilient Distributed Data

[00:00 - 12:22] The goals of Exercise 1 and dataset - Introducing the example Spark application we'll be modifying - Description of the dataset we'll be using - Downloading the dataset - Goals of the exercise [12:23 - 24:03] Code review of solution to Exercise 1 - Reviewing Exercise 1 in Sublime text editor - Stepping through the Exercise - Creating a module for configuring Spark session - Inspecting the RDD code in Spark UI - Stepping through the RDD in an IDE [24:04 - 28:34] Description of Exercise 2 - In Exercise 2 you will do the following: - profile code written for Exercise 1 using the Spark UI - rewrite Exercise 2 using dataframes - Calculate additional statistics from the dataset - Compare and contrast resource consumption - of dataframe version vs RDD version - Recap of what we accomplished in Module 5 - Loading a text file into an RDD - Tokenizing text using RDD operations - Using flatMap, map, and reduceByKey operations - Sorting the RDD - Lookahead to what's next using data frames

Module 6:

Lecture 6 Dataframes

[00:00 - 8:30] Demonstrating a solution to Exercise 2 - Introduction to Exercise 2 - Demonstration of a solution to Exercise 2 - from the command line - in a debugger [8:31 - 24:09] Stepping through Exercise 2 - Recap of RDD-based approach - Loading a dataframe from text file - Using the select, alias, explode, lower, col, and length operations. - Counting uniques using drop_duplicates and distinct - Aggregations using the groupBy operation - Introducing the GroupedData object - Set operations - Joins - Set intersection - Set subtraction - Filtering using where - Inspecting a sample of a result set using the show action [24:10 - 29:33] Transforming columns using UDFs - Transforming a column using a UDF within a select operation - Adding a new column using withColumn - Adding a column containing a fixed literal value using a literal argument to a UDF - How to create a UDF that operates on two columns [29:34 - 40:33] Bonus material - Inspecting the RDDs used in Exercise 1 - Comparing the RDD contents with the dataframes from Exercise 2 - Introduction to Exercise 3

Module 7:

Lecture 7 Caching and Memory Storage Levels

[00:00 - 9:03] Caching and Logging - caching vs persist - removing objects from cache using unpersist - command line demonstration of caching - demonstrating an important quirk of logging when using the DEBUG log level [9:04 - 13:44] How to size a dataset in the Spark UI - creating the object, caching it, pausing the application - inspecting the object size in the Spark UI [13:45 - 20:54] Storage levels, serialization, and cache eviction policy - memory storage levels - serialization - cache eviction policy [20:55 - 44:59] Tuning Cache and Best Practices for Caching and Logging - a systematic way to tune cache - best practices for caching - when to cache - when not to cache - implications of running Spark actions within a debug log statement

Module 8:

Lecture 8 SQL, Window Function, and Execution Plans

[00:00 - 8:43] Introduction to Spark SQL - overview of the main concepts - a preview of the code - examples of traditional SQL queries - examples of window function SQL queries - running from the command line - running in IDE [8:44 - 21:40] Spark Tables, Spark Catalog, Execution Plans - registering a dataframe as a Spark table - caching a Spark table - inspecting the Spark catalog - querying a Spark table using SQL - examining the execution plan of a dataframe - examining the execution plan of a query - a defensive programming technique for debugging lazy evaluations [21:41 - 31:10] Window Function SQL - Using dot notation vs sql query for dataframe operations - example of an operation that is easier using dot notation - Window functions in action - identifying word sequences of a specified length - creating a moving window feature set - finding most frequent word sequences of a specified length - observations gleaned from our dataset using windowed sequence analysis [31:11 - 42:44] Pro Tips - Recap - Capstone project ideas - Pro tips: window functions - Pro tips: UDFs - Pro tips: Debugging - Pro tips: Tuning