Course Description

In this course you'll learn the physical components of a Spark cluster, and the Spark computing framework. You’ll build your own local standalone cluster. You’ll write Spark code. You’ll learn how to run Spark jobs in a variety of ways. You’ll create Spark tables and query them using SQL. You will learn a process for creating successful Spark applications.

What am I going to get from this course?

Install and configure Spark.
Run Spark in several ways.
Build Spark applications.
Profile Spark applications.
Tune Spark applications.
Learn 30+ Spark commands.
Use Spark SQL window functions.
Step through 900 lines of Spark code.
Apply pro tips and best practices tested in production.

Prerequisites and Target Audience

What will students need to know or do before starting this course?

Have familiarity with Python.
Be comfortable at the Unix command line.
Have a data orientation and enjoy combing through huge piles of data.
Have a laptop, software development environment, and internet connection.

Who should take this course? Who should not?

This course is for all Data Driven Professionals, including:

Data Scientists
Data Engineers
Quantitative Analysts
Engineering Managers
Data Analysts
Dashboard Developers
Machine Learning Developers
R and SQL Developers

Curriculum

Module 1:

Lecture 1 Course Overview and objectives

[0:00 - 4:08] Introduction - Overview of Objectives, Goals, Benefits - About the Instructor [4:08 - 9:27] Course Curriculum - How can Spark be used - What kinds of problems is Spark good for - How will you learn in this class? - What is the Audience of this course? - What Topics will be covered? - What you will learn. - Prerequisites

Module 2:

Lecture 2 Building Your Local Dev Environment

To install pyspark on any unix system first try the following : $ pip install pyspark -- This is the recommended installation and works for most configurations. -- -- See Module 9 for example installation code and setup instructions for version 2.2+ -- [0:00 - 6:40] Spark Computing Framework - Components of Spark Physical Cluster - Components of Spark Software Architecture - Execution Modes [6:40 - 20:50] Installing Spark First try the following : $ pip install pyspark ... The video procedure helps most users for who the package management system install does not work. - Virtual Environment - Using package management system vs building from source - Logging configuration [20:51 - 29:30] Running Spark - Running the pyspark shell - Running "Hello World" in Spark - Running Spark in the python shell, ipython shell - Creating an RDD and inspecting its contents. - Using the Spark session object and the Spark context

Module 3:

Lecture 3 Preview: Spark UI

How to use the Spark UI to observe the internals of an application while it is running. We inspect the Jobs, Stages, and Executors tabs.

Lecture 4 Running Spark

[0:00 - 17:40] The Spark UI - Review Spark Cluster Components - Review Spark Execution Modes - Spark Standalone Cluster Architecture - Using the spark-submit command - Running in an Integrated Development Environment - Using the Spark UI [17:41 - 29:00] Running a Spark application in notebook and IDE - Writing a new Spark application - Running Spark in a Jupyter notebook - Creating a dataframe - Debugging Spark in Pycharm - Using the Spark UI to inspect resource usage - Inspecting the driver and executor - Observing the programmatic creation of jobs, stages, and tasks - Profiling memory and data movement

Lecture 5 Preview: How to Debug Internals, and Recapping What We Covered

A debugging tip, followed by a comprehensive summary of what this module covered, and lookahead to what's next.

Module 4:

Lecture 6 Preview of Module 4

Preview of topics covered, including Driver Program, Jobs, Stages, Tasks, Transforms vs Actions, Wide vs Narrow Transforms, What is a Shuffle Boundary, Execution Plan vs Execution Mode, and Shared Variables.

Lecture 7 Preview: Parallel Operations, Transforms vs Actions

You will learn a succinct description of Parallel Operations, understand the difference between Transforms and Actions, and how Accumulators are related to Actions.

Lecture 8 The Spark Computing Framework

[00:00 - 9:34] Core Components of the Spark Computing Framework - Cluster Manager - Workers, Executors, Task slots - Master Node - Driver Program - Driver Node vs Driver Process - What “Job” means, inside of a Spark Session vs outside - How Jobs are organized into Stages and Tasks [9:35 - 15:26] Parallelized Collections - Resilient Distributed Datasets - Dataframes - Datasets - Partitions [15:27 - 23:16] Parallel Operations - Transformation vs Action - Lazy Execution - Accumulators and Lazy Execution - Types of Transformations - Shuffle Operation - Shuffle Boundary - Execution Plan - How transformations and actions are assembled into stages - Shared Variables - Broadcast Variables - Accumulators - Description of Exercise 1

Lecture 9 Preview : Thread Dumps, and Recap

How to locate the Thread Dump in the Pyspark Spark UI, how these differ in PySpark vs the Scala and Java version of Spark UI, Shared Variables, Broadcast Variables vs Accumulators.

Module 5:

Lecture 10 Resilient Distributed Data

[00:00 - 12:22] The goals of Exercise 1 and dataset - Introducing the example Spark application we'll be modifying - Description of the dataset we'll be using - Downloading the dataset - Goals of the exercise [12:23 - 24:03] Code review of solution to Exercise 1 - Reviewing Exercise 1 in Sublime text editor - Stepping through the Exercise - Creating a module for configuring Spark session - Inspecting the RDD code in Spark UI - Stepping through the RDD in an IDE [24:04 - 28:34] Description of Exercise 2 - In Exercise 2 you will do the following: - profile code written for Exercise 1 using the Spark UI - rewrite Exercise 2 using dataframes - Calculate additional statistics from the dataset - Compare and contrast resource consumption - of dataframe version vs RDD version - Recap of what we accomplished in Module 5 - Loading a text file into an RDD - Tokenizing text using RDD operations - Using flatMap, map, and reduceByKey operations - Sorting the RDD - Lookahead to what's next using data frames

Quiz 1 Practice your investigative skills.

There are several easter eggs in the data file. Can you match wits with Sherlock Holmes, the data?

Module 6:

Lecture 11 Dataframes

[00:00 - 8:30] Demonstrating a solution to Exercise 2 - Introduction to Exercise 2 - Demonstration of a solution to Exercise 2 - from the command line - in a debugger [8:31 - 24:09] Stepping through Exercise 2 - Recap of RDD-based approach - Loading a dataframe from text file - Using the select, alias, explode, lower, col, and length operations. - Counting uniques using drop_duplicates and distinct - Aggregations using the groupBy operation - Introducing the GroupedData object - Set operations - Joins - Set intersection - Set subtraction - Filtering using where - Inspecting a sample of a result set using the show action [24:10 - 29:33] Transforming columns using UDFs - Transforming a column using a UDF within a select operation - Adding a new column using withColumn - Adding a column containing a fixed literal value using a literal argument to a UDF - How to create a UDF that operates on two columns [29:34 - 40:33] Bonus material - Inspecting the RDDs used in Exercise 1 - Comparing the RDD contents with the dataframes from Exercise 2 - Introduction to Exercise 3

Module 7:

Lecture 12 Caching and Memory Storage Levels

[00:00 - 9:03] Caching and Logging - caching vs persist - removing objects from cache using unpersist - command line demonstration of caching - demonstrating an important quirk of logging when using the DEBUG log level [9:04 - 13:44] How to size a dataset in the Spark UI - creating the object, caching it, pausing the application - inspecting the object size in the Spark UI [13:45 - 20:54] Storage levels, serialization, and cache eviction policy - memory storage levels - serialization - cache eviction policy [20:55 - 44:59] Tuning Cache and Best Practices for Caching and Logging - a systematic way to tune cache - best practices for caching - when to cache - when not to cache - implications of running Spark actions within a debug log statement

Module 8:

Lecture 13 Preview: Moving Window N-Tuple Analysis using Window SQL Functions

We preview the 3-tuple, 4-tuple, 5-tuple, and 6-tuple analyses to be performed on the 6 MiB text of a million words using moving n-tuple windows via Window SQL functions.

Lecture 14 Preview: Course Recap

A summary of what is covered in the course and what you are able to do with what you learned in the course.

Lecture 15 Preview: N-Tuple Results and Module Recap

Recap of the results of using Spark to perform text analysis on a 6.5 MiB corpus containing one million tokens, and a recap of what we covered in this module.

Lecture 16 SQL, Window Function, and Execution Plans

[00:00 - 8:43] Introduction to Spark SQL - overview of the main concepts - a preview of the code - examples of traditional SQL queries - examples of window function SQL queries - running from the command line - running in IDE [8:44 - 21:40] Spark Tables, Spark Catalog, Execution Plans - registering a dataframe as a Spark table - caching a Spark table - inspecting the Spark catalog - querying a Spark table using SQL - examining the execution plan of a dataframe - examining the execution plan of a query - a defensive programming technique for debugging lazy evaluations [21:41 - 31:10] Window Function SQL - Using dot notation vs sql query for dataframe operations - example of an operation that is easier using dot notation - Window functions in action - identifying word sequences of a specified length - creating a moving window feature set - finding most frequent word sequences of a specified length - observations gleaned from our dataset using windowed sequence analysis [31:11 - 42:44] Pro Tips - Recap - Capstone project ideas - Pro tips: window functions - Pro tips: UDFs - Pro tips: Debugging - Pro tips: Tuning

Module 9: Running Spark version 2.2.1

Resource 1 Running Spark 2.2.1

A code module spark_2_2_1.py is provided in the Downloads for running Spark 2.2.1. See https://github.com/minrk/findspark for how findspark resolves some cases where Pyspark isn't on sys.path by default. Note also the use of a os environment setting to explicitly instruct the code to use python3. This resolves some issues in environments where more than one version of python is installed.

Apache Spark SQL

Certification

Need Custom Training for Your Team?

Call Us

Inquire About This Course

Instructor

Dr. Mark Plutowski

Instructor: Dr. Mark Plutowski

Create, run, and tune Spark SQL applications, end-to-end. Projects included

Duration: 3h 58m

About Course

Prerequisites

Curriculum

Course Description

What am I going to get from this course?

Prerequisites and Target Audience

What will students need to know or do before starting this course?

Who should take this course? Who should not?

Curriculum

Module 1:

Module 2:

Module 3:

Module 4:

Module 5:

Module 6:

Module 7:

Module 8:

Module 9: Running Spark version 2.2.1