This course teaches you how to write programs in Apache Storm to take streaming data from tools like Kafka and Twitter in real time, process in Storm and save to tables in Cassandra or files in Hadoop HDFS. You will be able to develop distributed stream processing applications that can process streaming data in parallel and handle failures. You will be able to implement data transformations like maps and filters in Apache Storm, implement stateful stream processing and exactly once processing. It covers some administrative aspects also like setting up an Apache Storm cluster, scheduling, monitoring and metrics reporting.
This is a hands on course, so you will be developing many Apache Storm programs in the course using Eclipse IDE and Java programming. Theory will be intermixed with practice so that you implement what you have learned as a developer. You will write more than thirty programs during this course.
Only way to learn a new tool quickly is to practice by writing programs. This course provides you the right mix of theory and practice with real life industry use of Apache Storm. By enrolling in this course, you will be on a journey to become a big data developer using Apache Storm.
What am I going to get from this course?
Implement Apache Storm programs that take real time streaming data from tools like Kafka and Twitter, process in Storm and save to tables in Cassandra or files in Hadoop HDFS. You will be able to develop distributed stream processing applications that can process streaming data in parallel and handle failures. You will be able to implement stateful stream processing, implement data transformations like maps and filters and implement exactly once processing.
Prerequisites and Target Audience
What will students need to know or do before starting this course?
- Experience in devloping software projects
- Some programming experience in Java required
- Use of Java IDE like Eclipe or IntelliJ
Who should take this course? Who should not?
Real time big data processing tools have become main stream now and lot of organizations have started processing big data in real time. Apache storm is one of the popular tools for processing big data in real time. If you are familiar with Java, then you can easily learn Apache Storm programming to process streaming data in your organization. Through this course, I aim to provide you with working knowledge of Apache Storm so that you can write distributed programs to process streaming data.
Data Sizes in Big Data
Big Data Problem
Big Data Solution
Demo and practice activity: Install Eclipse
Download, Install and start Eclipse
Down load the training programs
Download the training program zip file. Create a directory C:\storm in windows and unzip the training program zip file in that directory. It will create three directories input, output and training and copy the files. The output folder will be empty.
Demo and practice activity: Create a maven project in Eclipse
Create a maven project and set build path
Demo and practice activity : Add Apache storm programs to Eclipse project
Add the training programs provided to the created eclipse project
Demo and practice activity: Compile the storm program in Eclipse
Correct the mistakes, adjust the build path and create a run configuration to run the program
Demo and practice activity: Run the Apache Storm program from Eclipse
Using the run configuration, run the storm program in local cluster and see the results.
Module 2: Introduction to Apache Storm
Storm Data Model
Storm Topology Simple Example
Demo and practice activity: Create a simple Apache Storm program
In this demo, practice to create a simple Storm program using the sample program provided and run in the local cluster to see the results.
Storm Topology: Case Study1
Demo and practice activity: Implement the case study as Apache Storm program
Implement the case study 1 program in Eclipse and run the program to see the results
Storm Topology: Case Study2
Demo and practice activity: Implement the case study 2 program
Implement the case study 2 program in Eclipse, run and see the results.
Demo and practice activity: Implement the periodic processing in Storm with tick tuples
Use tick tuples to implement the Apache Storm program for periodic processing. Run and see the results.
Practice activity: Write the storm programs for the five assignments and run them with the data provided
Five assignments are described in the document. Modify the programs in this section to complete assignment programs and run them. Sample programs are provided to help with few assignments. Download them from the download section. Look at them only if you have trouble completing the assignment programs. Make sure to run them and see the results before you move on to the next section.
Module 3: Storm Installation & Configuration
Storm Environment Setup
Starting Storm Servers
Demo and practice activity: Create a thin jar in Eclipse
Use this demo to create a thin jar that you can use to run your Apache Storm programs. The maven build in Eclipse can be used to build a thin jar.
Demo and practice activity: Create a far jar in Eclipse
Create a fat jar for the storm program so that it includes the dependent libraries.
Submitting a Job to Storm
Using Eclipse for Storm Programs
Setup a Storm Cluster
Practice activity: Perform the five activities specified
Practice what you have learned in this section by completing the practice activities. Sample program is provided for one of the activities in the download section.
Module 4: Storm Classes & Groupings
The Fields Class
Storm Classes and Interfaces
Building a Topology
Demo and practice activity: Shuffle grouping with multiple tasks
Demo and practice activity: Fields grouping with multiple tasks
Normal Tuple Processing in Storm
Demo and practice activity: Implement reliable processing in Storm
Practice activity: Write the programs for the nine activities listed and run to check the output
Nine activities listed provide good practice for this section. Sample programs are provided for some of the activities in the download section. Look at them only after trying out the activities. Always run, correct the mistakes and check the output.
Case Study: Trident Operations
Demo and practice activity: Implement Trident stream transformations
The previous case study is illustrated with the actual program in Eclipse. The student is encouraged to create this program in Eclipse using the training program files provided, run and check the results.
Operations on Grouped Streams
Trident Exactly Once Processing
Case Study: Trident State Updates
Demo and practice activity: Trident state implementation part 1 : Spout implementation
The Trident state processing and exactly once processing implementation is quite complex. It is implemented and illustrated in a step by step manner in multiple parts. I start with showing the spout that produces the batch of tuples.
Demo and practice activity: Trident state implementation part 2: IBackingMap implementation
I continue here with the implementation of IBackingMap interface in Trident. Part of this class, the method multiGet is illustrated here.
Demo and practice activity: Trident state implementation part 3: IBackingMap and StateFactory implementation
Here I cover the multiPut method of IBackingMap implementation and continue with simple implementation of StateFactory.
Demo and practice activity: Trident state implementation part 4: The main method implementation
Now that I have all the pieces in place, it is time to connect the pieces in the main method by creating the Trident topology and adding the spout and state processing to the topology.
Demo and practice activity: Trident state implementation part 5: Run the Trident state processing program
It is finally time to see the fruits of our labor. Here I will run the create program in the local cluster and see the results. Make sure you also follow the demo and run the program on your machine to check the results.
Practice activity: Write the programs for the six activities listed and check the output
These six activities help you apply the Trident interface to processing streams in Apache Storm. Sample programs are provided for some of the activities. You can download the sample programs from the download section,
Module 6: Storm Scheduling
Storm User Interface
Resource Aware Scheduler
Resource Aware Scheduler: Example
Configuration for Ganglia
Practice activity: Perform the two activities listed in this section
Perform the two activities listed in this section by modifying the existing programs. A sample modified file is provided. You can download the sample program from the download section.
Demo: Monitor multiple topologies using Storm User Interface
Look at multiple topologies including reliable topology and Trident topology in the Storm UI
Module 7: Storm Interfaces
Storm Kafka Spout Example
Compiling for Kafka
Demo and practice activity: Setup and start Zookeeper and Kafka servers
To illustrate Storm interface to Kafka, let us first setup Zookeeper and Kafka and start the servers.
Demo and practice activity: Create a new topic in Kafka
Create a topic in Kafka so that Storm can receive messages from this topic.
Demo and practice activity: Start Kafka producer
Start the Kafka console producer process that can take the typed messages and send them to Storm
Demo and practice activity: Storm Program for interfacing with Kafka
Here you can look at the Apache Storm program that uses a Kafka client spout to connect to a Kafka topic and gets the messages and prints them out.
Demo and practice activity: Run the program and see flow of messages from Kafka to Storm
Here you will start the Storm program that interfaces with Kafka. Messages will be entered for Kafka and the same messages can be seen in the Storm output.
Setting Properties for Cassandra
Writing to Cassandra Table
Real Time Data Analytics Platform
Demo and practice activity: Setup and start Cassandra server
Install Cassandra and start the Cassandra server
Demo and practice activity: Create key space and table in Cassandra
Create a key space in Cassandra and create a table in Cassandra to receive the data from Storm.
Demo and practice activity: Look at the Storm program that takes messages from Kafka and stores to table in Cassandra
Here I illustrate the real time data analytics platform with the Apache Storm program that takes messages from a topic in Kafka and stores as rows into a table in Cassandra in real time.
Demo and practice activity: Run the Kafka-Storm-Cassandra interface program to see the flow of data from Kafka to Cassandra table
Finally the real time data analytics platform is illustrated by running the Storm interface program. You can enter the messages for Kafka topic in one console window and see the data updated in the Cassandra table in another console window.
Example Writing to HDFS
Demo and practice activity: Create the program to store data into Hadoop HDFS from Kafka
The program illustrates reading data from Kafka topic and inserting into a directory in HDFS.
Interfacing with Twitter
Demo and practice activity: Create the program for getting tweets from Twitter in Apache Storm
This program illustrates using Twitter4J to get data from Twitter and processing the tweets in Storm. The link for creating the Twitter developer account and getting the Twitter credentials is provided in the download section.
Demo and practice activity: Run the twitter interface program and look at the live tweets
The program filters the tweets in real time for certain key words and displays the tweets. You can also run the program by providing the Twitter credentials. The Twitter credentials can be obtained from the link provided.
Practice activity: Write the programs for the seven activities listed in this section and check the results
Seven activities in this section can be used to practice the Storm interfaces. You can go through the demos multiple times to perform the commands for Kafka and Cassandra as well. Sample programs are provided for some of the activities. You can download the sample programs from the download section.
Course summary, next steps