Stream big data in real time with Spark and integrate any data source from Kafka to Twitter.
Nothing static, everything in motion.
You probably already know: Spark is the most popular big data computing engine, the most serviced and with a proven performance record. It's 100 times faster than the old MapReduce paradigm, and it can be easily expanded with machine learning, streaming, and more.
In this course, we will take a natural step forward: we will process big data as it becomes available.
What awaits you:
- You'll find out how Spark Structured Streaming and Spark's "regular" batches are similar and different.
- You'll be working with new streaming abstractions (DStreams) for low-level, high-control processing.
- You integrate Kafka, JDBC, Cassandra and Akka Streams (!), so you can integrate whatever you like afterwards.
- You'll be working with powerful state-tracking APIs that few know how to use correctly.
- You'll have access to all the code I write on camera (2200 q LOC)
- (soon) You'll have access to slides
Project 1: Twitter
In this project we will integrate live data from Twitter. We'll create a customizable data source that we'll use with Spark, and do a variety of analyses: the length of tweets that the most commonly used hashtags in real time. You can use this project as a model for any data source you might want to integrate. At the very end, we'll use Stanford's NLP library to analyze the moods in tweets and find out the general state of social media.
You'll learn:
- how to set up your own data receiver, which you can manage yourself and "extract" new data
- how to create dStream from your custom code
- how to get data from Twitter
- how to aggregate tweets
- how to use the Stanford coreNLP library to analyze moods
- how to apply mood analysis to real-time tweets
In this project, we will write a full-featured web application that will support multiple users who are subject to scientific testing. We study the effects of alcohol/substances/insert_your_addictive_drug_like_Scala on reflexes and reaction times. We will send the data through a web interface connected to the REST end point, then the data will pass through the Kafka broker and finally to the spark streaming server part, which will process the data. You can use this app as a model for any full-featured app that combines and processes real-time Spark streaming data from any number of simultaneous working users.
You'll learn:
- how to set up an HTTP server in minutes with Akka HTTP
- how to manually send data via Kafka
- how to aggregate data in a way that is almost impossible in the sql
- how to write a full-featured application with a web interface, Akka HTTP, Kafka and Spark Streaming