In this tutorial, we will introduce core concepts of Apache Spark Streaming and run a Word Count demo that computes an incoming list of words every two seconds.

 

Concepts

Spark Streaming is an extension of the core Spark API that enables scalable, high-throughput, fault-tolerant stream processing of live data streams. Data can be ingested from many sources like Kafka, Flume, Twitter, ZeroMQ, Kinesis, or TCP sockets, and can be processed using complex algorithms expressed with high-level functions like map, reduce, join and window. Finally, processed data can be pushed out to filesystems, databases, and live dashboards.

Spark Streaming

Internally, it works as follows. Spark Streaming receives live input data streams and divides the data into batches, which are then processed by the Spark engine to generate the final stream of results in batches.

Spark Streaming

DStream

Discretized Stream or DStream is the basic abstraction provided by Spark Streaming. It represents a continuous stream of data, either the input data stream received from source, or the processed data stream generated by transforming the input stream. Internally, a DStream is represented by a continuous series of RDDs, which is Spark’s abstraction of an immutable, distributed dataset (see Spark Programming Guide for more details).

Setup

1. Start your Sandbox

First, start your Sandbox Virtual Machine (VM) in either a VirtualBox or VMware environment and note your VM IP address.

We will refer to your VM IP address as <HOST IP> throughout this tutorial.

If you need help finding your <HOST IP> checkout Learning the Ropes.

2. Launch a “Shell in a Box”

Now that your Sandbox is running, open a web browser and go to: http://<HOST IP>:4200

Where <HOST IP> is the IP address of your Sandbox machine.

For example, the default address for VirtualBox is http://127.0.0.1:4200

Next, log into the “Shell in a Box” using the following default credentials: <br>

If you’re logging for the first time you will be required to change your password.

3. Download a Spark Streaming Demo to Sandbox

Now let’s download a Spark Streaming demo code to your sandbox from GitHub.

In your “Shell in a Box” execute the following two commands:

and

Note: wget &lt;url&gt; downloads Spark Streaming code that computes a simple Word Count. Words (i.e. strings) will be coming in via a network socket connection from a simple Netcat tool introduced later.

Several things worth pointing out in the demo code you’ve just downloaded:
1. We’ve set a 2 sec batch interval to make it easier to inspect results of each batch processed.
2. We perform a simple word count for each batch and return the results back to the terminal screen with a pprint() function.

4. Submit a Spark Streaming Job

Now you’re ready to submit a Spark job. In your terminal window copy and paste the following and hit Enter:

You should see lots of INFO interspersed with Timestamp corresponding to each batch that is updated every 2 seconds:

5. Run Netcat

Netcat (often abbreviated to nc) is a computer networking utility for reading from and writing to network connections using TCP or UDP.

In your browser, open a second tab or window, and open another “Shell in a Box” by going to http://&lt;Host IP&gt;:4200.

For example, http://127.0.0.1:4200 if you’re running a VirtualBox.

Login to your shell and run the following command to launch Netcat:

At this point you should be connected and you may start typing or pasting any text.

For example, if you type the following hello hello world text in the Netcat window, you should see the following output in the already running Spark Streaming job tab or window:

6. Stopping Spark Streaming and Netcat

When you’re done experimenting, press Ctrl + C in your shell tab or window to stop your Spark Job and/or Netcat process.

7. Suppressing INFO Messages (Optional)

If you want to remove annoying INFO messages from the Spark streaming terminal window, do the following:

Open conf/log4j.properties, for example:

and Edit log4j.properties:

Replace the first line:

with

Save log4j.properties and restart your spark-submit job. Now you should see only WARN messages.

Further Reading

Once you’ve completed this tutorial, checkout other Spark Tutorials.