Add a little spark* to your dishes

By now, you must have heard of big data. Who doesn’t really?

If you’ve been actively involved and reading about BIG DATA, and I really hope you are (there is no getting about it), you’ve probably been messing around with Map Reduce.

Let’s just refresh our memory quickly:

MapReduce is the heart of Hadoop®. It is this programming paradigm that allows for massive scalability across hundreds or thousands of servers in a Hadoop cluster.

For those of you who haven’t heard of hadoop:

Apache™ Hadoop® is a highly scalable storage platform designed to process very large data sets across hundreds to thousands of computing nodes that operate in parallel. It provides a cost effective storage solution for large data volumes with no format requirements.

MapReduce, the programming paradigm that allows for this massive scalability, is the heart of Hadoop. The term MapReduce actually refers to two separate and distinct tasks that Hadoop programs perform.

For quite a long time, Map Reduce (MR) has reigned as the king of distributed parallel processing frameworks, and rightly so: it gave a way to use cheap commodity hardware to do tasks in parallel that would’ve taken years (literally!) to finish on a single costly machine.

Although there are loads of advantages, MR does come with a few disadvantages; the biggest being latency. Now it was impossible to get a certain computation done 20 years ago, MR made it possible and gave the result in, say, 20 minutes. Still, 20 minutes is quite a bit of time and enough of a motivation for people to look (and build!) alternatives.

Enters the knight in shining armor, Apache Spark.

But what does it do different than our old friend, Map Reduce?

When I was getting started with using Apache Spark, I had the same question. From everything I heard, it seemed as if Spark does the same things as MR but better and faster. But, as it turns out that’s not the case.

Here are some key differences:

1. Spark is Faster

One of the key differences between Spark & MR is how they process data. Spark does everything In-Memory while MR persists the data on the disk after every map or reduce jobs. So, by any standard Spark can outperform MR quite easily. Spark can be as much as 10 times faster than MapReduce for batch processing and even upto 100 times faster for in-memory analytics.

But, this speed comes at a cost. Spark needs a lot of memory. It keeps the data in memory to cache it and so if we have to run Spark alongside some other resource heavy application, we could see a significant drop in performance. MR does much better in these situations as it removes data immediately when its not required.

So, essentially we can say that Spark is faster than MR when it comes to processing the same amount of data multiple times rather than unloading and loading new data.

2. Ease Of Use

Spark is generally agreed to be much easier than Map Reduce. It has API’s in Java, Python, Scala, and R. So, people can pick a language they are familiar with. It also has an interactive REPL where you can get instant feedback for the commands.

MapReduce programs on the other hand has to be written in Java and it is usually difficult to program. Even simple programs like Word Count need a huge program when you can do the same in Spark in just a ~~couple of lines~~ one line. But, fortunately we have tools like Hive, Pig trying to make MR jobs much easier to write.

So, Spark is easier to work with than MR in general, but you have several tools available for Map Reduce to make it easier.

3. Graph Processing

Most of the graph processing algorithms perform multiple iterations on the data. And, since Map Reduce has to read the date from the disk and write it back every time, it increases the latency a lot and thus significantly slower.

Spark, on the other hand, comes with a Graph computation library GraphX which is really fast. And, here is a performance comparison for graph computations for Pagerank algorithm.

Summarily, we can conclude that Spark is faster than Map Reduce in most of the cases except when the memory size is significantly smaller than the size of the data.

Since Spark is usually employed in terms of clusters in production, the memories of such machines tend to be enough for storing huge chunks of data (Thanks to the booming semiconductor industry).

Needless to say, Spark is the future in big data processing, until of course we come up with something even better.

Data Fried

Let's cook some delicious data!

Add a little spark* to your dishes

One thought on “Add a little spark* to your dishes”

Leave a comment Cancel reply

Share this:

Share this:

One thought on “Add a little spark* to your dishes”

Leave a comment Cancel reply