Apache Spark: ReduceByKey vs GroupByKey – differences and comparison

Apache Spark: ReduceByKey vs GroupByKey – differences and comparison

In this post I will try to introduce you to the main differences between ReduceByKey and GroupByKey methods and why you should avoid the latter. But why? The answer is “shuffe“.

Shuffle

In the data processing environment of parallel processing like Haddop, it is important that during the calculations the “exchange” of data between nodes is as small as possible. “Data exchange” is nothing but network traffic between machines. In addition, data must be serialized and deserialized every time. If the less data will be send, the less data will be serialized and deserialized. Each of these steps carries the consumption of computing power, memory, and most importantly, and worst of all – time.

Let’s take a look at the following two analyzes to see how the two methods differ and why one of them is better.

ReduceByKey

We’ll start with the ReduceByKey method, which is the “better” one. The green rectangles represent the individual nodes in the cluster.

Let’s analyze node 1, i.e. the first from the left. There are three pairs of data (big, 1), (big, 2) and (data, 1). We want to count how many times each word occurs. We see that sums for each word were calculated “locally” on the nodes. We can say differently that the “reduce” operation was first performed on each node based on its local portion of data, before the data was sent to the nodes, where the final “reduce” phase will take place, which will give us the result.

Thanks to the reduce operation, we locally limit the amount of data that “circulates” between nodes in the cluster. In addition, we reduce the amount of data subjected to the process of Serialization and Deserialization.

GroupByKey

Now let’s look at what happens when we use the GroupByKey method. As you can see in the figure below there is no reduce phase. As a result, the exchange of data between nodes is greater. In addition, remember the process of Serialization and Deserialization, which in this case must be done on a larger amount of data.

If you enjoyed this post please add the comment below or share this post on your Facebook, Twitter, LinkedIn or another social media webpage.
Thanks in advanced!

0 0 vote
Article Rating
Subscribe
Notify of
guest
0 Comments
Inline Feedbacks
View all comments