In this post I will try to introduce you to the main differences between Apache Spark ReduceByKey vs GroupByKey methods and why you should avoid the latter. But why? The answer is shuffle.
Shuffle in Apache Spark ReduceByKey vs GroupByKey
In the data processing environment of parallel processing like Hadoop", it is important that during the calculations the “exchange” of data between nodes is as small as possible. Data exchange is nothing but network traffic between machines. In addition, data must be serialized and de-serialized every time. If the less data will be send, the less data will be serialized and de-serialized. Each of these steps carries the consumption of computing power, memory, and most importantly, and worst of all – time.
Let’s take a look at the following two analyses to see how the two methods differ and why one of them is better. (Apache Spark ReduceByKey vs GroupByKey )
We’ll start with the RDD" ReduceByKey method, which is the better one. The green rectangles represent the individual nodes in the cluster.
Let’s analyse node 1, i.e. the first from the left. There are three pairs of data (big, 1), (big, 2) and (data, 1). We want to count how many times each word occurs. We see that sums for each word where calculated “locally” on the nodes. We can say differently that the “reduce” operation was first performed on each node based on its local portion of data, before the data was sent to the nodes, where the final “reduce” phase will take place, which will give us the result. (Apache Spark ReduceByKey vs GroupByKey )
Thanks to the reduce operation, we locally limit the amount of data that circulates between nodes in the cluster. In addition, we reduce the amount of data subjected to the process of Serialization and Deserialization.
Now let’s look at what happens when we use the RDD" GroupByKey method. As you can see in the figure below there is no reduce phase. As a result, the exchange of data between nodes is greater. In addition, remember the process of Serialization and De-serialization, which in this case must be done on a larger amount of data.
Could You Please Share This Post? I appreciate It And Thank YOU! :) Have A Nice Day!
We are sorry that this post was not useful for you!
Let us improve this post!
Tell us how we can improve this post?