You are currently viewing Spark Repartition Vs Coalesce – Check the 2 Major Differences!
Could You Please Share This Post? I Appreciate It And Thank YOU! :) Have A Nice Day!
4.8
(4)

In this tutorial I will show you what is the difference of Spark Repartition Vs Coalesce. When we work with Spark very often we have to change the partitions amount of the DataFrame or DataSet. The repartition() and coalesce() will do this for us, but with ONE Major Difference which is very important from performance perspective.

Spark Repartition Vs Coalesce

As I mentioned these both methods will change the DataFrame or DataSet partitions, but the Coalesce() will do this better!

1st Difference – Why Coalesce() Is Better Than Repartition()?

The answer is: PERFORMANCE! When the DataFrame or DataSet is spread across the Nodes when we execute the Coalesce() method the Spark will limit the data shuffle between data nodes.

As we know the Exchange (shuffle) is one of the most time consuming operation, due to fact that data must be transferred between nodes and it causes the network traffic, which is bad and unwanted.

When you will go to the Spark code the coalesce is just the repartition with shuffle = false as default. Let’s the code: https://github.com/apache/spark/blob/v3.3.0/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L500

def coalesce(numPartitions: Int, shuffle: Boolean = false,
               partitionCoalescer: Option[PartitionCoalescer] = Option.empty)
              (implicit ord: Ordering[T] = null)

Spark Repartition Vs Coalesce – Shuffle

Let’s assume we have data spread across the node in the following way as on below diagram.

When we execute coalesce() the data for partitions from Node 1 and Node 3 will be kept locally and only data from Node 2 and Node 4 will be shuffled, so it will limit the network traffic across the data nodes in you cluster.

Spark Repartition Vs Coalesce - Check the 2 Major Differences!

2nd Difference – Partitions Amount

When you call .coalesce(10) on the DataFrame/DataSet which already has lower amount of partitions nothing will happen. To make it you need to run .repartition(10) instead.

MethodIncreaseDecrease
repartitionYesYes
coalesceNoYes
repartition vs coalesce

Save DataFrame As Single File

Based on the above knowledge to save the DataFrame as single file you must use the .repartition(1) instead of .coalesce(1).

df
   .repartition(1)
   .write
   .format("com.databricks.spark.csv")
   .option("header", "true")
   .save("all_data_in_one_file.csv")

Summary

Spark Repartition Vs Coalesce – In this post you learned what are the differences between repartition and coalesce and how to use them using Spark with Scala and PySpark.

Could You Please Share This Post? 
I appreciate It And Thank YOU! :)
Have A Nice Day!

BigData-ETL: image 7YOU MIGHT ALSO LIKE

How useful was this post?

Click on a star to rate it!

Average rating 4.8 / 5. Vote count: 4

No votes so far! Be the first to rate this post.

As you found this post useful...

Follow us on social media!

We are sorry that this post was not useful for you!

Let us improve this post!

Tell us how we can improve this post?