Apache Spark: How to save DataFrame as a single file on HDFS?

Problem

If you want to save DataFrame as a file on HDFS, there may be a problem that it will be saved as many files. This is the most correct behavior and it results from the parallel work in Apache Spark. However, if you want to force the write to one file, you must change the partitioning of DataFrame to one partition. To do this, call the “coalesce” method before writing and specify the number of partitions.

Solution

The following example shows how to save any DataFrame to a CSV file. In addition, I presented a few options, such as:

  • Mode – available options:
    • overwrite – always overwrite the file
    • append – add to existing file if it exists
    • igonre – ignore if it exists
    • error, errorIfExists, default – thorw an error if the file exists. (default option).
  • header – whether the header should be at the beginning of the file
  • delimiter – column separator in the file
  • quoteMode – if set to “true” then each column will be written between quotes.
quoteMode = true “Spark”,”is”,”Cool”
quoteMode = false Spark,is,Cool
myDataFrame.coalesce(1).write.format("com.databricks.spark.csv")
  .mode("overwrite")
  .option("header", "true")
  .option("delimiter", ",")
  .option("quoteMode", "true")
  .save("<output_hdfs_path>")

 

If you enjoyed this post please leave the comment below or share this post on your Facebook, Twitter, LinkedIn or another social media webpage.
Thanks in advanced!

Please follow and like us:
error

Leave a Reply

Close Menu
Social media & sharing icons powered by UltimatelySocial
error

Enjoy this blog? Please spread the word :)