If you want to save DataFrame as a file on HDFS, there may be a problem that it will be saved as many files. This is the most correct behavior and it results from the parallel work in Apache Spark. However, if you want to force the write to one file, you must change the partitioning of DataFrame to one partition. To do this, call the “coalesce” method before writing and specify the number of partitions.
The following example shows how to save any DataFrame to a CSV file. In addition, I presented a few options, such as:
- Mode – available options:
- overwrite – always overwrite the file
- append – add to existing file if it exists
- igonre – ignore if it exists
- error, errorIfExists, default – thorw an error if the file exists. (default option).
- header – whether the header should be at the beginning of the file
- delimiter – column separator in the file
- quoteMode – if set to “true” then each column will be written between quotes.
|quoteMode = true||“Spark”,”is”,”Cool”|
|quoteMode = false||Spark,is,Cool|
myDataFrame.coalesce(1).write.format("com.databricks.spark.csv") .mode("overwrite") .option("header", "true") .option("delimiter", ",") .option("quoteMode", "true") .save("<output_hdfs_path>")
If you enjoyed this post please leave the comment below or share this post on your Facebook, Twitter, LinkedIn or another social media webpage.
Thanks in advanced!