Apache Spark Save DataFrame As a Single File HDFS – 1 Min Solution?

You are currently viewing Apache Spark Save DataFrame As a Single File HDFS – 1 Min Solution?
Share This Post, Help Others, And Earn My Heartfelt Appreciation! :)
4.9
(945)

Problem

In this tutorial I will show the example when using Apache Spark Save DataFrame as a single file HDFS. If you want to save DataFrame as a file on HDFS, there may be a problem that it will be saved as many files.

This is the most correct behaviour and it results from the parallel work in Apache Spark. However, if you want to force the write to one file, you must change the partitioning of DF to one partition. To do this, call the “coalesce” method before writing and specify the number of partitions. (Apache Spark Save DataFrame as a single file HDFS)

Solution

The following example shows how to save any DF to a CSV file. In addition, I presented a few options, such as: (Apache Spark Save DataFrame as a single file HDFS)

  • Mode – available options:
    • overwrite – always overwrite the file
    • append – add to existing file if it exists
    • igonre – ignore if it exists
    • error, errorIfExists, default – thorw an error if the file exists. (default option).
  • header – whether the header should be at the beginning of the file
  • delimiter – column separator in the file
  • quoteMode – if set to “true” then each column will be written between quotes.
quoteMode = true“Spark”,”is”,”Cool”
quoteMode = falseSpark,is,Cool
myDataFrame.coalesce(1).write.format("com.databricks.spark.csv")
  .mode("overwrite")
  .option("header", "true")
  .option("delimiter", ",")
  .option("quoteMode", "true")
  .save("<output_hdfs_path>")

Datasets and DataFrames

A Dataset is a distributed collection of data. Dataset is a new interface added in Spark 1.6 that provides the benefits of RDDs (strong typing, ability to use powerful lambda functions) with the benefits of Spark SQL’s optimized execution engine. A Dataset can be constructed from JVM objects and then manipulated using functional transformations (map, flatMap, filter, etc.). The Dataset API is available in Scala and Java. Python does not have the support for the Dataset API. But due to Python’s dynamic nature, many of the benefits of the Dataset API are already available (i.e. you can access the field of a row by name naturally row.columnName). The case for R is similar.

A DF is a Dataset organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood. DataFrames can be constructed from a wide array of sources such as: structured data files, tables in Hive, external databases, or existing RDDs. The DataFrame API is available in Scala, Java, Python, and R. In Scala and Java, a DF is represented by a Dataset of Rows. In the Scala API, DataFrame is simply a type alias of Dataset[Row]. While, in Java API, users need to use Dataset<Row> to represent a DataFrame.

(Apache Spark Save DataFrame as a single file HDFS)

https://spark.apache.org/docs/latest/sql-programming-guide.html

That’s all about (Apache Spark Save DataFrame as a single file HDFS). Enjoy!

If you enjoyed this post please add the comment below and share this post on your Facebook, Twitter, LinkedIn or another social media webpage.
Thanks in advanced!

How useful was this post?

Click on a star to rate it!

Average rating 4.9 / 5. Vote count: 945

No votes so far! Be the first to rate this post.

Subscribe
Notify of
guest
0 Comments
Inline Feedbacks
View all comments