Apache Spark Save DataFrame As a Single File HDFS – 1 Min Solution?

Apache Spark Save DataFrame As a Single File HDFS - 1 Min Solution?
Share this post and Earn Free Points!

Problem

In this tutorial" I will show the example when using Apache Spark Save DataFrame as a single file HDFS". If you want to save DataFrame" as a file on HDFS", there may be a problem that it will be saved as many files.

This is the most correct behaviour and it results from the parallel work in Apache Spark". However, if you want to force the write to one file, you must change the partitioning of DF to one partition. To do this, call the “coalesce” method before writing and specify the number of partitions. (Apache Spark Save DataFrame as a single file HDFS")

Solution

The following example shows how to save any DF to a CSV" file. In addition, I presented a few options, such as: (Apache Spark Save DataFrame as a single file HDFS")

  • Mode – available options:
    • overwrite – always overwrite the file
    • append – add to existing file if it exists
    • igonre – ignore if it exists
    • error, errorIfExists, default" – thorw an error if the file exists. (default" option).
  • header – whether the header should be at the beginning of the file
  • delimiter – column separator in the file
  • quoteMode – if set to “true” then each column will be written between quotes.
quoteMode = true“Spark”,”is”,”Cool”
quoteMode = falseSpark,is,Cool
myDataFrame.coalesce(1).write.format("com.databricks.spark.csv")
  .mode("overwrite")
  .option("header", "true")
  .option("delimiter", ",")
  .option("quoteMode", "true")
  .save("<output_hdfs_path>")
Apache Spark Save DataFrame As a Single File HDFS - 1 Min Solution?
Spark Save DataFrame"

Datasets and DataFrames

A Dataset" is a distributed collection of data. Dataset is a new interface added in Spark" 1.6 that provides the benefits of RDDs (strong typing, ability to use powerful lambda functions) with the benefits of Spark SQL’s optimized execution engine. A Dataset" can be constructed from JVM objects and then manipulated using functional transformations (map, flatMap, filter, etc.). The Dataset" API is available in Scala and Java. Python does not have the support for the Dataset" API. But due to Python’s dynamic nature, many of the benefits of the Dataset" API are already available (i.e. you can access the field of a row by name naturally row.columnName). The case for R is similar.

A DF is a Dataset" organized into named columns. It is conceptually equivalent to a table in a relational database" or a data frame in R/Python, but with richer optimizations under the hood. DataFrames can be constructed from a wide array of sources such as: structured data files, tables in Hive", external databases, or existing RDDs. The DataFrame API is available in Scala", Java", Python, and R. In Scala and Java", a DF is represented by a Dataset" of Rows. In the Scala API, DataFrame is simply a type alias of Dataset[Row]. While, in Java API, users need to use Dataset<Row> to represent a DataFrame.

(Apache Spark Save DataFrame as a single file HDFS")

https://spark.apache.org/docs/latest/sql-programming-guide.html

That’s all about (Apache Spark Save DataFrame as a single file HDFS"). Enjoy!

Could You Please Share This Post? 
I appreciate It And Thank YOU! :)
Have A Nice Day!

How useful was this post?

Click on a star to rate it!

Average rating 4.9 / 5. Vote count: 945

No votes so far! Be the first to rate this post.

As you found this post useful...

Follow us on social media!

We are sorry that this post was not useful for you!

Let us improve this post!

Tell us how we can improve this post?