In this tutorial I will show the example when using Apache Spark Save DataFrame as a single file HDFS. If you want to save DataFrame as a file on HDFS, there may be a problem that it will be saved as many files.
This is the most correct behaviour and it results from the parallel work in Apache Spark. However, if you want to force the write to one file, you must change the partitioning of DF to one partition. To do this, call the “coalesce” method before writing and specify the number of partitions. (Apache Spark Save DataFrame as a single file HDFS)
The following example shows how to save any DF to a CSV file. In addition, I presented a few options, such as: (Apache Spark Save DataFrame as a single file HDFS)
- Mode – available options:
- overwrite – always overwrite the file
- append – add to existing file if it exists
- igonre – ignore if it exists
- error, errorIfExists, default – thorw an error if the file exists. (default option).
- header – whether the header should be at the beginning of the file
- delimiter – column separator in the file
- quoteMode – if set to “true” then each column will be written between quotes.
|quoteMode = true||“Spark”,”is”,”Cool”|
|quoteMode = false||Spark,is,Cool|
myDataFrame.coalesce(1).write.format("com.databricks.spark.csv") .mode("overwrite") .option("header", "true") .option("delimiter", ",") .option("quoteMode", "true") .save("<output_hdfs_path>")
A Dataset is a distributed collection of data. Dataset is a new interface added in Spark 1.6 that provides the benefits of RDDs (strong typing, ability to use powerful lambda functions) with the benefits of Spark SQL’s optimized execution engine. A Dataset can be constructed from JVM objects and then manipulated using functional transformations (
filter, etc.). The Dataset API is available in Scala and Java. Python does not have the support for the Dataset API. But due to Python’s dynamic nature, many of the benefits of the Dataset API are already available (i.e. you can access the field of a row by name naturally
row.columnName). The case for R is similar.
A DF is a Dataset organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood. DataFrames can be constructed from a wide array of sources such as: structured data files, tables in Hive, external databases, or existing RDDs. The DataFrame API is available in Scala, Java, Python, and R. In Scala and Java, a DF is represented by a Dataset of
Rows. In the Scala API,
DataFrameis simply a type alias of
Dataset[Row]. While, in Java API, users need to use
Dataset<Row>to represent a
(Apache Spark Save DataFrame as a single file HDFS)https://spark.apache.org/docs/latest/sql-programming-guide.html
That’s all about (Apache Spark Save DataFrame as a single file HDFS). Enjoy!
If you enjoyed this post please add the comment below and share this post on your Facebook, Twitter, LinkedIn or another social media webpage.
Thanks in advanced!