[SOLVED] Apache Spark Convert DataFrame to DataSet in Scala – Read 1 min!

Apache Spark Convert DataFrame to DataSet in Scala - read 1 min!
Share this post and Earn Free Points!

In this post I will show you how easy is in Apache Spark Convert DataFrame to DataSet in Scala. Many times you might want to have strong typing on your data in Spark. The best to get it is to DataSet instead of DataFrame. In this post I give you simple example how you can get DataSet from data which is coming from CSV file.

Introduction

In Apache Spark, data can be stored in a variety of formats and data structures, including:

  • RDDs (Resilient Distributed Datasets): RDDs are the original data structure in Spark and are used to represent a distributed immutable collection of data. RDDs are low-level and are not strongly typed, which means they do not have an explicitly defined schema.
  • DataFrames: DataFrames are distributed collections of data organized into named columns. They are similar to tables in a traditional relational database and are supported in Scala, Python, Java, and R. DataFrames are more flexible than RDDs and are often used for a wide variety of data processing tasks.
  • Datasets: Datasets are strongly typed data structures that are similar to DataFrames, but with a more explicitly defined schema. They are available in Scala and Java and are often used for more complex data processing tasks, such as machine learning, where the schema is known in advance and needs to be explicitly defined.
  • DataStreams: DataStreams are distributed collections of data that are processed in real-time using a stream processing engine. They are used for tasks such as real-time analytics, event processing, and stream processing.

In addition to these data structures, Spark also supports a wide variety of data formats, including CSV, JSON, Parquet, and Avro, as well as external data sources such as databases and distributed file systems. Spark can read and write data in these formats using a variety of APIs, such as the Spark SQL API, the DataFrame API, and the RDD API.

Before we start let’s check what is the difference between DataFrame and DataSet.

Spark DataFrame vs DataSet

In Apache Spark, a DataFrame is a distributed collection of data organized into named columns. It is similar to a table in a traditional relational database, but can contain data of different types and can be processed in parallel using distributed computing. DataFrames are a key data structure in Spark and are used for a wide variety of data processing tasks.

A DataSet is a strongly typed data structure that is similar to a DataFrame, but with a more explicitly defined schema. DataSets are available in Scala and Java and are often used for more complex data processing tasks, such as machine learning, where the schema is known in advance and needs to be explicitly defined.

[SOLVED] Apache Spark Convert DataFrame to DataSet in Scala - read 1 min!
[SOLVED] Apache Spark Convert DataFrame to DataSet in Scala - Read 1 min! 3

Apache Spark Convert DataFrame to DataSet in Scala

We’ll start at the beginning. By default, each DataFrame row is of a Row class. To have a DataSet (i.e. strongly typed) object we must have a class that will be our template for each row in the DataSet. Because DataFrame is DataSet [Row]

Scala Case Class

You can treat Scala Case Class like a normal class. This kind of class is good for creating immutable objects.

In Scala, a case class is a special kind of class that is designed to be used in the context of pattern matching. Case classes are defined using the case keyword and can be used to define simple data structures, such as tuples with named fields.

Let’s create case class Book which describes our data in CSV file. Schema is: id, title, pagesCount.

case class Book(id: BigInt, title: String, pagesCount: Integer)

Case classes are useful for defining simple data structures and are often used in combination with pattern matching to process data. They are a convenient way to define immutable data structures and can be used in a variety of contexts, such as defining data models in a database or defining messages in a distributed system.

Spark Application

The next step is to write the Spark application which will read data from CSV file,

Please take a look for three main lines of this code:

  • import spark.implicits._ gives possibility to implicit conversion from Scala objects to DataFrame or DataSet.
  • to convert data from DataFrame to DataSet you can use method .as[U] and provide the Case Class name, in my case Book. It gives .as[Book].
  • In line .map(book => book.title.toUpperCase()) you can see that you can refer to Book class variables and methods.

Please take a look why I defined schema which describes data from file. For more information please go to: How to use Dataframe API in spark efficiently when loading/reading data?
val schema = StructType.fromDDL(“id bigint, title string, pagesCount integer”)

package main.scala.com.bigdataetl

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types.StructType

object BookApp extends App {

  val spark = SparkSession.builder
    .master("local[*]")
    .appName("BookApp")
    .getOrCreate()

  val schema = StructType.fromDDL("id bigint, title string, pagesCount integer")

  import spark.implicits._

  val books = spark.sqlContext.read.format("csv")
    .option("header", "true")
    .option("inferSchema", "false")
    .schema(schema)
    .load("Books.csv") // DataFrame
    .as[Book] // DataSet

  books.show()

  // UpperCase the book names
  books
    .map(book => book.title.toUpperCase())
    .show()
}

Print Results

First let’s see the output of the first line of code with the show action after changing DataFrame to DataSet.

Output for books.show():

+---+------+----------+
| id| title|pagesCount|
+---+------+----------+
|  1|Book_1|       113|
|  2|Book_3|       355|
|  3|Book_3|       512|
+---+------+----------+

Now let’s see another transformation. In both cases, we previously transformed from DataFrame to DataSet

UpperCased books:

+------+
| value|
+------+
|BOOK_1|
|BOOK_3|
|BOOK_3|
+------+

Summary

In general, DataFrames are more flexible and are easier to use than DataSets, as they do not require you to define a schema upfront. However, DataSets can be more efficient, as they can take advantage of type information and can be optimized by the Spark runtime.

Which one you should use depends on your specific use case. If you need the flexibility of a DataFrame and do not need to define a schema upfront, you should use a DataFrame. If you have a known schema and need the performance benefits of a DataSet, you should use a DataSet.

Could You Please Share This Post? 
I appreciate It And Thank YOU! :)
Have A Nice Day!

How useful was this post?

Click on a star to rate it!

Average rating 4.9 / 5. Vote count: 1649

No votes so far! Be the first to rate this post.

As you found this post useful...

Follow us on social media!

We are sorry that this post was not useful for you!

Let us improve this post!

Tell us how we can improve this post?