You are currently viewing [SOLVED] Apache Spark Convert DataFrame to DataSet in Scala – Read 1 min!
Could You Please Share This Post? I Appreciate It And Thank YOU! :) Have A Nice Day!

In this post I will show you how easy is in Apache Spark Convert DataFrame to DataSet in Scala. Many times you might want to have strong typing on your data in Spark. The best to get it is to DataSet instead of DataFrame. In this post I give you simple example how you can get DataSet from data which is coming from CSV file.

[SOLVED] Apache Spark Convert DataFrame to DataSet in Scala - read 1 min!

Apache Spark Convert DataFrame to DataSet in Scala

We’ll start at the beginning. By default, each DataFrame row is of a Row class. To have a DataSet (i.e. strongly typed) object we must have a class that will be our template for each row in the DataSet. Because DataFrame is DataSet [Row]

Scala Case Class

You can treat Scala Case Class like a normal class. This kind of class is good for creating immutable objects.

Let’s create case class Book which describes our data in CSV file. Schema is: id, title, pagesCount. (Apache Spark Convert DataFrame to DataSet in Scala)

case class Book(id: BigInt, title: String, pagesCount: Integer)

Spark Application

The next step is to write the Spark application which will read data from CSV file,

Please take a look for three main lines of this code:

  • import spark.implicits._ gives possibility to implicit conversion from Scala objects to DataFrame or DataSet.
  • to convert data from DataFrame to DataSet you can use method .as[U] and provide the Case Class name, in my case Book. It gives .as[Book].
  • In line .map(book => book.title.toUpperCase()) you can see that you can refer to Book class variables and methods.

Please take a look why I defined schema which describes data from file. For more information please go to: How to use Dataframe API in spark efficiently when loading/reading data?
val schema = StructType.fromDDL(“id bigint, title string, pagesCount integer”)


import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types.StructType

object BookApp extends App {

  val spark = SparkSession.builder

  val schema = StructType.fromDDL("id bigint, title string, pagesCount integer")

  import spark.implicits._

  val books ="csv")
    .option("header", "true")
    .option("inferSchema", "false")
    .load("Books.csv") // DataFrame
    .as[Book] // DataSet

  // UpperCase the book names
    .map(book => book.title.toUpperCase())

Print Results

First let’s see the output of the first line of code with the show action after changing DataFrame to DataSet.

Output for

| id| title|pagesCount|
|  1|Book_1|       113|
|  2|Book_3|       355|
|  3|Book_3|       512|

Now let’s see another transformation. In both cases, we previously transformed from DataFrame to DataSet

UpperCased books:

| value|

That’ all about Apache Spark Convert DataFrame to DataSet in Scala!

spark dataframe to dataset, Convert DataFrame to Dataset python, convert dataframe to dataset, apache spark databricks
Could You Please Share This Post? 
I appreciate It And Thank YOU! :)
Have A Nice Day!


How useful was this post?

Click on a star to rate it!

Average rating 4.9 / 5. Vote count: 1649

No votes so far! Be the first to rate this post.

As you found this post useful...

Follow us on social media!

We are sorry that this post was not useful for you!

Let us improve this post!

Tell us how we can improve this post?