Apache Spark Convert DataFrame to DataSet in Scala – read 1 min!

You are currently viewing Apache Spark Convert DataFrame to DataSet in Scala – read 1 min!
Share This Post, Help Others, And Earn My Heartfelt Appreciation! :)
4.9
(1648)

In this post I will show you how easy is in Apache Spark Convert DataFrame to DataSet in Scala. Many times you might want to have strong typing on your data in Spark. The best to get it is to DataSet instead of DataFrame. In this post I give you simple example how you can get DataSet from data which is coming from CSV file.

Apache Spark Convert DataFrame to DataSet in Scala

We’ll start at the beginning. By default, each DataFrame row is of a Row class. To have a DataSet (i.e. strongly typed) object we must have a class that will be our template for each row in the DataSet. Because DataFrame is DataSet [Row]

Scala Case Class

You can treat Scala Case Class like a normal class. This kind of class is good for creating immutable objects.

Let’s create case class Book which describes our data in CSV file. Schema is: id, title, pagesCount. (Apache Spark Convert DataFrame to DataSet in Scala)

case class Book(id: BigInt, title: String, pagesCount: Integer)

Spark Application

The next step is to write the Spark application which will read data from CSV file,

Please take a look for three main lines of this code:

  • import spark.implicits._ gives possibility to implicit conversion from Scala objects to DataFrame or DataSet.
  • to convert data from DataFrame to DataSet you can use method .as[U] and provide the Case Class name, in my case Book. It gives .as[Book].
  • In line .map(book => book.title.toUpperCase()) you can see that you can refer to Book class variables and methods.

Please take a look why I defined schema which describes data from file. For more information please go to: How to use Dataframe API in spark efficiently when loading/reading data?
val schema = StructType.fromDDL(“id bigint, title string, pagesCount integer”)

package main.scala.com.bigdataetl

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types.StructType

object BookApp extends App {

  val spark = SparkSession.builder
    .master("local[*]")
    .appName("BookApp")
    .getOrCreate()

  val schema = StructType.fromDDL("id bigint, title string, pagesCount integer")

  import spark.implicits._

  val books = spark.sqlContext.read.format("csv")
    .option("header", "true")
    .option("inferSchema", "false")
    .schema(schema)
    .load("Books.csv") // DataFrame
    .as[Book] // DataSet

  books.show()

  // UpperCase the book names
  books
    .map(book => book.title.toUpperCase())
    .show()
}

Print Results

First let’s see the output of the first line of code with the show action after changing DataFrame to DataSet.

Output for books.show():

+---+------+----------+
| id| title|pagesCount|
+---+------+----------+
|  1|Book_1|       113|
|  2|Book_3|       355|
|  3|Book_3|       512|
+---+------+----------+

Now let’s see another transformation. In both cases, we previously transformed from DataFrame to DataSet

UpperCased books:

+------+
| value|
+------+
|BOOK_1|
|BOOK_3|
|BOOK_3|
+------+

That’ all about Apache Spark Convert DataFrame to DataSet in Scala!

spark dataframe to dataset, Convert DataFrame to Dataset python, convert dataframe to dataset, apache spark databricks

If you enjoyed this post please add the comment below or share this post on your Facebook, Twitter, LinkedIn or another social media webpage.
Thanks in advanced!

How useful was this post?

Click on a star to rate it!

Average rating 4.9 / 5. Vote count: 1648

No votes so far! Be the first to rate this post.

Subscribe
Notify of
guest
0 Comments
Inline Feedbacks
View all comments