Apache Spark: Convert DataFrame to DataSet – Scala

Many times you might want to have strong typing on your data in Spark. The best to get it is to DataSet insted of DataFrame. In this post I give you simple example how you can get DataSet from data which is coming from CSV file.

Scala Case Class

You can treat Scala Case Class like a normal class. This kind of class is good for creating immutable objects.

Let’s create case class Book which describes our data in CSV file. Schema is: id, title, pagesCount.

case class Book(id: BigInt, title: String, pagesCount: Integer)

Spark application

The next step is to write the Spark application which will read data from CSV file,

Please take a look for three main lines of this code:

  • import spark.implicits._ gives possibility to implicit convertion from Scala objects to DataFrame or DataSet.
  • to convert data from DataFrame to DataSet you can use method .as[U] and provide the Case Class name, in my case Book. It gives .as[Book].
  • In line .map(book => book.title.toUpperCase()) you can see that you can refer to Book class variables and methods.

Please take a look why I defined schema which describes data from file. For more information please go to: How to use Dataframe API in spark efficiently when loading/reading data?
val schema = StructType.fromDDL(“id bigint, title string, pagesCount integer”)

package main.scala.com.bigdataetl

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types.StructType

object BookApp extends App {

  val spark = SparkSession.builder
    .master("local[*]")
    .appName("BookApp")
    .getOrCreate()

  val schema = StructType.fromDDL("id bigint, title string, pagesCount integer")

  import spark.implicits._

  val books = spark.sqlContext.read.format("csv")
    .option("header", "true")
    .option("inferSchema", "false")
    .schema(schema)
    .load("Books.csv") // DataFrame
    .as[Book] // DataSet

  books.show()

  // UpperCase the book names
  books
    .map(book => book.title.toUpperCase())
    .show()
}

Output for books.show():

+---+------+----------+
| id| title|pagesCount|
+---+------+----------+
|  1|Book_1|       113|
|  2|Book_3|       355|
|  3|Book_3|       512|
+---+------+----------+

UpperCased books:

+------+
| value|
+------+
|BOOK_1|
|BOOK_3|
|BOOK_3|
+------+

If you enjoyed this post please add the comment below or share this post on your Facebook, Twitter, LinkedIn or another social media webpage.
Thanks in advanced!

0 0 vote
Article Rating
Subscribe
Notify of
guest
0 Comments
Inline Feedbacks
View all comments