Many times you might want to have strong typing on your data in Spark. The best to get it is to DataSet insted of DataFrame. In this post I give you simple example how you can get DataSet from data which is coming from CSV file.
Scala Case Class
You can treat Scala Case Class like a normal class. This kind of class is good for creating immutable objects.
Let’s create case class Book which describes our data in CSV file. Schema is: id, title, pagesCount.
case class Book(id: BigInt, title: String, pagesCount: Integer)
Spark application
The next step is to write the Spark application which will read data from CSV file,
Please take a look for three main lines of this code:
- import spark.implicits._ gives possibility to implicit convertion from Scala objects to DataFrame or DataSet.
- to convert data from DataFrame to DataSet you can use method .as[U] and provide the Case Class name, in my case Book. It gives .as[Book].
- In line .map(book => book.title.toUpperCase()) you can see that you can refer to Book class variables and methods.
Please take a look why I defined schema which describes data from file. For more information please go to: How to use Dataframe API in spark efficiently when loading/reading data?
val schema = StructType.fromDDL(“id bigint, title string, pagesCount integer”)
package main.scala.com.bigdataetl import org.apache.spark.sql.SparkSession import org.apache.spark.sql.types.StructType object BookApp extends App { val spark = SparkSession.builder .master("local[*]") .appName("BookApp") .getOrCreate() val schema = StructType.fromDDL("id bigint, title string, pagesCount integer") import spark.implicits._ val books = spark.sqlContext.read.format("csv") .option("header", "true") .option("inferSchema", "false") .schema(schema) .load("Books.csv") // DataFrame .as[Book] // DataSet books.show() // UpperCase the book names books .map(book => book.title.toUpperCase()) .show() }
Output for books.show():
+---+------+----------+ | id| title|pagesCount| +---+------+----------+ | 1|Book_1| 113| | 2|Book_3| 355| | 3|Book_3| 512| +---+------+----------+
UpperCased books:
+------+ | value| +------+ |BOOK_1| |BOOK_3| |BOOK_3| +------+
If you enjoyed this post please add the comment below or share this post on your Facebook, Twitter, LinkedIn or another social media webpage.
Thanks in advanced!