CreateOrReplaceTempView Performance: Apache Spark SQL DataFrame DataSet API – Difference In Performance ? – 3 great APIs

You are currently viewing CreateOrReplaceTempView Performance: Apache Spark SQL DataFrame DataSet API – Difference In Performance ? – 3 great APIs
Share This Post, Help Others, And Earn My Heartfelt Appreciation! :)
4.8
(1434)

Let’s consider two methods to read the data from the same Hive table. For both the execution plan will be the same, [ CreateOrReplaceTempView Performance ] because for both the Catalyst optimiser and Tangsten engine will be used, which were available since Spark 2.0 (Tangsten).  In the future I will prepare posts about these two buzzwords in Spark world (Catalyst and Tangsten).

CreateOrReplaceTempView Performance

val testDataSetDF = spark.sqlContext.table("bigdata_etl.some_dataset").withColumn("age_label", when($"age" = 30, "thirty-year-old person").otherwise("Other"))
val testDataSetDF = spark.sqlContext.sql("SELECT *, CASE WHEN age = 30 THEN 'thirty-year-old person' ELSE 'Other' END age_label FROM bigdata_etl.some_dataset")

The difference is only in the syntax. Choose the one that’s closer to you. Personally, I think that you should use the first version. The second option with writing an SQL query has the advantage that we can refer to temporary tables in the query.

// Refer to temporary tables
val testDataSetDF = spark.sqlContext.table("bigdata_etl.some_dataset").withColumn("age_label", when($"age" = 30, "thirty-year-old person").otherwise("Other"))
testDataSetDF.createOrReplaceTempView("someDataSetTempView")
val only30thPersons = spark.sqlContext.sql("SELECT * FROM someDataSetTempView where age_label = 'thirty-year-old person'")

Catalyst

Catalyst contains a general library for representing trees and applying rules to manipulate them. On top of this framework, it has libraries specific to relational query processing (e.g., expressions, logical query plans), and several sets of rules that handle different phases of query execution: analysis, logical optimization, physical planning, and code generation to compile parts of queries to Java bytecode. For the latter, it uses another Scala feature, quasiquotes, that makes it easy to generate code at runtime from composable expressions. Catalyst also offers several public extension points, including external data sources and user-defined types. As well, Catalyst supports both rule-based and cost-based optimization.

// CreateOrReplaceTempView Performance

https://databricks.com/glossary/catalyst-optimizer

Tangsten

Tungsten is the codename for the umbrella project to make changes to Apache Spark’s execution engine that focuses on substantially improving the efficiency of memory and CPU for Spark applications, to push performance closer to the limits of modern hardware.

// CreateOrReplaceTempView Performance

https://databricks.com/glossary/tungsten

Apache Spark SQL DataFrame DataSet API

If you enjoyed this post please add the comment below and share this post on your Facebook, Twitter, LinkedIn or another social media webpage.
Thanks in advanced!

How useful was this post?

Click on a star to rate it!

Average rating 4.8 / 5. Vote count: 1434

No votes so far! Be the first to rate this post.

Subscribe
Notify of
guest
0 Comments
Inline Feedbacks
View all comments