You are currently viewing CreateOrReplaceTempView Performance: Apache Spark SQL DataFrame DataSet API – Difference In Performance ? – 3 great APIs
Could You Please Share This Post? I Appreciate It And Thank YOU! :) Have A Nice Day!
4.8
(1436)

Let’s consider two methods to read the data from the same Hive table. For both the execution plan will be the same, [ CreateOrReplaceTempView Performance ] because for both the Catalyst optimiser and Tangsten engine will be used, which were available since Spark 2.0 (Tangsten).  In the future I will prepare posts about these two buzzwords in Spark world (Catalyst and Tangsten).

CreateOrReplaceTempView Performance

val testDataSetDF = spark.sqlContext.table("bigdata_etl.some_dataset").withColumn("age_label", when($"age" = 30, "thirty-year-old person").otherwise("Other"))
val testDataSetDF = spark.sqlContext.sql("SELECT *, CASE WHEN age = 30 THEN 'thirty-year-old person' ELSE 'Other' END age_label FROM bigdata_etl.some_dataset")

The difference is only in the syntax. Choose the one that’s closer to you. Personally, I think that you should use the first version. The second option with writing an SQL query has the advantage that we can refer to temporary tables in the query.

// Refer to temporary tables
val testDataSetDF = spark.sqlContext.table("bigdata_etl.some_dataset").withColumn("age_label", when($"age" = 30, "thirty-year-old person").otherwise("Other"))
testDataSetDF.createOrReplaceTempView("someDataSetTempView")
val only30thPersons = spark.sqlContext.sql("SELECT * FROM someDataSetTempView where age_label = 'thirty-year-old person'")

CreateOrReplaceTempView In PySpark

Due to fact that Spark API is very similar in Scala and Python the command of CreateOrReplaceTempView In PySpark looks basically the same. Please find the below example. As you see the API is the same.

df.createOrReplaceTempView("someDataSetTempView")

Catalyst

Catalyst contains a general library for representing trees and applying rules to manipulate them. On top of this framework, it has libraries specific to relational query processing (e.g., expressions, logical query plans), and several sets of rules that handle different phases of query execution: analysis, logical optimization, physical planning, and code generation to compile parts of queries to Java bytecode. For the latter, it uses another Scala feature, quasiquotes, that makes it easy to generate code at runtime from composable expressions. Catalyst also offers several public extension points, including external data sources and user-defined types. As well, Catalyst supports both rule-based and cost-based optimization.

// CreateOrReplaceTempView Performance

https://databricks.com/glossary/catalyst-optimizer

Tangsten

Tungsten is the codename for the umbrella project to make changes to Apache Spark’s execution engine that focuses on substantially improving the efficiency of memory and CPU for Spark applications, to push performance closer to the limits of modern hardware.

// CreateOrReplaceTempView Performance

https://databricks.com/glossary/tungsten

Apache Spark SQL DataFrame DataSet API

Could You Please Share This Post? 
I appreciate It And Thank YOU! :)
Have A Nice Day!

BigData-ETL: image 7YOU MIGHT ALSO LIKE

How useful was this post?

Click on a star to rate it!

Average rating 4.8 / 5. Vote count: 1436

No votes so far! Be the first to rate this post.

As you found this post useful...

Follow us on social media!

We are sorry that this post was not useful for you!

Let us improve this post!

Tell us how we can improve this post?