Let’s consider two methods to read the data from the same Hive table. For both the execution plan will be the same, [ CreateOrReplaceTempView Performance ] because for both the Catalyst optimiser and Tangsten engine will be used, which were available since Spark 2.0 (Tangsten). In the future I will prepare posts about these two buzzwords in Spark world (Catalyst and Tangsten).
CreateOrReplaceTempView Performance
val testDataSetDF = spark.sqlContext.table("bigdata_etl.some_dataset").withColumn("age_label", when($"age" = 30, "thirty-year-old person").otherwise("Other")) val testDataSetDF = spark.sqlContext.sql("SELECT *, CASE WHEN age = 30 THEN 'thirty-year-old person' ELSE 'Other' END age_label FROM bigdata_etl.some_dataset")
The difference is only in the syntax. Choose the one that’s closer to you. Personally, I think that you should use the first version. The second option with writing an SQL query has the advantage that we can refer to temporary tables in the query.
// Refer to temporary tables val testDataSetDF = spark.sqlContext.table("bigdata_etl.some_dataset").withColumn("age_label", when($"age" = 30, "thirty-year-old person").otherwise("Other")) testDataSetDF.createOrReplaceTempView("someDataSetTempView") val only30thPersons = spark.sqlContext.sql("SELECT * FROM someDataSetTempView where age_label = 'thirty-year-old person'")
Catalyst
Catalyst contains a general library for representing trees and applying rules to manipulate them. On top of this framework, it has libraries specific to relational query processing (e.g., expressions, logical query plans), and several sets of rules that handle different phases of query execution: analysis, logical optimization, physical planning, and code generation to compile parts of queries to Java bytecode. For the latter, it uses another Scala feature, quasiquotes, that makes it easy to generate code at runtime from composable expressions. Catalyst also offers several public extension points, including external data sources and user-defined types. As well, Catalyst supports both rule-based and cost-based optimization.
// CreateOrReplaceTempView Performance
https://databricks.com/glossary/catalyst-optimizer
Tangsten
Tungsten is the codename for the umbrella project to make changes to Apache Spark’s execution engine that focuses on substantially improving the efficiency of memory and CPU for Spark applications, to push performance closer to the limits of modern hardware.
// CreateOrReplaceTempView Performance
https://databricks.com/glossary/tungsten
Apache Spark SQL DataFrame DataSet API
If you enjoyed this post please add the comment below and share this post on your Facebook, Twitter, LinkedIn or another social media webpage.
Thanks in advanced!