Let’s consider two methods to read the data from the same Hive table. For both the execution plan will be the same, [ CreateOrReplaceTempView Performance ] because for both the Catalyst optimiser and Tangsten engine will be used, which were available since Spark 2.0 (Tangsten). In the future I will prepare posts about these two buzzwords in Spark world (Catalyst and Tangsten).
val testDataSetDF = spark.sqlContext.table("bigdata_etl.some_dataset").withColumn("age_label", when($"age" = 30, "thirty-year-old person").otherwise("Other")) val testDataSetDF = spark.sqlContext.sql("SELECT *, CASE WHEN age = 30 THEN 'thirty-year-old person' ELSE 'Other' END age_label FROM bigdata_etl.some_dataset")
The difference is only in the syntax. Choose the one that’s closer to you. Personally, I think that you should use the first version. The second option with writing an SQL query has the advantage that we can refer to temporary tables in the query.
// Refer to temporary tables val testDataSetDF = spark.sqlContext.table("bigdata_etl.some_dataset").withColumn("age_label", when($"age" = 30, "thirty-year-old person").otherwise("Other")) testDataSetDF.createOrReplaceTempView("someDataSetTempView") val only30thPersons = spark.sqlContext.sql("SELECT * FROM someDataSetTempView where age_label = 'thirty-year-old person'")
CreateOrReplaceTempView In PySpark
Due to fact that Spark API is very similar in Scala and Python the command of CreateOrReplaceTempView In PySpark looks basically the same. Please find the below example. As you see the API is the same.
Catalyst contains a general library for representing trees and applying rules to manipulate them. On top of this framework, it has libraries specific to relational query processing (e.g., expressions, logical query plans), and several sets of rules that handle different phases of query execution: analysis, logical optimization, physical planning, and code generation to compile parts of queries to Java bytecode. For the latter, it uses another Scala feature, quasiquotes, that makes it easy to generate code at runtime from composable expressions. Catalyst also offers several public extension points, including external data sources and user-defined types. As well, Catalyst supports both rule-based and cost-based optimization.
// CreateOrReplaceTempView Performancehttps://databricks.com/glossary/catalyst-optimizer
Tungsten is the codename for the umbrella project to make changes to Apache Spark’s execution engine that focuses on substantially improving the efficiency of memory and CPU for Spark applications, to push performance closer to the limits of modern hardware.
// CreateOrReplaceTempView Performancehttps://databricks.com/glossary/tungsten
Could You Please Share This Post? I appreciate It And Thank YOU! :) Have A Nice Day!
YOU MIGHT ALSO LIKE
We are sorry that this post was not useful for you!
Let us improve this post!
Tell us how we can improve this post?